CSCE 590 sec 2: Web Scraping
General Information
DESCRIPTION:
Instructor
Manton M . Matthews
3A53 Swearingen
Phone: 777-3285
Office Hours: TH 11:30-1:00PM, others by appointment
Email: mm at sc in the domain edu
Teaching Assistant
Course Description:
This course will cover Web scraping with Python, POSIX regular expressions, the libraries Beautiful Soup and the Natural Language Toolkit (NLTK), and the Selenium (Web Driver) for
automating the process of interacting with browsers and finally the Scrapy library for creating web crawlers.
Main text and References
Required: Web Scraping with Python - Collecting Datafrom the Modern Web by Ryan Mitchell, O'Reilly 2015.
Resources: Websites or texts with online versions or substitutes.
- Python 3.5 documentation - https://docs.python.Org/3.5/
- Tutorial (Required) https://docs.python.Org/3.5/tutorial/index.html
- The Standard Library https://docs.python.Org/4.5/librarv/index.html
- Natural Language Toolkit - for Python 3.x http://www.nltk.org/book/ and for Python2.x
http://www.nltk.org/book led/
- Scrapy - https://doc.scrapy.org/en/latest/intro/tutorial.html
- Requests -
Learning Outcomes
: At the end of the course the students should have demonstrated the ability to:
- Open URLs and extract desired information using several tools including:
regular-expressions, BeautifulSoup, Requests, and the Natural Language Toolkit (NLTK.)
- Clean data and save the data gathered in databases.
- Automatically drive browsers with Selenium for testing websites and for gathering of data.
- Login to websites to gather information including handling CAPTCHAs.
- Create web crawlers by hand and with Scrapy.
Date |
Significance |
February 16 | Test 1 |
Thursday March 2,
| Last day to withdrawal without WF |
April 6 | Test 2 |
Tuesday May 2 @ 9:00 a.m.
| Final Exam |
Link to the Exam Schedule for Spring 2017
Policies
Homework:
The homework is submitted through the "dropbox" system on the CSE secure site.
All Homework is to be turned in as ASCII files, i.e. no "word documents."
No late homework or projects will be accepted.
All Homework is expected to be individual work unless explicitly specified otherwise.
Academic Integrity: You are expected to practice the highest possible standards of academic integrity.
Any deviation from this expectation will result in a minimum academic penalty of your failing the assignment, and will result in additional disciplinary measures including referring you to the Office of
Academic Integrity. Violations of the University's Honor Code include, but are not limited to improper
citation of sources, using another student's work, and any other form of academic misrepresentation. For more information, please see the Honor Code.
Accommodating Disabilities:
Reasonable accommodations are available for students with a documented disability. If you have a disability and may need accommodations to fully participate in this class, contact the Office of Student
Disability Services: 777-6142, TDD 777-6744, email sasds@mailbox.sc.edu, or stop by LeConte College
Room 112A. All accommodations must be approved through the Office of Student Disability Services.
Amending the Syllabus/Rules
Amendments and changes to the syllabus, including evaluation and grading mechanisms, are possible. The instructor must initiate any changes. Changes to the grading and evaluation scheme will be voted
on by the entire class and approved only with unanimous vote of all students present in class on the
day the issue is decided. The lecture schedule and reading assignments (daily schedule) will not require a vote and may be altered at the instructor's discretion. Once approved, amendments will be
distributed in writing to all students.
Grading policy:
The final grade will be based on two midterms, assignments
and the final exam, according to the following weights:
- Assignments and Quizzes: 35%
- Two Tests: 20% each
- Final: 25%
|