|
CSCE 590: Web Scraping Assignments Page
- Homework 1:Find and run Python3.
- Run the unix "script" command to create the file typescript.
- Run python3 on the programs of chapter 1. (emailed to you)
- Run the command "exit" after you have run the python programs.
- Homework 2: Regular expressions
CSCE 590 HW 2 - Regular expressions, Due Jan 29 Sunday night 11:55PM
- ) Give regular expressions that denotes the Languages:
a) { strings x such that x starts with 'aa' followed by any number of b's and c's and then end in an 'a'.
b) Phone numbers with optional area codes and optional extensions of the form " ext 432".
c) Email Addresses
d) a Python function definition that just has pass as the body
e) A link in a web page
- ) What languages(sets of strings that match the re) are denoted by the following regular expressions:
a) (a|b)[cde]{3,4}b
b) \w+\W+
c) \d{3}-\d{2}-\d{4}
- ) Give a regular expressions that extracts the "login" for a USC student from their email address "login@email.sc.edu"
(after the match one could use login=match.group(1) )
- ) Write a Python program that processes a file line by line and cleans it by removing (re.sub)
social security numbers (replacing with )
email addresses (replacing with "")
phone numbers (replacing with
For extra credit replacing Soc-Sec numbers that leave the number but replace the first three digits with the last three and replace the last three with the first three in the original string, i.e. 123-45-6789 becomes 789-45-6123 .
- Homework 3:Due Monday Feb 6 at 11:55PM
- Write a short program named 3_1.py (less than 10 lines)
that imports only urlopen and BeautifulSoup and then builds a
list of all links ( tag). Then it should process the list one
element at a time and print the link. Use the URL https://cse.sc.edu/ .
Finally it should print a count of the number of links.
- Modify the previous program to obatin 3_2.py that
a) write to the file "allLinks.txt"
b) write only the URL, i.e. the value of the href
- Run 5-getAllExternalLinks.py from chapter 3 on the URL https://cse.sc.edu/ .
Modify the code to handle the exceptions that occur, by logging,
then ignoring and continuing to handle the other links.
- Modify 5-getAllExternalLinks.py to check the website 5-getAllExternalLinks.py for "Bad Links" (404 is sufficient).
- HW 4 Due Feb 14, 11:55PM
- Copy the table from the Master Schedule of CSCE courses online (^A^C select all then copy) and paste (^V) into excel then save as a CSV file, sched.csv
- Write a program, table.py, to grab this same table.
- Use Requests to login to the CSE site (https://cse.sc.edu/user/login?destination=node) and then use BeautifulSoup to prettify the page returned.
- HW Yahoo login - Due Monday March 27 @ 11:55PM
- Create a yahoo email account with: Your USC login
as the account name (yourLogin@yahoo.com). Add 1234... to make
it a new name for yahoo if necessary.
Use your USC login as the password,
so I can test your ability to login.
Hard code this login and password into your programs.
- Write a utility function “dump_html (page, tag)” that
will use the Beautifulsoup function prettify the “page”
and write to the file “output_”+tag.
Include this function into later programs. i
There is no dropbox for this question or the first question.
- Use Selenium to login to your Yahoo account and dump_html the page
- Use scrapy to scrape table (see next slide) from
http://finance.yahoo.com/quote/FB?p=FB and export to csv
- Use scrapy to extract the same info on FB, XOM, STX, NFLX, AMZN
Here start_requests builds URLs from company symbol and
yields themwhile parse scrapes the table.
- Where is Coach K? - details to come later. Due 3/29 or so.
URL:
http://www.cse.sc.edu/~matthews/Courses/513/Homework.html
|
|