TDM 20200: Project 03 — 2024
Motivation: Web scraping is the process of taking content off of the internet. Typically this goes hand-in-hand with parsing or processing the data. In general, scraping data from websites has always been a popular topic in The Data Mine. We will continue to use the website of "books.toscrape.com" to practice scraping skills
Context: In the previous projects we gently introduced XML and XPath to parse a XML document, also introduced some basic web scraping skills using ""BeautifulSoup". In this project, we will learn some basic skills to scrap with another library "selenium"
Scope: Python, web scraping, BeautifulSoup, selenium
Readings and Resources
|
Questions
In this project, we will re-create the results from Project 2, BUT in this project, we will use Selenium instead of Beautiful Soup.
Question 1 (2 points)
-
Please use selenium to get and display the website’s HTML source code https://books.toscrape.com
-
Review the website’s HTML source code. What is the title for that webpage?
|
|
|
Question 2 (2 points)
-
Please use the selenium library to get and display all categories' names from the homepage of the website.
|
Question 3 (2 points)
-
Now, instead of only getting the names of the categories, get all of the category links from the homepage as well.
-
Update the code from question 3a to get (only) the links for books with the category "Romance".
romance_url is https://books.toscrape.com/catalogue/category/books/romance_8/index.html
|
Question 4 (2 points)
-
Starting from the homepage of Romance category "https://books.toscrape.com/catalogue/category/books/romance_8/index.html", please get the titles of all of the books from the "Romance" category’s first webpage.
-
Find the next pagination link from the "Romance" category of the first webpage. Next, get all of the book titles, from the second page of the "Romance" category.
You will need to extract the "href" attribute of a tag |
|
Question 5 (2 points)
-
In project 2 and Project 3 we used two different libraries, "BeautifulSoup" and "Selenium", to accomplish the very similar tasks. Please briefly outline the similarities and differences between these two libraries (BeautifulSoup versus Selenium).
Project 03 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and output for the assignment
-
firstname-lastname-project03.ipynb
-
-
Python file with code and comments for the assignment
-
firstname-lastname-project03.py
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |