Parsing HTML with BeautifulSoup
To obtain the information you desire through web crawling, you need to extract the desired information from the collected HTML data.
BeautifulSoup
is a Python package that helps solve this task easily by being used to parse
(analyze and extract data) HTML data fetched with requests.
Parsing HTML Data and Extracting Necessary Information
Using BeautifulSoup, you can convert an HTML document into a Python object, making it easy to navigate and manipulate each element of the document with Python code.
Let's go over how to parse HTML data
and extract the necessary information using BeautifulSoup.
Parsing HTML with BeautifulSoup
First, you need to convert the HTML data fetched from a web page into a BeautifulSoup object.
Using the requests
package to fetch HTML data and then creating a BeautifulSoup
object to parse the HTML can be done as follows:
import requests from bs4 import BeautifulSoup # URL to request url = 'https://www.codefriends.net' # Fetch HTML data with a GET request response = requests.get(url) # Create BeautifulSoup object and parse HTML soup = BeautifulSoup(response.text, 'html.parser') # Extract the title tag of the HTML title = soup.title.text # Print the page title print(f"Page Title: {title}")
The above code stores a BeautifulSoup object with the parsed HTML data in the soup
variable, and extracts the title of the HTML document using soup.title.text
.
The soup.title
code fetches the contents of the <title>
tag of the HTML document, and .text
extracts the text of that tag.
Extracting Required Information
Various methods can be used to extract information, as shown below.
- Finding elements by tag name: Locate specific tags in the HTML document.
# Find all <a> tags links = soup.find_all('a') # Print all links for link in links: print(link.get('href'))
- Finding elements by class name: Locate elements by a specific class name.
# Find all <div> tags with class="example" divs = soup.find_all('div', class_='example') # Print the text of all <div> tags for div in divs: print(div.text)
- Finding elements by ID: Locate an element by a specific ID.
# Find element with id="main-content" main_content = soup.find(id='main-content') # Print the text of the selected element print(main_content.text)
Extracting Article Titles and Links from a Web Page
Below is an example of extracting article titles and links from an actual web page:
import requests from bs4 import BeautifulSoup # URL of the web page to be scraped url = 'https://news.ycombinator.com/' # Fetch HTML data with a GET request response = requests.get(url) # Create BeautifulSoup object and parse HTML soup = BeautifulSoup(response.text, 'html.parser') # Extract all article titles and links articles = soup.find_all('a') # Print article titles and links for article in articles: # Extract article title and link title = article.text # Link URL link = article.get('href') # Print title and link print(f"Title: {title}, Link: {link}")
The above code searches for a
tags in the YCombinator news page to extract article titles and links.
As seen, using BeautifulSoup allows you to easily analyze the structure of a web page to extract the desired data.
What is the most appropriate method to fill in the blank below?
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help
Code Editor
Execution Result