Lecture

Parsing HTML with BeautifulSoup

To obtain the information you desire through web crawling, you need to extract the desired information from the collected HTML data.

BeautifulSoup is a Python package that helps solve this task easily by being used to parse (analyze and extract data) HTML data fetched with requests.


Parsing HTML Data and Extracting Necessary Information

Using BeautifulSoup, you can convert an HTML document into a Python object, making it easy to navigate and manipulate each element of the document with Python code.

Let's go over how to parse HTML data and extract the necessary information using BeautifulSoup.


Parsing HTML with BeautifulSoup

First, you need to convert the HTML data fetched from a web page into a BeautifulSoup object.

Using the requests package to fetch HTML data and then creating a BeautifulSoup object to parse the HTML can be done as follows:

Parsing HTML with BeautifulSoup
import requests from bs4 import BeautifulSoup # URL to request url = 'https://www.codefriends.net' # Fetch HTML data with a GET request response = requests.get(url) # Create BeautifulSoup object and parse HTML soup = BeautifulSoup(response.text, 'html.parser') # Extract the title tag of the HTML title = soup.title.text # Print the page title print(f"Page Title: {title}")

The above code stores a BeautifulSoup object with the parsed HTML data in the soup variable, and extracts the title of the HTML document using soup.title.text.

The soup.title code fetches the contents of the <title> tag of the HTML document, and .text extracts the text of that tag.


Extracting Required Information

Various methods can be used to extract information, as shown below.

  1. Finding elements by tag name: Locate specific tags in the HTML document.
Finding elements by tag name
# Find all <a> tags links = soup.find_all('a') # Print all links for link in links: print(link.get('href'))

  1. Finding elements by class name: Locate elements by a specific class name.
Finding elements by class name
# Find all <div> tags with class="example" divs = soup.find_all('div', class_='example') # Print the text of all <div> tags for div in divs: print(div.text)

  1. Finding elements by ID: Locate an element by a specific ID.
Finding elements by ID
# Find element with id="main-content" main_content = soup.find(id='main-content') # Print the text of the selected element print(main_content.text)

Extracting Article Titles and Links from a Web Page

Below is an example of extracting article titles and links from an actual web page:

Extracting Article Titles and Links
import requests from bs4 import BeautifulSoup # URL of the web page to be scraped url = 'https://news.ycombinator.com/' # Fetch HTML data with a GET request response = requests.get(url) # Create BeautifulSoup object and parse HTML soup = BeautifulSoup(response.text, 'html.parser') # Extract all article titles and links articles = soup.find_all('a') # Print article titles and links for article in articles: # Extract article title and link title = article.text # Link URL link = article.get('href') # Print title and link print(f"Title: {title}, Link: {link}")

The above code searches for a tags in the YCombinator news page to extract article titles and links.

As seen, using BeautifulSoup allows you to easily analyze the structure of a web page to extract the desired data.

Mission
0 / 1

What is the most appropriate method to fill in the blank below?

When parsing HTML data using the BeautifulSoup package and extracting the necessary information, the method to extract the title of the HTML document is .
soup.title.text
soup.find_all('title')
soup.get_title()
soup.find('head').title

Lecture

AI Tutor

Design

Upload

Notes

Favorites

Help

Code Editor

Run
Generate

Execution Result