Fetching Wikipedia Homepage Information with Python
Wikipedia is an online encyclopedia collaboratively built by people around the world. π
In this lesson, we'll use Python code to collect specific information from a Wikipedia page.
Using the BeautifulSoup
and requests
libraries, we can fetch the title and description of the Wikipedia homepage, as shown below.
Step 1: Import Required Libraries
import requests from bs4 import BeautifulSoup
The above code performs the following tasks:
-
Uses the
import
keyword to load the requests library for HTTP communication -
Uses the
from
keyword to load the bs4 package for collecting webpage data and imports the BeautifulSoup class from it
Step 2: Fetch HTML from URL and Store It in a Variable
Use BeautifulSoup to fetch and store the HTML of a webpage in a variable, as shown below:
# Wikipedia homepage URL url = "https://www.wikipedia.org" # Fetch HTML from the URL using the requests library response = requests.get(url) # Set the encoding of the fetched HTML to UTF-8 response.encoding = 'utf-8' # Store the fetched HTML in the soup variable soup = BeautifulSoup(response.text, 'html.parser')
The above code performs the following tasks:
-
Stores the Wikipedia homepage URL in the
url
variable -
Fetches HTML from the URL using
requests.get(url)
-
Parses the fetched HTML using
BeautifulSoup(response.text, 'html.parser')
and stores the parsed result in the soup variable
Step 3: Extract Title and Description Information
Extract the desired information from the soup variable as shown below:
# Extract h1 (heading 1, title) from the webpage h1_title = soup.find('h1').text # Extract p (paragraph) tag from the webpage p_description = soup.find('p').text
The above code performs the following tasks:
-
Uses
soup.find('h1').text
to find the h1 tag in the soup variable, extracts the title, and stores it in the h1_title variable -
Uses
soup.find('p').text
to find the p tag in the soup variable, extracts the description, and stores it in the p_description variable
Finally, use the print function to display the title and description fetched from the URL.
Practice
Click the Run Code
button on the right-hand side to see the scraping results.
The first execution of the code may take some time.
You can also modify the url
address in the code (e.g., https://www.codefriends.net
) to fetch information from other webpages.
Which library is used for parsing HTML when web scraping with Python?
requests
BeautifulSoup
urllib
selenium
Lecture
AI Tutor
Publish
Design
Upload
Notes
Favorites
Help