Lecture

Scraping Wikipedia Homepage Information with Python

Wikipedia is an online encyclopedia created by people worldwide. 📘

In this lesson, we will learn how to collect specific information from a Wikipedia page using Python code.

Using the BeautifulSoup and requests libraries, you can extract the title and description from the Wikipedia homepage as shown below.


Step 1: Import Necessary Libraries

Importing requests and BeautifulSoup libraries
import requests from bs4 import BeautifulSoup

This code performs the following:

  • Uses the import keyword to load the requests library for HTTP communication

  • Uses the from keyword to load the bs4 package for web scraping and imports the BeautifulSoup class from the bs4 package


Step 2: Retrieve and Store HTML from the URL

Use BeautifulSoup to retrieve and store the HTML of a webpage in a variable as follows.

Fetching HTML from Wikipedia homepage
# Wikipedia homepage URL url = "https://www.wikipedia.org" # Fetch HTML from the URL using the requests library response = requests.get(url) # Set the encoding of the fetched HTML to UTF-8 response.encoding = 'utf-8' # Store the fetched HTML in the soup variable soup = BeautifulSoup(response.text, 'html.parser')

This code performs the following:

  • Stores the Wikipedia homepage URL in the url variable

  • Fetches HTML from the URL using requests.get(url)

  • Parses the fetched HTML with BeautifulSoup(response.text, 'html.parser') and stores the parsed result in the soup variable


Step 3: Extract Title and Description Information

Extract desired information from the soup variable as shown below.

Extracting title and description from Wikipedia homepage
# Extract h1 (heading 1, title) from the webpage h1_title = soup.find('h1').text # Extract p (paragraph) tag from the webpage p_description = soup.find('p').text

This code performs the following:

  • Finds the h1 tag in the soup variable using soup.find('h1').text to extract the title and stores it in the h1_title variable

  • Finds the p tag in the soup variable using soup.find('p').text to extract the description and stores it in the p_description variable

Finally, use the print function to display the extracted title and description from the URL.


Practice

Press the Run Code button on the right to see the scraping results. The first execution may take some time.

You can also change the url address (e.g., https://www.codefriends.net) to fetch information from other web pages.

Lecture

AI Tutor

Design

Upload

Notes

Favorites

Help

Code Editor

Run
Generate

Execution Result