Lecture

Extracting Desired Information from Wikipedia

In this lesson, we will learn how to use Python to crawl data from the Internet page on Wikipedia.

Specifically, we will extract the title and certain sections from the content of the page, and learn how to properly handle UTF-8 data.


Fetching the Web Page

First, we will use the requests package to fetch the Internet page on Wikipedia.

The requests.get() method is used to retrieve the HTML source of the page.

Fetching the Web Page
import requests # Set the URL url = 'https://en.wikipedia.org/wiki/Internet' # Fetch the web page response = requests.get(url) # Check the response status code print("status_code:", response.status_code)
  • The url variable stores the address of the page to be crawled.

  • requests.get(url) fetches the HTML source of the given URL.

  • response.status_code is used to check if the request was successful. A status code of 200 means the request was successful.


Parsing HTML and Extracting the Title

We will parse the HTML and extract the title of the page.

We will use BeautifulSoup to analyze the HTML structure.

Parsing HTML and Extracting the Title
from bs4 import BeautifulSoup # Parse the HTML soup = BeautifulSoup(response.text, 'html.parser') # Extract the title of the page title = soup.find('h1', id='firstHeading').text print("title:", title)
  • BeautifulSoup(response.text, 'html.parser') parses the HTML source.

  • soup.find('h1', id='firstHeading').text extracts the title of the page.


Extracting the Content

Next, we will retrieve all <p> tags from the content and then extract the first 5 paragraphs.

Extracting the Content
# Retrieve all <p> tags from the content all_paragraphs = soup.find('div', class_='mw-parser-output').find_all('p') # Select only the first 5 <p> tags paragraphs = all_paragraphs[:5] # Combine the extracted paragraphs into a single text content = "\n".join([p.text for p in paragraphs])
  • soup.find('div', class_='mw-parser-output').find_all('p') retrieves all <p> tags from the content.

  • paragraphs = all_paragraphs[:5] selects the first 5 <p> tags.

  • "\n".join([p.text for p in paragraphs]) combines the selected paragraphs into a single text.


Handling UTF-8 Encoding Issues and Output

To properly output the crawled UTF-8 data, we will address encoding issues.

Handling UTF-8 Encoding Issues
# Handle UTF-8 encoding issues print("content:", content.encode('utf-8').decode('utf-8'))
  • content.encode('utf-8').decode('utf-8') ensures that the outputted UTF-8 data is properly displayed.

Encoding converts a string into bytes (data composed of 0s and 1s) and decoding converts bytes into a string.

Once you run the code, the title and content of the Wikipedia Internet page will be displayed.

Mission
0 / 1

What is the most appropriate content to fill in the blank below?

is used to fetch the HTML source of a specific URL.
requests.get()
BeautifulSoup()
urllib.request()
selenium.webdriver()

Lecture

AI Tutor

Design

Upload

Notes

Favorites

Help

Code Editor

Run
Generate

Execution Result