Extracting Desired Information from Wikipedia
In this lesson, we will learn how to use Python to crawl data from the Internet
page on Wikipedia.
Specifically, we will extract the title
and certain sections from the content
of the page, and learn how to properly handle UTF-8 data
.
Fetching the Web Page
First, we will use the requests
package to fetch the Internet
page on Wikipedia.
The requests.get()
method is used to retrieve the HTML source of the page.
import requests # Set the URL url = 'https://en.wikipedia.org/wiki/Internet' # Fetch the web page response = requests.get(url) # Check the response status code print("status_code:", response.status_code)
-
The
url
variable stores the address of the page to be crawled. -
requests.get(url)
fetches the HTML source of the given URL. -
response.status_code
is used to check if the request was successful. A status code of 200 means the request was successful.
Parsing HTML and Extracting the Title
We will parse the HTML and extract the title of the page.
We will use BeautifulSoup
to analyze the HTML structure.
from bs4 import BeautifulSoup # Parse the HTML soup = BeautifulSoup(response.text, 'html.parser') # Extract the title of the page title = soup.find('h1', id='firstHeading').text print("title:", title)
-
BeautifulSoup(response.text, 'html.parser')
parses the HTML source. -
soup.find('h1', id='firstHeading').text
extracts the title of the page.
Extracting the Content
Next, we will retrieve all <p>
tags from the content and then extract the first 5 paragraphs.
# Retrieve all <p> tags from the content all_paragraphs = soup.find('div', class_='mw-parser-output').find_all('p') # Select only the first 5 <p> tags paragraphs = all_paragraphs[:5] # Combine the extracted paragraphs into a single text content = "\n".join([p.text for p in paragraphs])
-
soup.find('div', class_='mw-parser-output').find_all('p')
retrieves all<p>
tags from the content. -
paragraphs = all_paragraphs[:5]
selects the first 5<p>
tags. -
"\n".join([p.text for p in paragraphs])
combines the selected paragraphs into a single text.
Handling UTF-8 Encoding Issues and Output
To properly output the crawled UTF-8 data, we will address encoding issues.
# Handle UTF-8 encoding issues print("content:", content.encode('utf-8').decode('utf-8'))
content.encode('utf-8').decode('utf-8')
ensures that the outputted UTF-8 data is properly displayed.
Encoding converts a string into bytes (data composed of 0s and 1s) and decoding converts bytes into a string.
Once you run the code, the title and content of the Wikipedia Internet
page will be displayed.
What is the most appropriate content to fill in the blank below?
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help
Code Editor
Execution Result