Crawling Latest Trending Articles from Wikipedia
Utilize the find_all
method of BeautifulSoup
to crawl significant events from Wikipedia's Current Events section.
Example Code Explanation
import requests from bs4 import BeautifulSoup def crawl_wikipedia_current_events_first_10_titles(): url = "https://en.wikipedia.org/wiki/Portal:Current_events" response = requests.get(url) if response.status_code != 200: print("Response failed", response.status_code) return None soup = BeautifulSoup(response.content, "html.parser") # Locate the div tag containing the contents of the Current Events section current_events_section = soup.find("div", {"id": "mw-content-text"}) # Find all li tags within the div tag list_items = current_events_section.find_all("li") if current_events_section else [] # Extract text inside li tags and store them in a list titles = [item.get_text(strip=True) for item in list_items[:10]] return titles
-
Requesting a Web Page
: Userequests.get(url)
to request the content of a specific URL. -
Checking Response Status
: Verify whether the request was successful by inspectingresponse.status_code
. -
Creating a BeautifulSoup Object and Parsing Data
: UseBeautifulSoup(response.content, "html.parser")
to parse the HTML content. -
Extracting Data from a Specific Section
: Locate allli
tags within a particular section of the webpage (e.g., 'Current Events'), and extract the first 10 entries.
Practice Exercises
-
Use the above code to extract the latest event titles from Wikipedia's 'Current Events' section.
-
Experiment with targeting different webpages and sections to practice data extraction techniques.
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help
Code Editor
Execution Result