Lecture

Crawling Latest Trending Articles from Wikipedia

Utilize the find_all method of BeautifulSoup to crawl significant events from Wikipedia's Current Events section.


Example Code Explanation

Extracting the First 10 Trending Article Titles
import requests from bs4 import BeautifulSoup def crawl_wikipedia_current_events_first_10_titles(): url = "https://en.wikipedia.org/wiki/Portal:Current_events" response = requests.get(url) if response.status_code != 200: print("Response failed", response.status_code) return None soup = BeautifulSoup(response.content, "html.parser") # Locate the div tag containing the contents of the Current Events section current_events_section = soup.find("div", {"id": "mw-content-text"}) # Find all li tags within the div tag list_items = current_events_section.find_all("li") if current_events_section else [] # Extract text inside li tags and store them in a list titles = [item.get_text(strip=True) for item in list_items[:10]] return titles

  1. Requesting a Web Page: Use requests.get(url) to request the content of a specific URL.

  2. Checking Response Status: Verify whether the request was successful by inspecting response.status_code.

  3. Creating a BeautifulSoup Object and Parsing Data: Use BeautifulSoup(response.content, "html.parser") to parse the HTML content.

  4. Extracting Data from a Specific Section: Locate all li tags within a particular section of the webpage (e.g., 'Current Events'), and extract the first 10 entries.


Practice Exercises

  • Use the above code to extract the latest event titles from Wikipedia's 'Current Events' section.

  • Experiment with targeting different webpages and sections to practice data extraction techniques.

Lecture

AI Tutor

Design

Upload

Notes

Favorites

Help

Code Editor

Run
Generate

Execution Result