Wikipedia Article Crawling
This document will guide you on how to crawl the title and the first paragraph of a Wikipedia article using Python's requests and BeautifulSoup libraries.
Step 1
response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser')
In this step, the requests library is used to retrieve the HTML content from a given URL. Then, the BeautifulSoup library is utilized to parse the HTML, and the parsed content is stored in the soup object. This object allows easy access to HTML elements.
Step 2
page_title = soup.find('title').text
Using soup.find('title'), the <title> tag of the HTML document is located, and the .text attribute is used to extract the text content of the tag. This step is used to retrieve the page's title.
Step 3
first_valid_paragraph = None for paragraph in soup.find_all('p'): if 'mw-empty-elt' not in paragraph.get('class', []): first_valid_paragraph = paragraph.text.strip() break
By iterating over all <p> tags, the first paragraph without the 'mw-empty-elt' class is found. The 'mw-empty-elt' class indicates an empty paragraph, so it is skipped to find the first paragraph with actual content.
Step 4
print(f"Page Title: {page_title}\n") if first_valid_paragraph: print(f"First Paragraph: {first_valid_paragraph}\n") else: print("No valid first paragraph found.\n")
Finally, the extracted page title and the first valid paragraph are printed. If a valid first paragraph is present, its content is displayed; if not, a "No valid first paragraph found." message is shown.
Practice
Click the Run Code button on the right to see the crawling results or modify the code!
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help
Code Editor
Execution Result