lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

pythonIntroCrawlingChapter4Title

lesson11Title

pythonIntroCrawlingChapter1Title

pythonIntroCrawlingChapter2Title

pythonIntroCrawlingChapter3Title

import requests
from bs4 import BeautifulSoup

def crawl_wikipedia_page(url):
 # Fetch HTML from URL using requests
 response = requests.get(url)
 
 # Create BeautifulSoup object to parse HTML
 soup = BeautifulSoup(response.text, 'html.parser')
 
 # Get page title (<title> tag content)
 page_title = soup.find('title').text
 print(f"Page Title: {page_title}\n")
 
 # Get first valid paragraph
 # Traverse all tags and find the first paragraph that doesn't have 'mw-empty-elt' class
 first_valid_paragraph = None
 for paragraph in soup.find_all('p'):
 if 'mw-empty-elt' not in paragraph.get('class', []):
 first_valid_paragraph = paragraph.text.strip()
 break
 
 print('-' * 40)

 if first_valid_paragraph:
 print(f"First Paragraph: {first_valid_paragraph}\n")
 else:
 print("No valid first paragraph found.\n")

# URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/Internet"

# Call the function
crawl_wikipedia_page(url)

# Wikipedia Article Crawling

This document will guide you on how to crawl the title and the first paragraph of a Wikipedia article using Python's `requests` and `BeautifulSoup` libraries.

**Step 1**
 ```python title="Retrieving and Parsing HTML"
 response = requests.get(url)
 soup = BeautifulSoup(response.text, 'html.parser')
 ```
 In this step, the `requests` library is used to retrieve the HTML content from a given URL. Then, the `BeautifulSoup` library is utilized to parse the HTML, and the parsed content is stored in the `soup` object. This object allows easy access to HTML elements.

 

**Step 2**
 ```python title="Extracting Page Title"
 page_title = soup.find('title').text
 ```
 Using `soup.find('title')`, the `<title>` tag of the HTML document is located, and the `.text` attribute is used to extract the text content of the tag. This step is used to retrieve the page's title.

 

**Step 3**
 ```python title="Extracting First Valid Paragraph"
 first_valid_paragraph = None
 for paragraph in soup.find_all('p'):
 if 'mw-empty-elt' not in paragraph.get('class', []):
 first_valid_paragraph = paragraph.text.strip()
 break
 ```
 By iterating over all `` tags, the first paragraph without the 'mw-empty-elt' class is found. The 'mw-empty-elt' class indicates an empty paragraph, so it is skipped to find the first paragraph with actual content.

 

**Step 4**
 ```python title="Outputting Results"
 print(f"Page Title: {page_title}\n")
 if first_valid_paragraph:
 print(f"First Paragraph: {first_valid_paragraph}\n")
 else:
 print("No valid first paragraph found.\n")
 ```
 Finally, the extracted page title and the first valid paragraph are printed. If a valid first paragraph is present, its content is displayed; if not, a "No valid first paragraph found." message is shown.

 

## Practice

Click the _`Run Code`_ button on the right to see the crawling results or modify the code!

python_execution