lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

lesson12Title

lesson13Title

lesson14Title

lesson15Title

lesson16Title

lesson17Title

lesson18Title

automationIntroBasicsChapter3Title

automationIntroBasicsChapter1Title

automationIntroBasicsChapter2Title

automationIntroBasicsChapter4Title

import requests
from bs4 import BeautifulSoup

# Set URL
url = 'https://en.wikipedia.org/wiki/Internet'

# Fetch the web page
response = requests.get(url)

# Check response status code
print("status_code:", response.status_code)

# Parse HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Extract page title
title = soup.find('h1', id='firstHeading').text
print("title:", title)

# Retrieve all tags within the page content
all_paragraphs = soup.find('div', class_='mw-parser-output').find_all('p')

# Select only the first 5 tags
paragraphs = all_paragraphs[:5]

# Join the extracted paragraphs into a single text
content = "\n".join([p.text for p in paragraphs])

# Resolve encoding issues
print("content:", content.encode('utf-8').decode('utf-8'))

# Extracting Desired Information from Wikipedia

In this lesson, we will learn how to use Python to crawl data from the `Internet` page on Wikipedia.

Specifically, we will extract the `title` and certain sections from the `content` of the page, and learn how to properly handle `UTF-8 data`.

 

## Fetching the Web Page

First, we will use the `requests` package to fetch the `Internet` page on Wikipedia.

The `requests.get()` method is used to retrieve the HTML source of the page.

```python title="Fetching the Web Page"
import requests

# Set the URL
url = 'https://en.wikipedia.org/wiki/Internet'

# Fetch the web page
response = requests.get(url)

# Check the response status code
print("status_code:", response.status_code)
```

- The `url` variable stores the address of the page to be crawled.

- `requests.get(url)` fetches the HTML source of the given URL.

- `response.status_code` is used to check if the request was successful. A status code of 200 means the request was successful.

 

## Parsing HTML and Extracting the Title

We will parse the HTML and extract the title of the page.

We will use `BeautifulSoup` to analyze the HTML structure.

```python title="Parsing HTML and Extracting the Title"
from bs4 import BeautifulSoup

# Parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Extract the title of the page
title = soup.find('h1', id='firstHeading').text
print("title:", title)
```

- `BeautifulSoup(response.text, 'html.parser')` parses the HTML source.

- `soup.find('h1', id='firstHeading').text` extracts the title of the page.

 

## Extracting the Content

Next, we will retrieve all `` tags from the content and then extract the first 5 paragraphs.

```python title="Extracting the Content"
# Retrieve all tags from the content
all_paragraphs = soup.find('div', class_='mw-parser-output').find_all('p')

# Select only the first 5 tags
paragraphs = all_paragraphs[:5]

# Combine the extracted paragraphs into a single text
content = "\n".join([p.text for p in paragraphs])
```

- `soup.find('div', class_='mw-parser-output').find_all('p')` retrieves all `` tags from the content.

- `paragraphs = all_paragraphs[:5]` selects the first 5 `` tags.

- `"\n".join([p.text for p in paragraphs])` combines the selected paragraphs into a single text.

 

## Handling UTF-8 Encoding Issues and Output

To properly output the crawled UTF-8 data, we will address encoding issues.

```python title="Handling UTF-8 Encoding Issues"
# Handle UTF-8 encoding issues
print("content:", content.encode('utf-8').decode('utf-8'))
```

- `content.encode('utf-8').decode('utf-8')` ensures that the outputted UTF-8 data is properly displayed.

Encoding converts a string into bytes (data composed of 0s and 1s) and decoding converts bytes into a string.

Once you run the code, the title and content of the Wikipedia `Internet` page will be displayed.

The `requests.get(url)` method is used to fetch the contents of a web page for a given URL. It sends an HTTP GET request to the specified URL and returns the HTML source in response.

### What is the most appropriate content to fill in the blank below?

python_execution