lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

lesson12Title

lesson13Title

lesson14Title

lesson15Title

lesson16Title

lesson17Title

lesson18Title

automationIntroBasicsChapter3Title

automationIntroBasicsChapter1Title

automationIntroBasicsChapter2Title

automationIntroBasicsChapter4Title

import requests
from bs4 import BeautifulSoup

# URL to request
url = 'https://www.codefriends.net'

# Fetch HTML data using GET request
response = requests.get(url)

# Create BeautifulSoup object and parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Extract the title tag from the HTML
title = soup.title.text

# Print page title
print(f"Page Title: {title}")

# Parsing HTML with BeautifulSoup

To obtain the information you desire through web crawling, you need to **extract the desired information** from the collected HTML data.

`BeautifulSoup` is a Python package that helps solve this task easily by being used to `parse` (analyze and extract data) HTML data fetched with requests.

<br />

## Parsing HTML Data and Extracting Necessary Information

Using *BeautifulSoup*, you can convert an HTML document into a Python object, making it easy to navigate and manipulate each element of the document with Python code.

Let's go over how to `parse HTML data` and extract the necessary information using *BeautifulSoup*.

<br />

## Parsing HTML with BeautifulSoup

First, you need to convert the HTML data fetched from a web page into a *BeautifulSoup* object.

Using the `requests` package to fetch HTML data and then creating a `BeautifulSoup` object to parse the HTML can be done as follows:

```python title="Parsing HTML with BeautifulSoup"
import requests
from bs4 import BeautifulSoup

# URL to request
url = 'https://www.codefriends.net'

# Fetch HTML data with a GET request
response = requests.get(url)

# Create BeautifulSoup object and parse HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Extract the title tag of the HTML
title = soup.title.text

# Print the page title
print(f"Page Title: {title}")
```

The above code stores a *BeautifulSoup* object with the parsed HTML data in the `soup` variable, and extracts the title of the HTML document using `soup.title.text`.

The `soup.title` code fetches the contents of the `<title>` tag of the HTML document, and `.text` extracts the text of that tag.

<br />

## Extracting Required Information

Various methods can be used to extract information, as shown below.

1. *Finding elements by tag name*: Locate specific tags in the HTML document.

```python title="Finding elements by tag name"
# Find all <a> tags
links = soup.find_all('a')

# Print all links
for link in links:
    print(link.get('href'))
```

<br />

2. *Finding elements by class name*: Locate elements by a specific class name.

```python title="Finding elements by class name"
# Find all <div> tags with class="example"
divs = soup.find_all('div', class_='example')

# Print the text of all <div> tags
for div in divs:
    print(div.text)
```

<br />

3. *Finding elements by ID*: Locate an element by a specific ID.

```python title="Finding elements by ID"
# Find element with id="main-content"
main_content = soup.find(id='main-content')

# Print the text of the selected element
print(main_content.text)
```

<br />

## Extracting Article Titles and Links from a Web Page

Below is an example of extracting article titles and links from an actual web page:

```python title="Extracting Article Titles and Links"
import requests
from bs4 import BeautifulSoup

# URL of the web page to be scraped
url = 'https://news.ycombinator.com/'

# Fetch HTML data with a GET request
response = requests.get(url)

# Create BeautifulSoup object and parse HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Extract all article titles and links
articles = soup.find_all('a')

# Print article titles and links
for article in articles:

    # Extract article title and link
    title = article.text

    # Link URL
    link = article.get('href')

    # Print title and link
    print(f"Title: {title}, Link: {link}")
```

The above code searches for `a` tags in the YCombinator news page to extract article titles and links.

As seen, using BeautifulSoup allows you to easily analyze the structure of a web page to extract the desired data.

To extract the title of an HTML document, you can access the 'title' attribute of the BeautifulSoup object. 'soup.title.text' returns the text within the <title> tag of the HTML document.

### What is the most appropriate method to fill in the blank below?

python_execution