lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

pythonIntroCrawlingChapter2Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

pythonIntroCrawlingChapter1Title

pythonIntroCrawlingChapter3Title

pythonIntroCrawlingChapter4Title

from bs4 import BeautifulSoup

# HTML document
html_doc = """
<html>
  <head>
    <title>Web Page Title</title>
  </head>
  <body>
    <div>
      <h1>Main Heading</h1>
      <p>Paragraph in body</p>
      <a href="http://example.com">Hyperlink</a>
    </div>
  </body>
</html>
"""

# Create BeautifulSoup object
soup = BeautifulSoup(html_doc, 'html.parser')

# Extract and print key elements from the HTML document
title = soup.title.text
header = soup.h1.text
paragraph = soup.p.text
link = soup.a['href']

print(f"title: {title}")
print('-' * 20)
print(f"h1: {header}")
print('-' * 20)
print(f"p: {paragraph}")
print('-' * 20)
print(f"a: {link}")

# What is HTML Parsing?

`HTML Parsing` is the process of reading data from an HTML document, analyzing its structure, and making it usable within a program.

By parsing HTML, you can extract and manipulate specific elements from a webpage.

<br />

## Parsing an HTML Document

1. `Creating a BeautifulSoup Object`

   - Create a `BeautifulSoup` object with the HTML document you want to parse.
   - This object allows you to access and manipulate HTML elements.

   ```python title="Creating a BeautifulSoup Object"
   from bs4 import BeautifulSoup

   html_doc = "<html><head><title>Hello World</title></head><body>...</body></html>"
   soup = BeautifulSoup(html_doc, 'html.parser')
   ```

2. `Understanding Document Structure`

   - An HTML document is composed of a hierarchical structure of tags.

   - Various tags like `<html>`, `<head>`, `<body>`, `<div>`, `<span>`, `<p>` are used.

<br />

## Methods for Extracting Key Elements

1. `Finding Specific Tags`

   - Use the `find()` and `find_all()` methods to search for specific tags.

   - `find()` returns the first matching tag, while `find_all()` returns a list of all matching tags.

   ```python title="Finding Specific Tags"
   # Finding the first <p> tag
   first_p = soup.find('p')

   # Finding all <a> tags
   all_links = soup.find_all('a')
   ```

<br />

2. `Extracting Tag Content`

   - Use the `.text` attribute of a tag object to extract the text content.

   ```python title="Extracting Tag Content"
   # Text content of the first <p> tag
   text = first_p.text
   ```

<br />

3. `Accessing Tag Attributes`

   - Access tag attributes by treating the tag object like a dictionary.

   - For example, to get the value of the `href` attribute from an `<a href="url">` tag.

   ```python title="Accessing Tag Attributes"
   # Value of the href attribute from the first <a> tag
   href_value = all_links[0]['href']
   ```

<br />

## Practice

Click the _`Run Code`_ button on the right and try modifying the code or checking the crawling results!

python_execution