lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

pythonIntroCrawlingChapter2Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

pythonIntroCrawlingChapter1Title

pythonIntroCrawlingChapter3Title

pythonIntroCrawlingChapter4Title

from bs4 import BeautifulSoup

# HTML 문서 예시
html_doc = """
<div class="content">
  <p class="intro">Welcome to the CodeFiends!</p>
  <div class="section">
    <p>First section paragraph.</p>
    <p>Second section paragraph with <a href="http://example.com">a link</a>.</p>
  </div>
  <div class="footer">
    <p>Contact information: <a href="mailto:contact@example.com">contact@example.com</a></p>
  </div>
</div>
"""

# BeautifulSoup 객체 생성
soup = BeautifulSoup(html_doc, 'html.parser')

# 'content' 클래스의 첫 번째 div 내의 모든 p 태그 추출
print("Content Paragraphs:")
content_paragraphs = soup.select('div.content > p')

for p in content_paragraphs:
    print(p.text)

print('-' * 10)

# 'section' 클래스의 div 내의 모든 p 태그 추출, 링크 포함
print("Section Paragraphs:")
section_paragraphs = soup.select('div.section p')

for p in section_paragraphs:
    if p.a:
        print(f"{p.text} (Link: {p.a['href']})")
    else:
        print(p.text)

print('-' * 10)

# 'footer' 클래스의 div 내의 모든 a 태그의 href 속성 값 추출
print("Footer Links:")
footer_links = soup.select('div.footer a')

for a in footer_links:
    print(a['href'])

# 중첩된 HTML 요소 다루기

HTML 요소가 중첩되었다는 것은 요소가 다른 요소 안에 포함되어 있다는 것을 의미합니다.

```html title="중첩된 요소"
<div>
  <p>First paragraph.</p>
  <p>Second paragraph.</p>
</div>
```

위의 예제에서 `<div>` 요소 안에 두 개의 `<p>` 요소가 중첩되어 있습니다.

이런 요소들을 다루는 것은 웹 크롤링에서 중요한 기술입니다.

<br />

## 중첩된 요소 탐색

1. `부모와 자식 관계 이해`

   - HTML 요소들은 부모-자식 관계를 가질 수 있습니다.

   - 예를 들어, `<div>` 안에 있는 `<p>`는 `<div>`의 자식 요소입니다.

2. `특정 경로의 요소 찾기`

   - `find()` 또는 `find_all()`을 사용하여 특정 경로에 있는 요소를 찾습니다.

   - 예: `soup.find('div').find('p')`는 첫 번째 `<div>` 안의 첫 번째 `<p>`를 찾습니다.

<br />

### 예제: 중첩된 요소 추출

```python title="중첩된 요소 추출"
html_doc = """
<div>
  <p class="inner-text">
    First paragraph.
    <span>Text within a span</span>
  </p>
  <p class="inner-text">Second paragraph.</p>
</div>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 첫 번째 div 내의 모든 p 태그 추출
for p in soup.find('div').find_all('p'):
    print(p.text)
```

<br />

## 속성을 활용한 추출

- 태그의 클래스, ID, 기타 속성을 사용하여 특정 요소를 추출합니다.

- 예: `soup.find_all('a', class_='external_link')`는 클래스가 'external_link'인 모든 `<a>` 태그를 찾습니다.

<br />

## CSS 선택자 사용

- BeautifulSoup에서는 `select()` 메소드를 사용하여 CSS 선택자를 활용할 수 있습니다.

- 예: `soup.select('div.content > p.paragraph')`는 클래스가 'content'인 `<div>`의 직접 자식인 클래스가 'paragraph'인 `<p>`를 찾습니다.

<br />

## 예제: 복잡한 구조의 데이터 추출

```python title="복잡한 구조의 데이터 추출"
html_doc = """
<div class="content">
  <p class="paragraph">First paragraph in content.</p>
  <div class="inner-content">
    <p>Inner paragraph.</p>
  </div>
</div>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 클래스가 'content'인 div 내의 모든 p 태그 추출
content_paragraphs = soup.select('div.content p')
for p in content_paragraphs:
    print(p.text)
```

<br />

## 실습

화면 오른쪽 _`코드 실행`_ 버튼을 누르고, 크롤링 결과를 확인하거나 코드를 수정해 보세요!

python_execution