lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

pythonIntroCrawlingChapter2Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

pythonIntroCrawlingChapter1Title

pythonIntroCrawlingChapter3Title

pythonIntroCrawlingChapter4Title

from bs4 import BeautifulSoup

# HTML 문서 예시
from bs4 import BeautifulSoup

# HTML 문서 예시
html_doc = """
<html>
 <head>
 <title>The Codefriends' story</title>
 </head>
 <body>
 The Codefriends' story
 Once upon a time there were three little sisters; and their names were
 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 and they lived at the bottom of a well.
 ...
 </body>
</html>
"""

# BeautifulSoup 객체 생성
soup = BeautifulSoup(html_doc, 'html.parser')

# 타이틀 태그 내용 추출
title = soup.title.text
print('Title:', title) # 결과: The Codefriends' story

print('-' * 10)

# 'a' 태그의 href 속성 값 추출
for link in soup.find_all('a'):
 print(link.get('href'))

# BeautifulSoup란?

`BeautifulSoup`은 파이썬에서 웹 크롤링을 쉽게 수행할 수 있도록 도와주는 라이브러리로, HTML 파일에서 데이터를 **추출**하고 **파싱**합니다.

파싱(parsing)은 웹페이지의 HTML 문서를 분석하여 원하는 데이터를 추출하는 과정을 뜻합니다. BeautifulSoup은 이러한 파싱 작업을 쉽게 수행할 수 있도록 돕습니다.

 

## BeautifulSoup의 기능과 특징

1. `다양한 파서 지원`

 - BeautifulSoup은 HTML/XML 문서를 파싱하기 위해 여러 종류의 파서(parser)를 지원합니다.

 - 가장 일반적으로 사용되는 파서는 `html.parser`(표준 파이썬 라이브러리)와 `lxml`입니다.

2. `간편한 데이터 추출`

 - 특정 태그, ID, 클래스 등을 쉽게 검색할 수 있습니다.

 - 텍스트, 속성 값 등 웹페이지의 다양한 요소들을 효과적으로 추출할 수 있습니다.

3. `복잡한 HTML 구조 다루기`

 - 중첩된 태그나 복잡한 HTML 구조도 쉽게 탐색하고, 필요한 데이터를 추출할 수 있습니다.

 - 태그의 계층적 관계를 활용하여 정확한 데이터 위치를 찾을 수 있습니다.

4. `유연한 검색 방법`

 - CSS 선택자, 정규 표현식 등 다양한 방법으로 데이터를 검색할 수 있습니다.

 - 복수의 조건을 결합하여 특정 패턴을 가진 데이터를 찾는 것도 가능합니다.

 

## 사용법

```python title="BeautifulSoup 라이브러리 사용 예제"
from bs4 import BeautifulSoup

# HTML 문서 예시
html_doc = """
<html>
 <head>
 <title>The Codefriends' story</title>
 </head>
 <body>
 The Codefriends' story
 Once upon a time there were three little sisters; and their names were
 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 and they lived at the bottom of a well.
 ...
 </body>
</html>
"""

# BeautifulSoup 객체 생성
soup = BeautifulSoup(html_doc, 'html.parser')

# HTML title 태그 내용 추출
title = soup.title.text
print('Title:', title) # 결과: The Codefriends' story

print('-' * 10)

# 'a' 태그의 href 속성 값 추출
for link in soup.find_all('a'):
 print(link.get('href'))
```

 

## 실습

화면 오른쪽 _`코드 실행`_ 버튼을 누르고, 크롤링 결과를 확인하거나 코드를 수정해 보세요!

python_execution