lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

pythonIntroCrawlingChapter4Title

lesson11Title

pythonIntroCrawlingChapter1Title

pythonIntroCrawlingChapter2Title

pythonIntroCrawlingChapter3Title

import requests
from bs4 import BeautifulSoup

def crawl_wikipedia_current_events_first_10_titles():
    url = "https://ko.wikipedia.org/wiki/위키백과:요즘_화제"
    headers = {
        "User-Agent": "CodeFriendsAcademy/1.0 (https://www.codefriends.net; educational crawling example)"
    }

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print("응답 실패", response.status_code)
        return []

    response.encoding = "utf-8"
    soup = BeautifulSoup(response.text, "html.parser")

    # 요즘 화제 섹션의 내용을 담고 있는 div 태그를 찾음
    current_events_section = soup.find("div", {"class": "current-events-content"})
    if current_events_section is None:
        current_events_section = soup.find("div", {"id": "mw-content-text"})

    # div 태그 내부의 모든 li 태그를 찾음
    list_items = current_events_section.find_all("li") if current_events_section else []

    # li 태그 내부의 텍스트를 추출하여 리스트에 저장
    titles = [item.get_text(" ", strip=True) for item in list_items[:10] if item.get_text(" ", strip=True)]

    return titles


# '요즘 화제' 섹션의 최초 10개 기사 제목들을 가져옴
current_events_first_10_titles = crawl_wikipedia_current_events_first_10_titles()

for title in current_events_first_10_titles:
    print(title)
    print('-' * 40)

# 위키피디아 최신 화제 기사 크롤링

`BeautifulSoup`의 `find_all` 메서드를 을 활용해 위키피디아의 요즘 화제 에서 주요 이벤트를 크롤링 해보겠습니다.

<br />

## 예제 코드 설명

```python title="최신 화제 기사 제목 10개 추출"
import requests
from bs4 import BeautifulSoup

def crawl_wikipedia_current_events_first_10_titles():
    url = "https://ko.wikipedia.org/wiki/위키백과:요즘_화제"

    response = requests.get(url)
    if response.status_code != 200:
        print("응답 실패", response.status_code)
        return None

    soup = BeautifulSoup(response.content, "html.parser")

    # 요즘 화제 섹션의 내용을 담고 있는 div 태그를 찾음
    current_events_section = soup.find("div", {"id": "mw-content-text"})

    # div 태그 내부의 모든 li 태그를 찾음
    list_items = current_events_section.find_all("li") if current_events_section else []

    # li 태그 내부의 텍스트를 추출하여 리스트에 저장
    titles = [item.get_text(strip=True) for item in list_items[:10]]

    return titles
```

<br />

1. `웹 페이지 요청`: `requests.get(url)`을 통해 특정 URL의 내용을 요청합니다.

2. `응답 상태 확인`: `response.status_code`를 검사하여 요청이 성공적으로 이루어졌는지 확인합니다.

3. `BeautifulSoup 객체 생성 및 데이터 파싱`: `BeautifulSoup(response.content, "html.parser")`을 사용하여 HTML 내용을 파싱합니다.

4. `특정 섹션에서 데이터 추출`: 웹 페이지 내 특정 섹션(예: '요즘 화제')의 모든 `li` 태그를 찾고, 이 중 처음 10개의 항목을 추출합니다.

<br />

## 실습 과제

- 위 코드를 사용하여 위키피디아 '요즘 화제' 섹션의 최신 이벤트 제목들을 추출해보세요.

- 다양한 웹 페이지와 섹션을 타겟으로 설정하여 데이터 추출 기법을 연습해보세요.

python_execution