lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

pythonIntroCrawlingChapter4Title

lesson11Title

pythonIntroCrawlingChapter1Title

pythonIntroCrawlingChapter2Title

pythonIntroCrawlingChapter3Title

import requests
from bs4 import BeautifulSoup

# Django 리포지토리의 URL
url = "https://github.com/django/django"

# requests를 이용해 URL에 GET 요청
response = requests.get(url)

# 응답으로 받은 HTML 컨텐츠 html_content에 저장
html_content = response.text

# BeautifulSoup을 이용해 HTML 파싱
soup = BeautifulSoup(html_content, "html.parser")

# id가 pull-requests-repo-tab-count인 요소의 텍스트를 가져옴
count = soup.find(id="pull-requests-repo-tab-count").get_text()

print("Pull Request Count:", count)

# 실시간으로 풀 리퀘스트 수 크롤링하기

이번 수업은 GitHub의 Django 리포지토리 페이지로부터 `풀 리퀘스트(Pull Request)` 수를 크롤링하여 화면에 출력해 보겠습니다.

참고로 풀 리퀘스트(Pull Request)는 다른 사용자의 리포지토리에 변경 사항을 제안하는 것을 뜻합니다.

 

**Step 1**
 ```python title="웹 페이지 HTML 가져오기"
 response = requests.get(url)
 html_content = response.text
 ```
 - `requests.get(url)`: 주어진 URL로부터 웹 페이지의 데이터를 가져옵니다. 여기서는 Django의 GitHub 리포지토리 페이지의 URL입니다.
 - `response.text`: `requests.get` 함수로부터 받은 응답에서 HTML 내용을 문자열로 추출합니다.

 

**Step 2**
 ```python title="HTML 파싱"
 soup = BeautifulSoup(html_content, "html.parser")
 ```
 - `BeautifulSoup(html_content, "html.parser")`: 가져온 HTML 컨텐츠(`html_content`)를 파싱하기 위해 `BeautifulSoup`을 사용합니다. 이 작업을 통해 HTML 문서 내의 다양한 요소에 쉽게 접근할 수 있게 됩니다.

 

**Step 3**
 ```python title="정보 추출"
 count = soup.find(id="pull-requests-repo-tab-count").get_text()
 ```
 - `soup.find(id="pull-requests-repo-tab-count")`: 파싱된 HTML 내용에서 ID가 `pull-requests-repo-tab-count`인 요소를 찾습니다. 이 ID는 GitHub 리포지토리 페이지에서 풀 리퀘스트 수를 나타내는 요소의 ID입니다.
 - `.get_text()`: 찾은 요소에서 텍스트 내용(여기서는 풀 리퀘스트 수)을 추출합니다.

 

주의 : 크롤링을 수행할 때는 대상 웹사이트의 `robots.txt` 파일과 이용 약관을 확인하여 규정을 준수해야 합니다.

 

## 실습 과제

- GitHub의 다양한 리포지토리 URL을 사용하여 위 코드를 실행해보세요.

- 다른 HTML 태그를 타겟으로 설정하여 해당 태그의 데이터를 추출하는 방법을 연습해보세요.

python_execution