Exploring Web Crawling on BBC News
Web Crawling
refers to the technique of automatically exploring websites and collecting the necessary data.
It commonly involves using an automated program called Crawler
to fetch the content of web pages, i.e., HTML code, and then analyze the code to collect the required data.
Difference Between Web Crawling and Web Scraping
Web Crawling
and Web Scraping
are often used interchangeably, but they technically mean different things.
Extracting specific content from a particular web page using sample code is an example of Web Scraping
.
Unlike web crawling, web scraping does not automatically explore multiple web pages and mainly targets one web page or specific data.
In summary, web crawling refers to the process of automatically exploring multiple web pages to collect data, whereas web scraping refers to the process of extracting content from web pages.
However, this course will mainly use the term web crawling
as it covers both exploring multiple web pages and data collection.
BBC News Web Crawling Practice
The code in the practice screen scrapes (technically speaking) the article headlines
in real-time from the BBC News website.
To fetch and analyze the HTML code of a web page using Python, the requests
and BeautifulSoup
libraries are commonly used.
The following courses will detail how these libraries are used and what code needs to be written to extract the desired information.
# BBC News website URL url = "https://www.bbc.com/news" response = requests.get(url) # Check if the request was successful print("status_code:", response.status_code) # Parse HTML data soup = BeautifulSoup(response.text, "html.parser") # Extract 10 article headlines from the page using h2 tags titles = soup.find_all('h2', limit=10)
Press the green ▶︎ Run
button in the code editor and check out the article headlines crawled in real-time from the BBC News website! 🙂
Run the code and check the results.
Lecture
AI Tutor
Publish
Design
Upload
Notes
Favorites
Help
Code Editor
Execution Result