Lecture

JavaScript and Dynamic Web Crawling

Web pages are made up of three components: HTML, CSS, and JavaScript.

HTML defines the structure of the web page, while CSS defines the style of the web page.

JavaScript is a language that makes web pages dynamic.

The term dynamic means the content of the web page can change in response to interactions with the user or certain events.

For instance, using JavaScript, you can display new content when a user clicks a button, or load additional content as the user scrolls.

Such dynamic content is not present when the web page is first loaded but is generated dynamically as JavaScript is executed in the web browser.


The Limits of BeautifulSoup

BeautifulSoup parses HTML to extract data.

However, content generated dynamically with JavaScript cannot be fetched with BeautifulSoup.


Example of Crawling Code that Doesn't Work with BeautifulSoup

Let's look at code that attempts to fetch the current temperature and perceived temperature from a weather website using BeautifulSoup.

Attempting Crawling with BeautifulSoup
import requests from bs4 import BeautifulSoup # Weather website URL url = 'https://www.weather.example.com/current' # Sending request to the page response = requests.get(url) # Parsing HTML with BeautifulSoup soup = BeautifulSoup(response.text, 'html.parser') # Finding temperature and perceived temperature # 'tmp' class indicates the current temperature, 'feels' class indicates perceived temperature temperature_element = soup.find('span', class_='tmp') feels_like_element = soup.find('span', class_='feels') # Extracting text temperature = temperature_element.text.strip() if temperature_element else 'N/A' feels_like = feels_like_element.text.strip() if feels_like_element else 'N/A' # Outputting results print(f"Today's temperature: {temperature}") print(f"Perceived temperature: {feels_like}")

This code tries to fetch weather information from a weather website using BeautifulSoup but returns None for temperature_element and feels_like_element.

This is because it only fetches the HTML before JavaScript is executed, so it cannot find those elements.


Dynamic Web Crawling with Selenium

The weather website uses JavaScript to dynamically display weather information, so BeautifulSoup alone can't successfully fetch that data.

But by using Selenium, which allows you to interact with browsers, you can capture the web page after JavaScript has run and extract the required data.

Note: To run this code on your computer, you need to install the Selenium library with the command pip install selenium.


Dynamic Web Crawling with Selenium
from selenium import webdriver from selenium.webdriver.common.by import By # Open Chrome browser driver = webdriver.Chrome() # Open the weather forecast page url = "https://www.weather.example.com/current" driver.get(url) # Finding temperature and perceived temperature # 'tmp' class indicates the current temperature, 'feels' class indicates perceived temperature temperature_element = driver.find_element(By.CLASS_NAME, 'tmp') feels_like_element = driver.find_element(By.CLASS_NAME, 'feels') # Extracting text temperature = temperature_element.text feels_like = feels_like_element.text # Outputting results print(f"Today's temperature: {temperature}") print(f"Perceived temperature: {feels_like}") # Closing WebDriver driver.quit()

Detailed Code Explanation

  • driver = webdriver.Chrome(): Opens Chrome browser and creates a driver object.

  • driver.get(url): Navigates to the specified URL (weather website).

  • temperature_element = driver.find_element(By.CLASS_NAME, 'tmp'): Finds the element with the tmp class and stores it in temperature_element.

  • feels_like_element = driver.find_element(By.CLASS_NAME, 'feels'): Finds the element with the feels class and stores it in feels_like_element.

  • temperature = temperature_element.text: Extracts the text from temperature_element and stores it in temperature.

  • feels_like = feels_like_element.text: Extracts the text from feels_like_element and stores it in feels_like.

  • driver.quit(): Closes the WebDriver.


Example Output
Today's temperature: 30.4℃ Perceived temperature: Feels like(30.6℃)

Using Selenium in this way allows you to crawl content that is dynamically generated with JavaScript.

Mission
0 / 1

Which tool can be used to crawl dynamically generated web content with JavaScript?

BeautifulSoup

requests

Selenium

pandas

Lecture

AI Tutor

Design

Upload

Notes

Favorites

Help

Code Editor

Run
Generate

Execution Result