Lecture

JavaScript and Dynamic Web Crawling

Web pages are made up of three components: HTML, CSS, and JavaScript.

HTML defines the structure of the web page, while CSS defines the style of the web page.

JavaScript is a language that makes web pages dynamic.

The term dynamic means the content of the web page can change in response to interactions with the user or certain events.

For instance, using JavaScript, you can display new content when a user clicks a button, or load additional content as the user scrolls.

Such dynamic content is not present when the web page is first loaded but is generated dynamically as JavaScript is executed in the web browser.

The Limits of BeautifulSoup

BeautifulSoup parses HTML to extract data.

However, content generated dynamically with JavaScript cannot be fetched with BeautifulSoup.

Example of Crawling Code that Doesn't Work with BeautifulSoup

Let's look at code that attempts to fetch the current temperature and perceived temperature from a weather website using BeautifulSoup.

Attempting Crawling with BeautifulSoup

import requests
from bs4 import BeautifulSoup

# Weather website URL
url = 'https://www.weather.example.com/current'

# Sending request to the page
response = requests.get(url)

# Parsing HTML with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Finding temperature and perceived temperature
# 'tmp' class indicates the current temperature, 'feels' class indicates perceived temperature
temperature_element = soup.find('span', class_='tmp')
feels_like_element = soup.find('span', class_='feels')

# Extracting text
temperature = temperature_element.text.strip() if temperature_element else 'N/A'
feels_like = feels_like_element.text.strip() if feels_like_element else 'N/A'

# Outputting results
print(f"Today's temperature: {temperature}")
print(f"Perceived temperature: {feels_like}")

This code tries to fetch weather information from a weather website using BeautifulSoup but returns None for temperature_element and feels_like_element.

This is because it only fetches the HTML before JavaScript is executed, so it cannot find those elements.

Dynamic Web Crawling with Selenium

The weather website uses JavaScript to dynamically display weather information, so BeautifulSoup alone can't successfully fetch that data.

But by using Selenium, which allows you to interact with browsers, you can capture the web page after JavaScript has run and extract the required data.

Note: To run this code on your computer, you need to install the Selenium library with the command pip install selenium.

Dynamic Web Crawling with Selenium

from selenium import webdriver
from selenium.webdriver.common.by import By

# Open Chrome browser
driver = webdriver.Chrome()

# Open the weather forecast page
url = "https://www.weather.example.com/current"
driver.get(url)

# Finding temperature and perceived temperature
# 'tmp' class indicates the current temperature, 'feels' class indicates perceived temperature
temperature_element = driver.find_element(By.CLASS_NAME, 'tmp')
feels_like_element = driver.find_element(By.CLASS_NAME, 'feels')

# Extracting text
temperature = temperature_element.text
feels_like = feels_like_element.text

# Outputting results
print(f"Today's temperature: {temperature}")
print(f"Perceived temperature: {feels_like}")

# Closing WebDriver
driver.quit()

Detailed Code Explanation

driver = webdriver.Chrome(): Opens Chrome browser and creates a driver object.
driver.get(url): Navigates to the specified URL (weather website).
temperature_element = driver.find_element(By.CLASS_NAME, 'tmp'): Finds the element with the tmp class and stores it in temperature_element.
feels_like_element = driver.find_element(By.CLASS_NAME, 'feels'): Finds the element with the feels class and stores it in feels_like_element.
temperature = temperature_element.text: Extracts the text from temperature_element and stores it in temperature.
feels_like = feels_like_element.text: Extracts the text from feels_like_element and stores it in feels_like.
driver.quit(): Closes the WebDriver.

Example Output

Today's temperature: 30.4℃
Perceived temperature: Feels like(30.6℃)

Using Selenium in this way allows you to crawl content that is dynamically generated with JavaScript.

Mission

0 / 1

Which tool can be used to crawl dynamically generated web content with JavaScript?

BeautifulSoup

requests

Selenium

pandas

Code Editor

Run

Generate

Execution Result