JavaScript and Dynamic Web Crawling
Web pages are made up of three components: HTML
, CSS
, and JavaScript
.
HTML
defines the structure of the web page, while CSS
defines the style of the web page.
JavaScript
is a language that makes web pages dynamic.
The term dynamic means the content of the web page can change in response to interactions with the user or certain events.
For instance, using JavaScript, you can display new content when a user clicks a button, or load additional content as the user scrolls.
Such dynamic content is not present when the web page is first loaded but is generated dynamically as JavaScript is executed in the web browser.
The Limits of BeautifulSoup
BeautifulSoup parses HTML
to extract data.
However, content generated dynamically with JavaScript
cannot be fetched with BeautifulSoup.
Example of Crawling Code that Doesn't Work with BeautifulSoup
Let's look at code that attempts to fetch the current temperature and perceived temperature from a weather website using BeautifulSoup.
import requests from bs4 import BeautifulSoup # Weather website URL url = 'https://www.weather.example.com/current' # Sending request to the page response = requests.get(url) # Parsing HTML with BeautifulSoup soup = BeautifulSoup(response.text, 'html.parser') # Finding temperature and perceived temperature # 'tmp' class indicates the current temperature, 'feels' class indicates perceived temperature temperature_element = soup.find('span', class_='tmp') feels_like_element = soup.find('span', class_='feels') # Extracting text temperature = temperature_element.text.strip() if temperature_element else 'N/A' feels_like = feels_like_element.text.strip() if feels_like_element else 'N/A' # Outputting results print(f"Today's temperature: {temperature}") print(f"Perceived temperature: {feels_like}")
This code tries to fetch weather information from a weather website using BeautifulSoup but returns None
for temperature_element
and feels_like_element
.
This is because it only fetches the HTML before JavaScript is executed, so it cannot find those elements.
Dynamic Web Crawling with Selenium
The weather website uses JavaScript to dynamically display weather information, so BeautifulSoup alone can't successfully fetch that data.
But by using Selenium
, which allows you to interact with browsers, you can capture the web page after JavaScript has run and extract the required data.
Note: To run this code on your computer, you need to install the Selenium library with the command
pip install selenium
.
from selenium import webdriver from selenium.webdriver.common.by import By # Open Chrome browser driver = webdriver.Chrome() # Open the weather forecast page url = "https://www.weather.example.com/current" driver.get(url) # Finding temperature and perceived temperature # 'tmp' class indicates the current temperature, 'feels' class indicates perceived temperature temperature_element = driver.find_element(By.CLASS_NAME, 'tmp') feels_like_element = driver.find_element(By.CLASS_NAME, 'feels') # Extracting text temperature = temperature_element.text feels_like = feels_like_element.text # Outputting results print(f"Today's temperature: {temperature}") print(f"Perceived temperature: {feels_like}") # Closing WebDriver driver.quit()
Detailed Code Explanation
-
driver = webdriver.Chrome()
: Opens Chrome browser and creates a driver object. -
driver.get(url)
: Navigates to the specified URL (weather website). -
temperature_element = driver.find_element(By.CLASS_NAME, 'tmp')
: Finds the element with the tmp class and stores it in temperature_element. -
feels_like_element = driver.find_element(By.CLASS_NAME, 'feels')
: Finds the element with the feels class and stores it in feels_like_element. -
temperature = temperature_element.text
: Extracts the text from temperature_element and stores it in temperature. -
feels_like = feels_like_element.text
: Extracts the text from feels_like_element and stores it in feels_like. -
driver.quit()
: Closes the WebDriver.
Today's temperature: 30.4℃ Perceived temperature: Feels like(30.6℃)
Using Selenium in this way allows you to crawl content that is dynamically generated with JavaScript.
Which tool can be used to crawl dynamically generated web content with JavaScript?
BeautifulSoup
requests
Selenium
pandas
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help
Code Editor
Execution Result