Crawling U.S. Stock Market Indices with Selenium
In this lesson, we’ll practice real-world web crawling by applying the Selenium knowledge we’ve learned so far.
The practice code demonstrates using Selenium to extract table data from the Americas
section on the Yahoo Finance
website and organizing the data using the pandas
library for output.
Note: Web crawling may fail if the
HTML
orCSS
structure of the website changes. In such cases, the code must be updated accordingly.
Let’s go through the code step by step.
1. Import Required Libraries
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import pandas as pd import time
-
selenium: A library for web automation and scraping. It allows interaction with web page elements.
-
pandas: A library for handling data in table formats. It’s useful for data analysis similar to working with Excel.
-
time: A Python built-in module that provides time-related functions.
2. Launch WebDriver and Navigate to Website
driver = webdriver.Chrome() driver.get('https://finance.yahoo.com/markets/')
-
webdriver.Chrome(): Launches the Chrome WebDriver to control the browser automatically. It opens a browser window.
-
driver.get(URL): Navigates to the specified URL. Here, it opens the Yahoo Finance ‘Markets’ page.
3. Wait for Page to Load
wait = WebDriverWait(driver, 10)
- WebDriverWait(driver, 10): Waits up to 10 seconds for an element to appear. This prevents errors caused by running the code before the page has fully loaded.
4. Locate the 'Americas' Section
americas_section = wait.until(EC.presence_of_element_located((By.XPATH, "//h3[text()='Americas']")))
- wait.until(): Waits until the
h3
tag containing the text ‘Americas’ appears on the page.EC.presence_of_element_located()
checks if the element is present on the page.
5. Scroll and Locate the Table within the Section
actions = ActionChains(driver) actions.move_to_element(americas_section).perform() parent_section = americas_section.find_element(By.XPATH, "./ancestor::section[contains(@data-testid, 'world-indices')]") table = parent_section.find_element(By.XPATH, ".//table")
-
ActionChains(driver): Automates actions like mouse movements and clicks. Here, it scrolls to the ‘Americas’ section.
-
find_element(By.XPATH): Finds the parent
section
element containing the ‘Americas’ table.
6. Extract Table Data
headers = [header.text for header in table.find_elements(By.XPATH, ".//th")] rows = table.find_elements(By.XPATH, ".//tbody/tr")
- table.find_elements(): Extracts the table headers and data from rows.
th
: Refers to table headers.tr
: Refers to table rows.
7. Store Data in a List
table_data = [] for row in rows: columns = row.find_elements(By.XPATH, ".//td") row_data = {} for i in range(len(headers)): header = headers[i] column_value = columns[i].text row_data[header] = column_value table_data.append(row_data)
- Extracts column data (
td
) for each row (tr
) and stores them as key-value pairs in a dictionary, with headers as keys and corresponding values.
8. Convert Data to a Pandas DataFrame and Display
df = pd.DataFrame(table_data) df_filtered = df[['Symbol', 'Price']] print(df_filtered)
-
pd.DataFrame(): Converts the extracted data into a Pandas DataFrame.
-
df[['Symbol', 'Price']]: Filters only the
Symbol
andPrice
columns for a clean display of data.
9. Close the Browser
driver.quit()
- driver.quit(): Closes the browser to release resources after completing all tasks.
Run the code and check the results.
Lecture
AI Tutor
Publish
Design
Upload
Notes
Favorites
Help
Code Editor
Execution Result