Lecture

Crawling U.S. Stock Market Indices with Selenium

In this lesson, we’ll practice real-world web crawling by applying the Selenium knowledge we’ve learned so far.

The practice code demonstrates using Selenium to extract table data from the Americas section on the Yahoo Finance website and organizing the data using the pandas library for output.

Note: Web crawling may fail if the HTML or CSS structure of the website changes. In such cases, the code must be updated accordingly.

Let’s go through the code step by step.

1. Import Required Libraries

Importing Libraries

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

selenium: A library for web automation and scraping. It allows interaction with web page elements.
pandas: A library for handling data in table formats. It’s useful for data analysis similar to working with Excel.
time: A Python built-in module that provides time-related functions.

2. Launch WebDriver and Navigate to Website

Launching WebDriver and Navigating to Website

driver = webdriver.Chrome()
driver.get('https://finance.yahoo.com/markets/')

webdriver.Chrome(): Launches the Chrome WebDriver to control the browser automatically. It opens a browser window.
driver.get(URL): Navigates to the specified URL. Here, it opens the Yahoo Finance ‘Markets’ page.

3. Wait for Page to Load

Waiting for the Page to Load

wait = WebDriverWait(driver, 10)

WebDriverWait(driver, 10): Waits up to 10 seconds for an element to appear. This prevents errors caused by running the code before the page has fully loaded.

4. Locate the 'Americas' Section

Locating the Section

americas_section = wait.until(EC.presence_of_element_located((By.XPATH, "//h3[text()='Americas']")))

wait.until(): Waits until the h3 tag containing the text ‘Americas’ appears on the page. EC.presence_of_element_located() checks if the element is present on the page.

5. Scroll and Locate the Table within the Section

Finding the Table in the Section

actions = ActionChains(driver)
actions.move_to_element(americas_section).perform()

parent_section = americas_section.find_element(By.XPATH, "./ancestor::section[contains(@data-testid, 'world-indices')]")
table = parent_section.find_element(By.XPATH, ".//table")

ActionChains(driver): Automates actions like mouse movements and clicks. Here, it scrolls to the ‘Americas’ section.
find_element(By.XPATH): Finds the parent section element containing the ‘Americas’ table.

6. Extract Table Data

Extracting Data from the Table

headers = [header.text for header in table.find_elements(By.XPATH, ".//th")]
rows = table.find_elements(By.XPATH, ".//tbody/tr")

table.find_elements(): Extracts the table headers and data from rows.
- th: Refers to table headers.
- tr: Refers to table rows.

7. Store Data in a List

Storing Table Data as a Dictionary

table_data = []
for row in rows:
    columns = row.find_elements(By.XPATH, ".//td")
    row_data = {}
    for i in range(len(headers)):
        header = headers[i]
        column_value = columns[i].text
        row_data[header] = column_value
    table_data.append(row_data)

Extracts column data (td) for each row (tr) and stores them as key-value pairs in a dictionary, with headers as keys and corresponding values.

8. Convert Data to a Pandas DataFrame and Display

Converting to DataFrame and Displaying

df = pd.DataFrame(table_data)
df_filtered = df[['Symbol', 'Price']]
print(df_filtered)

pd.DataFrame(): Converts the extracted data into a Pandas DataFrame.
df[['Symbol', 'Price']]: Filters only the Symbol and Price columns for a clean display of data.

9. Close the Browser

Closing the Browser

driver.quit()

driver.quit(): Closes the browser to release resources after completing all tasks.

Mission

0 / 1

Run the code and check the results.

Code Editor

Run

Generate

Execution Result