Lecture

Real-Time Stock Data Collection from Yahoo Finance

The ever-changing stock market! If you want to collect real-time stock data and store it periodically, how can you do it?

In this lesson, we will explore how to dynamically extract stock data from Yahoo Finance using Selenium.

Extracting Dynamic Data

Dynamic data is generated by JavaScript, which acts as the brain of a web page. This data can be generated after the user visits the site or changes based on specific user actions.

Such dynamic data cannot be fetched using BeautifulSoup and Requests.

However, Selenium allows us to execute JavaScript on web pages and retrieve dynamic data.

Let's go through the code step by step.

If any part of the code is difficult to understand, feel free to ask our AI Tutor for help.

1. Importing Necessary Packages

selenium: Fetches dynamic data from web pages.
pandas: Organizes and processes data in tabular form.
webdriver: Controls web browsers using Selenium.
By: Specifies how to locate elements on the web page.
ActionChains: Performs mouse and keyboard actions on the web page.
EC: Waits until elements appear on the web page.

2. Open a Web Browser

Open Web Browser

# Launch Chrome WebDriver to open a browser window
driver = webdriver.Chrome()

# Navigate to the 'Markets' page on Yahoo Finance
driver.get('https://finance.yahoo.com/markets/')

Launch the Chrome browser and navigate to the 'Markets' page on Yahoo Finance.

For reference, Selenium supports various browsers like Chrome and Firefox.

3. Wait Until the Page is Fully Loaded

Wait for Page Load

# Wait until the page is fully loaded (maximum wait time of 10 seconds)
wait = WebDriverWait(driver, 10)

Wait for the web page to be fully loaded with a maximum wait time of 10 seconds.

It is important to allow time for the page to load as it may take some time for elements to be ready.

4. Find the 'Americas' Section and Scroll to It

Find Americas Section

# Find the h3 tag with the text 'Americas'
americas_section = wait.until(EC.presence_of_element_located((By.XPATH, "//h3[text()='Americas']")))

# Scroll to the 'Americas' section
actions = ActionChains(driver)

# Move the mouse to the 'Americas' section
actions.move_to_element(americas_section).perform()

XPATH is an expression language used to locate particular elements in XML documents.

For example, the h3 element with the text 'Americas' can be expressed as /h3[text()='Americas'].

XPATH is one of the methods used to locate elements on a web page.

move_to_element scrolls the screen to the 'Americas' section.

5. Find the Table in the 'Americas' Section

Find Index Data Table

# Find the parent section of the 'Americas' section containing the table
parent_section = americas_section.find_element(By.XPATH, "./ancestor::section[contains(@data-testid, 'world-indices')]")

# Find the table
table = parent_section.find_element(By.XPATH, ".//table")

Find the table element within the parent section tag containing the "Americas" section.

"./ancestor::section[contains(@data-testid, 'world-indices')]" is the XPATH to locate the parent section element of the "Americas" section.

This table contains the data we need (e.g., index names, prices, etc.).

6. Collect Table Headers and Data

Extract Table Headers and Data

# Extract headers from the table
headers = [header.text for header in table.find_elements(By.XPATH, ".//th")]

# Extract rows from the table
rows = table.find_elements(By.XPATH, ".//tbody/tr")

table.find_elements(By.XPATH, ".//th") locates the th tags within the table to extract headers.

th (Table Header) tags represent the column names (e.g., "Name", "Price") in the table.

Store the headers (e.g., "Name", "Price") in a list.

Use tbody/tr to extract data from each row (tr, table row) in the table.

7. Extract 'Name' and 'Price' values from Each Row and Save

Extract and Save Row Data

 # Initialize a list to store data
table_data = []

# Extract column data for each row and add it to the list
for row in rows:
    # Extract column data for each row
    columns = row.find_elements(By.XPATH, ".//td")

    row_data = {}  # Initialize an empty dictionary

    # Assume that the 'headers' and 'columns' lists are of the same length
    for i in range(len(headers)):
        # Get the i-th element from the 'headers' list
        header = headers[i]

        # Get the text of the i-th element from the 'columns' list
        column_value = columns[i].text
        
        # Add to the dictionary with the header as key and column_value as value
        row_data[header] = column_value

    # Add the data to the list
    table_data.append(row_data)

Iterate over each row to extract data within the td tags (cell values).

Save this data in the row_data variable as a dictionary, where the key is header and the value is column_value.

Finally, append the dictionary to the table_data list.

8. Convert to DataFrame Using pandas

Convert to DataFrame

# Convert the extracted data to a pandas DataFrame
df = pd.DataFrame(table_data)

# Select and print the DataFrame with only 'Symbol' and 'Price' columns
df_filtered = df[['Symbol', 'Price']]

# Print the sorted data
print(df_filtered)

Convert the extracted data to a pandas.DataFrame.

The data will be stored in the data frame as shown below.

Symbol	Price
...	...
...	...

9. Select and Sort 'Symbol' and 'Price' Columns

Select and Sort Columns

df_filtered = df[['Symbol', 'Price']]

Select the 'Symbol' and 'Price' columns from the data frame and sort them based on the 'Symbol' column.

Code Editor

Run

Generate

Execution Result