Real-Time Stock Data Collection from Yahoo Finance
The ever-changing stock market! If you want to collect real-time stock data and store it periodically, how can you do it?
In this lesson, we will explore how to dynamically extract stock data from Yahoo Finance using Selenium
.
Extracting Dynamic Data
Dynamic data is generated by JavaScript
, which acts as the brain of a web page. This data can be generated after the user visits the site or changes based on specific user actions.
Such dynamic data cannot be fetched using BeautifulSoup
and Requests
.
However, Selenium
allows us to execute JavaScript on web pages and retrieve dynamic data.
Let's go through the code step by step.
If any part of the code is difficult to understand, feel free to ask our AI Tutor
for help.
1. Importing Necessary Packages
-
selenium
: Fetches dynamic data from web pages. -
pandas
: Organizes and processes data in tabular form. -
webdriver
: Controls web browsers using Selenium. -
By
: Specifies how to locate elements on the web page. -
ActionChains
: Performs mouse and keyboard actions on the web page. -
EC
: Waits until elements appear on the web page.
2. Open a Web Browser
# Launch Chrome WebDriver to open a browser window driver = webdriver.Chrome() # Navigate to the 'Markets' page on Yahoo Finance driver.get('https://finance.yahoo.com/markets/')
Launch the Chrome browser and navigate to the 'Markets' page on Yahoo Finance
.
For reference, Selenium supports various browsers like Chrome and Firefox.
3. Wait Until the Page is Fully Loaded
# Wait until the page is fully loaded (maximum wait time of 10 seconds) wait = WebDriverWait(driver, 10)
Wait for the web page to be fully loaded with a maximum wait time of 10 seconds.
It is important to allow time for the page to load as it may take some time for elements to be ready.
4. Find the 'Americas' Section and Scroll to It
# Find the h3 tag with the text 'Americas' americas_section = wait.until(EC.presence_of_element_located((By.XPATH, "//h3[text()='Americas']"))) # Scroll to the 'Americas' section actions = ActionChains(driver) # Move the mouse to the 'Americas' section actions.move_to_element(americas_section).perform()
XPATH
is an expression language used to locate particular elements in XML documents.
For example, the h3
element with the text 'Americas'
can be expressed as /h3[text()='Americas']
.
XPATH
is one of the methods used to locate elements on a web page.
move_to_element
scrolls the screen to the 'Americas' section.
5. Find the Table in the 'Americas' Section
# Find the parent section of the 'Americas' section containing the table parent_section = americas_section.find_element(By.XPATH, "./ancestor::section[contains(@data-testid, 'world-indices')]") # Find the table table = parent_section.find_element(By.XPATH, ".//table")
Find the table
element within the parent section
tag containing the "Americas" section.
"./ancestor::section[contains(@data-testid, 'world-indices')]"
is the XPATH to locate the parent section
element of the "Americas" section.
This table contains the data we need (e.g., index names, prices, etc.).
6. Collect Table Headers and Data
# Extract headers from the table headers = [header.text for header in table.find_elements(By.XPATH, ".//th")] # Extract rows from the table rows = table.find_elements(By.XPATH, ".//tbody/tr")
table.find_elements(By.XPATH, ".//th")
locates the th
tags within the table to extract headers.
th
(Table Header) tags represent the column names (e.g., "Name", "Price") in the table.
Store the headers (e.g., "Name", "Price") in a list.
Use tbody/tr
to extract data from each row (tr
, table row) in the table.
7. Extract 'Name' and 'Price' values from Each Row and Save
# Initialize a list to store data table_data = [] # Extract column data for each row and add it to the list for row in rows: # Extract column data for each row columns = row.find_elements(By.XPATH, ".//td") row_data = {} # Initialize an empty dictionary # Assume that the 'headers' and 'columns' lists are of the same length for i in range(len(headers)): # Get the i-th element from the 'headers' list header = headers[i] # Get the text of the i-th element from the 'columns' list column_value = columns[i].text # Add to the dictionary with the header as key and column_value as value row_data[header] = column_value # Add the data to the list table_data.append(row_data)
Iterate over each row to extract data within the td
tags (cell values).
Save this data in the row_data
variable as a dictionary, where the key is header
and the value is column_value
.
Finally, append the dictionary to the table_data
list.
8. Convert to DataFrame Using pandas
# Convert the extracted data to a pandas DataFrame df = pd.DataFrame(table_data) # Select and print the DataFrame with only 'Symbol' and 'Price' columns df_filtered = df[['Symbol', 'Price']] # Print the sorted data print(df_filtered)
Convert the extracted data to a pandas.DataFrame
.
The data will be stored in the data frame as shown below.
Symbol | Price |
---|---|
... | ... |
... | ... |
9. Select and Sort 'Symbol' and 'Price' Columns
df_filtered = df[['Symbol', 'Price']]
Select the 'Symbol' and 'Price' columns from the data frame and sort them based on the 'Symbol' column.
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help
Code Editor
Execution Result