Lecture

Sending Data Scraped from Wikipedia via Email

In this assignment, you will scrape date information for significant historical events from Wikipedia, then send a CSV file containing the historical events and their dates as an email attachment.

Understanding how such a program works will broaden the range of applications for crawling large datasets and processing them into various forms for email delivery.


Converting Crawling Results to CSV

First, let's look at how to convert the crawled results into a CSV file.


1. Importing Necessary Libraries

Importing libraries for static web crawling
import pandas as pd import requests from bs4 import BeautifulSoup
  • pandas : A library used for reading and processing data, often used with Excel and CSV files.

  • requests : A library used for sending requests to and receiving responses from web pages.

  • BeautifulSoup : A library used for parsing the HTML code of web pages to extract the desired information.


2. Reading the Excel File

Reading Excel file using read_excel
df = pd.read_excel('input_file.xlsx')
  • file_url : The path to the Excel file. This file contains numbers and names of historical events.

  • pd.read_excel : The read_excel function from pandas is used to read the Excel file as a DataFrame.


3. Creating a Function to Extract Date Information from Wikipedia

Function to extract date information from Wikipedia
def extract_date(event_name): # Wikipedia page URL url = base_url + event_name # Sending a web request response = requests.get(url) # If the request is successful if response.status_code == 200: # Parsing the HTML soup = BeautifulSoup(response.content, 'html.parser') # Assuming date information is typically in an infobox infobox = soup.find('table', {'class': 'infobox'}) # Finding 'Date' entry in the infobox if infobox: # Checking if 'Date' entry exists date_tag = infobox.find('th', string='Date') # If 'Date' entry exists if date_tag: # Extracting date information from the next sibling tag date_value = date_tag.find_next_sibling('td') # If date information exists if date_value: # Returning the date information return date_value.text.strip() # If date information is not found return 'No date information' # If web request fails else: return 'Page error'
  • requests.get : Sends a web request to the given URL and receives a response.

  • BeautifulSoup : Parses the HTML code of the response.

  • infobox : Finds the table (infobox) containing event information and returns the value of the 'Date' entry.

  • return : Returns 'No date information' if the date is not found, and 'Page error' if the page cannot be loaded.


4. Applying the Function to the DataFrame

Extracting date information
df['Date'] = df['HistoricalEvent'].apply(extract_date)
  • df['HistoricalEvent'] : The 'HistoricalEvent' column in the Excel file. This contains the names of each event.

  • apply(extract_date) : Applies the extract_date function to each event name to extract the date, and stores the result in a new 'Date' column.


5. Outputting the Results

Outputting crawling results in CSV format
print(df[['HistoricalEvent', 'Date']].to_csv(index=False))
  • df[['HistoricalEvent', 'Date']] : Selects only the 'HistoricalEvent' and extracted 'Date' columns.

  • to_csv(index=False) : Converts the selected data to CSV format and prints it. index=False means excluding the index (which indicates the position of each row in the DataFrame) from the output.


In the next lesson, we will learn how to send the crawled CSV data via email.

Lecture

AI Tutor

Design

Upload

Notes

Favorites

Help

Code Editor

Run
Generate

Execution Result

Input/Result

Output

The document is empty.

Try running the code.