Regular Expressions for Web Crawling
Regular expressions
are tools used to extract information that matches a specific pattern from string data.
They are commonly used to find, replace, or validate specific patterns within strings.
In this lesson, we will explore the basic concepts of regular expressions and introduce you to methods for filtering necessary information from crawling data.
Basic Syntax of Regular Expressions
Regular expressions define specific patterns by combining various symbols and characters.
For example, the regular expression "^\d{3}-\d{3}-\d{4}$"
is used to find phone numbers like "123-456-7890".
Common symbols and characters used in regular expressions include:
-
.
: Matches any single character. -
^
: Indicates the start of the string. -
$
: Indicates the end of the string. -
*
: Matches zero or more characters. -
+
: Matches one or more characters. -
[]
: Matches any one of the characters inside the brackets. -
\d
: Matches any digit. -
\w
: Matches any alphanumeric character. -
\s
: Matches any whitespace character.
Using Regular Expressions in Python
In Python, you can handle regular expressions using the re module
.
The re
module provides functionalities for string searching, matching, and substitution, and it comes pre-installed with Python.
import re # Regular expression pattern pattern = r'\d{3}-\d{3}-\d{4}' # Text to search within text = "Customer service contact: Please reach out to 123-456-7890." # Store the matched pattern string in match match = re.search(pattern, text) # Check if the pattern matches if match: # Output the found number: 123-456-7890 print(f"Found number: {match.group()}") else: print("No number found.")
This code searches for a phone number pattern within the string and prints it out when found.
Extracting Email Addresses from HTML Data Using Regular Expressions
When you want to extract email addresses from a specific web page, you can use the following regular expression.
import re import requests from bs4 import BeautifulSoup # URL to crawl url = 'https://www.example.com/' # Fetch HTML content response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # Extract text from HTML text = soup.get_text() # Regular expression pattern: find email addresses email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' # Find email addresses emails = re.findall(email_pattern, text) # Output the extracted email addresses for email in emails: print(f"Found email address: {email}")
The code above uses the re.findall()
function to return a list of all emails that match the regular expression, finding and printing all email addresses from the web page.
In regular expressions, \d
matches any digit.
Lecture
AI Tutor
Notes
Favorites
Help
Code Editor
Execution Result