Lecture

Regular Expressions for Web Crawling

Regular expressions are tools used to extract information that matches a specific pattern from string data.

They are commonly used to find, replace, or validate specific patterns within strings.

In this lesson, we will explore the basic concepts of regular expressions and introduce you to methods for filtering necessary information from crawling data.


Basic Syntax of Regular Expressions

Regular expressions define specific patterns by combining various symbols and characters.

For example, the regular expression "^\d{3}-\d{3}-\d{4}$" is used to find phone numbers like "123-456-7890".

Common symbols and characters used in regular expressions include:

  • . : Matches any single character.

  • ^ : Indicates the start of the string.

  • $ : Indicates the end of the string.

  • * : Matches zero or more characters.

  • + : Matches one or more characters.

  • [] : Matches any one of the characters inside the brackets.

  • \d : Matches any digit.

  • \w : Matches any alphanumeric character.

  • \s : Matches any whitespace character.


Using Regular Expressions in Python

In Python, you can handle regular expressions using the re module.

The re module provides functionalities for string searching, matching, and substitution, and it comes pre-installed with Python.

Using Regular Expressions in Python
import re # Regular expression pattern pattern = r'\d{3}-\d{3}-\d{4}' # Text to search within text = "Customer service contact: Please reach out to 123-456-7890." # Store the matched pattern string in match match = re.search(pattern, text) # Check if the pattern matches if match: # Output the found number: 123-456-7890 print(f"Found number: {match.group()}") else: print("No number found.")

This code searches for a phone number pattern within the string and prints it out when found.


Extracting Email Addresses from HTML Data Using Regular Expressions

When you want to extract email addresses from a specific web page, you can use the following regular expression.

Extracting Email Addresses Using Regular Expressions
import re import requests from bs4 import BeautifulSoup # URL to crawl url = 'https://www.example.com/' # Fetch HTML content response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # Extract text from HTML text = soup.get_text() # Regular expression pattern: find email addresses email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' # Find email addresses emails = re.findall(email_pattern, text) # Output the extracted email addresses for email in emails: print(f"Found email address: {email}")

The code above uses the re.findall() function to return a list of all emails that match the regular expression, finding and printing all email addresses from the web page.

Mission
0 / 1

In regular expressions, \d matches any digit.

True
False

Lecture

AI Tutor

Notes

Favorites

Help

Code Editor

Run
Generate

Execution Result