lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

lesson12Title

lesson13Title

lesson14Title

lesson15Title

lesson16Title

lesson17Title

lesson18Title

automationIntroBasicsChapter3Title

automationIntroBasicsChapter1Title

automationIntroBasicsChapter2Title

automationIntroBasicsChapter4Title

import re

# Regular expression pattern
pattern = r'\d{3}-\d{3}-\d{4}'

# Text to search
text = "Customer service contact: Please call 123-456-7890 for inquiries."

# Store the matching pattern in match
match = re.search(pattern, text)

# Check if the pattern matches
if match:
    # Print the found number: 123-456-7890
    print(f"Found number: {match.group()}")
else:
    print("Could not find a phone number.")

# Regular Expressions for Web Crawling

`Regular expressions` are tools used to **extract information that matches a specific pattern** from string data.

They are commonly used to find, replace, or validate specific patterns within strings.

In this lesson, we will explore the basic concepts of regular expressions and introduce you to methods for filtering necessary information from crawling data.

<br />

## Basic Syntax of Regular Expressions

Regular expressions define specific patterns by combining various symbols and characters.

For example, the regular expression `"^\d{3}-\d{3}-\d{4}$"` is used to find phone numbers like "123-456-7890".

Common symbols and characters used in regular expressions include:

- `.` : Matches any single character.

- `^` : Indicates the start of the string.

- `$` : Indicates the end of the string.

- `*` : Matches zero or more characters.

- `+` : Matches one or more characters.

- `[]` : Matches any one of the characters inside the brackets.

- `\d` : Matches any digit.

- `\w` : Matches any alphanumeric character.

- `\s` : Matches any whitespace character.

<br />

## Using Regular Expressions in Python

In Python, you can handle regular expressions using the `re module`.

The `re` module provides functionalities for string searching, matching, and substitution, and it comes pre-installed with Python.

```python title="Using Regular Expressions in Python"
import re

# Regular expression pattern
pattern = r'\d{3}-\d{3}-\d{4}'

# Text to search within
text = "Customer service contact: Please reach out to 123-456-7890."

# Store the matched pattern string in match
match = re.search(pattern, text)

# Check if the pattern matches
if match:
    # Output the found number: 123-456-7890
    print(f"Found number: {match.group()}")
else:
    print("No number found.")
```

This code searches for a phone number pattern within the string and prints it out when found.

<br />

## Extracting Email Addresses from HTML Data Using Regular Expressions

When you want to extract email addresses from a specific web page, you can use the following regular expression.

```python title="Extracting Email Addresses Using Regular Expressions"
import re
import requests
from bs4 import BeautifulSoup

# URL to crawl
url = 'https://www.example.com/'

# Fetch HTML content
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract text from HTML
text = soup.get_text()

# Regular expression pattern: find email addresses
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Find email addresses
emails = re.findall(email_pattern, text)

# Output the extracted email addresses
for email in emails:
    print(f"Found email address: {email}")
```

The code above uses the `re.findall()` function to return a list of all emails that match the regular expression, finding and printing all email addresses from the web page.

`\d` matches any digit from 0-9 in regular expressions. For example, `\d{3}` means a three-digit number.

### In regular expressions, `\d` matches any digit.

python_execution

Regular Expressions for Web Crawling

Basic Syntax of Regular Expressions

Using Regular Expressions in Python

Extracting Email Addresses from HTML Data Using Regular Expressions

In regular expressions, \d matches any digit.

In regular expressions, `\d` matches any digit.