lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

pythonIntroCrawlingChapter1Title

pythonIntroCrawlingChapter2Title

pythonIntroCrawlingChapter3Title

pythonIntroCrawlingChapter4Title

# Wikipedia allows public crawling.
# However, when crawling, it's a good practice to check the site's robots.txt file
# to ensure crawling is permitted.

# Library for sending network requests
import requests

# Library for parsing web pages
from bs4 import BeautifulSoup

# Wikipedia Main Page URL
url = 'https://en.wikipedia.org/wiki/Main_Page'

# Fetch the web page using requests
response = requests.get(url)

# Parse the HTML content of the web page into a BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')

# Retrieve the content of the title tag from the web page
page_title = soup.find('title').text
print(f"Page Title: {page_title}")

# Legal and Ethical Responsibilities in Web Crawling

Many websites restrict or prohibit crawling through their terms of service or `robots.txt` (a file that indicates whether web crawlers are allowed to crawl). Therefore, when performing web crawling, it’s crucial to be aware of both legal and ethical responsibilities.

<br />

## Legal Considerations

1. `Copyright Law`: Most website content is protected by copyright. When crawling and using website data, be mindful of copyright laws to avoid violations. Extra caution is required, especially when using the collected data for commercial purposes or public distribution.

2. `Data Protection Laws`: Many countries have strict regulations on collecting and using personal information. If web crawling involves personal data collection, you must comply with the relevant data protection laws.

3. `Terms of Service`:  A website’s terms of service outline how its data can be used. Many sites include clauses that restrict or prohibit crawling, so it's important to review them beforehand.
<br />

## Ethical Considerations

1. `Minimize Server Load`: Crawling can strain website servers. Excessive crawling may cause server overloads, disrupting normal operations. To prevent this, adjust crawling frequency appropriately and minimize server impact.

2. `Adherence to robots.txt`: A website's `robots.txt` file designates pages that crawlers should not access. For ethical crawling, you must adhere to the instructions in this file.

3. `Transparency in Data Use`: When using collected data, be transparent about the source and method of collection. Additionally, avoid data manipulation or misinformation.

<br />

## Practice

Click the _`Run Code`_ button on the right-hand side of the screen to review the crawling results or edit the code!

python_execution