lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

lesson12Title

lesson13Title

lesson14Title

lesson15Title

lesson16Title

lesson17Title

lesson18Title

automationIntroBasicsChapter3Title

automationIntroBasicsChapter1Title

automationIntroBasicsChapter2Title

automationIntroBasicsChapter4Title

# Considerations When Conducting Web Crawling

Web crawling, or web scraping, is an incredibly useful method for automatically collecting data from the internet, but it also comes with several `legal and ethical responsibilities`.

<br />

## Legal Responsibilities of Web Crawling

Many websites prohibit crawling to prevent server overload and clearly state these restrictions in their *terms of service*.

Ignoring these rules and proceeding with crawling can lead to legal disputes.

Moreover, if you plan to use the collected data for commercial purposes, you must comply with relevant laws, such as `copyright laws`.

<br />

## Always Check the robots.txt File

Common rules for web crawlers are typically defined in a website's `robots.txt` file.

This file is located at the `https://website.com/robots.txt` path (for example, https://en.wikipedia.org/robots.txt) and specifies which pages web crawlers can and cannot access.

Here is a simple example of a robots.txt file:

```plaintext title="Example robots.txt"
User-agent: *
Disallow: /private/
Allow: /public/
```

In this example, all crawlers are prohibited from accessing the `/private/` directory but are allowed to access the `/public/` directory.

Adhering to the robots.txt file is a fundamental ethical practice in web crawling.

Ignoring this file and collecting all data from a website goes against the website operator's intentions and can be considered `illegal`.

The robots.txt file specifies the rules that a website exposes to crawlers. Through this file, crawlers can determine which pages are accessible and which are not. Adhering to it is a fundamental ethical practice in web crawling.

Considerations When Conducting Web Crawling

Legal Responsibilities of Web Crawling

Always Check the robots.txt File

Which of the following must be verified to determine if web crawling is allowed?