lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

aiPromptEngineeringBasicsChapter3Title

lesson11Title

lesson12Title

lesson13Title

aiPromptEngineeringBasicsChapter1Title

aiPromptEngineeringBasicsChapter2Title

# Data Extraction with Retrieval-Augmented Generation (RAG)

Generative AI is extremely useful for extracting and organizing the necessary data from the web. Recently, with the advent of [Retrieval-Augmented Generation (RAG)](https://academy.codefriends.net/en/ai/prompt-engineering/basics/chapter-2/retrieval-augmented-generation), it's now possible to search for external data, allowing for more accurate and up-to-date information.

In this lesson, we will learn how to effectively extract data from the web using generative AI.

<br />

## How to Extract Data from the Web Using RAG

To find out what historical events occurred on today's date, let's extract the "On This Day" section from the English Wikipedia main page.

When extracting data using RAG in [ChatGPT](https://chatgpt.com/), you can write a prompt like the following:

```text title="Practice Example"
Please extract the "On This Day" section from the English Wikipedia main page.

URL: https://en.wikipedia.org/wiki/Main_Page

List the years in descending order and format the list using bullet points with dashes (-).
```

```text title="Example of AI Extraction Results (July 8, 2024)"
On This Day (July 8):

- 1994: North Korean leader Kim Il-sung passed away
- 1967: The CIA revealed the Phoenix Program
- 1889: The first issue of The Wall Street Journal was published
- 1833: The Treaty of Hünkâr İskelesi between Russia and the Ottoman Empire was signed
- 1497: Vasco da Gama began his voyage to India

[Source: English Wikipedia Main Page](https://en.wikipedia.org/wiki/Main_Page)
```

As you can see, when extracting data with RAG, it is important to specify the URL from which data should be extracted or request a 'search in the web document to retrieve results'.

Additionally, when organizing extracted data, listing items in a bullet-point format (using characters such as `-` or `·` for lists) or in table form can enhance data readability.

Therefore, it's recommended to specify the output format at the end of the prompt.

<br />

## Important Considerations When Using RAG to Search for Data

When a RAG (Retrieval-Augmented Generation) system extracts data from the internet, it must adhere to the `robots.txt` file.

robots.txt is a standard file that provides instructions to web crawlers and other web robots about which parts of a site can be crawled or indexed.

```text title="Simple robots.txt Example"
User-agent: *
Disallow: /private/
Disallow: /tmp/
Disallow: /admin/
```

Here, `User-agent: *` signifies all web robots, and the `Disallow` directives specify areas that should not be crawled or have data extracted from them.

The RAG system should comply with these `Disallow` directives and not extract data from areas like `/private/`, `/tmp/`, `/admin/`.

In the case of the Wikipedia example, the data collection directives specified in [robots.txt](https://en.wikipedia.org/robots.txt) must be followed.

Public institutions and websites like Wikipedia that operate for public benefit often allow data collection to disseminate useful information. However, most private websites restrict indiscriminate data collection through robots.txt files.

If you're unfamiliar with website structures and find it difficult to understand `robots.txt`, it's advisable to have the AI confirm whether the URL is safe for data extraction before proceeding.

<br />

## Notice

Currently, OpenAI does not support **RAG** for external services like CodeFriends.

Therefore, within the practice environment, you cannot explore external web documents starting with `https://`.

For actual RAG practice, please proceed using [ChatGPT](https://chatgpt.com/).

When a RAG system extracts data, it must comply with the website's crawler guidelines specified in the robots.txt file. The sitemap.xml file is primarily used to inform search engines about the page structure.

### When extracting data using RAG, you must refer to and comply with the sitemap.xml file.