Lecture

Data Extraction with Retrieval-Augmented Generation (RAG)

Generative AI is extremely useful for extracting and organizing the necessary data from the web. Recently, with the advent of Retrieval-Augmented Generation (RAG), it's now possible to search for external data, allowing for more accurate and up-to-date information.

In this lesson, we will learn how to effectively extract data from the web using generative AI.


How to Extract Data from the Web Using RAG

To find out what historical events occurred on today's date, let's extract the "On This Day" section from the English Wikipedia main page.

When extracting data using RAG in ChatGPT, you can write a prompt like the following:

Practice Example
Please extract the "On This Day" section from the English Wikipedia main page. URL: https://en.wikipedia.org/wiki/Main_Page List the years in descending order and format the list using bullet points with dashes (-).
Example of AI Extraction Results (July 8, 2024)
On This Day (July 8): - 1994: North Korean leader Kim Il-sung passed away - 1967: The CIA revealed the Phoenix Program - 1889: The first issue of The Wall Street Journal was published - 1833: The Treaty of Hünkâr İskelesi between Russia and the Ottoman Empire was signed - 1497: Vasco da Gama began his voyage to India [Source: English Wikipedia Main Page](https://en.wikipedia.org/wiki/Main_Page)

As you can see, when extracting data with RAG, it is important to specify the URL from which data should be extracted or request a 'search in the web document to retrieve results'.

Additionally, when organizing extracted data, listing items in a bullet-point format (using characters such as - or · for lists) or in table form can enhance data readability.

Therefore, it's recommended to specify the output format at the end of the prompt.


Important Considerations When Using RAG to Search for Data

When a RAG (Retrieval-Augmented Generation) system extracts data from the internet, it must adhere to the robots.txt file.

robots.txt is a standard file that provides instructions to web crawlers and other web robots about which parts of a site can be crawled or indexed.

Simple robots.txt Example
User-agent: * Disallow: /private/ Disallow: /tmp/ Disallow: /admin/

Here, User-agent: * signifies all web robots, and the Disallow directives specify areas that should not be crawled or have data extracted from them.

The RAG system should comply with these Disallow directives and not extract data from areas like /private/, /tmp/, /admin/.

In the case of the Wikipedia example, the data collection directives specified in robots.txt must be followed.

Public institutions and websites like Wikipedia that operate for public benefit often allow data collection to disseminate useful information. However, most private websites restrict indiscriminate data collection through robots.txt files.

If you're unfamiliar with website structures and find it difficult to understand robots.txt, it's advisable to have the AI confirm whether the URL is safe for data extraction before proceeding.


Notice

Currently, OpenAI does not support RAG for external services like CodeFriends.

Therefore, within the practice environment, you cannot explore external web documents starting with https://.

For actual RAG practice, please proceed using ChatGPT.

Mission
0 / 1

RAG를 활용해 데이터를 추출할 때는 sitemap.xml 파일을 참고하고 준수해야 한다.

True
False

Lecture

AI Tutor

Publish

Design

Upload

Notes

Favorites

Help

image