Lecture

Essentials for Taming AI, the Dataset

A Dataset is a collection of data amassed and organized for specific purposes like AI model training and validation.

The JSONL file created for fine-tuning in the previous lesson is also one form of a dataset.


What kind of data is included in a dataset?

A dataset can contain a variety of data forms, including tables, images, text, and time-series data.

  • Tabular Data: Table-formatted data comprised of rows and columns, such as Excel files (.xlsx) and CSV files (.csv).

  • Image Data: Consists of image files like PNG and JPG, mainly utilized in computer vision.

  • Text Data: Data in the form of documents, sentences, and words, widely utilized in Natural Language Processing (NLP).

  • Time Series Data: Data collected over time, such as stock market data and temperature data over time.


What is the general structure of a dataset?

Most datasets are configured into the following three parts:

  • Feature: Data that is input into the AI model and serves as the focus of learning. In a chatbot model, the user's 'question' could be a feature, while in an image classification model, the 'photo' could be a feature.

  • Label: Represents the answer or result of the dataset. If a photo contains a cat, the label of that photo would be 'cat'.

  • Metadata: It's like a manual for the dataset, providing additional information such as the source of the data and when it was created.


FeaturesLabelMetadata
Image file path: /images/cat.jpgCatFile size: 3MB, Capture date: 2021-01-15, Source: User Upload
Text: "How are you feeling today?"Feeling inquiryLength: 24 characters, Author: Admin, Creation date: 2021-02-01
Numeric data: [2, 14, 15, 23]Sum of sequence: 54Data type: Integer array, Input date: 2021-03-22

Commonly Used Datasets

  • MNIST Dataset: A dataset composed of handwritten digit images, frequently used in the field of computer vision.

  • Iris Dataset: A tabular dataset used for predicting Iris flower species.

  • IMDB Review Dataset: A dataset of movie review texts used for sentiment analysis and other applications.

Mission
0 / 1

다음 중 빈칸에 들어갈 가장 적절한 단어는 무엇인가요?

AI 모델 학습 및 검증을 위해 수집 및 정리된 데이터의 모음을 라고 합니다.
데이터셋
라벨
메타데이터
특징

Lecture

AI Tutor

Design

Upload

Notes

Favorites

Help