How to Split a Dataset
To evaluate if an AI model functions well for its intended purpose, the dataset should be divided into training set
, validation set
, and test set
.
In this lesson, we'll explore how to split datasets effectively for optimal learning.
How Should We Split?
There are various ways to split a dataset, but it's generally divided using the following ratios:
-
Training Set: About 60-80% of the entire dataset
-
Validation Set: About 10-20% of the entire dataset
-
Test Set: About 10-20% of the entire dataset
For instance, you might use 70% of the entire dataset as the training set, 15% as the validation set, and the remaining 15% as the test set.
Example of Dataset Splitting: Student Exam Scores
If you have exam score data for 100 students, you can split the data as follows:
Training Set (70%):
- Use the exam scores of 70 students to train the AI model.
- The AI model learns patterns in the scores through this data.
Validation Set (15%):
- During AI model training, use the exam scores of 15 students to evaluate the model's performance.
- Ensure the model isn't overly specialized on the training data (Overfitting) and that it learns enough from the training data (Underfitting).
Test Set (15%):
- Use the remaining exam scores of 15 students to test the model's final performance.
- Assess how well the model predicts new data using this set.
Methods of Dataset Splitting
Random Sampling
Select data randomly to split into training, validation, and test sets.
-
Advantages: It's simple to implement, and all data has an equal chance of being selected.
-
Disadvantages: There's a risk that the data's important characteristics might not be evenly distributed across the sets. For example, there may be too many or too few instances of a specific category.
Stratified Sampling
Select data by considering key characteristics of the dataset to ensure each stratum (or group) is proportionally represented. For instance, in a disease prediction model, you can maintain the male-to-female ratio to ensure that the dataset's relevant characteristics are reflected uniformly across all sets.
-
Advantages: Allows you to maintain the important characteristics of the dataset while splitting.
-
Disadvantages: It can be complex to implement and requires a precise understanding of the dataset's characteristics.
데이터셋 분할 방법 중 층화 샘플링(Stratified Sampling)의 주요 특징은 무엇인가요?
구현이 매우 간단하다
모든 데이터가 선택될 확률이 동일하다
데이터셋의 주요 특성을 고려하여 균등하게 분할한다
항상 랜덤 샘플링보다 빠르다
Lecture
AI Tutor
Publish
Design
Upload
Notes
Favorites
Help