lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

lesson12Title

aiFineTuningBasicsChapter2Title

lesson13Title

aiFineTuningBasicsChapter1Title

aiFineTuningBasicsChapter3Title

# How to Split a Dataset

To evaluate if an AI model functions well for its intended purpose, the dataset should be divided into `training set`, `validation set`, and `test set`.

In this lesson, we'll explore how to split datasets effectively for optimal learning.

 

## How Should We Split?

There are various ways to split a dataset, but it's generally divided using the following ratios:

- *Training Set*: About 60-80% of the entire dataset

- *Validation Set*: About 10-20% of the entire dataset

- *Test Set*: About 10-20% of the entire dataset

For instance, you might use 70% of the entire dataset as the **training set**, 15% as the **validation set**, and the remaining 15% as the **test set**.

 

### Example of Dataset Splitting: Student Exam Scores

If you have exam score data for 100 students, you can split the data as follows:

 

*Training Set (70%)*:
 - Use the exam scores of 70 students to train the AI model.
 - The AI model learns patterns in the scores through this data.

 

*Validation Set (15%)*:
 - During AI model training, use the exam scores of 15 students to evaluate the model's performance.
 - Ensure the model isn't overly specialized on the training data (Overfitting) and that it learns enough from the training data (Underfitting).

 

*Test Set (15%)*:
 - Use the remaining exam scores of 15 students to test the model's final performance.
 - Assess how well the model predicts new data using this set.

 

## Methods of Dataset Splitting

### Random Sampling

Select data randomly to split into training, validation, and test sets.

- Advantages: It's simple to implement, and all data has an equal chance of being selected.

- Disadvantages: There's a risk that the data's important characteristics might not be evenly distributed across the sets. For example, there may be too many or too few instances of a specific category.

 

### Stratified Sampling

Select data by considering key characteristics of the dataset to ensure each stratum (or group) is proportionally represented. For instance, in a disease prediction model, you can maintain the male-to-female ratio to ensure that the dataset's relevant characteristics are reflected uniformly across all sets.

- Advantages: Allows you to maintain the important characteristics of the dataset while splitting.

- Disadvantages: It can be complex to implement and requires a precise understanding of the dataset's characteristics.

### Stratified sampling selects data from each stratum (or group) in the dataset while maintaining the same proportion. This method ensures that key characteristics of the dataset are evenly represented across all subsets.