lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

aiFineTuningApplicationChapter2Title

lesson11Title

lesson12Title

aiFineTuningApplicationChapter1Title

aiFineTuningApplicationChapter3Title

# What is Imbalanced Data?

`Imbalanced data` occurs when certain data features (labels) are significantly more or less prevalent than others.

These specific features are referred to as **classes**, and an imbalance in these classes can severely degrade model performance.

For instance, imagine creating an email spam filter using AI.

Suppose the training data consists of 10,000 emails, out of which 9,500 are legitimate emails and only 500 are spam.

If you train the AI model using this data as it is, the model is more likely to predict that most emails are legitimate. This is because most of the training data consists of legitimate emails, leading to the model not adequately learning the minority class of spam emails due to data imbalance.

<br />

## Solutions to Imbalanced Data

### 1. Data Resampling

#### Undersampling
This involves reducing the amount of data from the majority class to balance the training dataset. However, there is a risk of losing important information.

#### Oversampling
This involves duplicating or generating more data for the minority class to balance the training dataset.

<br />

### 2. Data Augmentation

Enhance diversity by generating new data for the minority class. For instance, with image data, you can create new data through rotation, scaling, and cropping.

<br />

### 3. Use Appropriate Evaluation Metrics

In scenarios of data imbalance, using precision, recall, and F1 score as evaluation metrics is more appropriate than simple accuracy.

#### Precision
The ratio of true positive predictions among all positive predictions. For example, the proportion of transactions predicted as fraud that are actually fraudulent.

#### Recall
The ratio of true positive predictions to all actual positive instances. For example, the proportion of actual fraud cases successfully predicted by the model.

#### F1 Score
The harmonic mean of precision and recall, measuring the balance between the two metrics.

<br />

### 4. Algorithm Adjustment

Use algorithms that can handle imbalanced data, or adjust model training weights to emphasize the importance of the minority class.

#### Class Weights
Assign higher weights to the minority class so that the model places more emphasis on it.

#### Ensemble Methods
Ensemble methods involve combining multiple models to form a stronger model. Even if individual models give slightly different predictions, combining these predictions can yield more accurate and reliable results.

By duplicating or generating new data for the minority class through oversampling, the training dataset can be balanced. For example, if there are fewer spam emails, more spam emails can be created to help the model learn better.

### Oversampling is a method to balance the training dataset by duplicating or generating data for the minority class.