What is Imbalanced Data?
Imbalanced data
occurs when certain data features (labels) are significantly more or less prevalent than others.
These specific features are referred to as classes, and an imbalance in these classes can severely degrade model performance.
For instance, imagine creating an email spam filter using AI.
Suppose the training data consists of 10,000 emails, out of which 9,500 are legitimate emails and only 500 are spam.
If you train the AI model using this data as it is, the model is more likely to predict that most emails are legitimate. This is because most of the training data consists of legitimate emails, leading to the model not adequately learning the minority class of spam emails due to data imbalance.
Solutions to Imbalanced Data
1. Data Resampling
Undersampling
This involves reducing the amount of data from the majority class to balance the training dataset. However, there is a risk of losing important information.
Oversampling
This involves duplicating or generating more data for the minority class to balance the training dataset.
2. Data Augmentation
Enhance diversity by generating new data for the minority class. For instance, with image data, you can create new data through rotation, scaling, and cropping.
3. Use Appropriate Evaluation Metrics
In scenarios of data imbalance, using precision, recall, and F1 score as evaluation metrics is more appropriate than simple accuracy.
Precision
The ratio of true positive predictions among all positive predictions. For example, the proportion of transactions predicted as fraud that are actually fraudulent.
Recall
The ratio of true positive predictions to all actual positive instances. For example, the proportion of actual fraud cases successfully predicted by the model.
F1 Score
The harmonic mean of precision and recall, measuring the balance between the two metrics.
4. Algorithm Adjustment
Use algorithms that can handle imbalanced data, or adjust model training weights to emphasize the importance of the minority class.
Class Weights
Assign higher weights to the minority class so that the model places more emphasis on it.
Ensemble Methods
Ensemble methods involve combining multiple models to form a stronger model. Even if individual models give slightly different predictions, combining these predictions can yield more accurate and reliable results.
오버샘플링은 소수 클래스의 데이터를 복제하거나 생성해서 학습 데이터 셋의 균형을 맞추는 방법이다.
Lecture
AI Tutor
Publish
Design
Upload
Notes
Favorites
Help
