The purpose of this blog is to explain the techniques for addressing class imbalanced data in machine learning. Imbalanced classes put the "accuracy” of the ML model at risk. This is a surprisingly common problem in machine learning, occurring in datasets with a disproportionate ratio of observations in each class. Most machine learning algorithms and works best when the number of instances of each class is roughly equal. When the number of instances of one class far exceeds the other, problems arise.
The main objective of balancing classes is to either increasing the frequency of the minority class or decreasing the frequency of the majority class. This is done to obtain approximately the same number of instances for both classes.
Following are the prominent resampling techniques: -
Random Under-Sampling: - Random Under-sampling aims to balance class distribution by randomly eliminating majority class examples. This is done until the majority and minority class instances are balanced out. Advantages It can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge. Disadvantages It can discard potentially useful information that could be important for building rule classifiers. The sample chosen by random under-sampling may be a biased sample. And it will not be an accurate representation of the population. Thereby, resulting in inaccurate results with the actual test data set.
Random Over-Sampling: - Over-Sampling increases the number of instances in the minority class by randomly replicating them to present a higher representation of the minority class in the sample. Advantages Unlike under-sampling, this method leads to no information loss. Outperforms under sampling Disadvantages It increases the likelihood of over-fitting since it replicates the minority class events.
Synthetic Minority Over-sampling Technique for imbalanced data:- This technique is followed to avoid over-fitting which occurs when exact replicas of minority instances are added to the main data set. A subset of data is taken from the minority class as an example and then new synthetic similar instances are created. These synthetic instances are then added to the original datasets. The new data set is used as a sample to train the classification models. Advantages Mitigates the problem of over-fitting caused by random oversampling as synthetic examples are generated rather than a replication of instances. Also, there is no loss of useful information Disadvantages While generating synthetic examples SMOTE does not take into consideration neighboring examples from other classes. This can result in an increase in the overlapping of classes and can introduce additional noise.