Deep Unbalanced Regression – Complete Guide
Data imbalance is a common occurrence in real world applications. Classification problems related to class imbalances are popular, and there are many approaches to dealing with class imbalances, such as reweighting, biased sampling, and meta-learning. Non-uniformity and imbalances can also occur in regression problems, but the problems they cause are less addressed. Deep learning is strongly affected by unbalanced continuous goals (regression) versus unbalanced categorical goals (classification).
An ideally balanced classification problem will have an equal number of examples for each class. Likewise, an ideally balanced regression problem will have its target variable evenly distributed. But in practice, the target values in some areas are abundant in the enumeration and scarce in some other areas. To solve this problem, Yuzhe Yang, Kaiwen Zha, Ying-Cong Chen, Dina Katabi of Massachusetts Institute of Technology, and Hao Wang of Rutgers University introduced Deep Imbalanced Regression, DIR, to efficiently perform regression tasks in models of ‘deep learning with unbalanced regression data.
How Does Deep Unbalanced Regression Work?
Deep Imbalanced Regression (DIR) can learn continuous targets from unbalanced real-world data sets and feed them to a deep learning model. DIR is performed in two ways:
- Smooth label distribution
- Smooth feature distribution
DIR incorporates a kernel distribution function which exploits the similarity between adjacent target values and smooths the distribution of the target or features.
The feature distribution smoothing approach follows a formable and easy-to-integrate procedure. An encoder obtains the representations of latent entities. The mean and variance of the feature representations are calculated at this point. Covariances are calculated between different characteristics. The Exponential Moving Window Average (EMA) captures feature statistics sequentially and multiplies with a kernel (k) which helps smooth the data based on the imbalance found in the target.
The EMA ensures that the form of the original feature distribution is retained. The smoothed feature representations are returned to the neural network. Learning neural networks is quite the same, except for an additional pass through the feature smoothing layer.
Comparative Analysis of Deeply Unbalanced Regression
The following five famous datasets are processed with DIA to organize benchmarks. Data sets belong to computer vision, healthcare, and natural language processing. These datasets have continuous target variables that are highly unbalanced.
- IMDB-WIKI-DIR (organized from IMDB-WIKI dataset with age as target)
- AgeDB-DIR (organized from AgeDB dataset with age as target)
- STS-B-DIR (organized from semantic textual similarity test with text similarity score as target)
- NYUD2-DIR (organized from NYU Depth Dataset V2 with depth as target)
- SHHS-DIR (organized from SHHS dataset with health score as target)
The above datasets are divided and provided as training, validation and testing sets. These datasets are ready to use in appropriate deep learning architectures. The Deep Imbalanced Regression architecture benchmarks are developed with the above datasets. The ResNet50 baseline is used to build the reference architecture on the IMDB-WIKI-DIR dataset and AgeDB-DIR dataset. The BiLSTM + GloVe word integration baseline is used to model the STS-B-DIR dataset. The ResNet50 based encoder and decoder architecture is used to model the NYUD2-DIR dataset. A CNN-RNN based architecture with a ResNet block is used to model the SHHS-DIR dataset. The performance results far exceed the original versions of these models and datasets.
Imbalanced deep regression on the IMDB-WIKI dataset
The requirements are PyTorch 1.6, tensorboard_logger, NumPy, pandas, scipy, tqdm, matplotlib, PIL, and wget. Additionally, DIR requires a CUDA compatible GPU execution engine (at least 4 GPUs) for training.
Install wget and tensorboard_logger using the following command.
!pip install wget tensorboard_logger
Download the source files that lead to end-to-end dataset preparation, model building, training, and evaluation.
# Download source code !git clone https://github.com/YyzHarry/imbalanced-regression.git
Check the contents of the source file.
!ls -p imbalanced-regression/
Change the current directory to the imdb-wiki-dir directory to continue.
Download and preprocess the original IMDB-WIKI data
Part of the output:
Train the vanilla architecture for deep unbalanced regression without reweighting using the following command.
%%bash CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --data_dir /content/imbalanced-regression/imdb-wiki-dir/data/ --reweight none
Reverse reweighting can be included during training by implementing the following command.
%%bash CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --data_dir /content/imbalanced-regression/imdb-wiki-dir/data/ --reweight inverse
Alternatively, inverse square root reweighting can be included during training by implementing the following command.
%%bash CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --data_dir /content/imbalanced-regression/imdb-wiki-dir/data/ --reweight sqrt_inv
Enable full deep unbalanced regression training with label distribution smoothing (LDS) and feature distribution smoothing (FDS) using the following command.
%%bash python train.py --reweight sqrt_inv --lds --lds_kernel gaussian --lds_ks 5 --lds_sigma 2 --fds --fds_kernel gaussian --fds_ks 5 --fds_sigma 2
Users can also opt for pre-trained models. The pre-trained checkpoints are officially presented to This site.
Guidelines for preparing and training the datasets are available at the following official links corresponding to the reference datasets:
It is observed that the smoothing of the label distribution and the smoothing of the characteristic distribution give exceptional results when applied together.
This article discussed the newly introduced DIR, Deep Imbalanced Regression, which manages datasets with highly unbalanced continuous target variables. We have discussed approaches to smoothing the distribution of labels and the distribution of characteristics and the concepts behind them. We discussed benchmarking datasets and architectures that use the two different smoothing approaches. Finally, we explored the implementation of PyTorch code for dataset preparation and end-to-end training on the IMDB-WIKI-DIR dataset.
Join our Telegram group. Be part of an engaging online community. Join here.
Subscribe to our newsletter
Receive the latest updates and relevant offers by sharing your email.