Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Gender classification algorithms have important applications in many domains.
- Research has shown that algorithms trained on biased benchmark databases can result in algorithmic bias.
- Little research has been done on gender classification algorithms’ bias towards gender minorities.
- Surveys were conducted on existing benchmark databases for facial recognition and gender classification tasks.
- Current benchmark databases lack representation of gender minority subgroups.
- A baseline model was trained on an augmented benchmark database to increase classification accuracy and mitigate algorithmic biases.
- An ensemble model achieved an overall accuracy score of 90.39%.
Paper Content
Introduction 1.gender classi cation algorithm and its applications
- Automated gender classification systems recognize gender based on facial characteristics
- Applications of gender classification algorithms include video surveillance, law enforcement, demographic research, online advertising, and human-computer interaction
- Gender information is a type of biometrics data
- Gender identification can be used as a pre-processing step to reduce search time in a large-scale database
- Gender classification is used in social media design as part of digital monetization efforts
- Gender is used in HCI to make software systems appear more human-like
Algorithms bias in gender classi cation
- Ethics in machine learning algorithms has been gaining attention
- Dark-skinned females are misclassified at a high rate
- African-Americans are more likely to be subjected to law enforcement searches
- Facial recognition systems used by law enforcement have lower accuracy rates for people labeled female and black
- Machine learning models can amplify existing gender bias
Unrepresentative benchmark databases
- Biased databases can lead to accuracy disparities in prediction tasks.
- Many databases are built using facial detection algorithms, which can introduce bias.
- Color FERET and LFW databases have limited numbers of dark-skinned individuals and are mostly male and white.
- It is unclear how facial recognition systems perform on under-represented subgroups.
Benchmark databases in the fairness domain
- Fairface is a facial image dataset with 108K images to reduce racial bias
- Fairface has a balanced representation of 7 race groups
- Models trained on Fairface dataset reported higher accuracy rates
- PPB is a database to achieve better intersectional representation of gender and skin type
- PPB has 1270 unique identities and a balanced composition of light-skinned and dark-skinned individuals
- DiF is a dataset with one million facial images annotated with various statistical analysis measures
E lgbtq population and eir distinct characteristics in gender expression
- 2017 study used deep neural networks to detect white males’ sexuality
- Study implied LGBTQ population have distinct characteristics from heterosexual population
- 2016 study estimated proportion of Americans who identify as transgender to be between 0.5-0.6%
- Transgender facial recognition is challenging due to gender transformation
- Lack of representation of LGBTQ population in benchmark databases leads to false sense of progress on gender classification and facial recognition tasks
Limitations of binary gender classi cation
- Gender is often assumed to be binary and based on physiology.
- Misgendering is using language to describe someone that does not align with their gender identity.
- Misgendering can have negative consequences on mental health.
- Being misgendered by AGR systems is worse than being misgendered by humans.
Methods
- Current benchmark databases lack representation of dark-skin individuals and LGBTQ population
- Hypothesize that assembling a racially-balanced, LGBTQ and non-binary gender inclusive database will improve gender classification algorithms’ accuracy on minority faces
- Plan to make assembled non-binary database publicly available
Creation of inclusive benchmark database
- Created an inclusive benchmark database with images from different online platforms
- Contains 12,000 images of 168 unique identities, including 21 LGBTQ identities (9%) and 29 white male, 25 white female, 23 Asian male, 23 Asian female, 33 African male, and 35 African female identities
Creation of non-binary gender benchmark database
- Assembled a non-binary gender benchmark database
- 2000 images of 67 unique identities
E baseline model trained on a biased benchmark database
- Adience is a benchmark facial image database for machine learning research tasks on age and gender classification.
- Adience contains images of individuals of various appearances, postures, noise levels, and lighting conditions.
- Adience was largely collected through online photography sharing platforms such as Flickr.
- Adience is highly skewed in its racial representation due to systematic failures on facial detection of dark-skin individuals.
Labeling process (adience benchmark database
- Adience benchmark dataset already had gender labels.
- Skin-type labels were added by one author.
- Data augmentation increased percentage of dark-skin male and female.
Training baseline model. a er resizing each image to 227
- Split augmented Adience benchmark dataset into 23,004 training samples, 1,279 validation samples and 1,278 testing samples
- Baseline model trained on augmented Adience images using 6 convolutional layers, each followed by a rectified linear operation and pooling layer
- First four layers followed by batch normalization and dropouts
- First two convolutional layers contain 96 7x7 kernels, third and fourth contain 256 5x5 kernels, fifth and final contain 384 3x3 kernels
- Two fully-connected layers added, each containing 512 neurons
Baseline model accuracy evaluation.
- Baseline model achieved 94.37% accuracy on Adience benchmark test set
- Model performs well for binary population, except for children
Extension of binary gender classi er
- Baseline model accuracy dropped to 51.67% when tested on 3000 images from inclusive and non-binary databases.
- Baseline model failed to predict gender on non-binary people.
- To extend binary gender classifier, supplied 1500 images from inclusive database and 1019 images from non-binary database for training.
Oversampling.
- Increased percentage of male population from 29.89% to 36.51%
- Training set contains 2791 images
- Accuracy rates of training sets are 10% higher than validation set, implying overfitting
- Used ensemble techniques to make final prediction outcomes more robust
Logistic regression
- Constructed logistic regression model by assembling 3 models
- Achieved accuracy score of 96.03% on training set
- Achieved accuracy score of 90.39% on test set
Adaboost ensemble.
- Constructed an adaboost model using 3 models
- Horizontally stacked predicted class output to create a matrix
- Applied grid search to cross validate parameters
- Trained adaboost mega-model with accuracy score of 96.03%
- Applied trained model to test set with accuracy score of 90.24%
Evaluation metric
- Selection rate is used to measure algorithmic bias
- If selection rate is below 80%, it is an indication of disparate impact
- Transfer learning techniques can be used to remove disparate impact
- Adaboost ensemble model achieved a selection rate of 98.46%, a significant increase from baseline model
- Inclusive databases are important for accurate labeling and fairness in decision making
Results and discussion
- Increased accuracy of gender classification from baseline model
- Mitigated algorithmic bias with machine learning techniques
Limitations and future work 4.1 limitations
- Developed a non-binary gender classification system
- Noted that people of color and LGBTQ population are more likely to be misclassified
Future work: modeling gender as a continuum
- Gender is a complex socio-cultural construct and an internal identity
- Gender expressions vary in different social contexts
- Around 20% of millennials identify as non-binary
- Future work will model gender identity as a continuum with ‘femininity’ and ‘masculinity’ as two endpoints