Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Gender classification algorithms have important applications in many domains.
Research has shown that algorithms trained on biased benchmark databases can result in algorithmic bias.
Little research has been done on gender classification algorithms’ bias towards gender minorities.
Surveys were conducted on existing benchmark databases for facial recognition and gender classification tasks.
Current benchmark databases lack representation of gender minority subgroups.
A baseline model was trained on an augmented benchmark database to increase classification accuracy and mitigate algorithmic biases.
An ensemble model achieved an overall accuracy score of 90.39%.

Paper Content

Introduction 1.gender classi cation algorithm and its applications

Automated gender classification systems recognize gender based on facial characteristics
Applications of gender classification algorithms include video surveillance, law enforcement, demographic research, online advertising, and human-computer interaction
Gender information is a type of biometrics data
Gender identification can be used as a pre-processing step to reduce search time in a large-scale database
Gender classification is used in social media design as part of digital monetization efforts
Gender is used in HCI to make software systems appear more human-like

Algorithms bias in gender classi cation

Ethics in machine learning algorithms has been gaining attention
Dark-skinned females are misclassified at a high rate
African-Americans are more likely to be subjected to law enforcement searches
Facial recognition systems used by law enforcement have lower accuracy rates for people labeled female and black
Machine learning models can amplify existing gender bias

Unrepresentative benchmark databases

Biased databases can lead to accuracy disparities in prediction tasks.
Many databases are built using facial detection algorithms, which can introduce bias.
Color FERET and LFW databases have limited numbers of dark-skinned individuals and are mostly male and white.
It is unclear how facial recognition systems perform on under-represented subgroups.

Benchmark databases in the fairness domain

Fairface is a facial image dataset with 108K images to reduce racial bias
Fairface has a balanced representation of 7 race groups
Models trained on Fairface dataset reported higher accuracy rates
PPB is a database to achieve better intersectional representation of gender and skin type
PPB has 1270 unique identities and a balanced composition of light-skinned and dark-skinned individuals
DiF is a dataset with one million facial images annotated with various statistical analysis measures

E lgbtq population and eir distinct characteristics in gender expression

2017 study used deep neural networks to detect white males’ sexuality
Study implied LGBTQ population have distinct characteristics from heterosexual population
2016 study estimated proportion of Americans who identify as transgender to be between 0.5-0.6%
Transgender facial recognition is challenging due to gender transformation
Lack of representation of LGBTQ population in benchmark databases leads to false sense of progress on gender classification and facial recognition tasks

Limitations of binary gender classi cation

Gender is often assumed to be binary and based on physiology.
Misgendering is using language to describe someone that does not align with their gender identity.
Misgendering can have negative consequences on mental health.
Being misgendered by AGR systems is worse than being misgendered by humans.

Methods

Current benchmark databases lack representation of dark-skin individuals and LGBTQ population
Hypothesize that assembling a racially-balanced, LGBTQ and non-binary gender inclusive database will improve gender classification algorithms’ accuracy on minority faces
Plan to make assembled non-binary database publicly available

Creation of inclusive benchmark database

Created an inclusive benchmark database with images from different online platforms
Contains 12,000 images of 168 unique identities, including 21 LGBTQ identities (9%) and 29 white male, 25 white female, 23 Asian male, 23 Asian female, 33 African male, and 35 African female identities

Creation of non-binary gender benchmark database

Assembled a non-binary gender benchmark database
2000 images of 67 unique identities

E baseline model trained on a biased benchmark database

Adience is a benchmark facial image database for machine learning research tasks on age and gender classification.
Adience contains images of individuals of various appearances, postures, noise levels, and lighting conditions.
Adience was largely collected through online photography sharing platforms such as Flickr.
Adience is highly skewed in its racial representation due to systematic failures on facial detection of dark-skin individuals.

Labeling process (adience benchmark database

Adience benchmark dataset already had gender labels.
Skin-type labels were added by one author.
Data augmentation increased percentage of dark-skin male and female.

Training baseline model. a er resizing each image to 227

Split augmented Adience benchmark dataset into 23,004 training samples, 1,279 validation samples and 1,278 testing samples
Baseline model trained on augmented Adience images using 6 convolutional layers, each followed by a rectified linear operation and pooling layer
First four layers followed by batch normalization and dropouts
First two convolutional layers contain 96 7x7 kernels, third and fourth contain 256 5x5 kernels, fifth and final contain 384 3x3 kernels
Two fully-connected layers added, each containing 512 neurons

Baseline model accuracy evaluation.

Baseline model achieved 94.37% accuracy on Adience benchmark test set
Model performs well for binary population, except for children

Extension of binary gender classi er

Baseline model accuracy dropped to 51.67% when tested on 3000 images from inclusive and non-binary databases.
Baseline model failed to predict gender on non-binary people.
To extend binary gender classifier, supplied 1500 images from inclusive database and 1019 images from non-binary database for training.

Oversampling.

Increased percentage of male population from 29.89% to 36.51%
Training set contains 2791 images
Accuracy rates of training sets are 10% higher than validation set, implying overfitting
Used ensemble techniques to make final prediction outcomes more robust

Logistic regression

Constructed logistic regression model by assembling 3 models
Achieved accuracy score of 96.03% on training set
Achieved accuracy score of 90.39% on test set

Adaboost ensemble.

Constructed an adaboost model using 3 models
Horizontally stacked predicted class output to create a matrix
Applied grid search to cross validate parameters
Trained adaboost mega-model with accuracy score of 96.03%
Applied trained model to test set with accuracy score of 90.24%

Evaluation metric

Selection rate is used to measure algorithmic bias
If selection rate is below 80%, it is an indication of disparate impact
Transfer learning techniques can be used to remove disparate impact
Adaboost ensemble model achieved a selection rate of 98.46%, a significant increase from baseline model
Inclusive databases are important for accurate labeling and fairness in decision making

Results and discussion

Increased accuracy of gender classification from baseline model
Mitigated algorithmic bias with machine learning techniques

Limitations and future work 4.1 limitations

Developed a non-binary gender classification system
Noted that people of color and LGBTQ population are more likely to be misclassified

Future work: modeling gender as a continuum

Gender is a complex socio-cultural construct and an internal identity
Gender expressions vary in different social contexts
Around 20% of millennials identify as non-binary
Future work will model gender identity as a continuum with ‘femininity’ and ‘masculinity’ as two endpoints

Link to paper#

Abstract#

Paper Content#

Introduction 1.gender classi cation algorithm and its applications#

Algorithms bias in gender classi cation#

Unrepresentative benchmark databases#

Benchmark databases in the fairness domain#

E lgbtq population and eir distinct characteristics in gender expression#

Limitations of binary gender classi cation#

Methods#

Creation of inclusive benchmark database#

Creation of non-binary gender benchmark database#

E baseline model trained on a biased benchmark database#

Labeling process (adience benchmark database#

Training baseline model. a er resizing each image to 227#

Baseline model accuracy evaluation.#

Extension of binary gender classi er#

Oversampling.#

Logistic regression#

Adaboost ensemble.#

Evaluation metric#

Results and discussion#

Limitations and future work 4.1 limitations#

Future work: modeling gender as a continuum#

Link to paper

Abstract

Paper Content

Introduction 1.gender classi cation algorithm and its applications

Algorithms bias in gender classi cation

Unrepresentative benchmark databases

Benchmark databases in the fairness domain

E lgbtq population and eir distinct characteristics in gender expression

Limitations of binary gender classi cation

Methods

Creation of inclusive benchmark database

Creation of non-binary gender benchmark database

E baseline model trained on a biased benchmark database

Labeling process (adience benchmark database

Training baseline model. a er resizing each image to 227

Baseline model accuracy evaluation.

Extension of binary gender classi er

Oversampling.

Logistic regression

Adaboost ensemble.

Evaluation metric

Results and discussion

Limitations and future work 4.1 limitations

Future work: modeling gender as a continuum