Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Natural Language Processing (NLP) has been used for Automated Essay Scoring (AES) in English language.
AES in Hindi and other low-resource languages has not been explored.
This study reproduces and compares state-of-the-art methods for AES in the Hindi domain.
Classical feature-based Machine Learning (ML) and advanced end-to-end models, including LSTM Networks and Fine-Tuned Transformer Architecture, are employed.
Results are comparable to those in the English language domain.
Hindi being a low-resource language, lacks a dedicated essay-scoring corpus.
Models are trained and evaluated using translated English essays and empirically measured on a small-scale, real-world Hindi corpus.
An in-depth analysis is conducted to discuss prompt-specific behavior of different language models implemented.

Paper Content

Introduction

Academic assessments have used short-text response and essay writing tasks to judge language learning abilities.
Automated essay scoring (AES) uses NLP and ML to evaluate essays.
AES is used in standardized tests and to reduce workload.
AES scores are either holistic or trait-based.
Most research has focused on English language domain, with few focusing on other languages.
Research in AES for Indic languages is negligible.
Hindi is the most spoken language in India.
English data translated to Hindi is used to train language models.
Dataset, rubric and source code are publicly available.

Research for AES in English has spanned decades
Many studies have treated AES as a regression and text classification problem
Classical machine learning techniques like linear regression, support vector regression, and sequential minimal optimization are used typically for regression-based AES
Logistic regression and Bayesian network classification utilize classification approaches
For ranking, SVM ranking and LambdaMART have been used
These approaches use custom linguistic features such as errors in grammar, readability features, length and syntactic features
Progress in Deep Neural Networks, like Convolutional Neural Networks, Recurrent Neural Networks, and Long Short Term Memory networks, allow for better performance in AES systems
Pre-trained language models, built using transformer architecture, such as GPT, BERT, and XLNet have pushed forward the limits of language understanding and generation
Attempts have been made in the past to develop AES for languages other than English
The lack of dedicated essay corpora is a common feature among most of these studies
The ASAP corpus has benchmarked a variety of state-of-the-art English language models
Due to the ASAP corpus’ ubiquity in AES research, we decided to focus our analysis on a Hindi-translated version of the ASAP dataset
Translation challenges the validity and quality of the corpus itself
We verified random subsets from the translated corpus with the help of both skilled bilingual speakers and expert academics
We built our own Hindi language corpus consisting of 126 real-world essays
The scoring for the responses was in accordance with a comprehensive rubric
An expert panel of three Hindi academics performed the scoring
Scorers have high inter-rater reliability
We average the three scores to obtain the final score

Feature extraction

Essay Length
Average Sentence Length
Average Word Length
Semantic Overlap and Coherence

Neural approaches

Neural networks used for end-to-end machine learning
Four approaches for AES using neural methods
Initial padding and tokenization process
Word embeddings obtained using fastText
BiLSTM, CNN, CNN + LSTM + Attention Mechanism, and SKIPFLOW models used

Fine-tuned transformers

Transformers use an attention mechanism to handle sequences of ordered data
Transformers use non-sequential processing, allowing them to take fewer time steps than LSTMs and RNNs
Transformers do not suffer from long-term dependency problems
Multilingual language models use BERT-based architecture and its variants
IndicBERT and MuRIL are two prominent multilingual language models
Five multilingual transformer language models are fine-tuned
Input is tokenized with a [CLS] token and [SEP] tokens
Hidden layer sequences are passed to a simple feed-forward network to obtain a numerical score

Experiments

Described experimental procedure
Described evaluation metric
Presented empirical results
Compared results with published results on prominent English AES models

Experimental setup

Experiments conducted in accordance with Taghipour and Ng (2016)
Training prompts together is challenging due to different genres and scoring rubrics
Text pre-processed to filter out stopwords, named-entities and mentions
Custom feature scores normalized to improve model stability
Neural LSTM and CNN models trained for 100 epochs with a learning rate of 1e-4
AdamW optimizer used to fine-tune pre-trained multilingual transformer models
60/20/20 split for train, validation and test sets
Scores re-scaled to original prompt-specific scale for prediction

Evaluation metric

QWK is a measure used to evaluate and compare AES methods
QWK score ranges from 0 to 1, and can be negative if agreement is lower than expected by chance
QWK score is calculated using a weight matrix and a normalized expected count matrix

Results and comparison

13 models implemented on ASAP-Hindi dataset
Linear Regression and XGBoost outperformed other models on prompt 1 and prompts 2 and 8 respectively
Neural models did not outperform other models, but had higher averages than feature-based models
SKIPFLOW gave highest average score amongst neural models
Fine-tuned multilingual transformers gave higher results across all prompts
Fine-tuned XLM-R model outperformed all other models on 4 of 8 ASAP-Hindi prompts
Fine-tuned Indic transformers did not perform comparably well
Fine-tuned mBERT model gave highest QWK score on organic prompt

Analysis and discussion

Results on prompts 4, 5, and 6 are higher than other prompts
Source-dependent prompts perform better due to consistency in syntax, coherence, and source material
Generalization of ideas is more difficult on persuasive and expository prompts
Prompts with inconsistent rubrics give worse results
Large pre-trained language models outperform feature-based approaches
Indic language models perform poorly, possibly due to lack of trainable parameters and syntactic localization

Conclusion and future work

Implemented and analyzed various methods for AES in Hindi
Introduced single-prompt corpus of student-written essays in Hindi
Obtained competitive results compared to benchmark and state-of-the-art methods in English AES
Explained and analyzed results obtained using models
Future research directions possible
Plan to extend work by scaling corpus and proposing architectures
Most favorable results on fine-tuned pre-trained multilingual transformer language models
Plan to try different training optimization methods
Introducing linguistic knowledge may bring further improvement
Multilingual essay evaluation approach for more diversity in essay writing

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Feature extraction#

Neural approaches#

Fine-tuned transformers#

Experiments#

Experimental setup#

Evaluation metric#

Results and comparison#

Analysis and discussion#

Conclusion and future work#

Link to paper

Abstract

Paper Content

Introduction

Related work

Feature extraction

Neural approaches

Fine-tuned transformers

Experiments

Experimental setup

Evaluation metric

Results and comparison

Analysis and discussion

Conclusion and future work