Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Natural Language Processing (NLP) has been used for Automated Essay Scoring (AES) in English language.
  • AES in Hindi and other low-resource languages has not been explored.
  • This study reproduces and compares state-of-the-art methods for AES in the Hindi domain.
  • Classical feature-based Machine Learning (ML) and advanced end-to-end models, including LSTM Networks and Fine-Tuned Transformer Architecture, are employed.
  • Results are comparable to those in the English language domain.
  • Hindi being a low-resource language, lacks a dedicated essay-scoring corpus.
  • Models are trained and evaluated using translated English essays and empirically measured on a small-scale, real-world Hindi corpus.
  • An in-depth analysis is conducted to discuss prompt-specific behavior of different language models implemented.

Paper Content

Introduction

  • Academic assessments have used short-text response and essay writing tasks to judge language learning abilities.
  • Automated essay scoring (AES) uses NLP and ML to evaluate essays.
  • AES is used in standardized tests and to reduce workload.
  • AES scores are either holistic or trait-based.
  • Most research has focused on English language domain, with few focusing on other languages.
  • Research in AES for Indic languages is negligible.
  • Hindi is the most spoken language in India.
  • English data translated to Hindi is used to train language models.
  • Dataset, rubric and source code are publicly available.
  • Research for AES in English has spanned decades
  • Many studies have treated AES as a regression and text classification problem
  • Classical machine learning techniques like linear regression, support vector regression, and sequential minimal optimization are used typically for regression-based AES
  • Logistic regression and Bayesian network classification utilize classification approaches
  • For ranking, SVM ranking and LambdaMART have been used
  • These approaches use custom linguistic features such as errors in grammar, readability features, length and syntactic features
  • Progress in Deep Neural Networks, like Convolutional Neural Networks, Recurrent Neural Networks, and Long Short Term Memory networks, allow for better performance in AES systems
  • Pre-trained language models, built using transformer architecture, such as GPT, BERT, and XLNet have pushed forward the limits of language understanding and generation
  • Attempts have been made in the past to develop AES for languages other than English
  • The lack of dedicated essay corpora is a common feature among most of these studies
  • The ASAP corpus has benchmarked a variety of state-of-the-art English language models
  • Due to the ASAP corpus’ ubiquity in AES research, we decided to focus our analysis on a Hindi-translated version of the ASAP dataset
  • Translation challenges the validity and quality of the corpus itself
  • We verified random subsets from the translated corpus with the help of both skilled bilingual speakers and expert academics
  • We built our own Hindi language corpus consisting of 126 real-world essays
  • The scoring for the responses was in accordance with a comprehensive rubric
  • An expert panel of three Hindi academics performed the scoring
  • Scorers have high inter-rater reliability
  • We average the three scores to obtain the final score

Feature extraction

  • Essay Length
  • Average Sentence Length
  • Average Word Length
  • Semantic Overlap and Coherence

Neural approaches

  • Neural networks used for end-to-end machine learning
  • Four approaches for AES using neural methods
  • Initial padding and tokenization process
  • Word embeddings obtained using fastText
  • BiLSTM, CNN, CNN + LSTM + Attention Mechanism, and SKIPFLOW models used

Fine-tuned transformers

  • Transformers use an attention mechanism to handle sequences of ordered data
  • Transformers use non-sequential processing, allowing them to take fewer time steps than LSTMs and RNNs
  • Transformers do not suffer from long-term dependency problems
  • Multilingual language models use BERT-based architecture and its variants
  • IndicBERT and MuRIL are two prominent multilingual language models
  • Five multilingual transformer language models are fine-tuned
  • Input is tokenized with a [CLS] token and [SEP] tokens
  • Hidden layer sequences are passed to a simple feed-forward network to obtain a numerical score

Experiments

  • Described experimental procedure
  • Described evaluation metric
  • Presented empirical results
  • Compared results with published results on prominent English AES models

Experimental setup

  • Experiments conducted in accordance with Taghipour and Ng (2016)
  • Training prompts together is challenging due to different genres and scoring rubrics
  • Text pre-processed to filter out stopwords, named-entities and mentions
  • Custom feature scores normalized to improve model stability
  • Neural LSTM and CNN models trained for 100 epochs with a learning rate of 1e-4
  • AdamW optimizer used to fine-tune pre-trained multilingual transformer models
  • 60/20/20 split for train, validation and test sets
  • Scores re-scaled to original prompt-specific scale for prediction

Evaluation metric

  • QWK is a measure used to evaluate and compare AES methods
  • QWK score ranges from 0 to 1, and can be negative if agreement is lower than expected by chance
  • QWK score is calculated using a weight matrix and a normalized expected count matrix

Results and comparison

  • 13 models implemented on ASAP-Hindi dataset
  • Linear Regression and XGBoost outperformed other models on prompt 1 and prompts 2 and 8 respectively
  • Neural models did not outperform other models, but had higher averages than feature-based models
  • SKIPFLOW gave highest average score amongst neural models
  • Fine-tuned multilingual transformers gave higher results across all prompts
  • Fine-tuned XLM-R model outperformed all other models on 4 of 8 ASAP-Hindi prompts
  • Fine-tuned Indic transformers did not perform comparably well
  • Fine-tuned mBERT model gave highest QWK score on organic prompt

Analysis and discussion

  • Results on prompts 4, 5, and 6 are higher than other prompts
  • Source-dependent prompts perform better due to consistency in syntax, coherence, and source material
  • Generalization of ideas is more difficult on persuasive and expository prompts
  • Prompts with inconsistent rubrics give worse results
  • Large pre-trained language models outperform feature-based approaches
  • Indic language models perform poorly, possibly due to lack of trainable parameters and syntactic localization

Conclusion and future work

  • Implemented and analyzed various methods for AES in Hindi
  • Introduced single-prompt corpus of student-written essays in Hindi
  • Obtained competitive results compared to benchmark and state-of-the-art methods in English AES
  • Explained and analyzed results obtained using models
  • Future research directions possible
  • Plan to extend work by scaling corpus and proposing architectures
  • Most favorable results on fine-tuned pre-trained multilingual transformer language models
  • Plan to try different training optimization methods
  • Introducing linguistic knowledge may bring further improvement
  • Multilingual essay evaluation approach for more diversity in essay writing