Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Labeling systems have been proposed to help monitor work on the UN SDGs.
  • Systems differ in their specificity and sensitivity and have systematic biases.
  • An ensemble model that pools labeling systems can exceed the performance of all currently available systems.

Paper Content


  • Compare seven labeling systems on metrics and three labeled data sets
  • Assess bias of different SDGs
  • Assess susceptibility of labeling systems to produce false positives based on text length
  • Assess potential of ensemble models to address limitations of individual labeling systems

Sdg labeling systems differ in their sensitivity-specificity trade-offs

  • Compared seven labeling systems to generate predicted labels for documents from three labeled data sets
  • Data sets differ in number of words per document and number of SDGs they were evaluated for
  • Titles data set had 63% of documents judged to contain one SDG
  • Abstracts data set had 80.2% of documents assigned an SDG
  • News articles data set had 100% of documents assigned one or more SDGs
  • Visual inspection suggests all systems showed reasonable accuracy in terms of recovering the “true” SDG labels assigned by experts
  • Compared systems quantitatively using common metrics (sensitivity, specificity, accuracy, F1 score)
  • Systems differ in trade-off between sensitivity and specificity
  • Results point to labeling systems being differently conservative
  • Difficult to identify single best-performing model
  • Limitation of available data sets with small proportion of non-SDG-related documents

Biases in sdg labeling systems distort sdg profiles

  • Biases in SDG labeling systems can lead to inaccurate representation of the prevalence and importance of different SDGs.
  • Biased systems may create an unfair advantage or disadvantage for certain organizations or groups.
  • Biased systems may create confusion or mistrust among stakeholders.
  • We estimated SDG-specific biases by comparing the relative frequency of SDGs between predicted and observed labels.
  • We found Auckland to have the highest average correlation, followed by SDGO, Elsevier, SIRIS, Aurora, and OSDG.

Sdg labeling systems can produce many false positives when applied to large text sources

  • Existing validation data sets may not accurately reflect the performance of SDG labeling systems in real-world applications.
  • Existing data sets only include a small proportion of documents unrelated to the SDGs, which can lead to an overestimation of accuracy.
  • Evaluations of the labeling systems were conducted using existing and synthetic data sets.
  • The tendency of systems to produce false positives is in line with their level of conservatism.
  • The tendency to detect SDGs increases considerably when the length of the texts increases.

Trained ensemble models alleviate the shortcomings of existing labeling systems

  • Existing SDG labeling systems have shortcomings
  • Labeling systems differ in conservatism and accuracy
  • False positives increase with document length
  • Ensemble models of SDG labeling systems are trained using 6 publicly available labeling systems and document length
  • Ensemble model achieves higher accuracy than individual labeling systems
  • Ensemble model has even performance across data sets
  • Profile bias and fidelity of ensemble model outperform individual systems
  • Feature importance varies across SDGs


  • Aimed to compare existing automated labeling systems to identify work on SDGs from text
  • Compared seven systems using various text sources and metrics
  • Systems differ in accuracy and performance varies across text sources
  • Systems have different biases that can impact overall profile of SDGs identified
  • Reliance on automated systems for ranking institutions’ contributions to SDGs requires reliable systems
  • Ensemble models can overcome limitations of individual labeling systems
  • Ensemble models can achieve higher performance and less susceptible to biases
  • Recommend use of ensemble approaches as best practice when drawing conclusions about SDGs based on automated methods


  • Systems based on Lucene-style queries vary in complexity
  • SDSN and OSDG systems are least complex, only use OR operations
  • SIRIS and Elsevier systems are more complex, use AND operations
  • Aurora system is most complex, uses NEAR operations
  • Data sets used include Aurora, OSDG Community, SDG Knowledge Hub, Disneyland reviews, cooking recipes, math lectures, and synthetic data
  • Algorithms used to train ensembles of SDG labeling systems are random forest and extreme gradient boosting
  • Cases in data sets initially weighted by 1/N, synthetic data multiplied by factor k