Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Companies struggle to develop and deploy AI models to production systems.
  • Continuous pipelines for AI are an active research area.
  • This paper includes a Multivocal Literature Review and semi-structured interviews.
  • Paper provides and compares terminologies for DevOps, CI/CD, MLOps, lifecycle management, and CD4ML.
  • Paper provides a list of potential triggers for reiterating the pipeline.
  • Paper presents a consolidated pipeline with four stages: Data Handling, Model Learning, Software Development, and System Operations.
  • Paper maps challenges regarding pipeline implementation, adaption, and usage for the continuous development of AI.

Paper Content

Introduction

  • Increase in data and computing power allows exploring AI
  • ML and DL enable new intelligent products and services
  • Quality assurance needed to guarantee safe and reliable behaviour
  • Non-determinism leads to uncertainty
  • Automated end-to-end CI/CD pipelines needed for AI development

Continuous software engineering and devops

  • Continuous software engineering is a continuous development lifecycle
  • CI focuses on integrating code changes and ensuring software quality
  • CD delivers tested software features to a staging or test environment
  • Continuous Deployment deploys software to production environment
  • CI/CD for AI automates deployment process for AI models
  • DevOps is a continuous software development approach
  • MLOps expands DevOps for ML based applications
  • End-to-end lifecycle management handles tasks for continuous development of AI
  • CD4ML is the technical implementation of MLOps concept
  • Published works focus on pipelines for continuous development of AI
  • Table 1 lists related studies with similar methodology
  • Related studies have limited primary sources
  • Focus on specific application contexts
  • Different research questions investigated
  • Karamitsos et al. focus on traditional DevOps principles
  • John et al. propose MLOps maturity model
  • Lwakatare et al. propose levels of automation
  • Fredriksson et al. focus on approaches to label data
  • Testi et al. summarize different types of MLOps pipelines
  • Kolltveit et al. focus on operationalizing the model
  • Kreuzberger et al. cover principles and professions
  • Lorenzoni et al. identify software engineering processes
  • Xie et al. focus on characteristics of the lifecycle
  • DataOps addressed by Rodriguez et al., Munappy et al., and Ereth

Triggers:

  • No information on triggers of pipeline for continuous development in related SLRs and MLRs
  • Challenges related to development of AI in general, not pipeline itself
  • Organizational, ML system, and operational challenges for adopting MLOps
  • Non-technical challenges such as expectation management, trust, transparency
  • General challenges regarding ML system development, such as data management, modelling and operationalisation

Methodology

  • Used three research methods: MLR, taxonomy creation strategy, and qualitative analysis
  • MLR proposed by Garousi et al., taxonomy development method based on Usman et al., and interviews’ deductive category definition based on Mayring

Mlr

  • MLR was executed to provide overview of pipelines for continuous development of AI
  • Grey literature is essential because using a lifecycle pipeline for AI is an emerging research topic
  • 37 formal sources were derived as start data set
  • 79 papers were derived between 15th of April and 30th of May 2021
  • 151 relevant sources were identified, 79 formal and 72 informal
  • Descriptive qualitative synthesis was executed to extract necessary categories
  • Open and axial coding was used to break down, examine, compare, conceptualize, and categorize information

Taxonomy creation

  • Categorized data based on revised taxonomy creation strategy proposed by Usman et al.
  • Defined units of classes/categories based on DevOps phases
  • Used qualitative approach to add extracted information to respective class/category
  • Used facet approach for classification structure type to easily adapt taxonomy if further research is done
  • Identified facets: Data Handling, Model Learning, Software Development and System Operations

Qualitative analysis

  • Qualitative approach used to check derived information from literature
  • Interviews conducted to explore individual experiences
  • Stratified sampling technique used to select interview participants
  • Nine participants identified
  • Interviews conducted between 30th of July and 10th of September 2021
  • Qualitative content analysis used to analyse interviews

Results

  • Different terminologies are elaborated
  • Triggers to start/restart the pipeline are identified
  • Taxonomy is created to describe the pipeline for continuous development of AI
  • Challenges regarding implementation, adaption and usage of pipelines are explored

Terminologies

  • DevOps for AI, CI/CD for AI, MLOps, (End-to-End) Lifecycle Management, and CD4ML are terms related to automation for the continuous development and improvement of AI models.
  • MLOps is an extension of DevOps.
  • MLOps takes into consideration the company’s culture and how cross-functional teams collaborate.
  • Continuous Training (CT) is a new practice that uses collected feedback and production data.
  • MLOps can be divided into three different levels of maturity.
  • (End-to-End) Lifecycle Management uses concepts from software development to cope with many model iterations.
  • CD4ML is a software engineering approach that produces machine learning applications.

Triggers

  • Four trigger types discussed: feedback and alert systems, orchestration service and schedule, repository, and other triggers
  • AI models need to be iteratively adapted and retrained
  • Context-specific triggers depending on AI model, business requirements, and retraining strategies
  • Triggers may take into consideration optimal threshold for benefits of updated model
  • Trial and error process to minimize resource consumption
  • Triggers combining different approaches may be feasible
  • Collected feedback and alerts may be used to trigger pipeline
  • Data events may be potential trigger
  • Monitoring system monitors and collects data from production to trigger pipeline
  • Data drifts occur when distribution of data set changes
  • Data updates may happen due to schema updates
  • Triggers may occur periodically or when specific threshold is attained
  • Model updates may be triggered due to deterioration of model’s performance
  • Event streaming platforms may be used to semi-automate triggering process
  • Repository updates used as traditional triggers

Data handling

  • Data Handling covers the end-to-end lifecycle of data curation
  • Quality of AI model depends on data availability, quality, and preprocessing techniques
  • Data pipeline manipulates initial data via tasks such as preprocessing, testing, versioning, and documentation
  • 7% of total execution time can be reduced due to parallelized computing procedures
  • Data collection includes data injection, preparation, labelling and feature extraction
  • Data preparation depends on type of data
  • Labelling is necessary for supervised learning
  • Feature extraction or feature engineering is necessary
  • Data and feature validation validates batches of data
  • Data quality can be measured via 6 dimensions
  • Automatic feasibility analysis can be used to identify if data is sufficient
  • Unit tests verify data ingestion
  • Data needs to be stored and versioned
  • Documentation includes guidelines with concrete actions

Model learning

  • Task allows comparing models to find the best one
  • Standard Keras metrics calculate accuracy, precision, recall
  • Pipeline should support decisions on model design, selection of components, algorithm, feature selection, hyper-parameter setting, data set split
  • AutoML takes over design decisions
  • Pipeline implementation depends on architecture, distribution, amount of computation resources, context-specific algorithms
  • TFX and MLFlow provide necessary implementations or support for AI libraries
  • Pipeline should provide faultless, reliable and secure software
  • Fail-safe measures need to be introduced via tests
  • Test data sets used more often can lead to overfitting
  • Metrics used to test for overfitting
  • Automated tests not always possible, experts validate model decisions
  • Goal is to optimize specific metric throughout pipeline’s lifecycles
  • Pipeline compares statistical evaluation metrics with metrics from previous model versions
  • Traditional quality metrics used, such as F1 scores, precision and accuracy
  • Learnability metric adapted for using it during CI/CD of AI models
  • Ethics-by-design metric for model quality assurance in continuous development pipeline
  • Feature explainability used to identify how much each feature contributes to final prediction
  • Checks for bias not included in pipeline’s model quality assurance due to non-critical domains
  • Pipeline-specific model improvement techniques not identified
  • Model compression, pruning, hardening, hyper-parameter optimization used
  • Metadata necessary to govern modelling lifecycle
  • Model versioning captures version model artifacts and model dependencies
  • Model versions stored due to resource restraints or lack of interest
  • Alternatives to standard version control systems used
  • Model documentation necessary to receive certifications

Software development

  • TFX does not handle software development-specific tasks
  • ML Metadata stores properties of trained models and data sets
  • TFX versions artifacts produced by components
  • Automated software-level quality assurance tests check correct behaviour of system
  • Automated stress and robustness tests measure operational metrics
  • Packaged models and quality assurance results stored
  • Pipeline and tasks versioned
  • Documentation of development stage proposed

System operations

  • System Operations handles deployment and monitoring of AI model
  • 4 criteria must be fulfilled before deployment
  • Pipeline identifies best model, but research gap on how to compare model versions fairly
  • Model must fulfill user-defined deployment criteria
  • Human involvement may be required
  • Model deployed on different environments
  • Continuous experimentation allows gathering user feedback
  • Monitoring systems collect input and output data
  • Monitor model performance and traditional software monitoring aspects
  • Collect data to identify impact on business outcomes
  • AI applications may be deployed in different environments

Challenges

  • Transform data into universal format
  • Automatic feature labelling
  • Identify which data source augments features
  • Identify how data preparation effects model quality
  • Unclear data anomaly alerts
  • Correct & reasonably strict alerts
  • Automatic versioning methods for data sets and associated model
  • Store information on data governance
  • Elastic scaling to provide right amount of hardware
  • Adapt to horizontal or vertical scaling
  • Efficiently distribute resources

Discussion

  • Pipeline may vary due to resource constraints, difficulties in specific implementations and cost/benefit analysis
  • Execution order of tasks depends on context, such as organizational policies
  • TFX combines components to enable a flexible and customizable pipeline
  • TFX components mapped to proposed framework in Figure 6

Threats to validity

Conclusion

  • Main goal was to identify and compare five terms related to pipelines for continuous development of AI
  • Nine semi-structured interviews and 151 sources were used
  • Five terms identified: DevOps for AI, CI/CD for AI, MLOps, end-to-end lifecycle management, and CD4ML
  • Four stages of pipeline identified: Data Handling, Model Learning, Software Development, and System Operations
  • Challenges identified: data collection and integration, data transformation, data preparation, model quality assurance, versioning, packaging, deployment, monitoring, environment and infrastructure handling, flexibility, customizability, and fault tolerance
  • Future work planned to explore, evaluate, and compare pipeline concepts with available platforms