Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Companies struggle to develop and deploy AI models to production systems.
Continuous pipelines for AI are an active research area.
This paper includes a Multivocal Literature Review and semi-structured interviews.
Paper provides and compares terminologies for DevOps, CI/CD, MLOps, lifecycle management, and CD4ML.
Paper provides a list of potential triggers for reiterating the pipeline.
Paper presents a consolidated pipeline with four stages: Data Handling, Model Learning, Software Development, and System Operations.
Paper maps challenges regarding pipeline implementation, adaption, and usage for the continuous development of AI.

Paper Content

Introduction

Increase in data and computing power allows exploring AI
ML and DL enable new intelligent products and services
Quality assurance needed to guarantee safe and reliable behaviour
Non-determinism leads to uncertainty
Automated end-to-end CI/CD pipelines needed for AI development

Continuous software engineering and devops

Continuous software engineering is a continuous development lifecycle
CI focuses on integrating code changes and ensuring software quality
CD delivers tested software features to a staging or test environment
Continuous Deployment deploys software to production environment
CI/CD for AI automates deployment process for AI models
DevOps is a continuous software development approach
MLOps expands DevOps for ML based applications
End-to-end lifecycle management handles tasks for continuous development of AI
CD4ML is the technical implementation of MLOps concept

Published works focus on pipelines for continuous development of AI
Table 1 lists related studies with similar methodology
Related studies have limited primary sources
Focus on specific application contexts
Different research questions investigated
Karamitsos et al. focus on traditional DevOps principles
John et al. propose MLOps maturity model
Lwakatare et al. propose levels of automation
Fredriksson et al. focus on approaches to label data
Testi et al. summarize different types of MLOps pipelines
Kolltveit et al. focus on operationalizing the model
Kreuzberger et al. cover principles and professions
Lorenzoni et al. identify software engineering processes
Xie et al. focus on characteristics of the lifecycle
DataOps addressed by Rodriguez et al., Munappy et al., and Ereth

Triggers:

No information on triggers of pipeline for continuous development in related SLRs and MLRs
Challenges related to development of AI in general, not pipeline itself
Organizational, ML system, and operational challenges for adopting MLOps
Non-technical challenges such as expectation management, trust, transparency
General challenges regarding ML system development, such as data management, modelling and operationalisation

Methodology

Used three research methods: MLR, taxonomy creation strategy, and qualitative analysis
MLR proposed by Garousi et al., taxonomy development method based on Usman et al., and interviews’ deductive category definition based on Mayring

Mlr

MLR was executed to provide overview of pipelines for continuous development of AI
Grey literature is essential because using a lifecycle pipeline for AI is an emerging research topic
37 formal sources were derived as start data set
79 papers were derived between 15th of April and 30th of May 2021
151 relevant sources were identified, 79 formal and 72 informal
Descriptive qualitative synthesis was executed to extract necessary categories
Open and axial coding was used to break down, examine, compare, conceptualize, and categorize information

Taxonomy creation

Categorized data based on revised taxonomy creation strategy proposed by Usman et al.
Defined units of classes/categories based on DevOps phases
Used qualitative approach to add extracted information to respective class/category
Used facet approach for classification structure type to easily adapt taxonomy if further research is done
Identified facets: Data Handling, Model Learning, Software Development and System Operations

Qualitative analysis

Qualitative approach used to check derived information from literature
Interviews conducted to explore individual experiences
Stratified sampling technique used to select interview participants
Nine participants identified
Interviews conducted between 30th of July and 10th of September 2021
Qualitative content analysis used to analyse interviews

Results

Different terminologies are elaborated
Triggers to start/restart the pipeline are identified
Taxonomy is created to describe the pipeline for continuous development of AI
Challenges regarding implementation, adaption and usage of pipelines are explored

Terminologies

DevOps for AI, CI/CD for AI, MLOps, (End-to-End) Lifecycle Management, and CD4ML are terms related to automation for the continuous development and improvement of AI models.
MLOps is an extension of DevOps.
MLOps takes into consideration the company’s culture and how cross-functional teams collaborate.
Continuous Training (CT) is a new practice that uses collected feedback and production data.
MLOps can be divided into three different levels of maturity.
(End-to-End) Lifecycle Management uses concepts from software development to cope with many model iterations.
CD4ML is a software engineering approach that produces machine learning applications.

Triggers

Four trigger types discussed: feedback and alert systems, orchestration service and schedule, repository, and other triggers
AI models need to be iteratively adapted and retrained
Context-specific triggers depending on AI model, business requirements, and retraining strategies
Triggers may take into consideration optimal threshold for benefits of updated model
Trial and error process to minimize resource consumption
Triggers combining different approaches may be feasible
Collected feedback and alerts may be used to trigger pipeline
Data events may be potential trigger
Monitoring system monitors and collects data from production to trigger pipeline
Data drifts occur when distribution of data set changes
Data updates may happen due to schema updates
Triggers may occur periodically or when specific threshold is attained
Model updates may be triggered due to deterioration of model’s performance
Event streaming platforms may be used to semi-automate triggering process
Repository updates used as traditional triggers

Data handling

Data Handling covers the end-to-end lifecycle of data curation
Quality of AI model depends on data availability, quality, and preprocessing techniques
Data pipeline manipulates initial data via tasks such as preprocessing, testing, versioning, and documentation
7% of total execution time can be reduced due to parallelized computing procedures
Data collection includes data injection, preparation, labelling and feature extraction
Data preparation depends on type of data
Labelling is necessary for supervised learning
Feature extraction or feature engineering is necessary
Data and feature validation validates batches of data
Data quality can be measured via 6 dimensions
Automatic feasibility analysis can be used to identify if data is sufficient
Unit tests verify data ingestion
Data needs to be stored and versioned
Documentation includes guidelines with concrete actions

Model learning

Task allows comparing models to find the best one
Standard Keras metrics calculate accuracy, precision, recall
Pipeline should support decisions on model design, selection of components, algorithm, feature selection, hyper-parameter setting, data set split
AutoML takes over design decisions
Pipeline implementation depends on architecture, distribution, amount of computation resources, context-specific algorithms
TFX and MLFlow provide necessary implementations or support for AI libraries
Pipeline should provide faultless, reliable and secure software
Fail-safe measures need to be introduced via tests
Test data sets used more often can lead to overfitting
Metrics used to test for overfitting
Automated tests not always possible, experts validate model decisions
Goal is to optimize specific metric throughout pipeline’s lifecycles
Pipeline compares statistical evaluation metrics with metrics from previous model versions
Traditional quality metrics used, such as F1 scores, precision and accuracy
Learnability metric adapted for using it during CI/CD of AI models
Ethics-by-design metric for model quality assurance in continuous development pipeline
Feature explainability used to identify how much each feature contributes to final prediction
Checks for bias not included in pipeline’s model quality assurance due to non-critical domains
Pipeline-specific model improvement techniques not identified
Model compression, pruning, hardening, hyper-parameter optimization used
Metadata necessary to govern modelling lifecycle
Model versioning captures version model artifacts and model dependencies
Model versions stored due to resource restraints or lack of interest
Alternatives to standard version control systems used
Model documentation necessary to receive certifications

Software development

TFX does not handle software development-specific tasks
ML Metadata stores properties of trained models and data sets
TFX versions artifacts produced by components
Automated software-level quality assurance tests check correct behaviour of system
Automated stress and robustness tests measure operational metrics
Packaged models and quality assurance results stored
Pipeline and tasks versioned
Documentation of development stage proposed

System operations

System Operations handles deployment and monitoring of AI model
4 criteria must be fulfilled before deployment
Pipeline identifies best model, but research gap on how to compare model versions fairly
Model must fulfill user-defined deployment criteria
Human involvement may be required
Model deployed on different environments
Continuous experimentation allows gathering user feedback
Monitoring systems collect input and output data
Monitor model performance and traditional software monitoring aspects
Collect data to identify impact on business outcomes
AI applications may be deployed in different environments

Challenges

Transform data into universal format
Automatic feature labelling
Identify which data source augments features
Identify how data preparation effects model quality
Unclear data anomaly alerts
Correct & reasonably strict alerts
Automatic versioning methods for data sets and associated model
Store information on data governance
Elastic scaling to provide right amount of hardware
Adapt to horizontal or vertical scaling
Efficiently distribute resources

Discussion

Pipeline may vary due to resource constraints, difficulties in specific implementations and cost/benefit analysis
Execution order of tasks depends on context, such as organizational policies
TFX combines components to enable a flexible and customizable pipeline
TFX components mapped to proposed framework in Figure 6

Threats to validity

Conclusion

Main goal was to identify and compare five terms related to pipelines for continuous development of AI
Nine semi-structured interviews and 151 sources were used
Five terms identified: DevOps for AI, CI/CD for AI, MLOps, end-to-end lifecycle management, and CD4ML
Four stages of pipeline identified: Data Handling, Model Learning, Software Development, and System Operations
Challenges identified: data collection and integration, data transformation, data preparation, model quality assurance, versioning, packaging, deployment, monitoring, environment and infrastructure handling, flexibility, customizability, and fault tolerance
Future work planned to explore, evaluate, and compare pipeline concepts with available platforms

Link to paper#

Abstract#

Paper Content#

Introduction#

Background and related work#

Continuous software engineering and devops#

Related work#

Triggers:#

Methodology#

Mlr#

Taxonomy creation#

Qualitative analysis#

Results#

Terminologies#

Triggers#

Data handling#

Model learning#

Software development#

System operations#

Challenges#

Discussion#

Threats to validity#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background and related work

Continuous software engineering and devops

Related work

Triggers:

Methodology

Mlr

Taxonomy creation

Qualitative analysis

Results

Terminologies

Triggers

Data handling

Model learning

Software development

System operations

Challenges

Discussion

Threats to validity

Conclusion