Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Companies struggle to develop and deploy AI models to production systems.
- Continuous pipelines for AI are an active research area.
- This paper includes a Multivocal Literature Review and semi-structured interviews.
- Paper provides and compares terminologies for DevOps, CI/CD, MLOps, lifecycle management, and CD4ML.
- Paper provides a list of potential triggers for reiterating the pipeline.
- Paper presents a consolidated pipeline with four stages: Data Handling, Model Learning, Software Development, and System Operations.
- Paper maps challenges regarding pipeline implementation, adaption, and usage for the continuous development of AI.
Paper Content
Introduction
- Increase in data and computing power allows exploring AI
- ML and DL enable new intelligent products and services
- Quality assurance needed to guarantee safe and reliable behaviour
- Non-determinism leads to uncertainty
- Automated end-to-end CI/CD pipelines needed for AI development
Background and related work
Continuous software engineering and devops
- Continuous software engineering is a continuous development lifecycle
- CI focuses on integrating code changes and ensuring software quality
- CD delivers tested software features to a staging or test environment
- Continuous Deployment deploys software to production environment
- CI/CD for AI automates deployment process for AI models
- DevOps is a continuous software development approach
- MLOps expands DevOps for ML based applications
- End-to-end lifecycle management handles tasks for continuous development of AI
- CD4ML is the technical implementation of MLOps concept
Related work
- Published works focus on pipelines for continuous development of AI
- Table 1 lists related studies with similar methodology
- Related studies have limited primary sources
- Focus on specific application contexts
- Different research questions investigated
- Karamitsos et al. focus on traditional DevOps principles
- John et al. propose MLOps maturity model
- Lwakatare et al. propose levels of automation
- Fredriksson et al. focus on approaches to label data
- Testi et al. summarize different types of MLOps pipelines
- Kolltveit et al. focus on operationalizing the model
- Kreuzberger et al. cover principles and professions
- Lorenzoni et al. identify software engineering processes
- Xie et al. focus on characteristics of the lifecycle
- DataOps addressed by Rodriguez et al., Munappy et al., and Ereth
Triggers:
- No information on triggers of pipeline for continuous development in related SLRs and MLRs
- Challenges related to development of AI in general, not pipeline itself
- Organizational, ML system, and operational challenges for adopting MLOps
- Non-technical challenges such as expectation management, trust, transparency
- General challenges regarding ML system development, such as data management, modelling and operationalisation
Methodology
- Used three research methods: MLR, taxonomy creation strategy, and qualitative analysis
- MLR proposed by Garousi et al., taxonomy development method based on Usman et al., and interviews’ deductive category definition based on Mayring
Mlr
- MLR was executed to provide overview of pipelines for continuous development of AI
- Grey literature is essential because using a lifecycle pipeline for AI is an emerging research topic
- 37 formal sources were derived as start data set
- 79 papers were derived between 15th of April and 30th of May 2021
- 151 relevant sources were identified, 79 formal and 72 informal
- Descriptive qualitative synthesis was executed to extract necessary categories
- Open and axial coding was used to break down, examine, compare, conceptualize, and categorize information
Taxonomy creation
- Categorized data based on revised taxonomy creation strategy proposed by Usman et al.
- Defined units of classes/categories based on DevOps phases
- Used qualitative approach to add extracted information to respective class/category
- Used facet approach for classification structure type to easily adapt taxonomy if further research is done
- Identified facets: Data Handling, Model Learning, Software Development and System Operations
Qualitative analysis
- Qualitative approach used to check derived information from literature
- Interviews conducted to explore individual experiences
- Stratified sampling technique used to select interview participants
- Nine participants identified
- Interviews conducted between 30th of July and 10th of September 2021
- Qualitative content analysis used to analyse interviews
Results
- Different terminologies are elaborated
- Triggers to start/restart the pipeline are identified
- Taxonomy is created to describe the pipeline for continuous development of AI
- Challenges regarding implementation, adaption and usage of pipelines are explored
Terminologies
- DevOps for AI, CI/CD for AI, MLOps, (End-to-End) Lifecycle Management, and CD4ML are terms related to automation for the continuous development and improvement of AI models.
- MLOps is an extension of DevOps.
- MLOps takes into consideration the company’s culture and how cross-functional teams collaborate.
- Continuous Training (CT) is a new practice that uses collected feedback and production data.
- MLOps can be divided into three different levels of maturity.
- (End-to-End) Lifecycle Management uses concepts from software development to cope with many model iterations.
- CD4ML is a software engineering approach that produces machine learning applications.
Triggers
- Four trigger types discussed: feedback and alert systems, orchestration service and schedule, repository, and other triggers
- AI models need to be iteratively adapted and retrained
- Context-specific triggers depending on AI model, business requirements, and retraining strategies
- Triggers may take into consideration optimal threshold for benefits of updated model
- Trial and error process to minimize resource consumption
- Triggers combining different approaches may be feasible
- Collected feedback and alerts may be used to trigger pipeline
- Data events may be potential trigger
- Monitoring system monitors and collects data from production to trigger pipeline
- Data drifts occur when distribution of data set changes
- Data updates may happen due to schema updates
- Triggers may occur periodically or when specific threshold is attained
- Model updates may be triggered due to deterioration of model’s performance
- Event streaming platforms may be used to semi-automate triggering process
- Repository updates used as traditional triggers
Data handling
- Data Handling covers the end-to-end lifecycle of data curation
- Quality of AI model depends on data availability, quality, and preprocessing techniques
- Data pipeline manipulates initial data via tasks such as preprocessing, testing, versioning, and documentation
- 7% of total execution time can be reduced due to parallelized computing procedures
- Data collection includes data injection, preparation, labelling and feature extraction
- Data preparation depends on type of data
- Labelling is necessary for supervised learning
- Feature extraction or feature engineering is necessary
- Data and feature validation validates batches of data
- Data quality can be measured via 6 dimensions
- Automatic feasibility analysis can be used to identify if data is sufficient
- Unit tests verify data ingestion
- Data needs to be stored and versioned
- Documentation includes guidelines with concrete actions
Model learning
- Task allows comparing models to find the best one
- Standard Keras metrics calculate accuracy, precision, recall
- Pipeline should support decisions on model design, selection of components, algorithm, feature selection, hyper-parameter setting, data set split
- AutoML takes over design decisions
- Pipeline implementation depends on architecture, distribution, amount of computation resources, context-specific algorithms
- TFX and MLFlow provide necessary implementations or support for AI libraries
- Pipeline should provide faultless, reliable and secure software
- Fail-safe measures need to be introduced via tests
- Test data sets used more often can lead to overfitting
- Metrics used to test for overfitting
- Automated tests not always possible, experts validate model decisions
- Goal is to optimize specific metric throughout pipeline’s lifecycles
- Pipeline compares statistical evaluation metrics with metrics from previous model versions
- Traditional quality metrics used, such as F1 scores, precision and accuracy
- Learnability metric adapted for using it during CI/CD of AI models
- Ethics-by-design metric for model quality assurance in continuous development pipeline
- Feature explainability used to identify how much each feature contributes to final prediction
- Checks for bias not included in pipeline’s model quality assurance due to non-critical domains
- Pipeline-specific model improvement techniques not identified
- Model compression, pruning, hardening, hyper-parameter optimization used
- Metadata necessary to govern modelling lifecycle
- Model versioning captures version model artifacts and model dependencies
- Model versions stored due to resource restraints or lack of interest
- Alternatives to standard version control systems used
- Model documentation necessary to receive certifications
Software development
- TFX does not handle software development-specific tasks
- ML Metadata stores properties of trained models and data sets
- TFX versions artifacts produced by components
- Automated software-level quality assurance tests check correct behaviour of system
- Automated stress and robustness tests measure operational metrics
- Packaged models and quality assurance results stored
- Pipeline and tasks versioned
- Documentation of development stage proposed
System operations
- System Operations handles deployment and monitoring of AI model
- 4 criteria must be fulfilled before deployment
- Pipeline identifies best model, but research gap on how to compare model versions fairly
- Model must fulfill user-defined deployment criteria
- Human involvement may be required
- Model deployed on different environments
- Continuous experimentation allows gathering user feedback
- Monitoring systems collect input and output data
- Monitor model performance and traditional software monitoring aspects
- Collect data to identify impact on business outcomes
- AI applications may be deployed in different environments
Challenges
- Transform data into universal format
- Automatic feature labelling
- Identify which data source augments features
- Identify how data preparation effects model quality
- Unclear data anomaly alerts
- Correct & reasonably strict alerts
- Automatic versioning methods for data sets and associated model
- Store information on data governance
- Elastic scaling to provide right amount of hardware
- Adapt to horizontal or vertical scaling
- Efficiently distribute resources
Discussion
- Pipeline may vary due to resource constraints, difficulties in specific implementations and cost/benefit analysis
- Execution order of tasks depends on context, such as organizational policies
- TFX combines components to enable a flexible and customizable pipeline
- TFX components mapped to proposed framework in Figure 6
Threats to validity
Conclusion
- Main goal was to identify and compare five terms related to pipelines for continuous development of AI
- Nine semi-structured interviews and 151 sources were used
- Five terms identified: DevOps for AI, CI/CD for AI, MLOps, end-to-end lifecycle management, and CD4ML
- Four stages of pipeline identified: Data Handling, Model Learning, Software Development, and System Operations
- Challenges identified: data collection and integration, data transformation, data preparation, model quality assurance, versioning, packaging, deployment, monitoring, environment and infrastructure handling, flexibility, customizability, and fault tolerance
- Future work planned to explore, evaluate, and compare pipeline concepts with available platforms