How to Plan for LLM deployment

7 min readSep 19, 2023

The field of LLM (Language Model) has gained significant prominence in recent times. In my previous post, I presented a tutorial on creating a basic Q&A chatbot tailored to specific data domains. Nevertheless, when it comes to deploying such an LLM application in a production environment, numerous factors warrant careful consideration. In this article, I aim to provide an outline for planning the deployment of an LLM application. While it is important to note that there may not exist a definitive solution for every aspect, the following points serve as a valuable starting point for engaging in further discussions.

Objective

In the initial stages, it is crucial to establish a clear objective for the application. This involves identifying

the specific use cases and functionalities required,
the primary benefit that the application aims to achieve. Is the focus on cost savings, process automation, or revenue generation?

Ideally, quantifying the benefit would be advantageous as it enables a comprehensive evaluation of whether the potential gains outweigh the associated risks.

Risk management

From an application standpoint, the LLM model can be likened to a black box, as the internals is beyond our direct control. It is important, therefore, to prioritize risk management when deploying any model. Despite its potential flaws, acknowledging and addressing the inherent uncertainties within the model is vital to ensure a successful deployment process.

An important initial consideration is whether the application is intended for customer-facing purposes or solely for internal use. Due to the inherent non-deterministic nature of LLM, employing it to directly respond to customer queries entails a high level of risk. To mitigate this risk, incorporating a human review process as the final step can help prevent undesired responses. In scenarios involving LLM agent automation, human intervention at the final stage can serve as a safeguard against catastrophic actions. However, it is essential to note that relying on human intervention also imposes limitations on the scalability of the application.
Impact analysis: Assessing the potential financial or reputation loss in the event of model failure, and devising effective mitigation strategies, is of utmost importance. It is crucial to consider various failure cases that the LLM may encounter, including:

(i) Provision of irrelevant or inaccurate information,

(ii) Generation of biased, discriminatory, or offensive responses,

(iii) Unintentional leakage of confidential information.

3. Model Testing Risk:

(i) Due to the inherent variability of LLM responses, verifying testing results can pose a challenge. One approach to address this issue involves conducting multiple iterations of each test case to derive a sample mean of accuracy, along with a confidence interval. While this methodology helps mitigate uncertainties, it is important to acknowledge that absolute certainty cannot be guaranteed. Ultimately, as the application owner, one must accept a residual level of risk. Nevertheless, diligent efforts can be made to minimize this risk to the greatest extent possible.

(ii) Given the free-form and unstructured nature of input and output in LLM applications, human judgment may be required to ascertain the correctness and expectedness of outcomes. Consequently, this reliance on human judgment imposes limitations on the scalability of test cases, as well as the feasibility of encompassing all possible scenarios within the scope of testing. In such situations, leveraging public benchmark datasets can prove beneficial in constructing comprehensive test cases. By utilizing these datasets, the testing process can be augmented, enabling a broader coverage of scenarios and facilitating more robust evaluation of the LLM application.

Data source

When it comes to ML models, data serves as the bedrock upon which they are built. Consequently, carefully considering the data source for the LLM application is of utmost importance. The quality, relevance, and reliability of the data directly impact the performance and effectiveness of the model. It is crucial to identify and select appropriate data sources that align with the specific objectives and requirements of the LLM application.

Data classification: It is essential to assess whether the dataset contains sensitive or confidential information. In such cases, determining the necessity for data masking techniques becomes crucial to ensure data privacy and compliance with security protocols.
Data Pipeline: A well-defined data pipeline is necessary to facilitate the acquisition of data. Considerations include determining the data source, integrating with the system housing the data, and establishing a robust process for data cleaning and preprocessing. This involves eliminating noise, duplicates, and irrelevant information to enhance the quality and relevance of the dataset.
Frequency: Understanding the frequency at which data is received is vital. Data can be obtained either in real-time or in batch form, and this distinction impacts the design and implementation of the data processing pipeline.
Data size: Evaluating the size of the dataset and its physical location is essential. This assessment aids in estimating the network bandwidth required and the transfer time necessary for efficient data handling and processing.
Data labeling: Fine-tuning a pre-trained model necessitates a labeled dataset. Therefore, careful consideration must be given to how each data item will be labeled. Establishing effective and consistent labeling methodologies is crucial to ensure the accuracy and reliability of the training process.

Design

During the application design phase of an LLM application, several crucial considerations must be taken into account:

Selecting a suitable pre-trained LLM model: With numerous pre-trained LLM models available in the public domain, such as those offered by Hugging Face, the task is to carefully choose the most appropriate model that aligns with the specific requirements and objectives of the application.
Determining when to fine-tune a pre-trained model or utilize prompt engineering: It is essential to evaluate whether fine-tuning a pre-trained model or employing prompt engineering techniques would yield optimal results for the given application context.
Configuring parameter values: Decisions regarding parameter values, such as the model temperature, need careful consideration as they directly impact the behavior and output of the LLM model.
Selecting an appropriate embedding model: The choice of an embedding model plays a vital role in representing the underlying semantic information of the data accurately. Careful evaluation and comparison should guide the selection process.
Identifying an optimal Vector Database for storing document embeddings: The selection of a suitable Vector Database to store document embeddings should be based on performance benchmarks and considerations such as scalability, retrieval speed, and compatibility with the overall system architecture.

All of the aforementioned decisions should be grounded in thorough model evaluation and performance testing to ensure that the chosen approaches align with the desired outcomes and demonstrate satisfactory performance levels.

Model evaluation

Evaluating LLM or GAI models using traditional metrics like accuracy, precision, or recall is inadequate. Assessing the relevance and correctness of responses often requires subjective human judgment. This reliance on human assessment significantly limits the scale of test cases and introduces higher risks into the evaluation process.
Additional test cases should be incorporated to proactively mitigate the potential generation of discriminatory or offensive responses and the inadvertent disclosure of confidential information. These test cases should focus on identifying and addressing vulnerabilities related to these sensitive aspects.
The adoption of a Red-Blue approach can prove beneficial in managing risks associated with LLM or GAI models. In this approach, the Red team assumes the responsibility of injecting boundary cases and undesirable queries to challenge the system and attempt to elicit undesirable responses or data leakage. Conversely, the Blue team is tasked with mitigating these risks by implementing robust safeguards and measures to ensure the system’s resilience and integrity. This collaborative approach helps identify and address vulnerabilities through rigorous testing and proactive risk management.

Production Deployment

Estimating infrastructure costs: In addition to the resources needed for running the model inference API and Vector DB, it is crucial to consider the potential requirement for additional hardware when conducting model fine-tuning. This includes factoring in the associated costs for acquiring and maintaining the necessary hardware resources.
Tracking model training and testing: Implementing a comprehensive tracking system is essential to monitor and record key information during model training and testing. This includes documenting hyper-parameters for each test, as well as capturing relevant metrics to assess the performance and effectiveness of the training and testing processes.
Model repository: Establishing a centralized model repository is highly recommended. This repository serves as a centralized hub for storing and organizing model artifacts. It facilitates easy tracing of the training and testing results of specific models, as well as enables efficient model fallback when required.
Deployment pipeline: To ensure enhanced control and governance over the model inference code, it is advisable to develop it within a standardized framework and template, such as Mlflow. Deploying the code through a Helm chart onto a Kubernetes cluster as an immutable image further enhances stability and reproducibility. This deployment pipeline approach promotes streamlined and reliable deployment processes, facilitating efficient management of the model inference system.

Model monitoring and feedback

Post-production rollout, continuous monitoring of the model is imperative to ensure its ongoing performance and effectiveness.

User feedback mechanism: Implementing a user feedback mechanism is crucial to promptly identify any undesired outcomes or issues encountered by users. This allows for timely alerts and facilitates swift action to address any concerns.
Regular review: Logging and reviewing user inputs and corresponding model responses should be a routine practice. Due to potentially high traffic volumes, random samples can be taken for review purposes. By computing accuracy as the sample mean with a confidence interval, any performance deterioration can be promptly detected and flagged.
Auto-monitoring: Employing an additional NLP classification model can aid in automatically detecting responses containing discriminative or offensive language. Additionally, scanning responses for sensitive information leakage helps maintain data privacy and safeguard against inadvertent disclosures.
Error analysis: In the event of discovering undesired behaviors or issues, conducting thorough error analysis becomes crucial. This involves analyzing the root causes and identifying any patterns that contribute to the observed anomalies. This process enables the identification of areas that require improvement and informs subsequent actions.
Versioning and A/B testing: When introducing major upgrades to the LLM model, performing A/B testing is advisable. This comparative analysis allows for evaluating the performance and impact of different model versions or configurations. By carefully assessing the results of A/B testing, informed decisions can be made regarding the selection and deployment of model versions or configurations that yield the best outcomes.

How to Plan for LLM deployment

Objective

Risk management

Data source

Design

Model evaluation

Model monitoring and feedback

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Matthew Leung

No responses yet