LLM10: Model Theft
This entry refers to the unauthorized access and exfiltration of LLM models by malicious actors or APTs. This arises when the proprietary LLM models (being valuable intellectual property), are compromised, physically stolen, copied or weights and parameters are extracted to create a functional equivalent. The impact of LLM model theft can include economic and brand reputation loss, erosion of competitive advantage, unauthorized usage of the model or unauthorized access to sensitive information contained within the model.
The theft of LLMs represents a significant security concern as language models become increasingly powerful and prevalent. Organizations and researchers must prioritize robust security measures to protect their LLM models, ensuring the confidentiality and integrity of their intellectual property. Employing a comprehensive security framework that includes access controls, encryption, and continuous monitoring is crucial in mitigating the risks associated with LLM model theft and safeguarding the interests of both individuals and organizations relying on LLM.
Common Examples of Vulnerability
- An attacker exploits a vulnerability in a company's infrastructure to gain unauthorized access to their LLM model repository via misconfiguration in their network or application security settings.
- An insider threat scenario where a disgruntled employee leaks model or related artifacts.
- An attacker queries the model API using carefully crafted inputs and prompt injection techniques to collect a sufficient number of outputs to create a shadow model.
- A malicious attacker is able to bypass input filtering techniques of the LLM to perform a side-channel attack and ultimately harvest model weights and architecture information to a remote controlled resource.
- The attack vector for model extraction involves querying the LLM with a large number of prompts on a particular topic. The outputs from the LLM can then be used to fine-tune another model. However, there are a few things to note about this attack:
- The attacker must generate a large number of targeted prompts. If the prompts are not specific enough, the outputs from the LLM will be useless.
- The outputs from LLMs can sometimes contain hallucinated answers meaning the attacker may not be able to extract the entire model as some of the outputs can be nonsensical.
- It is not possible to replicate an LLM 100% through model extraction. However, the attacker will be able to replicate a partial model.
- The attack vector for functional model replication involves using the target model via prompts to generate synthetic training data (an approach called "self-instruct") to then use it and fine-tune another foundational model to produce a functional equivalent. This bypasses the limitations of traditional query-based extraction used in Example 5 and has been successfully used in research of using an LLM to train another LLM. Although in the context of this research, model replication is not an attack. The approach could be used by an attacker to replicate a proprietary model with a public API.
Use of a stolen model, as a shadow model, can be used to stage adversarial attacks including unauthorized access to sensitive information contained within the model or experiment undetected with adversarial inputs to further stage advanced prompt injections.
Example Attack Scenarios
- An attacker exploits a vulnerability in a company’s infrastructure to gain unauthorized access to their LLM model repository. The attacker proceeds to exfiltrate valuable LLM models and uses them to launch a competing language processing service or extract sensitive information, causing significant financial harm to the original company.
- A disgruntled employee leaks model or related artifacts. The public exposure of this scenario increases knowledge to attackers for gray box adversarial attacks or alternatively directly steal the available property.
- An attacker queries the API with carefully selected inputs and collects sufficient number of outputs to create a shadow model.
- A security control failure is present within the supply-chain and leads to data leaks of proprietary model information.
- A malicious attacker bypasses input filtering techniques and preambles of the LLM to perform a side-channel attack and retrieve model information to a remote controlled resource under their control.
How to Prevent
- Implement strong access controls (e.g., RBAC and rule of least privilege) and strong authentication mechanisms to limit unauthorized access to LLM model repositories and training environments.
- This is particularly true for the first three common examples, which could cause this vulnerability due to insider threats, misconfiguration, and/or weak security controls about the infrastructure that houses LLM models, weights and architecture in which a malicious actor could infiltrate from inside or outside the environment.
- Supplier management tracking, verification and dependency vulnerabilities are important focus topics to prevent exploits of supply-chain attacks.
- Use a centralized ML Model Inventory or Registry for ML models used in production. Having a centralized model registry prevents unauthorized access to ML Models via access controls, authentication, and monitoring/logging capability which are good foundations for governance. Having a centralized repository is also beneficial for collecting data about algorithms used by the models for the purposes of compliance, risk assessments, and risk mitigation.
- Restrict the LLMs access to network resources, internal services, and APIs.
- This is particularly true for all common examples as it covers insider risk and threats, but also ultimately controls what the LLM application “has access to” and thus could be a mechanism or prevention step to prevent side-channel attacks.
- Regularly monitor and audit access logs and activities related to LLM model repositories to detect and respond to any suspicious or unauthorized behavior promptly.
- Automate MLOps deployment with governance and tracking and approval workflows to tighten access and deployment controls within the infrastructure.
- Implement controls and mitigation strategies to mitigate and/or reduce risk of prompt injection techniques causing side-channel attacks.
- Rate Limiting of API calls where applicable and/or filters to reduce risk of data exfiltration from the LLMs applications, or implement techniques to detect (e.g., DLP) extraction activity from other monitoring systems.
- Implement adversarial robustness training to help detect extraction queries and tighten physical security measures.
- Implement a watermarking framework into the embedding and detection stages of an LLMs lifecycle.
- Meta’s powerful AI language model has leaked online: The Verge
- Runaway LLaMA- How Meta’s LLaMA NLP model leaked: DeepLearning.ai
- AML.TA0000 ML Model Access: MITRE ATLAS
- I Know What You See: Cornell University
- D-DAE: Defense-Penetrating Model Extraction Attacks: IEEE
- A Comprehensive Defense Framework Against Model Extraction Attacks: IEEE
- Alpaca: A Strong, Replicable Instruction-Following Model: Stanford University
- How Watermarking Can Help Mitigate The Potential Risks Of LLMs?: KD Nuggets