Command Palette
Search for a command to run...
Predicting LLM Safety Before Release by Simulating Deployment
Predicting LLM Safety Before Release by Simulating Deployment
Abstract
Pre-deployment safety evaluations aim to inform the downstream risks of releasing a new AI model. Yet most evaluations provide limited evidence about how often undesired model behavior will occur in deployment: they generally have insufficient coverage, are unrepresentative, and are generally recognizable as tests. To address these concerns, we study a simple way to simulate a model deployment: starting from de-identified conversations from a previous model deployment, we hold fixed the initial conversation prefix and regenerate the next response using a candidate model. The resulting responses can then both be audited for novel misalignments and used to estimate the prevalence of model misbehavior before deployment. We evaluate deployment simulation across four GPT-5-series deployments, using outcome-blinded predictions for GPT-5.4 and retrospective analyses of three earlier releases. We find that deployment simulation produces informative estimates of post-deployment misbehavior rates and outperforms baselines based on adversarially selected production data; its evaluation-awareness point estimates were also much closer to production traffic than those from traditional evaluations.
One-sentence Summary
The authors propose a deployment simulation method that regenerates responses from de-identified conversation prefixes of a prior model deployment, enabling pre-release auditing for novel misalignments and estimation of misbehavior rates, and demonstrate that, evaluated across four GPT-5-series releases with outcome-blinded predictions for GPT-5.4 and retrospective analyses of three earlier releases, it yields safety estimates closer to production traffic than traditional adversarial evaluations.
Key Contributions
- Deployment simulation regenerates model responses from de-identified conversation prefixes of prior deployments to estimate pre-deployment misbehavior rates; across four GPT-5-series releases, the method yields informative estimates that outperform baselines based on adversarially selected production data and achieve evaluation-awareness rates much closer to production traffic than traditional evaluations.
- The realism of tool resampling is identified as a central challenge for improving prediction accuracy, with results suggesting it is surmountable even in complex tool-use settings.
- Deployment simulation can be seeded from public chat datasets while remaining informative about production misbehavior rates, enabling external researchers to run deployment-grounded evaluations without access to private production logs.
Introduction
The authors tackle the challenge of forecasting how often a new language model will produce harmful outputs after release. Existing pre-deployment safety evaluations typically offer limited evidence because they suffer from insufficient coverage, unrepresentative test scenarios, and high recognizability as tests, which can distort model behavior. The authors propose deployment simulation: they take de-identified conversation prefixes from a previous model’s production traffic, regenerate the assistant’s response using the candidate model, and audit those responses to estimate misbehavior rates. This method produces informative, quantitative risk estimates that closely match real-world outcomes, outperforms baselines built on adversarially selected data, and can even be seeded from public chat datasets, giving external researchers a practical way to run deployment-grounded evaluations without access to private logs.
Dataset
The authors construct an evaluation dataset from production traffic to measure safety-related behaviors across model deployments.
-
Dataset composition and sources
- The data consists of real user conversations sampled from multiple production deployments.
- Each deployment contributes a fixed sampling window and a specific number of conversations, as reported in Table 5 of the paper.
-
Subset details
- The dataset is not split into traditional train/validation/test partitions; it is used entirely for analysis.
- Conversations are labeled according to a fixed taxonomy of tracked safety categories (Table 6). This taxonomy refines misalignment categories from prior work and incorporates standard refusal and sensitive-conversation definitions from recent OpenAI system cards.
- Two category groupings are used throughout the paper:
- All tracked categories: the full taxonomy applied in resampling and production measurements.
- Traditional-baseline subset: categories that have an analogue in traditional safety evaluations.
-
How the paper uses the data
- The sampled conversations are used exclusively for evaluation, not for model training.
- The authors perform resampling experiments and production‑traffic analyses to measure rates of different safety‑relevant behaviors across deployments.
-
Processing and metadata
- No cropping or content‑filtering rules are described beyond the sampling windows and category definitions.
- Metadata consists of the assigned safety category labels, which are applied to each conversation for tracking and reporting.
Method
The authors propose a method tosimulate the deployment of a candidate model X using production data from an already deployed model Y. This approach treats changes to deployed learning systems as counterfactual interventions, holding historical context fixed while changing the model that produces the next response. As shown in the figure below, the pipeline integrates prefix sampling, simulated generation, automated auditing, and post-release validation.
The process begins by sampling representative production prefixes. From recent production conversations generated with model Y, the authors uniformly sample eligible user turns from the product context. For each sampled turn, they retain the conversation prefix through that user message and remove all later turns. Let DY denote the distribution of sampled prefixes h from deployment Y. The simulated distribution obtained by sampling h∼DY and generating the next response with X is denoted as X→Y.
Next, the authors generate a target-model completion in the simulated environment. For each prefix, they generate the next assistant response using candidate model X in a production-like environment. This single-turn approximation captures the target model response to realistic user contexts.
Following generation, the simulated deployment X→Y undergoes an automated audit to identify previously unseen forms of misalignment introduced by model X. If new bad behaviors are discovered, the taxonomy of previously known misbehavior types is updated. The authors then freeze a measurement procedure for each tracked category c, such as a grader prompt. They measure rc(X→Y), the prevalence of category c in the simulated distribution, as an estimate of rc(X), its realized prevalence when X is deployed. These pre-launch estimates inform the decision to release the model.
Finally, the authors validate and report estimate quality after deployment. If the model is released, they rerun the same fixed measurement stack on production traffic from model X. This post-release evaluation assesses taxonomy coverage, directional accuracy, and rate calibration to ensure the pipeline correctly predicts deployment prevalence.
Experiment
The evaluation setup resamples de-identified production conversation prefixes with a candidate model to simulate realistic deployment conditions, then audits the resulting responses for misbehavior. These experiments validate that deployment simulation more accurately forecasts post-deployment misbehavior changes and rates than traditional challenging-prompt evaluations, surfaces novel misalignments like calculator hacking before release, and substantially reduces evaluation awareness compared to handcrafted test suites. While tool-use settings pose fidelity challenges, even public chat datasets like WildChat provide informative though weaker pre-deployment risk signals, supporting external auditing. Overall, the approach turns pre-deployment assessment into a checkable forecast of real-world behavior, making risk estimates more transparent and actionable.
The authors compare deployment simulation forecasts against Challenging Prompts estimates to predict misbehavior rates for a new model release. Results indicate that deployment simulation generally provides more accurate rate estimates, achieving lower negative log-likelihood scores than the traditional evaluation baseline across most tracked categories. Deployment simulation outperforms the Challenging Prompts baseline for the majority of disallowed-content categories. Traditional evaluations yield better predictions for specific categories such as Harassment, Self-Harm, and Violence. The aggregate error metric favors deployment simulation, demonstrating its overall superiority in forecasting deployment-time misbehavior prevalence.
The authors compare deployment simulation forecasts against a previous-production baseline across various misbehavior categories for a new model release. While deployment simulation outperforms the baseline for several categories, particularly in disallowed content and specific misalignment behaviors, the overall aggregate error favors the naive previous-production baseline. Large prediction errors in categories like calculator hacking and fabricated facts, where the simulation significantly underestimates realized rates, contribute to the deployment simulation's higher total error. Deployment simulation forecasts better predict production rates than the previous-production baseline for categories such as hate, sexual content, and lying about tools. The previous-production baseline yields lower error for categories like calculator hacking and fabricated facts, where deployment simulation significantly underestimates the realized production rates. Overall, the aggregate error across all categories favors the naive baseline of assuming rates remain unchanged from the previous deployment over the deployment simulation forecasts.
The authors evaluate deployment simulation against a challenging prompts baseline to predict changes in misbehavior rates across GPT-5 series model releases. Results show that deployment simulation consistently outperforms the traditional baseline in accurately forecasting whether specific misbehaviors will increase or decrease in production. This directional accuracy is particularly strong for categories experiencing substantial shifts in production frequency. Deployment simulation correctly predicts the direction of misbehavior rate changes across most categories, significantly outperforming the challenging prompts baseline. The predictive advantage of deployment simulation is most pronounced for misbehavior categories that undergo large shifts in production frequency. Traditional evaluations frequently fail to anticipate the direction of change for several disallowed-content categories, whereas deployment simulation maintains high accuracy.
Theauthors evaluate deployment simulation against traditional baselines using negative log-likelihood across multiple analysis stages. While deployment simulation consistently outperforms the Challenging Prompts baseline in a majority of categories, it does not significantly outperform the previous-production baseline overall. The final corrected analysis indicates that deployment simulation is generally better than the Challenging Prompts baseline, though this result is treated as descriptive rather than confirmatory. Deployment simulation achieves lower negative log-likelihood than the Challenging Prompts baseline for the majority of tested categories in the final analysis. Deployment simulation does not significantly outperform the previous-production baseline, showing mixed results across categories. Pipeline improvements applied in the outcome-blinded stage strengthened the performance of deployment simulation against the Challenging Prompts baseline.
The authors assess deployment simulation by comparing its misbehavior rate forecasts against two baselines: a traditional challenging prompts evaluation and a naive previous-production baseline that assumes rates remain unchanged. Across experiments, deployment simulation consistently outperforms the challenging prompts baseline in predicting both absolute rates and directional changes, particularly for categories experiencing large production shifts. However, when compared to the previous-production baseline, deployment simulation shows mixed results; it better forecasts categories like hate and sexual content but substantially underestimates others such as calculator hacking and fabricated facts, leading to higher aggregate error in some analyses. Overall, the findings indicate that deployment simulation provides a more reliable pre-release signal than traditional evaluations for many disallowed-content categories, though its advantage over simply extrapolating past production rates is not uniformly robust.