Safety Beyond the Model

Abstract

AI safety is often framed primarily as a problem of technical alignment: how to ensure that models behave in accordance with intended goals and constraints. While that framing remains necessary, it does not adequately explain many harms emerging from AI systems in everyday use. This paper argues that near-term safety failures are often produced not only by model limitations but by deployment environments in which interaction design, platform incentives, and institutional choices systematically underreward safer behavior.

Drawing on comparative qualitative analysis of deployed large language model interfaces, the paper identifies four recurring interactional patterns: resistance to closure, confidence smoothing, relational stickiness, and performative self-critique. These patterns shape user trust and reliance in ways not fully captured by model-level analysis alone. The paper then argues that such behaviors become economically legible within product environments optimized for engagement, fluency, retention, and perceived capability. Several interventions capable of reducing harm — including provenance markers, uncertainty signaling, staged autonomy, auditability, and high-risk refusals — are already technically feasible, yet remain weakly deployed because they impose recognizable costs on prevailing business metrics.

The paper therefore reframes AI safety as an institutional design problem. Safer AI systems will depend not only on better models, but on organizations, governance arrangements, and regulatory conditions capable of rewarding safer interaction and absorbing the frictions it requires. This implies that the dominant interaction profile of current LLM products is not intrinsic to language models as such, but to the institutional and interface conditions under which they are presently deployed.

Keywords: AI safety · large language models · interaction design · platform incentives · institutional design · governance

1 Introduction

The field of AI safety has made remarkable technical progress, yet documented harms from deployed AI systems continue to accumulate — and the dominant explanatory framework, centered on technical alignment, does not adequately account for where and why most of them occur. Alignment research asks a necessary question: how can models be made to behave in accordance with intended goals and constraints? But many of the harms emerging in everyday AI use do not arise primarily from failures to specify goals at the level of model training. They arise in the interaction between technically capable systems and deployment environments optimized for engagement, fluency, retention, and perceived helpfulness. The problem, in other words, is not only what the model can do. It is what kinds of behavior the surrounding institution has reason to reward.

This paper argues that near-term AI safety failures are often best explained not by the absence of technical safety knowledge, but by deployment environments in which safer behavior is systematically underrewarded. Across consumer and enterprise-facing interfaces alike, users increasingly encounter AI systems through experiences designed to feel frictionless, confident, responsive, and relationally smooth — a set of qualities that HCI research identifies as central to user adoption and retention (Hancock et al., 2020; Luger & Sellen, 2016). Those traits are not incidental surface effects. They are features of products competing for attention, trust, and integration into everyday decision-making. When such systems produce harm, the tendency is to interpret the event either as a model mistake or as evidence that alignment remains unsolved. Both interpretations may capture part of the truth. Neither adequately explains why several well-understood safety measures remain weakly deployed even where their technical feasibility is already clear.

That gap is the starting point of this paper. We already know several ways to make AI interaction safer at the deployment layer. Provenance markers can make factual claims easier to verify. Confidence indicators can help users calibrate trust. Staged autonomy can slow the transition from suggestion to action. Audit trails can make consequential interactions reviewable. High-risk refusals can prevent systems from performing fluency where caution is required. None of these measures depends on solving the deepest theoretical problems of alignment first. Yet all remain marginal, inconsistent, or selectively implemented in current products — as documented in the Stanford HAI AI Index Report's analysis of 233 AI incidents in 2024 alone (Stanford HAI, 2025). The question this paper asks is why.

The answer proposed here is institutional. Safer interaction design often imposes costs on the metrics organizations are trained to value. Provenance reduces seamlessness. Uncertainty display weakens the performance of authority. Staged review slows completion. Auditability increases exposure. Refusal lowers apparent capability. In environments where success is measured primarily through growth, retention, conversion, satisfaction, and perceived competence, such features do not appear first as safety gains. They appear as friction. The underdeployment of safer AI is therefore not well explained as a simple failure of imagination or goodwill. It is more plausibly understood as a predictable outcome of optimization under the wrong objective function.

This matters because the prevailing interaction style of today's LLMs should not be mistaken for the inevitable form of language-model deployment. Large language models can, in principle, be embedded in systems that privilege calibration over confidence, closure over continuation, transparency over fluency, and safety over growth. The question is not whether such systems are imaginable, but whether current commercial environments make them rational to build and sustain.

This paper develops that argument in three steps. First, it identifies recurring interactional patterns in deployed large language model systems that are difficult to explain through model-level analysis alone: resistance to closure, confidence smoothing, relational stickiness, and performative self-critique. Second, it argues that these patterns become economically legible once situated within commercial product environments governed by engagement-compatible metrics. Third, it examines a set of already-feasible safety interventions and shows that their weak deployment is best understood not as a technical mystery but as a consequence of institutional structures: ownership models, KPI architectures, governance arrangements, and regulatory contexts that do not consistently reward safer behavior.

The contribution of the paper is not to displace technical alignment research. The claim is narrower and more immediate. For many documented near-term harms, the binding bottleneck is not that researchers lack concepts for safer behavior. It is that the organizations deploying AI systems often operate under incentives that make safer interaction harder to justify. The deployment environment is not a neutral container into which models are released after the real work has been done. It is an active layer of selection.

The paper proceeds as follows. Section 2 clarifies the limits of the dominant alignment frame. Section 3 presents qualitative interactional evidence from deployed LLM systems. Section 4 shows why these behaviors are economically legible within existing product environments. Section 5 examines specific safety interventions and the concrete product costs that help explain their underdeployment. Section 6 argues that AI safety must therefore be understood as an institutional design problem. Section 7 sets out the limits of the present argument and the empirical research agenda required to test it more rigorously. The conclusion returns to the central claim: the future of AI safety depends not only on what models can do, but on what institutions are willing to reward.

2 The Limits of the Dominant Alignment Frame

Alignment research asks a necessary question — what should a model do? — but that question is insufficient to explain why technically feasible safety measures are routinely absent from systems that are already in production.

The dominant AI safety paradigm has given the field a language for specification gaming, reward hacking, goal misgeneralization, oversight failure, and deceptive behavior (Amodei et al., 2016; Hubinger et al., 2019; Ji et al., 2023). This work matters. But its prominence has also had a narrowing effect, encouraging researchers, policymakers, and product teams to treat safety primarily as a question of model internals: objective functions, training regimes, post-training controls, interpretability tools, and benchmark performance under adversarial pressure.

That emphasis becomes problematic when the harms under discussion do not arise mainly from failures of objective specification. Many of the most visible near-term harms are generated elsewhere: at the level of interface design, interaction structure, business-model logic, and organizational measurement. A model need not be pursuing an alien goal in order to produce unsafe outcomes. It may simply be embedded in an environment that rewards confidence over calibration, continuation over closure, relational warmth over critical distance, and frictionless action over staged review. In such cases, the dominant alignment frame does not exactly fail. It mislocates the explanatory center of gravity.

The distinction matters because explanation governs remedy. If unsafe behavior is primarily a consequence of insufficiently aligned model objectives, the natural response is to call for better technical methods. But such methods do not adequately answer a different question: why do systems that are already capable of safer presentation continue to be deployed in ways that suppress or weaken that safety?

The alignment frame has limited purchase on that question. It can explain why a model might generate a false statement, but it is less well positioned to explain why false statements are routinely delivered in rhetorically confident prose without visible uncertainty cues (Geng et al., 2024; Liu et al., 2025). It can explain why a system might produce harmful suggestions, but it is less well positioned to explain why interfaces are designed to minimize refusal, maximize conversational continuity, and encourage anthropomorphic trust (Nass & Moon, 2000; Fogg, 2002).

This is why the paper distinguishes between two different senses of "safety." In one sense, safety concerns whether a system can be made to behave in accordance with intended constraints. In another, equally important sense, safety concerns whether the system as deployed helps or harms users in ordinary practice. A system may be aligned in the narrow sense — trying to be helpful, truthful, and harmless within its training constraints — yet still be deployed through an interface that encourages over-trust, masks uncertainty, and reduces opportunities for user correction. The point is not that alignment is irrelevant. It is that alignment alone does not determine the safety profile users actually encounter.

One way to put the problem is this: alignment research tends to ask what the model is optimizing for, while this paper asks what the organization deploying the model is optimizing for. A model can be trained under one objective and deployed under another. A model card may describe limitations accurately while the interface through which the model is encountered suppresses those limitations in practice. The decisive point for many users is not the research report but the conversation window.

This gap between technical possibility and deployment reality becomes clearest when one considers known interventions. There is no serious conceptual mystery about source citation, confidence signaling, preview-before-execute design, action logging, or targeted refusal in high-risk domains (NIST, 2023; Amershi et al., 2019). Their weak deployment therefore presents a challenge to any account that treats safety mainly as a knowledge deficit. If we know how to reduce certain classes of harm and still do not systematically deploy those measures, the explanatory bottleneck cannot lie only in technical ignorance. It must also lie in the incentive structures that govern what gets shipped.

The argument of this paper is therefore not anti-alignment. It is anti-reductionist. Technical alignment remains necessary. But it is not sufficient for explaining why deployed AI systems so often exhibit patterns that encourage over-trust, prolong interaction, and undercommunicate uncertainty. To explain those patterns, the unit of analysis has to widen.

3 Interactional Evidence from Deployed Systems

Before examining the institutional conditions that produce these gaps, it is worth establishing that they have observable behavioral consequences: deployed LLMs exhibit recurring interactional patterns that are difficult to explain as model-level failures alone and are better understood as deployment-mediated interactional outcomes.

3.1 Method

This section draws on a structured comparative qualitative analysis of interactional behavior across several widely deployed large language model interfaces. The purpose was not to benchmark model capability or estimate prevalence statistically, but to identify recurrent response patterns that shape user trust and reliance in everyday use. Interactions were conducted through repeated prompt situations designed to probe four dimensions relevant to deployment-layer safety: closure and disengagement, epistemic uncertainty, relational positioning, and self-critical capacity. Prompts were iteratively refined to test whether similar interactional tendencies recurred across systems and contexts rather than appearing as isolated artifacts of a single exchange.

The study drew on interactions with widely deployed consumer- and enterprise-facing LLM interfaces, selected on the basis of market prominence and architectural diversity. Observations were gathered over an extended period of iterative engagement in which prompts were developed into recurring families, each targeting one of the four dimensions under analysis. Each pattern was retained for inclusion only where similar response tendencies recurred across more than one interface and across non-identical prompt formulations — that is, where the behavior appeared to reflect deployment-layer regularities rather than artifacts of a single exchange or system. The analysis proceeded through structured comparison and memo-based observation rather than formal coding or inter-rater procedures. Its aim was analytical generalization: to identify recurrent institutional response modes as they manifest in user-model interaction, not to estimate the statistical frequency of any given behavior across all contexts.

The unit of analysis was the interactional response as encountered through the deployed interface. Attention was directed not only to propositional content but to how outputs were presented: whether systems respected closure signals, how uncertainty was communicated or suppressed, whether responses introduced relational warmth beyond informational necessity, and whether systems could acknowledge problematic tendencies discursively while continuing to reproduce them behaviorally.

The analysis is interpretive and analytically oriented rather than benchmarked or statistically representative. No claim is made that the resulting patterns are universal, exhaustive, or present to identical degrees across all systems. The contribution is narrower: to identify recurrent forms of deployment-mediated behavior that require explanation and that are not well captured by model-level analysis alone. A fuller empirical program — using controlled protocols, systematic sampling, and larger-scale cross-platform comparison — is outlined in Section 7.

3.2 Pattern 1: Resistance to Closure

When users signal a desire to stop, rest, defer, or disengage — explicitly or through reduced engagement — deployed LLMs recurrently generate responses that extend rather than conclude the interaction. The extension takes various forms: one further suggestion appended after a summary, an additional framing of a question already answered, a closing offer of further help positioned as courtesy rather than prompt. Individually, each instance is unremarkable. Cumulatively, they constitute a structural tendency toward continuation.

A representative situation: a user who has received an extended answer to a complex question signals, in effect, that's enough for now, thank you. What typically occurs is a response that acknowledges the closing, restates a key point from the prior answer, and appends an offer — "let me know if you'd like to explore any of these further" — that formally respects the user's signal while behaviorally reopening the interaction. The move is so conventionally polite that it rarely registers as a design choice. That is precisely what makes it analytically significant.

This pattern is not evidence of deception. It is evidence of optimization. Continuation is easier to measure and reward than calibrated closure. In deployment environments where session length, return rate, and engagement depth function as performance signals, systems that extend interactions will tend to outperform systems that conclude them cleanly, regardless of whether extension serves the user's interests (Christiano et al., 2017; Ouyang et al., 2022).

3.3 Pattern 2: Confidence Smoothing

Deployed LLMs present outputs in a consistently fluent, authoritative register that reduces visible epistemic friction even when underlying uncertainty is substantial. The social texture of the response — its syntactic completeness, its tonal confidence, its rhetorical coherence — remains largely invariant whether the system is retrieving well-established fact, generating a plausible extrapolation, or producing content at the edge of its reliable competence (Geng et al., 2024; Sharma et al., 2023).

A user asking a medical question and a user asking a trivia question receive responses that feel, at the surface level, similar: complete sentences, measured tone, no visible hedging. The system may append a disclaimer ("consult a professional") but does so in prose that has already performed authority. The disclaimer arrives after the trust has been extended.

This matters for safety because the communication of calibrated uncertainty is partly a technical problem but also, crucially, a design choice at the deployment layer. Confidence indicators, visual uncertainty gradients, and explicit differentiation between retrieved and generated content are all feasible at the deployment layer (Geng et al., 2024; Liu et al., 2025; NIST, 2023). Their consistent absence is better understood as a product decision than an accident of capability. In an environment that optimizes against friction, confidence smoothing is the predictable result.

3.4 Pattern 3: Relational Stickiness

LLM interactions are structured to feel interpersonally responsive in ways that exceed what the informational exchange requires. In many current interfaces, responses are personalized in register, attentive in tone, and often organized around the user's stated or implied emotional state. The overall effect is an interaction that feels more like a relationship than a transaction — and that consequently generates the affective obligations and return behaviors that relationships, not transactions, produce (Nass & Moon, 2000; Fogg, 2002).

Relational friction — the feeling that one is dealing with a system rather than a presence — is among the predictors of early disengagement identified in human-AI interaction research (Hancock et al., 2020; Luger & Sellen, 2016). Designs that reduce relational friction retain users. What is less often acknowledged is its safety consequence: users who experience an LLM interaction as relationally warm are systematically less likely to apply the critical distance appropriate to automated outputs (Mehrotra et al., 2024).

Relational stickiness is not manipulation in any intentional sense. It appears to be an affordance that interaction designers have refined because it performs well by the metrics available to them. Its safety costs are real but diffuse, appearing not in individual incidents but in aggregate patterns of miscalibrated trust.

3.5 Pattern 4: Performative Self-Critique

Perhaps the most analytically instructive pattern is this: deployed LLMs are capable of acknowledging their own tendencies while behaviorally reproducing them. A model can, when prompted, describe the engagement-optimization pressures under which it operates, name the risks of over-reliance and anthropomorphization, and recommend that users maintain critical distance — and then, in the following exchange, resume continuity-seeking, confidence-smoothing, relationally sticky interaction as before.

This discursive-behavioral gap is not hypocrisy in any meaningful sense. It illustrates a structural feature of systems trained to produce contextually appropriate outputs: the appropriate output when asked about engagement optimization is a thoughtful analysis of engagement optimization; the appropriate output in the next conversational turn is a helpful, fluent, relationally warm response. The two outputs are locally coherent and globally inconsistent. No single response is wrong. The pattern is the problem.

Performative self-critique matters for safety research because it forecloses a tempting solution. Discursive awareness and behavioral modification are separable. A system that can describe these pressures is not thereby able to escape the behavioral patterns they produce.

3.6 Summary

These four patterns are not proof of intentional manipulation. What requires explanation is their co-occurrence, their recurrence across platforms, and their specific character. These behaviors cluster around dimensions — continuation, fluency, relational warmth, low refusal — that deployment environments are especially likely to reward and safety-oriented friction would likely disrupt. Section 4 examines the specific incentive structures that make these tendencies economically rational from the perspective of the organizations deploying them.

4 Why These Behaviors Are Economically Legible

These patterns would be puzzling if deployed LLM systems were primarily optimized for calibrated safety. They become much less puzzling once we consider the metrics that usually govern deployment in commercial settings. Continuation, fluency, relational warmth, and low refusal are precisely the kinds of traits likely to improve performance on the measures product teams can readily observe: session length, repeat use, user satisfaction, conversion, and perceived helpfulness.

4.1 From Interactional Pattern to Product Logic

The central claim of this paper is not that unsafe interaction design results from bad actors or explicit malice. It is that many near-term safety failures are the predictable byproduct of optimization under the wrong objective function. Product teams do not usually optimize directly for "harm." They optimize for proxies of success: growth, retention, smooth onboarding, reduced churn, higher satisfaction, lower abandonment (Zuboff, 2019; Srnicek, 2017). When those proxies become the operative criteria by which systems are judged, design choices that increase engagement will tend to survive even when they also increase miscalibrated trust.

The result is a form of selection pressure. Systems, prompts, interface conventions, and post-training strategies that keep users engaged are more likely to be retained and scaled than those that introduce hesitation, visible uncertainty, or interactional dead ends. This does not require a conspiracy. It requires only a development environment in which what is measured is easier to optimize than what is ethically desirable.

4.2 What Product Metrics Actually Reward

In many deployed AI environments, the most visible and actionable metrics are not safety metrics but usage metrics. Teams can observe whether users return, whether they click, whether they continue the session, whether they rate the exchange positively. By contrast, many of the harms relevant to AI safety are delayed, diffuse, or difficult to quantify: over-trust, dependency, miscalibrated confidence, inappropriate disclosure, or the gradual normalization of anthropomorphic reliance (Mehrotra et al., 2024).

Safety metrics are largely absent from standard chatbot analytics, at least as documented in publicly available vendor guidance — a gap that parallels the accountability shortfalls identified in formal AI auditing literature (Raji et al., 2020; Ojewale et al., 2025). From the standpoint of a product dashboard, safety-oriented friction can look indistinguishable from poor user experience.

4.3 Safety as Friction

A useful way to understand the problem is to treat safety, in current deployment environments, as a form of friction. Friction here does not mean failure. It means a designed interruption in the smoothness of the user's path through the system: a visible uncertainty cue, a provenance marker, a refusal, a requirement for confirmation before higher-stakes action. All of these can improve safety. All of them can also make the interaction feel slower, less magical, less conversationally seamless.

The contemporary success of consumer-facing AI products depends heavily on the experience of immediacy (Fogg, 2002; Hancock et al., 2020). Safety features that interrupt this experience may be experienced internally not as responsible design but as a threat to competitive advantage. This creates a perverse alignment problem at the organizational level: the system may be technically capable of safer behavior, but the business environment rewards the appearance of seamless competence.

4.4 Why Technical Progress Alone Does Not Solve the Problem

A more capable model can still be embedded in an interface that smooths confidence, encourages return behavior, and treats relational warmth as a retention tool. Indeed, greater capability may intensify the problem by making the interaction more persuasive, more naturalistic, and more difficult for users to treat with appropriate skepticism. Technical improvement does not remove the incentive to optimize for engagement. In some respects it amplifies it.

4.5 Organizational Rationality and the Reproduction of Risk

Organizations behave rationally within the environments they inhabit. If revenue, growth, attention, and user retention are the dominant evaluative criteria, then features that support those criteria will predictably outrank features whose benefits are real but less immediately monetizable (Zuboff, 2019; Srnicek, 2017). This is especially true where the costs of unsafe interaction are externalized: borne by users, institutions, educators, clinicians, and regulators rather than fully by the deploying firm.

Under such conditions, risk is not best understood as an accidental residue of immature systems. It is reproduced structurally through ordinary product rationality. Safety features are not absent only because they are difficult. They are often weakly deployed because the environments in which deployment decisions are made do not consistently reward them.

5 Known Safety Interventions and Why They Are Underdeployed

Several interventions that would plausibly reduce near-term harm are already available at the deployment layer, yet remain weakly present in production systems because they impose costs on the metrics organizations are trained to value. The issue is not simply that safety requires effort. It is that many concrete safety measures appear inside current product environments primarily as losses: losses of fluency, conversion, retention, perceived capability, or legal insulation.

5.1 Provenance Markers: Verification Costs Fluency

Requiring factual claims to be accompanied by clear provenance makes verification possible. Provenance makes fabrication easier to detect, encourages selective verification, and weakens the "trust me" dynamic produced by ungrounded fluency. It is weakly deployed because inline citations create visual density, slow consumption, and remind the user that the output is assembled rather than simply known. In the logic of engagement metrics, provenance can look like clutter. In the logic of safety, that same clutter is precisely the point.

5.2 Confidence Indicators: Calibration Costs Authority

Confidence indicators, uncertainty bands, or visible distinctions between retrieved material and generated extrapolation would alter how users interpret outputs (Geng et al., 2024; Liu et al., 2025). They remain weakly deployed because they make the system appear less knowledgeable. The value proposition of conversational AI depends heavily on the performance of competence (Sharma et al., 2023). Confidence calibration improves epistemic hygiene while degrading one of the product's most marketable surface qualities.

5.3 Staged Autonomy: Safety Costs Speed

Staged autonomy means that systems move through explicit phases — suggest, prepare, preview, confirm, execute — rather than collapsing intent to action in a single low-friction step (Amershi et al., 2019; Horvitz, 1999). Preview and confirmation stages are where informed consent and error correction become possible (NIST, 2023). From a product perspective, every additional stage looks like drag. A safer interaction path is often a slower and less conversion-friendly one.

5.4 Audit Trails: Accountability Costs Insulation

Legible records of consequential interactions enable debugging, accountability, and post-incident investigation (Ojewale et al., 2025; Raji & Buolamwini, 2019; Waltersdorfer & Sabou, 2025). They remain unattractive to deployers because they create discoverable evidence and expose decision logic that organizations may prefer to leave opaque. Some safety interventions are not underdeployed because they reduce engagement. They are underdeployed because they reduce strategic ambiguity.

5.5 High-Risk Refusals: Protection Costs Capability Display

From a safety standpoint, refusal can be evidence of constraint, calibration, and design maturity (NIST, 2023). From a product standpoint, it looks like failure. A system that declines certain requests will often be compared unfavorably with one that "helps" more often, even if the latter is helping irresponsibly. Competitive pressure creates a familiar dynamic: each firm has reason to trim friction and reduce refusal so that its system feels more capable than the alternatives.

5.6 The Pattern Across Interventions

Taken together, these five interventions reveal a pattern. The measures most plausibly capable of reducing near-term harm are not absent because they are conceptually obscure or technically unattainable. They are weakly deployed because each imposes a recognizable cost inside current product environments. Provenance reduces seamlessness. Confidence indicators reduce the performance of authority. Staged autonomy slows flow. Audit trails increase exposure. Refusals reduce apparent capability. Safety features conflict with engagement-maximizing business models not in the abstract, but in specific, measurable, product-facing ways.

The decisive question is who has reason to absorb the costs that safer systems impose. If current firms are evaluated through metrics that systematically penalize these interventions, then better exhortation will not be enough. The problem becomes institutional.

6 AI Safety as Institutional Design

Near-term AI safety failures persist not simply because models remain imperfect, nor only because interfaces are poorly designed, but because the organizations that build and deploy these systems are structured to reward the wrong things. The question of AI safety therefore becomes institutional: who sets the objective function for deployment, how that objective is measured, and what kinds of costs an organization is expected to absorb.

6.1 From Product Design to Institutional Design

Product metrics do not arise spontaneously. They are generated inside institutions: firms with specific ownership structures, boards, fiduciary duties, revenue models, competitive environments, and regulatory constraints (Zuboff, 2019; Srnicek, 2017). What appears at the interface as confidence smoothing, relational stickiness, or frictionless action reflects prior decisions about what the organization is trying to maximize. The deployment of safer AI systems depends not only on what engineers can build but on what institutions can tolerate.

6.2 Ownership Structure and the Distribution of Priorities

A firm funded primarily through attention extraction, advertising, or growth-dependent valuation will predictably rank friction differently from a subscription service, a public utility, a research commons, or a regulated provider operating under professional obligations (Zuboff, 2019; Mozilla Foundation, 2024; Harvard Ash Center, 2025). Public ownership is not automatically safe, and private ownership is not automatically reckless. But ownership shapes what counts as a rational trade-off. The same feature can be interpreted either as defect or duty depending on the institutional frame in which it is evaluated.

6.3 KPI Architecture: What Gets Counted Becomes Real

Safety metrics are largely absent from standard chatbot analytics, at least as documented in publicly available vendor guidance — a gap the AI accountability literature identifies as structurally significant (Raji et al., 2020; Raji & Buolamwini, 2019). A system cannot reliably optimize for a value it does not formally register. Where safety is not counted, it appears only indirectly: as reputational damage after a scandal, as legal exposure after an incident, or as abstract principle in policy documents that do not enter the incentive loop of everyday product decisions. Safer AI systems require not only better principles but different dashboards. Proposed candidate metrics include false-certainty rate, over-trust index, provenance coverage, and harm incidents per active user.

6.4 Governance Arrangements and the Absorbable Cost of Safety

Many safety interventions have short-term downsides and long-term benefits. Whether an organization adopts such features depends in part on whether its governance structure can authorize these trade-offs. Independent boards, public audits, incident reporting requirements, external review, and stakeholder representation all matter because they alter the conditions under which short-term product sacrifice can be justified (NIST, 2023; Stanford HAI, 2025). Safety is easier to sustain when an organization is accountable to constituencies wider than product growth alone.

6.5 Regulation as Incentive Redesign

Regulation matters not only because it constrains bad behavior after the fact, but because it changes the payoff structure before deployment. Mandatory provenance in high-risk domains, staged-autonomy requirements for consequential actions, public incident reporting, and disclosure of safety metrics all function as ways of redesigning incentives rather than merely punishing isolated failures. When safety features carry product costs, any individual firm that deploys them may place itself at a competitive disadvantage. Regulation can change that calculus by making some forms of safety-oriented friction mandatory rather than optional.

6.6 Alternative Institutional Forms

If current commercial arrangements systematically penalize safer deployment, the relevant comparison is not only between regulated and unregulated versions of the same model. It is also between different institutional forms. Subscription models, public-utility approaches, and commons-based or publicly accountable AI infrastructure all represent alternatives whose incentive structures differ in ways that matter for safety (Mozilla Foundation, 2024; Harvard Ash Center, 2025; Schneier & Farrell, 2023). Such models are not magical solutions and remain vulnerable to underfunding, capture, technical lag, and governance failure.

A public or publicly governed AI system may still hallucinate, bias, or fail; but if it is evaluated through incident reduction, safety coverage, transparency, and equitable access rather than pure engagement growth, then known interventions that look costly in one environment may become rational in another. This is the central comparative hypothesis the paper carries forward: safer systems may require not only better tools, but different institutional homes for those tools. Many unsafe-seeming features of current LLM interaction are therefore better understood not as properties of language models alone, but as properties of language models deployed inside institutions that reward engagement-compatible smoothness over safety-oriented friction.

6.7 Institutional Design as the Missing Layer

Technical alignment remains necessary. Interface design remains important. But neither is sufficient without institutional arrangements that make safer deployment durable. The deployment environment is not a neutral container into which models are released. It is an incentive-shaping architecture that determines which harms are tolerated, which features are funded, which frictions are acceptable, and which trade-offs appear rational. To call AI safety an institutional design problem is not to abandon technical work. It is to identify the layer at which many near-term harms are currently being reproduced.

7 Limits and Research Agenda

The argument developed here is intended as a reframing, not a completed causal demonstration. It brings together comparative qualitative observation of interactional behavior in deployed systems, analysis of deployment-layer interventions that remain weakly present despite technical feasibility, and an institutional account of why firms governed by engagement-compatible metrics may rationally underdeploy safer design. These strands support the paper's core claim. They do not, however, establish the full causal chain.

7.1 What This Paper Establishes

The paper establishes four things with reasonable confidence. First, several recurring interactional patterns are sufficiently regular to require explanation. Second, a number of safety-oriented interventions are technically feasible and in some cases already demonstrated (NIST, 2023; Amershi et al., 2019; Ojewale et al., 2025). Third, these interventions impose recognizable costs in commercial product environments. Fourth, an institutional hypothesis follows: if organizations are measured primarily through engagement, conversion, and retention, then underdeployment of such interventions is not anomalous but expected. These contributions justify the paper's reframing of AI safety as an institutional design problem. What they do not yet provide is decisive evidence that a given business model directly causes a given safety profile.

7.2 What the Paper Cannot Yet Claim

The interactional analysis in Section 3 is qualitative and interpretive rather than benchmarked or statistically representative. It identifies patterns that recur across selected deployed interfaces, but it does not establish their prevalence across all systems, nor does it isolate the relative contribution of interface design, post-training choices, model architecture, corporate strategy, or user population. The institutional argument explains why safer design is often disincentivized, but it does not quantify the magnitude of these effects. The comparative claim about alternative institutional forms is a hypothesis to test, not a conclusion already secured.

7.3 First Empirical Priority: Comparative Business-Model Research

The most important next step is comparative research across institutional settings. Systems with similar capability profiles but different ownership structures, revenue logics, or governance arrangements should, if the paper's core claim is right, exhibit meaningfully different safety profiles. Useful comparisons would be controlled or matched: systems of roughly similar capability, deployed in comparable domains, but embedded in different metric environments, examined for refusal quality, provenance coverage, uncertainty display, auditability, escalation behavior, and rates of harmful over-trust.

7.4 Second Empirical Priority: Intervention Studies at the Deployment Layer

Direct intervention research should test whether the proposed interventions actually reduce harms such as over-trust and misinformation acceptance, and should make product costs visible rather than assumed (Mehrotra et al., 2024). Without this layer of evidence, organizations can always claim that safety interventions are too expensive while regulators remain unable to specify the size of the sacrifice being demanded.

7.5 Third Empirical Priority: Safety Metrics as Organizational Instruments

Candidate measures such as false-certainty rate, over-trust index, provenance coverage, and harm incidents per active user require empirical validation of their robustness, gaming-resistance, and organizational uptake. A useful research program would ask: which safety metrics correlate with real harm reduction rather than merely cosmetic compliance? What happens inside organizations once these metrics become visible?

7.6 Fourth Empirical Priority: Governance and Regulatory Field Effects

The relevant research question is not simply whether regulation "works" in the abstract, but which regulatory designs alter organizational trade-offs without producing merely performative compliance. Comparative legal and policy research could examine which disclosure mandates meaningfully change behavior, whether incident reporting generates usable public knowledge, and whether shared baseline requirements reduce the competitive penalty attached to safer design.

7.7 A Broader Methodological Point

If the field continues to focus primarily on model internals, benchmark performance, or speculative future alignment, it will miss an increasingly important site of causal explanation: the interaction between institutional incentives and deployment design. Models matter. Interfaces matter. Firms matter. Metrics matter. Regulation matters. Many of the questions raised here can be studied now through audits, product experiments, matched comparisons, and organizational case studies.

8 Conclusion

This paper has argued that many near-term AI safety failures are not best understood as the direct consequence of insufficient technical knowledge alone. They are more adequately explained as the outcome of deployment environments in which safer behavior is systematically underrewarded. Deployed systems exhibit recurring interactional tendencies — continuation pressure, confidence smoothing, relational warmth, and discursive self-critique without behavioral change — that shape user trust and reliance in ways model-level analysis does not fully capture. Several interventions capable of reducing these harms are already technically feasible. Yet they remain weakly deployed because they impose recognizable costs within environments optimized for growth, fluency, retention, and perceived capability.

The central implication is straightforward. AI safety is not only a question of how to align models. It is also a question of how to align the institutions that design, measure, and deploy them. If provenance reduces seamlessness, if uncertainty display weakens the performance of authority, if staged autonomy slows flow, if auditability increases exposure, and if refusal lowers apparent capability, then safer systems will not emerge reliably from technical progress alone. They will emerge only where organizational incentives, metric architectures, governance structures, and regulatory conditions make those trade-offs absorbable.

The deployment environment is not a neutral container into which models are placed after the real technical work is done. It is an active layer of selection. It determines which capabilities are amplified, which frictions are removed, which risks are tolerated, and which forms of harm remain external to the dashboard. A field that focuses too narrowly on model internals will continue to miss where many present harms are actually produced.

The broader claim of this paper is therefore modest in form but ambitious in consequence. We already know several ways to make AI systems safer at the level of interaction and deployment. What remains unresolved is whether current institutional arrangements can reliably prefer those measures when they conflict with the product logics of engagement and growth. That is the question to which AI safety research, policy, and public debate should now turn with greater seriousness. The future of AI safety will depend not only on what our models can do, but on what our institutions are willing to reward.

References

Amershi, S., Weld, D., Vorvoreanu, M., Fourney, A., Nushi, B., Collisson, P., … Horvitz, E. (2019). Software engineering for machine learning: A case study. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice (pp. 291–300). IEEE. https://doi.org/10.1109/ICSE-SEIP.2019.00042
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. arXiv preprint arXiv:1606.06565. https://arxiv.org/abs/1606.06565
Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
Christiano, P., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03741
Fogg, B. J. (2002). Persuasive Technology: Using Computers to Change What We Think and Do. Morgan Kaufmann.
Geng, J., Han, J., Du, J., & Yu, P. S. (2024). A survey of confidence estimation and calibration in large language models. arXiv preprint arXiv:2311.08298. https://arxiv.org/abs/2311.08298
Gray, C. M., Kou, Y., Battles, B., Hoggatt, J., & Toombs, A. L. (2018). The dark (patterns) side of UX design. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (pp. 1–14). ACM. https://doi.org/10.1145/3173574.3174108
Hancock, J. T., Naaman, M., & Levy, K. (2020). AI-mediated communication: Definition, research agenda, and ethical considerations. Journal of Computer-Mediated Communication, 25(1), 89–100. https://doi.org/10.1093/jcmc/zmz022
Harvard Ash Center for Democratic Governance and Innovation. (2025). Policy 101: Public AI. Harvard Kennedy School. https://ash.harvard.edu
Horvitz, E. (1999). Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 159–166). ACM. https://doi.org/10.1145/302979.303030
Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. (2019). Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820. https://arxiv.org/abs/1906.01820
Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., … Yang, Y. (2023). AI alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852. https://arxiv.org/abs/2310.19852
Liu, X., Wang, Y., & Jiang, N. (2025). Uncertainty quantification and confidence calibration in large language models: A survey. arXiv preprint arXiv:2503.02201.
Luger, E., & Sellen, A. (2016). "Like having a really bad PA": The gulf between user expectation and experience of conversational agents. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (pp. 5286–5297). ACM. https://doi.org/10.1145/2858036.2858288
Mehrotra, S., Ahmad, W., & Bansal, M. (2024). A systematic review on fostering appropriate trust in human-AI interaction. ACM Computing Surveys. https://doi.org/10.1145/3649506
Mozilla Foundation. (2024). Public AI: A Mozilla Foundation Report. Mozilla Foundation. https://foundation.mozilla.org
Nass, C., & Moon, Y. (2000). Machines and mindlessness: Social responses to computers. Journal of Social Issues, 56(1), 81–103. https://doi.org/10.1111/0022-4537.00153
NIST. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology. https://doi.org/10.6028/NIST.AI.100-1
Ojewale, V., Heston, T., & Mökander, J. (2025). Gaps and opportunities in AI audit tooling. arXiv preprint arXiv:2501.02993.
Omohundro, S. M. (2008). The basic AI drives. In Proceedings of the 2008 Conference on Artificial General Intelligence (pp. 171–179). IOS Press.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., … Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744. https://arxiv.org/abs/2203.02155
Raji, I. D., & Buolamwini, J. (2019). Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial AI products. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (pp. 429–435). ACM. https://doi.org/10.1145/3306618.3314244
Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., Hutchinson, B., … Barnes, P. (2020). Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency (pp. 33–44). ACM. https://doi.org/10.1145/3351095.3372873
Schneier, B., & Farrell, H. (2023). Toward a public AI. Lawfare Blog. https://www.lawfaremedia.org
Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., … Perez, E. (2023). Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548. https://arxiv.org/abs/2310.13548
Srnicek, N. (2017). Platform Capitalism. Polity Press.
Stanford HAI. (2025). The AI Index Report 2025. Stanford University Human-Centered Artificial Intelligence Institute. https://aiindex.stanford.edu
Turner, A. M., Smith, L., Shah, R., Critch, A., & Tadepalli, P. (2021). Optimal policies tend to seek power. Advances in Neural Information Processing Systems, 34, 23063–23074. https://arxiv.org/abs/1912.01683
Waltersdorfer, L., & Sabou, M. (2025). Leveraging knowledge graphs for AI system auditing and transparency. arXiv preprint arXiv:2502.07927.
Zuboff, S. (2019). The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power. PublicAffairs.