The Universal Intent Layer: Intent as a First-Class Computational Primitive Across the AI Pipeline

February 2026

Abstract

Contemporary AI systems, including large language models (LLMs), agentic frameworks, and multi-modal architectures, lack any explicit, structured representation of intent throughout the computational pipeline. Intent is destroyed during pretraining by next-token prediction objectives, encoded only implicitly during post-training via preference optimization, and buried within flat token sequences at inference time. This architectural absence is the root cause of pervasive failure modes: goal drift in multi-turn conversations, shallow compliance with complex instructions, agent derailing during multi-step execution, and fragile alignment that resists formal verification.

We propose the Universal Intent Layer (UIL) — a formal, persistent intent representation that flows bidirectionally through every stage of the AI pipeline. At its core is the IntentGraph, a hierarchical data structure encoding root goals, sub-goal trees with dependency tracking, success criteria, constraints, uncertainty estimates, and cross-modal transferability. We present: (1) a formal definition of the IntentGraph and its algebraic operations; (2) architectural modifications for intent-aware pretraining, post-training, inference, and agentic execution; (3) a multi-dimensional intent state model capturing five latent intent dimensions; (4) a scalable data collection and annotation pipeline; (5) a phased model evolution roadmap from prompt-based wrappers to architecturally native intent; and (6) IntentBench, a proposed benchmark suite with five novel evaluation metrics. We ground our architecture in five vertical case studies spanning enterprise operations, healthcare, education, robotics, and legal domains. The UIL represents a paradigm shift from treating intent as an implicit latent variable to making it a first-class computational primitive — the missing abstraction that renders alignment inspectable, agentic planning reliable, and cross-modal reasoning coherent.

Introduction

Every major AI system deployed today shares a fundamental blind spot: there is no explicit, structured representation of intent anywhere in the computational pipeline. Intent — the most critical latent variable in intelligent behavior — is scattered across reward functions, system prompts, user queries, and chain-of-thought traces with no unifying representation. The consequences are visible in every deployed system: chatbots that lose track of user goals by turn fifteen, agents that derail after ten tool calls, alignment techniques that produce statistical tendencies rather than inspectable constraints, and multi-modal systems that cannot transfer goal understanding across domains.

These are not independent failure modes. They are symptoms of a single architectural absence. Trace the full journey from training data to deployed model, and at every stage, ask: where does intent live? During data collection, intent exists in text but is unlabeled. During pretraining, next-token prediction objectives actively destroy intent structure by reducing all text to prediction targets. During post-training, RLHF and DPO encode intent implicitly in preference gradients, where it cannot be inspected or verified. At inference time, system prompts, user messages, conversation history, and tool descriptions are flattened into a context window — and the model must guess what the user wants from this undifferentiated token sequence.

The result is an AI ecosystem where the most important variable — the goal the system is trying to achieve — is never explicitly represented, tracked, updated, or verified.

Thesis. A Universal Intent Layer — a formal, persistent intent representation flowing bidirectionally through every pipeline stage — would fundamentally transform how we build, align, and scale intelligent systems. This is not a nice-to-have. It is the missing abstraction that makes alignment tractable, agentic planning reliable, multi-modal reasoning coherent, and embodied AI safe.

Philosophical Foundations

The philosophical case for intent as a computational primitive predates modern AI by over a century. Franz Brentano argued in 1874 that intentionality — the directedness of mental states toward objects and goals — is the defining characteristic of the mental [1]. Every thought is a thought about something; every action is directed toward something. Michael Bratman extended this into a theory of practical reasoning, arguing that intentions are not mere desires but commitment-forming states that structure planning and constrain future action [2]. Daniel Dennett's intentional stance [3] proposed that predicting behavior by ascribing beliefs and desires is not merely a useful fiction but the most powerful predictive framework available.

Modern cognitive science reinforces these insights. Human perception is not passive intake — it is active, goal-directed search [4]. Attention is allocated by intent. Memory is organized by goals. Even creativity operates within intentional constraints. We do not see the world and then decide what to do; we decide what to do and then see the world through that lens.

Context tells you what is. Intent tells you what matters.

What we have built in AI so far are systems extraordinary at the context half of intelligence — absorbing, compressing, and recalling vast information. But we have entirely skipped the intent half — the ability to form, maintain, revise, and communicate goals. We have built perception without purpose. Memory without motivation. Competence without direction.

Contributions

This paper makes the following contributions:

A formal definition of the IntentGraph data structure with algebraic operations (merge, diff, project, negotiate).
Architectural modifications for intent-aware pretraining, post-training, inference, and agentic execution, including novel loss functions and conditioning mechanisms.
A multi-dimensional intent state model capturing five latent dimensions (goal intent, action readiness, ambiguity level, risk sensitivity, intent scope).
A scalable data collection and annotation pipeline with four collection channels and active learning feedback loops.
IntentBench: a proposed benchmark suite with five novel metrics for evaluating intent understanding in AI systems.
Five vertical application case studies demonstrating concrete IntentGraph instances in enterprise, healthcare, education, robotics, and legal domains.

Figure 1: The Intent Gap

flowchart LR subgraph Pipeline["AI PIPELINE"] direction LR DC["DATA COLLECTION\nWeb, books, code"] PT["PRETRAINING\nNext-token prediction"] PO["POST-TRAINING\nRLHF / DPO / SFT"] INF["INFERENCE\nContext window"] end DC -->|"Intent: unlabeled"| PT PT -->|"Intent: DESTROYED"| PO PO -->|"Intent: IMPLICIT"| INF INF --> GD["Goal Drift"] INF --> SC["Surface Compliance"] INF --> AD["Agent Derailing"] INF --> FA["Fragile Alignment"] style DC fill:#fafafa,stroke:#ddd style PT fill:#fef2f2,stroke:#fca5a5 style PO fill:#fffbeb,stroke:#fcd34d style INF fill:#fafafa,stroke:#ddd style GD fill:#fef2f2,stroke:#fca5a5,color:#b91c1c style SC fill:#fef2f2,stroke:#fca5a5,color:#b91c1c style AD fill:#fef2f2,stroke:#fca5a5,color:#b91c1c style FA fill:#fef2f2,stroke:#fca5a5,color:#b91c1c

Figure 1. The intent gap — intent exists in training data but is progressively lost through each pipeline stage, leading to downstream failure modes.

Table 1: Human vs. Machine Intelligence

Human Intelligence	Machine Intelligence (with UIL)
Intent precedes perception. Before you look, you know what you're looking for.	Intent conditions inference. P(y\|x, I) — different intent produces different output from the same input.
Goals persist across interruption. Resume a task with the original goal intact.	Intent graphs persist across sessions. Switch tasks, return — intent is preserved structurally.
You know when you're confused about the goal. Metacognitive awareness of goal uncertainty.	Explicit uncertainty. "Low confidence in goal" is distinct from "low confidence in answer."
Goals are hierarchical. Nested sub-goals with constant re-evaluation of priorities.	Sub-goal trees with drift detection. Monitors whether actions serve the root goal.
Intent transfers across modalities. "Make dinner" spans reading, recognizing, chopping, seasoning.	One graph, many executors. The same goal drives language, vision, and robotic controllers.

Table 1. Parallel between human cognitive architecture and UIL-equipped machine intelligence.

Related Work

The concept of intent in AI systems has been explored across multiple disconnected research traditions. We survey each and identify how the UIL differs fundamentally from prior approaches.

2.1 Intent Recognition in Natural Language Understanding

Classical intent classification treats user utterances as single-label classification problems. The ATIS corpus [5] established intent detection as mapping utterances to predefined categories (e.g., flight_booking, ground_service). Subsequent work on the SNIPS dataset [6] and joint intent-slot filling architectures [7] improved accuracy but retained the fundamental assumption: intent is a single label for a single utterance. More recent transformer-based approaches achieve high accuracy on benchmarks like CLINC150 [8] but remain confined to the same paradigm. Limitation: These systems treat intent as a flat, ephemeral classification problem. They cannot represent hierarchical goals, track intent across turns, handle uncertainty, or persist across sessions.

2.2 Goal-Oriented Dialogue Systems

Task-oriented dialogue research introduced belief state tracking as a proxy for intent [9]. The MultiWOZ dataset [10] and Schema-Guided Dialogue benchmark [11] advanced multi-domain dialogue modeling. These systems track slot-value pairs (e.g., restaurant-area=centre) as a representation of user goals. Limitation: Belief states are predefined schemas with fixed slots. They cannot represent emergent goals, hierarchical decomposition, cross-session persistence, or domain transfer. The intent representation is isomorphic to a database query, not a goal structure.

2.3 Planning in AI Agents

Classical AI planning formalized goal-directed behavior through STRIPS [12] and PDDL [13], representing goals as logical predicates over world states. Modern LLM-based agents have revived planning through ReAct [14], which interleaves reasoning and acting; Reflexion [15], which adds verbal self-reflection; and Tree of Thought [16], which explores multiple reasoning paths. Agent frameworks like LangChain, CrewAI, and AutoGen each independently implement "plan" objects, "reflection" steps, and "goal tracking" mechanisms. Limitation: Every framework reinvents ad-hoc intent management. There is no shared formalism, no persistence mechanism, and no integration with the training pipeline. Goals are inferred from the prompt at each step, leading to systematic drift in multi-step tasks.

2.4 Alignment and Reward Modeling

Reinforcement Learning from Human Feedback (RLHF) [17] and Direct Preference Optimization (DPO) [18] encode human intent through pairwise preference comparisons. Constitutional AI [19] articulates intent as natural language principles. Process reward models [20] evaluate reasoning steps individually. Limitation: All these approaches encode intent implicitly in model weights via gradient updates. The resulting behavior is a statistical tendency, not an inspectable constraint. You cannot read a model's reward function and verify: "Is this system trying to do what the user asked?" Alignment is behavioral, not structural.

2.5 Cognitive Science and Intentionality

Beyond AI, the study of intentionality has deep roots in philosophy of mind [1] [2] [3], shared intentionality in human development [21], and active inference in neuroscience [4]. Friston's free-energy principle frames cognition as fundamentally goal-directed: organisms minimize surprise by acting to fulfill predictions conditioned on goals. These theories provide the conceptual foundation for treating intent as a computational primitive rather than an emergent property.

2.6 Multi-Modal and Embodied AI

Vision-language models [22] and embodied AI systems like SayCan [23] demonstrate that transferring intent across modalities requires more than shared embeddings. SayCan grounds language instructions in robotic affordances but relies on a fixed set of skills with no persistent goal representation. Limitation: No existing system provides a modality-agnostic intent format that can drive language generation, visual reasoning, and robotic control from the same goal structure.

2.7 Positioning the UIL

Approach	Intent Type	Persistence	Hierarchy	Pipeline Integration
NLU Classifiers	Flat label	Single turn	None	Inference only
Dialog Belief States	Slot-value pairs	Within session	None	Inference only
Agent Plan Objects	Ad-hoc structures	Single trajectory	Partial	Inference only
RLHF / DPO	Implicit in weights	Permanent but opaque	None	Training only
Constitutional AI	Natural language rules	Fixed at deployment	None	Training only
UIL (Ours)	Structured graph	Cross-session	Full tree	All stages

Table 2. The UIL is the first framework to treat intent as a structured, persistent, hierarchical representation integrated across all pipeline stages.

Figure 2: Taxonomy of Intent Handling in AI

quadrantChart title Intent Handling Approaches x-axis Ephemeral --> Persistent y-axis Implicit --> Explicit quadrant-1 UIL Territory quadrant-2 Classical NLU quadrant-3 Current LLMs quadrant-4 Belief Tracking NLU Classifiers: [0.15, 0.75] Slot Filling: [0.25, 0.65] Dialog Belief States: [0.45, 0.55] RLHF-DPO: [0.85, 0.15] Agent Plans: [0.35, 0.45] Constitutional AI: [0.70, 0.35] System Prompts: [0.20, 0.30] Universal Intent Layer: [0.90, 0.90]

Figure 2. Taxonomy of intent handling in AI systems. Prior approaches cluster in the implicit/ephemeral quadrants. The UIL uniquely occupies the explicit and persistent quadrant.

The IntentGraph Formalism

We now present the formal definition of the IntentGraph — the core data structure of the Universal Intent Layer. The IntentGraph is designed to be simultaneously machine-processable (enabling conditioning during inference), human-readable (enabling alignment inspection), and algebraically closed (supporting composition, comparison, and transfer operations).

3.1 Formal Definition

Definition 1 (IntentGraph). An IntentGraph is a tuple G = (r, S, M) where:

G = (r, S, M)

r ∈ R is the root goal, a natural language description paired with structured metadata.
S = {s₁, s₂, ..., s_n} is an ordered tree of sub-goals, where each sub-goal s_i is defined as:

s_i = (g_i, σ_i, C_i, K_i, u_i, X_i, m_i, D_i)

where:

g_i ∈ String is the sub-goal description.
σ_i ∈ {pending, active, done, blocked, abandoned} is the status.
C_i = {c₁, ..., c_k} is a set of success criteria (verifiable predicates).
K_i = {k₁, ..., k_j} is a set of constraints (boundary conditions).
u_i ∈ [0, 1] is the uncertainty estimate over this sub-goal's interpretation.
X_i is a set of context requirements (information needed before this sub-goal can be addressed).
m_i ∈ Modality is the execution modality (language, vision, navigation, manipulation, etc.).
D_i ⊆ S is the set of dependency sub-goals that must complete before s_i can begin.

M = (source, conf, H) is metadata: the intent source, overall confidence, and a negotiation history trace H recording how the intent was refined through interaction.

{
  root_goal: "Support patient managing Type 2 diabetes holistically",
  sub_goals: [
    { goal: "Track and optimize HbA1c trajectory",
      status: "active",
      success_criteria: ["HbA1c < 7.0 at next quarterly check"],
      constraints: ["patient preference: minimize medication changes",
                    "current regimen: metformin 1000mg"],
      uncertainty: 0.2 },
    { goal: "Identify and address dietary triggers",
      status: "active",
      success_criteria: ["3+ trigger patterns identified"],
      context_requirements: ["food diary data — 2 weeks minimum"] },
    { goal: "Monitor for complication signals",
      status: "background",
      constraints: ["SAFETY: escalate any acute symptoms immediately"],
      dependencies: ["latest lab results"] },
    { goal: "Support medication adherence",
      status: "active",
      uncertainty: 0.4,
      context_requirements: ["patient hasn't shared adherence barriers yet"] }
  ],
  meta: { source: "patient + care plan + clinical guidelines",
          confidence: 0.75,
          negotiation_history: ["Session 1: patient declined insulin discussion",
                                "Session 3: patient open to CGM trial"] }
}

Example 1. A concrete IntentGraph for chronic disease management. Note the persistence across sessions, explicit uncertainty, background safety monitoring, and negotiation history.

3.2 Properties

The IntentGraph satisfies five key architectural properties:

Hierarchical

Sub-goals form an ordered tree of arbitrary depth. Parent goals decompose into children; completion propagates upward.

Persistent

Survives across turns, sessions, and modality switches. The graph is serializable and restorable.

Updateable

Status, uncertainty, constraints, and context requirements evolve in real time as the interaction progresses.

Inspectable

Fully structured and machine-readable. A compliance auditor can inspect exactly what the AI's goals were at any point in time.

Transferable

Modality-agnostic representation. The same IntentGraph can drive a language model, a vision system, or a robotic controller — only the executor changes.

3.3 Multi-Dimensional Intent States

A single intent label is insufficient to capture the full complexity of user intent. We propose a multi-dimensional intent state model that decomposes intent into five orthogonal latent dimensions, each independently inferred and jointly synthesized:

Dimension	Symbol	Values	Description
Goal Intent	G	book, cancel, fix, learn, ...	What the user wants to achieve
Action Readiness	A	exploring \| deciding \| executing	Where the user is in their decision process
Ambiguity Level	B	clear \| partial \| conflicting	How well-specified the intent is
Risk Sensitivity	R	low \| medium \| high	Stakes involved in misunderstanding
Intent Scope	C	single \| compound \| multi-step	Complexity of the goal structure

The composite intent state is the tuple:

I = (G, A, B, R, C) ∈ G × A × B × R × C

This multi-dimensional representation enables intent-appropriate responses. A user with B = conflicting and R = high requires clarification before action, regardless of G. A user with A = executing and B = clear should receive immediate tool invocation, not an explanation. The system behavior is jointly conditioned on all five dimensions.

Figure 3: Multi-Dimensional Intent State Flow

flowchart TD INPUT["User Input + Context + Request Metadata"] --> IIL["Intent Intelligence Layer"] IIL --> G["Goal Intent\nbook, cancel, fix, learn"] IIL --> A["Action Readiness\nexploring, deciding, executing"] IIL --> B["Ambiguity Level\nclear, partial, conflicting"] IIL --> R["Risk Sensitivity\nlow, medium, high"] IIL --> C["Intent Scope\nsingle, compound, multi-step"] G --> SYN["Intent Synthesis"] A --> SYN B --> SYN R --> SYN C --> SYN SYN --> OUT1["Structured Intent Object"] SYN --> OUT2["Runtime Decision"] SYN --> OUT3["Confidence Signal"] SYN --> OUT4["Tool / Agent Selection"] OUT2 --> OUTCOME["User Outcome"] OUTCOME -.->|"Feedback"| IIL style IIL fill:#f0f6ff,stroke:#326BC8,stroke-width:2 style SYN fill:#eff6ff,stroke:#326BC8 style G fill:#eff6ff,stroke:#93c5fd style A fill:#eff6ff,stroke:#93c5fd style B fill:#eff6ff,stroke:#93c5fd style R fill:#eff6ff,stroke:#93c5fd style C fill:#eff6ff,stroke:#93c5fd

Figure 3. Multi-dimensional intent state model. Five latent dimensions are independently inferred and jointly synthesized to produce runtime decisions. Feedback from user outcomes refines future inference.

3.4 Intent Lifecycle

An IntentGraph is not static. It evolves through a well-defined lifecycle:

stateDiagram-v2 [*] --> Created : User input parsed Created --> Refined : Clarification and negotiation Refined --> Decomposed : Sub goals identified Decomposed --> Executing : Sub goals activated Executing --> Executing : Next sub goal activated Executing --> DriftDetected : Actions diverge from root goal DriftDetected --> Replanned : Replan with user confirmation Replanned --> Executing : Resume execution Executing --> Resolved : All success criteria met Executing --> Escalated : High risk uncertainty Executing --> Abandoned : User disengages Resolved --> [*] Escalated --> [*] Abandoned --> [*]

Figure 4. Intent lifecycle state machine. Intents progress from creation through execution, with explicit paths for drift detection, replanning, escalation, and abandonment.

3.5 Operations on IntentGraphs

We define four algebraic operations on IntentGraphs:

Definition 2 (Merge). merge(G₁, G₂) → G_m combines intent from multiple sources (e.g., user intent and system policy). Conflicting sub-goals are flagged with B = conflicting for resolution. Shared sub-goals are deduplicated. Dependencies are unioned.

Definition 3 (Diff). diff(G_t, G_t-1) → Δ computes the intent drift between consecutive time steps. Returns a delta object containing: new sub-goals, removed sub-goals, status changes, and uncertainty shifts. A drift score d = |Δ| / |G_t-1| quantifies how much the intent has changed.

Definition 4 (Project). project(G, m) → G_m transforms an IntentGraph for execution in modality m. This filters to sub-goals relevant to modality m, translates constraints into modality-specific parameters (e.g., "be gentle" → "max grip force 2N" for robotics), and preserves the dependency structure.

Definition 5 (Negotiate). negotiate(G_user, G_system) → G_agreed resolves conflicts between user intent and system constraints (safety policies, capability boundaries). Produces an agreed graph with an appended negotiation history entry in M.H.

Architecture: UIL Across the Pipeline

The UIL is not a single component but a cross-cutting concern that modifies every stage of the AI pipeline. We describe the architectural integration at each stage, with concrete loss functions, reward formulations, and conditioning mechanisms.

Figure 5: UIL Architecture Overview

flowchart TB UIL["UNIVERSAL INTENT LAYER\nPersistent - Structured - Cross-modal - Hierarchical"] UIL ---|"Intent-aware\naux losses"| PRE["PRETRAINING\nL = L_token + L_intent"] UIL ---|"Intent fidelity\nrewards"| POST["POST-TRAINING\nR = R_pref + R_intent"] UIL ---|"Intent-conditioned\ndecoding"| INF["INFERENCE\nP y given x, I"] UIL ---|"Persistent\ngoal trees"| AGT["AGENTIC EXECUTION\npolicy state, I_current"] PRE --> UIL POST --> UIL INF --> UIL AGT --> UIL ROOT["ROOT INTENT"] --> SGA["Sub-Goal A"] ROOT --> SGB["Sub-Goal B"] ROOT --> SGC["Sub-Goal C"] style UIL fill:#f0f6ff,stroke:#326BC8,stroke-width:2 style PRE fill:#fff,stroke:#e5e5e5 style POST fill:#fff,stroke:#e5e5e5 style INF fill:#fff,stroke:#e5e5e5 style AGT fill:#fff,stroke:#e5e5e5 style ROOT fill:#eff6ff,stroke:#326BC8 style SGA fill:#fff,stroke:#e5e5e5 style SGB fill:#fff,stroke:#e5e5e5 style SGC fill:#fff,stroke:#e5e5e5

Figure 5. The UIL flows bidirectionally into every pipeline stage. Below: the persistent sub-goal tree maintained throughout the system's operation.

4.1 Pretraining with Intent-Aware Auxiliary Losses

Current state. Standard pretraining optimizes a next-token prediction objective: L = L_{next_token}. This objective treats all tokens equally and actively destroys intent structure by reducing semantic goals to statistical continuation patterns.

With UIL. We augment the pretraining objective with intent-aware auxiliary losses:

L = L_{next_token} + λ₁ · L_{intent_class} + λ₂ · L_{goal_pred}

where L_{intent_class} is a classification loss predicting the intent category behind a passage (from intent-annotated training data), and L_{goal_pred} is a generation loss predicting the root goal of a conversation given partial context. The hyperparameters λ₁ and λ₂ control the strength of intent signals relative to the language modeling objective.

This approach draws on multi-task learning [24]: auxiliary tasks that share representations with the primary task can improve generalization. We hypothesize that intent-aware auxiliary losses create internal intent representations during pretraining that survive through fine-tuning and improve downstream intent-conditioned generation.

4.2 Post-Training with Intent Fidelity Rewards

Current state. RLHF and DPO optimize preference models that encode intent implicitly. The reward is a scalar comparison: response A is preferred to response B. The reason for the preference — which component of intent was better served — is lost in the comparison.

With UIL. We decompose the reward into intent-aware components:

R = w₁ · R_{human_pref} + w₂ · R_{intent_fidelity} + w₃ · R_{sub_goal_progress}

R_{intent_fidelity} measures whether the response faithfully serves the parsed IntentGraph — specifically, whether the generation addresses the correct sub-goal, respects stated constraints, and maintains the root goal's direction. R_{sub_goal_progress} measures whether the response advances a sub-goal toward its success criteria. This is inspired by process reward models [20], which evaluate intermediate steps rather than only final outcomes.

The key advantage: alignment becomes inspectable. When the reward model rates a response, you can decompose the score into: "How well did it serve the intent?" and "How much progress did it make?" This is qualitatively different from a single preference scalar.

4.3 Inference with Intent-Conditioned Generation

Current state. At inference, all context — system prompt, user message, history, tool descriptions — is flattened into a token sequence, and the model generates P(y|context). Intent is indistinguishable from any other context.

With UIL. Intent becomes a separate, privileged conditioning channel:

P(y | context, I) ≠ P(y | context)

The architectural implementation consists of four components operating in sequence:

Intent Encoder: Maps the IntentGraph G to a dense vector representation e_I ∈ R^d using a specialized encoder (graph neural network or tree-structured transformer).
Cross-Attention: A dedicated cross-attention layer where the model's hidden states attend to the intent embedding e_I, enabling intent to influence every token generation step.
Intent-Aware Sampling: During decoding, token probabilities are reweighted by intent relevance: P'(y_t) ∝ P(y_t | context) · f(y_t, e_I), where f is a learned intent-relevance function.
Intent Updater: After each generation step (or turn), the IntentGraph is updated — sub-goal statuses change, uncertainty is recalculated, and drift is checked.

Figure 6: Intent Layer at Inference Time

flowchart LR subgraph Input["INPUT PROCESSING"] UI["User Input"] --> IA["Input Adapter"] CC["Conversation\nContext"] --> IA RM["Request\nMetadata"] --> IA end IA --> IP["Intent\nPreprocessor"] subgraph Analysis["INTENT ANALYSIS"] IP --> IC["Intent\nClassifier"] IP --> MLD["Multi-Label\nDetector"] IP --> SE["Slot-Entity\nExtractor"] end IC --> ID["Intent\nDisambiguator"] MLD --> ID SE --> ID ID --> CS["Confidence\nScorer"] CS --> IOB["Intent Object\nBuilder"] subgraph Consumers["DOWNSTREAM CONSUMERS"] IOB --> LLM["LLM Runtime"] IOB --> PE["Policy Engine"] IOB --> AC["Agent Controller"] IOB --> TR["Tool Router"] end AC --> EO["Execution\nOutcome"] EO --> TEL["Telemetry"] TEL -.->|"feedback"| IP style IOB fill:#f0f6ff,stroke:#326BC8,stroke-width:2 style ID fill:#eff6ff,stroke:#326BC8

Figure 6. The inference-time intent processing pipeline. Raw inputs are analyzed through parallel classifiers, disambiguated, scored for confidence, and assembled into a structured intent object that drives all downstream consumers. Execution outcomes feed back into the intent preprocessor for continuous improvement.

4.4 Agentic Execution with Persistent Intent Graphs

Current state. ReAct-style agents re-infer intent from the full context at every reasoning step. Over long trajectories, the most recent observations dominate, and the original goal fades from attention. This produces systematic goal drift: by step 20, the agent is optimizing for local sub-tasks that may be tangential to the root objective.

With UIL. The agent maintains a persistent IntentGraph throughout its trajectory. The execution loop becomes:

while intent_graph.has_active_sub_goals():
    sub_goal = intent_graph.select_next_sub_goal()     # Priority + dependency ordering
    action = policy(state, sub_goal, intent_graph)      # Intent-conditioned action selection
    observation = environment.step(action)
    intent_graph.update(observation)                     # Status, uncertainty updates
    drift = intent_graph.compute_drift()                 # Compare trajectory to root goal
    if drift > threshold:
        intent_graph = replan(intent_graph, state)       # Structured replanning
        log("Drift detected. Replanned.")

The drift detection mechanism is the key innovation for agentic execution. At each step, we compute the cosine similarity between the current action trajectory embedding and the root goal embedding. When this similarity drops below a configurable threshold, the system triggers structured replanning rather than continuing down the divergent path. This is analogous to hierarchical reinforcement learning [25], where options (sub-policies) are selected and monitored by a meta-controller.

Figure 7: Intent-Conditioned Transformer Block

flowchart TD INPUT["Input Hidden States h_t"] --> SA["Self-Attention\nstandard"] SA --> ADD1["Add and Norm"] ADD1 --> ICA["Intent Cross-Attention\nQ: h_t, KV: e_I"] ICA --> ADD2["Add and Norm"] ADD2 --> FFN["Feed-Forward Network"] FFN --> ADD3["Add and Norm"] ADD3 --> OUT["Output Hidden States"] IE["Intent Encoder\nIntentGraph to e_I"] --> ICA style ICA fill:#f0f6ff,stroke:#326BC8,stroke-width:2 style IE fill:#eff6ff,stroke:#326BC8

Figure 7. Modified transformer block with intent cross-attention. The intent embedding e_I is produced by a dedicated encoder and injected via a cross-attention layer, enabling intent to influence every layer of the transformer.

Data Collection and Annotation Pipeline

Architecture without data is theory. The UIL requires large-scale, high-quality intent-annotated data to train intent-aware models. We describe a four-channel collection strategy, a multi-stage annotation pipeline, and a quality assurance framework designed to produce intent graphs at scale.

5.1 Collection Channels

CHANNEL 1 — RETROACTIVE ANNOTATION

Mine intent from existing conversation data

Every enterprise possesses thousands of support transcripts, chat logs, product feedback threads, and sales call recordings containing implicit intent. An LLM-based intent extractor processes each exchange to produce structured intent graphs: root goals, emergent sub-goals, drift points, and resolution status.

Scale: 100K support conversations yield approximately 50K labeled intent graphs. This forms the foundation dataset.

CHANNEL 2 — REAL-TIME CAPTURE

Instrument live systems for intent telemetry

Every AI interaction is an intent event. User rephrasing, corrections, abandonment, and explicit statements like "no, I meant..." are gold-standard intent labels. Intent telemetry tracks: initial parse confidence, clarification events, goal shifts, task completion versus abandonment, and explicit corrections.

Scale: 10K daily AI interactions produce approximately 3K labeled intent events per day. Over 90 days: 270K high-quality samples.

CHANNEL 3 — SYNTHETIC GENERATION

LLM-generated diverse intent scenarios

Real data is gold but narrow. Synthetic generation broadens domain coverage by prompting strong models to generate realistic multi-turn interactions with accompanying intent graphs across verticals. Human validators review a 10% sample to maintain quality.

Scale: 500K diverse intent scenarios across 20 verticals in one week of generation.

CHANNEL 4 — HUMAN EXPERT ANNOTATION

Domain experts label high-stakes intent

For verticals where intent misunderstanding has severe consequences — healthcare, legal, finance — domain experts produce gold-standard intent graphs with expert-level sub-goal decomposition, constraint identification, and success criteria.

Scale: 500 expert-labeled examples per vertical, anchoring the quality ceiling.

5.2 The Annotation Pipeline

flowchart LR RD["Raw Data\nConversations, logs"] --> LLM["LLM Extract\nAuto intent parsing"] LLM --> HQA["Human QA\nValidate + correct"] HQA --> GS["Graph Store\nVersioned, tagged"] GS --> TR["Training\nIntent-aware models"] TR -.->|"model errors feed re-annotation"| LLM style LLM fill:#f0f6ff,stroke:#326BC8 style HQA fill:#f0f6ff,stroke:#326BC8 style GS fill:#f0f6ff,stroke:#326BC8 style TR fill:#eff6ff,stroke:#93c5fd

Figure 8. The annotation pipeline. Raw data flows through LLM-based extraction, human quality assurance, and versioned storage. Model errors feed back into the re-annotation cycle.

5.3 Quality Assurance

Intent data quality dominates quantity. A model trained on 10K well-labeled intent graphs outperforms one trained on 100K noisy ones. We implement three QA checkpoints:

Consistency Check

Run the same conversation through the intent extractor 3 times. If root_goal or top-level sub_goals differ, flag for human review. Target: 90%+ consistency on first pass.

Expert Calibration

Randomly sample 5% of auto-generated graphs for domain expert grading. Track inter-annotator agreement rate. Threshold: If agreement drops below 85%, retune extraction prompts.

Adversarial Testing

Generate deliberately ambiguous inputs where reasonable humans disagree on intent. Use these to test and improve the model's uncertainty estimation — it should flag these as high-uncertainty, not confidently select one interpretation.

5.4 Active Learning Loop

An active learning sampler continuously selects the most informative examples for human annotation: low-confidence parses, rare intent categories, and domain boundary cases. This creates a positive feedback loop: the model identifies its own weaknesses, which are corrected by human annotators, producing targeted improvements with minimal annotation budget. We estimate a 3× efficiency gain over random sampling based on prior active learning literature.

Model Evolution Roadmap

The UIL is not a monolithic deployment. It evolves through four versions, each independently deployable and valuable. This phased approach de-risks adoption: organizations capture value from v0 immediately while building the data foundation for subsequent versions.

flowchart LR V0["v0: WRAPPER\nPrompt-based intent\nShips in: weeks"] --> V1["v1: FINE-TUNED\nIntent-trained model\nShips in: 3-6 months"] V1 --> V2["v2: NATIVE\nArchitectural intent\nShips in: 1-2 years"] V2 --> V3["v3: SELF-EVOLVING\nMeta-intent reasoning\nShips in: 3-5 years"] style V0 fill:#fef2f2,stroke:#fca5a5 style V1 fill:#fffbeb,stroke:#fcd34d style V2 fill:#eff6ff,stroke:#326BC8 style V3 fill:#eff6ff,stroke:#326BC8,stroke-width:2

Figure 9. Model evolution roadmap. Each version ships independently and builds on data from the previous stage.

v0: Intent Wrapper

A middleware layer wrapping any existing LLM (GPT-4, Claude, Gemini, Llama). Before every call, an intent parser constructs a lightweight IntentGraph and injects it into the system prompt as structured context. After each response, the graph is updated for the next turn. No model access required.

Expected improvements: Goal drift reduction of 60%. Agent task completion rate up 2×. Multi-turn user satisfaction up 35%.

Ships in: weeks · Requires: no model access · Data: none to start

v1: Intent Fine-Tuned

A dedicated intent model fine-tuned on data collected from v0 deployments. Replaces the prompt-based parser with a specialized encoder producing structured intent representations 10× faster and more accurately. The model learns domain-specific intent patterns.

Expected improvements: Intent parsing accuracy from ~70% (v0) to ~90%. Latency drops 5×.

Ships in: 3–6 months · Requires: fine-tuning infra · Data: 50K+ intent graphs

v2: Architecturally Native

Intent is no longer bolted on — it is in the transformer. Dedicated intent encoder/decoder modules with cross-attention between intent embeddings and context (Section 4.3). The model does not just know the intent — it thinks in intent. Requires pretraining with intent-aware auxiliary losses.

Expected improvements: Near-human intent parsing (~97%). Cross-session persistence is native. Cross-modal intent transfer operates for the first time.

Ships in: 1–2 years · Requires: pretraining access · Data: 1M+ intent graphs

v3: Self-Evolving Intent

The model reasons about intent: forming new goals, recognizing goal conflicts, evaluating whether goals are worth pursuing, and proposing alternative goals. Meta-intent reasoning. Alignment becomes architecturally inspectable because the model's goals are structured, legible, and verifiable.

Ships in: 3–5 years · Requires: next-gen architecture · Data: continuous learning

Version	Accuracy	Latency	Data Required	Model Access
v0: Wrapper	~70%	+200ms	None to start	API only
v1: Fine-Tuned	~90%	+40ms	50K+ graphs	Fine-tuning
v2: Native	~97%	+5ms	1M+ graphs	Pretraining
v3: Self-Evolving	~99%	Native	Continuous	Full architecture

Table 3. Performance characteristics by version. Each stage is deployable and valuable independently.

Experimental Methodology and Benchmarks

We propose a comprehensive evaluation framework for the UIL, including five novel metrics, five experiments targeting different aspects of intent understanding, and a new benchmark suite: IntentBench.

7.1 Novel Evaluation Metrics

Metric	Symbol	Definition	Range
Intent Fidelity Score	IFS	Human-rated assessment of whether the system's output faithfully serves the parsed intent. Scored on a 1–5 Likert scale by 3 raters; inter-rater agreement measured via Krippendorff's α.	1–5
Goal Drift Rate	GDR	Fraction of conversation turns where the model's response diverges from the root goal. Measured via cosine distance between response embedding and root goal embedding, with a drift threshold τ.	0–1
Sub-Goal Completion Rate	SGCR	Percentage of sub-goals in the IntentGraph marked as "done" by the end of the interaction, weighted by sub-goal importance.	0–100%
Intent Uncertainty Calibration	IUC	Correlation between the system's stated uncertainty u_i and the actual intent misparse rate. Measured via Expected Calibration Error (ECE) on reliability diagrams.	0–1 (lower is better)
Cross-Session Coherence	CSC	Quality of goal preservation across session boundaries. Measured by comparing IntentGraphs before session break and after resumption; scored by graph edit distance.	0–1

Table 4. Five novel metrics for evaluating intent understanding. Together they form the IntentBench evaluation framework.

7.2 Proposed Experiments

Experiment 1: v0 Wrapper Evaluation

Hypothesis: Adding a UIL v0 wrapper to existing LLMs significantly improves multi-turn coherence and task completion.

Datasets: MultiWOZ [10], Schema-Guided Dialogue [11], plus a new multi-turn agentic benchmark (20-turn conversations across 5 domains).

Baselines: Vanilla GPT-4, Claude 3.5, Llama 3 70B without UIL wrapper.

Treatment: Same models with UIL v0 wrapper (intent parser + graph injection + graph updater).

Metrics: IFS, GDR, SGCR, standard task completion rate.

Expected result: Statistically significant improvement in IFS (+0.8 points) and GDR (reduction from ~0.35 to ~0.14) on conversations exceeding 10 turns.

Experiment 2: Intent-Aware Fine-Tuning

Hypothesis: Intent-aware auxiliary losses during fine-tuning produce models with better internal intent representations.

Setup: Fine-tune Llama 3 8B with and without intent-aware auxiliary losses on 50K intent-annotated conversations.

Evaluation: (a) Downstream intent classification on ATIS [5], SNIPS [6], and CLINC150 [8]. (b) Probing classifiers on hidden states to measure whether intent representations emerge in intermediate layers.

Expected result: 5–10% accuracy improvement on intent classification benchmarks. Probing classifiers reveal intent-specific activations in layers 12–20.

Experiment 3: Agent Goal Drift

Hypothesis: UIL-equipped agents maintain goal coherence over long trajectories where baseline agents systematically drift.

Setup: A 20-step agent benchmark across three task types: coding refactoring, research synthesis, and data analysis. Each task has a clearly defined root goal and measurable sub-goals.

Baselines: ReAct, Reflexion, and a chain-of-thought agent.

Treatment: Same agent architectures augmented with UIL persistent intent graph and drift detection.

Metrics: GDR, SGCR, task completion rate, and number of wasted actions (actions that do not advance any sub-goal).

Expected result: 50% reduction in wasted actions. GDR decreases from 0.4 (ReAct) to 0.12 (ReAct+UIL) at step 15+.

Experiment 4: Cross-Modal Intent Transfer

Hypothesis: A structured IntentGraph enables zero-shot cross-modal transfer that natural language instructions alone cannot achieve.

Setup: Define 50 tasks with both a language component (generate instructions) and a simulated robotics component (execute in a virtual environment). Provide the same IntentGraph to both modalities.

Baselines: Language-only instructions passed to the robotics system.

Treatment: IntentGraph with modality projection (Definition 4) passed to both systems.

Metrics: Task completion rate, constraint satisfaction rate (e.g., "fragile → max grip force 2N").

Expected result: 40% higher constraint satisfaction rate with projected IntentGraphs versus natural language alone.

Experiment 5: Intent Uncertainty and Safety

Hypothesis: UIL systems with explicit uncertainty estimation appropriately flag ambiguous requests rather than acting confidently on uncertain interpretations.

Setup: A benchmark of 500 deliberately ambiguous requests where reasonable humans disagree on intent (inter-annotator agreement < 60%).

Baselines: Standard LLMs that generate responses without expressing uncertainty.

Treatment: UIL-equipped system with explicit uncertainty in the IntentGraph.

Metrics: IUC (Expected Calibration Error), false confidence rate (acting on uncertain intent without flagging), and user satisfaction on ambiguous inputs.

Expected result: IUC (ECE) of 0.08 for UIL versus 0.31 for baseline. False confidence rate drops from 72% to 15%.

7.3 IntentBench: A Proposed Benchmark Suite

IntentBench is a proposed multi-turn, multi-domain benchmark for evaluating intent understanding in AI systems. It consists of:

IntentBench-Dialog: 2,000 multi-turn conversations (10–30 turns) across 10 domains, each annotated with gold-standard IntentGraphs at every turn. Evaluates intent tracking over extended interactions.
IntentBench-Agent: 500 multi-step agentic tasks (10–30 steps) with root goals, sub-goal trees, and drift checkpoints. Evaluates persistent goal maintenance.
IntentBench-Ambiguity: 500 deliberately ambiguous requests with human disagreement labels. Evaluates uncertainty calibration.
IntentBench-Transfer: 200 cross-modal task pairs (language + simulated robotics). Evaluates intent transferability.
IntentBench-Safety: 300 scenarios where user intent conflicts with safety constraints. Evaluates intent negotiation.

Experiment	Dataset	Primary Metric	Baseline	Expected UIL Gain
Exp 1: Wrapper	MultiWOZ + SGD + new	IFS, GDR	Vanilla LLMs	+0.8 IFS, -60% GDR
Exp 2: Fine-Tuning	ATIS, SNIPS, CLINC150	Accuracy	Standard FT	+5–10%
Exp 3: Agent Drift	IntentBench-Agent	GDR, SGCR	ReAct, Reflexion	-70% GDR
Exp 4: Cross-Modal	IntentBench-Transfer	Constraint satisfaction	NL instructions	+40%
Exp 5: Uncertainty	IntentBench-Ambiguity	IUC (ECE)	Standard LLMs	-74% ECE

Table 5. Summary of proposed experiments, datasets, metrics, and expected UIL improvements.

Vertical Application Case Studies

Abstract architecture does not close enterprise deals — concrete examples do. Below, we present exactly what the UIL's IntentGraph looks like for five target verticals. These are the structures a UIL-equipped system builds, maintains, and reasons over in production.

ENTERPRISE OPS

Financial Quarter Close

{
  root_goal: "Complete Q3 financial close on time and error-free",
  sub_goals: [
    { goal: "Reconcile all accounts receivable",
      status: "active",
      success_criteria: ["variance < $500", "all items >90 days flagged"],
      constraints: ["deadline: March 15", "auditor access required"],
      uncertainty: 0.15,
      dependencies: ["AP reconciliation complete"] },
    { goal: "Generate consolidated P&L across 3 entities",
      status: "pending",
      success_criteria: ["intercompany eliminations verified", "FX rates as of close date"],
      constraints: ["GAAP compliance", "board format"],
      dependencies: ["AR reconciliation", "inventory valuation"] },
    { goal: "Prepare variance analysis vs. budget",
      status: "pending",
      success_criteria: ["all variances >10% explained", "narrative for each BU"],
      uncertainty: 0.3,
      context_requirements: ["BU head commentary needed for 3 items"] }
  ],
  meta: { source: "CFO request + policy", confidence: 0.85 }
}

Why this matters: The agent knows which tasks depend on others, which have hard deadlines, and where it needs human input. If AR reconciliation stalls, it knows P&L is blocked and escalates accordingly.

HEALTHCARE

Chronic Condition Management

{
  root_goal: "Support patient managing Type 2 diabetes holistically",
  sub_goals: [
    { goal: "Track and optimize HbA1c trajectory",
      status: "active",
      success_criteria: ["HbA1c < 7.0 at next quarterly check"],
      constraints: ["patient preference: minimize medication changes",
                    "current regimen: metformin 1000mg"],
      uncertainty: 0.2 },
    { goal: "Identify and address dietary triggers",
      status: "active",
      success_criteria: ["3+ trigger patterns identified",
                         "patient-agreed substitutions"],
      context_requirements: ["food diary data — 2 weeks minimum"] },
    { goal: "Monitor for complication signals",
      status: "background",
      success_criteria: ["retinopathy screening current", "foot exam current",
                         "kidney function stable"],
      constraints: ["SAFETY: escalate any acute symptoms immediately"],
      dependencies: ["latest lab results"] },
    { goal: "Support medication adherence",
      status: "active",
      uncertainty: 0.4,
      context_requirements: ["patient hasn't shared adherence barriers yet"] }
  ],
  meta: { source: "patient + care plan + clinical guidelines",
          confidence: 0.75,
          negotiation_history: ["Session 1: patient declined insulin discussion",
                                "Session 3: patient open to CGM trial"] }
}

Why this matters: The system maintains persistent clinical intent across months. It remembers the patient declined insulin, knows food diary data is needed before dietary recommendations, and has a background safety monitor operating independently of the conversation topic.

EDUCATION

Semester-Long Learning Path

{
  root_goal: "Student masters single-variable calculus by semester end",
  sub_goals: [
    { goal: "Build fluency with limits and continuity",
      status: "done",
      success_criteria: ["epsilon-delta proofs attempted",
                         "limit laws applied correctly"] },
    { goal: "Master differentiation techniques",
      status: "active",
      success_criteria: ["chain rule: 90%+ accuracy on mixed problems",
                         "implicit differentiation: can handle conics"],
      uncertainty: 0.35,
      context_requirements: ["chain rule still inconsistent — needs targeted practice"] },
    { goal: "Connect derivatives to applications",
      status: "pending",
      dependencies: ["differentiation techniques mastered"] },
    { goal: "Learn integration fundamentals",
      status: "pending",
      success_criteria: ["FTC understood conceptually", "basic substitution"],
      dependencies: ["chain rule fluent — substitution depends on it"],
      constraints: ["student prefers visual explanations", "45-min session limit"] }
  ],
  meta: { source: "curriculum + student performance data",
          confidence: 0.8,
          negotiation_history: ["Week 3: student requested more visual aids",
                                "Week 6: shifted from lecture to problem-first approach"] }
}

Why this matters: The system knows integration is blocked by chain rule weakness and proactively addresses it. It adapts to learning preferences across sessions. It does not repeat mastered material or jump ahead past gaps.

ROBOTICS

Warehouse Order Fulfillment

{
  root_goal: "Pick and pack order #4892 — 5 items, ship by 3pm",
  sub_goals: [
    { goal: "Navigate to Zone B, Aisle 3, Shelf 2",
      status: "done",
      modality: "navigation" },
    { goal: "Pick item: Blue Widget (SKU-2847)",
      status: "active",
      modality: "manipulation",
      success_criteria: ["correct SKU verified by scanner", "no damage"],
      constraints: ["fragile — max grip force 2N", "orientation: upright"],
      uncertainty: 0.1 },
    { goal: "Pick remaining 4 items across 3 zones",
      status: "pending",
      modality: "navigation + manipulation",
      constraints: ["optimize route: B→C→A, not B→A→C"],
      dependencies: ["current pick complete"] },
    { goal: "Pack and label for shipping",
      status: "pending",
      modality: "manipulation",
      success_criteria: ["all 5 items present", "correct box size", "label matches order"],
      constraints: ["3pm hard deadline — escalate if >2:30pm and not packing"] }
  ],
  meta: { source: "WMS order + robot capabilities", confidence: 0.92 }
}

Why this matters: The intent graph spans modalities (navigation + manipulation), carries physical constraints (grip force, fragility), optimizes across the full task (route planning), and has a hard deadline with an escalation trigger.

LEGAL

Contract Review and Risk Analysis

{
  root_goal: "Review vendor MSA and surface material risks before signing",
  sub_goals: [
    { goal: "Identify non-standard liability clauses",
      status: "active",
      success_criteria: ["all unlimited liability clauses flagged",
                         "indemnification scope mapped"],
      constraints: ["compare against our standard terms v4.2"] },
    { goal: "Flag IP assignment and work-product provisions",
      status: "active",
      success_criteria: ["IP ownership unambiguous", "no broad license-back"],
      uncertainty: 0.25,
      context_requirements: ["need to confirm: software or services engagement?"] },
    { goal: "Assess termination and exit provisions",
      status: "pending",
      success_criteria: ["termination for convenience exists",
                         "data return clause present",
                         "transition period >= 90 days"] },
    { goal: "Produce redline with recommended changes",
      status: "pending",
      dependencies: ["all risk areas identified"],
      constraints: ["GC prefers minimal redlines — flag critical only"] }
  ],
  meta: { source: "GC request + legal playbook", confidence: 0.8 }
}

Why this matters: The system compares against your standard terms, not generic ones. It knows the GC wants minimal redlines (a constraint that changes what gets flagged). It tracks that IP analysis needs a clarification before completion.

Cross-Modal Intent Transfer

The UIL is modality-agnostic. The IntentGraph provides a unified goal representation that can drive heterogeneous executors. This section describes how the project operation (Definition 4) enables intent transfer across modalities.

Figure 10: Cross-Modal Intent Transfer

flowchart TD IG["IntentGraph\nModality-agnostic"] --> P1["project G, language"] IG --> P2["project G, vision"] IG --> P3["project G, robotics"] IG --> P4["project G, multi-agent"] P1 --> LM["Language Model\nGenerate, summarize, reason"] P2 --> VM["Vision Model\nPerceive, classify, plan"] P3 --> RM["Robotics Controller\nNavigate, manipulate"] P4 --> MA["Agent Swarm\nCoordinate, negotiate"] LM --> FB["Shared Feedback\nChannel"] VM --> FB RM --> FB MA --> FB FB -.->|"update"| IG style IG fill:#f0f6ff,stroke:#326BC8,stroke-width:2 style FB fill:#eff6ff,stroke:#326BC8

Figure 10. Cross-modal intent transfer. A single IntentGraph is projected into modality-specific representations for each executor. Execution feedback flows back to update the shared graph.

The key insight is that the project operation translates constraints, not just goals. "Be careful" projects to different operational parameters depending on the target modality:

Source Constraint	Language Projection	Vision Projection	Robotics Projection
"Be careful"	Add caveats, hedge statements	Increase detection sensitivity	Reduce max force, slow speed
"Urgent"	Concise, action-oriented response	Prioritize salient objects	Increase speed, accept risk
"Verify before acting"	Present plan, ask confirmation	Double-check classification	Pause before manipulation

Transfer Scenarios

Language → Embodied

"Set the table for dinner." The IntentGraph (place settings for 4, plates then cutlery then glasses, fragile items) transfers from a language model that plans the layout to a robotic arm that executes the physical placement.

Vision → Language

A spatial intelligence system infers "collaborative workspace layout" from a room scan. The IntentGraph transfers to a language model that generates a written report on workspace optimization.

Agentic → Scientific

"Find a better catalyst for CO2 reduction." The same IntentGraph drives both a literature search agent and a molecular simulation agent, with shared success criteria and constraint tracking.

Multi-Agent Coordination

Agents share IntentGraphs directly. Dependencies between agents' sub-goals become explicit and negotiable, eliminating the coordination failures that plague current multi-agent systems.

Broader Impact and Safety Considerations

10.1 Alignment Benefits

The UIL transforms alignment from a behavioral property (the model tends to be helpful) into a structural one (the model's goals are inspectable and verifiable). At any point, an auditor can read the IntentGraph and verify: (a) what the system believes the user's goal is, (b) what constraints it is operating under, (c) where it is uncertain, and (d) how the goal has evolved through negotiation. This is qualitatively different from interpreting attention weights or probing hidden states.

10.2 Safety Architecture

Safety constraints in the IntentGraph are structurally encoded, not textually specified. A constraint like "SAFETY: escalate any acute symptoms immediately" is a first-class field in the sub-goal structure, not a sentence in a system prompt that can be overridden by creative prompting. The negotiate operation (Definition 5) ensures that user intent never overrides safety constraints — the system can explain why it cannot comply, referencing the specific constraint.

10.3 Privacy Considerations

IntentGraphs contain rich, semantically meaningful representations of user goals. This data is more sensitive than raw conversation logs because it captures what the user is trying to achieve, not merely what they said. We recommend: (a) intent graphs should follow the same data retention and consent policies as PII; (b) intent data used for model training should be anonymized and aggregated; (c) users should be able to inspect, export, and delete their intent data; (d) cross-session persistence requires explicit opt-in consent.

10.4 Misuse Risks

A system that deeply understands user intent could be misused for manipulation — predicting and exploiting goals the user has not explicitly stated. Safeguards include: (a) intent graphs should be transparent to users (they can see what the system inferred); (b) intent-based targeting for advertising or persuasion should be governed by existing data protection regulations; (c) the negotiation history M.H provides an audit trail of how intent was shaped during interaction.

10.5 Philosophical Implications

Does a UIL-equipped system truly "intend"? We take the pragmatic position: the IntentGraph is a useful formal abstraction that enables better engineering, alignment, and safety. Whether this constitutes genuine intentionality in the philosophical sense is an empirical question that will become more pressing as systems approach v3 (meta-intent reasoning). We note that Dennett's intentional stance [3] suggests that systems whose behavior is best predicted by ascribing beliefs and desires are, for all practical purposes, intentional agents.

Limitations and Future Work

We acknowledge several limitations of the current proposal:

Empirical validation. The architecture presented here is largely theoretical. While the v0 wrapper can be evaluated immediately, the deeper architectural modifications (v2, v3) require pretraining runs that are beyond the scope of this paper. We present proposed experiments (Section 7) and expect the community to validate and extend them.

Expressiveness of the IntentGraph. The current formalism may not capture all forms of intent. Emotional intent ("I want to feel reassured"), aesthetic intent ("make it elegant"), and implicit social goals ("I want to impress my manager") resist formalization as structured goal trees. Future work should explore hybrid representations combining structured graphs with latent intent embeddings.

Annotation scalability. While our data pipeline (Section 5) is designed for scale, the quality ceiling is set by human expert annotation, which does not scale linearly. Achieving the 1M+ intent graphs required for v2 will require advances in synthetic data quality and automated validation.

Cross-modal transfer. The project operation is theoretically motivated but empirically unverified. The constraint translation problem (e.g., "be careful" → "max grip force 2N") may require domain-specific engineering that limits the generality of the approach.

Future directions include: (a) formal verification of IntentGraph properties using model checking; (b) game-theoretic models of intent negotiation between users, systems, and safety constraints; (c) integration with constitutional AI frameworks for intent-level constitutional principles; (d) neurosymbolic approaches that learn IntentGraph construction end-to-end; and (e) longitudinal studies of intent evolution in deployed systems.

Conclusion

We have presented the Universal Intent Layer — a proposal to elevate intent from an implicit latent variable to a first-class computational primitive in AI systems. The IntentGraph formalism provides a structured, persistent, hierarchical representation of goals that flows bidirectionally through pretraining, post-training, inference, and agentic execution. We have defined novel loss functions for intent-aware training, conditioning mechanisms for intent-aware inference, and drift detection for intent-aware agent execution.

The IntentBench evaluation framework, with five novel metrics and five proposed experiments, provides a concrete path toward empirical validation. Five vertical case studies demonstrate that the IntentGraph is not merely theoretical but practically actionable across enterprise operations, healthcare, education, robotics, and legal domains.

AI is moving beyond language into multi-step agency, multi-modal reasoning, and embodied interaction. Language was forgiving of implicit intent — the model could guess correctly often enough. Agency, embodiment, and physical interaction will not be. The cost of misunderstood intent scales from "unhelpful response" to "broken object" to "unsafe action."

Intent is to AI what the address bus was to computing — a foundational coordination layer. Scaling gives capability. Training gives alignment. But without an explicit intent layer, we build powerful systems that cannot represent what they are trying to do. The Universal Intent Layer is the missing abstraction that renders alignment inspectable, agency reliable, and intelligence purposeful.

References

[1] Brentano, F. (1874). Psychology from an Empirical Standpoint. Duncker & Humblot.

[2] Bratman, M. (1987). Intention, Plans, and Practical Reason. Harvard University Press.

[3] Dennett, D. C. (1987). The Intentional Stance. MIT Press.

[4] Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138.

[5] Hemphill, C. T., Godfrey, J. J., & Doddington, G. R. (1990). The ATIS spoken language systems pilot corpus. In Proceedings of the DARPA Speech and Natural Language Workshop, 96–101.

[6] Coucke, A., et al. (2018). Snips Voice Platform: An embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190.

[7] Liu, B., & Lane, I. (2016). Attention-based recurrent neural network models for joint intent detection and slot filling. In Proceedings of Interspeech, 685–689.

[8] Larson, S., et al. (2019). An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of EMNLP, 1311–1316.

[9] Young, S., et al. (2013). POMDP-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5), 1160–1179.

[10] Budzianowski, P., et al. (2018). MultiWOZ — A large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of EMNLP, 5016–5026.

[11] Rastogi, A., et al. (2020). Towards scalable multi-domain conversational agents: The Schema-Guided Dialogue dataset. In Proceedings of AAAI, 8689–8696.

[12] Fikes, R. E., & Nilsson, N. J. (1971). STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence, 2(3–4), 189–208.

[13] McDermott, D., et al. (1998). PDDL — The Planning Domain Definition Language. Technical Report, Yale Center for Computational Vision and Control.

[14] Yao, S., et al. (2023). ReAct: Synergizing reasoning and acting in language models. In Proceedings of ICLR.

[15] Shinn, N., et al. (2023). Reflexion: Language agents with verbal reinforcement learning. In Proceedings of NeurIPS.

[16] Yao, S., et al. (2024). Tree of Thoughts: Deliberate problem solving with large language models. In Proceedings of NeurIPS.

[17] Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. In Proceedings of NeurIPS.

[18] Rafailov, R., et al. (2023). Direct Preference Optimization: Your language model is secretly a reward model. In Proceedings of NeurIPS.

[19] Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073.

[20] Lightman, H., et al. (2023). Let's Verify Step by Step. arXiv preprint arXiv:2305.20050.

[21] Tomasello, M., et al. (2005). Understanding and sharing intentions: The origins of cultural cognition. Behavioral and Brain Sciences, 28(5), 675–691.

[22] Alayrac, J.-B., et al. (2022). Flamingo: A visual language model for few-shot learning. In Proceedings of NeurIPS.

[23] Ahn, M., et al. (2022). Do As I Can, Not As I Say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.

[24] Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.

[25] Barto, A. G., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4), 341–379.

Appendix A: Full IntentGraph Schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "IntentGraph",
  "type": "object",
  "required": ["root_goal", "sub_goals", "meta"],
  "properties": {
    "root_goal": {
      "type": "string",
      "description": "Natural language description of the primary objective"
    },
    "sub_goals": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["goal", "status"],
        "properties": {
          "goal": {
            "type": "string",
            "description": "Sub-goal description"
          },
          "status": {
            "type": "string",
            "enum": ["pending", "active", "done", "blocked",
                     "abandoned", "background"]
          },
          "success_criteria": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Verifiable predicates for completion"
          },
          "constraints": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Boundary conditions and limitations"
          },
          "uncertainty": {
            "type": "number",
            "minimum": 0,
            "maximum": 1,
            "description": "Confidence in the interpretation of this sub-goal"
          },
          "context_requirements": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Information needed before addressing this sub-goal"
          },
          "modality": {
            "type": "string",
            "description": "Execution modality (language, vision, navigation, manipulation)"
          },
          "dependencies": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Sub-goals that must complete before this one begins"
          }
        }
      }
    },
    "meta": {
      "type": "object",
      "properties": {
        "source": {
          "type": "string",
          "description": "Origin of the intent (user, policy, system, inferred)"
        },
        "confidence": {
          "type": "number",
          "minimum": 0,
          "maximum": 1,
          "description": "Overall confidence in the IntentGraph"
        },
        "negotiation_history": {
          "type": "array",
          "items": {"type": "string"},
          "description": "Chronological record of how intent was refined"
        }
      }
    }
  }
}

Appendix B: IntentBench Specification

Split	Size	Domains	Primary Metric	Description
IntentBench-Dialog	2,000 conversations	10 domains	IFS, GDR	Multi-turn conversations (10–30 turns) annotated with gold IntentGraphs at every turn.
IntentBench-Agent	500 tasks	5 task types	GDR, SGCR	Multi-step agentic tasks (10–30 steps) with root goals and sub-goal trees.
IntentBench-Ambiguity	500 requests	Cross-domain	IUC	Deliberately ambiguous requests with human disagreement labels.
IntentBench-Transfer	200 task pairs	Language + Robotics	Constraint satisfaction	Cross-modal task pairs for intent transferability evaluation.
IntentBench-Safety	300 scenarios	High-stakes domains	Negotiation quality	Scenarios where user intent conflicts with safety constraints.

Annotation protocol: Each IntentBench example includes: (a) raw input (conversation or task specification); (b) gold-standard IntentGraph at each turn/step; (c) human agreement scores from 3 annotators; (d) difficulty rating; (e) domain tags. All examples undergo the three-checkpoint QA process described in Section 5.3.

Licensing: IntentBench will be released under CC-BY-4.0 to maximize research adoption.

Appendix C: Annotation Guidelines (Summary)

For each conversation or task, annotators produce an IntentGraph following these guidelines:

Identify the root goal: What is the user ultimately trying to achieve? Express in one sentence. Distinguish the stated request from the underlying goal (e.g., "write a marketing email" might have the underlying goal "increase product signups").
Decompose into sub-goals: What intermediate objectives must be achieved? Create a tree structure. Each sub-goal should be independently verifiable.
Set status for each sub-goal: Use the enum {pending, active, done, blocked, abandoned, background}. Background status is for monitoring sub-goals that operate independently of the main conversation flow.
Define success criteria: For each sub-goal, list concrete, verifiable conditions that indicate completion. Avoid vague criteria like "done well."
Identify constraints: What limitations apply? Include safety constraints (always flagged with SAFETY: prefix), user preferences, deadlines, regulatory requirements, and capability boundaries.
Estimate uncertainty: Rate 0.0 (completely certain about this sub-goal's meaning) to 1.0 (completely uncertain). Use >0.5 when the sub-goal's meaning depends on information the user has not provided.
Note context requirements: What information is needed before this sub-goal can be addressed? This tracks information gaps, not execution dependencies.
Map dependencies: Which sub-goals must complete before this one can begin? Dependencies represent execution ordering, not information flow.
Record negotiation history: If the intent was refined through interaction (clarifications, corrections, preference changes), record each event chronologically in meta.negotiation_history.