{
  "access": "public",
  "type": "reference",
  "format": "markdown",
  "title": "Semio: A Semantic Interface Layer for Tool-Oriented AI Systems",
  "url": "https://app.datagrout.ai/labs/semio",
  "summary": "Modern AI agents face a fundamental interoperability problem: tool surfaces are syntactically heterogeneous but semantically similar. A customer record in Salesforce, HubSpot, and Stripe represents the same conceptual entity but exposes different schemas, field names, and access patterns. Current approaches rely on LLM reasoning to bridge these gaps, resulting in probabilistic failures, high token costs, and brittle integrations.\n\nSemio introduces a semantic interface layer that enables deterministic tool composition through typed contracts. Rather than requiring agents to reason about schema differences at runtime, Semio provides a declarative type system where tools announce their semantic capabilities and adapters handle structural transformations. This approach reduces integration fragility, lowers inference overhead, and enables formal verification of multi-step workflows.\n\nThe system operates as a compatibility substrate: tools declare inputs and outputs using semantic types (e.g., `billing.customer@1`), the planner reasons about type compatibility, and adapters bridge structural differences when needed. Identity anchors (keys) enable cross-system entity resolution without forcing schema normalization.",
  "topics": [
    "semantic-types",
    "interoperability",
    "tool-integration",
    "prolog"
  ],
  "content_markdown": "## Abstract\n\nModern AI agents face a fundamental interoperability problem: tool surfaces are syntactically heterogeneous but semantically similar. A customer record in Salesforce, HubSpot, and Stripe represents the same conceptual entity but exposes different schemas, field names, and access patterns. Current approaches rely on LLM reasoning to bridge these gaps, resulting in probabilistic failures, high token costs, and brittle integrations.\n\nSemio introduces a semantic interface layer that enables deterministic tool composition through typed contracts. Rather than requiring agents to reason about schema differences at runtime, Semio provides a declarative type system where tools announce their semantic capabilities and adapters handle structural transformations. This approach reduces integration fragility, lowers inference overhead, and enables formal verification of multi-step workflows.\n\nThe system operates as a compatibility substrate: tools declare inputs and outputs using semantic types (e.g., `billing.customer@1`), the planner reasons about type compatibility, and adapters bridge structural differences when needed. Identity anchors (keys) enable cross-system entity resolution without forcing schema normalization.\n\n---\n\n## Problem Landscape\n\n### Tool Sprawl and Schema Fragmentation\n\nEnterprise systems expose thousands of API endpoints across hundreds of services. Each system evolved independently, resulting in:\n\n- **Incompatible schemas** - A \"customer\" in one system has different fields than another\n- **Naming variations** - `customer_id`, `customerId`, `external_id`, `cust_ref` all mean \"customer identifier\"\n- **Type mismatches** - Dates as strings, ISO timestamps, or Unix epochs depending on the API\n- **Identity fragmentation** - No canonical way to reference the same entity across systems\n\n### Manual Integration Brittleness\n\nTraditional integration approaches require hand-coded glue logic for every tool pair. This results in:\n\n- **O(N^2) integration complexity** - Every new tool requires adapters for every existing tool\n- **Maintenance burden** - API changes break existing workflows\n- **Hidden assumptions** - Implicit schema mappings that fail silently\n- **No reusability** - Integration logic cannot be shared or composed\n\n### LLM Probabilistic Failures\n\nUsing LLM reasoning alone to bridge integration gaps introduces:\n\n- **Non-deterministic behavior** - Same request produces different results\n- **Token waste** - Extensive context needed to describe schemas and mappings\n- **Silent failures** - Type mismatches discovered at execution time, not plan time\n- **Hallucination risk** - LLMs invent plausible but incorrect field mappings\n\n---\n\n## Design Principles\n\nSemio is built on five core principles:\n\n### 1. Semantic Over Syntactic Compatibility\n\nTools declare what they produce and consume semantically, not syntactically. A tool outputs `billing.invoice@1`, not \"an object with fields id, amount, customer_id.\"\n\n### 2. Partial Type Coverage\n\nNot all fields must be declared. Types can be partially specified, with core fields annotated and extended fields available but not semantically indexed. This balances expressiveness with maintenance burden.\n\n### 3. Identity-First Interoperability\n\nCross-system composition requires identity anchors. Semio defines key kinds (email, id, external_id) that enable entity resolution without forcing global unique identifiers.\n\n### 4. Declarative Tool Contracts\n\nTools announce their capabilities through typed contracts. The planner reasons about compatibility without executing tools. This enables pre-execution verification and cost estimation.\n\n### 5. Safety Through Structural Constraints\n\nType mismatches are caught at plan time, not execution time. Adapters are explicitly modeled and verified, preventing silent data corruption.\n\n---\n\n## Semio Model Overview\n\n### Semantic Types\n\nTypes follow a versioned naming convention:\n\n```\n<family>.<entity>@<version>\n```\n\nExamples:\n- `crm.account@1` - CRM account record\n- `billing.invoice@1` - Billing invoice\n- `core.email@1` - Email address (primitive)\n- `crm.account.list@1` - Collection of accounts\n\nEach type declares:\n\n- **Family** - Domain grouping (crm, billing, hr, etc.)\n- **Label** - Human-readable name\n- **Keys** - Identity anchors available on this type\n- **Fields** - Named properties with types and tiers\n- **Containers** - List and page variants\n\n### Type Lifting and Duck Typing\n\nVendor-specific data is automatically recognized as semantic types through structural matching. Objects that provide all required fields of a type implicitly satisfy that type.\n\nExample:\n```elixir\n# Salesforce returns:\n%{\"Id\" => \"00Q...\", \"Email\" => \"user@example.com\", \"Company\" => \"Acme Corp\"}\n\n# Automatically recognized as crm.lead@1 (has required \"id\" and \"email\")\n# Vendor type (salesforce.Lead) preserved alongside semantic type\n# Usable anywhere crm.lead@1 is accepted\n```\n\n**Cross-type matching uses shared anchors** (like `email`), not type-specific IDs:\n```elixir\n# Lead: {id: \"00Q...\", email: \"user@example.com\"}\n# Account: {id: \"001...\", email: \"user@example.com\"}\n# Match via \"email\" anchor, not \"id\" (different ID namespaces)\n```\n\n### Keys: Identity Anchors\n\nKeys enable cross-system entity resolution:\n\n```yaml\nkeys: [email, id, external_id]\n```\n\nA tool that outputs `crm.account@1` with keys `[email, id]` can provide input to a tool requiring `billing.customer@1` if an adapter bridges the type difference and shares a common key (e.g., `email`).\n\n### Fields and Tiers\n\nFields are categorized into strategic tiers that guide planning and optimization:\n\n- **Core** - Essential properties required for basic operations (id, name, email)\n- **Useful** - Valuable fields that enhance workflows but aren't strictly required (company, status, owner)\n- **PII** - Fields containing personally identifiable information requiring redaction (email, phone)\n- **Index** - Fields optimized for search and lookup operations (email, company)\n\nExample type definition with tiers:\n\n```json\n{\n  \"$id\": \"crm.lead@1\",\n  \"type\": \"object\",\n  \"properties\": {\n    \"id\": { \"$ref\": \"core.entity_ref@1\" },\n    \"name\": { \"type\": \"string\" },\n    \"company\": { \"type\": \"string\" },\n    \"email\": { \"type\": \"string\", \"format\": \"email\" },\n    \"status\": { \n      \"type\": \"string\", \n      \"enum\": [\"new\", \"working\", \"qualified\", \"unqualified\", \"other\"] \n    },\n    \"owner\": { \"$ref\": \"core.user_ref@1\" },\n    \"created_at\": { \"type\": \"string\", \"format\": \"date-time\" }\n  },\n  \"required\": [\"id\", \"email\"],\n  \"x_tiers\": {\n    \"core\": [\"id\", \"name\", \"email\"],\n    \"useful\": [\"company\", \"status\", \"owner\"],\n    \"pii\": [\"email\"],\n    \"index\": [\"email\", \"company\"]\n  }\n}\n```\n\nThis tiering system enables cost-aware planning: core fields are always fetched, useful fields are included when credits permit, and PII fields require policy clearance.\n\n### Adapters: Type Bridges\n\nAdapters transform one type to another:\n\n```yaml\nadapter:\n  from: crm.account@1\n  to: billing.customer@1\n  anchor: email\n  confidence: 0.9\n  cost: 1.0\n```\n\nConfidence represents how reliably an adapter bridges two types. A direct field mapping with no information loss scores 1.0, while a lossy or heuristic mapping (e.g., inferring a billing customer from a CRM lead where not all fields carry over) scores lower. The planner treats adapters as weighted edges in a type graph, using confidence and cost as search criteria when discovering transformation paths.\n\n### Tool Contracts\n\nTools declare their semantic surface with rich metadata:\n\n```yaml\ntool: salesforce.query_accounts\ninputs:\n  - name: email\n    type: core.email@1\n    required: true\noutputs:\n  - type: crm.account@1\n    mode: one\nprovides_keys: [id, email, external_id]\nrequires_keys: [email]\nsupports_select: true\nselect_fields: [id, name, email, company, status, owner, created_at]\njmespath_selector: \"records[0]\"\n```\n\n**Key contract features:**\n\n- **Field selection** - Tools can specify which semantic fields they support (`supports_select`, `select_fields`)\n- **JMESPath selectors** - Integration-specific paths for extracting typed fields from responses\n- **Identity anchors** - Which keys are provided for cross-system resolution\n- **Enrichment capability** - Whether tool can augment partial data\n\n**JMESPath Field Mapping:**\n\nIntegrations declare how to map vendor-specific JSON fields to semantic types using JMESPath selectors. JMESPath is a JSON query language (like XPath for JSON) that extracts data from complex API responses.\n\n**Example - Salesforce SOQL Query:**\n\nAPI returns:\n```json\n{\n  \"totalSize\": 1,\n  \"done\": true,\n  \"records\": [{\n    \"Id\": \"00Q5G00000ABC123\",\n    \"Email\": \"user@example.com\",\n    \"Company\": \"Acme Corp\",\n    \"Status\": \"New\"\n  }]\n}\n```\n\nSemantic field mappings for `crm.lead@1`:\n```json\n{\n  \"id\": \"records[0].Id\",\n  \"email\": \"records[0].Email\",\n  \"company\": \"records[0].Company\",\n  \"status\": \"records[0].Status | lowercase(@)\"\n}\n```\n\n**Result after extraction:**\n```json\n{\n  \"id\": \"00Q5G00000ABC123\",\n  \"email\": \"user@example.com\",\n  \"company\": \"Acme Corp\",\n  \"status\": \"new\"\n}\n```\n\nThis normalized data satisfies `crm.lead@1` and can be used in cross-system workflows without additional transformation.\n\n**Common patterns:**\n- `records[0]` - First item from paginated list\n- `data.items[*]` - All items from nested array\n- `response.user.{id: id, email: email}` - Multi-field projection\n- `results[?active=='true']` - Filtered selection\n\n**Enrichment annotations:**\n\nTools can declare their ability to augment partial data. Enrichment capabilities are declared as structured facts in the planning knowledge base, specifying which tools can augment which types, what keys they require for lookup, and which fields they add.\n\nWhen the planner encounters incomplete data (e.g., only has `id` and `email` but needs `status`), it automatically searches for enrichment tools that can fill the gaps using available keys.\n\nThis contract system enables:\n- **Static validation** - Verify inputs before execution\n- **Capability discovery** - Find tools by semantic output\n- **Cost estimation** - Calculate credit costs before running\n- **Formal verification** - Prove workflow correctness\n- **Automatic enrichment** - Fill missing fields using available data\n\n---\n\n## Example Workflow Walkthrough\n\n### Scenario: Invoice Generation from CRM Lead\n\n**Goal**: Generate an invoice for a customer using only their email address.\n\n**Available Types**:\n- User provides: `core.email@1`\n- Goal requires: `billing.invoice@1`\n\n### Step 1: Discovery\n\nThe planner searches for tools that can bridge the gap:\n\n```\nhave: [core.email@1]\nwant: billing.invoice@1\n```\n\nDiscovery finds:\n1. `salesforce.get_lead` - Outputs `crm.lead@1`, requires `email` key\n2. `adapter: crm.lead@1 -> billing.customer@1` - Bridges CRM to billing domain\n3. `stripe.create_invoice` - Outputs `billing.invoice@1`, requires `billing.customer@1`\n\n### Step 2: Type Path Construction\n\nThe planner constructs a typed path:\n\n```\ncore.email@1 \n  -> [tool: salesforce.get_lead] \n  -> crm.lead@1 {email, id}\n  -> [adapter: crm.lead@1 -> billing.customer@1] \n  -> billing.customer@1 {email, id}\n  -> [tool: stripe.create_invoice]\n  -> billing.invoice@1 {id, amount, customer_id}\n```\n\n### Step 3: Adapter Bridging\n\nThe adapter `crm.lead@1 -> billing.customer@1` uses the `email` anchor:\n\n```yaml\nfrom: crm.lead@1\nto: billing.customer@1\nanchor: email  # Both types provide email\ntransform:\n  - map: lead.email -> customer.email\n  - map: lead.id -> customer.external_id\n```\n\n### Step 4: Execution\n\nThe plan executes deterministically:\n1. Call `salesforce.get_lead(email: \"user@example.com\")` -> Returns lead record\n2. Apply adapter -> Transform lead fields to customer fields\n3. Call `stripe.create_invoice(customer: {email, external_id})` -> Returns invoice\n\nThe entire workflow was verified at plan time. No LLM reasoning needed during execution.\n\n---\n\n## Automatic Data Enrichment\n\n### The Enrichment Problem\n\nPlans often encounter incomplete data. A workflow receives a lead with only `{id, email}` but needs `status` to proceed. Traditional approaches fail here or require manual intervention.\n\n### Discovery-Driven Enrichment\n\nSemio's planner automatically detects \"holes\" in data and searches for enrichment tools. Hole-filling is integrated into the planning search itself, not a separate post-processing pass, so enrichment steps are discovered, costed, and validated alongside primary tool calls within the same search space.\n\nEnrichment capabilities are declared as structured facts in the planning knowledge base, specifying which tools can augment which types, what keys they require for lookup, and which fields they add.\n\n**Enrichment discovery:**\n\n1. **Detect missing fields** - Plan requires `crm.lead@1` with `[id, email, status]`\n2. **Current data** - Have `crm.lead@1` with `[id, email]` (missing `status`)\n3. **Search enrichment tools** - Find tools that accept `crm.lead@1` (with `id` key) and add `status`\n4. **Inject enrichment step** - Automatically insert `get_lead_details` before the step that needs `status`\n\n**Example plan with automatic enrichment:**\n\n```\nStep 1: get_lead(email) -> crm.lead@1 {id, email}\nStep 2: [AUTO-ENRICHMENT] get_lead_details(id) -> crm.lead@1 {id, email, status, owner}\nStep 3: check_lead_status(status) -> ...\n```\n\n**Benefits:**\n- **No manual patching** - Planner fills gaps automatically\n- **Key-based lookup** - Uses available keys (id, email) for enrichment\n- **Cost-aware** - Enrichment steps included in cost estimate\n- **Deterministic** - Same missing fields -> same enrichment strategy\n\n### Field Selection and Projection\n\nTools declare which semantic fields they support for selective retrieval. The planner uses this information to request minimal fields, optimize API calls, and identify which fields require additional enrichment lookups.\n\n---\n\n## Integration Surface\n\n### Tool Authors: Declaring Contracts\n\nTool developers annotate their endpoints with semantic type declarations specifying outputs, identity keys, required keys, and output mode. The annotation format integrates with the language's existing metadata system.\n\n### Adapter Configuration\n\nPlatform operators define adapters between semantically equivalent types:\n\n```yaml\nadapters:\n  - from: crm.account@1\n    to: billing.customer@1\n    anchor: email\n    confidence: 0.95\n    cost: 1.0\n    rationale: \"Both represent customer entities\"\n```\n\n### Platform Resolution\n\nThe Semio engine:\n1. Indexes all tool contracts into a semantic graph\n2. Resolves types to families (e.g., `crm.*` types)\n3. Discovers adapter chains via heuristic search over the semantic graph\n4. Evaluates plans across cost, latency, and risk objectives\n5. Computes Pareto frontier of non-dominated solutions\n6. Validates key availability for each transformation\n7. Returns optimal plans with multi-objective metrics\n\n### Why Symbolic Planning\n\nPlan generation over the type graph uses logic programming (Prolog) rather than LLM reasoning. This is a deliberate architectural choice. The planning problem (exhaustive search over typed facts with backtracking, unification, and constraint propagation) maps directly to capabilities that logic programming provides natively and that LLMs approximate probabilistically.\n\nSymbolic planning offers properties that are difficult to achieve with neural approaches alone: deterministic outputs (same inputs produce the same plan), exhaustive search (all valid plans are found, not just the first plausible guess), and proof generation (the planner can explain *why* a plan is valid through its derivation trace). These properties are prerequisites for formal verification via Cognitive Trust Certificates.\n\nThe LLM's role is constrained to intent parsing (natural language to structured query) and optional result ranking. The planning itself is symbolic. This separation is explored in detail in *The Symbolic Backbone: Why Agent Systems Need Logic Programming*.\n\n### Prior Art\n\nTyped service composition has precedent in the semantic web services literature. Projects such as OWL-S and WSMO explored similar ideas (typed service contracts, semantic matching, and automated composition) during the 2000s. These efforts produced valuable theoretical foundations but failed to achieve practical adoption, largely due to the knowledge acquisition bottleneck: manually authoring ontologies and service descriptions was prohibitively expensive.\n\nThe neuro-symbolic approach resolves this bottleneck. LLMs can infer tool semantics from documentation and API schemas, automatically generating the typed contracts that semantic web systems required humans to author. The symbolic planning layer then operates over these contracts with the same rigor the earlier systems intended, but without the manual overhead that prevented their adoption.\n\n---\n\n## Governance Integration\n\nSemio's typed contracts participate in the broader governance stack. Side effect classifications, PII field annotations, and budget constraints declared in tool contracts are consumed by:\n\n- **Semantic Guards** — pre-execution policy enforcement based on tool metadata (see *Runtime Policy Enforcement for Autonomous AI Systems*)\n- **Dynamic Redaction** — PII-annotated fields trigger automatic masking in outputs\n- **Cognitive Trust Certificates** — plans generated through the type graph are validated for cycle-freedom, type safety, policy compliance, and input consumption before execution (see *Cognitive Trust Certificates: Verifiable Execution Proofs for Autonomous Systems*)\n- **Credit accounting** — cost estimates are computed from the plan and included in execution receipts (see *Credit System: Economic Primitives for Autonomous Systems*)\n\n---\n\n## Implications for Agent Architectures\n\n### Reduced Hallucination Risk\n\nBy moving schema reasoning from runtime (LLM) to plan time (symbolic), Semio eliminates a major source of agent errors. The LLM describes intent; the type system handles compatibility.\n\n### Deterministic Composition\n\nGiven the same inputs and available tools, Semio produces the same plan. This predictability is critical for production systems where non-determinism creates operational risk.\n\n### Lower Inference Overhead\n\nCompact type representations reduce prompt size. Instead of including full API schemas in context, the planner sees:\n\n```yaml\ntool: salesforce.query_accounts\nout: crm.account@1\nkeys: [email, id]\n```\n\nThis fits thousands of tools in a single prompt.\n\n### Planner Compatibility\n\nSemio's type graph integrates with existing planners:\n- **Prolog-based** - Native support for typed facts and rules\n- **LLM-based** - Types as structured prompts\n- **Hybrid** - Symbolic planning with LLM refinement\n\n### Scalable Orchestration\n\nAdding a new tool requires:\n1. Declare its semantic contract\n2. Optionally define adapters to existing types\n3. Index into the graph\n\nNo N^2 integration work. The planner automatically discovers new composition paths.\n\n---\n\n## Future Work\n\n### Federated Type Registries\n\nCurrently, types are platform-defined. A federated registry would enable shared semantic definitions across organizations, community-contributed adapters, and standardized industry types.\n\n### Ecosystem Tooling\n\nPotential areas include type inference from OpenAPI specs, adapter validation frameworks, and cross-platform interoperability standards.\n\n---\n\n## Appendix: Cross-System Type Definitions\n\nThe invoice generation walkthrough above relies on three interoperating type definitions: `crm.account@1` (CRM domain, keyed on id and email), `billing.customer@1` (Billing domain, keyed on id, email, and external_id), and an adapter that bridges them via the shared `email` anchor.\n\nEach type definition specifies: required and optional properties, field tiers (core, useful, PII, index), identity keys for cross-system resolution, and JSON Schema compatibility for validation. Adapter contracts specify the source and target types, the anchor key used for identity continuity, transformation logic, confidence score, cost, and tier preservation rules.\n\nThe specific JSON schemas, tier assignments, and adapter transformation specifications are part of the operational implementation.\n\n---\n\n*This document describes the conceptual architecture of Semio. Implementation details and optimization strategies are not included to protect operational IP while enabling conceptual understanding.*\n",
  "last_updated": "2026-01-15T00:00:00Z"
}