Skip to content

Data Ingestion

Squad’s Universal Semantic Encoding Pipeline (USEP) is a source-agnostic ingestion system that accepts content from any source: documents, files, URLs, or connected platforms: and transforms it into a unified knowledge graph. Rather than building separate pipelines for each data source, USEP normalises everything into a single representation and extracts structured knowledge through an intelligent, staged process.

Core Concepts

Episodes

The fundamental unit of ingested content is the Episode: a paragraph-level segment of your source material stored with full provenance metadata. When you ingest a document, Squad parses it into episodes, each preserving exactly what was written and where it came from. Episodes are the foundation that all downstream extraction and retrieval builds on.

This source-agnostic design means that a PDF uploaded from a local drive, a page imported from Notion, and a document pulled from SharePoint all produce the same Episode structure with no source-specific configuration required.

Two Operating Modes

USEP operates in two distinct modes depending on your needs:

Best for: Getting started quickly, exploring new data, general-purpose knowledge retrieval.

When no domain ontology is provided, Squad indexes your content immediately with zero configuration. Documents are parsed and stored as episodes with lightweight linguistic indexing: noun phrases, co-occurrence relationships, and hierarchical topic communities are extracted using fast natural language processing alone.

Retrieval happens at query time: when you ask a question, Squad decomposes it into sub-queries, matches concepts against the indexed structure, and synthesises answers with full source citations. This approach keeps ingestion near-instant while deferring the cost of deep extraction until you actually need it.

How Ingestion Works

Source Normalisation

Squad accepts content in a wide range of formats. All inputs are normalised to clean text with preserved structural markers (headings, sections, lists) before entering the extraction pipeline.

Supported formats include: PDF, DOCX, PPTX, Markdown, CSV, HTML, plain text, and images.

Before any content reaches the pipeline, it passes through pre-pipeline guardrails: file-level security checks including format validation and size limits. Content that fails these checks is rejected and logged; it never enters the extraction process.

Staged Extraction

Rather than relying on a single extraction method, Squad uses a tiered cascade: a series of increasingly sophisticated stages where each stage handles what the previous one couldn’t. Early stages are fast and inexpensive; later stages bring in more powerful (and costlier) analysis only when needed.

USEP Operating Models & Ingestion Pipeline
USEP: Universal Semantic Encoding Pipeline OPERATING MODELS & INGESTION PIPELINE DATA SOURCES .pdf .docx .pptx .csv .xlsx .md .html .txt .png REST API CLI Platform UI Connectors 1 Source Normalisation Raw content → clean text with structural markers. Format-agnostic. Content hashing & deduplication. 2 Content Policy PII detection, content guideline enforcement. Rejected content logged & excluded before extraction. OPERATING MODEL — SELECT ONTOLOGY FRAMEWORK DEFAULT Discovery Mode POLE+O Framework Person · Organisation · Location · Event · Object + Concept · Tool · Procedure · Fact or CUSTOM Ontology-Fed Mode Domain Ontology Schema User-defined entity types, relationships & domain vocabulary. Constrains extraction precision. Selected framework drives entity typing and relationship extraction throughout the pipeline below NLP PHASE Lightweight indexing — both modes Linguistic Indexing Noun phrases, entity mentions Co-occurrence Network Weighted entity proximity edges Community Detection Leiden algorithm · topic hierarchies No LLM calls · fast NLP only · cost equivalent to standard vector indexing EXTRACTION CASCADE Tiered — each stage handles what the previous couldn’t Pattern Recognition Deterministic: people, organisations, locations, dates, structural patterns. No AI model. FAST · FREE Domain Recognition Zero-shot: technical terms, methodologies, domain concepts → mapped to selected ontology types. ONTOLOGY-DRIVEN Knowledge Graph Lookup Known entities linked directly. Gets faster and cheaper as graph grows. LOOKUP Deep Extraction AI model for novel concepts, implicit relationships, ambiguous references. Only fires when needed. AI MODEL Entity Resolution Deterministic → semantic matching. Progressive consolidation with reversible soft merges. PROGRESSIVE MEMORY CONSOLIDATION THREE-LAYER STORAGE MODEL LAYER 0: EPISODES Verbatim segments · full source provenance LAYER 1: MENTIONS Anchored extractions · confidence scores LAYER 2: ENTITIES Canonical · deduplicated · POLE+O / custom types Ingest → Encode → Bind → Store squadai.uk
  1. Normalisation: Raw content is cleaned, standardised, and split into logical segments while preserving structural context.

  2. Content Policy: Text is checked against data policies (PII detection, content guidelines) before any extraction occurs. Rejected content is logged and excluded.

  3. Pattern Recognition: Fast, deterministic recognition identifies known entity types (people, organisations, locations, dates) and structural patterns (links, references, identifiers) without any AI model involvement.

  4. Domain Recognition: Zero-shot recognition catches domain-specific entities that general pattern recognition misses: technical terms, methodologies, domain concepts: mapped to your configured entity types.

  5. Knowledge Graph Lookup: Detected entities are checked against the existing graph. If an entity is already known, it’s linked directly: no further analysis needed. This means the system gets faster and cheaper as your knowledge graph grows.

  6. Deep Extraction: Only entities and relationships that earlier stages couldn’t resolve are passed to an AI model for analysis. This handles novel concepts, implicit relationships, and ambiguous references that require contextual understanding.

  7. Memory Consolidation: Extracted knowledge is written into the three-layer storage model with full source provenance, confidence scores, and cross-references.

Three-Layer Storage

All ingested knowledge is stored in a structured, three-layer model inspired by Fuzzy-Trace Theory from cognitive science: preserving both the verbatim detail and the durable meaning of every piece of content.

Layer 0: Episodes

Verbatim content segments with full source provenance. This layer is never deduplicated: every paragraph from every source is preserved exactly as it appeared, linked back to its origin.

Layer 1: Mentions

Individual entity mentions and extractions, anchored in their source episodes. Each mention carries a confidence score reflecting how it was detected, positional anchoring in the source text, and temporal metadata.

Layer 2: Entities

Canonical, deduplicated entities representing real-world concepts. A single “Active Inference” entity might be referenced by dozens of mentions across multiple sources: Layer 2 unifies them into one authoritative node.

All Layer 2 entities are classified using the POLE+O framework: a well-established intelligence ontology extended with cognitive primitives. The foundational types (Person, Organisation, Location, Event, Object) provide domain-agnostic coverage across any industry, while cognitive extensions (Concept, Tool, Procedure, Fact) capture abstract knowledge and operational patterns that emerge during use.

Rather than requiring a rigid predefined schema, domain-specific entity types crystallise organically as data is ingested. Squad detects stable clusters in the entity space and creates new type classifications automatically, adapting to your domain without manual configuration. A defence logistics deployment develops different entity types than a financial services one: the ontology evolves to match.

Knowledge Graph Ontology
KNOWLEDGE GRAPH ONTOLOGY Three-layer storage with POLE+O entity classification LAYER 0 Source Provenance │ Verbatim trace Document file, hash, provenance HAS Episode content, section, source Episode table row child HAS_ROW ANCHORED_IN LAYER 1 Extraction │ Per-source mentions with confidence Mention surface form, type, confidence, span high conf. needs review REFERS_TO LAYER 2 Canonical Entities │ Deduplicated across all sources Entity canonical name, types, embedding Entity co-occurring CO_OCCURS MERGED_INTO (soft, reversible) POLE+O CLASSIFICATION Person Organisation Location Event Object + COGNITIVE Concept Tool Procedure Fact + domain types crystallise organically CO-OCCURRENCE NETWORK Powers retrieval │ NLP-only, no LLM NounPhrase concept, frequency NounPhrase co-occurring CO_OCCURS_WITH CONTAINS_PHRASE ↑ Episode IN_COMMUNITY Community topic cluster, level Community parent level PARENT Leiden Hierarchical Communities Leaf (specific topics) → Mid (domains) → Top (themes) Enables local, regional, and global query scoping squadai.uk

Co-Occurrence Indexing

During ingestion, Squad builds a co-occurrence network alongside the three-layer storage model. When entities appear together within the same text segment, they are linked with weighted edges reflecting their proximity: forming a dense web of semantic associations that captures how concepts relate in your domain.

This co-occurrence structure serves two purposes:

  • At retrieval time, the co-occurrence concept network, enriched with hierarchical community detection via the Leiden algorithm, powers Squad’s Retrieval Graph. The resulting community hierarchy enables everything from precise local lookups to broad thematic queries, at a fraction of the cost of traditional graph traversal.

  • Over time, frequently co-occurring entities can have their implicit relationship promoted to an explicit edge in the knowledge graph through consolidation processes: strengthening the semantic layer as the system matures.

The co-occurrence network is built entirely from natural language processing: no LLM calls or embedding computation required. This makes it extremely fast to construct and keeps indexing costs equivalent to standard vector search.

Entity Resolution

When entities are extracted from multiple sources, they often refer to the same real-world concept in different ways: abbreviations, typos, alternate names, or varying levels of specificity. Squad’s entity resolution process unifies these into canonical representations through progressive consolidation.

Resolution works in two phases:

  • Deterministic resolution handles the majority of cases through exact normalised matching, fuzzy string similarity, and cross-source link detection. This phase is fast, free, and resolves roughly 75% of duplicates with near-perfect precision.

  • Semantic resolution handles the remaining ambiguous cases through a multi-strategy approach. Candidate pairs are identified using embedding similarity, structural co-occurrence patterns, and type constraints, then evaluated for equivalence using contextual analysis. Confidence bands control the outcome: high-confidence matches are merged automatically, medium-confidence cases are flagged for human review, and low-confidence pairs are kept as separate entities.

Rather than running a single resolution pass, Squad uses progressive consolidation: multiple passes with progressively relaxed thresholds. High-confidence merges happen first, building context that improves accuracy for harder cases in subsequent passes. This annealing approach consistently outperforms single-pass methods, and the system gets more accurate as the knowledge graph grows: co-occurrence patterns and enriched entity profiles provide stronger signals for each successive round.

All merges are recorded as soft merges: reversible relationships that preserve the original entities and can be undone if a merge turns out to be incorrect. Administrators can review and correct the system’s decisions without data loss, and when confidence is high enough, soft merges can be hardened into permanent consolidations.

After resolution, entities are enriched with structured metadata from external knowledge bases: descriptions, alternate names, geographic coordinates, and domain-specific identifiers. Enrichment produces richer representations that improve both future retrieval and subsequent resolution rounds.

Ingestion in Practice

Running Ingestion

Ingestion can be triggered through the Squad CLI or the platform API. At its simplest, point Squad at a file or directory:

Terminal window
# Ingest a single document (Discovery Mode)
usep ingest document.pdf
# Ingest with a domain ontology (Ontology-Fed Mode)
usep ingest document.pdf --ontology schema.json
# Ingest and build the full index structure
usep lazygraph document.pdf

Progress is streamed in real time: each pipeline stage reports its status as content moves through extraction.

Monitoring and Verification

After ingestion completes, you can verify what was extracted:

  • Entity counts and types: Review the entities, relationships, and episodes created from your content.
  • Confidence distributions: Understand how much was resolved by fast pattern matching versus deep extraction.
  • Source provenance: Trace any entity or fact back to its original source document and paragraph.
  • Graph visualisation: Explore the resulting knowledge graph interactively. See Graph Visualisation.

Incremental Updates

Squad tracks what has already been ingested. When content is updated at the source, only the changed portions are reprocessed: avoiding redundant extraction and keeping the knowledge graph current without full re-ingestion.

CLI Reference

USEP provides a CLI for ingestion and index management. All commands connect to Neo4j and operate on the knowledge graph directly.

Core Pipeline

CommandDescription
usep ingest <path>Ingest a file or directory into Episode nodes. Supports --dry-run, --max-episodes, and --ontology options.
usep lazygraph <path>Run the full LazyGraphRAG pipeline: ingest, extract noun phrases, detect communities, and build the co-occurrence index in one step. Supports --skip-ingest to run only the indexing stages.

Index Building

These commands run individual stages of the LazyGraphRAG pipeline. Use them when you need fine-grained control over the indexing process.

CommandDescription
usep noun-phrasesExtract NounPhrase nodes from Episode text for the co-occurrence network.
usep communitiesDetect hierarchical communities in the NounPhrase co-occurrence graph using the Leiden algorithm. Supports --max-levels and --gamma options.

Entity Extraction & Resolution

CommandDescription
usep encodeExtract entities from Episode nodes and write Layer 1 Mentions.
usep bindResolve Layer 1 Mentions into Layer 2 Entities and create CO_OCCURS edges.
usep enrichEnrich Entity nodes with descriptions from external knowledge bases.
usep hardenFinalise all soft merges into permanent merges.
usep undo-mergesReverse all soft merges, restoring absorbed entities.
usep reviewExport entities flagged for review to JSON (optionally push to Notion).

Evaluation

CommandDescription
usep benchmark-queryGenerate test queries from Episode content using LLM-powered AutoQ.
usep benchmark-evalRun a gold test set through the retrieval pipeline and score with LLM-as-judge.