Knowledge Graph (Draft)

This document is a preliminary sketch. The graph layer will be refined when implementation begins. See the legalscript spec (separate repository) for the full vision.

Scope

The knowledge graph is a separate system built ON TOP of the corpus. It is not part of the corpus schema. It lives in a separate PostgreSQL schema (graph.*) and compiles into self-contained SQLite packages for runtime navigation.

Relationship to corpus

corpus.*                              graph.*                          compiled/
  documents (source texts)     →      concepts (legal concepts)     →  fr.civil.sqlite
  edges (citations)            →      annotations (LLM/human)       →  fr.travail.sqlite
                               →      edges (concept relations)     →  fr.conso.sqlite
                               →      compilations (metadata)

The corpus stores source data (legislation, decisions, guidance). Precious, days to re-ingest.
The graph stores derived knowledge (concepts, annotations). Reproducible, hours to recompile.
Compiled packages are self-contained SQLite for runtime. No PostgreSQL dependency at query time.

Graph schema (draft)

<syntaxhighlight lang="sql"> CREATE SCHEMA IF NOT EXISTS graph;

CREATE TABLE graph.concepts (

   id                text PRIMARY KEY,    -- thematic path: fr.civil.contrat.formation.consentement.vice.dol
   jurisdiction      text NOT NULL,
   parent_id         text,                -- parent concept (extends)
   title             text NOT NULL,
   concept_type      text NOT NULL,       -- qualifiable, standard_ouvert, principe_directeur, procedural, bareme
   defining_articles jsonb,               -- references to corpus documents
   metadata          jsonb DEFAULT '{}',
   created_at        timestamptz DEFAULT now()

);

CREATE TABLE graph.annotations (

   id              bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
   doc_id          text NOT NULL,         -- references corpus.documents.id
   annotation_type text NOT NULL,         -- structure, defines, illustrates, condition, qualify, framework...
   concept_id      text,                  -- references graph.concepts.id
   version         int NOT NULL,
   parent_version  int,
   method          text NOT NULL,         -- stub, llm, human, jurist
   confidence      text NOT NULL,         -- stub, memory_only, source_checked, cross_validated, disputed
   author          text,
   prompt_hash     text,
   content         jsonb NOT NULL,
   created_at      timestamptz DEFAULT now()

);

-- PostgreSQL does not allow function calls in table-level UNIQUE constraints. -- Use a unique index instead: CREATE UNIQUE INDEX idx_ann_unique

   ON graph.annotations (doc_id, annotation_type, coalesce(concept_id, ), version);

CREATE TABLE graph.edges (

   id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
   source_id   text NOT NULL,
   target_id   text NOT NULL,
   kind        text NOT NULL,
   properties  jsonb DEFAULT '{}',
   UNIQUE (source_id, target_id, kind)

);

CREATE TABLE graph.compilations (

   id            text PRIMARY KEY,
   version       int NOT NULL,
   compiled_at   timestamptz NOT NULL,
   source_commit text,
   quality       jsonb,
   dependencies  jsonb,
   artifact_path text

); </syntaxhighlight>

Why separate from corpus

The graph has different structural needs:

Annotations have versioning (version + parent_version), confidence, method, prompt_hash as first-class columns
Concepts have concept_type and defining_articles
No body/body_search/content_fts/language — annotations are structured JSONB, not searchable text
Different write patterns: corpus is append-mostly, graph is recompiled in bulk

Forcing annotations into corpus.documents would leave 5 of 14 columns always NULL. That is not optimal — it is forcing.

Corpus guarantees for the graph

The corpus schema guarantees properties the graph depends on:

Stable IDs — corpus.documents.id never changes after ingestion. The graph references documents by ID.
Permanent article identity — tags.cid groups temporal versions of the same article. The graph needs this to link concepts to articles across renumbering.
Immutable body — body does not change after initial ingestion. Future annotation anchoring (character offsets) depends on this stability.
No reverse dependency — corpus never references graph. The graph depends on corpus, not the inverse.

Corpus/Graph

Contents

Knowledge Graph (Draft)

Scope

Relationship to corpus

Graph schema (draft)

Why separate from corpus

Corpus guarantees for the graph

Navigation menu

Corpus/Graph

Knowledge Graph (Draft)

Scope

Relationship to corpus

Graph schema (draft)

Why separate from corpus

Corpus guarantees for the graph

Navigation menu

Search