Corpus/Graph
Knowledge Graph (Draft)
[edit | edit source]This document is a preliminary sketch. The graph layer will be refined when implementation begins. See the legalscript spec (separate repository) for the full vision.
Scope
[edit | edit source]The knowledge graph is a separate system built ON TOP of the corpus. It is not part of the corpus schema. It lives in a separate PostgreSQL schema (graph.*) and compiles into self-contained SQLite packages for runtime navigation.
Relationship to corpus
[edit | edit source]corpus.* graph.* compiled/
documents (source texts) → concepts (legal concepts) → fr.civil.sqlite
edges (citations) → annotations (LLM/human) → fr.travail.sqlite
→ edges (concept relations) → fr.conso.sqlite
→ compilations (metadata)
- The corpus stores source data (legislation, decisions, guidance). Precious, days to re-ingest.
- The graph stores derived knowledge (concepts, annotations). Reproducible, hours to recompile.
- Compiled packages are self-contained SQLite for runtime. No PostgreSQL dependency at query time.
Graph schema (draft)
[edit | edit source]<syntaxhighlight lang="sql"> CREATE SCHEMA IF NOT EXISTS graph;
CREATE TABLE graph.concepts (
id text PRIMARY KEY, -- thematic path: fr.civil.contrat.formation.consentement.vice.dol
jurisdiction text NOT NULL,
parent_id text, -- parent concept (extends)
title text NOT NULL,
concept_type text NOT NULL, -- qualifiable, standard_ouvert, principe_directeur, procedural, bareme
defining_articles jsonb, -- references to corpus documents
metadata jsonb DEFAULT '{}',
created_at timestamptz DEFAULT now()
);
CREATE TABLE graph.annotations (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, doc_id text NOT NULL, -- references corpus.documents.id annotation_type text NOT NULL, -- structure, defines, illustrates, condition, qualify, framework... concept_id text, -- references graph.concepts.id version int NOT NULL, parent_version int, method text NOT NULL, -- stub, llm, human, jurist confidence text NOT NULL, -- stub, memory_only, source_checked, cross_validated, disputed author text, prompt_hash text, content jsonb NOT NULL, created_at timestamptz DEFAULT now()
);
-- PostgreSQL does not allow function calls in table-level UNIQUE constraints. -- Use a unique index instead: CREATE UNIQUE INDEX idx_ann_unique
ON graph.annotations (doc_id, annotation_type, coalesce(concept_id, ), version);
CREATE TABLE graph.edges (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
source_id text NOT NULL,
target_id text NOT NULL,
kind text NOT NULL,
properties jsonb DEFAULT '{}',
UNIQUE (source_id, target_id, kind)
);
CREATE TABLE graph.compilations (
id text PRIMARY KEY, version int NOT NULL, compiled_at timestamptz NOT NULL, source_commit text, quality jsonb, dependencies jsonb, artifact_path text
); </syntaxhighlight>
Why separate from corpus
[edit | edit source]The graph has different structural needs:
- Annotations have versioning (version + parent_version), confidence, method, prompt_hash as first-class columns
- Concepts have concept_type and defining_articles
- No body/body_search/content_fts/language — annotations are structured JSONB, not searchable text
- Different write patterns: corpus is append-mostly, graph is recompiled in bulk
Forcing annotations into corpus.documents would leave 5 of 14 columns always NULL. That is not optimal — it is forcing.
Corpus guarantees for the graph
[edit | edit source]The corpus schema guarantees properties the graph depends on:
- Stable IDs — corpus.documents.id never changes after ingestion. The graph references documents by ID.
- Permanent article identity —
tags.cidgroups temporal versions of the same article. The graph needs this to link concepts to articles across renumbering. - Immutable body —
bodydoes not change after initial ingestion. Future annotation anchoring (character offsets) depends on this stability. - No reverse dependency — corpus never references graph. The graph depends on corpus, not the inverse.