Corpus/Graph

From Dura Lex Wiki
Jump to navigation Jump to search

Knowledge Graph (Draft)

[edit | edit source]

This document is a preliminary sketch. The graph layer will be refined when implementation begins. See the legalscript spec (separate repository) for the full vision.

Scope

[edit | edit source]

The knowledge graph is a separate system built ON TOP of the corpus. It is not part of the corpus schema. It lives in a separate PostgreSQL schema (graph.*) and compiles into self-contained SQLite packages for runtime navigation.

Relationship to corpus

[edit | edit source]
corpus.*                              graph.*                          compiled/
  documents (source texts)     →      concepts (legal concepts)     →  fr.civil.sqlite
  edges (citations)            →      annotations (LLM/human)       →  fr.travail.sqlite
                               →      edges (concept relations)     →  fr.conso.sqlite
                               →      compilations (metadata)
  • The corpus stores source data (legislation, decisions, guidance). Precious, days to re-ingest.
  • The graph stores derived knowledge (concepts, annotations). Reproducible, hours to recompile.
  • Compiled packages are self-contained SQLite for runtime. No PostgreSQL dependency at query time.

Graph schema (draft)

[edit | edit source]

<syntaxhighlight lang="sql"> CREATE SCHEMA IF NOT EXISTS graph;

CREATE TABLE graph.concepts (

   id                text PRIMARY KEY,    -- thematic path: fr.civil.contrat.formation.consentement.vice.dol
   jurisdiction      text NOT NULL,
   parent_id         text,                -- parent concept (extends)
   title             text NOT NULL,
   concept_type      text NOT NULL,       -- qualifiable, standard_ouvert, principe_directeur, procedural, bareme
   defining_articles jsonb,               -- references to corpus documents
   metadata          jsonb DEFAULT '{}',
   created_at        timestamptz DEFAULT now()

);

CREATE TABLE graph.annotations (

   id              bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
   doc_id          text NOT NULL,         -- references corpus.documents.id
   annotation_type text NOT NULL,         -- structure, defines, illustrates, condition, qualify, framework...
   concept_id      text,                  -- references graph.concepts.id
   version         int NOT NULL,
   parent_version  int,
   method          text NOT NULL,         -- stub, llm, human, jurist
   confidence      text NOT NULL,         -- stub, memory_only, source_checked, cross_validated, disputed
   author          text,
   prompt_hash     text,
   content         jsonb NOT NULL,
   created_at      timestamptz DEFAULT now()

);

-- PostgreSQL does not allow function calls in table-level UNIQUE constraints. -- Use a unique index instead: CREATE UNIQUE INDEX idx_ann_unique

   ON graph.annotations (doc_id, annotation_type, coalesce(concept_id, ), version);

CREATE TABLE graph.edges (

   id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
   source_id   text NOT NULL,
   target_id   text NOT NULL,
   kind        text NOT NULL,
   properties  jsonb DEFAULT '{}',
   UNIQUE (source_id, target_id, kind)

);

CREATE TABLE graph.compilations (

   id            text PRIMARY KEY,
   version       int NOT NULL,
   compiled_at   timestamptz NOT NULL,
   source_commit text,
   quality       jsonb,
   dependencies  jsonb,
   artifact_path text

); </syntaxhighlight>

Why separate from corpus

[edit | edit source]

The graph has different structural needs:

  • Annotations have versioning (version + parent_version), confidence, method, prompt_hash as first-class columns
  • Concepts have concept_type and defining_articles
  • No body/body_search/content_fts/language — annotations are structured JSONB, not searchable text
  • Different write patterns: corpus is append-mostly, graph is recompiled in bulk

Forcing annotations into corpus.documents would leave 5 of 14 columns always NULL. That is not optimal — it is forcing.

Corpus guarantees for the graph

[edit | edit source]

The corpus schema guarantees properties the graph depends on:

  1. Stable IDs — corpus.documents.id never changes after ingestion. The graph references documents by ID.
  2. Permanent article identitytags.cid groups temporal versions of the same article. The graph needs this to link concepts to articles across renumbering.
  3. Immutable bodybody does not change after initial ingestion. Future annotation anchoring (character offsets) depends on this stability.
  4. No reverse dependency — corpus never references graph. The graph depends on corpus, not the inverse.