Corpus/Schema

From Dura Lex Wiki
Jump to navigation Jump to search

Corpus Schema

[edit | edit source]

Technical specification for the Dura Lex unified corpus schema. Reference for developers and AI agents.

Overview

[edit | edit source]

One documents table with promoted columns and a JSONB tags column, inspired by OpenStreetMap's model: a small fixed schema handles structure, while a flexible key-value store handles everything jurisdiction-specific.

Validated across 25 jurisdictions on 5 continents. Benchmark on 3.4M rows: 7x faster than the previous 16-table normalized schema.

Four tables:

Table Purpose
corpus.documents All legal content: legislation, case law, regulatory filings, administrative notices
corpus.edges Relationships between documents: citations, amendments, hierarchy, consolidation
corpus.tag_stats Aggregated tag counts per kind/jurisdiction, refreshed post-ingest
corpus.source_metadata Data source registry: publishers, licenses, ingestion state, freshness

Schema: corpus.documents

[edit | edit source]

14 columns.

<syntaxhighlight lang="sql"> CREATE SCHEMA IF NOT EXISTS corpus;

CREATE TABLE corpus.documents (

   id            text PRIMARY KEY,
   kind          text NOT NULL,
   jurisdiction  text NOT NULL,
   language      text,
   source        text NOT NULL,
   date          date,
   date_end      date,
   parent_id     text,
   title         text,
   body          text,
   body_search   text,
   tags          jsonb NOT NULL DEFAULT '{}',
   content_fts   tsvector,
   ingested_at   timestamptz DEFAULT now()

); </syntaxhighlight>

Column definitions

[edit | edit source]

id -- Text primary key. Uses source-native identifiers with a jurisdiction prefix, always lowercase: {jurisdiction}.{source_id} (e.g., fr.legiarti000006902764, eu.eurlextext32016r0679). Identifiers with built-in jurisdiction context keep their original form (ECLI:FR:CCASS:2024:CR00123, [2024] UKSC 1). For non-legal registries where the source ID could collide, prefix with {jurisdiction}.{source}. (e.g., fr.company.443061841, gb.company.12345678). Never synthetic UUIDs -- the ID must be meaningful and traceable to the source. See ADR: lowercase document IDs (design-decisions/2026-04-09-lowercase-document-ids.md).

kind -- Structural classification. Exactly 6 values (see Kind Taxonomy below): legislation, decision, record, notice, section, chunk. This column answers "what shape is this document?" not "what legal category is it?" -- legal category belongs in tags.type.

jurisdiction -- Geographic/political area, not institution. ISO 3166-1 alpha-2 for countries (fr, gb, de, br, jp). ISO 3166-2 for subdivisions (us-ca, gb-sct, es-ct, de-by). Extensions for supranational areas: eu (European law — EU institutions and Council of Europe), int (international treaties). A search for a jurisdiction includes its subdivisions: searching us includes us-ca. No default value -- every document must declare its jurisdiction explicitly.

language -- ISO 639-1 (fr, en, de, ar, zh, ja, ko). Nullable. When NULL, the FTS trigger infers language from jurisdiction. Decoupled from jurisdiction because many countries are multilingual: Belgium (3 languages), Switzerland (4), South Africa (11), India (23+). A Belgian law in Dutch and the same law in French are two separate documents with the same jurisdiction (be) but different languages.

source -- Data source identifier. Examples: legi, cass, hmrc_manuals, cendoj, boe, legifrance_jorf, knesset. One source maps to one ingestion pipeline. Multiple sources can feed the same jurisdiction.

date -- Primary date, always ISO 8601 (converted from Hijri calendar, Japanese era dates, etc. at ingestion). Semantics vary by kind: decision_date for decisions, valid_from for legislation, publication_date for notices, creation_date for records. The semantic meaning is implicit from kind -- not stored separately.

date_end -- End date. valid_until for legislation, expiration for notices, closure for records. NULL for most decisions (they don't expire). NULL does not mean "still in force" -- it means the source did not provide an end date.

parent_id -- Hierarchical link to another document ID. Used for structure trees (code -> article), sub-documents (decision -> chunk), and entity hierarchies (company -> director appointments). Text, not a foreign key -- the parent may live in a different source or not yet be ingested. No FK constraint.

title -- Display title. Free text. May be NULL for chunks and sections that inherit their parent's title.

body -- Clean displayable content. HTML or plain text suitable for direct rendering to a user. Immutable after initial ingestion. Never contains noisy OCR or raw PDF binary. For PDF-only sources, body is a clean stub like <p>Source: <a href="...">PDF officiel</a></p> and the extracted text goes to body_search. body is what get_document returns.

body_search -- Indexable text representation used by FTS. Can be noisy (raw PDF extraction, OCR output) because users never see it directly — get_document never returns it. Nullable. When NULL, the FTS trigger falls back to body via coalesce(body_search, body). Used for: (a) PDF-sourced documents where body is a clean stub and body_search holds the pdfminer-extracted text; (b) HTML documents with irregular references that need an LLM-cleaned indexable version. Summary and headnote FTS indexing is handled separately by the trigger reading tags.summary and tags.headnote_classification directly from JSONB (weight B). See ADR-010 (design-decisions/2026-04-08-body-search-inversion.md) for the current semantic and ADR-008 (design-decisions/2026-04-05-fts-three-weights.md) for the trigger weight system. See QUALITY for tags.content_quality levels.

tags -- JSONB. Everything jurisdiction-specific or type-specific that does not deserve a promoted column. Examples: {"type": "loi", "nid": "JORFTEXT000041865244", "nature": "LOI"} for French legislation, {"chamber": "2e_civ", "solution": "cassation"} for French cassation decisions, {"sic_code": "6411", "status": "active"} for UK company records. No enforced schema -- each source defines its own tag vocabulary. Queried via @>, ?, and jsonb_path_ops GIN index.

content_fts -- tsvector column, populated by a trigger with three weights: A=title, B=tags.summary+tags.headnote_classification (read from JSONB), C=coalesce(body_search, body) with HTML tags stripped. Uses the appropriate language configuration (derived from language or jurisdiction). NULL for CJK languages (Chinese, Japanese, Korean) which require pgroonga or zhparser instead of PostgreSQL built-in FTS. See FTS for full trigger details.

ingested_at -- Timestamp of first ingestion. DEFAULT now(). Not updated on re-ingestion -- use tags for versioning metadata if needed.


Indexes

[edit | edit source]

<syntaxhighlight lang="sql"> -- B-tree indexes on promoted columns CREATE INDEX idx_doc_kind ON corpus.documents (kind); CREATE INDEX idx_doc_jurisdiction ON corpus.documents (jurisdiction); CREATE INDEX idx_doc_source ON corpus.documents (source); CREATE INDEX idx_doc_date ON corpus.documents (date); CREATE INDEX idx_doc_date_end ON corpus.documents (date_end) WHERE date_end IS NOT NULL; CREATE INDEX idx_doc_parent ON corpus.documents (parent_id) WHERE parent_id IS NOT NULL; CREATE INDEX idx_doc_kind_juris ON corpus.documents (kind, jurisdiction); CREATE INDEX idx_doc_kind_source ON corpus.documents (kind, source);

-- GIN index on tags CREATE INDEX idx_doc_tags ON corpus.documents USING GIN (tags jsonb_path_ops);

-- Partial GIN indexes per kind (FTS isolation) CREATE INDEX idx_doc_fts_legislation ON corpus.documents USING GIN (content_fts) WHERE kind = 'legislation'; CREATE INDEX idx_doc_fts_decision ON corpus.documents USING GIN (content_fts) WHERE kind = 'decision'; CREATE INDEX idx_doc_fts_record ON corpus.documents USING GIN (content_fts) WHERE kind = 'record'; CREATE INDEX idx_doc_fts_notice ON corpus.documents USING GIN (content_fts) WHERE kind = 'notice'; </syntaxhighlight>

Why partial GIN indexes per kind instead of table partitioning

[edit | edit source]

Partitioning documents by kind would require kind in the primary key (PostgreSQL requires the partition key in the PK). That means id alone can no longer be a PK -- every FK, every edge reference, every external link would need to carry (id, kind) instead of just id. The complexity cascades through the entire system.

Partial GIN indexes achieve 80% of the performance benefit with 0% of the complexity. When a query filters WHERE kind = 'legislation' AND content_fts @@ ..., PostgreSQL uses the partial index idx_doc_fts_legislation which contains only legislation tsvectors. The planner skips the other kinds entirely. Index size is smaller, scans are faster, and the PK stays a simple text column.

No partial index for section and chunk because they rarely have body/body_search content worth full-text indexing.


Schema: corpus.edges

[edit | edit source]

<syntaxhighlight lang="sql"> CREATE TABLE corpus.edges (

   id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
   source_id   text NOT NULL,
   target_id   text,
   kind        text NOT NULL,
   reference   text,
   properties  jsonb DEFAULT '{}',
   UNIQUE (source_id, target_id, kind)

);

CREATE INDEX idx_edge_source ON corpus.edges (source_id); CREATE INDEX idx_edge_target ON corpus.edges (target_id) WHERE target_id IS NOT NULL; CREATE INDEX idx_edge_kind ON corpus.edges (kind); </syntaxhighlight>

Column definitions

[edit | edit source]

id -- Synthetic identity, bigint GENERATED ALWAYS AS IDENTITY. Edges have no natural key.

source_id -- The document that originates the relationship. Not a foreign key.

target_id -- The document that receives the relationship. Nullable: an unresolved reference (a citation to a document not yet ingested, or a reference to an external system) has target_id = NULL and reference containing the raw citation text. Once the target is ingested or resolved, target_id is backfilled.

kind -- Relationship type. Open vocabulary. Examples: cites, amends, repeals, implements, transposes, overrules, affirms, commences, language_variant. See EDGE-TYPES for the full taxonomy.

reference -- Raw citation text as found in the source document. Preserved even after resolution for traceability. Example: "article L. 121-1 du code de la consommation".

properties -- JSONB metadata on the relationship. Standard provenance keys (see ADR: edge pipeline architecture, design-decisions/2026-04-22-edge-pipeline-architecture.md):

  • extraction: "publisher_provided" (source has a structured link) or "id_resolved" (textual reference resolved by the reference resolver).
  • source_type: identifies the source system (legi_lien, cellar_sparql, hudoc_appno, cellar_nim).
  • raw_link_type: source-native link type preserved for traceability (e.g., "TRANSPOSITION", "cdm:resource_legal_amends_resource_legal").

Type-specific examples: {"scope": "partial", "articles": ["1", "2", "3"]} for partial amendments, {"commencement_date": "2024-01-01", "territorial_extent": "england_and_wales"} for UK effects, {"reservation": "sous reserve de..."} for conditional constitutionality decisions.

Quarantine model

[edit | edit source]

Edges with unresolved targets are stored with target_id = NULL and the raw citation in reference. They are visible to MCP and queryable for monitoring. The partial unique index idx_edge_unique_without_target ON (source_id, reference, kind) WHERE target_id IS NULL prevents duplicate quarantine entries.

Ownership: an edge is owned by the jurisdiction prefix of its source_id. Drop-and-reingest for jurisdiction X deletes edges WHERE source_id LIKE '{x}.%'. Cross-jurisdiction quarantine edges (e.g., EU NIM → FR target) survive the target jurisdiction's reingest.

Reconciliation: after each jurisdiction ingest, a post-ingest pass attempts to resolve quarantine edges whose reference matches the freshly ingested jurisdiction. Resolved → UPDATE target_id. Still unresolved → remains in quarantine.

Why no foreign keys

[edit | edit source]

At 50M+ edges, FK checks on every insert become a measurable bottleneck. Sources are ingested independently and in parallel -- a French decision citing a EU directive may be ingested before the directive itself. FKs would either block ingestion order or require two-pass inserts.

Orphan cleanup runs periodically (post-ingest batch job), not on every write.


Schema: corpus.tag_stats

[edit | edit source]

<syntaxhighlight lang="sql"> CREATE TABLE corpus.tag_stats (

   kind          text NOT NULL,
   jurisdiction  text NOT NULL,
   tag_key       text NOT NULL,
   tag_value     text NOT NULL,
   count         integer DEFAULT 0,
   PRIMARY KEY (kind, jurisdiction, tag_key, tag_value)

); </syntaxhighlight>

Refreshed post-ingest, incrementally per source. Provides instant faceted counts without scanning documents.

Must handle both scalar and array tag values. For scalar tags, count directly. For array tags (e.g., tags.themes = ["droit civil", "responsabilite"]), use jsonb_array_elements_text to explode the array and count each element individually.


Kind taxonomy

[edit | edit source]

6 structural values. These classify document shape, not legal category.

Kind What it is What it is NOT Examples
legislation Creates, interprets, or defines a rule. Normative or quasi-normative content. Not a resolution of a specific dispute. Statute, regulation, decree, circular, directive, fatwa, ministerial order, regulatory handbook
decision Resolves a specific dispute or case. Issued by a court or quasi-judicial body. Not a general rule, even if it sets precedent. Judgment, ruling, order, arbitral award, AG opinion
record Factual registration in an official registry. Not normative, not adjudicative. Company filing, trademark registration, patent, land title, birth certificate
notice Official announcement or publication. Time-bound, informational. Not normative long-term. Gazette entry, public consultation, tender, regulatory announcement
section Structural subdivision of another document. Has a parent_id. Not a standalone document. Part, title, chapter, article, schedule, annex
chunk Processing artifact. A fragment of a document created for search or embedding. Not an editorial unit. Paragraph-level split for RAG, embedding chunk

Disambiguation rule

[edit | edit source]

When a document is ambiguous:

  1. Does it resolve a specific dispute between parties? -> decision
  2. Does it create, interpret, or define a general rule? -> legislation
  3. Still ambiguous? -> legislation

The legal category goes in tags.type (e.g., "type": "sumula_vinculante", "type": "tesis_jurisprudencial").

Stress-test: ambiguous documents across jurisdictions

[edit | edit source]
Document Jurisdiction Kind Rationale
Sumula vinculante BR legislation Binding precedent abstract, creates a general rule. Not resolving a specific dispute.
Tesis jurisprudencial MX legislation Abstracted legal thesis from repeated rulings. General rule, not a specific dispute.
Judicial interpretation (司法解释) CN legislation Supreme People's Court quasi-legislation. Creates binding rules of general application.
Dictamen CGR CL legislation Comptroller General opinion. Interprets administrative law generally.
AG Opinion (CJEU) EU decision Issued in the context of a specific case, advises on its resolution.
FCA Handbook GB legislation Financial Conduct Authority regulatory rules. Normative, not case-specific.
Fatwa SA legislation Religious-legal ruling of general application. Interprets Islamic law.
Administrative guidance (通達 tsutatsu) JP legislation Ministry directive to subordinate agencies. General rule, not dispute resolution.
Restatement section US legislation Scholarly synthesis of common law rules. General rule formulation.
Grundsatzentscheidung DE decision Federal Court principle decision. Resolves a specific case, even if it sets precedent.
Arret de principe FR decision Cour de cassation principle ruling. Resolves a specific dispute.
Sentenza interpretativa di rigetto IT decision Constitutional Court interpretive rejection. Resolves a specific constitutional question.
Advisory Opinion (ICJ) INT decision Issued on a specific question referred by a UN organ.
Pleno no jurisdiccional ES decision Supreme Court plenary agreement. Resolves doctrinal conflict arising from cases.
Bepaling van algemene strekking NL legislation General binding regulation. Normative, not case resolution.
Constitutional Court communique TR legislation Press release summarizing a decision. Informational/normative summary.
Practice Direction GB legislation Court-issued procedural rules. General application.
Reglement grand-ducal LU legislation Grand-ducal regulation. Normative.
Acordao vinculante STF BR decision Binding Supreme Court judgment. Resolves a specific case.
Royal Decree (مرسوم ملكي) SA legislation Royal legislative act. Creates general rules.
Verfassungsbeschwerde decision DE decision Constitutional complaint resolution. Specific dispute.
Recurso extraordinario BR decision Extraordinary appeal. Resolves a specific case.
EU Regulation EU legislation Directly applicable legislation. General normative act.
OHADA Uniform Act INT legislation Harmonized commercial law. General normative act.
Tax ruling (rescrit) FR legislation Tax authority interpretation. General application (even if triggered by a specific request).

Benchmark results

[edit | edit source]

Measured on 3.4M rows, PostgreSQL 16, single-node, 16 GB RAM.

Query Corpus schema Previous 16-table schema Speedup
Tag filter + date range 0.9 ms B-tree scan, instant
FTS + tag filter (broad, 111K results) 4.4 s 31 s 7x faster (BitmapAnd vs Seq Scan)
Tag-only + date 1.4 ms B-tree scan, instant
FTS narrow (17K results, warm) 565 ms Partial GIN index
Get by ID 0.3 ms 0.3 ms PK lookup
Tag discovery (GROUP BY court, 3.4M rows) 2.7 s Seq scan — justifies tag_stats

The FTS speedup comes from BitmapAnd of GIN (FTS) + GIN (tags) indexes. The previous schema used a Parallel Seq Scan because the planner could not combine B-tree (court column) with GIN (FTS) efficiently.


Operational notes

[edit | edit source]

No FK on edges or parent_id. Foreign keys on multi-million row tables with parallel ingestion from independent sources create ordering constraints and lock contention that outweigh referential safety. Orphan detection runs as a periodic batch job.

Partial GIN indexes per kind. See the indexes section. Keeps FTS index sizes small, avoids partitioning complexity.

WHERE IS DISTINCT FROM in upserts. Use IS DISTINCT FROM instead of != in ON CONFLICT ... DO UPDATE SET ... WHERE clauses. != returns NULL when comparing with NULL (and the row is not updated), while IS DISTINCT FROM treats NULL as a comparable value. Critical for columns like date_end, body_search, language that are frequently NULL.

Autovacuum tuning. Set autovacuum_vacuum_scale_factor = 0.01 on corpus.documents. The default (0.2) means autovacuum triggers after 20% of rows are dead -- on a 3.4M row table, that is 680K dead rows before cleanup. At 0.01, it triggers at 34K dead rows, keeping bloat and index performance under control.

Estimated counts at scale. COUNT(*) OVER() becomes expensive above 100M rows. Use EXPLAIN-based row estimates or pg_class.reltuples for approximate counts in paginated APIs. Exact counts only when explicitly requested.

CJK full-text search. PostgreSQL built-in FTS does not support Chinese, Japanese, or Korean tokenization. content_fts is NULL for CJK documents. Use pgroonga (all three languages) or zhparser (Chinese only) as an extension. CJK search queries must use a different code path.

Partitioning at CN scale. At 130M+ rows (realistic for a full Chinese legal corpus), partition documents by jurisdiction. This is the one case where partitioning is justified: queries almost always filter by jurisdiction, and partition pruning eliminates entire shards. The composite PK concern is manageable when partitioning by a column that is already always present in queries.


corpus.source_metadata

[edit | edit source]

Tracks data sources, their static descriptive metadata, and dynamic statistics recomputed at the end of each ingest cycle.

<syntaxhighlight lang="sql"> CREATE TABLE corpus.source_metadata (

   source_key             text PRIMARY KEY,
   parent_source_key      text,
   jurisdiction           text NOT NULL,
   name                   text NOT NULL,
   description            text NOT NULL,
   kind                   text NOT NULL DEFAULT 'legislation',
   publisher              text NOT NULL,
   publisher_url          text NOT NULL,
   license                text NOT NULL,
   license_url            text NOT NULL,
   language               text NOT NULL DEFAULT 'fr',
   coverage_start_year    integer,
   document_count         integer,
   freshest_document_date date,
   last_file              text,
   source_updated         date,
   ingested_at            timestamptz

); </syntaxhighlight>

Used by the MCP server (guidelines, data freshness display) and the portal. Sub-sources (e.g., bodacc_pcl) have parent_source_key set.

Static vs dynamic columns

[edit | edit source]
  • Static (written by sync_source_registry() at init from SOURCE_REGISTRY): source_key, parent_source_key, jurisdiction, name, description, kind, publisher, publisher_url, license, license_url, language.
  • Dynamic (written by refresh_corpus_stats() at the end of each ingest cycle, including the dedup steps inside run_post_ingest()): coverage_start_year (= MIN(date) per source), document_count (= COUNT(*) per source), freshest_document_date (= MAX(date) per source). Top-level parent rows that have sub-sources are populated by a SUM/MIN/MAX rollup over their children.
  • Operational (written by state.set_ingest_state() per ingest run): last_file, source_updated, ingested_at. These track pipeline state, not content state, and are not displayed in MCP responses.

sync_source_registry() MUST NOT touch dynamic columns — they would be reset to NULL on every init. The INSERT and ON CONFLICT DO UPDATE SET clauses only list static columns.


Required PostgreSQL extensions

[edit | edit source]

<syntaxhighlight lang="sql"> CREATE EXTENSION IF NOT EXISTS unaccent; -- accent-insensitive FTS -- Optional (for CJK jurisdictions): -- CREATE EXTENSION IF NOT EXISTS pgroonga; -- CJK full-text search </syntaxhighlight>


Complete DDL execution order

[edit | edit source]
  1. CREATE SCHEMA corpus;
  2. CREATE EXTENSION IF NOT EXISTS unaccent;
  3. CREATE TABLE corpus.documents (...); (14 columns)
  4. CREATE TABLE corpus.edges (...);
  5. CREATE TABLE corpus.tag_stats (...);
  6. CREATE TABLE corpus.source_metadata (...);
  7. CREATE FUNCTION corpus.normalize_for_fts(...); (see FTS)
  8. CREATE FUNCTION corpus.update_content_fts(...); (see FTS)
  9. CREATE TRIGGER trg_content_fts ...;
  10. All indexes (B-tree, GIN, partial GIN)
  11. ANALYZE corpus.documents;

For the graph schema (future): CREATE SCHEMA graph; + tables from GRAPH.


ID format convention

[edit | edit source]

Document IDs use source-native identifiers, always lowercase. Rules:

  • All document IDs carry a lowercase jurisdiction prefix: {jurisdiction}.{source_id} (e.g., fr.legiarti000006902764, fr.juritext000041701651, fr.cetatext000046783006, eu.eurlextext32016r0679).
  • Identifiers with built-in jurisdiction context (ECLI:*, CELEX:*, [year] UKSC) may omit the prefix.
  • For non-legal registries where the source ID could collide, use {jurisdiction}.{source}. (e.g., fr.company.443061841, gb.company.12345678).
  • Decision: the ingest plugin author determines the prefix format, documented in the plugin's tag schema.

Bidirectional edges convention

[edit | edit source]

Some edge types are logically bidirectional (language_variant, corresponds_to, associated_text, joined_with, pilot_follower). The convention:

  • Store ONE edge, not two.
  • Pick a canonical direction (alphabetical by source_id, or the "primary" document in the pair).
  • Queries for bidirectional edges check both directions: WHERE (source_id = X OR target_id = X) AND kind = 'language_variant'.

Jurisdiction subdivision mapping in FTS trigger

[edit | edit source]

The trigger resolves language from coalesce(language, jurisdiction). For subdivision jurisdictions like gb-sct, es-ct, de-by:

  • If language is set (recommended for subdivisions), it is used directly.
  • If language is NULL, the trigger extracts the country prefix: split_part(jurisdiction, '-', 1)gb. Since gb is not a language code in the mapping (the mapping uses ISO 639-1 codes like en), it falls through to simple. Set language = 'en' explicitly for English-language jurisdictions like gb, gb-sct, us.

This extraction is in the FTS trigger code (see FTS).


Migration strategy

[edit | edit source]

Migration from the current 16-table schema is incremental and reversible:

  1. Create corpus.* tables alongside existing tables (coexistence).
  2. Migrate one kind at a time with SQL INSERT...SELECT scripts (decision first — largest and most exercised).
  3. Benchmark each migrated kind against the equivalent old-table queries.
  4. Adapt the MCP server to read from corpus.documents (dual-read during transition).
  5. Switch MCP tools one by one (search first, then get, then browse).
  6. When all tools read from corpus.*, drop old tables.
  7. Migrate citationscorpus.edges (same process).
  8. Refresh corpus.tag_stats.

Rollback: during coexistence, the old tables are still the source of truth. If migration fails, revert MCP to old tables.