Corpus/Schema
Corpus Schema
[edit | edit source]Technical specification for the Dura Lex unified corpus schema. Reference for developers and AI agents.
Overview
[edit | edit source]One documents table with promoted columns and a JSONB tags column, inspired by OpenStreetMap's model: a small fixed schema handles structure, while a flexible key-value store handles everything jurisdiction-specific.
Validated across 25 jurisdictions on 5 continents. Benchmark on 3.4M rows: 7x faster than the previous 16-table normalized schema.
Four tables:
| Table | Purpose |
|---|---|
corpus.documents |
All legal content: legislation, case law, regulatory filings, administrative notices |
corpus.edges |
Relationships between documents: citations, amendments, hierarchy, consolidation |
corpus.tag_stats |
Aggregated tag counts per kind/jurisdiction, refreshed post-ingest |
corpus.source_metadata |
Data source registry: publishers, licenses, ingestion state, freshness |
Schema: corpus.documents
[edit | edit source]14 columns.
<syntaxhighlight lang="sql"> CREATE SCHEMA IF NOT EXISTS corpus;
CREATE TABLE corpus.documents (
id text PRIMARY KEY,
kind text NOT NULL,
jurisdiction text NOT NULL,
language text,
source text NOT NULL,
date date,
date_end date,
parent_id text,
title text,
body text,
body_search text,
tags jsonb NOT NULL DEFAULT '{}',
content_fts tsvector,
ingested_at timestamptz DEFAULT now()
); </syntaxhighlight>
Column definitions
[edit | edit source]id -- Text primary key. Uses source-native identifiers with a jurisdiction prefix, always lowercase: {jurisdiction}.{source_id} (e.g., fr.legiarti000006902764, eu.eurlextext32016r0679). Identifiers with built-in jurisdiction context keep their original form (ECLI:FR:CCASS:2024:CR00123, [2024] UKSC 1). For non-legal registries where the source ID could collide, prefix with {jurisdiction}.{source}. (e.g., fr.company.443061841, gb.company.12345678). Never synthetic UUIDs -- the ID must be meaningful and traceable to the source. See ADR: lowercase document IDs (design-decisions/2026-04-09-lowercase-document-ids.md).
kind -- Structural classification. Exactly 6 values (see Kind Taxonomy below): legislation, decision, record, notice, section, chunk. This column answers "what shape is this document?" not "what legal category is it?" -- legal category belongs in tags.type.
jurisdiction -- Geographic/political area, not institution. ISO 3166-1 alpha-2 for countries (fr, gb, de, br, jp). ISO 3166-2 for subdivisions (us-ca, gb-sct, es-ct, de-by). Extensions for supranational areas: eu (European law — EU institutions and Council of Europe), int (international treaties). A search for a jurisdiction includes its subdivisions: searching us includes us-ca. No default value -- every document must declare its jurisdiction explicitly.
language -- ISO 639-1 (fr, en, de, ar, zh, ja, ko). Nullable. When NULL, the FTS trigger infers language from jurisdiction. Decoupled from jurisdiction because many countries are multilingual: Belgium (3 languages), Switzerland (4), South Africa (11), India (23+). A Belgian law in Dutch and the same law in French are two separate documents with the same jurisdiction (be) but different languages.
source -- Data source identifier. Examples: legi, cass, hmrc_manuals, cendoj, boe, legifrance_jorf, knesset. One source maps to one ingestion pipeline. Multiple sources can feed the same jurisdiction.
date -- Primary date, always ISO 8601 (converted from Hijri calendar, Japanese era dates, etc. at ingestion). Semantics vary by kind: decision_date for decisions, valid_from for legislation, publication_date for notices, creation_date for records. The semantic meaning is implicit from kind -- not stored separately.
date_end -- End date. valid_until for legislation, expiration for notices, closure for records. NULL for most decisions (they don't expire). NULL does not mean "still in force" -- it means the source did not provide an end date.
parent_id -- Hierarchical link to another document ID. Used for structure trees (code -> article), sub-documents (decision -> chunk), and entity hierarchies (company -> director appointments). Text, not a foreign key -- the parent may live in a different source or not yet be ingested. No FK constraint.
title -- Display title. Free text. May be NULL for chunks and sections that inherit their parent's title.
body -- Clean displayable content. HTML or plain text suitable for direct rendering to a user. Immutable after initial ingestion. Never contains noisy OCR or raw PDF binary. For PDF-only sources, body is a clean stub like <p>Source: <a href="...">PDF officiel</a></p> and the extracted text goes to body_search. body is what get_document returns.
body_search -- Indexable text representation used by FTS. Can be noisy (raw PDF extraction, OCR output) because users never see it directly — get_document never returns it. Nullable. When NULL, the FTS trigger falls back to body via coalesce(body_search, body). Used for: (a) PDF-sourced documents where body is a clean stub and body_search holds the pdfminer-extracted text; (b) HTML documents with irregular references that need an LLM-cleaned indexable version. Summary and headnote FTS indexing is handled separately by the trigger reading tags.summary and tags.headnote_classification directly from JSONB (weight B). See ADR-010 (design-decisions/2026-04-08-body-search-inversion.md) for the current semantic and ADR-008 (design-decisions/2026-04-05-fts-three-weights.md) for the trigger weight system. See QUALITY for tags.content_quality levels.
tags -- JSONB. Everything jurisdiction-specific or type-specific that does not deserve a promoted column. Examples: {"type": "loi", "nid": "JORFTEXT000041865244", "nature": "LOI"} for French legislation, {"chamber": "2e_civ", "solution": "cassation"} for French cassation decisions, {"sic_code": "6411", "status": "active"} for UK company records. No enforced schema -- each source defines its own tag vocabulary. Queried via @>, ?, and jsonb_path_ops GIN index.
content_fts -- tsvector column, populated by a trigger with three weights: A=title, B=tags.summary+tags.headnote_classification (read from JSONB), C=coalesce(body_search, body) with HTML tags stripped. Uses the appropriate language configuration (derived from language or jurisdiction). NULL for CJK languages (Chinese, Japanese, Korean) which require pgroonga or zhparser instead of PostgreSQL built-in FTS. See FTS for full trigger details.
ingested_at -- Timestamp of first ingestion. DEFAULT now(). Not updated on re-ingestion -- use tags for versioning metadata if needed.
Indexes
[edit | edit source]<syntaxhighlight lang="sql"> -- B-tree indexes on promoted columns CREATE INDEX idx_doc_kind ON corpus.documents (kind); CREATE INDEX idx_doc_jurisdiction ON corpus.documents (jurisdiction); CREATE INDEX idx_doc_source ON corpus.documents (source); CREATE INDEX idx_doc_date ON corpus.documents (date); CREATE INDEX idx_doc_date_end ON corpus.documents (date_end) WHERE date_end IS NOT NULL; CREATE INDEX idx_doc_parent ON corpus.documents (parent_id) WHERE parent_id IS NOT NULL; CREATE INDEX idx_doc_kind_juris ON corpus.documents (kind, jurisdiction); CREATE INDEX idx_doc_kind_source ON corpus.documents (kind, source);
-- GIN index on tags CREATE INDEX idx_doc_tags ON corpus.documents USING GIN (tags jsonb_path_ops);
-- Partial GIN indexes per kind (FTS isolation) CREATE INDEX idx_doc_fts_legislation ON corpus.documents USING GIN (content_fts) WHERE kind = 'legislation'; CREATE INDEX idx_doc_fts_decision ON corpus.documents USING GIN (content_fts) WHERE kind = 'decision'; CREATE INDEX idx_doc_fts_record ON corpus.documents USING GIN (content_fts) WHERE kind = 'record'; CREATE INDEX idx_doc_fts_notice ON corpus.documents USING GIN (content_fts) WHERE kind = 'notice'; </syntaxhighlight>
Why partial GIN indexes per kind instead of table partitioning
[edit | edit source]Partitioning documents by kind would require kind in the primary key (PostgreSQL requires the partition key in the PK). That means id alone can no longer be a PK -- every FK, every edge reference, every external link would need to carry (id, kind) instead of just id. The complexity cascades through the entire system.
Partial GIN indexes achieve 80% of the performance benefit with 0% of the complexity. When a query filters WHERE kind = 'legislation' AND content_fts @@ ..., PostgreSQL uses the partial index idx_doc_fts_legislation which contains only legislation tsvectors. The planner skips the other kinds entirely. Index size is smaller, scans are faster, and the PK stays a simple text column.
No partial index for section and chunk because they rarely have body/body_search content worth full-text indexing.
Schema: corpus.edges
[edit | edit source]<syntaxhighlight lang="sql"> CREATE TABLE corpus.edges (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
source_id text NOT NULL,
target_id text,
kind text NOT NULL,
reference text,
properties jsonb DEFAULT '{}',
UNIQUE (source_id, target_id, kind)
);
CREATE INDEX idx_edge_source ON corpus.edges (source_id); CREATE INDEX idx_edge_target ON corpus.edges (target_id) WHERE target_id IS NOT NULL; CREATE INDEX idx_edge_kind ON corpus.edges (kind); </syntaxhighlight>
Column definitions
[edit | edit source]id -- Synthetic identity, bigint GENERATED ALWAYS AS IDENTITY. Edges have no natural key.
source_id -- The document that originates the relationship. Not a foreign key.
target_id -- The document that receives the relationship. Nullable: an unresolved reference (a citation to a document not yet ingested, or a reference to an external system) has target_id = NULL and reference containing the raw citation text. Once the target is ingested or resolved, target_id is backfilled.
kind -- Relationship type. Open vocabulary. Examples: cites, amends, repeals, implements, transposes, overrules, affirms, commences, language_variant. See EDGE-TYPES for the full taxonomy.
reference -- Raw citation text as found in the source document. Preserved even after resolution for traceability. Example: "article L. 121-1 du code de la consommation".
properties -- JSONB metadata on the relationship. Standard provenance keys (see ADR: edge pipeline architecture, design-decisions/2026-04-22-edge-pipeline-architecture.md):
extraction:"publisher_provided"(source has a structured link) or"id_resolved"(textual reference resolved by the reference resolver).source_type: identifies the source system (legi_lien,cellar_sparql,hudoc_appno,cellar_nim).raw_link_type: source-native link type preserved for traceability (e.g.,"TRANSPOSITION","cdm:resource_legal_amends_resource_legal").
Type-specific examples: {"scope": "partial", "articles": ["1", "2", "3"]} for partial amendments, {"commencement_date": "2024-01-01", "territorial_extent": "england_and_wales"} for UK effects, {"reservation": "sous reserve de..."} for conditional constitutionality decisions.
Quarantine model
[edit | edit source]Edges with unresolved targets are stored with target_id = NULL and the raw citation in reference. They are visible to MCP and queryable for monitoring. The partial unique index idx_edge_unique_without_target ON (source_id, reference, kind) WHERE target_id IS NULL prevents duplicate quarantine entries.
Ownership: an edge is owned by the jurisdiction prefix of its source_id. Drop-and-reingest for jurisdiction X deletes edges WHERE source_id LIKE '{x}.%'. Cross-jurisdiction quarantine edges (e.g., EU NIM → FR target) survive the target jurisdiction's reingest.
Reconciliation: after each jurisdiction ingest, a post-ingest pass attempts to resolve quarantine edges whose reference matches the freshly ingested jurisdiction. Resolved → UPDATE target_id. Still unresolved → remains in quarantine.
Why no foreign keys
[edit | edit source]At 50M+ edges, FK checks on every insert become a measurable bottleneck. Sources are ingested independently and in parallel -- a French decision citing a EU directive may be ingested before the directive itself. FKs would either block ingestion order or require two-pass inserts.
Orphan cleanup runs periodically (post-ingest batch job), not on every write.
Schema: corpus.tag_stats
[edit | edit source]<syntaxhighlight lang="sql"> CREATE TABLE corpus.tag_stats (
kind text NOT NULL, jurisdiction text NOT NULL, tag_key text NOT NULL, tag_value text NOT NULL, count integer DEFAULT 0, PRIMARY KEY (kind, jurisdiction, tag_key, tag_value)
); </syntaxhighlight>
Refreshed post-ingest, incrementally per source. Provides instant faceted counts without scanning documents.
Must handle both scalar and array tag values. For scalar tags, count directly. For array tags (e.g., tags.themes = ["droit civil", "responsabilite"]), use jsonb_array_elements_text to explode the array and count each element individually.
Kind taxonomy
[edit | edit source]6 structural values. These classify document shape, not legal category.
| Kind | What it is | What it is NOT | Examples |
|---|---|---|---|
legislation |
Creates, interprets, or defines a rule. Normative or quasi-normative content. | Not a resolution of a specific dispute. | Statute, regulation, decree, circular, directive, fatwa, ministerial order, regulatory handbook |
decision |
Resolves a specific dispute or case. Issued by a court or quasi-judicial body. | Not a general rule, even if it sets precedent. | Judgment, ruling, order, arbitral award, AG opinion |
record |
Factual registration in an official registry. | Not normative, not adjudicative. | Company filing, trademark registration, patent, land title, birth certificate |
notice |
Official announcement or publication. Time-bound, informational. | Not normative long-term. | Gazette entry, public consultation, tender, regulatory announcement |
section |
Structural subdivision of another document. Has a parent_id. |
Not a standalone document. | Part, title, chapter, article, schedule, annex |
chunk |
Processing artifact. A fragment of a document created for search or embedding. | Not an editorial unit. | Paragraph-level split for RAG, embedding chunk |
Disambiguation rule
[edit | edit source]When a document is ambiguous:
- Does it resolve a specific dispute between parties? ->
decision - Does it create, interpret, or define a general rule? ->
legislation - Still ambiguous? ->
legislation
The legal category goes in tags.type (e.g., "type": "sumula_vinculante", "type": "tesis_jurisprudencial").
Stress-test: ambiguous documents across jurisdictions
[edit | edit source]| Document | Jurisdiction | Kind | Rationale |
|---|---|---|---|
| Sumula vinculante | BR | legislation |
Binding precedent abstract, creates a general rule. Not resolving a specific dispute. |
| Tesis jurisprudencial | MX | legislation |
Abstracted legal thesis from repeated rulings. General rule, not a specific dispute. |
| Judicial interpretation (司法解释) | CN | legislation |
Supreme People's Court quasi-legislation. Creates binding rules of general application. |
| Dictamen CGR | CL | legislation |
Comptroller General opinion. Interprets administrative law generally. |
| AG Opinion (CJEU) | EU | decision |
Issued in the context of a specific case, advises on its resolution. |
| FCA Handbook | GB | legislation |
Financial Conduct Authority regulatory rules. Normative, not case-specific. |
| Fatwa | SA | legislation |
Religious-legal ruling of general application. Interprets Islamic law. |
| Administrative guidance (通達 tsutatsu) | JP | legislation |
Ministry directive to subordinate agencies. General rule, not dispute resolution. |
| Restatement section | US | legislation |
Scholarly synthesis of common law rules. General rule formulation. |
| Grundsatzentscheidung | DE | decision |
Federal Court principle decision. Resolves a specific case, even if it sets precedent. |
| Arret de principe | FR | decision |
Cour de cassation principle ruling. Resolves a specific dispute. |
| Sentenza interpretativa di rigetto | IT | decision |
Constitutional Court interpretive rejection. Resolves a specific constitutional question. |
| Advisory Opinion (ICJ) | INT | decision |
Issued on a specific question referred by a UN organ. |
| Pleno no jurisdiccional | ES | decision |
Supreme Court plenary agreement. Resolves doctrinal conflict arising from cases. |
| Bepaling van algemene strekking | NL | legislation |
General binding regulation. Normative, not case resolution. |
| Constitutional Court communique | TR | legislation |
Press release summarizing a decision. Informational/normative summary. |
| Practice Direction | GB | legislation |
Court-issued procedural rules. General application. |
| Reglement grand-ducal | LU | legislation |
Grand-ducal regulation. Normative. |
| Acordao vinculante STF | BR | decision |
Binding Supreme Court judgment. Resolves a specific case. |
| Royal Decree (مرسوم ملكي) | SA | legislation |
Royal legislative act. Creates general rules. |
| Verfassungsbeschwerde decision | DE | decision |
Constitutional complaint resolution. Specific dispute. |
| Recurso extraordinario | BR | decision |
Extraordinary appeal. Resolves a specific case. |
| EU Regulation | EU | legislation |
Directly applicable legislation. General normative act. |
| OHADA Uniform Act | INT | legislation |
Harmonized commercial law. General normative act. |
| Tax ruling (rescrit) | FR | legislation |
Tax authority interpretation. General application (even if triggered by a specific request). |
Benchmark results
[edit | edit source]Measured on 3.4M rows, PostgreSQL 16, single-node, 16 GB RAM.
| Query | Corpus schema | Previous 16-table schema | Speedup |
|---|---|---|---|
| Tag filter + date range | 0.9 ms | — | B-tree scan, instant |
| FTS + tag filter (broad, 111K results) | 4.4 s | 31 s | 7x faster (BitmapAnd vs Seq Scan) |
| Tag-only + date | 1.4 ms | — | B-tree scan, instant |
| FTS narrow (17K results, warm) | 565 ms | — | Partial GIN index |
| Get by ID | 0.3 ms | 0.3 ms | PK lookup |
| Tag discovery (GROUP BY court, 3.4M rows) | 2.7 s | — | Seq scan — justifies tag_stats |
The FTS speedup comes from BitmapAnd of GIN (FTS) + GIN (tags) indexes. The previous schema used a Parallel Seq Scan because the planner could not combine B-tree (court column) with GIN (FTS) efficiently.
Operational notes
[edit | edit source]No FK on edges or parent_id. Foreign keys on multi-million row tables with parallel ingestion from independent sources create ordering constraints and lock contention that outweigh referential safety. Orphan detection runs as a periodic batch job.
Partial GIN indexes per kind. See the indexes section. Keeps FTS index sizes small, avoids partitioning complexity.
WHERE IS DISTINCT FROM in upserts. Use IS DISTINCT FROM instead of != in ON CONFLICT ... DO UPDATE SET ... WHERE clauses. != returns NULL when comparing with NULL (and the row is not updated), while IS DISTINCT FROM treats NULL as a comparable value. Critical for columns like date_end, body_search, language that are frequently NULL.
Autovacuum tuning. Set autovacuum_vacuum_scale_factor = 0.01 on corpus.documents. The default (0.2) means autovacuum triggers after 20% of rows are dead -- on a 3.4M row table, that is 680K dead rows before cleanup. At 0.01, it triggers at 34K dead rows, keeping bloat and index performance under control.
Estimated counts at scale. COUNT(*) OVER() becomes expensive above 100M rows. Use EXPLAIN-based row estimates or pg_class.reltuples for approximate counts in paginated APIs. Exact counts only when explicitly requested.
CJK full-text search. PostgreSQL built-in FTS does not support Chinese, Japanese, or Korean tokenization. content_fts is NULL for CJK documents. Use pgroonga (all three languages) or zhparser (Chinese only) as an extension. CJK search queries must use a different code path.
Partitioning at CN scale. At 130M+ rows (realistic for a full Chinese legal corpus), partition documents by jurisdiction. This is the one case where partitioning is justified: queries almost always filter by jurisdiction, and partition pruning eliminates entire shards. The composite PK concern is manageable when partitioning by a column that is already always present in queries.
corpus.source_metadata
[edit | edit source]Tracks data sources, their static descriptive metadata, and dynamic statistics recomputed at the end of each ingest cycle.
<syntaxhighlight lang="sql"> CREATE TABLE corpus.source_metadata (
source_key text PRIMARY KEY, parent_source_key text, jurisdiction text NOT NULL, name text NOT NULL, description text NOT NULL, kind text NOT NULL DEFAULT 'legislation', publisher text NOT NULL, publisher_url text NOT NULL, license text NOT NULL, license_url text NOT NULL, language text NOT NULL DEFAULT 'fr', coverage_start_year integer, document_count integer, freshest_document_date date, last_file text, source_updated date, ingested_at timestamptz
); </syntaxhighlight>
Used by the MCP server (guidelines, data freshness display) and the portal. Sub-sources (e.g., bodacc_pcl) have parent_source_key set.
Static vs dynamic columns
[edit | edit source]- Static (written by
sync_source_registry()at init fromSOURCE_REGISTRY):source_key,parent_source_key,jurisdiction,name,description,kind,publisher,publisher_url,license,license_url,language. - Dynamic (written by
refresh_corpus_stats()at the end of each ingest cycle, including the dedup steps insiderun_post_ingest()):coverage_start_year(=MIN(date)per source),document_count(=COUNT(*)per source),freshest_document_date(=MAX(date)per source). Top-level parent rows that have sub-sources are populated by a SUM/MIN/MAX rollup over their children. - Operational (written by
state.set_ingest_state()per ingest run):last_file,source_updated,ingested_at. These track pipeline state, not content state, and are not displayed in MCP responses.
sync_source_registry() MUST NOT touch dynamic columns — they would be reset to NULL on every init. The INSERT and ON CONFLICT DO UPDATE SET clauses only list static columns.
Required PostgreSQL extensions
[edit | edit source]<syntaxhighlight lang="sql"> CREATE EXTENSION IF NOT EXISTS unaccent; -- accent-insensitive FTS -- Optional (for CJK jurisdictions): -- CREATE EXTENSION IF NOT EXISTS pgroonga; -- CJK full-text search </syntaxhighlight>
Complete DDL execution order
[edit | edit source]CREATE SCHEMA corpus;CREATE EXTENSION IF NOT EXISTS unaccent;CREATE TABLE corpus.documents (...);(14 columns)CREATE TABLE corpus.edges (...);CREATE TABLE corpus.tag_stats (...);CREATE TABLE corpus.source_metadata (...);CREATE FUNCTION corpus.normalize_for_fts(...);(see FTS)CREATE FUNCTION corpus.update_content_fts(...);(see FTS)CREATE TRIGGER trg_content_fts ...;- All indexes (B-tree, GIN, partial GIN)
ANALYZE corpus.documents;
For the graph schema (future): CREATE SCHEMA graph; + tables from GRAPH.
ID format convention
[edit | edit source]Document IDs use source-native identifiers, always lowercase. Rules:
- All document IDs carry a lowercase jurisdiction prefix:
{jurisdiction}.{source_id}(e.g.,fr.legiarti000006902764,fr.juritext000041701651,fr.cetatext000046783006,eu.eurlextext32016r0679). - Identifiers with built-in jurisdiction context (ECLI:*, CELEX:*, [year] UKSC) may omit the prefix.
- For non-legal registries where the source ID could collide, use
{jurisdiction}.{source}.(e.g.,fr.company.443061841,gb.company.12345678). - Decision: the ingest plugin author determines the prefix format, documented in the plugin's tag schema.
Bidirectional edges convention
[edit | edit source]Some edge types are logically bidirectional (language_variant, corresponds_to, associated_text, joined_with, pilot_follower). The convention:
- Store ONE edge, not two.
- Pick a canonical direction (alphabetical by source_id, or the "primary" document in the pair).
- Queries for bidirectional edges check both directions:
WHERE (source_id = X OR target_id = X) AND kind = 'language_variant'.
Jurisdiction subdivision mapping in FTS trigger
[edit | edit source]The trigger resolves language from coalesce(language, jurisdiction). For subdivision jurisdictions like gb-sct, es-ct, de-by:
- If
languageis set (recommended for subdivisions), it is used directly. - If
languageis NULL, the trigger extracts the country prefix:split_part(jurisdiction, '-', 1)→gb. Sincegbis not a language code in the mapping (the mapping uses ISO 639-1 codes likeen), it falls through tosimple. Setlanguage = 'en'explicitly for English-language jurisdictions likegb,gb-sct,us.
This extraction is in the FTS trigger code (see FTS).
Migration strategy
[edit | edit source]Migration from the current 16-table schema is incremental and reversible:
- Create
corpus.*tables alongside existing tables (coexistence). - Migrate one kind at a time with SQL INSERT...SELECT scripts (decision first — largest and most exercised).
- Benchmark each migrated kind against the equivalent old-table queries.
- Adapt the MCP server to read from
corpus.documents(dual-read during transition). - Switch MCP tools one by one (search first, then get, then browse).
- When all tools read from
corpus.*, drop old tables. - Migrate
citations→corpus.edges(same process). - Refresh
corpus.tag_stats.
Rollback: during coexistence, the old tables are still the source of truth. If migration fails, revert MCP to old tables.