<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.dura-lex.org/index.php?action=history&amp;feed=atom&amp;title=Corpus%2FSchema</id>
	<title>Corpus/Schema - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.dura-lex.org/index.php?action=history&amp;feed=atom&amp;title=Corpus%2FSchema"/>
	<link rel="alternate" type="text/html" href="https://wiki.dura-lex.org/index.php?title=Corpus/Schema&amp;action=history"/>
	<updated>2026-04-23T05:46:05Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.45.3</generator>
	<entry>
		<id>https://wiki.dura-lex.org/index.php?title=Corpus/Schema&amp;diff=40&amp;oldid=prev</id>
		<title>Nicolas: Import from duralex/spec/CORPUS-SCHEMA.md — faithful conversion to wikitext (via create-page on MediaWiki MCP Server)</title>
		<link rel="alternate" type="text/html" href="https://wiki.dura-lex.org/index.php?title=Corpus/Schema&amp;diff=40&amp;oldid=prev"/>
		<updated>2026-04-23T02:05:28Z</updated>

		<summary type="html">&lt;p&gt;Import from duralex/spec/CORPUS-SCHEMA.md — faithful conversion to wikitext (via create-page on MediaWiki MCP Server)&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;= Corpus Schema =&lt;br /&gt;
&lt;br /&gt;
Technical specification for the Dura Lex unified corpus schema. Reference for developers and AI agents.&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
One &amp;lt;code&amp;gt;documents&amp;lt;/code&amp;gt; table with promoted columns and a JSONB &amp;lt;code&amp;gt;tags&amp;lt;/code&amp;gt; column, inspired by OpenStreetMap&amp;#039;s model: a small fixed schema handles structure, while a flexible key-value store handles everything jurisdiction-specific.&lt;br /&gt;
&lt;br /&gt;
Validated across 25 jurisdictions on 5 continents. Benchmark on 3.4M rows: 7x faster than the previous 16-table normalized schema.&lt;br /&gt;
&lt;br /&gt;
Four tables:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Table !! Purpose&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;corpus.documents&amp;lt;/code&amp;gt; || All legal content: legislation, case law, regulatory filings, administrative notices&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;corpus.edges&amp;lt;/code&amp;gt; || Relationships between documents: citations, amendments, hierarchy, consolidation&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;corpus.tag_stats&amp;lt;/code&amp;gt; || Aggregated tag counts per kind/jurisdiction, refreshed post-ingest&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;corpus.source_metadata&amp;lt;/code&amp;gt; || Data source registry: publishers, licenses, ingestion state, freshness&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Schema: corpus.documents ==&lt;br /&gt;
&lt;br /&gt;
14 columns.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;sql&amp;quot;&amp;gt;&lt;br /&gt;
CREATE SCHEMA IF NOT EXISTS corpus;&lt;br /&gt;
&lt;br /&gt;
CREATE TABLE corpus.documents (&lt;br /&gt;
    id            text PRIMARY KEY,&lt;br /&gt;
    kind          text NOT NULL,&lt;br /&gt;
    jurisdiction  text NOT NULL,&lt;br /&gt;
    language      text,&lt;br /&gt;
    source        text NOT NULL,&lt;br /&gt;
    date          date,&lt;br /&gt;
    date_end      date,&lt;br /&gt;
    parent_id     text,&lt;br /&gt;
    title         text,&lt;br /&gt;
    body          text,&lt;br /&gt;
    body_search   text,&lt;br /&gt;
    tags          jsonb NOT NULL DEFAULT &amp;#039;{}&amp;#039;,&lt;br /&gt;
    content_fts   tsvector,&lt;br /&gt;
    ingested_at   timestamptz DEFAULT now()&lt;br /&gt;
);&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Column definitions ===&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;id&amp;#039;&amp;#039;&amp;#039; -- Text primary key. Uses source-native identifiers with a jurisdiction prefix, always &amp;#039;&amp;#039;&amp;#039;lowercase&amp;#039;&amp;#039;&amp;#039;: &amp;lt;code&amp;gt;{jurisdiction}.{source_id}&amp;lt;/code&amp;gt; (e.g., &amp;lt;code&amp;gt;fr.legiarti000006902764&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;eu.eurlextext32016r0679&amp;lt;/code&amp;gt;). Identifiers with built-in jurisdiction context keep their original form (&amp;lt;code&amp;gt;ECLI:FR:CCASS:2024:CR00123&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;[2024] UKSC 1&amp;lt;/code&amp;gt;). For non-legal registries where the source ID could collide, prefix with &amp;lt;code&amp;gt;{jurisdiction}.{source}.&amp;lt;/code&amp;gt; (e.g., &amp;lt;code&amp;gt;fr.company.443061841&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;gb.company.12345678&amp;lt;/code&amp;gt;). Never synthetic UUIDs -- the ID must be meaningful and traceable to the source. See ADR: lowercase document IDs (design-decisions/2026-04-09-lowercase-document-ids.md).&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;kind&amp;#039;&amp;#039;&amp;#039; -- Structural classification. Exactly 6 values (see Kind Taxonomy below): &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;record&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;notice&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;section&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;chunk&amp;lt;/code&amp;gt;. This column answers &amp;quot;what shape is this document?&amp;quot; not &amp;quot;what legal category is it?&amp;quot; -- legal category belongs in &amp;lt;code&amp;gt;tags.type&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;jurisdiction&amp;#039;&amp;#039;&amp;#039; -- Geographic/political area, not institution. ISO 3166-1 alpha-2 for countries (&amp;lt;code&amp;gt;fr&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;gb&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;de&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;br&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;jp&amp;lt;/code&amp;gt;). ISO 3166-2 for subdivisions (&amp;lt;code&amp;gt;us-ca&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;gb-sct&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;es-ct&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;de-by&amp;lt;/code&amp;gt;). Extensions for supranational areas: &amp;lt;code&amp;gt;eu&amp;lt;/code&amp;gt; (European law — EU institutions and Council of Europe), &amp;lt;code&amp;gt;int&amp;lt;/code&amp;gt; (international treaties). A search for a jurisdiction includes its subdivisions: searching &amp;lt;code&amp;gt;us&amp;lt;/code&amp;gt; includes &amp;lt;code&amp;gt;us-ca&amp;lt;/code&amp;gt;. No default value -- every document must declare its jurisdiction explicitly.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;language&amp;#039;&amp;#039;&amp;#039; -- ISO 639-1 (&amp;lt;code&amp;gt;fr&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;en&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;de&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;ar&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;zh&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;ja&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;ko&amp;lt;/code&amp;gt;). Nullable. When NULL, the FTS trigger infers language from &amp;lt;code&amp;gt;jurisdiction&amp;lt;/code&amp;gt;. Decoupled from jurisdiction because many countries are multilingual: Belgium (3 languages), Switzerland (4), South Africa (11), India (23+). A Belgian law in Dutch and the same law in French are two separate documents with the same jurisdiction (&amp;lt;code&amp;gt;be&amp;lt;/code&amp;gt;) but different languages.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;source&amp;#039;&amp;#039;&amp;#039; -- Data source identifier. Examples: &amp;lt;code&amp;gt;legi&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;cass&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;hmrc_manuals&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;cendoj&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;boe&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;legifrance_jorf&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;knesset&amp;lt;/code&amp;gt;. One source maps to one ingestion pipeline. Multiple sources can feed the same jurisdiction.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;date&amp;#039;&amp;#039;&amp;#039; -- Primary date, always ISO 8601 (converted from Hijri calendar, Japanese era dates, etc. at ingestion). Semantics vary by kind: &amp;lt;code&amp;gt;decision_date&amp;lt;/code&amp;gt; for decisions, &amp;lt;code&amp;gt;valid_from&amp;lt;/code&amp;gt; for legislation, &amp;lt;code&amp;gt;publication_date&amp;lt;/code&amp;gt; for notices, &amp;lt;code&amp;gt;creation_date&amp;lt;/code&amp;gt; for records. The semantic meaning is implicit from &amp;lt;code&amp;gt;kind&amp;lt;/code&amp;gt; -- not stored separately.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;date_end&amp;#039;&amp;#039;&amp;#039; -- End date. &amp;lt;code&amp;gt;valid_until&amp;lt;/code&amp;gt; for legislation, expiration for notices, closure for records. NULL for most decisions (they don&amp;#039;t expire). NULL does not mean &amp;quot;still in force&amp;quot; -- it means the source did not provide an end date.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;parent_id&amp;#039;&amp;#039;&amp;#039; -- Hierarchical link to another document ID. Used for structure trees (code -&amp;gt; article), sub-documents (decision -&amp;gt; chunk), and entity hierarchies (company -&amp;gt; director appointments). Text, not a foreign key -- the parent may live in a different source or not yet be ingested. No FK constraint.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;title&amp;#039;&amp;#039;&amp;#039; -- Display title. Free text. May be NULL for chunks and sections that inherit their parent&amp;#039;s title.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;body&amp;#039;&amp;#039;&amp;#039; -- Clean displayable content. HTML or plain text suitable for direct rendering to a user. Immutable after initial ingestion. Never contains noisy OCR or raw PDF binary. For PDF-only sources, &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt; is a clean stub like &amp;lt;code&amp;gt;&amp;amp;lt;p&amp;amp;gt;Source: &amp;amp;lt;a href=&amp;quot;...&amp;quot;&amp;amp;gt;PDF officiel&amp;amp;lt;/a&amp;amp;gt;&amp;amp;lt;/p&amp;amp;gt;&amp;lt;/code&amp;gt; and the extracted text goes to &amp;lt;code&amp;gt;body_search&amp;lt;/code&amp;gt;. &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt; is what &amp;lt;code&amp;gt;get_document&amp;lt;/code&amp;gt; returns.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;body_search&amp;#039;&amp;#039;&amp;#039; -- Indexable text representation used by FTS. Can be noisy (raw PDF extraction, OCR output) because users never see it directly — &amp;lt;code&amp;gt;get_document&amp;lt;/code&amp;gt; never returns it. Nullable. When NULL, the FTS trigger falls back to &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt; via &amp;lt;code&amp;gt;coalesce(body_search, body)&amp;lt;/code&amp;gt;. Used for: (a) PDF-sourced documents where &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt; is a clean stub and &amp;lt;code&amp;gt;body_search&amp;lt;/code&amp;gt; holds the pdfminer-extracted text; (b) HTML documents with irregular references that need an LLM-cleaned indexable version. Summary and headnote FTS indexing is handled separately by the trigger reading &amp;lt;code&amp;gt;tags.summary&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;tags.headnote_classification&amp;lt;/code&amp;gt; directly from JSONB (weight B). See ADR-010 (design-decisions/2026-04-08-body-search-inversion.md) for the current semantic and ADR-008 (design-decisions/2026-04-05-fts-three-weights.md) for the trigger weight system. See [[Corpus/Quality|QUALITY]] for &amp;lt;code&amp;gt;tags.content_quality&amp;lt;/code&amp;gt; levels.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;tags&amp;#039;&amp;#039;&amp;#039; -- JSONB. Everything jurisdiction-specific or type-specific that does not deserve a promoted column. Examples: &amp;lt;code&amp;gt;{&amp;quot;type&amp;quot;: &amp;quot;loi&amp;quot;, &amp;quot;nid&amp;quot;: &amp;quot;JORFTEXT000041865244&amp;quot;, &amp;quot;nature&amp;quot;: &amp;quot;LOI&amp;quot;}&amp;lt;/code&amp;gt; for French legislation, &amp;lt;code&amp;gt;{&amp;quot;chamber&amp;quot;: &amp;quot;2e_civ&amp;quot;, &amp;quot;solution&amp;quot;: &amp;quot;cassation&amp;quot;}&amp;lt;/code&amp;gt; for French cassation decisions, &amp;lt;code&amp;gt;{&amp;quot;sic_code&amp;quot;: &amp;quot;6411&amp;quot;, &amp;quot;status&amp;quot;: &amp;quot;active&amp;quot;}&amp;lt;/code&amp;gt; for UK company records. No enforced schema -- each source defines its own tag vocabulary. Queried via &amp;lt;code&amp;gt;@&amp;amp;gt;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;?&amp;lt;/code&amp;gt;, and &amp;lt;code&amp;gt;jsonb_path_ops&amp;lt;/code&amp;gt; GIN index.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;content_fts&amp;#039;&amp;#039;&amp;#039; -- &amp;lt;code&amp;gt;tsvector&amp;lt;/code&amp;gt; column, populated by a trigger with three weights: A=title, B=&amp;lt;code&amp;gt;tags.summary&amp;lt;/code&amp;gt;+&amp;lt;code&amp;gt;tags.headnote_classification&amp;lt;/code&amp;gt; (read from JSONB), C=&amp;lt;code&amp;gt;coalesce(body_search, body)&amp;lt;/code&amp;gt; with HTML tags stripped. Uses the appropriate language configuration (derived from &amp;lt;code&amp;gt;language&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;jurisdiction&amp;lt;/code&amp;gt;). NULL for CJK languages (Chinese, Japanese, Korean) which require pgroonga or zhparser instead of PostgreSQL built-in FTS. See [[Corpus/FTS|FTS]] for full trigger details.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;ingested_at&amp;#039;&amp;#039;&amp;#039; -- Timestamp of first ingestion. &amp;lt;code&amp;gt;DEFAULT now()&amp;lt;/code&amp;gt;. Not updated on re-ingestion -- use &amp;lt;code&amp;gt;tags&amp;lt;/code&amp;gt; for versioning metadata if needed.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Indexes ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;sql&amp;quot;&amp;gt;&lt;br /&gt;
-- B-tree indexes on promoted columns&lt;br /&gt;
CREATE INDEX idx_doc_kind          ON corpus.documents (kind);&lt;br /&gt;
CREATE INDEX idx_doc_jurisdiction   ON corpus.documents (jurisdiction);&lt;br /&gt;
CREATE INDEX idx_doc_source        ON corpus.documents (source);&lt;br /&gt;
CREATE INDEX idx_doc_date          ON corpus.documents (date);&lt;br /&gt;
CREATE INDEX idx_doc_date_end      ON corpus.documents (date_end)   WHERE date_end IS NOT NULL;&lt;br /&gt;
CREATE INDEX idx_doc_parent        ON corpus.documents (parent_id)  WHERE parent_id IS NOT NULL;&lt;br /&gt;
CREATE INDEX idx_doc_kind_juris    ON corpus.documents (kind, jurisdiction);&lt;br /&gt;
CREATE INDEX idx_doc_kind_source   ON corpus.documents (kind, source);&lt;br /&gt;
&lt;br /&gt;
-- GIN index on tags&lt;br /&gt;
CREATE INDEX idx_doc_tags          ON corpus.documents USING GIN (tags jsonb_path_ops);&lt;br /&gt;
&lt;br /&gt;
-- Partial GIN indexes per kind (FTS isolation)&lt;br /&gt;
CREATE INDEX idx_doc_fts_legislation ON corpus.documents USING GIN (content_fts) WHERE kind = &amp;#039;legislation&amp;#039;;&lt;br /&gt;
CREATE INDEX idx_doc_fts_decision  ON corpus.documents USING GIN (content_fts) WHERE kind = &amp;#039;decision&amp;#039;;&lt;br /&gt;
CREATE INDEX idx_doc_fts_record    ON corpus.documents USING GIN (content_fts) WHERE kind = &amp;#039;record&amp;#039;;&lt;br /&gt;
CREATE INDEX idx_doc_fts_notice    ON corpus.documents USING GIN (content_fts) WHERE kind = &amp;#039;notice&amp;#039;;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Why partial GIN indexes per kind instead of table partitioning ===&lt;br /&gt;
&lt;br /&gt;
Partitioning &amp;lt;code&amp;gt;documents&amp;lt;/code&amp;gt; by &amp;lt;code&amp;gt;kind&amp;lt;/code&amp;gt; would require &amp;lt;code&amp;gt;kind&amp;lt;/code&amp;gt; in the primary key (PostgreSQL requires the partition key in the PK). That means &amp;lt;code&amp;gt;id&amp;lt;/code&amp;gt; alone can no longer be a PK -- every FK, every edge reference, every external link would need to carry &amp;lt;code&amp;gt;(id, kind)&amp;lt;/code&amp;gt; instead of just &amp;lt;code&amp;gt;id&amp;lt;/code&amp;gt;. The complexity cascades through the entire system.&lt;br /&gt;
&lt;br /&gt;
Partial GIN indexes achieve 80% of the performance benefit with 0% of the complexity. When a query filters &amp;lt;code&amp;gt;WHERE kind = &amp;#039;legislation&amp;#039; AND content_fts @@ ...&amp;lt;/code&amp;gt;, PostgreSQL uses the partial index &amp;lt;code&amp;gt;idx_doc_fts_legislation&amp;lt;/code&amp;gt; which contains only legislation tsvectors. The planner skips the other kinds entirely. Index size is smaller, scans are faster, and the PK stays a simple &amp;lt;code&amp;gt;text&amp;lt;/code&amp;gt; column.&lt;br /&gt;
&lt;br /&gt;
No partial index for &amp;lt;code&amp;gt;section&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;chunk&amp;lt;/code&amp;gt; because they rarely have &amp;lt;code&amp;gt;body&amp;lt;/code&amp;gt;/&amp;lt;code&amp;gt;body_search&amp;lt;/code&amp;gt; content worth full-text indexing.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Schema: corpus.edges ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;sql&amp;quot;&amp;gt;&lt;br /&gt;
CREATE TABLE corpus.edges (&lt;br /&gt;
    id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,&lt;br /&gt;
    source_id   text NOT NULL,&lt;br /&gt;
    target_id   text,&lt;br /&gt;
    kind        text NOT NULL,&lt;br /&gt;
    reference   text,&lt;br /&gt;
    properties  jsonb DEFAULT &amp;#039;{}&amp;#039;,&lt;br /&gt;
    UNIQUE (source_id, target_id, kind)&lt;br /&gt;
);&lt;br /&gt;
&lt;br /&gt;
CREATE INDEX idx_edge_source ON corpus.edges (source_id);&lt;br /&gt;
CREATE INDEX idx_edge_target ON corpus.edges (target_id) WHERE target_id IS NOT NULL;&lt;br /&gt;
CREATE INDEX idx_edge_kind   ON corpus.edges (kind);&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Column definitions ===&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;id&amp;#039;&amp;#039;&amp;#039; -- Synthetic identity, &amp;lt;code&amp;gt;bigint GENERATED ALWAYS AS IDENTITY&amp;lt;/code&amp;gt;. Edges have no natural key.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;source_id&amp;#039;&amp;#039;&amp;#039; -- The document that originates the relationship. Not a foreign key.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;target_id&amp;#039;&amp;#039;&amp;#039; -- The document that receives the relationship. Nullable: an unresolved reference (a citation to a document not yet ingested, or a reference to an external system) has &amp;lt;code&amp;gt;target_id = NULL&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;reference&amp;lt;/code&amp;gt; containing the raw citation text. Once the target is ingested or resolved, &amp;lt;code&amp;gt;target_id&amp;lt;/code&amp;gt; is backfilled.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;kind&amp;#039;&amp;#039;&amp;#039; -- Relationship type. Open vocabulary. Examples: &amp;lt;code&amp;gt;cites&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;amends&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;repeals&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;implements&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;transposes&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;overrules&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;affirms&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;commences&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;language_variant&amp;lt;/code&amp;gt;. See [[Corpus/Edge types|EDGE-TYPES]] for the full taxonomy.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;reference&amp;#039;&amp;#039;&amp;#039; -- Raw citation text as found in the source document. Preserved even after resolution for traceability. Example: &amp;lt;code&amp;gt;&amp;quot;article L. 121-1 du code de la consommation&amp;quot;&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;properties&amp;#039;&amp;#039;&amp;#039; -- JSONB metadata on the relationship. Standard provenance keys (see ADR: edge pipeline architecture, design-decisions/2026-04-22-edge-pipeline-architecture.md):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;extraction&amp;lt;/code&amp;gt;: &amp;lt;code&amp;gt;&amp;quot;publisher_provided&amp;quot;&amp;lt;/code&amp;gt; (source has a structured link) or &amp;lt;code&amp;gt;&amp;quot;id_resolved&amp;quot;&amp;lt;/code&amp;gt; (textual reference resolved by the reference resolver).&lt;br /&gt;
* &amp;lt;code&amp;gt;source_type&amp;lt;/code&amp;gt;: identifies the source system (&amp;lt;code&amp;gt;legi_lien&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;cellar_sparql&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;hudoc_appno&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;cellar_nim&amp;lt;/code&amp;gt;).&lt;br /&gt;
* &amp;lt;code&amp;gt;raw_link_type&amp;lt;/code&amp;gt;: source-native link type preserved for traceability (e.g., &amp;lt;code&amp;gt;&amp;quot;TRANSPOSITION&amp;quot;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;&amp;quot;cdm:resource_legal_amends_resource_legal&amp;quot;&amp;lt;/code&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Type-specific examples: &amp;lt;code&amp;gt;{&amp;quot;scope&amp;quot;: &amp;quot;partial&amp;quot;, &amp;quot;articles&amp;quot;: [&amp;quot;1&amp;quot;, &amp;quot;2&amp;quot;, &amp;quot;3&amp;quot;]}&amp;lt;/code&amp;gt; for partial amendments, &amp;lt;code&amp;gt;{&amp;quot;commencement_date&amp;quot;: &amp;quot;2024-01-01&amp;quot;, &amp;quot;territorial_extent&amp;quot;: &amp;quot;england_and_wales&amp;quot;}&amp;lt;/code&amp;gt; for UK effects, &amp;lt;code&amp;gt;{&amp;quot;reservation&amp;quot;: &amp;quot;sous reserve de...&amp;quot;}&amp;lt;/code&amp;gt; for conditional constitutionality decisions.&lt;br /&gt;
&lt;br /&gt;
=== Quarantine model ===&lt;br /&gt;
&lt;br /&gt;
Edges with unresolved targets are stored with &amp;lt;code&amp;gt;target_id = NULL&amp;lt;/code&amp;gt; and the raw citation in &amp;lt;code&amp;gt;reference&amp;lt;/code&amp;gt;. They are visible to MCP and queryable for monitoring. The partial unique index &amp;lt;code&amp;gt;idx_edge_unique_without_target ON (source_id, reference, kind) WHERE target_id IS NULL&amp;lt;/code&amp;gt; prevents duplicate quarantine entries.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Ownership:&amp;#039;&amp;#039;&amp;#039; an edge is owned by the jurisdiction prefix of its &amp;lt;code&amp;gt;source_id&amp;lt;/code&amp;gt;. Drop-and-reingest for jurisdiction X deletes edges WHERE &amp;lt;code&amp;gt;source_id LIKE &amp;#039;{x}.%&amp;#039;&amp;lt;/code&amp;gt;. Cross-jurisdiction quarantine edges (e.g., EU NIM → FR target) survive the target jurisdiction&amp;#039;s reingest.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Reconciliation:&amp;#039;&amp;#039;&amp;#039; after each jurisdiction ingest, a post-ingest pass attempts to resolve quarantine edges whose reference matches the freshly ingested jurisdiction. Resolved → &amp;lt;code&amp;gt;UPDATE target_id&amp;lt;/code&amp;gt;. Still unresolved → remains in quarantine.&lt;br /&gt;
&lt;br /&gt;
=== Why no foreign keys ===&lt;br /&gt;
&lt;br /&gt;
At 50M+ edges, FK checks on every insert become a measurable bottleneck. Sources are ingested independently and in parallel -- a French decision citing a EU directive may be ingested before the directive itself. FKs would either block ingestion order or require two-pass inserts.&lt;br /&gt;
&lt;br /&gt;
Orphan cleanup runs periodically (post-ingest batch job), not on every write.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Schema: corpus.tag_stats ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;sql&amp;quot;&amp;gt;&lt;br /&gt;
CREATE TABLE corpus.tag_stats (&lt;br /&gt;
    kind          text NOT NULL,&lt;br /&gt;
    jurisdiction  text NOT NULL,&lt;br /&gt;
    tag_key       text NOT NULL,&lt;br /&gt;
    tag_value     text NOT NULL,&lt;br /&gt;
    count         integer DEFAULT 0,&lt;br /&gt;
    PRIMARY KEY (kind, jurisdiction, tag_key, tag_value)&lt;br /&gt;
);&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Refreshed post-ingest, incrementally per source. Provides instant faceted counts without scanning &amp;lt;code&amp;gt;documents&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Must handle both scalar and array tag values. For scalar tags, count directly. For array tags (e.g., &amp;lt;code&amp;gt;tags.themes = [&amp;quot;droit civil&amp;quot;, &amp;quot;responsabilite&amp;quot;]&amp;lt;/code&amp;gt;), use &amp;lt;code&amp;gt;jsonb_array_elements_text&amp;lt;/code&amp;gt; to explode the array and count each element individually.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Kind taxonomy ==&lt;br /&gt;
&lt;br /&gt;
6 structural values. These classify document shape, not legal category.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Kind !! What it is !! What it is NOT !! Examples&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Creates, interprets, or defines a rule. Normative or quasi-normative content. || Not a resolution of a specific dispute. || Statute, regulation, decree, circular, directive, fatwa, ministerial order, regulatory handbook&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt; || Resolves a specific dispute or case. Issued by a court or quasi-judicial body. || Not a general rule, even if it sets precedent. || Judgment, ruling, order, arbitral award, AG opinion&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;record&amp;lt;/code&amp;gt; || Factual registration in an official registry. || Not normative, not adjudicative. || Company filing, trademark registration, patent, land title, birth certificate&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;notice&amp;lt;/code&amp;gt; || Official announcement or publication. Time-bound, informational. || Not normative long-term. || Gazette entry, public consultation, tender, regulatory announcement&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;section&amp;lt;/code&amp;gt; || Structural subdivision of another document. Has a &amp;lt;code&amp;gt;parent_id&amp;lt;/code&amp;gt;. || Not a standalone document. || Part, title, chapter, article, schedule, annex&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;chunk&amp;lt;/code&amp;gt; || Processing artifact. A fragment of a document created for search or embedding. || Not an editorial unit. || Paragraph-level split for RAG, embedding chunk&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Disambiguation rule ===&lt;br /&gt;
&lt;br /&gt;
When a document is ambiguous:&lt;br /&gt;
&lt;br /&gt;
# Does it resolve a specific dispute between parties? -&amp;gt; &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt;&lt;br /&gt;
# Does it create, interpret, or define a general rule? -&amp;gt; &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt;&lt;br /&gt;
# Still ambiguous? -&amp;gt; &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The legal category goes in &amp;lt;code&amp;gt;tags.type&amp;lt;/code&amp;gt; (e.g., &amp;lt;code&amp;gt;&amp;quot;type&amp;quot;: &amp;quot;sumula_vinculante&amp;quot;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;&amp;quot;type&amp;quot;: &amp;quot;tesis_jurisprudencial&amp;quot;&amp;lt;/code&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
=== Stress-test: ambiguous documents across jurisdictions ===&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Document !! Jurisdiction !! Kind !! Rationale&lt;br /&gt;
|-&lt;br /&gt;
| Sumula vinculante || BR || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Binding precedent abstract, creates a general rule. Not resolving a specific dispute.&lt;br /&gt;
|-&lt;br /&gt;
| Tesis jurisprudencial || MX || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Abstracted legal thesis from repeated rulings. General rule, not a specific dispute.&lt;br /&gt;
|-&lt;br /&gt;
| Judicial interpretation (司法解释) || CN || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Supreme People&amp;#039;s Court quasi-legislation. Creates binding rules of general application.&lt;br /&gt;
|-&lt;br /&gt;
| Dictamen CGR || CL || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Comptroller General opinion. Interprets administrative law generally.&lt;br /&gt;
|-&lt;br /&gt;
| AG Opinion (CJEU) || EU || &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt; || Issued in the context of a specific case, advises on its resolution.&lt;br /&gt;
|-&lt;br /&gt;
| FCA Handbook || GB || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Financial Conduct Authority regulatory rules. Normative, not case-specific.&lt;br /&gt;
|-&lt;br /&gt;
| Fatwa || SA || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Religious-legal ruling of general application. Interprets Islamic law.&lt;br /&gt;
|-&lt;br /&gt;
| Administrative guidance (通達 tsutatsu) || JP || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Ministry directive to subordinate agencies. General rule, not dispute resolution.&lt;br /&gt;
|-&lt;br /&gt;
| Restatement section || US || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Scholarly synthesis of common law rules. General rule formulation.&lt;br /&gt;
|-&lt;br /&gt;
| Grundsatzentscheidung || DE || &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt; || Federal Court principle decision. Resolves a specific case, even if it sets precedent.&lt;br /&gt;
|-&lt;br /&gt;
| Arret de principe || FR || &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt; || Cour de cassation principle ruling. Resolves a specific dispute.&lt;br /&gt;
|-&lt;br /&gt;
| Sentenza interpretativa di rigetto || IT || &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt; || Constitutional Court interpretive rejection. Resolves a specific constitutional question.&lt;br /&gt;
|-&lt;br /&gt;
| Advisory Opinion (ICJ) || INT || &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt; || Issued on a specific question referred by a UN organ.&lt;br /&gt;
|-&lt;br /&gt;
| Pleno no jurisdiccional || ES || &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt; || Supreme Court plenary agreement. Resolves doctrinal conflict arising from cases.&lt;br /&gt;
|-&lt;br /&gt;
| Bepaling van algemene strekking || NL || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || General binding regulation. Normative, not case resolution.&lt;br /&gt;
|-&lt;br /&gt;
| Constitutional Court communique || TR || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Press release summarizing a decision. Informational/normative summary.&lt;br /&gt;
|-&lt;br /&gt;
| Practice Direction || GB || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Court-issued procedural rules. General application.&lt;br /&gt;
|-&lt;br /&gt;
| Reglement grand-ducal || LU || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Grand-ducal regulation. Normative.&lt;br /&gt;
|-&lt;br /&gt;
| Acordao vinculante STF || BR || &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt; || Binding Supreme Court judgment. Resolves a specific case.&lt;br /&gt;
|-&lt;br /&gt;
| Royal Decree (مرسوم ملكي) || SA || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Royal legislative act. Creates general rules.&lt;br /&gt;
|-&lt;br /&gt;
| Verfassungsbeschwerde decision || DE || &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt; || Constitutional complaint resolution. Specific dispute.&lt;br /&gt;
|-&lt;br /&gt;
| Recurso extraordinario || BR || &amp;lt;code&amp;gt;decision&amp;lt;/code&amp;gt; || Extraordinary appeal. Resolves a specific case.&lt;br /&gt;
|-&lt;br /&gt;
| EU Regulation || EU || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Directly applicable legislation. General normative act.&lt;br /&gt;
|-&lt;br /&gt;
| OHADA Uniform Act || INT || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Harmonized commercial law. General normative act.&lt;br /&gt;
|-&lt;br /&gt;
| Tax ruling (rescrit) || FR || &amp;lt;code&amp;gt;legislation&amp;lt;/code&amp;gt; || Tax authority interpretation. General application (even if triggered by a specific request).&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Benchmark results ==&lt;br /&gt;
&lt;br /&gt;
Measured on 3.4M rows, PostgreSQL 16, single-node, 16 GB RAM.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Query !! Corpus schema !! Previous 16-table schema !! Speedup&lt;br /&gt;
|-&lt;br /&gt;
| Tag filter + date range || 0.9 ms || — || B-tree scan, instant&lt;br /&gt;
|-&lt;br /&gt;
| FTS + tag filter (broad, 111K results) || 4.4 s || 31 s || 7x faster (BitmapAnd vs Seq Scan)&lt;br /&gt;
|-&lt;br /&gt;
| Tag-only + date || 1.4 ms || — || B-tree scan, instant&lt;br /&gt;
|-&lt;br /&gt;
| FTS narrow (17K results, warm) || 565 ms || — || Partial GIN index&lt;br /&gt;
|-&lt;br /&gt;
| Get by ID || 0.3 ms || 0.3 ms || PK lookup&lt;br /&gt;
|-&lt;br /&gt;
| Tag discovery (GROUP BY court, 3.4M rows) || 2.7 s || — || Seq scan — justifies tag_stats&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The FTS speedup comes from BitmapAnd of GIN (FTS) + GIN (tags) indexes. The previous schema used a Parallel Seq Scan because the planner could not combine B-tree (court column) with GIN (FTS) efficiently.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Operational notes ==&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;No FK on edges or parent_id.&amp;#039;&amp;#039;&amp;#039; Foreign keys on multi-million row tables with parallel ingestion from independent sources create ordering constraints and lock contention that outweigh referential safety. Orphan detection runs as a periodic batch job.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Partial GIN indexes per kind.&amp;#039;&amp;#039;&amp;#039; See the indexes section. Keeps FTS index sizes small, avoids partitioning complexity.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;WHERE IS DISTINCT FROM in upserts.&amp;#039;&amp;#039;&amp;#039; Use &amp;lt;code&amp;gt;IS DISTINCT FROM&amp;lt;/code&amp;gt; instead of &amp;lt;code&amp;gt;!=&amp;lt;/code&amp;gt; in &amp;lt;code&amp;gt;ON CONFLICT ... DO UPDATE SET ... WHERE&amp;lt;/code&amp;gt; clauses. &amp;lt;code&amp;gt;!=&amp;lt;/code&amp;gt; returns NULL when comparing with NULL (and the row is not updated), while &amp;lt;code&amp;gt;IS DISTINCT FROM&amp;lt;/code&amp;gt; treats NULL as a comparable value. Critical for columns like &amp;lt;code&amp;gt;date_end&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;body_search&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;language&amp;lt;/code&amp;gt; that are frequently NULL.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Autovacuum tuning.&amp;#039;&amp;#039;&amp;#039; Set &amp;lt;code&amp;gt;autovacuum_vacuum_scale_factor = 0.01&amp;lt;/code&amp;gt; on &amp;lt;code&amp;gt;corpus.documents&amp;lt;/code&amp;gt;. The default (0.2) means autovacuum triggers after 20% of rows are dead -- on a 3.4M row table, that is 680K dead rows before cleanup. At 0.01, it triggers at 34K dead rows, keeping bloat and index performance under control.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Estimated counts at scale.&amp;#039;&amp;#039;&amp;#039; &amp;lt;code&amp;gt;COUNT(*) OVER()&amp;lt;/code&amp;gt; becomes expensive above 100M rows. Use &amp;lt;code&amp;gt;EXPLAIN&amp;lt;/code&amp;gt;-based row estimates or &amp;lt;code&amp;gt;pg_class.reltuples&amp;lt;/code&amp;gt; for approximate counts in paginated APIs. Exact counts only when explicitly requested.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;CJK full-text search.&amp;#039;&amp;#039;&amp;#039; PostgreSQL built-in FTS does not support Chinese, Japanese, or Korean tokenization. &amp;lt;code&amp;gt;content_fts&amp;lt;/code&amp;gt; is NULL for CJK documents. Use pgroonga (all three languages) or zhparser (Chinese only) as an extension. CJK search queries must use a different code path.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Partitioning at CN scale.&amp;#039;&amp;#039;&amp;#039; At 130M+ rows (realistic for a full Chinese legal corpus), partition &amp;lt;code&amp;gt;documents&amp;lt;/code&amp;gt; by &amp;lt;code&amp;gt;jurisdiction&amp;lt;/code&amp;gt;. This is the one case where partitioning is justified: queries almost always filter by jurisdiction, and partition pruning eliminates entire shards. The composite PK concern is manageable when partitioning by a column that is already always present in queries.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== corpus.source_metadata ==&lt;br /&gt;
&lt;br /&gt;
Tracks data sources, their static descriptive metadata, and dynamic statistics recomputed at the end of each ingest cycle.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;sql&amp;quot;&amp;gt;&lt;br /&gt;
CREATE TABLE corpus.source_metadata (&lt;br /&gt;
    source_key             text PRIMARY KEY,&lt;br /&gt;
    parent_source_key      text,&lt;br /&gt;
    jurisdiction           text NOT NULL,&lt;br /&gt;
    name                   text NOT NULL,&lt;br /&gt;
    description            text NOT NULL,&lt;br /&gt;
    kind                   text NOT NULL DEFAULT &amp;#039;legislation&amp;#039;,&lt;br /&gt;
    publisher              text NOT NULL,&lt;br /&gt;
    publisher_url          text NOT NULL,&lt;br /&gt;
    license                text NOT NULL,&lt;br /&gt;
    license_url            text NOT NULL,&lt;br /&gt;
    language               text NOT NULL DEFAULT &amp;#039;fr&amp;#039;,&lt;br /&gt;
    coverage_start_year    integer,&lt;br /&gt;
    document_count         integer,&lt;br /&gt;
    freshest_document_date date,&lt;br /&gt;
    last_file              text,&lt;br /&gt;
    source_updated         date,&lt;br /&gt;
    ingested_at            timestamptz&lt;br /&gt;
);&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Used by the MCP server (guidelines, data freshness display) and the portal. Sub-sources (e.g., &amp;lt;code&amp;gt;bodacc_pcl&amp;lt;/code&amp;gt;) have &amp;lt;code&amp;gt;parent_source_key&amp;lt;/code&amp;gt; set.&lt;br /&gt;
&lt;br /&gt;
=== Static vs dynamic columns ===&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Static&amp;#039;&amp;#039;&amp;#039; (written by &amp;lt;code&amp;gt;sync_source_registry()&amp;lt;/code&amp;gt; at init from &amp;lt;code&amp;gt;SOURCE_REGISTRY&amp;lt;/code&amp;gt;): &amp;lt;code&amp;gt;source_key&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;parent_source_key&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;jurisdiction&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;name&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;description&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;kind&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;publisher&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;publisher_url&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;license&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;license_url&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;language&amp;lt;/code&amp;gt;.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Dynamic&amp;#039;&amp;#039;&amp;#039; (written by &amp;lt;code&amp;gt;refresh_corpus_stats()&amp;lt;/code&amp;gt; at the end of each ingest cycle, including the dedup steps inside &amp;lt;code&amp;gt;run_post_ingest()&amp;lt;/code&amp;gt;): &amp;lt;code&amp;gt;coverage_start_year&amp;lt;/code&amp;gt; (= &amp;lt;code&amp;gt;MIN(date)&amp;lt;/code&amp;gt; per source), &amp;lt;code&amp;gt;document_count&amp;lt;/code&amp;gt; (= &amp;lt;code&amp;gt;COUNT(*)&amp;lt;/code&amp;gt; per source), &amp;lt;code&amp;gt;freshest_document_date&amp;lt;/code&amp;gt; (= &amp;lt;code&amp;gt;MAX(date)&amp;lt;/code&amp;gt; per source). Top-level parent rows that have sub-sources are populated by a SUM/MIN/MAX rollup over their children.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Operational&amp;#039;&amp;#039;&amp;#039; (written by &amp;lt;code&amp;gt;state.set_ingest_state()&amp;lt;/code&amp;gt; per ingest run): &amp;lt;code&amp;gt;last_file&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;source_updated&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;ingested_at&amp;lt;/code&amp;gt;. These track pipeline state, not content state, and are not displayed in MCP responses.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;sync_source_registry()&amp;lt;/code&amp;gt; MUST NOT touch dynamic columns — they would be reset to NULL on every init. The INSERT and &amp;lt;code&amp;gt;ON CONFLICT DO UPDATE SET&amp;lt;/code&amp;gt; clauses only list static columns.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Required PostgreSQL extensions ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;sql&amp;quot;&amp;gt;&lt;br /&gt;
CREATE EXTENSION IF NOT EXISTS unaccent;    -- accent-insensitive FTS&lt;br /&gt;
-- Optional (for CJK jurisdictions):&lt;br /&gt;
-- CREATE EXTENSION IF NOT EXISTS pgroonga;  -- CJK full-text search&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Complete DDL execution order ==&lt;br /&gt;
&lt;br /&gt;
# &amp;lt;code&amp;gt;CREATE SCHEMA corpus;&amp;lt;/code&amp;gt;&lt;br /&gt;
# &amp;lt;code&amp;gt;CREATE EXTENSION IF NOT EXISTS unaccent;&amp;lt;/code&amp;gt;&lt;br /&gt;
# &amp;lt;code&amp;gt;CREATE TABLE corpus.documents (...);&amp;lt;/code&amp;gt; (14 columns)&lt;br /&gt;
# &amp;lt;code&amp;gt;CREATE TABLE corpus.edges (...);&amp;lt;/code&amp;gt;&lt;br /&gt;
# &amp;lt;code&amp;gt;CREATE TABLE corpus.tag_stats (...);&amp;lt;/code&amp;gt;&lt;br /&gt;
# &amp;lt;code&amp;gt;CREATE TABLE corpus.source_metadata (...);&amp;lt;/code&amp;gt;&lt;br /&gt;
# &amp;lt;code&amp;gt;CREATE FUNCTION corpus.normalize_for_fts(...);&amp;lt;/code&amp;gt; (see [[Corpus/FTS|FTS]])&lt;br /&gt;
# &amp;lt;code&amp;gt;CREATE FUNCTION corpus.update_content_fts(...);&amp;lt;/code&amp;gt; (see [[Corpus/FTS|FTS]])&lt;br /&gt;
# &amp;lt;code&amp;gt;CREATE TRIGGER trg_content_fts ...;&amp;lt;/code&amp;gt;&lt;br /&gt;
# All indexes (B-tree, GIN, partial GIN)&lt;br /&gt;
# &amp;lt;code&amp;gt;ANALYZE corpus.documents;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For the graph schema (future): &amp;lt;code&amp;gt;CREATE SCHEMA graph;&amp;lt;/code&amp;gt; + tables from [[Corpus/Graph|GRAPH]].&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== ID format convention ==&lt;br /&gt;
&lt;br /&gt;
Document IDs use source-native identifiers, always &amp;#039;&amp;#039;&amp;#039;lowercase&amp;#039;&amp;#039;&amp;#039;. Rules:&lt;br /&gt;
&lt;br /&gt;
* All document IDs carry a lowercase jurisdiction prefix: &amp;lt;code&amp;gt;{jurisdiction}.{source_id}&amp;lt;/code&amp;gt; (e.g., &amp;lt;code&amp;gt;fr.legiarti000006902764&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;fr.juritext000041701651&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;fr.cetatext000046783006&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;eu.eurlextext32016r0679&amp;lt;/code&amp;gt;).&lt;br /&gt;
* Identifiers with built-in jurisdiction context (ECLI:*, CELEX:*, [year] UKSC) may omit the prefix.&lt;br /&gt;
* For non-legal registries where the source ID could collide, use &amp;lt;code&amp;gt;{jurisdiction}.{source}.&amp;lt;/code&amp;gt; (e.g., &amp;lt;code&amp;gt;fr.company.443061841&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;gb.company.12345678&amp;lt;/code&amp;gt;).&lt;br /&gt;
* Decision: the ingest plugin author determines the prefix format, documented in the plugin&amp;#039;s tag schema.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Bidirectional edges convention ==&lt;br /&gt;
&lt;br /&gt;
Some edge types are logically bidirectional (&amp;lt;code&amp;gt;language_variant&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;corresponds_to&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;associated_text&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;joined_with&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;pilot_follower&amp;lt;/code&amp;gt;). The convention:&lt;br /&gt;
&lt;br /&gt;
* Store ONE edge, not two.&lt;br /&gt;
* Pick a canonical direction (alphabetical by source_id, or the &amp;quot;primary&amp;quot; document in the pair).&lt;br /&gt;
* Queries for bidirectional edges check both directions: &amp;lt;code&amp;gt;WHERE (source_id = X OR target_id = X) AND kind = &amp;#039;language_variant&amp;#039;&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Jurisdiction subdivision mapping in FTS trigger ==&lt;br /&gt;
&lt;br /&gt;
The trigger resolves language from &amp;lt;code&amp;gt;coalesce(language, jurisdiction)&amp;lt;/code&amp;gt;. For subdivision jurisdictions like &amp;lt;code&amp;gt;gb-sct&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;es-ct&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;de-by&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
* If &amp;lt;code&amp;gt;language&amp;lt;/code&amp;gt; is set (recommended for subdivisions), it is used directly.&lt;br /&gt;
* If &amp;lt;code&amp;gt;language&amp;lt;/code&amp;gt; is NULL, the trigger extracts the country prefix: &amp;lt;code&amp;gt;split_part(jurisdiction, &amp;#039;-&amp;#039;, 1)&amp;lt;/code&amp;gt; → &amp;lt;code&amp;gt;gb&amp;lt;/code&amp;gt;. Since &amp;lt;code&amp;gt;gb&amp;lt;/code&amp;gt; is not a language code in the mapping (the mapping uses ISO 639-1 codes like &amp;lt;code&amp;gt;en&amp;lt;/code&amp;gt;), it falls through to &amp;lt;code&amp;gt;simple&amp;lt;/code&amp;gt;. Set &amp;lt;code&amp;gt;language = &amp;#039;en&amp;#039;&amp;lt;/code&amp;gt; explicitly for English-language jurisdictions like &amp;lt;code&amp;gt;gb&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;gb-sct&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;us&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
This extraction is in the FTS trigger code (see [[Corpus/FTS|FTS]]).&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Migration strategy ==&lt;br /&gt;
&lt;br /&gt;
Migration from the current 16-table schema is incremental and reversible:&lt;br /&gt;
&lt;br /&gt;
# Create &amp;lt;code&amp;gt;corpus.*&amp;lt;/code&amp;gt; tables alongside existing tables (coexistence).&lt;br /&gt;
# Migrate one kind at a time with SQL INSERT...SELECT scripts (decision first — largest and most exercised).&lt;br /&gt;
# Benchmark each migrated kind against the equivalent old-table queries.&lt;br /&gt;
# Adapt the MCP server to read from &amp;lt;code&amp;gt;corpus.documents&amp;lt;/code&amp;gt; (dual-read during transition).&lt;br /&gt;
# Switch MCP tools one by one (search first, then get, then browse).&lt;br /&gt;
# When all tools read from &amp;lt;code&amp;gt;corpus.*&amp;lt;/code&amp;gt;, drop old tables.&lt;br /&gt;
# Migrate &amp;lt;code&amp;gt;citations&amp;lt;/code&amp;gt; → &amp;lt;code&amp;gt;corpus.edges&amp;lt;/code&amp;gt; (same process).&lt;br /&gt;
# Refresh &amp;lt;code&amp;gt;corpus.tag_stats&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Rollback: during coexistence, the old tables are still the source of truth. If migration fails, revert MCP to old tables.&lt;br /&gt;
&lt;br /&gt;
[[Category:Corpus]]&lt;/div&gt;</summary>
		<author><name>Nicolas</name></author>
	</entry>
</feed>