Editing Corpus/Schema

= Corpus Schema =

Technical specification for the Dura Lex unified corpus schema. Reference for developers and AI agents.

== Overview ==

One <code>documents</code> table with promoted columns and a JSONB <code>tags</code> column, inspired by OpenStreetMap's model: a small fixed schema handles structure, while a flexible key-value store handles everything jurisdiction-specific.

Validated across 25 jurisdictions on 5 continents. Benchmark on 3.4M rows: 7x faster than the previous 16-table normalized schema.

Four tables:

{| class="wikitable"
! Table !! Purpose
|-
| <code>corpus.documents</code> || All legal content: legislation, case law, regulatory filings, administrative notices
|-
| <code>corpus.edges</code> || Relationships between documents: citations, amendments, hierarchy, consolidation
|-
| <code>corpus.tag_stats</code> || Aggregated tag counts per kind/jurisdiction, refreshed post-ingest
|-
| <code>corpus.source_metadata</code> || Data source registry: publishers, licenses, ingestion state, freshness
|}

----

== Schema: corpus.documents ==

14 columns.

<syntaxhighlight lang="sql">
CREATE SCHEMA IF NOT EXISTS corpus;

CREATE TABLE corpus.documents (
    id            text PRIMARY KEY,
    kind          text NOT NULL,
    jurisdiction  text NOT NULL,
    language      text,
    source        text NOT NULL,
    date          date,
    date_end      date,
    parent_id     text,
    title         text,
    body          text,
    body_search   text,
    tags          jsonb NOT NULL DEFAULT '{}',
    content_fts   tsvector,
    ingested_at   timestamptz DEFAULT now()
);
</syntaxhighlight>

=== Column definitions ===

'''id''' -- Text primary key. Uses source-native identifiers with a jurisdiction prefix, always '''lowercase''': <code>{jurisdiction}.{source_id}</code> (e.g., <code>fr.legiarti000006902764</code>, <code>eu.eurlextext32016r0679</code>). Identifiers with built-in jurisdiction context keep their original form (<code>ECLI:FR:CCASS:2024:CR00123</code>, <code>[2024] UKSC 1</code>). For non-legal registries where the source ID could collide, prefix with <code>{jurisdiction}.{source}.</code> (e.g., <code>fr.company.443061841</code>, <code>gb.company.12345678</code>). Never synthetic UUIDs -- the ID must be meaningful and traceable to the source. See ADR: lowercase document IDs (design-decisions/2026-04-09-lowercase-document-ids.md).

'''kind''' -- Structural classification. Exactly 6 values (see Kind Taxonomy below): <code>legislation</code>, <code>decision</code>, <code>record</code>, <code>notice</code>, <code>section</code>, <code>chunk</code>. This column answers "what shape is this document?" not "what legal category is it?" -- legal category belongs in <code>tags.type</code>.

'''jurisdiction''' -- Geographic/political area, not institution. ISO 3166-1 alpha-2 for countries (<code>fr</code>, <code>gb</code>, <code>de</code>, <code>br</code>, <code>jp</code>). ISO 3166-2 for subdivisions (<code>us-ca</code>, <code>gb-sct</code>, <code>es-ct</code>, <code>de-by</code>). Extensions for supranational areas: <code>eu</code> (European law — EU institutions and Council of Europe), <code>int</code> (international treaties). A search for a jurisdiction includes its subdivisions: searching <code>us</code> includes <code>us-ca</code>. No default value -- every document must declare its jurisdiction explicitly.

'''language''' -- ISO 639-1 (<code>fr</code>, <code>en</code>, <code>de</code>, <code>ar</code>, <code>zh</code>, <code>ja</code>, <code>ko</code>). Nullable. When NULL, the FTS trigger infers language from <code>jurisdiction</code>. Decoupled from jurisdiction because many countries are multilingual: Belgium (3 languages), Switzerland (4), South Africa (11), India (23+). A Belgian law in Dutch and the same law in French are two separate documents with the same jurisdiction (<code>be</code>) but different languages.

'''source''' -- Data source identifier. Examples: <code>legi</code>, <code>cass</code>, <code>hmrc_manuals</code>, <code>cendoj</code>, <code>boe</code>, <code>legifrance_jorf</code>, <code>knesset</code>. One source maps to one ingestion pipeline. Multiple sources can feed the same jurisdiction.

'''date''' -- Primary date, always ISO 8601 (converted from Hijri calendar, Japanese era dates, etc. at ingestion). Semantics vary by kind: <code>decision_date</code> for decisions, <code>valid_from</code> for legislation, <code>publication_date</code> for notices, <code>creation_date</code> for records. The semantic meaning is implicit from <code>kind</code> -- not stored separately.

'''date_end''' -- End date. <code>valid_until</code> for legislation, expiration for notices, closure for records. NULL for most decisions (they don't expire). NULL does not mean "still in force" -- it means the source did not provide an end date.

'''parent_id''' -- Hierarchical link to another document ID. Used for structure trees (code -> article), sub-documents (decision -> chunk), and entity hierarchies (company -> director appointments). Text, not a foreign key -- the parent may live in a different source or not yet be ingested. No FK constraint.

'''title''' -- Display title. Free text. May be NULL for chunks and sections that inherit their parent's title.

'''body''' -- Clean displayable content. HTML or plain text suitable for direct rendering to a user. Immutable after initial ingestion. Never contains noisy OCR or raw PDF binary. For PDF-only sources, <code>body</code> is a clean stub like <code>&lt;p&gt;Source: &lt;a href="..."&gt;PDF officiel&lt;/a&gt;&lt;/p&gt;</code> and the extracted text goes to <code>body_search</code>. <code>body</code> is what <code>get_document</code> returns.

'''body_search''' -- Indexable text representation used by FTS. Can be noisy (raw PDF extraction, OCR output) because users never see it directly — <code>get_document</code> never returns it. Nullable. When NULL, the FTS trigger falls back to <code>body</code> via <code>coalesce(body_search, body)</code>. Used for: (a) PDF-sourced documents where <code>body</code> is a clean stub and <code>body_search</code> holds the pdfminer-extracted text; (b) HTML documents with irregular references that need an LLM-cleaned indexable version. Summary and headnote FTS indexing is handled separately by the trigger reading <code>tags.summary</code> and <code>tags.headnote_classification</code> directly from JSONB (weight B). See ADR-010 (design-decisions/2026-04-08-body-search-inversion.md) for the current semantic and ADR-008 (design-decisions/2026-04-05-fts-three-weights.md) for the trigger weight system. See [[Corpus/Quality|QUALITY]] for <code>tags.content_quality</code> levels.

'''tags''' -- JSONB. Everything jurisdiction-specific or type-specific that does not deserve a promoted column. Examples: <code>{"type": "loi", "nid": "JORFTEXT000041865244", "nature": "LOI"}</code> for French legislation, <code>{"chamber": "2e_civ", "solution": "cassation"}</code> for French cassation decisions, <code>{"sic_code": "6411", "status": "active"}</code> for UK company records. No enforced schema -- each source defines its own tag vocabulary. Queried via <code>@&gt;</code>, <code>?</code>, and <code>jsonb_path_ops</code> GIN index.

'''content_fts''' -- <code>tsvector</code> column, populated by a trigger with three weights: A=title, B=<code>tags.summary</code>+<code>tags.headnote_classification</code> (read from JSONB), C=<code>coalesce(body_search, body)</code> with HTML tags stripped. Uses the appropriate language configuration (derived from <code>language</code> or <code>jurisdiction</code>). NULL for CJK languages (Chinese, Japanese, Korean) which require pgroonga or zhparser instead of PostgreSQL built-in FTS. See [[Corpus/FTS|FTS]] for full trigger details.

'''ingested_at''' -- Timestamp of first ingestion. <code>DEFAULT now()</code>. Not updated on re-ingestion -- use <code>tags</code> for versioning metadata if needed.

----

== Indexes ==

<syntaxhighlight lang="sql">
-- B-tree indexes on promoted columns
CREATE INDEX idx_doc_kind          ON corpus.documents (kind);
CREATE INDEX idx_doc_jurisdiction   ON corpus.documents (jurisdiction);
CREATE INDEX idx_doc_source        ON corpus.documents (source);
CREATE INDEX idx_doc_date          ON corpus.documents (date);
CREATE INDEX idx_doc_date_end      ON corpus.documents (date_end)   WHERE date_end IS NOT NULL;
CREATE INDEX idx_doc_parent        ON corpus.documents (parent_id)  WHERE parent_id IS NOT NULL;
CREATE INDEX idx_doc_kind_juris    ON corpus.documents (kind, jurisdiction);
CREATE INDEX idx_doc_kind_source   ON corpus.documents (kind, source);

-- GIN index on tags
CREATE INDEX idx_doc_tags          ON corpus.documents USING GIN (tags jsonb_path_ops);

-- Partial GIN indexes per kind (FTS isolation)
CREATE INDEX idx_doc_fts_legislation ON corpus.documents USING GIN (content_fts) WHERE kind = 'legislation';
CREATE INDEX idx_doc_fts_decision  ON corpus.documents USING GIN (content_fts) WHERE kind = 'decision';
CREATE INDEX idx_doc_fts_record    ON corpus.documents USING GIN (content_fts) WHERE kind = 'record';
CREATE INDEX idx_doc_fts_notice    ON corpus.documents USING GIN (content_fts) WHERE kind = 'notice';
</syntaxhighlight>

=== Why partial GIN indexes per kind instead of table partitioning ===

Partitioning <code>documents</code> by <code>kind</code> would require <code>kind</code> in the primary key (PostgreSQL requires the partition key in the PK). That means <code>id</code> alone can no longer be a PK -- every FK, every edge reference, every external link would need to carry <code>(id, kind)</code> instead of just <code>id</code>. The complexity cascades through the entire system.

Partial GIN indexes achieve 80% of the performance benefit with 0% of the complexity. When a query filters <code>WHERE kind = 'legislation' AND content_fts @@ ...</code>, PostgreSQL uses the partial index <code>idx_doc_fts_legislation</code> which contains only legislation tsvectors. The planner skips the other kinds entirely. Index size is smaller, scans are faster, and the PK stays a simple <code>text</code> column.

No partial index for <code>section</code> and <code>chunk</code> because they rarely have <code>body</code>/<code>body_search</code> content worth full-text indexing.

----

== Schema: corpus.edges ==

<syntaxhighlight lang="sql">
CREATE TABLE corpus.edges (
    id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    source_id   text NOT NULL,
    target_id   text,
    kind        text NOT NULL,
    reference   text,
    properties  jsonb DEFAULT '{}',
    UNIQUE (source_id, target_id, kind)
);

CREATE INDEX idx_edge_source ON corpus.edges (source_id);
CREATE INDEX idx_edge_target ON corpus.edges (target_id) WHERE target_id IS NOT NULL;
CREATE INDEX idx_edge_kind   ON corpus.edges (kind);
</syntaxhighlight>

=== Column definitions ===

'''id''' -- Synthetic identity, <code>bigint GENERATED ALWAYS AS IDENTITY</code>. Edges have no natural key.

'''source_id''' -- The document that originates the relationship. Not a foreign key.

'''target_id''' -- The document that receives the relationship. Nullable: an unresolved reference (a citation to a document not yet ingested, or a reference to an external system) has <code>target_id = NULL</code> and <code>reference</code> containing the raw citation text. Once the target is ingested or resolved, <code>target_id</code> is backfilled.

'''kind''' -- Relationship type. Open vocabulary. Examples: <code>cites</code>, <code>amends</code>, <code>repeals</code>, <code>implements</code>, <code>transposes</code>, <code>overrules</code>, <code>affirms</code>, <code>commences</code>, <code>language_variant</code>. See [[Corpus/Edge types|EDGE-TYPES]] for the full taxonomy.

'''reference''' -- Raw citation text as found in the source document. Preserved even after resolution for traceability. Example: <code>"article L. 121-1 du code de la consommation"</code>.

'''properties''' -- JSONB metadata on the relationship. Standard provenance keys (see ADR: edge pipeline architecture, design-decisions/2026-04-22-edge-pipeline-architecture.md):

* <code>extraction</code>: <code>"publisher_provided"</code> (source has a structured link) or <code>"id_resolved"</code> (textual reference resolved by the reference resolver).
* <code>source_type</code>: identifies the source system (<code>legi_lien</code>, <code>cellar_sparql</code>, <code>hudoc_appno</code>, <code>cellar_nim</code>).
* <code>raw_link_type</code>: source-native link type preserved for traceability (e.g., <code>"TRANSPOSITION"</code>, <code>"cdm:resource_legal_amends_resource_legal"</code>).

Type-specific examples: <code>{"scope": "partial", "articles": ["1", "2", "3"]}</code> for partial amendments, <code>{"commencement_date": "2024-01-01", "territorial_extent": "england_and_wales"}</code> for UK effects, <code>{"reservation": "sous reserve de..."}</code> for conditional constitutionality decisions.

=== Quarantine model ===

Edges with unresolved targets are stored with <code>target_id = NULL</code> and the raw citation in <code>reference</code>. They are visible to MCP and queryable for monitoring. The partial unique index <code>idx_edge_unique_without_target ON (source_id, reference, kind) WHERE target_id IS NULL</code> prevents duplicate quarantine entries.

'''Ownership:''' an edge is owned by the jurisdiction prefix of its <code>source_id</code>. Drop-and-reingest for jurisdiction X deletes edges WHERE <code>source_id LIKE '{x}.%'</code>. Cross-jurisdiction quarantine edges (e.g., EU NIM → FR target) survive the target jurisdiction's reingest.

'''Reconciliation:''' after each jurisdiction ingest, a post-ingest pass attempts to resolve quarantine edges whose reference matches the freshly ingested jurisdiction. Resolved → <code>UPDATE target_id</code>. Still unresolved → remains in quarantine.

=== Why no foreign keys ===

At 50M+ edges, FK checks on every insert become a measurable bottleneck. Sources are ingested independently and in parallel -- a French decision citing a EU directive may be ingested before the directive itself. FKs would either block ingestion order or require two-pass inserts.

Orphan cleanup runs periodically (post-ingest batch job), not on every write.

----

== Schema: corpus.tag_stats ==

<syntaxhighlight lang="sql">
CREATE TABLE corpus.tag_stats (
    kind          text NOT NULL,
    jurisdiction  text NOT NULL,
    tag_key       text NOT NULL,
    tag_value     text NOT NULL,
    count         integer DEFAULT 0,
    PRIMARY KEY (kind, jurisdiction, tag_key, tag_value)
);
</syntaxhighlight>

Refreshed post-ingest, incrementally per source. Provides instant faceted counts without scanning <code>documents</code>.

Must handle both scalar and array tag values. For scalar tags, count directly. For array tags (e.g., <code>tags.themes = ["droit civil", "responsabilite"]</code>), use <code>jsonb_array_elements_text</code> to explode the array and count each element individually.

----

== Kind taxonomy ==

6 structural values. These classify document shape, not legal category.

{| class="wikitable"
! Kind !! What it is !! What it is NOT !! Examples
|-
| <code>legislation</code> || Creates, interprets, or defines a rule. Normative or quasi-normative content. || Not a resolution of a specific dispute. || Statute, regulation, decree, circular, directive, fatwa, ministerial order, regulatory handbook
|-
| <code>decision</code> || Resolves a specific dispute or case. Issued by a court or quasi-judicial body. || Not a general rule, even if it sets precedent. || Judgment, ruling, order, arbitral award, AG opinion
|-
| <code>record</code> || Factual registration in an official registry. || Not normative, not adjudicative. || Company filing, trademark registration, patent, land title, birth certificate
|-
| <code>notice</code> || Official announcement or publication. Time-bound, informational. || Not normative long-term. || Gazette entry, public consultation, tender, regulatory announcement
|-
| <code>section</code> || Structural subdivision of another document. Has a <code>parent_id</code>. || Not a standalone document. || Part, title, chapter, article, schedule, annex
|-
| <code>chunk</code> || Processing artifact. A fragment of a document created for search or embedding. || Not an editorial unit. || Paragraph-level split for RAG, embedding chunk
|}

=== Disambiguation rule ===

When a document is ambiguous:

# Does it resolve a specific dispute between parties? -> <code>decision</code>
# Does it create, interpret, or define a general rule? -> <code>legislation</code>
# Still ambiguous? -> <code>legislation</code>

The legal category goes in <code>tags.type</code> (e.g., <code>"type": "sumula_vinculante"</code>, <code>"type": "tesis_jurisprudencial"</code>).

=== Stress-test: ambiguous documents across jurisdictions ===

{| class="wikitable"
! Document !! Jurisdiction !! Kind !! Rationale
|-
| Sumula vinculante || BR || <code>legislation</code> || Binding precedent abstract, creates a general rule. Not resolving a specific dispute.
|-
| Tesis jurisprudencial || MX || <code>legislation</code> || Abstracted legal thesis from repeated rulings. General rule, not a specific dispute.
|-
| Judicial interpretation (司法解释) || CN || <code>legislation</code> || Supreme People's Court quasi-legislation. Creates binding rules of general application.
|-
| Dictamen CGR || CL || <code>legislation</code> || Comptroller General opinion. Interprets administrative law generally.
|-
| AG Opinion (CJEU) || EU || <code>decision</code> || Issued in the context of a specific case, advises on its resolution.
|-
| FCA Handbook || GB || <code>legislation</code> || Financial Conduct Authority regulatory rules. Normative, not case-specific.
|-
| Fatwa || SA || <code>legislation</code> || Religious-legal ruling of general application. Interprets Islamic law.
|-
| Administrative guidance (通達 tsutatsu) || JP || <code>legislation</code> || Ministry directive to subordinate agencies. General rule, not dispute resolution.
|-
| Restatement section || US || <code>legislation</code> || Scholarly synthesis of common law rules. General rule formulation.
|-
| Grundsatzentscheidung || DE || <code>decision</code> || Federal Court principle decision. Resolves a specific case, even if it sets precedent.
|-
| Arret de principe || FR || <code>decision</code> || Cour de cassation principle ruling. Resolves a specific dispute.
|-
| Sentenza interpretativa di rigetto || IT || <code>decision</code> || Constitutional Court interpretive rejection. Resolves a specific constitutional question.
|-
| Advisory Opinion (ICJ) || INT || <code>decision</code> || Issued on a specific question referred by a UN organ.
|-
| Pleno no jurisdiccional || ES || <code>decision</code> || Supreme Court plenary agreement. Resolves doctrinal conflict arising from cases.
|-
| Bepaling van algemene strekking || NL || <code>legislation</code> || General binding regulation. Normative, not case resolution.
|-
| Constitutional Court communique || TR || <code>legislation</code> || Press release summarizing a decision. Informational/normative summary.
|-
| Practice Direction || GB || <code>legislation</code> || Court-issued procedural rules. General application.
|-
| Reglement grand-ducal || LU || <code>legislation</code> || Grand-ducal regulation. Normative.
|-
| Acordao vinculante STF || BR || <code>decision</code> || Binding Supreme Court judgment. Resolves a specific case.
|-
| Royal Decree (مرسوم ملكي) || SA || <code>legislation</code> || Royal legislative act. Creates general rules.
|-
| Verfassungsbeschwerde decision || DE || <code>decision</code> || Constitutional complaint resolution. Specific dispute.
|-
| Recurso extraordinario || BR || <code>decision</code> || Extraordinary appeal. Resolves a specific case.
|-
| EU Regulation || EU || <code>legislation</code> || Directly applicable legislation. General normative act.
|-
| OHADA Uniform Act || INT || <code>legislation</code> || Harmonized commercial law. General normative act.
|-
| Tax ruling (rescrit) || FR || <code>legislation</code> || Tax authority interpretation. General application (even if triggered by a specific request).
|}

----

== Benchmark results ==

Measured on 3.4M rows, PostgreSQL 16, single-node, 16 GB RAM.

{| class="wikitable"
! Query !! Corpus schema !! Previous 16-table schema !! Speedup
|-
| Tag filter + date range || 0.9 ms || — || B-tree scan, instant
|-
| FTS + tag filter (broad, 111K results) || 4.4 s || 31 s || 7x faster (BitmapAnd vs Seq Scan)
|-
| Tag-only + date || 1.4 ms || — || B-tree scan, instant
|-
| FTS narrow (17K results, warm) || 565 ms || — || Partial GIN index
|-
| Get by ID || 0.3 ms || 0.3 ms || PK lookup
|-
| Tag discovery (GROUP BY court, 3.4M rows) || 2.7 s || — || Seq scan — justifies tag_stats
|}

The FTS speedup comes from BitmapAnd of GIN (FTS) + GIN (tags) indexes. The previous schema used a Parallel Seq Scan because the planner could not combine B-tree (court column) with GIN (FTS) efficiently.

----

== Operational notes ==

'''No FK on edges or parent_id.''' Foreign keys on multi-million row tables with parallel ingestion from independent sources create ordering constraints and lock contention that outweigh referential safety. Orphan detection runs as a periodic batch job.

'''Partial GIN indexes per kind.''' See the indexes section. Keeps FTS index sizes small, avoids partitioning complexity.

'''WHERE IS DISTINCT FROM in upserts.''' Use <code>IS DISTINCT FROM</code> instead of <code>!=</code> in <code>ON CONFLICT ... DO UPDATE SET ... WHERE</code> clauses. <code>!=</code> returns NULL when comparing with NULL (and the row is not updated), while <code>IS DISTINCT FROM</code> treats NULL as a comparable value. Critical for columns like <code>date_end</code>, <code>body_search</code>, <code>language</code> that are frequently NULL.

'''Autovacuum tuning.''' Set <code>autovacuum_vacuum_scale_factor = 0.01</code> on <code>corpus.documents</code>. The default (0.2) means autovacuum triggers after 20% of rows are dead -- on a 3.4M row table, that is 680K dead rows before cleanup. At 0.01, it triggers at 34K dead rows, keeping bloat and index performance under control.

'''Estimated counts at scale.''' <code>COUNT(*) OVER()</code> becomes expensive above 100M rows. Use <code>EXPLAIN</code>-based row estimates or <code>pg_class.reltuples</code> for approximate counts in paginated APIs. Exact counts only when explicitly requested.

'''CJK full-text search.''' PostgreSQL built-in FTS does not support Chinese, Japanese, or Korean tokenization. <code>content_fts</code> is NULL for CJK documents. Use pgroonga (all three languages) or zhparser (Chinese only) as an extension. CJK search queries must use a different code path.

'''Partitioning at CN scale.''' At 130M+ rows (realistic for a full Chinese legal corpus), partition <code>documents</code> by <code>jurisdiction</code>. This is the one case where partitioning is justified: queries almost always filter by jurisdiction, and partition pruning eliminates entire shards. The composite PK concern is manageable when partitioning by a column that is already always present in queries.

----

== corpus.source_metadata ==

Tracks data sources, their static descriptive metadata, and dynamic statistics recomputed at the end of each ingest cycle.

<syntaxhighlight lang="sql">
CREATE TABLE corpus.source_metadata (
    source_key             text PRIMARY KEY,
    parent_source_key      text,
    jurisdiction           text NOT NULL,
    name                   text NOT NULL,
    description            text NOT NULL,
    kind                   text NOT NULL DEFAULT 'legislation',
    publisher              text NOT NULL,
    publisher_url          text NOT NULL,
    license                text NOT NULL,
    license_url            text NOT NULL,
    language               text NOT NULL DEFAULT 'fr',
    coverage_start_year    integer,
    document_count         integer,
    freshest_document_date date,
    last_file              text,
    source_updated         date,
    ingested_at            timestamptz
);
</syntaxhighlight>

Used by the MCP server (guidelines, data freshness display) and the portal. Sub-sources (e.g., <code>bodacc_pcl</code>) have <code>parent_source_key</code> set.

=== Static vs dynamic columns ===

* '''Static''' (written by <code>sync_source_registry()</code> at init from <code>SOURCE_REGISTRY</code>): <code>source_key</code>, <code>parent_source_key</code>, <code>jurisdiction</code>, <code>name</code>, <code>description</code>, <code>kind</code>, <code>publisher</code>, <code>publisher_url</code>, <code>license</code>, <code>license_url</code>, <code>language</code>.
* '''Dynamic''' (written by <code>refresh_corpus_stats()</code> at the end of each ingest cycle, including the dedup steps inside <code>run_post_ingest()</code>): <code>coverage_start_year</code> (= <code>MIN(date)</code> per source), <code>document_count</code> (= <code>COUNT(*)</code> per source), <code>freshest_document_date</code> (= <code>MAX(date)</code> per source). Top-level parent rows that have sub-sources are populated by a SUM/MIN/MAX rollup over their children.
* '''Operational''' (written by <code>state.set_ingest_state()</code> per ingest run): <code>last_file</code>, <code>source_updated</code>, <code>ingested_at</code>. These track pipeline state, not content state, and are not displayed in MCP responses.

<code>sync_source_registry()</code> MUST NOT touch dynamic columns — they would be reset to NULL on every init. The INSERT and <code>ON CONFLICT DO UPDATE SET</code> clauses only list static columns.

----

== Required PostgreSQL extensions ==

<syntaxhighlight lang="sql">
CREATE EXTENSION IF NOT EXISTS unaccent;    -- accent-insensitive FTS
-- Optional (for CJK jurisdictions):
-- CREATE EXTENSION IF NOT EXISTS pgroonga;  -- CJK full-text search
</syntaxhighlight>

----

== Complete DDL execution order ==

# <code>CREATE SCHEMA corpus;</code>
# <code>CREATE EXTENSION IF NOT EXISTS unaccent;</code>
# <code>CREATE TABLE corpus.documents (...);</code> (14 columns)
# <code>CREATE TABLE corpus.edges (...);</code>
# <code>CREATE TABLE corpus.tag_stats (...);</code>
# <code>CREATE TABLE corpus.source_metadata (...);</code>
# <code>CREATE FUNCTION corpus.normalize_for_fts(...);</code> (see [[Corpus/FTS|FTS]])
# <code>CREATE FUNCTION corpus.update_content_fts(...);</code> (see [[Corpus/FTS|FTS]])
# <code>CREATE TRIGGER trg_content_fts ...;</code>
# All indexes (B-tree, GIN, partial GIN)
# <code>ANALYZE corpus.documents;</code>

For the graph schema (future): <code>CREATE SCHEMA graph;</code> + tables from [[Corpus/Graph|GRAPH]].

----

== ID format convention ==

Document IDs use source-native identifiers, always '''lowercase'''. Rules:

* All document IDs carry a lowercase jurisdiction prefix: <code>{jurisdiction}.{source_id}</code> (e.g., <code>fr.legiarti000006902764</code>, <code>fr.juritext000041701651</code>, <code>fr.cetatext000046783006</code>, <code>eu.eurlextext32016r0679</code>).
* Identifiers with built-in jurisdiction context (ECLI:*, CELEX:*, [year] UKSC) may omit the prefix.
* For non-legal registries where the source ID could collide, use <code>{jurisdiction}.{source}.</code> (e.g., <code>fr.company.443061841</code>, <code>gb.company.12345678</code>).
* Decision: the ingest plugin author determines the prefix format, documented in the plugin's tag schema.

----

== Bidirectional edges convention ==

Some edge types are logically bidirectional (<code>language_variant</code>, <code>corresponds_to</code>, <code>associated_text</code>, <code>joined_with</code>, <code>pilot_follower</code>). The convention:

* Store ONE edge, not two.
* Pick a canonical direction (alphabetical by source_id, or the "primary" document in the pair).
* Queries for bidirectional edges check both directions: <code>WHERE (source_id = X OR target_id = X) AND kind = 'language_variant'</code>.

----

== Jurisdiction subdivision mapping in FTS trigger ==

The trigger resolves language from <code>coalesce(language, jurisdiction)</code>. For subdivision jurisdictions like <code>gb-sct</code>, <code>es-ct</code>, <code>de-by</code>:

* If <code>language</code> is set (recommended for subdivisions), it is used directly.
* If <code>language</code> is NULL, the trigger extracts the country prefix: <code>split_part(jurisdiction, '-', 1)</code> → <code>gb</code>. Since <code>gb</code> is not a language code in the mapping (the mapping uses ISO 639-1 codes like <code>en</code>), it falls through to <code>simple</code>. Set <code>language = 'en'</code> explicitly for English-language jurisdictions like <code>gb</code>, <code>gb-sct</code>, <code>us</code>.

This extraction is in the FTS trigger code (see [[Corpus/FTS|FTS]]).

----

== Migration strategy ==

Migration from the current 16-table schema is incremental and reversible:

# Create <code>corpus.*</code> tables alongside existing tables (coexistence).
# Migrate one kind at a time with SQL INSERT...SELECT scripts (decision first — largest and most exercised).
# Benchmark each migrated kind against the equivalent old-table queries.
# Adapt the MCP server to read from <code>corpus.documents</code> (dual-read during transition).
# Switch MCP tools one by one (search first, then get, then browse).
# When all tools read from <code>corpus.*</code>, drop old tables.
# Migrate <code>citations</code> → <code>corpus.edges</code> (same process).
# Refresh <code>corpus.tag_stats</code>.

Rollback: during coexistence, the old tables are still the source of truth. If migration fails, revert MCP to old tables.

[[Category:Corpus]]