Editing
Corpus/Schema
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
= Corpus Schema = Technical specification for the Dura Lex unified corpus schema. Reference for developers and AI agents. == Overview == One <code>documents</code> table with promoted columns and a JSONB <code>tags</code> column, inspired by OpenStreetMap's model: a small fixed schema handles structure, while a flexible key-value store handles everything jurisdiction-specific. Validated across 25 jurisdictions on 5 continents. Benchmark on 3.4M rows: 7x faster than the previous 16-table normalized schema. Four tables: {| class="wikitable" ! Table !! Purpose |- | <code>corpus.documents</code> || All legal content: legislation, case law, regulatory filings, administrative notices |- | <code>corpus.edges</code> || Relationships between documents: citations, amendments, hierarchy, consolidation |- | <code>corpus.tag_stats</code> || Aggregated tag counts per kind/jurisdiction, refreshed post-ingest |- | <code>corpus.source_metadata</code> || Data source registry: publishers, licenses, ingestion state, freshness |} ---- == Schema: corpus.documents == 14 columns. <syntaxhighlight lang="sql"> CREATE SCHEMA IF NOT EXISTS corpus; CREATE TABLE corpus.documents ( id text PRIMARY KEY, kind text NOT NULL, jurisdiction text NOT NULL, language text, source text NOT NULL, date date, date_end date, parent_id text, title text, body text, body_search text, tags jsonb NOT NULL DEFAULT '{}', content_fts tsvector, ingested_at timestamptz DEFAULT now() ); </syntaxhighlight> === Column definitions === '''id''' -- Text primary key. Uses source-native identifiers with a jurisdiction prefix, always '''lowercase''': <code>{jurisdiction}.{source_id}</code> (e.g., <code>fr.legiarti000006902764</code>, <code>eu.eurlextext32016r0679</code>). Identifiers with built-in jurisdiction context keep their original form (<code>ECLI:FR:CCASS:2024:CR00123</code>, <code>[2024] UKSC 1</code>). For non-legal registries where the source ID could collide, prefix with <code>{jurisdiction}.{source}.</code> (e.g., <code>fr.company.443061841</code>, <code>gb.company.12345678</code>). Never synthetic UUIDs -- the ID must be meaningful and traceable to the source. See ADR: lowercase document IDs (design-decisions/2026-04-09-lowercase-document-ids.md). '''kind''' -- Structural classification. Exactly 6 values (see Kind Taxonomy below): <code>legislation</code>, <code>decision</code>, <code>record</code>, <code>notice</code>, <code>section</code>, <code>chunk</code>. This column answers "what shape is this document?" not "what legal category is it?" -- legal category belongs in <code>tags.type</code>. '''jurisdiction''' -- Geographic/political area, not institution. ISO 3166-1 alpha-2 for countries (<code>fr</code>, <code>gb</code>, <code>de</code>, <code>br</code>, <code>jp</code>). ISO 3166-2 for subdivisions (<code>us-ca</code>, <code>gb-sct</code>, <code>es-ct</code>, <code>de-by</code>). Extensions for supranational areas: <code>eu</code> (European law β EU institutions and Council of Europe), <code>int</code> (international treaties). A search for a jurisdiction includes its subdivisions: searching <code>us</code> includes <code>us-ca</code>. No default value -- every document must declare its jurisdiction explicitly. '''language''' -- ISO 639-1 (<code>fr</code>, <code>en</code>, <code>de</code>, <code>ar</code>, <code>zh</code>, <code>ja</code>, <code>ko</code>). Nullable. When NULL, the FTS trigger infers language from <code>jurisdiction</code>. Decoupled from jurisdiction because many countries are multilingual: Belgium (3 languages), Switzerland (4), South Africa (11), India (23+). A Belgian law in Dutch and the same law in French are two separate documents with the same jurisdiction (<code>be</code>) but different languages. '''source''' -- Data source identifier. Examples: <code>legi</code>, <code>cass</code>, <code>hmrc_manuals</code>, <code>cendoj</code>, <code>boe</code>, <code>legifrance_jorf</code>, <code>knesset</code>. One source maps to one ingestion pipeline. Multiple sources can feed the same jurisdiction. '''date''' -- Primary date, always ISO 8601 (converted from Hijri calendar, Japanese era dates, etc. at ingestion). Semantics vary by kind: <code>decision_date</code> for decisions, <code>valid_from</code> for legislation, <code>publication_date</code> for notices, <code>creation_date</code> for records. The semantic meaning is implicit from <code>kind</code> -- not stored separately. '''date_end''' -- End date. <code>valid_until</code> for legislation, expiration for notices, closure for records. NULL for most decisions (they don't expire). NULL does not mean "still in force" -- it means the source did not provide an end date. '''parent_id''' -- Hierarchical link to another document ID. Used for structure trees (code -> article), sub-documents (decision -> chunk), and entity hierarchies (company -> director appointments). Text, not a foreign key -- the parent may live in a different source or not yet be ingested. No FK constraint. '''title''' -- Display title. Free text. May be NULL for chunks and sections that inherit their parent's title. '''body''' -- Clean displayable content. HTML or plain text suitable for direct rendering to a user. Immutable after initial ingestion. Never contains noisy OCR or raw PDF binary. For PDF-only sources, <code>body</code> is a clean stub like <code><p>Source: <a href="...">PDF officiel</a></p></code> and the extracted text goes to <code>body_search</code>. <code>body</code> is what <code>get_document</code> returns. '''body_search''' -- Indexable text representation used by FTS. Can be noisy (raw PDF extraction, OCR output) because users never see it directly β <code>get_document</code> never returns it. Nullable. When NULL, the FTS trigger falls back to <code>body</code> via <code>coalesce(body_search, body)</code>. Used for: (a) PDF-sourced documents where <code>body</code> is a clean stub and <code>body_search</code> holds the pdfminer-extracted text; (b) HTML documents with irregular references that need an LLM-cleaned indexable version. Summary and headnote FTS indexing is handled separately by the trigger reading <code>tags.summary</code> and <code>tags.headnote_classification</code> directly from JSONB (weight B). See ADR-010 (design-decisions/2026-04-08-body-search-inversion.md) for the current semantic and ADR-008 (design-decisions/2026-04-05-fts-three-weights.md) for the trigger weight system. See [[Corpus/Quality|QUALITY]] for <code>tags.content_quality</code> levels. '''tags''' -- JSONB. Everything jurisdiction-specific or type-specific that does not deserve a promoted column. Examples: <code>{"type": "loi", "nid": "JORFTEXT000041865244", "nature": "LOI"}</code> for French legislation, <code>{"chamber": "2e_civ", "solution": "cassation"}</code> for French cassation decisions, <code>{"sic_code": "6411", "status": "active"}</code> for UK company records. No enforced schema -- each source defines its own tag vocabulary. Queried via <code>@></code>, <code>?</code>, and <code>jsonb_path_ops</code> GIN index. '''content_fts''' -- <code>tsvector</code> column, populated by a trigger with three weights: A=title, B=<code>tags.summary</code>+<code>tags.headnote_classification</code> (read from JSONB), C=<code>coalesce(body_search, body)</code> with HTML tags stripped. Uses the appropriate language configuration (derived from <code>language</code> or <code>jurisdiction</code>). NULL for CJK languages (Chinese, Japanese, Korean) which require pgroonga or zhparser instead of PostgreSQL built-in FTS. See [[Corpus/FTS|FTS]] for full trigger details. '''ingested_at''' -- Timestamp of first ingestion. <code>DEFAULT now()</code>. Not updated on re-ingestion -- use <code>tags</code> for versioning metadata if needed. ---- == Indexes == <syntaxhighlight lang="sql"> -- B-tree indexes on promoted columns CREATE INDEX idx_doc_kind ON corpus.documents (kind); CREATE INDEX idx_doc_jurisdiction ON corpus.documents (jurisdiction); CREATE INDEX idx_doc_source ON corpus.documents (source); CREATE INDEX idx_doc_date ON corpus.documents (date); CREATE INDEX idx_doc_date_end ON corpus.documents (date_end) WHERE date_end IS NOT NULL; CREATE INDEX idx_doc_parent ON corpus.documents (parent_id) WHERE parent_id IS NOT NULL; CREATE INDEX idx_doc_kind_juris ON corpus.documents (kind, jurisdiction); CREATE INDEX idx_doc_kind_source ON corpus.documents (kind, source); -- GIN index on tags CREATE INDEX idx_doc_tags ON corpus.documents USING GIN (tags jsonb_path_ops); -- Partial GIN indexes per kind (FTS isolation) CREATE INDEX idx_doc_fts_legislation ON corpus.documents USING GIN (content_fts) WHERE kind = 'legislation'; CREATE INDEX idx_doc_fts_decision ON corpus.documents USING GIN (content_fts) WHERE kind = 'decision'; CREATE INDEX idx_doc_fts_record ON corpus.documents USING GIN (content_fts) WHERE kind = 'record'; CREATE INDEX idx_doc_fts_notice ON corpus.documents USING GIN (content_fts) WHERE kind = 'notice'; </syntaxhighlight> === Why partial GIN indexes per kind instead of table partitioning === Partitioning <code>documents</code> by <code>kind</code> would require <code>kind</code> in the primary key (PostgreSQL requires the partition key in the PK). That means <code>id</code> alone can no longer be a PK -- every FK, every edge reference, every external link would need to carry <code>(id, kind)</code> instead of just <code>id</code>. The complexity cascades through the entire system. Partial GIN indexes achieve 80% of the performance benefit with 0% of the complexity. When a query filters <code>WHERE kind = 'legislation' AND content_fts @@ ...</code>, PostgreSQL uses the partial index <code>idx_doc_fts_legislation</code> which contains only legislation tsvectors. The planner skips the other kinds entirely. Index size is smaller, scans are faster, and the PK stays a simple <code>text</code> column. No partial index for <code>section</code> and <code>chunk</code> because they rarely have <code>body</code>/<code>body_search</code> content worth full-text indexing. ---- == Schema: corpus.edges == <syntaxhighlight lang="sql"> CREATE TABLE corpus.edges ( id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, source_id text NOT NULL, target_id text, kind text NOT NULL, reference text, properties jsonb DEFAULT '{}', UNIQUE (source_id, target_id, kind) ); CREATE INDEX idx_edge_source ON corpus.edges (source_id); CREATE INDEX idx_edge_target ON corpus.edges (target_id) WHERE target_id IS NOT NULL; CREATE INDEX idx_edge_kind ON corpus.edges (kind); </syntaxhighlight> === Column definitions === '''id''' -- Synthetic identity, <code>bigint GENERATED ALWAYS AS IDENTITY</code>. Edges have no natural key. '''source_id''' -- The document that originates the relationship. Not a foreign key. '''target_id''' -- The document that receives the relationship. Nullable: an unresolved reference (a citation to a document not yet ingested, or a reference to an external system) has <code>target_id = NULL</code> and <code>reference</code> containing the raw citation text. Once the target is ingested or resolved, <code>target_id</code> is backfilled. '''kind''' -- Relationship type. Open vocabulary. Examples: <code>cites</code>, <code>amends</code>, <code>repeals</code>, <code>implements</code>, <code>transposes</code>, <code>overrules</code>, <code>affirms</code>, <code>commences</code>, <code>language_variant</code>. See [[Corpus/Edge types|EDGE-TYPES]] for the full taxonomy. '''reference''' -- Raw citation text as found in the source document. Preserved even after resolution for traceability. Example: <code>"article L. 121-1 du code de la consommation"</code>. '''properties''' -- JSONB metadata on the relationship. Standard provenance keys (see ADR: edge pipeline architecture, design-decisions/2026-04-22-edge-pipeline-architecture.md): * <code>extraction</code>: <code>"publisher_provided"</code> (source has a structured link) or <code>"id_resolved"</code> (textual reference resolved by the reference resolver). * <code>source_type</code>: identifies the source system (<code>legi_lien</code>, <code>cellar_sparql</code>, <code>hudoc_appno</code>, <code>cellar_nim</code>). * <code>raw_link_type</code>: source-native link type preserved for traceability (e.g., <code>"TRANSPOSITION"</code>, <code>"cdm:resource_legal_amends_resource_legal"</code>). Type-specific examples: <code>{"scope": "partial", "articles": ["1", "2", "3"]}</code> for partial amendments, <code>{"commencement_date": "2024-01-01", "territorial_extent": "england_and_wales"}</code> for UK effects, <code>{"reservation": "sous reserve de..."}</code> for conditional constitutionality decisions. === Quarantine model === Edges with unresolved targets are stored with <code>target_id = NULL</code> and the raw citation in <code>reference</code>. They are visible to MCP and queryable for monitoring. The partial unique index <code>idx_edge_unique_without_target ON (source_id, reference, kind) WHERE target_id IS NULL</code> prevents duplicate quarantine entries. '''Ownership:''' an edge is owned by the jurisdiction prefix of its <code>source_id</code>. Drop-and-reingest for jurisdiction X deletes edges WHERE <code>source_id LIKE '{x}.%'</code>. Cross-jurisdiction quarantine edges (e.g., EU NIM β FR target) survive the target jurisdiction's reingest. '''Reconciliation:''' after each jurisdiction ingest, a post-ingest pass attempts to resolve quarantine edges whose reference matches the freshly ingested jurisdiction. Resolved β <code>UPDATE target_id</code>. Still unresolved β remains in quarantine. === Why no foreign keys === At 50M+ edges, FK checks on every insert become a measurable bottleneck. Sources are ingested independently and in parallel -- a French decision citing a EU directive may be ingested before the directive itself. FKs would either block ingestion order or require two-pass inserts. Orphan cleanup runs periodically (post-ingest batch job), not on every write. ---- == Schema: corpus.tag_stats == <syntaxhighlight lang="sql"> CREATE TABLE corpus.tag_stats ( kind text NOT NULL, jurisdiction text NOT NULL, tag_key text NOT NULL, tag_value text NOT NULL, count integer DEFAULT 0, PRIMARY KEY (kind, jurisdiction, tag_key, tag_value) ); </syntaxhighlight> Refreshed post-ingest, incrementally per source. Provides instant faceted counts without scanning <code>documents</code>. Must handle both scalar and array tag values. For scalar tags, count directly. For array tags (e.g., <code>tags.themes = ["droit civil", "responsabilite"]</code>), use <code>jsonb_array_elements_text</code> to explode the array and count each element individually. ---- == Kind taxonomy == 6 structural values. These classify document shape, not legal category. {| class="wikitable" ! Kind !! What it is !! What it is NOT !! Examples |- | <code>legislation</code> || Creates, interprets, or defines a rule. Normative or quasi-normative content. || Not a resolution of a specific dispute. || Statute, regulation, decree, circular, directive, fatwa, ministerial order, regulatory handbook |- | <code>decision</code> || Resolves a specific dispute or case. Issued by a court or quasi-judicial body. || Not a general rule, even if it sets precedent. || Judgment, ruling, order, arbitral award, AG opinion |- | <code>record</code> || Factual registration in an official registry. || Not normative, not adjudicative. || Company filing, trademark registration, patent, land title, birth certificate |- | <code>notice</code> || Official announcement or publication. Time-bound, informational. || Not normative long-term. || Gazette entry, public consultation, tender, regulatory announcement |- | <code>section</code> || Structural subdivision of another document. Has a <code>parent_id</code>. || Not a standalone document. || Part, title, chapter, article, schedule, annex |- | <code>chunk</code> || Processing artifact. A fragment of a document created for search or embedding. || Not an editorial unit. || Paragraph-level split for RAG, embedding chunk |} === Disambiguation rule === When a document is ambiguous: # Does it resolve a specific dispute between parties? -> <code>decision</code> # Does it create, interpret, or define a general rule? -> <code>legislation</code> # Still ambiguous? -> <code>legislation</code> The legal category goes in <code>tags.type</code> (e.g., <code>"type": "sumula_vinculante"</code>, <code>"type": "tesis_jurisprudencial"</code>). === Stress-test: ambiguous documents across jurisdictions === {| class="wikitable" ! Document !! Jurisdiction !! Kind !! Rationale |- | Sumula vinculante || BR || <code>legislation</code> || Binding precedent abstract, creates a general rule. Not resolving a specific dispute. |- | Tesis jurisprudencial || MX || <code>legislation</code> || Abstracted legal thesis from repeated rulings. General rule, not a specific dispute. |- | Judicial interpretation (εΈζ³θ§£ι) || CN || <code>legislation</code> || Supreme People's Court quasi-legislation. Creates binding rules of general application. |- | Dictamen CGR || CL || <code>legislation</code> || Comptroller General opinion. Interprets administrative law generally. |- | AG Opinion (CJEU) || EU || <code>decision</code> || Issued in the context of a specific case, advises on its resolution. |- | FCA Handbook || GB || <code>legislation</code> || Financial Conduct Authority regulatory rules. Normative, not case-specific. |- | Fatwa || SA || <code>legislation</code> || Religious-legal ruling of general application. Interprets Islamic law. |- | Administrative guidance (ιι tsutatsu) || JP || <code>legislation</code> || Ministry directive to subordinate agencies. General rule, not dispute resolution. |- | Restatement section || US || <code>legislation</code> || Scholarly synthesis of common law rules. General rule formulation. |- | Grundsatzentscheidung || DE || <code>decision</code> || Federal Court principle decision. Resolves a specific case, even if it sets precedent. |- | Arret de principe || FR || <code>decision</code> || Cour de cassation principle ruling. Resolves a specific dispute. |- | Sentenza interpretativa di rigetto || IT || <code>decision</code> || Constitutional Court interpretive rejection. Resolves a specific constitutional question. |- | Advisory Opinion (ICJ) || INT || <code>decision</code> || Issued on a specific question referred by a UN organ. |- | Pleno no jurisdiccional || ES || <code>decision</code> || Supreme Court plenary agreement. Resolves doctrinal conflict arising from cases. |- | Bepaling van algemene strekking || NL || <code>legislation</code> || General binding regulation. Normative, not case resolution. |- | Constitutional Court communique || TR || <code>legislation</code> || Press release summarizing a decision. Informational/normative summary. |- | Practice Direction || GB || <code>legislation</code> || Court-issued procedural rules. General application. |- | Reglement grand-ducal || LU || <code>legislation</code> || Grand-ducal regulation. Normative. |- | Acordao vinculante STF || BR || <code>decision</code> || Binding Supreme Court judgment. Resolves a specific case. |- | Royal Decree (Ω Ψ±Ψ³ΩΩ Ω ΩΩΩ) || SA || <code>legislation</code> || Royal legislative act. Creates general rules. |- | Verfassungsbeschwerde decision || DE || <code>decision</code> || Constitutional complaint resolution. Specific dispute. |- | Recurso extraordinario || BR || <code>decision</code> || Extraordinary appeal. Resolves a specific case. |- | EU Regulation || EU || <code>legislation</code> || Directly applicable legislation. General normative act. |- | OHADA Uniform Act || INT || <code>legislation</code> || Harmonized commercial law. General normative act. |- | Tax ruling (rescrit) || FR || <code>legislation</code> || Tax authority interpretation. General application (even if triggered by a specific request). |} ---- == Benchmark results == Measured on 3.4M rows, PostgreSQL 16, single-node, 16 GB RAM. {| class="wikitable" ! Query !! Corpus schema !! Previous 16-table schema !! Speedup |- | Tag filter + date range || 0.9 ms || β || B-tree scan, instant |- | FTS + tag filter (broad, 111K results) || 4.4 s || 31 s || 7x faster (BitmapAnd vs Seq Scan) |- | Tag-only + date || 1.4 ms || β || B-tree scan, instant |- | FTS narrow (17K results, warm) || 565 ms || β || Partial GIN index |- | Get by ID || 0.3 ms || 0.3 ms || PK lookup |- | Tag discovery (GROUP BY court, 3.4M rows) || 2.7 s || β || Seq scan β justifies tag_stats |} The FTS speedup comes from BitmapAnd of GIN (FTS) + GIN (tags) indexes. The previous schema used a Parallel Seq Scan because the planner could not combine B-tree (court column) with GIN (FTS) efficiently. ---- == Operational notes == '''No FK on edges or parent_id.''' Foreign keys on multi-million row tables with parallel ingestion from independent sources create ordering constraints and lock contention that outweigh referential safety. Orphan detection runs as a periodic batch job. '''Partial GIN indexes per kind.''' See the indexes section. Keeps FTS index sizes small, avoids partitioning complexity. '''WHERE IS DISTINCT FROM in upserts.''' Use <code>IS DISTINCT FROM</code> instead of <code>!=</code> in <code>ON CONFLICT ... DO UPDATE SET ... WHERE</code> clauses. <code>!=</code> returns NULL when comparing with NULL (and the row is not updated), while <code>IS DISTINCT FROM</code> treats NULL as a comparable value. Critical for columns like <code>date_end</code>, <code>body_search</code>, <code>language</code> that are frequently NULL. '''Autovacuum tuning.''' Set <code>autovacuum_vacuum_scale_factor = 0.01</code> on <code>corpus.documents</code>. The default (0.2) means autovacuum triggers after 20% of rows are dead -- on a 3.4M row table, that is 680K dead rows before cleanup. At 0.01, it triggers at 34K dead rows, keeping bloat and index performance under control. '''Estimated counts at scale.''' <code>COUNT(*) OVER()</code> becomes expensive above 100M rows. Use <code>EXPLAIN</code>-based row estimates or <code>pg_class.reltuples</code> for approximate counts in paginated APIs. Exact counts only when explicitly requested. '''CJK full-text search.''' PostgreSQL built-in FTS does not support Chinese, Japanese, or Korean tokenization. <code>content_fts</code> is NULL for CJK documents. Use pgroonga (all three languages) or zhparser (Chinese only) as an extension. CJK search queries must use a different code path. '''Partitioning at CN scale.''' At 130M+ rows (realistic for a full Chinese legal corpus), partition <code>documents</code> by <code>jurisdiction</code>. This is the one case where partitioning is justified: queries almost always filter by jurisdiction, and partition pruning eliminates entire shards. The composite PK concern is manageable when partitioning by a column that is already always present in queries. ---- == corpus.source_metadata == Tracks data sources, their static descriptive metadata, and dynamic statistics recomputed at the end of each ingest cycle. <syntaxhighlight lang="sql"> CREATE TABLE corpus.source_metadata ( source_key text PRIMARY KEY, parent_source_key text, jurisdiction text NOT NULL, name text NOT NULL, description text NOT NULL, kind text NOT NULL DEFAULT 'legislation', publisher text NOT NULL, publisher_url text NOT NULL, license text NOT NULL, license_url text NOT NULL, language text NOT NULL DEFAULT 'fr', coverage_start_year integer, document_count integer, freshest_document_date date, last_file text, source_updated date, ingested_at timestamptz ); </syntaxhighlight> Used by the MCP server (guidelines, data freshness display) and the portal. Sub-sources (e.g., <code>bodacc_pcl</code>) have <code>parent_source_key</code> set. === Static vs dynamic columns === * '''Static''' (written by <code>sync_source_registry()</code> at init from <code>SOURCE_REGISTRY</code>): <code>source_key</code>, <code>parent_source_key</code>, <code>jurisdiction</code>, <code>name</code>, <code>description</code>, <code>kind</code>, <code>publisher</code>, <code>publisher_url</code>, <code>license</code>, <code>license_url</code>, <code>language</code>. * '''Dynamic''' (written by <code>refresh_corpus_stats()</code> at the end of each ingest cycle, including the dedup steps inside <code>run_post_ingest()</code>): <code>coverage_start_year</code> (= <code>MIN(date)</code> per source), <code>document_count</code> (= <code>COUNT(*)</code> per source), <code>freshest_document_date</code> (= <code>MAX(date)</code> per source). Top-level parent rows that have sub-sources are populated by a SUM/MIN/MAX rollup over their children. * '''Operational''' (written by <code>state.set_ingest_state()</code> per ingest run): <code>last_file</code>, <code>source_updated</code>, <code>ingested_at</code>. These track pipeline state, not content state, and are not displayed in MCP responses. <code>sync_source_registry()</code> MUST NOT touch dynamic columns β they would be reset to NULL on every init. The INSERT and <code>ON CONFLICT DO UPDATE SET</code> clauses only list static columns. ---- == Required PostgreSQL extensions == <syntaxhighlight lang="sql"> CREATE EXTENSION IF NOT EXISTS unaccent; -- accent-insensitive FTS -- Optional (for CJK jurisdictions): -- CREATE EXTENSION IF NOT EXISTS pgroonga; -- CJK full-text search </syntaxhighlight> ---- == Complete DDL execution order == # <code>CREATE SCHEMA corpus;</code> # <code>CREATE EXTENSION IF NOT EXISTS unaccent;</code> # <code>CREATE TABLE corpus.documents (...);</code> (14 columns) # <code>CREATE TABLE corpus.edges (...);</code> # <code>CREATE TABLE corpus.tag_stats (...);</code> # <code>CREATE TABLE corpus.source_metadata (...);</code> # <code>CREATE FUNCTION corpus.normalize_for_fts(...);</code> (see [[Corpus/FTS|FTS]]) # <code>CREATE FUNCTION corpus.update_content_fts(...);</code> (see [[Corpus/FTS|FTS]]) # <code>CREATE TRIGGER trg_content_fts ...;</code> # All indexes (B-tree, GIN, partial GIN) # <code>ANALYZE corpus.documents;</code> For the graph schema (future): <code>CREATE SCHEMA graph;</code> + tables from [[Corpus/Graph|GRAPH]]. ---- == ID format convention == Document IDs use source-native identifiers, always '''lowercase'''. Rules: * All document IDs carry a lowercase jurisdiction prefix: <code>{jurisdiction}.{source_id}</code> (e.g., <code>fr.legiarti000006902764</code>, <code>fr.juritext000041701651</code>, <code>fr.cetatext000046783006</code>, <code>eu.eurlextext32016r0679</code>). * Identifiers with built-in jurisdiction context (ECLI:*, CELEX:*, [year] UKSC) may omit the prefix. * For non-legal registries where the source ID could collide, use <code>{jurisdiction}.{source}.</code> (e.g., <code>fr.company.443061841</code>, <code>gb.company.12345678</code>). * Decision: the ingest plugin author determines the prefix format, documented in the plugin's tag schema. ---- == Bidirectional edges convention == Some edge types are logically bidirectional (<code>language_variant</code>, <code>corresponds_to</code>, <code>associated_text</code>, <code>joined_with</code>, <code>pilot_follower</code>). The convention: * Store ONE edge, not two. * Pick a canonical direction (alphabetical by source_id, or the "primary" document in the pair). * Queries for bidirectional edges check both directions: <code>WHERE (source_id = X OR target_id = X) AND kind = 'language_variant'</code>. ---- == Jurisdiction subdivision mapping in FTS trigger == The trigger resolves language from <code>coalesce(language, jurisdiction)</code>. For subdivision jurisdictions like <code>gb-sct</code>, <code>es-ct</code>, <code>de-by</code>: * If <code>language</code> is set (recommended for subdivisions), it is used directly. * If <code>language</code> is NULL, the trigger extracts the country prefix: <code>split_part(jurisdiction, '-', 1)</code> β <code>gb</code>. Since <code>gb</code> is not a language code in the mapping (the mapping uses ISO 639-1 codes like <code>en</code>), it falls through to <code>simple</code>. Set <code>language = 'en'</code> explicitly for English-language jurisdictions like <code>gb</code>, <code>gb-sct</code>, <code>us</code>. This extraction is in the FTS trigger code (see [[Corpus/FTS|FTS]]). ---- == Migration strategy == Migration from the current 16-table schema is incremental and reversible: # Create <code>corpus.*</code> tables alongside existing tables (coexistence). # Migrate one kind at a time with SQL INSERT...SELECT scripts (decision first β largest and most exercised). # Benchmark each migrated kind against the equivalent old-table queries. # Adapt the MCP server to read from <code>corpus.documents</code> (dual-read during transition). # Switch MCP tools one by one (search first, then get, then browse). # When all tools read from <code>corpus.*</code>, drop old tables. # Migrate <code>citations</code> β <code>corpus.edges</code> (same process). # Refresh <code>corpus.tag_stats</code>. Rollback: during coexistence, the old tables are still the source of truth. If migration fails, revert MCP to old tables. [[Category:Corpus]]
Summary:
Please note that all contributions to Dura Lex Wiki are considered to be released under the Creative Commons Attribution-ShareAlike (see
Dura Lex Wiki:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Tools
What links here
Related changes
Page information