Editing
Corpus/Schema
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Schema: corpus.documents == 14 columns. <syntaxhighlight lang="sql"> CREATE SCHEMA IF NOT EXISTS corpus; CREATE TABLE corpus.documents ( id text PRIMARY KEY, kind text NOT NULL, jurisdiction text NOT NULL, language text, source text NOT NULL, date date, date_end date, parent_id text, title text, body text, body_search text, tags jsonb NOT NULL DEFAULT '{}', content_fts tsvector, ingested_at timestamptz DEFAULT now() ); </syntaxhighlight> === Column definitions === '''id''' -- Text primary key. Uses source-native identifiers with a jurisdiction prefix, always '''lowercase''': <code>{jurisdiction}.{source_id}</code> (e.g., <code>fr.legiarti000006902764</code>, <code>eu.eurlextext32016r0679</code>). Identifiers with built-in jurisdiction context keep their original form (<code>ECLI:FR:CCASS:2024:CR00123</code>, <code>[2024] UKSC 1</code>). For non-legal registries where the source ID could collide, prefix with <code>{jurisdiction}.{source}.</code> (e.g., <code>fr.company.443061841</code>, <code>gb.company.12345678</code>). Never synthetic UUIDs -- the ID must be meaningful and traceable to the source. See ADR: lowercase document IDs (design-decisions/2026-04-09-lowercase-document-ids.md). '''kind''' -- Structural classification. Exactly 6 values (see Kind Taxonomy below): <code>legislation</code>, <code>decision</code>, <code>record</code>, <code>notice</code>, <code>section</code>, <code>chunk</code>. This column answers "what shape is this document?" not "what legal category is it?" -- legal category belongs in <code>tags.type</code>. '''jurisdiction''' -- Geographic/political area, not institution. ISO 3166-1 alpha-2 for countries (<code>fr</code>, <code>gb</code>, <code>de</code>, <code>br</code>, <code>jp</code>). ISO 3166-2 for subdivisions (<code>us-ca</code>, <code>gb-sct</code>, <code>es-ct</code>, <code>de-by</code>). Extensions for supranational areas: <code>eu</code> (European law β EU institutions and Council of Europe), <code>int</code> (international treaties). A search for a jurisdiction includes its subdivisions: searching <code>us</code> includes <code>us-ca</code>. No default value -- every document must declare its jurisdiction explicitly. '''language''' -- ISO 639-1 (<code>fr</code>, <code>en</code>, <code>de</code>, <code>ar</code>, <code>zh</code>, <code>ja</code>, <code>ko</code>). Nullable. When NULL, the FTS trigger infers language from <code>jurisdiction</code>. Decoupled from jurisdiction because many countries are multilingual: Belgium (3 languages), Switzerland (4), South Africa (11), India (23+). A Belgian law in Dutch and the same law in French are two separate documents with the same jurisdiction (<code>be</code>) but different languages. '''source''' -- Data source identifier. Examples: <code>legi</code>, <code>cass</code>, <code>hmrc_manuals</code>, <code>cendoj</code>, <code>boe</code>, <code>legifrance_jorf</code>, <code>knesset</code>. One source maps to one ingestion pipeline. Multiple sources can feed the same jurisdiction. '''date''' -- Primary date, always ISO 8601 (converted from Hijri calendar, Japanese era dates, etc. at ingestion). Semantics vary by kind: <code>decision_date</code> for decisions, <code>valid_from</code> for legislation, <code>publication_date</code> for notices, <code>creation_date</code> for records. The semantic meaning is implicit from <code>kind</code> -- not stored separately. '''date_end''' -- End date. <code>valid_until</code> for legislation, expiration for notices, closure for records. NULL for most decisions (they don't expire). NULL does not mean "still in force" -- it means the source did not provide an end date. '''parent_id''' -- Hierarchical link to another document ID. Used for structure trees (code -> article), sub-documents (decision -> chunk), and entity hierarchies (company -> director appointments). Text, not a foreign key -- the parent may live in a different source or not yet be ingested. No FK constraint. '''title''' -- Display title. Free text. May be NULL for chunks and sections that inherit their parent's title. '''body''' -- Clean displayable content. HTML or plain text suitable for direct rendering to a user. Immutable after initial ingestion. Never contains noisy OCR or raw PDF binary. For PDF-only sources, <code>body</code> is a clean stub like <code><p>Source: <a href="...">PDF officiel</a></p></code> and the extracted text goes to <code>body_search</code>. <code>body</code> is what <code>get_document</code> returns. '''body_search''' -- Indexable text representation used by FTS. Can be noisy (raw PDF extraction, OCR output) because users never see it directly β <code>get_document</code> never returns it. Nullable. When NULL, the FTS trigger falls back to <code>body</code> via <code>coalesce(body_search, body)</code>. Used for: (a) PDF-sourced documents where <code>body</code> is a clean stub and <code>body_search</code> holds the pdfminer-extracted text; (b) HTML documents with irregular references that need an LLM-cleaned indexable version. Summary and headnote FTS indexing is handled separately by the trigger reading <code>tags.summary</code> and <code>tags.headnote_classification</code> directly from JSONB (weight B). See ADR-010 (design-decisions/2026-04-08-body-search-inversion.md) for the current semantic and ADR-008 (design-decisions/2026-04-05-fts-three-weights.md) for the trigger weight system. See [[Corpus/Quality|QUALITY]] for <code>tags.content_quality</code> levels. '''tags''' -- JSONB. Everything jurisdiction-specific or type-specific that does not deserve a promoted column. Examples: <code>{"type": "loi", "nid": "JORFTEXT000041865244", "nature": "LOI"}</code> for French legislation, <code>{"chamber": "2e_civ", "solution": "cassation"}</code> for French cassation decisions, <code>{"sic_code": "6411", "status": "active"}</code> for UK company records. No enforced schema -- each source defines its own tag vocabulary. Queried via <code>@></code>, <code>?</code>, and <code>jsonb_path_ops</code> GIN index. '''content_fts''' -- <code>tsvector</code> column, populated by a trigger with three weights: A=title, B=<code>tags.summary</code>+<code>tags.headnote_classification</code> (read from JSONB), C=<code>coalesce(body_search, body)</code> with HTML tags stripped. Uses the appropriate language configuration (derived from <code>language</code> or <code>jurisdiction</code>). NULL for CJK languages (Chinese, Japanese, Korean) which require pgroonga or zhparser instead of PostgreSQL built-in FTS. See [[Corpus/FTS|FTS]] for full trigger details. '''ingested_at''' -- Timestamp of first ingestion. <code>DEFAULT now()</code>. Not updated on re-ingestion -- use <code>tags</code> for versioning metadata if needed. ----
Summary:
Please note that all contributions to Dura Lex Wiki are considered to be released under the Creative Commons Attribution-ShareAlike (see
Dura Lex Wiki:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Tools
What links here
Related changes
Page information