Editing Corpus/Tag conventions

= Tag Conventions =

Tags are the primary mechanism for jurisdiction-specific metadata. They live in the <code>tags</code> JSONB column on <code>corpus.documents</code>. This document defines the shared vocabulary — tag keys that have the same meaning across jurisdictions.

Jurisdiction-specific tags (e.g., <code>idcc</code> for French labor conventions, <code>foral</code> for Spanish Basque tax law) are documented in each jurisdiction's plugin, not here.

== General rules ==

* '''Enum / classification tag values MUST be lowercase.''' Values from controlled vocabularies (type, binding, enforcement_status, content_quality, importance_level, legal_branch, etc.) must always be lowercase. No mixed-case, no uppercase. Enforced at ingest time. This ensures deterministic matching and avoids case-sensitivity bugs in filters, discover mode, and tag_stats aggregation.

* '''Data values are stored as-is, queried case-insensitively.''' Values that represent real-world data — company names (denominations like "DANONE"), legal form labels ("SAS", "SARL"), role labels ("Président"), tribunal names, case numbers, ECLI identifiers, NOR codes — are stored in their original case. The MCP search tools already lowercase tag values at query time for case-insensitive matching.

----

== Legal classification ==

=== tags.type (required) ===

The legal nature of the document. This does the work that the old per-table schema did.

Values for <code>kind=legislation</code>:

<pre>
statute, decree, regulation, enforcement_decree, ordinance, order,
collective_agreement, guidance, circular, manual, instruction,
fatwa, tesis, sumula_vinculante, tsūtatsu, judicial_interpretation,
dictamen, acordada, auto_acordado, normative_document,
inquiry_report, bill, debate, question, committee_report,
constitutional_amendment, provisional_measure, technical_standard,
code_of_practice, regulatory_handbook
</pre>

Values for <code>kind=decision</code>:

<pre>
judgment, order, advisory_opinion, enforcement_decision,
constitutional_review, preliminary_ruling, arbitral_award,
tutela, amparo, advance_ruling, administrative_decision
</pre>

Values for <code>kind=record</code>:

<pre>
company, person, property, patent, establishment
</pre>

Values for <code>kind=notice</code>:

<pre>
insolvency, registration, modification, filing, deregistration,
capital_change, merger, liquidation, gazette_publication
</pre>

=== tags.binding (optional) ===

Whether the document creates binding legal obligations.

Values: <code>true</code>, <code>false</code>

=== tags.binding_scope (optional, when binding=true) ===

Who is bound.

Values: <code>erga_omnes</code>, <code>administration</code>, <code>inter_partes</code>, <code>applicant_only</code>, <code>sector</code>

=== tags.legal_tradition (optional) ===

For jurisdictions with mixed legal systems (UAE, ZA, NG).

Values: <code>civil_law</code>, <code>common_law</code>, <code>sharia</code>, <code>customary</code>, <code>mixed</code>, <code>roman_dutch</code>

----

== Document form ==

=== tags.document_form (required for all documents) ===

Editorial/administrative form of the document. Jurisdiction-agnostic. Distinguishes the authoritative text from its editorial variants, notices, opinions, and corrections. See ADR: document_form tag (design-decisions/2026-04-22-document-form-tag.md) for naming rationale.

Values:

{| class="wikitable"
! Value !! Description !! Examples
|-
| <code>canonical_text</code> || Authoritative full text (default) || Judgment, statute, regulation, collective agreement
|-
| <code>editorial_summary</code> || Summary by documentalist/editor || CELEX <code>_res</code>, headnote, syllabus, Leitsatz, massima
|-
| <code>gazette_notice</code> || Publication notice in official gazette || EU OJ-C notice, JORF avis, Federal Register notice
|-
| <code>case_registration</code> || Notice of new case filing || CELEX CN/TN/FN, certified question (US), ICJ memorial
|-
| <code>corrigendum</code> || Correction/erratum || JORF rectificatif, EUR-Lex corrigendum
|-
| <code>consolidated_version</code> || Consolidated text incorporating amendments || EUR-Lex consolidated, UK consolidation act
|-
| <code>amendment</code> || Amending instrument || Avenant (ACCO/KALI), amending regulation
|-
| <code>dissenting_opinion</code> || Dissenting opinion by judge(s) || ECHR, ICJ, SCOTUS, BVerfG Sondervotum
|-
| <code>concurring_opinion</code> || Concurring opinion by judge(s) || ECHR, common law
|-
| <code>separate_opinion</code> || Separate opinion (neutral) || ICJ Art.57, ECHR
|-
| <code>declaration</code> || Brief declaration without detailed reasoning || ICJ, ECHR
|-
| <code>per_curiam</code> || Unsigned opinion attributed to the court || SCOTUS, UKSC
|-
| <code>memorandum_opinion</code> || Brief opinion without detailed reasoning || US federal courts
|}

Default: <code>canonical_text</code> when not specified. Title prefix: when <code>document_form != canonical_text</code>, prefix title with English bracketed label: <code>[Editorial Summary]</code>, <code>[OJ Notice]</code>, <code>[Case Filing]</code>, <code>[Corrigendum]</code>, <code>[Consolidated]</code>, <code>[Amendment]</code>, <code>[Dissenting Opinion]</code>, <code>[Concurring Opinion]</code>, <code>[Separate Opinion]</code>, <code>[Declaration]</code>, <code>[Per Curiam]</code>, <code>[Memorandum]</code>.

----

== Temporal metadata ==

=== tags.enforcement_status (for kind=legislation with versioning) ===

Values:
* <code>in_force</code> — currently applicable
* <code>deferred_enforcement</code> — adopted but not yet in force, will be on a future date
* <code>deferred_repeal</code> — currently in force, repeal scheduled for a future date
* <code>repealed</code> — explicitly repealed by a subsequent act
* <code>superseded</code> — historical version, a newer version of the same article exists
* <code>never_in_force</code> — modified before its effective date, never applied
* <code>expired</code> — lapsed by its own terms (sunset clause, fixed-duration text)
* <code>annulled</code> — struck down by a court (e.g. constitutional review) — typically retroactive
* <code>transferred</code> — content moved to a different article or location (renumbering)
* <code>denounced</code> — collective agreement repudiated by one of the parties
* <code>disjoined</code> — version split into multiple separate articles
* <code>conditional</code> — in force only under a specific interpretation (constitutional reservation)
* <code>pending</code> — emergency decree or provisional measure awaiting ratification

=== tags.in_force (boolean shortcut) ===

Values: <code>true</code>, <code>false</code>

=== tags.status (for provisional instruments: BR medida provisoria, AR DNU) ===

Values: <code>in_force</code>, <code>pending_ratification</code>, <code>converted</code>, <code>rejected</code>, <code>lapsed</code>

=== tags.last_modified (optional, ISO date) ===

Date the source last modified this document editorially. Distinct from <code>date</code> (legal effect date) and from ingestion timestamps. Populated when the source exposes it (BOFiP <code>bodgfip:date_modification</code>, Judilibre <code>update_date</code>, KALI <code>DATE_PUBLI</code>).

=== tags.date_published (optional, ISO date) ===

Date the document was officially published (e.g. in the Journal Officiel, Official Gazette, EUR-Lex). Distinct from <code>date</code> (legal effect date) which may be later (deferred enforcement). Populated when source exposes it and it differs from <code>date</code>.

----

== Non-Gregorian calendars ==

ISO date in the promoted <code>date</code> column. Original calendar representation in tags.

=== tags.date_hijri ===

Islamic (Hijri) calendar date. Format: <code>"YYYY-MM-DD"</code> in Hijri.

Used by: SA, AE, other Middle Eastern jurisdictions.

=== tags.date_era_jp ===

Japanese imperial era year. Format: era letter + 2-digit year (e.g., <code>"R06"</code> = Reiwa 6 = 2024).

Used by: JP.

=== tags.date_roc_tw ===

Republic of China (Minguo) year. Format: 3-digit year (e.g., <code>"112"</code> = 2023).

Used by: TW.

----

== Content quality ==

=== tags.content_quality (required for all documents) ===

The quality level of the text in <code>body</code>.

Values (progression order):

<pre>
ocr_raw              — OCR extraction, errors likely
ocr_cleaned          — OCR cleaned by regex or LLM
native_raw           — native text (PDF text layer, .doc), structure lost
native_structured    — structured source (HTML Legifrance, XML Akoma Ntoso, API JSON)
machine_reviewed     — reviewed/corrected by LLM
human_reviewed       — reviewed by a human
jurist_reviewed      — validated by a jurist (highest confidence)
</pre>

Only the current level is stored. History is not tracked.

Critical for:

* Knowledge graph trust chains (annotations on <code>ocr_raw</code> docs have lower confidence)
* User-facing warnings ("this text may contain OCR errors")
* Quality pipeline prioritization

----

== Sub-jurisdiction ==

=== tags.sub_jurisdiction ===

For legal enclaves or devolved entities within a jurisdiction that have their own legal system.

Examples: <code>difc</code>, <code>adgm</code> (UAE free zones with English common law courts).

The <code>jurisdiction</code> column carries the country; this tag carries the enclave.

----

== Internal tags ==

Tags prefixed with <code>_</code> are internal: never displayed to users, excluded from discover mode, excluded from <code>tag_stats</code>. Used for computed values needed by the search engine. Populated at ingestion time by jurisdiction plugins.

----

== Quality flags ==

=== tags.need_fixing ===

Marker that the document has a known data quality issue that should be revisited later. The value is the ''category'' of the problem so we can group records by what needs to be done:

* <code>date</code> — one or more date columns were nullified at ingest time because the source published an aberrant value (DILA sentinels like <code>2999-01-01</code> in the wrong column, INPI <code>9999-12-31</code>, manual typos like <code>5489-12-30</code> or <code>3023-04-03</code>). The original raw value is lost; recovering it requires re-fetching from the source. Used by <code>duralex.ingest.date_validation.validate_date</code>.

The tag is intentionally a single string and not an array — one flag per document is enough for now. If we ever need to track multiple issue categories on the same document, switch to an array.

----

== Precedent, publication, and importance ==

=== tags.is_precedent (boolean) ===

Formally designated as binding precedent.

Used by: FI (ennakkopäätös), PL (uchwala zasada prawna), CN (guiding case).

=== tags.official_grade (string, kind=decision) ===

Official publication/importance grade assigned by the jurisdiction. Values are jurisdiction-specific, documented per plugin. Absent when the source has no grading system.

The grade is relative to the issuing court — a high grade from a first-instance court is not equivalent to the same grade from a supreme court. The real importance depends on the (court, grade) pair.

Known values by jurisdiction:

{| class="wikitable"
! Jurisdiction !! Source !! Values (highest → lowest)
|-
| FR admin (CE, TdC) || JADE <code>PUBLI_RECUEIL</code> || <code>A</code>, <code>B</code>
|-
| FR admin (cour_administrative_appel, tribunal_administratif) || JADE <code>PUBLI_RECUEIL</code>, CE opendata <code>Code_Publication</code> || <code>R</code>, <code>C+</code>, <code>C</code>, <code>D</code>, <code>Z</code>
|-
| FR cass (post-2021) || Judilibre <code>publication</code> || <code>rapport</code>, <code>bulletin</code>, <code>diffuse</code>, <code>non_diffuse</code>
|-
| FR cass (pre-2021) || Judilibre <code>publication</code>, DILA XML || <code>rapport</code>, <code>bulletin</code>, <code>bulletin_information</code>, <code>internet</code>, <code>diffuse</code>
|-
| ECHR || HUDOC <code>importance</code> || <code>1</code>, <code>2</code>, <code>3</code>, <code>4</code>
|-
| CJEU || CELLAR <code>erecueil</code> + CELEX sector || <code>ecr_grand_chamber</code>, <code>ecr_chamber</code>, <code>oj_only</code>, <code>unpublished</code>
|-
| DE || juris <code>Dokumenttyp</code> || <code>amtlicher_leitsatz</code>, <code>orientierungssatz</code>, <code>redaktioneller_leitsatz</code>
|-
| CH || bger.ch || <code>atf</code>, <code>atf_partial</code>, <code>online_only</code>
|-
| US || CourtListener || <code>precedential</code>, <code>non_precedential</code>
|-
| CN || SPC database || <code>guiding</code>, <code>reference</code>, <code>typical</code>, <code>gazette</code>, <code>ordinary</code>
|-
| MX || SJF || <code>jurisprudencia</code>, <code>precedente</code>, <code>tesis_aislada</code>
|-
| IT || CED || <code>sezioni_unite_massima</code>, <code>massima</code>, <code>no_massima</code>
|-
| UK || ICLR || <code>law_reports</code>, <code>wlr_2_3</code>, <code>wlr_1</code>, <code>wlr_4</code>, <code>all_er</code>, <code>unreported</code>
|-
| FR fond (cour_appel, tribunal_judiciaire, tribunal_commerce) || Judilibre <code>particularInterest</code> || <code>particular_interest</code>
|-
| FR financial (CdC, CRC, CDBF, CAF) || Légifrance <code>publicationRecueil</code> || <code>recueil</code>
|}

=== tags.importance_level (string, computed) ===

Harmonized importance score derived from <code>official_grade</code> by jurisdiction plugins. NOT an official classification — computed by Dura Lex for cross-jurisdiction search and FTS ranking. Applies to <code>kind=decision</code> and <code>kind=legislation</code> (for authoritative guidance like BOFiP fiscal doctrine). Absent when no importance signal is available. BOFiP documents are always <code>high_importance</code> (no per-document grading from the source).

Values (ascending importance): <code>minimal_importance</code>, <code>low_importance</code>, <code>medium_importance</code>, <code>high_importance</code>, <code>highest_importance</code>.

=== tags._importance_level_default (string, internal) ===

Internal tag for FTS ranking, never displayed. Always set for documents that have importance signals (<code>kind=decision</code>, <code>kind=legislation</code> with authoritative guidance). Equals <code>importance_level</code> when available, else derived from <code>court_level</code>:

* <code>supreme</code>, <code>constitutional</code>, <code>supranational</code> → <code>medium_importance</code>
* <code>appellate</code> → <code>low_importance</code>
* <code>first_instance</code> → <code>minimal_importance</code>
* <code>court_level</code> absent or null → <code>unknown_importance</code>

=== tags.formation_solemnity (string, computed, kind=decision) ===

Standardized bench type derived from <code>formation</code> by jurisdiction plugins. NOT an official classification — computed by Dura Lex for cross-jurisdiction comparison and display. Absent when formation is unknown.

Values (ascending solemnity): <code>single_judge</code>, <code>reduced_bench</code>, <code>standard_bench</code>, <code>combined_chambers</code>, <code>grand_bench</code>, <code>full_court</code>.

=== tags.court_level (string, kind=decision) ===

Position of the issuing court in the judicial hierarchy. Cross-jurisdiction handle for filtering by instance. Distinct from <code>court</code> (the name) and from <code>formation_solemnity</code> (the bench that heard the case). Null (absent) for non-judicial authorities (CNIL, CADA, AMF, Défenseur des droits) and for sui generis bodies that don't fit the hierarchy (Tribunal des conflits).

Values:

* <code>first_instance</code> — trial courts (FR tribunal judiciaire, tribunal de commerce, tribunal administratif, conseil de prud'hommes; EU General Court, Civil Service Tribunal)
* <code>appellate</code> — appeal courts (FR cour d'appel, cour administrative d'appel, tribunal supérieur d'appel)
* <code>supreme</code> — apex ordinary courts (FR Cour de cassation, Conseil d'État)
* <code>constitutional</code> — constitutional review bodies (FR Conseil constitutionnel, DE BVerfG, IT Corte Costituzionale)
* <code>supranational</code> — international/supranational courts (CJEU Court of Justice, ECHR, ICJ)

See ADR: Tag tier architecture (design-decisions/2026-04-22-tag-tier-architecture.md) for the per-jurisdiction mapping rationale.

----

== Collective bargaining ==

=== tags.bargaining_level (string, kind=legislation, type=collective_agreement) ===

Level at which a collective agreement was negotiated. ILO/OECD standard terminology. Applies to collective agreements and amendments (<code>avenants</code>).

Values:

* <code>enterprise</code> — company-level (FR <code>entreprise</code>, including group and establishment agreements)
* <code>sectoral</code> — industry/branch-level (FR <code>branche</code>)
* <code>inter_sectoral</code> — cross-industry, national (FR <code>interprofessionnel</code>, e.g. ANI)
* <code>territorial</code> — geographically scoped (regional, departmental)

Renames the former FR-specific <code>level</code> key.

----

== Administrative subclassification ==

=== tags.subcategory (string, optional) ===

Subclassification finer than <code>type</code>. T2: key is shared across jurisdictions, value vocabularies are controlled per plugin and listed in that plugin's documentation. Used primarily on <code>kind=notice</code> (BODACC event types) and on records (RCS filing subtypes). Values are English lowercase, slugified — translated from source classifications so they are portable across tools.

Per-plugin value lists live in <code>duralex-ingest-fr/docs/FR-TAGS.md</code> and equivalents.

----

== Source-native classifications (T3) ==

Some tag keys hold values that mirror the source's own taxonomy rather than a Dura Lex enum. These keys are documented per plugin, not here, and are intended for intra-jurisdiction precision filters. Values are slugified at ingest time (lowercase, no accents, spaces → underscores) so that queries remain deterministic.

=== tags.nature (string) ===

Source-native document classification (e.g. FR <code>arret</code>, <code>ordonnance</code>, <code>loi_organique</code>, <code>qpc</code>). T3 source-fidelity — values are plugin-specific. For cross-jurisdiction search, prefer <code>type</code> and <code>document_form</code>; use <code>nature</code> for jurisdiction-local precision (e.g. distinguishing QPC from an ordinary Conseil constitutionnel decision).

Slugification (accent removal + space → underscore) is applied centrally at ingest time by <code>duralex.ingest.tag_normalization.normalize_tag_value</code>.

Known tech debt: EU <code>nature</code> values are French-origin slugs because CJEU ingestion sources French documents first. See ADR: Tag tier architecture (design-decisions/2026-04-22-tag-tier-architecture.md).

----

== Legal branch ==

=== tags.legal_branch (array, kind=legislation only) ===

Branch(es) of law. Populated at ingest from code→branch mapping (deterministic). Decisions: LLM-enriched (future).

Values: <code>civil</code>, <code>criminal</code>, <code>administrative</code>, <code>commercial</code>, <code>social</code>, <code>tax</code>, <code>constitutional</code>, <code>environmental</code>, <code>consumer</code>, <code>ip</code>, <code>public_procurement</code>, <code>health</code>, <code>family</code>, <code>real_estate</code>, <code>digital</code>

----

== Translation ==

=== tags.translation_of ===

ID of the original document this is a translation of. Present only on translated documents, not on originals.

=== tags.translation_quality ===

Quality of the translation. Present only on documents that are translations or language variants.

Values:
* <code>official</code> — equally authoritative version (EU texts in 24 languages, BE/FI/CH bilingual laws). No single original exists.
* <code>official_translation</code> — official translation without equal legal force
* <code>machine_translated</code> — LLM translation, not verified
* <code>human_reviewed</code> — translation reviewed by bilingual human

=== tags.translation_pending ===

ISO 639-1 code of the target language for which a translation is pending. Set on documents whose original language differs from the user's primary language and that have not yet been translated. Removed once a <code>language_variant</code> edge is created linking to the translated version.

Used by the LLM enrichment script (<code>enrich_translations.py</code>) to find candidates for machine translation:
<syntaxhighlight lang="sql">
SELECT id, body, body_search FROM corpus.documents
WHERE jurisdiction='eu' AND tags->>'translation_pending' = 'fr'
</syntaxhighlight>

=== ID convention for language variants ===

* '''EU texts''' (all versions equally authoritative): every version has a language suffix. No version is "the original."
<pre>
eu.celex:32016r0679:fr    language=fr, tags.translation_quality=official
eu.celex:32016r0679:en    language=en, tags.translation_quality=official
eu.celex:32016r0679:de    language=de, tags.translation_quality=official
</pre>

* '''National texts with translations''' (one original, others are translations): the original has no suffix. Translations have <code>:lang</code> suffix and <code>tags.translation_of</code> pointing to the original.
<pre>
fr.legiarti000006902764       language=fr  (original, no suffix)
fr.legiarti000006902764:en    language=en, tags.translation_of=fr.legiarti000006902764
</pre>

All language variants are linked by <code>language_variant</code> edges. The edge target is chosen by convention (alphabetical or first ingested) — no version is inherently canonical for equally authoritative texts.

----

== Identifiers ==

=== tags.text_id (kind=legislation and chunk) ===

Cross-language canonical identifier of the underlying legal work. Equal across all language variants of the same text — does NOT carry the <code>:lang</code> suffix that the document <code>id</code> column carries. Used by reference resolvers to find a text regardless of language.

Example:
<pre>
id = "eu.eurlextext32016r0679:fr"   tags.text_id = "eu.eurlextext32016r0679"
id = "eu.eurlextext32016r0679:en"   tags.text_id = "eu.eurlextext32016r0679"
</pre>

Both rows match a search by <code>text_id = "eu.eurlextext32016r0679"</code>. The search engine then narrows by user language via the <code>language</code> filter on <code>TagQuery</code>.

For articles and sections, <code>tags.text_id</code> points to the parent text (also without <code>:lang</code> suffix), enabling cross-language navigation: an article in the EN version of GDPR has the same <code>text_id</code> as the corresponding article in the FR version.

=== tags.eli ===

ELI (European Legislation Identifier) URI when provided by source. Not all jurisdictions support ELI.

=== tags.celex ===

CELEX identifier for EU documents (EUR-Lex primary key). Format: sector digit + 4-digit year + type letter(s) + ordinal (e.g. <code>32016R0679</code> for GDPR). Redundant with the document ID (<code>EU.CELEX:32016R0679</code>) but useful for tag-based filtering and discovery.

=== tags.ecli (kind=decision) ===

ECLI (European Case Law Identifier). Stored upper-case with whitespace stripped (ECLIs are case-insensitive per the EU spec but published mixed-case across sources). Normalization happens at ingest time via <code>duralex.ingest.tag_normalization.normalize_ecli</code> so dedup queries can match cross-source via a single partial expression index.

=== tags.case_number (array, kind=decision) ===

Display form of the case number(s), as published by the source. Multiple aliases allowed. Used for human-readable presentation only.

=== tags.case_number_normalized (array, kind=decision) ===

Indexable form of the case number(s) for cross-source dedup matching. Built at ingest time by <code>duralex.ingest.tag_normalization.normalize_case_number_list</code> which strips dots and whitespace while preserving dashes, slashes and letters (matches the historical <code>replace(., '.', '')</code> semantics so we don't introduce cross-court false positives like CASS <code>22-13456</code> colliding with CAPP <code>22/13456</code>). Backed by partial GIN <code>jsonb_path_ops</code> index <code>idx_doc_decision_case_number_normalized</code> for <code>@&gt;</code> containment queries.

----

== Original document reference ==

=== tags.original_url ===

URL or path to the original document (PDF in external storage).

=== tags.original_format ===

Original format: <code>pdf</code>, <code>xml</code>, <code>html</code>, <code>json</code>, <code>docx</code>, <code>odt</code>

----

== Document structure (for kind=chunk) ==

=== tags.part ===

The structural section of the parent document this chunk represents.

Values: <code>visa</code>, <code>facts</code>, <code>moyens</code>, <code>reasoning</code>, <code>ruling</code>, <code>ag_opinion</code>, <code>conclusie</code>, <code>dispositif</code>

=== tags.position ===

Ordinal position within parent (for ordering chunks and sections).

----

== Article versioning ==

=== tags.cid ===

Permanent article identity — groups temporal versions of the same article across renumbering. Corresponds to LEGI XML CID in France; other jurisdictions provide equivalent identifiers (e.g., BWB number in NL, SFS number in SE).

Used by: the compiler to group versions, the <code>at_date</code> TagQuery to select the correct temporal version, the knowledge graph to link concepts to articles.

See [[Corpus/Temporal|TEMPORAL]] for temporal versioning details.

----

== Structure type (for kind=section) ==

=== tags.structure_type ===

Differentiates navigation trees.

Values: <code>legislation</code>, <code>labor</code>, <code>doctrine</code>, <code>official_journal</code> (and future jurisdiction-specific values)

[[Category:Corpus]]