Corpus/Tag conventions
Tag Conventions
[edit | edit source]Tags are the primary mechanism for jurisdiction-specific metadata. They live in the tags JSONB column on corpus.documents. This document defines the shared vocabulary — tag keys that have the same meaning across jurisdictions.
Jurisdiction-specific tags (e.g., idcc for French labor conventions, foral for Spanish Basque tax law) are documented in each jurisdiction's plugin, not here.
General rules
[edit | edit source]- Enum / classification tag values MUST be lowercase. Values from controlled vocabularies (type, binding, enforcement_status, content_quality, importance_level, legal_branch, etc.) must always be lowercase. No mixed-case, no uppercase. Enforced at ingest time. This ensures deterministic matching and avoids case-sensitivity bugs in filters, discover mode, and tag_stats aggregation.
- Data values are stored as-is, queried case-insensitively. Values that represent real-world data — company names (denominations like "DANONE"), legal form labels ("SAS", "SARL"), role labels ("Président"), tribunal names, case numbers, ECLI identifiers, NOR codes — are stored in their original case. The MCP search tools already lowercase tag values at query time for case-insensitive matching.
Legal classification
[edit | edit source]tags.type (required)
[edit | edit source]The legal nature of the document. This does the work that the old per-table schema did.
Values for kind=legislation:
statute, decree, regulation, enforcement_decree, ordinance, order, collective_agreement, guidance, circular, manual, instruction, fatwa, tesis, sumula_vinculante, tsūtatsu, judicial_interpretation, dictamen, acordada, auto_acordado, normative_document, inquiry_report, bill, debate, question, committee_report, constitutional_amendment, provisional_measure, technical_standard, code_of_practice, regulatory_handbook
Values for kind=decision:
judgment, order, advisory_opinion, enforcement_decision, constitutional_review, preliminary_ruling, arbitral_award, tutela, amparo, advance_ruling, administrative_decision
Values for kind=record:
company, person, property, patent, establishment
Values for kind=notice:
insolvency, registration, modification, filing, deregistration, capital_change, merger, liquidation, gazette_publication
tags.binding (optional)
[edit | edit source]Whether the document creates binding legal obligations.
Values: true, false
tags.binding_scope (optional, when binding=true)
[edit | edit source]Who is bound.
Values: erga_omnes, administration, inter_partes, applicant_only, sector
tags.legal_tradition (optional)
[edit | edit source]For jurisdictions with mixed legal systems (UAE, ZA, NG).
Values: civil_law, common_law, sharia, customary, mixed, roman_dutch
Document form
[edit | edit source]tags.document_form (required for all documents)
[edit | edit source]Editorial/administrative form of the document. Jurisdiction-agnostic. Distinguishes the authoritative text from its editorial variants, notices, opinions, and corrections. See ADR: document_form tag (design-decisions/2026-04-22-document-form-tag.md) for naming rationale.
Values:
| Value | Description | Examples |
|---|---|---|
canonical_text |
Authoritative full text (default) | Judgment, statute, regulation, collective agreement |
editorial_summary |
Summary by documentalist/editor | CELEX _res, headnote, syllabus, Leitsatz, massima
|
gazette_notice |
Publication notice in official gazette | EU OJ-C notice, JORF avis, Federal Register notice |
case_registration |
Notice of new case filing | CELEX CN/TN/FN, certified question (US), ICJ memorial |
corrigendum |
Correction/erratum | JORF rectificatif, EUR-Lex corrigendum |
consolidated_version |
Consolidated text incorporating amendments | EUR-Lex consolidated, UK consolidation act |
amendment |
Amending instrument | Avenant (ACCO/KALI), amending regulation |
dissenting_opinion |
Dissenting opinion by judge(s) | ECHR, ICJ, SCOTUS, BVerfG Sondervotum |
concurring_opinion |
Concurring opinion by judge(s) | ECHR, common law |
separate_opinion |
Separate opinion (neutral) | ICJ Art.57, ECHR |
declaration |
Brief declaration without detailed reasoning | ICJ, ECHR |
per_curiam |
Unsigned opinion attributed to the court | SCOTUS, UKSC |
memorandum_opinion |
Brief opinion without detailed reasoning | US federal courts |
Default: canonical_text when not specified. Title prefix: when document_form != canonical_text, prefix title with English bracketed label: [Editorial Summary], [OJ Notice], [Case Filing], [Corrigendum], [Consolidated], [Amendment], [Dissenting Opinion], [Concurring Opinion], [Separate Opinion], [Declaration], [Per Curiam], [Memorandum].
Temporal metadata
[edit | edit source]tags.enforcement_status (for kind=legislation with versioning)
[edit | edit source]Values:
in_force— currently applicabledeferred_enforcement— adopted but not yet in force, will be on a future datedeferred_repeal— currently in force, repeal scheduled for a future daterepealed— explicitly repealed by a subsequent actsuperseded— historical version, a newer version of the same article existsnever_in_force— modified before its effective date, never appliedexpired— lapsed by its own terms (sunset clause, fixed-duration text)annulled— struck down by a court (e.g. constitutional review) — typically retroactivetransferred— content moved to a different article or location (renumbering)denounced— collective agreement repudiated by one of the partiesdisjoined— version split into multiple separate articlesconditional— in force only under a specific interpretation (constitutional reservation)pending— emergency decree or provisional measure awaiting ratification
tags.in_force (boolean shortcut)
[edit | edit source]Values: true, false
tags.status (for provisional instruments: BR medida provisoria, AR DNU)
[edit | edit source]Values: in_force, pending_ratification, converted, rejected, lapsed
tags.last_modified (optional, ISO date)
[edit | edit source]Date the source last modified this document editorially. Distinct from date (legal effect date) and from ingestion timestamps. Populated when the source exposes it (BOFiP bodgfip:date_modification, Judilibre update_date, KALI DATE_PUBLI).
tags.date_published (optional, ISO date)
[edit | edit source]Date the document was officially published (e.g. in the Journal Officiel, Official Gazette, EUR-Lex). Distinct from date (legal effect date) which may be later (deferred enforcement). Populated when source exposes it and it differs from date.
Non-Gregorian calendars
[edit | edit source]ISO date in the promoted date column. Original calendar representation in tags.
tags.date_hijri
[edit | edit source]Islamic (Hijri) calendar date. Format: "YYYY-MM-DD" in Hijri.
Used by: SA, AE, other Middle Eastern jurisdictions.
tags.date_era_jp
[edit | edit source]Japanese imperial era year. Format: era letter + 2-digit year (e.g., "R06" = Reiwa 6 = 2024).
Used by: JP.
tags.date_roc_tw
[edit | edit source]Republic of China (Minguo) year. Format: 3-digit year (e.g., "112" = 2023).
Used by: TW.
Content quality
[edit | edit source]tags.content_quality (required for all documents)
[edit | edit source]The quality level of the text in body.
Values (progression order):
ocr_raw — OCR extraction, errors likely ocr_cleaned — OCR cleaned by regex or LLM native_raw — native text (PDF text layer, .doc), structure lost native_structured — structured source (HTML Legifrance, XML Akoma Ntoso, API JSON) machine_reviewed — reviewed/corrected by LLM human_reviewed — reviewed by a human jurist_reviewed — validated by a jurist (highest confidence)
Only the current level is stored. History is not tracked.
Critical for:
- Knowledge graph trust chains (annotations on
ocr_rawdocs have lower confidence) - User-facing warnings ("this text may contain OCR errors")
- Quality pipeline prioritization
Sub-jurisdiction
[edit | edit source]tags.sub_jurisdiction
[edit | edit source]For legal enclaves or devolved entities within a jurisdiction that have their own legal system.
Examples: difc, adgm (UAE free zones with English common law courts).
The jurisdiction column carries the country; this tag carries the enclave.
Internal tags
[edit | edit source]Tags prefixed with _ are internal: never displayed to users, excluded from discover mode, excluded from tag_stats. Used for computed values needed by the search engine. Populated at ingestion time by jurisdiction plugins.
Quality flags
[edit | edit source]tags.need_fixing
[edit | edit source]Marker that the document has a known data quality issue that should be revisited later. The value is the category of the problem so we can group records by what needs to be done:
date— one or more date columns were nullified at ingest time because the source published an aberrant value (DILA sentinels like2999-01-01in the wrong column, INPI9999-12-31, manual typos like5489-12-30or3023-04-03). The original raw value is lost; recovering it requires re-fetching from the source. Used byduralex.ingest.date_validation.validate_date.
The tag is intentionally a single string and not an array — one flag per document is enough for now. If we ever need to track multiple issue categories on the same document, switch to an array.
Precedent, publication, and importance
[edit | edit source]tags.is_precedent (boolean)
[edit | edit source]Formally designated as binding precedent.
Used by: FI (ennakkopäätös), PL (uchwala zasada prawna), CN (guiding case).
tags.official_grade (string, kind=decision)
[edit | edit source]Official publication/importance grade assigned by the jurisdiction. Values are jurisdiction-specific, documented per plugin. Absent when the source has no grading system.
The grade is relative to the issuing court — a high grade from a first-instance court is not equivalent to the same grade from a supreme court. The real importance depends on the (court, grade) pair.
Known values by jurisdiction:
| Jurisdiction | Source | Values (highest → lowest) |
|---|---|---|
| FR admin (CE, TdC) | JADE PUBLI_RECUEIL |
A, B
|
| FR admin (cour_administrative_appel, tribunal_administratif) | JADE PUBLI_RECUEIL, CE opendata Code_Publication |
R, C+, C, D, Z
|
| FR cass (post-2021) | Judilibre publication |
rapport, bulletin, diffuse, non_diffuse
|
| FR cass (pre-2021) | Judilibre publication, DILA XML |
rapport, bulletin, bulletin_information, internet, diffuse
|
| ECHR | HUDOC importance |
1, 2, 3, 4
|
| CJEU | CELLAR erecueil + CELEX sector |
ecr_grand_chamber, ecr_chamber, oj_only, unpublished
|
| DE | juris Dokumenttyp |
amtlicher_leitsatz, orientierungssatz, redaktioneller_leitsatz
|
| CH | bger.ch | atf, atf_partial, online_only
|
| US | CourtListener | precedential, non_precedential
|
| CN | SPC database | guiding, reference, typical, gazette, ordinary
|
| MX | SJF | jurisprudencia, precedente, tesis_aislada
|
| IT | CED | sezioni_unite_massima, massima, no_massima
|
| UK | ICLR | law_reports, wlr_2_3, wlr_1, wlr_4, all_er, unreported
|
| FR fond (cour_appel, tribunal_judiciaire, tribunal_commerce) | Judilibre particularInterest |
particular_interest
|
| FR financial (CdC, CRC, CDBF, CAF) | Légifrance publicationRecueil |
recueil
|
tags.importance_level (string, computed)
[edit | edit source]Harmonized importance score derived from official_grade by jurisdiction plugins. NOT an official classification — computed by Dura Lex for cross-jurisdiction search and FTS ranking. Applies to kind=decision and kind=legislation (for authoritative guidance like BOFiP fiscal doctrine). Absent when no importance signal is available. BOFiP documents are always high_importance (no per-document grading from the source).
Values (ascending importance): minimal_importance, low_importance, medium_importance, high_importance, highest_importance.
tags._importance_level_default (string, internal)
[edit | edit source]Internal tag for FTS ranking, never displayed. Always set for documents that have importance signals (kind=decision, kind=legislation with authoritative guidance). Equals importance_level when available, else derived from court_level:
supreme,constitutional,supranational→medium_importanceappellate→low_importancefirst_instance→minimal_importancecourt_levelabsent or null →unknown_importance
tags.formation_solemnity (string, computed, kind=decision)
[edit | edit source]Standardized bench type derived from formation by jurisdiction plugins. NOT an official classification — computed by Dura Lex for cross-jurisdiction comparison and display. Absent when formation is unknown.
Values (ascending solemnity): single_judge, reduced_bench, standard_bench, combined_chambers, grand_bench, full_court.
tags.court_level (string, kind=decision)
[edit | edit source]Position of the issuing court in the judicial hierarchy. Cross-jurisdiction handle for filtering by instance. Distinct from court (the name) and from formation_solemnity (the bench that heard the case). Null (absent) for non-judicial authorities (CNIL, CADA, AMF, Défenseur des droits) and for sui generis bodies that don't fit the hierarchy (Tribunal des conflits).
Values:
first_instance— trial courts (FR tribunal judiciaire, tribunal de commerce, tribunal administratif, conseil de prud'hommes; EU General Court, Civil Service Tribunal)appellate— appeal courts (FR cour d'appel, cour administrative d'appel, tribunal supérieur d'appel)supreme— apex ordinary courts (FR Cour de cassation, Conseil d'État)constitutional— constitutional review bodies (FR Conseil constitutionnel, DE BVerfG, IT Corte Costituzionale)supranational— international/supranational courts (CJEU Court of Justice, ECHR, ICJ)
See ADR: Tag tier architecture (design-decisions/2026-04-22-tag-tier-architecture.md) for the per-jurisdiction mapping rationale.
Collective bargaining
[edit | edit source]tags.bargaining_level (string, kind=legislation, type=collective_agreement)
[edit | edit source]Level at which a collective agreement was negotiated. ILO/OECD standard terminology. Applies to collective agreements and amendments (avenants).
Values:
enterprise— company-level (FRentreprise, including group and establishment agreements)sectoral— industry/branch-level (FRbranche)inter_sectoral— cross-industry, national (FRinterprofessionnel, e.g. ANI)territorial— geographically scoped (regional, departmental)
Renames the former FR-specific level key.
Administrative subclassification
[edit | edit source]tags.subcategory (string, optional)
[edit | edit source]Subclassification finer than type. T2: key is shared across jurisdictions, value vocabularies are controlled per plugin and listed in that plugin's documentation. Used primarily on kind=notice (BODACC event types) and on records (RCS filing subtypes). Values are English lowercase, slugified — translated from source classifications so they are portable across tools.
Per-plugin value lists live in duralex-ingest-fr/docs/FR-TAGS.md and equivalents.
Source-native classifications (T3)
[edit | edit source]Some tag keys hold values that mirror the source's own taxonomy rather than a Dura Lex enum. These keys are documented per plugin, not here, and are intended for intra-jurisdiction precision filters. Values are slugified at ingest time (lowercase, no accents, spaces → underscores) so that queries remain deterministic.
tags.nature (string)
[edit | edit source]Source-native document classification (e.g. FR arret, ordonnance, loi_organique, qpc). T3 source-fidelity — values are plugin-specific. For cross-jurisdiction search, prefer type and document_form; use nature for jurisdiction-local precision (e.g. distinguishing QPC from an ordinary Conseil constitutionnel decision).
Slugification (accent removal + space → underscore) is applied centrally at ingest time by duralex.ingest.tag_normalization.normalize_tag_value.
Known tech debt: EU nature values are French-origin slugs because CJEU ingestion sources French documents first. See ADR: Tag tier architecture (design-decisions/2026-04-22-tag-tier-architecture.md).
Legal branch
[edit | edit source]tags.legal_branch (array, kind=legislation only)
[edit | edit source]Branch(es) of law. Populated at ingest from code→branch mapping (deterministic). Decisions: LLM-enriched (future).
Values: civil, criminal, administrative, commercial, social, tax, constitutional, environmental, consumer, ip, public_procurement, health, family, real_estate, digital
Translation
[edit | edit source]tags.translation_of
[edit | edit source]ID of the original document this is a translation of. Present only on translated documents, not on originals.
tags.translation_quality
[edit | edit source]Quality of the translation. Present only on documents that are translations or language variants.
Values:
official— equally authoritative version (EU texts in 24 languages, BE/FI/CH bilingual laws). No single original exists.official_translation— official translation without equal legal forcemachine_translated— LLM translation, not verifiedhuman_reviewed— translation reviewed by bilingual human
tags.translation_pending
[edit | edit source]ISO 639-1 code of the target language for which a translation is pending. Set on documents whose original language differs from the user's primary language and that have not yet been translated. Removed once a language_variant edge is created linking to the translated version.
Used by the LLM enrichment script (enrich_translations.py) to find candidates for machine translation:
<syntaxhighlight lang="sql">
SELECT id, body, body_search FROM corpus.documents
WHERE jurisdiction='eu' AND tags->>'translation_pending' = 'fr'
</syntaxhighlight>
ID convention for language variants
[edit | edit source]- EU texts (all versions equally authoritative): every version has a language suffix. No version is "the original."
eu.celex:32016r0679:fr language=fr, tags.translation_quality=official eu.celex:32016r0679:en language=en, tags.translation_quality=official eu.celex:32016r0679:de language=de, tags.translation_quality=official
- National texts with translations (one original, others are translations): the original has no suffix. Translations have
:langsuffix andtags.translation_ofpointing to the original.
fr.legiarti000006902764 language=fr (original, no suffix) fr.legiarti000006902764:en language=en, tags.translation_of=fr.legiarti000006902764
All language variants are linked by language_variant edges. The edge target is chosen by convention (alphabetical or first ingested) — no version is inherently canonical for equally authoritative texts.
Identifiers
[edit | edit source]tags.text_id (kind=legislation and chunk)
[edit | edit source]Cross-language canonical identifier of the underlying legal work. Equal across all language variants of the same text — does NOT carry the :lang suffix that the document id column carries. Used by reference resolvers to find a text regardless of language.
Example:
id = "eu.eurlextext32016r0679:fr" tags.text_id = "eu.eurlextext32016r0679" id = "eu.eurlextext32016r0679:en" tags.text_id = "eu.eurlextext32016r0679"
Both rows match a search by text_id = "eu.eurlextext32016r0679". The search engine then narrows by user language via the language filter on TagQuery.
For articles and sections, tags.text_id points to the parent text (also without :lang suffix), enabling cross-language navigation: an article in the EN version of GDPR has the same text_id as the corresponding article in the FR version.
tags.eli
[edit | edit source]ELI (European Legislation Identifier) URI when provided by source. Not all jurisdictions support ELI.
tags.celex
[edit | edit source]CELEX identifier for EU documents (EUR-Lex primary key). Format: sector digit + 4-digit year + type letter(s) + ordinal (e.g. 32016R0679 for GDPR). Redundant with the document ID (EU.CELEX:32016R0679) but useful for tag-based filtering and discovery.
tags.ecli (kind=decision)
[edit | edit source]ECLI (European Case Law Identifier). Stored upper-case with whitespace stripped (ECLIs are case-insensitive per the EU spec but published mixed-case across sources). Normalization happens at ingest time via duralex.ingest.tag_normalization.normalize_ecli so dedup queries can match cross-source via a single partial expression index.
tags.case_number (array, kind=decision)
[edit | edit source]Display form of the case number(s), as published by the source. Multiple aliases allowed. Used for human-readable presentation only.
tags.case_number_normalized (array, kind=decision)
[edit | edit source]Indexable form of the case number(s) for cross-source dedup matching. Built at ingest time by duralex.ingest.tag_normalization.normalize_case_number_list which strips dots and whitespace while preserving dashes, slashes and letters (matches the historical replace(., '.', ) semantics so we don't introduce cross-court false positives like CASS 22-13456 colliding with CAPP 22/13456). Backed by partial GIN jsonb_path_ops index idx_doc_decision_case_number_normalized for @> containment queries.
Original document reference
[edit | edit source]tags.original_url
[edit | edit source]URL or path to the original document (PDF in external storage).
tags.original_format
[edit | edit source]Original format: pdf, xml, html, json, docx, odt
Document structure (for kind=chunk)
[edit | edit source]tags.part
[edit | edit source]The structural section of the parent document this chunk represents.
Values: visa, facts, moyens, reasoning, ruling, ag_opinion, conclusie, dispositif
tags.position
[edit | edit source]Ordinal position within parent (for ordering chunks and sections).
Article versioning
[edit | edit source]tags.cid
[edit | edit source]Permanent article identity — groups temporal versions of the same article across renumbering. Corresponds to LEGI XML CID in France; other jurisdictions provide equivalent identifiers (e.g., BWB number in NL, SFS number in SE).
Used by: the compiler to group versions, the at_date TagQuery to select the correct temporal version, the knowledge graph to link concepts to articles.
See TEMPORAL for temporal versioning details.
Structure type (for kind=section)
[edit | edit source]tags.structure_type
[edit | edit source]Differentiates navigation trees.
Values: legislation, labor, doctrine, official_journal (and future jurisdiction-specific values)