Corpus/Tag conventions

From Dura Lex Wiki
Jump to navigation Jump to search

Tag Conventions

[edit | edit source]

Tags are the primary mechanism for jurisdiction-specific metadata. They live in the tags JSONB column on corpus.documents. This document defines the shared vocabulary — tag keys that have the same meaning across jurisdictions.

Jurisdiction-specific tags (e.g., idcc for French labor conventions, foral for Spanish Basque tax law) are documented in each jurisdiction's plugin, not here.

General rules

[edit | edit source]
  • Enum / classification tag values MUST be lowercase. Values from controlled vocabularies (type, binding, enforcement_status, content_quality, importance_level, legal_branch, etc.) must always be lowercase. No mixed-case, no uppercase. Enforced at ingest time. This ensures deterministic matching and avoids case-sensitivity bugs in filters, discover mode, and tag_stats aggregation.
  • Data values are stored as-is, queried case-insensitively. Values that represent real-world data — company names (denominations like "DANONE"), legal form labels ("SAS", "SARL"), role labels ("Président"), tribunal names, case numbers, ECLI identifiers, NOR codes — are stored in their original case. The MCP search tools already lowercase tag values at query time for case-insensitive matching.

[edit | edit source]

tags.type (required)

[edit | edit source]

The legal nature of the document. This does the work that the old per-table schema did.

Values for kind=legislation:

statute, decree, regulation, enforcement_decree, ordinance, order,
collective_agreement, guidance, circular, manual, instruction,
fatwa, tesis, sumula_vinculante, tsūtatsu, judicial_interpretation,
dictamen, acordada, auto_acordado, normative_document,
inquiry_report, bill, debate, question, committee_report,
constitutional_amendment, provisional_measure, technical_standard,
code_of_practice, regulatory_handbook

Values for kind=decision:

judgment, order, advisory_opinion, enforcement_decision,
constitutional_review, preliminary_ruling, arbitral_award,
tutela, amparo, advance_ruling, administrative_decision

Values for kind=record:

company, person, property, patent, establishment

Values for kind=notice:

insolvency, registration, modification, filing, deregistration,
capital_change, merger, liquidation, gazette_publication

tags.binding (optional)

[edit | edit source]

Whether the document creates binding legal obligations.

Values: true, false

tags.binding_scope (optional, when binding=true)

[edit | edit source]

Who is bound.

Values: erga_omnes, administration, inter_partes, applicant_only, sector

tags.legal_tradition (optional)

[edit | edit source]

For jurisdictions with mixed legal systems (UAE, ZA, NG).

Values: civil_law, common_law, sharia, customary, mixed, roman_dutch


Document form

[edit | edit source]

tags.document_form (required for all documents)

[edit | edit source]

Editorial/administrative form of the document. Jurisdiction-agnostic. Distinguishes the authoritative text from its editorial variants, notices, opinions, and corrections. See ADR: document_form tag (design-decisions/2026-04-22-document-form-tag.md) for naming rationale.

Values:

Value Description Examples
canonical_text Authoritative full text (default) Judgment, statute, regulation, collective agreement
editorial_summary Summary by documentalist/editor CELEX _res, headnote, syllabus, Leitsatz, massima
gazette_notice Publication notice in official gazette EU OJ-C notice, JORF avis, Federal Register notice
case_registration Notice of new case filing CELEX CN/TN/FN, certified question (US), ICJ memorial
corrigendum Correction/erratum JORF rectificatif, EUR-Lex corrigendum
consolidated_version Consolidated text incorporating amendments EUR-Lex consolidated, UK consolidation act
amendment Amending instrument Avenant (ACCO/KALI), amending regulation
dissenting_opinion Dissenting opinion by judge(s) ECHR, ICJ, SCOTUS, BVerfG Sondervotum
concurring_opinion Concurring opinion by judge(s) ECHR, common law
separate_opinion Separate opinion (neutral) ICJ Art.57, ECHR
declaration Brief declaration without detailed reasoning ICJ, ECHR
per_curiam Unsigned opinion attributed to the court SCOTUS, UKSC
memorandum_opinion Brief opinion without detailed reasoning US federal courts

Default: canonical_text when not specified. Title prefix: when document_form != canonical_text, prefix title with English bracketed label: [Editorial Summary], [OJ Notice], [Case Filing], [Corrigendum], [Consolidated], [Amendment], [Dissenting Opinion], [Concurring Opinion], [Separate Opinion], [Declaration], [Per Curiam], [Memorandum].


Temporal metadata

[edit | edit source]

tags.enforcement_status (for kind=legislation with versioning)

[edit | edit source]

Values:

  • in_force — currently applicable
  • deferred_enforcement — adopted but not yet in force, will be on a future date
  • deferred_repeal — currently in force, repeal scheduled for a future date
  • repealed — explicitly repealed by a subsequent act
  • superseded — historical version, a newer version of the same article exists
  • never_in_force — modified before its effective date, never applied
  • expired — lapsed by its own terms (sunset clause, fixed-duration text)
  • annulled — struck down by a court (e.g. constitutional review) — typically retroactive
  • transferred — content moved to a different article or location (renumbering)
  • denounced — collective agreement repudiated by one of the parties
  • disjoined — version split into multiple separate articles
  • conditional — in force only under a specific interpretation (constitutional reservation)
  • pending — emergency decree or provisional measure awaiting ratification

tags.in_force (boolean shortcut)

[edit | edit source]

Values: true, false

tags.status (for provisional instruments: BR medida provisoria, AR DNU)

[edit | edit source]

Values: in_force, pending_ratification, converted, rejected, lapsed

tags.last_modified (optional, ISO date)

[edit | edit source]

Date the source last modified this document editorially. Distinct from date (legal effect date) and from ingestion timestamps. Populated when the source exposes it (BOFiP bodgfip:date_modification, Judilibre update_date, KALI DATE_PUBLI).

tags.date_published (optional, ISO date)

[edit | edit source]

Date the document was officially published (e.g. in the Journal Officiel, Official Gazette, EUR-Lex). Distinct from date (legal effect date) which may be later (deferred enforcement). Populated when source exposes it and it differs from date.


Non-Gregorian calendars

[edit | edit source]

ISO date in the promoted date column. Original calendar representation in tags.

tags.date_hijri

[edit | edit source]

Islamic (Hijri) calendar date. Format: "YYYY-MM-DD" in Hijri.

Used by: SA, AE, other Middle Eastern jurisdictions.

tags.date_era_jp

[edit | edit source]

Japanese imperial era year. Format: era letter + 2-digit year (e.g., "R06" = Reiwa 6 = 2024).

Used by: JP.

tags.date_roc_tw

[edit | edit source]

Republic of China (Minguo) year. Format: 3-digit year (e.g., "112" = 2023).

Used by: TW.


Content quality

[edit | edit source]

tags.content_quality (required for all documents)

[edit | edit source]

The quality level of the text in body.

Values (progression order):

ocr_raw              — OCR extraction, errors likely
ocr_cleaned          — OCR cleaned by regex or LLM
native_raw           — native text (PDF text layer, .doc), structure lost
native_structured    — structured source (HTML Legifrance, XML Akoma Ntoso, API JSON)
machine_reviewed     — reviewed/corrected by LLM
human_reviewed       — reviewed by a human
jurist_reviewed      — validated by a jurist (highest confidence)

Only the current level is stored. History is not tracked.

Critical for:

  • Knowledge graph trust chains (annotations on ocr_raw docs have lower confidence)
  • User-facing warnings ("this text may contain OCR errors")
  • Quality pipeline prioritization

Sub-jurisdiction

[edit | edit source]

tags.sub_jurisdiction

[edit | edit source]

For legal enclaves or devolved entities within a jurisdiction that have their own legal system.

Examples: difc, adgm (UAE free zones with English common law courts).

The jurisdiction column carries the country; this tag carries the enclave.


Internal tags

[edit | edit source]

Tags prefixed with _ are internal: never displayed to users, excluded from discover mode, excluded from tag_stats. Used for computed values needed by the search engine. Populated at ingestion time by jurisdiction plugins.


Quality flags

[edit | edit source]

tags.need_fixing

[edit | edit source]

Marker that the document has a known data quality issue that should be revisited later. The value is the category of the problem so we can group records by what needs to be done:

  • date — one or more date columns were nullified at ingest time because the source published an aberrant value (DILA sentinels like 2999-01-01 in the wrong column, INPI 9999-12-31, manual typos like 5489-12-30 or 3023-04-03). The original raw value is lost; recovering it requires re-fetching from the source. Used by duralex.ingest.date_validation.validate_date.

The tag is intentionally a single string and not an array — one flag per document is enough for now. If we ever need to track multiple issue categories on the same document, switch to an array.


Precedent, publication, and importance

[edit | edit source]

tags.is_precedent (boolean)

[edit | edit source]

Formally designated as binding precedent.

Used by: FI (ennakkopäätös), PL (uchwala zasada prawna), CN (guiding case).

tags.official_grade (string, kind=decision)

[edit | edit source]

Official publication/importance grade assigned by the jurisdiction. Values are jurisdiction-specific, documented per plugin. Absent when the source has no grading system.

The grade is relative to the issuing court — a high grade from a first-instance court is not equivalent to the same grade from a supreme court. The real importance depends on the (court, grade) pair.

Known values by jurisdiction:

Jurisdiction Source Values (highest → lowest)
FR admin (CE, TdC) JADE PUBLI_RECUEIL A, B
FR admin (cour_administrative_appel, tribunal_administratif) JADE PUBLI_RECUEIL, CE opendata Code_Publication R, C+, C, D, Z
FR cass (post-2021) Judilibre publication rapport, bulletin, diffuse, non_diffuse
FR cass (pre-2021) Judilibre publication, DILA XML rapport, bulletin, bulletin_information, internet, diffuse
ECHR HUDOC importance 1, 2, 3, 4
CJEU CELLAR erecueil + CELEX sector ecr_grand_chamber, ecr_chamber, oj_only, unpublished
DE juris Dokumenttyp amtlicher_leitsatz, orientierungssatz, redaktioneller_leitsatz
CH bger.ch atf, atf_partial, online_only
US CourtListener precedential, non_precedential
CN SPC database guiding, reference, typical, gazette, ordinary
MX SJF jurisprudencia, precedente, tesis_aislada
IT CED sezioni_unite_massima, massima, no_massima
UK ICLR law_reports, wlr_2_3, wlr_1, wlr_4, all_er, unreported
FR fond (cour_appel, tribunal_judiciaire, tribunal_commerce) Judilibre particularInterest particular_interest
FR financial (CdC, CRC, CDBF, CAF) Légifrance publicationRecueil recueil

tags.importance_level (string, computed)

[edit | edit source]

Harmonized importance score derived from official_grade by jurisdiction plugins. NOT an official classification — computed by Dura Lex for cross-jurisdiction search and FTS ranking. Applies to kind=decision and kind=legislation (for authoritative guidance like BOFiP fiscal doctrine). Absent when no importance signal is available. BOFiP documents are always high_importance (no per-document grading from the source).

Values (ascending importance): minimal_importance, low_importance, medium_importance, high_importance, highest_importance.

tags._importance_level_default (string, internal)

[edit | edit source]

Internal tag for FTS ranking, never displayed. Always set for documents that have importance signals (kind=decision, kind=legislation with authoritative guidance). Equals importance_level when available, else derived from court_level:

  • supreme, constitutional, supranationalmedium_importance
  • appellatelow_importance
  • first_instanceminimal_importance
  • court_level absent or null → unknown_importance

tags.formation_solemnity (string, computed, kind=decision)

[edit | edit source]

Standardized bench type derived from formation by jurisdiction plugins. NOT an official classification — computed by Dura Lex for cross-jurisdiction comparison and display. Absent when formation is unknown.

Values (ascending solemnity): single_judge, reduced_bench, standard_bench, combined_chambers, grand_bench, full_court.

tags.court_level (string, kind=decision)

[edit | edit source]

Position of the issuing court in the judicial hierarchy. Cross-jurisdiction handle for filtering by instance. Distinct from court (the name) and from formation_solemnity (the bench that heard the case). Null (absent) for non-judicial authorities (CNIL, CADA, AMF, Défenseur des droits) and for sui generis bodies that don't fit the hierarchy (Tribunal des conflits).

Values:

  • first_instance — trial courts (FR tribunal judiciaire, tribunal de commerce, tribunal administratif, conseil de prud'hommes; EU General Court, Civil Service Tribunal)
  • appellate — appeal courts (FR cour d'appel, cour administrative d'appel, tribunal supérieur d'appel)
  • supreme — apex ordinary courts (FR Cour de cassation, Conseil d'État)
  • constitutional — constitutional review bodies (FR Conseil constitutionnel, DE BVerfG, IT Corte Costituzionale)
  • supranational — international/supranational courts (CJEU Court of Justice, ECHR, ICJ)

See ADR: Tag tier architecture (design-decisions/2026-04-22-tag-tier-architecture.md) for the per-jurisdiction mapping rationale.


Collective bargaining

[edit | edit source]

tags.bargaining_level (string, kind=legislation, type=collective_agreement)

[edit | edit source]

Level at which a collective agreement was negotiated. ILO/OECD standard terminology. Applies to collective agreements and amendments (avenants).

Values:

  • enterprise — company-level (FR entreprise, including group and establishment agreements)
  • sectoral — industry/branch-level (FR branche)
  • inter_sectoral — cross-industry, national (FR interprofessionnel, e.g. ANI)
  • territorial — geographically scoped (regional, departmental)

Renames the former FR-specific level key.


Administrative subclassification

[edit | edit source]

tags.subcategory (string, optional)

[edit | edit source]

Subclassification finer than type. T2: key is shared across jurisdictions, value vocabularies are controlled per plugin and listed in that plugin's documentation. Used primarily on kind=notice (BODACC event types) and on records (RCS filing subtypes). Values are English lowercase, slugified — translated from source classifications so they are portable across tools.

Per-plugin value lists live in duralex-ingest-fr/docs/FR-TAGS.md and equivalents.


Source-native classifications (T3)

[edit | edit source]

Some tag keys hold values that mirror the source's own taxonomy rather than a Dura Lex enum. These keys are documented per plugin, not here, and are intended for intra-jurisdiction precision filters. Values are slugified at ingest time (lowercase, no accents, spaces → underscores) so that queries remain deterministic.

tags.nature (string)

[edit | edit source]

Source-native document classification (e.g. FR arret, ordonnance, loi_organique, qpc). T3 source-fidelity — values are plugin-specific. For cross-jurisdiction search, prefer type and document_form; use nature for jurisdiction-local precision (e.g. distinguishing QPC from an ordinary Conseil constitutionnel decision).

Slugification (accent removal + space → underscore) is applied centrally at ingest time by duralex.ingest.tag_normalization.normalize_tag_value.

Known tech debt: EU nature values are French-origin slugs because CJEU ingestion sources French documents first. See ADR: Tag tier architecture (design-decisions/2026-04-22-tag-tier-architecture.md).


[edit | edit source]

tags.legal_branch (array, kind=legislation only)

[edit | edit source]

Branch(es) of law. Populated at ingest from code→branch mapping (deterministic). Decisions: LLM-enriched (future).

Values: civil, criminal, administrative, commercial, social, tax, constitutional, environmental, consumer, ip, public_procurement, health, family, real_estate, digital


Translation

[edit | edit source]

tags.translation_of

[edit | edit source]

ID of the original document this is a translation of. Present only on translated documents, not on originals.

tags.translation_quality

[edit | edit source]

Quality of the translation. Present only on documents that are translations or language variants.

Values:

  • official — equally authoritative version (EU texts in 24 languages, BE/FI/CH bilingual laws). No single original exists.
  • official_translation — official translation without equal legal force
  • machine_translated — LLM translation, not verified
  • human_reviewed — translation reviewed by bilingual human

tags.translation_pending

[edit | edit source]

ISO 639-1 code of the target language for which a translation is pending. Set on documents whose original language differs from the user's primary language and that have not yet been translated. Removed once a language_variant edge is created linking to the translated version.

Used by the LLM enrichment script (enrich_translations.py) to find candidates for machine translation: <syntaxhighlight lang="sql"> SELECT id, body, body_search FROM corpus.documents WHERE jurisdiction='eu' AND tags->>'translation_pending' = 'fr' </syntaxhighlight>

ID convention for language variants

[edit | edit source]
  • EU texts (all versions equally authoritative): every version has a language suffix. No version is "the original."
eu.celex:32016r0679:fr    language=fr, tags.translation_quality=official
eu.celex:32016r0679:en    language=en, tags.translation_quality=official
eu.celex:32016r0679:de    language=de, tags.translation_quality=official
  • National texts with translations (one original, others are translations): the original has no suffix. Translations have :lang suffix and tags.translation_of pointing to the original.
fr.legiarti000006902764       language=fr  (original, no suffix)
fr.legiarti000006902764:en    language=en, tags.translation_of=fr.legiarti000006902764

All language variants are linked by language_variant edges. The edge target is chosen by convention (alphabetical or first ingested) — no version is inherently canonical for equally authoritative texts.


Identifiers

[edit | edit source]

tags.text_id (kind=legislation and chunk)

[edit | edit source]

Cross-language canonical identifier of the underlying legal work. Equal across all language variants of the same text — does NOT carry the :lang suffix that the document id column carries. Used by reference resolvers to find a text regardless of language.

Example:

id = "eu.eurlextext32016r0679:fr"   tags.text_id = "eu.eurlextext32016r0679"
id = "eu.eurlextext32016r0679:en"   tags.text_id = "eu.eurlextext32016r0679"

Both rows match a search by text_id = "eu.eurlextext32016r0679". The search engine then narrows by user language via the language filter on TagQuery.

For articles and sections, tags.text_id points to the parent text (also without :lang suffix), enabling cross-language navigation: an article in the EN version of GDPR has the same text_id as the corresponding article in the FR version.

tags.eli

[edit | edit source]

ELI (European Legislation Identifier) URI when provided by source. Not all jurisdictions support ELI.

tags.celex

[edit | edit source]

CELEX identifier for EU documents (EUR-Lex primary key). Format: sector digit + 4-digit year + type letter(s) + ordinal (e.g. 32016R0679 for GDPR). Redundant with the document ID (EU.CELEX:32016R0679) but useful for tag-based filtering and discovery.

tags.ecli (kind=decision)

[edit | edit source]

ECLI (European Case Law Identifier). Stored upper-case with whitespace stripped (ECLIs are case-insensitive per the EU spec but published mixed-case across sources). Normalization happens at ingest time via duralex.ingest.tag_normalization.normalize_ecli so dedup queries can match cross-source via a single partial expression index.

tags.case_number (array, kind=decision)

[edit | edit source]

Display form of the case number(s), as published by the source. Multiple aliases allowed. Used for human-readable presentation only.

tags.case_number_normalized (array, kind=decision)

[edit | edit source]

Indexable form of the case number(s) for cross-source dedup matching. Built at ingest time by duralex.ingest.tag_normalization.normalize_case_number_list which strips dots and whitespace while preserving dashes, slashes and letters (matches the historical replace(., '.', ) semantics so we don't introduce cross-court false positives like CASS 22-13456 colliding with CAPP 22/13456). Backed by partial GIN jsonb_path_ops index idx_doc_decision_case_number_normalized for @> containment queries.


Original document reference

[edit | edit source]

tags.original_url

[edit | edit source]

URL or path to the original document (PDF in external storage).

tags.original_format

[edit | edit source]

Original format: pdf, xml, html, json, docx, odt


Document structure (for kind=chunk)

[edit | edit source]

tags.part

[edit | edit source]

The structural section of the parent document this chunk represents.

Values: visa, facts, moyens, reasoning, ruling, ag_opinion, conclusie, dispositif

tags.position

[edit | edit source]

Ordinal position within parent (for ordering chunks and sections).


Article versioning

[edit | edit source]

tags.cid

[edit | edit source]

Permanent article identity — groups temporal versions of the same article across renumbering. Corresponds to LEGI XML CID in France; other jurisdictions provide equivalent identifiers (e.g., BWB number in NL, SFS number in SE).

Used by: the compiler to group versions, the at_date TagQuery to select the correct temporal version, the knowledge graph to link concepts to articles.

See TEMPORAL for temporal versioning details.


Structure type (for kind=section)

[edit | edit source]

tags.structure_type

[edit | edit source]

Differentiates navigation trees.

Values: legislation, labor, doctrine, official_journal (and future jurisdiction-specific values)