Editing
Corpus/Tag conventions
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
= Tag Conventions = Tags are the primary mechanism for jurisdiction-specific metadata. They live in the <code>tags</code> JSONB column on <code>corpus.documents</code>. This document defines the shared vocabulary — tag keys that have the same meaning across jurisdictions. Jurisdiction-specific tags (e.g., <code>idcc</code> for French labor conventions, <code>foral</code> for Spanish Basque tax law) are documented in each jurisdiction's plugin, not here. == General rules == * '''Enum / classification tag values MUST be lowercase.''' Values from controlled vocabularies (type, binding, enforcement_status, content_quality, importance_level, legal_branch, etc.) must always be lowercase. No mixed-case, no uppercase. Enforced at ingest time. This ensures deterministic matching and avoids case-sensitivity bugs in filters, discover mode, and tag_stats aggregation. * '''Data values are stored as-is, queried case-insensitively.''' Values that represent real-world data — company names (denominations like "DANONE"), legal form labels ("SAS", "SARL"), role labels ("Président"), tribunal names, case numbers, ECLI identifiers, NOR codes — are stored in their original case. The MCP search tools already lowercase tag values at query time for case-insensitive matching. ---- == Legal classification == === tags.type (required) === The legal nature of the document. This does the work that the old per-table schema did. Values for <code>kind=legislation</code>: <pre> statute, decree, regulation, enforcement_decree, ordinance, order, collective_agreement, guidance, circular, manual, instruction, fatwa, tesis, sumula_vinculante, tsūtatsu, judicial_interpretation, dictamen, acordada, auto_acordado, normative_document, inquiry_report, bill, debate, question, committee_report, constitutional_amendment, provisional_measure, technical_standard, code_of_practice, regulatory_handbook </pre> Values for <code>kind=decision</code>: <pre> judgment, order, advisory_opinion, enforcement_decision, constitutional_review, preliminary_ruling, arbitral_award, tutela, amparo, advance_ruling, administrative_decision </pre> Values for <code>kind=record</code>: <pre> company, person, property, patent, establishment </pre> Values for <code>kind=notice</code>: <pre> insolvency, registration, modification, filing, deregistration, capital_change, merger, liquidation, gazette_publication </pre> === tags.binding (optional) === Whether the document creates binding legal obligations. Values: <code>true</code>, <code>false</code> === tags.binding_scope (optional, when binding=true) === Who is bound. Values: <code>erga_omnes</code>, <code>administration</code>, <code>inter_partes</code>, <code>applicant_only</code>, <code>sector</code> === tags.legal_tradition (optional) === For jurisdictions with mixed legal systems (UAE, ZA, NG). Values: <code>civil_law</code>, <code>common_law</code>, <code>sharia</code>, <code>customary</code>, <code>mixed</code>, <code>roman_dutch</code> ---- == Document form == === tags.document_form (required for all documents) === Editorial/administrative form of the document. Jurisdiction-agnostic. Distinguishes the authoritative text from its editorial variants, notices, opinions, and corrections. See ADR: document_form tag (design-decisions/2026-04-22-document-form-tag.md) for naming rationale. Values: {| class="wikitable" ! Value !! Description !! Examples |- | <code>canonical_text</code> || Authoritative full text (default) || Judgment, statute, regulation, collective agreement |- | <code>editorial_summary</code> || Summary by documentalist/editor || CELEX <code>_res</code>, headnote, syllabus, Leitsatz, massima |- | <code>gazette_notice</code> || Publication notice in official gazette || EU OJ-C notice, JORF avis, Federal Register notice |- | <code>case_registration</code> || Notice of new case filing || CELEX CN/TN/FN, certified question (US), ICJ memorial |- | <code>corrigendum</code> || Correction/erratum || JORF rectificatif, EUR-Lex corrigendum |- | <code>consolidated_version</code> || Consolidated text incorporating amendments || EUR-Lex consolidated, UK consolidation act |- | <code>amendment</code> || Amending instrument || Avenant (ACCO/KALI), amending regulation |- | <code>dissenting_opinion</code> || Dissenting opinion by judge(s) || ECHR, ICJ, SCOTUS, BVerfG Sondervotum |- | <code>concurring_opinion</code> || Concurring opinion by judge(s) || ECHR, common law |- | <code>separate_opinion</code> || Separate opinion (neutral) || ICJ Art.57, ECHR |- | <code>declaration</code> || Brief declaration without detailed reasoning || ICJ, ECHR |- | <code>per_curiam</code> || Unsigned opinion attributed to the court || SCOTUS, UKSC |- | <code>memorandum_opinion</code> || Brief opinion without detailed reasoning || US federal courts |} Default: <code>canonical_text</code> when not specified. Title prefix: when <code>document_form != canonical_text</code>, prefix title with English bracketed label: <code>[Editorial Summary]</code>, <code>[OJ Notice]</code>, <code>[Case Filing]</code>, <code>[Corrigendum]</code>, <code>[Consolidated]</code>, <code>[Amendment]</code>, <code>[Dissenting Opinion]</code>, <code>[Concurring Opinion]</code>, <code>[Separate Opinion]</code>, <code>[Declaration]</code>, <code>[Per Curiam]</code>, <code>[Memorandum]</code>. ---- == Temporal metadata == === tags.enforcement_status (for kind=legislation with versioning) === Values: * <code>in_force</code> — currently applicable * <code>deferred_enforcement</code> — adopted but not yet in force, will be on a future date * <code>deferred_repeal</code> — currently in force, repeal scheduled for a future date * <code>repealed</code> — explicitly repealed by a subsequent act * <code>superseded</code> — historical version, a newer version of the same article exists * <code>never_in_force</code> — modified before its effective date, never applied * <code>expired</code> — lapsed by its own terms (sunset clause, fixed-duration text) * <code>annulled</code> — struck down by a court (e.g. constitutional review) — typically retroactive * <code>transferred</code> — content moved to a different article or location (renumbering) * <code>denounced</code> — collective agreement repudiated by one of the parties * <code>disjoined</code> — version split into multiple separate articles * <code>conditional</code> — in force only under a specific interpretation (constitutional reservation) * <code>pending</code> — emergency decree or provisional measure awaiting ratification === tags.in_force (boolean shortcut) === Values: <code>true</code>, <code>false</code> === tags.status (for provisional instruments: BR medida provisoria, AR DNU) === Values: <code>in_force</code>, <code>pending_ratification</code>, <code>converted</code>, <code>rejected</code>, <code>lapsed</code> === tags.last_modified (optional, ISO date) === Date the source last modified this document editorially. Distinct from <code>date</code> (legal effect date) and from ingestion timestamps. Populated when the source exposes it (BOFiP <code>bodgfip:date_modification</code>, Judilibre <code>update_date</code>, KALI <code>DATE_PUBLI</code>). === tags.date_published (optional, ISO date) === Date the document was officially published (e.g. in the Journal Officiel, Official Gazette, EUR-Lex). Distinct from <code>date</code> (legal effect date) which may be later (deferred enforcement). Populated when source exposes it and it differs from <code>date</code>. ---- == Non-Gregorian calendars == ISO date in the promoted <code>date</code> column. Original calendar representation in tags. === tags.date_hijri === Islamic (Hijri) calendar date. Format: <code>"YYYY-MM-DD"</code> in Hijri. Used by: SA, AE, other Middle Eastern jurisdictions. === tags.date_era_jp === Japanese imperial era year. Format: era letter + 2-digit year (e.g., <code>"R06"</code> = Reiwa 6 = 2024). Used by: JP. === tags.date_roc_tw === Republic of China (Minguo) year. Format: 3-digit year (e.g., <code>"112"</code> = 2023). Used by: TW. ---- == Content quality == === tags.content_quality (required for all documents) === The quality level of the text in <code>body</code>. Values (progression order): <pre> ocr_raw — OCR extraction, errors likely ocr_cleaned — OCR cleaned by regex or LLM native_raw — native text (PDF text layer, .doc), structure lost native_structured — structured source (HTML Legifrance, XML Akoma Ntoso, API JSON) machine_reviewed — reviewed/corrected by LLM human_reviewed — reviewed by a human jurist_reviewed — validated by a jurist (highest confidence) </pre> Only the current level is stored. History is not tracked. Critical for: * Knowledge graph trust chains (annotations on <code>ocr_raw</code> docs have lower confidence) * User-facing warnings ("this text may contain OCR errors") * Quality pipeline prioritization ---- == Sub-jurisdiction == === tags.sub_jurisdiction === For legal enclaves or devolved entities within a jurisdiction that have their own legal system. Examples: <code>difc</code>, <code>adgm</code> (UAE free zones with English common law courts). The <code>jurisdiction</code> column carries the country; this tag carries the enclave. ---- == Internal tags == Tags prefixed with <code>_</code> are internal: never displayed to users, excluded from discover mode, excluded from <code>tag_stats</code>. Used for computed values needed by the search engine. Populated at ingestion time by jurisdiction plugins. ---- == Quality flags == === tags.need_fixing === Marker that the document has a known data quality issue that should be revisited later. The value is the ''category'' of the problem so we can group records by what needs to be done: * <code>date</code> — one or more date columns were nullified at ingest time because the source published an aberrant value (DILA sentinels like <code>2999-01-01</code> in the wrong column, INPI <code>9999-12-31</code>, manual typos like <code>5489-12-30</code> or <code>3023-04-03</code>). The original raw value is lost; recovering it requires re-fetching from the source. Used by <code>duralex.ingest.date_validation.validate_date</code>. The tag is intentionally a single string and not an array — one flag per document is enough for now. If we ever need to track multiple issue categories on the same document, switch to an array. ---- == Precedent, publication, and importance == === tags.is_precedent (boolean) === Formally designated as binding precedent. Used by: FI (ennakkopäätös), PL (uchwala zasada prawna), CN (guiding case). === tags.official_grade (string, kind=decision) === Official publication/importance grade assigned by the jurisdiction. Values are jurisdiction-specific, documented per plugin. Absent when the source has no grading system. The grade is relative to the issuing court — a high grade from a first-instance court is not equivalent to the same grade from a supreme court. The real importance depends on the (court, grade) pair. Known values by jurisdiction: {| class="wikitable" ! Jurisdiction !! Source !! Values (highest → lowest) |- | FR admin (CE, TdC) || JADE <code>PUBLI_RECUEIL</code> || <code>A</code>, <code>B</code> |- | FR admin (cour_administrative_appel, tribunal_administratif) || JADE <code>PUBLI_RECUEIL</code>, CE opendata <code>Code_Publication</code> || <code>R</code>, <code>C+</code>, <code>C</code>, <code>D</code>, <code>Z</code> |- | FR cass (post-2021) || Judilibre <code>publication</code> || <code>rapport</code>, <code>bulletin</code>, <code>diffuse</code>, <code>non_diffuse</code> |- | FR cass (pre-2021) || Judilibre <code>publication</code>, DILA XML || <code>rapport</code>, <code>bulletin</code>, <code>bulletin_information</code>, <code>internet</code>, <code>diffuse</code> |- | ECHR || HUDOC <code>importance</code> || <code>1</code>, <code>2</code>, <code>3</code>, <code>4</code> |- | CJEU || CELLAR <code>erecueil</code> + CELEX sector || <code>ecr_grand_chamber</code>, <code>ecr_chamber</code>, <code>oj_only</code>, <code>unpublished</code> |- | DE || juris <code>Dokumenttyp</code> || <code>amtlicher_leitsatz</code>, <code>orientierungssatz</code>, <code>redaktioneller_leitsatz</code> |- | CH || bger.ch || <code>atf</code>, <code>atf_partial</code>, <code>online_only</code> |- | US || CourtListener || <code>precedential</code>, <code>non_precedential</code> |- | CN || SPC database || <code>guiding</code>, <code>reference</code>, <code>typical</code>, <code>gazette</code>, <code>ordinary</code> |- | MX || SJF || <code>jurisprudencia</code>, <code>precedente</code>, <code>tesis_aislada</code> |- | IT || CED || <code>sezioni_unite_massima</code>, <code>massima</code>, <code>no_massima</code> |- | UK || ICLR || <code>law_reports</code>, <code>wlr_2_3</code>, <code>wlr_1</code>, <code>wlr_4</code>, <code>all_er</code>, <code>unreported</code> |- | FR fond (cour_appel, tribunal_judiciaire, tribunal_commerce) || Judilibre <code>particularInterest</code> || <code>particular_interest</code> |- | FR financial (CdC, CRC, CDBF, CAF) || Légifrance <code>publicationRecueil</code> || <code>recueil</code> |} === tags.importance_level (string, computed) === Harmonized importance score derived from <code>official_grade</code> by jurisdiction plugins. NOT an official classification — computed by Dura Lex for cross-jurisdiction search and FTS ranking. Applies to <code>kind=decision</code> and <code>kind=legislation</code> (for authoritative guidance like BOFiP fiscal doctrine). Absent when no importance signal is available. BOFiP documents are always <code>high_importance</code> (no per-document grading from the source). Values (ascending importance): <code>minimal_importance</code>, <code>low_importance</code>, <code>medium_importance</code>, <code>high_importance</code>, <code>highest_importance</code>. === tags._importance_level_default (string, internal) === Internal tag for FTS ranking, never displayed. Always set for documents that have importance signals (<code>kind=decision</code>, <code>kind=legislation</code> with authoritative guidance). Equals <code>importance_level</code> when available, else derived from <code>court_level</code>: * <code>supreme</code>, <code>constitutional</code>, <code>supranational</code> → <code>medium_importance</code> * <code>appellate</code> → <code>low_importance</code> * <code>first_instance</code> → <code>minimal_importance</code> * <code>court_level</code> absent or null → <code>unknown_importance</code> === tags.formation_solemnity (string, computed, kind=decision) === Standardized bench type derived from <code>formation</code> by jurisdiction plugins. NOT an official classification — computed by Dura Lex for cross-jurisdiction comparison and display. Absent when formation is unknown. Values (ascending solemnity): <code>single_judge</code>, <code>reduced_bench</code>, <code>standard_bench</code>, <code>combined_chambers</code>, <code>grand_bench</code>, <code>full_court</code>. === tags.court_level (string, kind=decision) === Position of the issuing court in the judicial hierarchy. Cross-jurisdiction handle for filtering by instance. Distinct from <code>court</code> (the name) and from <code>formation_solemnity</code> (the bench that heard the case). Null (absent) for non-judicial authorities (CNIL, CADA, AMF, Défenseur des droits) and for sui generis bodies that don't fit the hierarchy (Tribunal des conflits). Values: * <code>first_instance</code> — trial courts (FR tribunal judiciaire, tribunal de commerce, tribunal administratif, conseil de prud'hommes; EU General Court, Civil Service Tribunal) * <code>appellate</code> — appeal courts (FR cour d'appel, cour administrative d'appel, tribunal supérieur d'appel) * <code>supreme</code> — apex ordinary courts (FR Cour de cassation, Conseil d'État) * <code>constitutional</code> — constitutional review bodies (FR Conseil constitutionnel, DE BVerfG, IT Corte Costituzionale) * <code>supranational</code> — international/supranational courts (CJEU Court of Justice, ECHR, ICJ) See ADR: Tag tier architecture (design-decisions/2026-04-22-tag-tier-architecture.md) for the per-jurisdiction mapping rationale. ---- == Collective bargaining == === tags.bargaining_level (string, kind=legislation, type=collective_agreement) === Level at which a collective agreement was negotiated. ILO/OECD standard terminology. Applies to collective agreements and amendments (<code>avenants</code>). Values: * <code>enterprise</code> — company-level (FR <code>entreprise</code>, including group and establishment agreements) * <code>sectoral</code> — industry/branch-level (FR <code>branche</code>) * <code>inter_sectoral</code> — cross-industry, national (FR <code>interprofessionnel</code>, e.g. ANI) * <code>territorial</code> — geographically scoped (regional, departmental) Renames the former FR-specific <code>level</code> key. ---- == Administrative subclassification == === tags.subcategory (string, optional) === Subclassification finer than <code>type</code>. T2: key is shared across jurisdictions, value vocabularies are controlled per plugin and listed in that plugin's documentation. Used primarily on <code>kind=notice</code> (BODACC event types) and on records (RCS filing subtypes). Values are English lowercase, slugified — translated from source classifications so they are portable across tools. Per-plugin value lists live in <code>duralex-ingest-fr/docs/FR-TAGS.md</code> and equivalents. ---- == Source-native classifications (T3) == Some tag keys hold values that mirror the source's own taxonomy rather than a Dura Lex enum. These keys are documented per plugin, not here, and are intended for intra-jurisdiction precision filters. Values are slugified at ingest time (lowercase, no accents, spaces → underscores) so that queries remain deterministic. === tags.nature (string) === Source-native document classification (e.g. FR <code>arret</code>, <code>ordonnance</code>, <code>loi_organique</code>, <code>qpc</code>). T3 source-fidelity — values are plugin-specific. For cross-jurisdiction search, prefer <code>type</code> and <code>document_form</code>; use <code>nature</code> for jurisdiction-local precision (e.g. distinguishing QPC from an ordinary Conseil constitutionnel decision). Slugification (accent removal + space → underscore) is applied centrally at ingest time by <code>duralex.ingest.tag_normalization.normalize_tag_value</code>. Known tech debt: EU <code>nature</code> values are French-origin slugs because CJEU ingestion sources French documents first. See ADR: Tag tier architecture (design-decisions/2026-04-22-tag-tier-architecture.md). ---- == Legal branch == === tags.legal_branch (array, kind=legislation only) === Branch(es) of law. Populated at ingest from code→branch mapping (deterministic). Decisions: LLM-enriched (future). Values: <code>civil</code>, <code>criminal</code>, <code>administrative</code>, <code>commercial</code>, <code>social</code>, <code>tax</code>, <code>constitutional</code>, <code>environmental</code>, <code>consumer</code>, <code>ip</code>, <code>public_procurement</code>, <code>health</code>, <code>family</code>, <code>real_estate</code>, <code>digital</code> ---- == Translation == === tags.translation_of === ID of the original document this is a translation of. Present only on translated documents, not on originals. === tags.translation_quality === Quality of the translation. Present only on documents that are translations or language variants. Values: * <code>official</code> — equally authoritative version (EU texts in 24 languages, BE/FI/CH bilingual laws). No single original exists. * <code>official_translation</code> — official translation without equal legal force * <code>machine_translated</code> — LLM translation, not verified * <code>human_reviewed</code> — translation reviewed by bilingual human === tags.translation_pending === ISO 639-1 code of the target language for which a translation is pending. Set on documents whose original language differs from the user's primary language and that have not yet been translated. Removed once a <code>language_variant</code> edge is created linking to the translated version. Used by the LLM enrichment script (<code>enrich_translations.py</code>) to find candidates for machine translation: <syntaxhighlight lang="sql"> SELECT id, body, body_search FROM corpus.documents WHERE jurisdiction='eu' AND tags->>'translation_pending' = 'fr' </syntaxhighlight> === ID convention for language variants === * '''EU texts''' (all versions equally authoritative): every version has a language suffix. No version is "the original." <pre> eu.celex:32016r0679:fr language=fr, tags.translation_quality=official eu.celex:32016r0679:en language=en, tags.translation_quality=official eu.celex:32016r0679:de language=de, tags.translation_quality=official </pre> * '''National texts with translations''' (one original, others are translations): the original has no suffix. Translations have <code>:lang</code> suffix and <code>tags.translation_of</code> pointing to the original. <pre> fr.legiarti000006902764 language=fr (original, no suffix) fr.legiarti000006902764:en language=en, tags.translation_of=fr.legiarti000006902764 </pre> All language variants are linked by <code>language_variant</code> edges. The edge target is chosen by convention (alphabetical or first ingested) — no version is inherently canonical for equally authoritative texts. ---- == Identifiers == === tags.text_id (kind=legislation and chunk) === Cross-language canonical identifier of the underlying legal work. Equal across all language variants of the same text — does NOT carry the <code>:lang</code> suffix that the document <code>id</code> column carries. Used by reference resolvers to find a text regardless of language. Example: <pre> id = "eu.eurlextext32016r0679:fr" tags.text_id = "eu.eurlextext32016r0679" id = "eu.eurlextext32016r0679:en" tags.text_id = "eu.eurlextext32016r0679" </pre> Both rows match a search by <code>text_id = "eu.eurlextext32016r0679"</code>. The search engine then narrows by user language via the <code>language</code> filter on <code>TagQuery</code>. For articles and sections, <code>tags.text_id</code> points to the parent text (also without <code>:lang</code> suffix), enabling cross-language navigation: an article in the EN version of GDPR has the same <code>text_id</code> as the corresponding article in the FR version. === tags.eli === ELI (European Legislation Identifier) URI when provided by source. Not all jurisdictions support ELI. === tags.celex === CELEX identifier for EU documents (EUR-Lex primary key). Format: sector digit + 4-digit year + type letter(s) + ordinal (e.g. <code>32016R0679</code> for GDPR). Redundant with the document ID (<code>EU.CELEX:32016R0679</code>) but useful for tag-based filtering and discovery. === tags.ecli (kind=decision) === ECLI (European Case Law Identifier). Stored upper-case with whitespace stripped (ECLIs are case-insensitive per the EU spec but published mixed-case across sources). Normalization happens at ingest time via <code>duralex.ingest.tag_normalization.normalize_ecli</code> so dedup queries can match cross-source via a single partial expression index. === tags.case_number (array, kind=decision) === Display form of the case number(s), as published by the source. Multiple aliases allowed. Used for human-readable presentation only. === tags.case_number_normalized (array, kind=decision) === Indexable form of the case number(s) for cross-source dedup matching. Built at ingest time by <code>duralex.ingest.tag_normalization.normalize_case_number_list</code> which strips dots and whitespace while preserving dashes, slashes and letters (matches the historical <code>replace(., '.', '')</code> semantics so we don't introduce cross-court false positives like CASS <code>22-13456</code> colliding with CAPP <code>22/13456</code>). Backed by partial GIN <code>jsonb_path_ops</code> index <code>idx_doc_decision_case_number_normalized</code> for <code>@></code> containment queries. ---- == Original document reference == === tags.original_url === URL or path to the original document (PDF in external storage). === tags.original_format === Original format: <code>pdf</code>, <code>xml</code>, <code>html</code>, <code>json</code>, <code>docx</code>, <code>odt</code> ---- == Document structure (for kind=chunk) == === tags.part === The structural section of the parent document this chunk represents. Values: <code>visa</code>, <code>facts</code>, <code>moyens</code>, <code>reasoning</code>, <code>ruling</code>, <code>ag_opinion</code>, <code>conclusie</code>, <code>dispositif</code> === tags.position === Ordinal position within parent (for ordering chunks and sections). ---- == Article versioning == === tags.cid === Permanent article identity — groups temporal versions of the same article across renumbering. Corresponds to LEGI XML CID in France; other jurisdictions provide equivalent identifiers (e.g., BWB number in NL, SFS number in SE). Used by: the compiler to group versions, the <code>at_date</code> TagQuery to select the correct temporal version, the knowledge graph to link concepts to articles. See [[Corpus/Temporal|TEMPORAL]] for temporal versioning details. ---- == Structure type (for kind=section) == === tags.structure_type === Differentiates navigation trees. Values: <code>legislation</code>, <code>labor</code>, <code>doctrine</code>, <code>official_journal</code> (and future jurisdiction-specific values) [[Category:Corpus]]
Summary:
Please note that all contributions to Dura Lex Wiki are considered to be released under the Creative Commons Attribution-ShareAlike (see
Dura Lex Wiki:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Tools
What links here
Related changes
Page information