Development/Adding a source

From Dura Lex Wiki
Revision as of 02:07, 23 April 2026 by Nicolas (talk | contribs) (Create guide for adding a new data source to the ingest pipeline (via create-page on MediaWiki MCP Server))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Adding a source

[edit | edit source]

This guide explains how to add a new data source to the Dura Lex ingest pipeline. A "source" is a public institutional data feed (legislation, case law, company records, official notices) that gets downloaded, parsed, and inserted into the corpus.

Overview

[edit | edit source]

Adding a source involves these steps:

  1. Register the source in the source registry
  2. Implement a downloader (and optionally a parser)
  3. Follow tag and edge conventions
  4. Document the source on the wiki

Step 1: Register the source

[edit | edit source]

Every source must be registered in the jurisdiction's source_registry.py module. Registration happens at import time via the register_source() function from duralex.ingest.source_registry.

Location

[edit | edit source]
  • French sources: duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/source_registry.py
  • EU sources: duralex-ingest/duralex-ingest-eu/src/duralex/ingest/eu/source_registry.py
  • New jurisdictions: create duralex-ingest/duralex-ingest-{jur}/src/duralex/ingest/{jur}/source_registry.py

Required fields

[edit | edit source]

The register_source() function requires the following fields:

Field Type Description Example
source_key str (positional) Unique short identifier for the source "cada"
jurisdiction str ISO-style jurisdiction code "fr", "eu"
name str Human-readable source name "CADA"
description str What this source contains "Avis et conseils de la Commission d'accès aux documents administratifs"
kind str One of the 6 structural kinds: legislation, decision, record, notice, section, chunk "decision"
publisher str Organization that publishes the data "CADA", "DILA"
publisher_url str URL where the data can be found "https://www.data.gouv.fr/fr/datasets/..."
license str License name "Licence Ouverte 2.0"
license_url str URL of the license "https://www.etalab.gouv.fr/licence-ouverte-open-licence/"
language str ISO 639-1 language code "fr", "en"
date_bounds DateBounds Valid date range for this source see below

DateBounds

[edit | edit source]

DateBounds validates that document dates are within a plausible range. Two factory methods are available:

  • DateBounds.strict(min_year=NNNN) — for sources where documents cannot have future dates (decisions, records, notices). Max year defaults to current year + small margin.
  • DateBounds.permissive(min_year=NNNN, max_year=NNNN) — for legislation with legitimate forward entry-into-force or expiration dates.

Example: simple source registration

[edit | edit source]

<syntaxhighlight lang="python"> from duralex.ingest.date_validation import DateBounds from duralex.ingest.source_registry import register_source

register_source(

   "cada",
   jurisdiction="fr",
   name="CADA",
   description="Avis et conseils de la Commission d'accès aux documents administratifs",
   kind="decision",
   publisher="CADA",
   publisher_url="https://www.data.gouv.fr/fr/datasets/53698f37a3a729239d2036a0/",
   license="Licence Ouverte 2.0",
   license_url="https://www.etalab.gouv.fr/licence-ouverte-open-licence/",
   language="fr",
   date_bounds=DateBounds.strict(min_year=1900),

) </syntaxhighlight>

Sub-sources

[edit | edit source]

If your source has logical sub-feeds (e.g., different court levels within a single API), register them as sub-sources. Sub-sources inherit all metadata from their parent:

<syntaxhighlight lang="python"> from duralex.ingest.source_registry import register_sub_source

  1. Parent must be registered first

for sub_key in ("judilibre_cc", "judilibre_ca", "judilibre_tj", "judilibre_tcom"):

   register_sub_source(sub_key, "judilibre")

</syntaxhighlight>

Step 2: Implement a downloader

[edit | edit source]

BaseDownloader (for DILA-style tar.gz archives)

[edit | edit source]

If your source publishes data as .tar.gz archives on an HTTP index page (like DILA sources), subclass BaseDownloader from duralex.ingest.sources.base_downloader.

BaseDownloader provides:

  • HTTP retry with exponential backoff and 429/Retry-After support
  • Listing remote files from an HTML index page
  • Downloading with partial-download safety (.tmp files + atomic rename)
  • Extracting .tar.gz archives with prefix stripping
  • State tracking (last downloaded diff filename) for incremental updates
  • Freemium base file + incremental diff pattern

Constructor parameters:

Parameter Description
dataset_name Identifier for this dataset (e.g. "cass", "legi")
data_directory Root directory for extracted data and archives
base_url URL of the HTTP index listing .tar.gz files
get_state Callback returning the last processed filename, or None
set_state Callback to persist the last processed filename
archive_prefix_markers Directory names marking the start of useful content in tar archives (e.g. ["JURI", "CETA"])

The run() method is the main entry point. It downloads the freemium base (on first run) and/or incremental diffs, extracts them, and returns a list of extracted XML file paths. State is NOT written by run() -- call commit_state() after successful ingest.

Inline downloader (for non-archive sources)

[edit | edit source]

For sources that do not follow the tar.gz archive pattern (e.g., CSV files, API endpoints), write a standalone downloader class that handles its own parsing and DB insertion. This is called an "inline source".

The CADA downloader is a good example of an inline source. Key patterns:

<syntaxhighlight lang="python"> class CadaDownloader:

   """Downloads CADA CSV from data.gouv.fr and ingests into corpus.documents."""
   def __init__(self, dsn: str, data_directory: Path) -> None:
       self.dsn = dsn
       self._cache_directory = data_directory / "cada"
   def run(self, errors: IngestErrors | None = None) -> list[str]:
       """Download, parse, and insert. Returns []."""
       # 1. Check upstream for changes (compare last_modified)
       # 2. Download or use cached file
       # 3. Parse rows into corpus.documents records
       # 4. Insert in batches using insert_batch()
       # 5. Clean up orphaned records
       # 6. Update ingest state
       ...

</syntaxhighlight>

Key conventions for inline downloaders:

  • Accept dsn (PostgreSQL connection string) and data_directory (root for cached files) in the constructor
  • Accept an optional errors: IngestErrors | None parameter in run() for error tracking
  • Cache downloaded files to disk with a .tmp extension during download and atomic rename on completion
  • Store a sidecar metadata file (e.g., .meta) for change detection
  • Use get_ingest_state() / set_ingest_state() from duralex.ingest.state to track whether the upstream data has changed
  • Use insert_batch() from duralex.ingest.database.batch_writer for bulk insertion
  • Return an empty list (inline sources handle their own insertion)

Document record format

[edit | edit source]

Each record inserted into corpus.documents must have:

Field Description
id Globally unique document ID, prefixed with jurisdiction (e.g. "fr.cada_20240001")
kind One of: legislation, decision, record, notice, section, chunk
jurisdiction Jurisdiction code (e.g. "fr", "eu")
language ISO 639-1 language code
source Source key matching the registered source
date Primary date (ISO 8601 format YYYY-MM-DD), or None
date_end End date for documents with a date range, or None
parent_id Parent document ID for hierarchical sources, or None
title Human-readable title
body Clean displayable content (HTML or formatted text). Immutable after ingestion.
body_search Indexable text for FTS (can be noisy: PDF OCR, etc.), or None to use body
tags JSON string of metadata tags (see below)

Step 3: Follow tag conventions

[edit | edit source]

Tags are the primary metadata mechanism. Every document gets a tags JSON object. Refer to Corpus/Tag conventions for the full shared vocabulary.

Mandatory tags

[edit | edit source]
  • type — legal document type (e.g. "law", "decree", "judgment", "administrative_decision"). See the tag conventions for the full list.
  • content_quality — quality of the body content (e.g. "native_structured", "ocr_raw", "metadata_only")

Common optional tags

[edit | edit source]
  • nature — source-specific document nature
  • court — court identifier (for decisions)
  • authority — issuing authority
  • case_number — case/dossier numbers (as a list)
  • solution — outcome/disposition
  • summary — headnote or summary
  • headnote_classification — subject classification
  • _importance_level_default — default importance for ranking

Tag construction pattern

[edit | edit source]

<syntaxhighlight lang="python"> import json

def _tags(**kwargs: object) -> str:

   """Build a tags JSON string, stripping None values."""
   return json.dumps(
       {k: v for k, v in kwargs.items() if v is not None},
       ensure_ascii=False,
   )
  1. Usage

tags = _tags(

   type="administrative_decision",
   content_quality="native_structured",
   court="cada",
   case_number=[number],
   solution=outcome or None,

) </syntaxhighlight>

Step 4: Follow edge conventions

[edit | edit source]

If your source contains cross-references to other documents (citations, amendments, transpositions, etc.), create edges in the corpus.edges table. Refer to Corpus/Edge types for the full taxonomy of ~75 edge types.

Common edge types for new sources:

  • cites — document A cites document B
  • amends — document A amends document B
  • repeals — document A repeals document B
  • implements — national law implements an EU directive
  • transposes — national law transposes an EU directive
  • consolidates — consolidated version of a text

Step 5: Document the source

[edit | edit source]

Create a wiki page at Sources/{jurisdiction}/{source_name} documenting:

  • What the source contains
  • Publisher and license
  • Data format (XML, CSV, JSON, API)
  • Update frequency
  • Known quirks or data quality issues
  • Volume (approximate number of documents)
  • Coverage dates

Complete example: CADA source

[edit | edit source]

The CADA (Commission d'acces aux documents administratifs) source is a good reference implementation for a simple inline source.

Registration

[edit | edit source]

In duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/source_registry.py:

<syntaxhighlight lang="python"> register_source(

   "cada",
   jurisdiction="fr",
   name="CADA",
   description="Avis et conseils de la Commission d'accès aux documents administratifs",
   kind="decision",
   publisher="CADA",
   publisher_url="https://www.data.gouv.fr/fr/datasets/53698f37a3a729239d2036a0/",
   license="Licence Ouverte 2.0",
   license_url="https://www.etalab.gouv.fr/licence-ouverte-open-licence/",
   language="fr",
   date_bounds=DateBounds.strict(min_year=1900),

) </syntaxhighlight>

Downloader

[edit | edit source]

In duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/sources/cada.py:

  • Downloads a consolidated CSV (~184 MB, ~60k records, 1984-present) from data.gouv.fr
  • Caches the CSV to disk with change detection via last_modified timestamp
  • Parses each CSV row into a corpus.documents record with structured tags
  • Inserts in batches of 5000 using insert_batch()
  • Cleans up orphaned records (with a safety threshold of 30,000 to avoid accidental mass deletion)
  • Tracks ingest state so unchanged upstream data is skipped

Record construction

[edit | edit source]

Each CADA record maps to:

  • id: "fr.cada_{number}"
  • kind: "decision"
  • source: "cada"
  • title: "CADA {year} n°{number} — {subject}"
  • tags: includes type, content_quality, nature, court, authority, theme, case_number, solution, summary, headnote_classification

Existing French sources

[edit | edit source]

For reference, the following sources are currently registered for France:

Source key Name Kind Publisher
cass Cour de cassation decision DILA
inca INCA decision DILA
capp Cours d'appel decision DILA
jade JADE decision DILA
constit Conseil constitutionnel decision DILA
cnil CNIL decision DILA
legi LEGI legislation DILA
kali KALI legislation DILA
acco ACCO legislation DILA
jorf JORF legislation DILA
ce_opendata Justice administrative (open data) decision Conseil d'État
cada CADA decision CADA
bofip BOFiP legislation DGFiP
judilibre Judilibre decision Cour de cassation
jufi Juridictions financières decision DILA
rne Registre national des entreprises record INPI
bodacc BODACC notice DILA