Adding a source

This guide explains how to add a new data source to the Dura Lex ingest pipeline. A "source" is a public institutional data feed (legislation, case law, company records, official notices) that gets downloaded, parsed, and inserted into the corpus.

Overview

Adding a source involves these steps:

Register the source in the source registry
Implement a downloader (and optionally a parser)
Follow tag and edge conventions
Document the source on the wiki

Step 1: Register the source

Every source must be registered in the jurisdiction's source_registry.py module. Registration happens at import time via the register_source() function from duralex.ingest.source_registry.

Location

French sources: duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/source_registry.py
EU sources: duralex-ingest/duralex-ingest-eu/src/duralex/ingest/eu/source_registry.py
New jurisdictions: create duralex-ingest/duralex-ingest-{jur}/src/duralex/ingest/{jur}/source_registry.py

Required fields

The register_source() function requires the following fields:

Field	Type	Description	Example
`source_key`	`str` (positional)	Unique short identifier for the source	`"cada"`
`jurisdiction`	`str`	ISO-style jurisdiction code	`"fr"`, `"eu"`
`name`	`str`	Human-readable source name	`"CADA"`
`description`	`str`	What this source contains	`"Avis et conseils de la Commission d'accès aux documents administratifs"`
`kind`	`str`	One of the 6 structural kinds: `legislation`, `decision`, `record`, `notice`, `section`, `chunk`	`"decision"`
`publisher`	`str`	Organization that publishes the data	`"CADA"`, `"DILA"`
`publisher_url`	`str`	URL where the data can be found	`"https://www.data.gouv.fr/fr/datasets/..."`
`license`	`str`	License name	`"Licence Ouverte 2.0"`
`license_url`	`str`	URL of the license	`"https://www.etalab.gouv.fr/licence-ouverte-open-licence/"`
`language`	`str`	ISO 639-1 language code	`"fr"`, `"en"`
`date_bounds`	`DateBounds`	Valid date range for this source	see below

DateBounds

DateBounds validates that document dates are within a plausible range. Two factory methods are available:

DateBounds.strict(min_year=NNNN) — for sources where documents cannot have future dates (decisions, records, notices). Max year defaults to current year + small margin.
DateBounds.permissive(min_year=NNNN, max_year=NNNN) — for legislation with legitimate forward entry-into-force or expiration dates.

Example: simple source registration

<syntaxhighlight lang="python"> from duralex.ingest.date_validation import DateBounds from duralex.ingest.source_registry import register_source

register_source(

   "cada",
   jurisdiction="fr",
   name="CADA",
   description="Avis et conseils de la Commission d'accès aux documents administratifs",
   kind="decision",
   publisher="CADA",
   publisher_url="https://www.data.gouv.fr/fr/datasets/53698f37a3a729239d2036a0/",
   license="Licence Ouverte 2.0",
   license_url="https://www.etalab.gouv.fr/licence-ouverte-open-licence/",
   language="fr",
   date_bounds=DateBounds.strict(min_year=1900),

) </syntaxhighlight>

Sub-sources

If your source has logical sub-feeds (e.g., different court levels within a single API), register them as sub-sources. Sub-sources inherit all metadata from their parent:

<syntaxhighlight lang="python"> from duralex.ingest.source_registry import register_sub_source

Parent must be registered first

for sub_key in ("judilibre_cc", "judilibre_ca", "judilibre_tj", "judilibre_tcom"):

   register_sub_source(sub_key, "judilibre")

</syntaxhighlight>

Step 2: Implement a downloader

BaseDownloader (for DILA-style tar.gz archives)

If your source publishes data as .tar.gz archives on an HTTP index page (like DILA sources), subclass BaseDownloader from duralex.ingest.sources.base_downloader.

BaseDownloader provides:

HTTP retry with exponential backoff and 429/Retry-After support
Listing remote files from an HTML index page
Downloading with partial-download safety (.tmp files + atomic rename)
Extracting .tar.gz archives with prefix stripping
State tracking (last downloaded diff filename) for incremental updates
Freemium base file + incremental diff pattern

Constructor parameters:

Parameter	Description
`dataset_name`	Identifier for this dataset (e.g. `"cass"`, `"legi"`)
`data_directory`	Root directory for extracted data and archives
`base_url`	URL of the HTTP index listing `.tar.gz` files
`get_state`	Callback returning the last processed filename, or `None`
`set_state`	Callback to persist the last processed filename
`archive_prefix_markers`	Directory names marking the start of useful content in tar archives (e.g. `["JURI", "CETA"]`)

The run() method is the main entry point. It downloads the freemium base (on first run) and/or incremental diffs, extracts them, and returns a list of extracted XML file paths. State is NOT written by run() -- call commit_state() after successful ingest.

Inline downloader (for non-archive sources)

For sources that do not follow the tar.gz archive pattern (e.g., CSV files, API endpoints), write a standalone downloader class that handles its own parsing and DB insertion. This is called an "inline source".

The CADA downloader is a good example of an inline source. Key patterns:

<syntaxhighlight lang="python"> class CadaDownloader:

   """Downloads CADA CSV from data.gouv.fr and ingests into corpus.documents."""

   def __init__(self, dsn: str, data_directory: Path) -> None:
       self.dsn = dsn
       self._cache_directory = data_directory / "cada"

   def run(self, errors: IngestErrors | None = None) -> list[str]:
       """Download, parse, and insert. Returns []."""
       # 1. Check upstream for changes (compare last_modified)
       # 2. Download or use cached file
       # 3. Parse rows into corpus.documents records
       # 4. Insert in batches using insert_batch()
       # 5. Clean up orphaned records
       # 6. Update ingest state
       ...

</syntaxhighlight>

Key conventions for inline downloaders:

Accept dsn (PostgreSQL connection string) and data_directory (root for cached files) in the constructor
Accept an optional errors: IngestErrors | None parameter in run() for error tracking
Cache downloaded files to disk with a .tmp extension during download and atomic rename on completion
Store a sidecar metadata file (e.g., .meta) for change detection
Use get_ingest_state() / set_ingest_state() from duralex.ingest.state to track whether the upstream data has changed
Use insert_batch() from duralex.ingest.database.batch_writer for bulk insertion
Return an empty list (inline sources handle their own insertion)

Document record format

Each record inserted into corpus.documents must have:

Field	Description
`id`	Globally unique document ID, prefixed with jurisdiction (e.g. `"fr.cada_20240001"`)
`kind`	One of: `legislation`, `decision`, `record`, `notice`, `section`, `chunk`
`jurisdiction`	Jurisdiction code (e.g. `"fr"`, `"eu"`)
`language`	ISO 639-1 language code
`source`	Source key matching the registered source
`date`	Primary date (ISO 8601 format `YYYY-MM-DD`), or `None`
`date_end`	End date for documents with a date range, or `None`
`parent_id`	Parent document ID for hierarchical sources, or `None`
`title`	Human-readable title
`body`	Clean displayable content (HTML or formatted text). Immutable after ingestion.
`body_search`	Indexable text for FTS (can be noisy: PDF OCR, etc.), or `None` to use `body`
`tags`	JSON string of metadata tags (see below)

Step 3: Follow tag conventions

Tags are the primary metadata mechanism. Every document gets a tags JSON object. Refer to Corpus/Tag conventions for the full shared vocabulary.

Mandatory tags

type — legal document type (e.g. "law", "decree", "judgment", "administrative_decision"). See the tag conventions for the full list.
content_quality — quality of the body content (e.g. "native_structured", "ocr_raw", "metadata_only")

Common optional tags

nature — source-specific document nature
court — court identifier (for decisions)
authority — issuing authority
case_number — case/dossier numbers (as a list)
solution — outcome/disposition
summary — headnote or summary
headnote_classification — subject classification
_importance_level_default — default importance for ranking

Tag construction pattern

<syntaxhighlight lang="python"> import json

def _tags(**kwargs: object) -> str:

   """Build a tags JSON string, stripping None values."""
   return json.dumps(
       {k: v for k, v in kwargs.items() if v is not None},
       ensure_ascii=False,
   )

Usage

tags = _tags(

   type="administrative_decision",
   content_quality="native_structured",
   court="cada",
   case_number=[number],
   solution=outcome or None,

) </syntaxhighlight>

Step 4: Follow edge conventions

If your source contains cross-references to other documents (citations, amendments, transpositions, etc.), create edges in the corpus.edges table. Refer to Corpus/Edge types for the full taxonomy of ~75 edge types.

Common edge types for new sources:

cites — document A cites document B
amends — document A amends document B
repeals — document A repeals document B
implements — national law implements an EU directive
transposes — national law transposes an EU directive
consolidates — consolidated version of a text

Step 5: Document the source

Create a wiki page at Sources/{jurisdiction}/{source_name} documenting:

What the source contains
Publisher and license
Data format (XML, CSV, JSON, API)
Update frequency
Known quirks or data quality issues
Volume (approximate number of documents)
Coverage dates

Complete example: CADA source

The CADA (Commission d'acces aux documents administratifs) source is a good reference implementation for a simple inline source.

Registration

In duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/source_registry.py:

<syntaxhighlight lang="python"> register_source(

   "cada",
   jurisdiction="fr",
   name="CADA",
   description="Avis et conseils de la Commission d'accès aux documents administratifs",
   kind="decision",
   publisher="CADA",
   publisher_url="https://www.data.gouv.fr/fr/datasets/53698f37a3a729239d2036a0/",
   license="Licence Ouverte 2.0",
   license_url="https://www.etalab.gouv.fr/licence-ouverte-open-licence/",
   language="fr",
   date_bounds=DateBounds.strict(min_year=1900),

) </syntaxhighlight>

Downloader

In duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/sources/cada.py:

Downloads a consolidated CSV (~184 MB, ~60k records, 1984-present) from data.gouv.fr
Caches the CSV to disk with change detection via last_modified timestamp
Parses each CSV row into a corpus.documents record with structured tags
Inserts in batches of 5000 using insert_batch()
Cleans up orphaned records (with a safety threshold of 30,000 to avoid accidental mass deletion)
Tracks ingest state so unchanged upstream data is skipped

Record construction

Each CADA record maps to:

id: "fr.cada_{number}"
kind: "decision"
source: "cada"
title: "CADA {year} n°{number} — {subject}"
tags: includes type, content_quality, nature, court, authority, theme, case_number, solution, summary, headnote_classification

Existing French sources

For reference, the following sources are currently registered for France:

Source key	Name	Kind	Publisher
`cass`	Cour de cassation	decision	DILA
`inca`	INCA	decision	DILA
`capp`	Cours d'appel	decision	DILA
`jade`	JADE	decision	DILA
`constit`	Conseil constitutionnel	decision	DILA
`cnil`	CNIL	decision	DILA
`legi`	LEGI	legislation	DILA
`kali`	KALI	legislation	DILA
`acco`	ACCO	legislation	DILA
`jorf`	JORF	legislation	DILA
`ce_opendata`	Justice administrative (open data)	decision	Conseil d'État
`cada`	CADA	decision	CADA
`bofip`	BOFiP	legislation	DGFiP
`judilibre`	Judilibre	decision	Cour de cassation
`jufi`	Juridictions financières	decision	DILA
`rne`	Registre national des entreprises	record	INPI
`bodacc`	BODACC	notice	DILA

Development/Adding a source

Contents

Adding a source

Overview

Step 1: Register the source

Location

Required fields

DateBounds

Example: simple source registration

Sub-sources

Step 2: Implement a downloader

BaseDownloader (for DILA-style tar.gz archives)

Inline downloader (for non-archive sources)

Document record format

Step 3: Follow tag conventions

Mandatory tags

Common optional tags

Tag construction pattern

Step 4: Follow edge conventions

Step 5: Document the source

Complete example: CADA source

Registration

Downloader

Record construction

Existing French sources

Navigation menu

Development/Adding a source

Adding a source

Overview

Step 1: Register the source

Location

Required fields

DateBounds

Example: simple source registration

Sub-sources

Step 2: Implement a downloader

BaseDownloader (for DILA-style tar.gz archives)

Inline downloader (for non-archive sources)

Document record format

Step 3: Follow tag conventions

Mandatory tags

Common optional tags

Tag construction pattern

Step 4: Follow edge conventions

Step 5: Document the source

Complete example: CADA source

Registration

Downloader

Record construction

Existing French sources

Navigation menu

Search