Development/Adding a source
Adding a source
[edit | edit source]This guide explains how to add a new data source to the Dura Lex ingest pipeline. A "source" is a public institutional data feed (legislation, case law, company records, official notices) that gets downloaded, parsed, and inserted into the corpus.
Overview
[edit | edit source]Adding a source involves these steps:
- Register the source in the source registry
- Implement a downloader (and optionally a parser)
- Follow tag and edge conventions
- Document the source on the wiki
Step 1: Register the source
[edit | edit source]Every source must be registered in the jurisdiction's source_registry.py module. Registration happens at import time via the register_source() function from duralex.ingest.source_registry.
Location
[edit | edit source]- French sources:
duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/source_registry.py - EU sources:
duralex-ingest/duralex-ingest-eu/src/duralex/ingest/eu/source_registry.py - New jurisdictions: create
duralex-ingest/duralex-ingest-{jur}/src/duralex/ingest/{jur}/source_registry.py
Required fields
[edit | edit source]The register_source() function requires the following fields:
| Field | Type | Description | Example |
|---|---|---|---|
source_key |
str (positional) |
Unique short identifier for the source | "cada"
|
jurisdiction |
str |
ISO-style jurisdiction code | "fr", "eu"
|
name |
str |
Human-readable source name | "CADA"
|
description |
str |
What this source contains | "Avis et conseils de la Commission d'accès aux documents administratifs"
|
kind |
str |
One of the 6 structural kinds: legislation, decision, record, notice, section, chunk |
"decision"
|
publisher |
str |
Organization that publishes the data | "CADA", "DILA"
|
publisher_url |
str |
URL where the data can be found | "https://www.data.gouv.fr/fr/datasets/..."
|
license |
str |
License name | "Licence Ouverte 2.0"
|
license_url |
str |
URL of the license | "https://www.etalab.gouv.fr/licence-ouverte-open-licence/"
|
language |
str |
ISO 639-1 language code | "fr", "en"
|
date_bounds |
DateBounds |
Valid date range for this source | see below |
DateBounds
[edit | edit source]DateBounds validates that document dates are within a plausible range. Two factory methods are available:
DateBounds.strict(min_year=NNNN)— for sources where documents cannot have future dates (decisions, records, notices). Max year defaults to current year + small margin.DateBounds.permissive(min_year=NNNN, max_year=NNNN)— for legislation with legitimate forward entry-into-force or expiration dates.
Example: simple source registration
[edit | edit source]<syntaxhighlight lang="python"> from duralex.ingest.date_validation import DateBounds from duralex.ingest.source_registry import register_source
register_source(
"cada", jurisdiction="fr", name="CADA", description="Avis et conseils de la Commission d'accès aux documents administratifs", kind="decision", publisher="CADA", publisher_url="https://www.data.gouv.fr/fr/datasets/53698f37a3a729239d2036a0/", license="Licence Ouverte 2.0", license_url="https://www.etalab.gouv.fr/licence-ouverte-open-licence/", language="fr", date_bounds=DateBounds.strict(min_year=1900),
) </syntaxhighlight>
Sub-sources
[edit | edit source]If your source has logical sub-feeds (e.g., different court levels within a single API), register them as sub-sources. Sub-sources inherit all metadata from their parent:
<syntaxhighlight lang="python"> from duralex.ingest.source_registry import register_sub_source
- Parent must be registered first
for sub_key in ("judilibre_cc", "judilibre_ca", "judilibre_tj", "judilibre_tcom"):
register_sub_source(sub_key, "judilibre")
</syntaxhighlight>
Step 2: Implement a downloader
[edit | edit source]BaseDownloader (for DILA-style tar.gz archives)
[edit | edit source]If your source publishes data as .tar.gz archives on an HTTP index page (like DILA sources), subclass BaseDownloader from duralex.ingest.sources.base_downloader.
BaseDownloader provides:
- HTTP retry with exponential backoff and 429/Retry-After support
- Listing remote files from an HTML index page
- Downloading with partial-download safety (
.tmpfiles + atomic rename) - Extracting
.tar.gzarchives with prefix stripping - State tracking (last downloaded diff filename) for incremental updates
- Freemium base file + incremental diff pattern
Constructor parameters:
| Parameter | Description |
|---|---|
dataset_name |
Identifier for this dataset (e.g. "cass", "legi")
|
data_directory |
Root directory for extracted data and archives |
base_url |
URL of the HTTP index listing .tar.gz files
|
get_state |
Callback returning the last processed filename, or None
|
set_state |
Callback to persist the last processed filename |
archive_prefix_markers |
Directory names marking the start of useful content in tar archives (e.g. ["JURI", "CETA"])
|
The run() method is the main entry point. It downloads the freemium base (on first run) and/or incremental diffs, extracts them, and returns a list of extracted XML file paths. State is NOT written by run() -- call commit_state() after successful ingest.
Inline downloader (for non-archive sources)
[edit | edit source]For sources that do not follow the tar.gz archive pattern (e.g., CSV files, API endpoints), write a standalone downloader class that handles its own parsing and DB insertion. This is called an "inline source".
The CADA downloader is a good example of an inline source. Key patterns:
<syntaxhighlight lang="python"> class CadaDownloader:
"""Downloads CADA CSV from data.gouv.fr and ingests into corpus.documents."""
def __init__(self, dsn: str, data_directory: Path) -> None:
self.dsn = dsn
self._cache_directory = data_directory / "cada"
def run(self, errors: IngestErrors | None = None) -> list[str]:
"""Download, parse, and insert. Returns []."""
# 1. Check upstream for changes (compare last_modified)
# 2. Download or use cached file
# 3. Parse rows into corpus.documents records
# 4. Insert in batches using insert_batch()
# 5. Clean up orphaned records
# 6. Update ingest state
...
</syntaxhighlight>
Key conventions for inline downloaders:
- Accept
dsn(PostgreSQL connection string) anddata_directory(root for cached files) in the constructor - Accept an optional
errors: IngestErrors | Noneparameter inrun()for error tracking - Cache downloaded files to disk with a
.tmpextension during download and atomic rename on completion - Store a sidecar metadata file (e.g.,
.meta) for change detection - Use
get_ingest_state()/set_ingest_state()fromduralex.ingest.stateto track whether the upstream data has changed - Use
insert_batch()fromduralex.ingest.database.batch_writerfor bulk insertion - Return an empty list (inline sources handle their own insertion)
Document record format
[edit | edit source]Each record inserted into corpus.documents must have:
| Field | Description |
|---|---|
id |
Globally unique document ID, prefixed with jurisdiction (e.g. "fr.cada_20240001")
|
kind |
One of: legislation, decision, record, notice, section, chunk
|
jurisdiction |
Jurisdiction code (e.g. "fr", "eu")
|
language |
ISO 639-1 language code |
source |
Source key matching the registered source |
date |
Primary date (ISO 8601 format YYYY-MM-DD), or None
|
date_end |
End date for documents with a date range, or None
|
parent_id |
Parent document ID for hierarchical sources, or None
|
title |
Human-readable title |
body |
Clean displayable content (HTML or formatted text). Immutable after ingestion. |
body_search |
Indexable text for FTS (can be noisy: PDF OCR, etc.), or None to use body
|
tags |
JSON string of metadata tags (see below) |
Step 3: Follow tag conventions
[edit | edit source]Tags are the primary metadata mechanism. Every document gets a tags JSON object. Refer to Corpus/Tag conventions for the full shared vocabulary.
Mandatory tags
[edit | edit source]type— legal document type (e.g."law","decree","judgment","administrative_decision"). See the tag conventions for the full list.content_quality— quality of the body content (e.g."native_structured","ocr_raw","metadata_only")
Common optional tags
[edit | edit source]nature— source-specific document naturecourt— court identifier (for decisions)authority— issuing authoritycase_number— case/dossier numbers (as a list)solution— outcome/dispositionsummary— headnote or summaryheadnote_classification— subject classification_importance_level_default— default importance for ranking
Tag construction pattern
[edit | edit source]<syntaxhighlight lang="python"> import json
def _tags(**kwargs: object) -> str:
"""Build a tags JSON string, stripping None values."""
return json.dumps(
{k: v for k, v in kwargs.items() if v is not None},
ensure_ascii=False,
)
- Usage
tags = _tags(
type="administrative_decision", content_quality="native_structured", court="cada", case_number=[number], solution=outcome or None,
) </syntaxhighlight>
Step 4: Follow edge conventions
[edit | edit source]If your source contains cross-references to other documents (citations, amendments, transpositions, etc.), create edges in the corpus.edges table. Refer to Corpus/Edge types for the full taxonomy of ~75 edge types.
Common edge types for new sources:
cites— document A cites document Bamends— document A amends document Brepeals— document A repeals document Bimplements— national law implements an EU directivetransposes— national law transposes an EU directiveconsolidates— consolidated version of a text
Step 5: Document the source
[edit | edit source]Create a wiki page at Sources/{jurisdiction}/{source_name} documenting:
- What the source contains
- Publisher and license
- Data format (XML, CSV, JSON, API)
- Update frequency
- Known quirks or data quality issues
- Volume (approximate number of documents)
- Coverage dates
Complete example: CADA source
[edit | edit source]The CADA (Commission d'acces aux documents administratifs) source is a good reference implementation for a simple inline source.
Registration
[edit | edit source]In duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/source_registry.py:
<syntaxhighlight lang="python"> register_source(
"cada", jurisdiction="fr", name="CADA", description="Avis et conseils de la Commission d'accès aux documents administratifs", kind="decision", publisher="CADA", publisher_url="https://www.data.gouv.fr/fr/datasets/53698f37a3a729239d2036a0/", license="Licence Ouverte 2.0", license_url="https://www.etalab.gouv.fr/licence-ouverte-open-licence/", language="fr", date_bounds=DateBounds.strict(min_year=1900),
) </syntaxhighlight>
Downloader
[edit | edit source]In duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/sources/cada.py:
- Downloads a consolidated CSV (~184 MB, ~60k records, 1984-present) from data.gouv.fr
- Caches the CSV to disk with change detection via
last_modifiedtimestamp - Parses each CSV row into a
corpus.documentsrecord with structured tags - Inserts in batches of 5000 using
insert_batch() - Cleans up orphaned records (with a safety threshold of 30,000 to avoid accidental mass deletion)
- Tracks ingest state so unchanged upstream data is skipped
Record construction
[edit | edit source]Each CADA record maps to:
id:"fr.cada_{number}"kind:"decision"source:"cada"title:"CADA {year} n°{number} — {subject}"tags: includestype,content_quality,nature,court,authority,theme,case_number,solution,summary,headnote_classification
Existing French sources
[edit | edit source]For reference, the following sources are currently registered for France:
| Source key | Name | Kind | Publisher |
|---|---|---|---|
cass |
Cour de cassation | decision | DILA |
inca |
INCA | decision | DILA |
capp |
Cours d'appel | decision | DILA |
jade |
JADE | decision | DILA |
constit |
Conseil constitutionnel | decision | DILA |
cnil |
CNIL | decision | DILA |
legi |
LEGI | legislation | DILA |
kali |
KALI | legislation | DILA |
acco |
ACCO | legislation | DILA |
jorf |
JORF | legislation | DILA |
ce_opendata |
Justice administrative (open data) | decision | Conseil d'État |
cada |
CADA | decision | CADA |
bofip |
BOFiP | legislation | DGFiP |
judilibre |
Judilibre | decision | Cour de cassation |
jufi |
Juridictions financières | decision | DILA |
rne |
Registre national des entreprises | record | INPI |
bodacc |
BODACC | notice | DILA |