Editing Development/Adding a source (section)

= Adding a source =

This guide explains how to add a new data source to the Dura Lex ingest pipeline. A "source" is a public institutional data feed (legislation, case law, company records, official notices) that gets downloaded, parsed, and inserted into the corpus.

== Overview ==

Adding a source involves these steps:

# Register the source in the source registry
# Implement a downloader (and optionally a parser)
# Follow tag and edge conventions
# Document the source on the wiki

== Step 1: Register the source ==

Every source must be registered in the jurisdiction's <code>source_registry.py</code> module. Registration happens at import time via the <code>register_source()</code> function from <code>duralex.ingest.source_registry</code>.

=== Location ===

* French sources: <code>duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/source_registry.py</code>
* EU sources: <code>duralex-ingest/duralex-ingest-eu/src/duralex/ingest/eu/source_registry.py</code>
* New jurisdictions: create <code>duralex-ingest/duralex-ingest-{jur}/src/duralex/ingest/{jur}/source_registry.py</code>

=== Required fields ===

The <code>register_source()</code> function requires the following fields:

{| class="wikitable"
! Field !! Type !! Description !! Example
|-
| <code>source_key</code> || <code>str</code> (positional) || Unique short identifier for the source || <code>"cada"</code>
|-
| <code>jurisdiction</code> || <code>str</code> || ISO-style jurisdiction code || <code>"fr"</code>, <code>"eu"</code>
|-
| <code>name</code> || <code>str</code> || Human-readable source name || <code>"CADA"</code>
|-
| <code>description</code> || <code>str</code> || What this source contains || <code>"Avis et conseils de la Commission d'accès aux documents administratifs"</code>
|-
| <code>kind</code> || <code>str</code> || One of the 6 structural kinds: <code>legislation</code>, <code>decision</code>, <code>record</code>, <code>notice</code>, <code>section</code>, <code>chunk</code> || <code>"decision"</code>
|-
| <code>publisher</code> || <code>str</code> || Organization that publishes the data || <code>"CADA"</code>, <code>"DILA"</code>
|-
| <code>publisher_url</code> || <code>str</code> || URL where the data can be found || <code>"https://www.data.gouv.fr/fr/datasets/..."</code>
|-
| <code>license</code> || <code>str</code> || License name || <code>"Licence Ouverte 2.0"</code>
|-
| <code>license_url</code> || <code>str</code> || URL of the license || <code>"https://www.etalab.gouv.fr/licence-ouverte-open-licence/"</code>
|-
| <code>language</code> || <code>str</code> || ISO 639-1 language code || <code>"fr"</code>, <code>"en"</code>
|-
| <code>date_bounds</code> || <code>DateBounds</code> || Valid date range for this source || see below
|}

=== DateBounds ===

<code>DateBounds</code> validates that document dates are within a plausible range. Two factory methods are available:

* <code>DateBounds.strict(min_year=NNNN)</code> — for sources where documents cannot have future dates (decisions, records, notices). Max year defaults to current year + small margin.
* <code>DateBounds.permissive(min_year=NNNN, max_year=NNNN)</code> — for legislation with legitimate forward entry-into-force or expiration dates.

=== Example: simple source registration ===

<syntaxhighlight lang="python">
from duralex.ingest.date_validation import DateBounds
from duralex.ingest.source_registry import register_source

register_source(
    "cada",
    jurisdiction="fr",
    name="CADA",
    description="Avis et conseils de la Commission d'accès aux documents administratifs",
    kind="decision",
    publisher="CADA",
    publisher_url="https://www.data.gouv.fr/fr/datasets/53698f37a3a729239d2036a0/",
    license="Licence Ouverte 2.0",
    license_url="https://www.etalab.gouv.fr/licence-ouverte-open-licence/",
    language="fr",
    date_bounds=DateBounds.strict(min_year=1900),
)
</syntaxhighlight>

=== Sub-sources ===

If your source has logical sub-feeds (e.g., different court levels within a single API), register them as sub-sources. Sub-sources inherit all metadata from their parent:

<syntaxhighlight lang="python">
from duralex.ingest.source_registry import register_sub_source

# Parent must be registered first
for sub_key in ("judilibre_cc", "judilibre_ca", "judilibre_tj", "judilibre_tcom"):
    register_sub_source(sub_key, "judilibre")
</syntaxhighlight>

== Step 2: Implement a downloader ==

=== BaseDownloader (for DILA-style tar.gz archives) ===

If your source publishes data as <code>.tar.gz</code> archives on an HTTP index page (like DILA sources), subclass <code>BaseDownloader</code> from <code>duralex.ingest.sources.base_downloader</code>.

<code>BaseDownloader</code> provides:
* HTTP retry with exponential backoff and 429/Retry-After support
* Listing remote files from an HTML index page
* Downloading with partial-download safety (<code>.tmp</code> files + atomic rename)
* Extracting <code>.tar.gz</code> archives with prefix stripping
* State tracking (last downloaded diff filename) for incremental updates
* Freemium base file + incremental diff pattern

Constructor parameters:

{| class="wikitable"
! Parameter !! Description
|-
| <code>dataset_name</code> || Identifier for this dataset (e.g. <code>"cass"</code>, <code>"legi"</code>)
|-
| <code>data_directory</code> || Root directory for extracted data and archives
|-
| <code>base_url</code> || URL of the HTTP index listing <code>.tar.gz</code> files
|-
| <code>get_state</code> || Callback returning the last processed filename, or <code>None</code>
|-
| <code>set_state</code> || Callback to persist the last processed filename
|-
| <code>archive_prefix_markers</code> || Directory names marking the start of useful content in tar archives (e.g. <code>["JURI", "CETA"]</code>)
|}

The <code>run()</code> method is the main entry point. It downloads the freemium base (on first run) and/or incremental diffs, extracts them, and returns a list of extracted XML file paths. State is NOT written by <code>run()</code> -- call <code>commit_state()</code> after successful ingest.

=== Inline downloader (for non-archive sources) ===

For sources that do not follow the tar.gz archive pattern (e.g., CSV files, API endpoints), write a standalone downloader class that handles its own parsing and DB insertion. This is called an "inline source".

The CADA downloader is a good example of an inline source. Key patterns:

<syntaxhighlight lang="python">
class CadaDownloader:
    """Downloads CADA CSV from data.gouv.fr and ingests into corpus.documents."""

    def __init__(self, dsn: str, data_directory: Path) -> None:
        self.dsn = dsn
        self._cache_directory = data_directory / "cada"

    def run(self, errors: IngestErrors | None = None) -> list[str]:
        """Download, parse, and insert. Returns []."""
        # 1. Check upstream for changes (compare last_modified)
        # 2. Download or use cached file
        # 3. Parse rows into corpus.documents records
        # 4. Insert in batches using insert_batch()
        # 5. Clean up orphaned records
        # 6. Update ingest state
        ...
</syntaxhighlight>

Key conventions for inline downloaders:

* Accept <code>dsn</code> (PostgreSQL connection string) and <code>data_directory</code> (root for cached files) in the constructor
* Accept an optional <code>errors: IngestErrors | None</code> parameter in <code>run()</code> for error tracking
* Cache downloaded files to disk with a <code>.tmp</code> extension during download and atomic rename on completion
* Store a sidecar metadata file (e.g., <code>.meta</code>) for change detection
* Use <code>get_ingest_state()</code> / <code>set_ingest_state()</code> from <code>duralex.ingest.state</code> to track whether the upstream data has changed
* Use <code>insert_batch()</code> from <code>duralex.ingest.database.batch_writer</code> for bulk insertion
* Return an empty list (inline sources handle their own insertion)

=== Document record format ===

Each record inserted into <code>corpus.documents</code> must have:

{| class="wikitable"
! Field !! Description
|-
| <code>id</code> || Globally unique document ID, prefixed with jurisdiction (e.g. <code>"fr.cada_20240001"</code>)
|-
| <code>kind</code> || One of: <code>legislation</code>, <code>decision</code>, <code>record</code>, <code>notice</code>, <code>section</code>, <code>chunk</code>
|-
| <code>jurisdiction</code> || Jurisdiction code (e.g. <code>"fr"</code>, <code>"eu"</code>)
|-
| <code>language</code> || ISO 639-1 language code
|-
| <code>source</code> || Source key matching the registered source
|-
| <code>date</code> || Primary date (ISO 8601 format <code>YYYY-MM-DD</code>), or <code>None</code>
|-
| <code>date_end</code> || End date for documents with a date range, or <code>None</code>
|-
| <code>parent_id</code> || Parent document ID for hierarchical sources, or <code>None</code>
|-
| <code>title</code> || Human-readable title
|-
| <code>body</code> || Clean displayable content (HTML or formatted text). Immutable after ingestion.
|-
| <code>body_search</code> || Indexable text for FTS (can be noisy: PDF OCR, etc.), or <code>None</code> to use <code>body</code>
|-
| <code>tags</code> || JSON string of metadata tags (see below)
|}

== Step 3: Follow tag conventions ==

Tags are the primary metadata mechanism. Every document gets a <code>tags</code> JSON object. Refer to [[Corpus/Tag conventions]] for the full shared vocabulary.

=== Mandatory tags ===

* <code>type</code> — legal document type (e.g. <code>"law"</code>, <code>"decree"</code>, <code>"judgment"</code>, <code>"administrative_decision"</code>). See the tag conventions for the full list.
* <code>content_quality</code> — quality of the body content (e.g. <code>"native_structured"</code>, <code>"ocr_raw"</code>, <code>"metadata_only"</code>)

=== Common optional tags ===

* <code>nature</code> — source-specific document nature
* <code>court</code> — court identifier (for decisions)
* <code>authority</code> — issuing authority
* <code>case_number</code> — case/dossier numbers (as a list)
* <code>solution</code> — outcome/disposition
* <code>summary</code> — headnote or summary
* <code>headnote_classification</code> — subject classification
* <code>_importance_level_default</code> — default importance for ranking

=== Tag construction pattern ===

<syntaxhighlight lang="python">
import json

def _tags(**kwargs: object) -> str:
    """Build a tags JSON string, stripping None values."""
    return json.dumps(
        {k: v for k, v in kwargs.items() if v is not None},
        ensure_ascii=False,
    )

# Usage
tags = _tags(
    type="administrative_decision",
    content_quality="native_structured",
    court="cada",
    case_number=[number],
    solution=outcome or None,
)
</syntaxhighlight>

== Step 4: Follow edge conventions ==

If your source contains cross-references to other documents (citations, amendments, transpositions, etc.), create edges in the <code>corpus.edges</code> table. Refer to [[Corpus/Edge types]] for the full taxonomy of ~75 edge types.

Common edge types for new sources:
* <code>cites</code> — document A cites document B
* <code>amends</code> — document A amends document B
* <code>repeals</code> — document A repeals document B
* <code>implements</code> — national law implements an EU directive
* <code>transposes</code> — national law transposes an EU directive
* <code>consolidates</code> — consolidated version of a text

== Step 5: Document the source ==

Create a wiki page at <code>Sources/{jurisdiction}/{source_name}</code> documenting:

* What the source contains
* Publisher and license
* Data format (XML, CSV, JSON, API)
* Update frequency
* Known quirks or data quality issues
* Volume (approximate number of documents)
* Coverage dates

== Complete example: CADA source ==

The CADA (Commission d'acces aux documents administratifs) source is a good reference implementation for a simple inline source.

=== Registration ===

In <code>duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/source_registry.py</code>:

<syntaxhighlight lang="python">
register_source(
    "cada",
    jurisdiction="fr",
    name="CADA",
    description="Avis et conseils de la Commission d'accès aux documents administratifs",
    kind="decision",
    publisher="CADA",
    publisher_url="https://www.data.gouv.fr/fr/datasets/53698f37a3a729239d2036a0/",
    license="Licence Ouverte 2.0",
    license_url="https://www.etalab.gouv.fr/licence-ouverte-open-licence/",
    language="fr",
    date_bounds=DateBounds.strict(min_year=1900),
)
</syntaxhighlight>

=== Downloader ===

In <code>duralex-ingest/duralex-ingest-fr/src/duralex/ingest/fr/sources/cada.py</code>:

* Downloads a consolidated CSV (~184 MB, ~60k records, 1984-present) from data.gouv.fr
* Caches the CSV to disk with change detection via <code>last_modified</code> timestamp
* Parses each CSV row into a <code>corpus.documents</code> record with structured tags
* Inserts in batches of 5000 using <code>insert_batch()</code>
* Cleans up orphaned records (with a safety threshold of 30,000 to avoid accidental mass deletion)
* Tracks ingest state so unchanged upstream data is skipped

=== Record construction ===

Each CADA record maps to:
* <code>id</code>: <code>"fr.cada_{number}"</code>
* <code>kind</code>: <code>"decision"</code>
* <code>source</code>: <code>"cada"</code>
* <code>title</code>: <code>"CADA {year} n°{number} — {subject}"</code>
* <code>tags</code>: includes <code>type</code>, <code>content_quality</code>, <code>nature</code>, <code>court</code>, <code>authority</code>, <code>theme</code>, <code>case_number</code>, <code>solution</code>, <code>summary</code>, <code>headnote_classification</code>

== Existing French sources ==

For reference, the following sources are currently registered for France:

{| class="wikitable"
! Source key !! Name !! Kind !! Publisher
|-
| <code>cass</code> || Cour de cassation || decision || DILA
|-
| <code>inca</code> || INCA || decision || DILA
|-
| <code>capp</code> || Cours d'appel || decision || DILA
|-
| <code>jade</code> || JADE || decision || DILA
|-
| <code>constit</code> || Conseil constitutionnel || decision || DILA
|-
| <code>cnil</code> || CNIL || decision || DILA
|-
| <code>legi</code> || LEGI || legislation || DILA
|-
| <code>kali</code> || KALI || legislation || DILA
|-
| <code>acco</code> || ACCO || legislation || DILA
|-
| <code>jorf</code> || JORF || legislation || DILA
|-
| <code>ce_opendata</code> || Justice administrative (open data) || decision || Conseil d'État
|-
| <code>cada</code> || CADA || decision || CADA
|-
| <code>bofip</code> || BOFiP || legislation || DGFiP
|-
| <code>judilibre</code> || Judilibre || decision || Cour de cassation
|-
| <code>jufi</code> || Juridictions financières || decision || DILA
|-
| <code>rne</code> || Registre national des entreprises || record || INPI
|-
| <code>bodacc</code> || BODACC || notice || DILA
|}

[[Category:Development]]