Development Guide

Overview

This guide describes how to set up a development environment and extend the pipeline

Repository Structure

src/ adapters/ cli/ main.py # CLI dispatcher pipeline_full.py # Full pipeline execution logic (run_full) check_environment.py # Environment diagnostic tool integration/ # Stage scripts domain/ # Core data models and logic scripts/ # Utility and conversion scripts data/ integration/ runs/ # Versioned outputs (git-ignored) docs/ # MkDocs documentation tests/ # Unit and integration tests

Adding or modifying stages

Overview

The pipeline is composed of stages, each responsible for a major part of the integration workflow — such as blocking, enrichment, disambiguation, or merging.
Within a stage, there may exist multiple use cases, which represent different operational variants of that stage.

For example:

This structure allows new functionality to be added incrementally without rewriting existing code.

How code is organized

CLI adapter → Use case → Application services → Database/API adapters

Layer Purpose Typical location Notes
CLI adapters Entry points that handle arguments, environment loading, and I/O paths. src/adapters/cli/integration/ Thin setup layer; no logic beyond parsing and orchestration.
Use cases Define a specific workflow for a stage. Coordinate services and domain models to achieve one operational goal. src/domain/use_cases/ Each use case encapsulates one scenario (e.g., “merge entries,” “update one annotation”).
Application services Implement the actual operations and interact with the database or APIs. src/domain/services/ Each service encapsulates a single capability (e.g., conflict detection, metadata enrichment).
Infrastructure adapters Handle external connectivity (MongoDB, APIs, file system). src/infrastructure/ Include the database adapter and API clients.

All services in this project are application services — they depend on the DatabaseAdapter or other clients.
This is by design, as each stage interacts with stored metadata or external APIs.


What a use case is

A use case represents one way the system performs a task.
It defines how and when services are combined to fulfill a specific need.

You can think of it as a “recipe” built from reusable ingredients (services, models, adapters).

Each use case:


Adding a new stage or use case

  1. Create a new use case

  2. Add or reuse services

  3. Use domain models

  4. Integrate with the database and APIs

  5. Create or extend a CLI adapter

  6. (Optional) Add to the full pipeline

  7. Document and test


About the infrastructure layer and database adapter

To add or replace the database:


Key principles for contributors

Testing

To run tests, go to the root directory of this repository and use:

PYTHONPATH=$(pwd) pytest -v -s tests/

The previous command will run all tests except the ones marked as "manual". To run tests marked as "manual" use:

PYTHONPATH=$(pwd) pytest -v -s -m manual tests/

Logging

To add loggings, use:

import logging logger = logging.getLogger("rs-etl-pipeline")

The logger configuration can be found in src/infrastructure/logging_config.py. INFO logs are writen to terminal and all the rest to a file (re_etl_pipeline.log)