This guide describes how to set up a development environment and extend the pipeline
src/
adapters/
cli/
main.py # CLI dispatcher
pipeline_full.py # Full pipeline execution logic (run_full)
check_environment.py # Environment diagnostic tool
integration/ # Stage scripts
domain/ # Core data models and logic
scripts/ # Utility and conversion scripts
data/
integration/
runs/ # Versioned outputs (git-ignored)
docs/ # MkDocs documentation
tests/ # Unit and integration tests
The pipeline is composed of stages, each responsible for a major part of the integration workflow — such as blocking, enrichment, disambiguation, or merging.
Within a stage, there may exist multiple use cases, which represent different operational variants of that stage.
For example:
This structure allows new functionality to be added incrementally without rewriting existing code.
CLI adapter → Use case → Application services → Database/API adapters
| Layer | Purpose | Typical location | Notes |
|---|---|---|---|
| CLI adapters | Entry points that handle arguments, environment loading, and I/O paths. | src/adapters/cli/integration/ |
Thin setup layer; no logic beyond parsing and orchestration. |
| Use cases | Define a specific workflow for a stage. Coordinate services and domain models to achieve one operational goal. | src/domain/use_cases/ |
Each use case encapsulates one scenario (e.g., “merge entries,” “update one annotation”). |
| Application services | Implement the actual operations and interact with the database or APIs. | src/domain/services/ |
Each service encapsulates a single capability (e.g., conflict detection, metadata enrichment). |
| Infrastructure adapters | Handle external connectivity (MongoDB, APIs, file system). | src/infrastructure/ |
Include the database adapter and API clients. |
All services in this project are application services — they depend on the
DatabaseAdapteror other clients.
This is by design, as each stage interacts with stored metadata or external APIs.
A use case represents one way the system performs a task.
It defines how and when services are combined to fulfill a specific need.
You can think of it as a “recipe” built from reusable ingredients (services, models, adapters).
Each use case:
Create a new use case
src/domain/use_cases/<stage_or_scenario>.py.db_adapter, api_client, path_in, path_out) as arguments.Add or reuse services
src/domain/services/.Use domain models
src/domain/models/ to represent entities (software entries, conflicts, annotations, etc.).Integrate with the database and APIs
DatabaseAdapter (Mongo implementation by default) for persistence.src/infrastructure/ for web services (Observatory, Europe PMC, Semantic Scholar, etc.).requests calls inside use cases — those belong in services or clients.Create or extend a CLI adapter
src/adapters/cli/integration/<stage>.py or extend an existing one.argparse), load environment variables, and instantiate clients.(Optional) Add to the full pipeline
_run([...]) command in src/adapters/cli/pipeline_full.py at the correct step.<run_id>) to store intermediate outputs.Document and test
docs/pipeline.md (Stage Summary and Details).docs/installation.md if new environment variables or tokens are required.src/infrastructure/db/adapter.py and defines a DatabaseAdapter protocol.To add or replace the database:
src/infrastructure/db/<backend>.py using the same method signatures.To run tests, go to the root directory of this repository and use:
PYTHONPATH=$(pwd) pytest -v -s tests/
The previous command will run all tests except the ones marked as "manual". To run tests marked as "manual" use:
PYTHONPATH=$(pwd) pytest -v -s -m manual tests/
To add loggings, use:
import logging
logger = logging.getLogger("rs-etl-pipeline")
The logger configuration can be found in src/infrastructure/logging_config.py. INFO logs are writen to terminal and all the rest to a file (re_etl_pipeline.log)