CLI Reference¶
Commands¶
rsetl run¶
Run the full ETL/integration pipeline, a partial pipeline, a single stage, or resume an existing run.
Options
--tag TEXT— append a tag to the generated run ID.--resume-run TEXT— resume an existing run by run ID or run directory path.--no-merge— skip themergestage.--no-human-updates— skip thehuman_updatesstage.--remove-opeb-metrics— disable removal of OEB metrics.--dry-run-disambiguation— run disambiguation without creating conflict files or GitHub issues.--from-stage STAGE— start the pipeline from this stage.--until STAGE— run the pipeline until this stage, inclusive.--only STAGE— run only one stage.--python-exe PATH— Python executable for subprocesses. Default:python.--workdir PATH— working directory. Default:..--runs-root PATH— root folder for run outputs. Default:data/integration/runs.
Stages
The pipeline supports the following stage names, in order:
transformationlicense-normalizationgroupingremove_opeb_metricsconflict_detectionsimplify_blocksjson_to_jsonldisambiguationhuman_updatesmergestatsfairsoftsimilarity
Examples
# Standard full run
rsetl run
# Run with a tag
rsetl run --tag 2026Q2
# Run without merging results into the database
rsetl run --tag test-run --no-merge
# Run until disambiguation
rsetl run --until disambiguation --tag pre-human-review
# Run disambiguation in dry-run mode
rsetl run --only disambiguation \
--resume-run 20260428T090000Z-ab12cd-pre-human-review \
--dry-run-disambiguation
# Resume an existing run from human updates onward
rsetl run \
--resume-run 20260428T090000Z-ab12cd-pre-human-review \
--from-stage human_updates
# Resume an existing run and execute only FAIRsoft scoring
rsetl run \
--resume-run 20260428T090000Z-ab12cd-pre-human-review \
--only fairsoft
# (Re)compute similarity scores only
rsetl run --only similarity
Notes
--tagcannot be used together with--resume-run.- When
--resume-runis used, the existing run directory and run ID are reused. - Resumed runs append a new execution record to the existing
manifest.json. - When resuming, required input files for the selected stages must already exist.
- The
remove_opeb_metrics,human_updates, andmergestages can be skipped with their corresponding options. --dry-run-disambiguationonly affects thedisambiguationstage.
rsetl run-transformation¶
Run only the transformation step.
This command executes transformation independently and creates a versioned run directory with provenance metadata.
Options
--tag TEXT— append a tag to the generated run ID.--sources TEXT— sources passed to the transformation step. Default:all.--python-exe PATH— Python executable for subprocesses. Default:python.--workdir PATH— working directory. Default:..--runs-root PATH— root folder for run outputs. Default:data/integration/runs.
Examples
# Run transformation with default sources
rsetl run-transformation
# Run transformation with a custom tag
rsetl run-transformation --tag test1
# Run transformation for selected sources
rsetl run-transformation --sources biotools
# Use a custom output directory
rsetl run-transformation --runs-root /data/rso/runs
rsetl enrich-publications¶
Enrich publication metadata and citation counts using Europe PMC.
This command delegates to the publication enrichment adapter and can update MongoDB, write a JSONL cache, or run in inspection/dry-run modes depending on the options passed.
Options
--collection TEXT— MongoDB collection name. Default:publicationsMetadataDev.--jsonl-path PATH— path to the JSONL cache/output file. Default:data/cache/publications_enrichment.jsonl.--progress-every N— print progress every N processed documents. Default:1000.--limit N— maximum number of documents to process.--no-skip-seen— do not skip DOIs already present in the JSONL file.--no-skip-existing-europe-pmc-citations— process records even if they already contain Europe PMC citations.--no-write-cache— do not append results to the JSONL file.--no-update-db— do not update MongoDB.--dry-run— show configuration and exit without running enrichment.
Examples
# Run publication enrichment with default settings
rsetl enrich-publications
# Process only 100 records
rsetl enrich-publications --limit 100
# Print progress more frequently
rsetl enrich-publications --progress-every 100
# Run enrichment without updating MongoDB
rsetl enrich-publications --no-update-db
# Reprocess records even if Europe PMC citations already exist
rsetl enrich-publications --no-skip-existing-europe-pmc-citations
# Check the effective configuration without running
rsetl enrich-publications --dry-run
rsetl run-webavailability¶
Run the daily web availability update and ensure toolsDev URLs exist.
Additional arguments are passed through to the web availability job.
Example
rsetl run-webavailability
rsetl runs list¶
List available pipeline runs.
Shows a compact summary including run ID, update time, manifest availability, disambiguation output availability, resumability, execution count, and latest executed stages.
Options
--workdir PATH— working directory. Default:..--runs-root PATH— root folder for run outputs. Default:data/integration/runs.--json— output the run list as JSON.
Examples
rsetl runs list
rsetl runs list --json
rsetl runs show¶
Show details for a specific run.
By default this prints a human-readable summary with run metadata, latest execution information, paths, and latest options.
Arguments
run_ref— run ID or full run directory path.
Options
--workdir PATH— working directory. Default:..--runs-root PATH— root folder for run outputs. Default:data/integration/runs.--json— output the full manifest as JSON.
Examples
rsetl runs show 20260428T090000Z-ab12cd-pre-human-review
rsetl runs show /data/rso/runs/20260428T090000Z-ab12cd-pre-human-review
rsetl runs show 20260428T090000Z-ab12cd-pre-human-review --json
rsetl runs latest¶
Show the latest run.
Options
--workdir PATH— working directory. Default:..--runs-root PATH— root folder for run outputs. Default:data/integration/runs.--json— output the full manifest as JSON.
Examples
rsetl runs latest
rsetl runs latest --json
rsetl check-env¶
Check environment variables and API connectivity.
Example
rsetl check-env
Environment configuration¶
The pipeline loads environment variables from a .env file if present.
Example .env:
# MongoDB
MONGO_HOST=localhost
MONGO_PORT=27017
MONGO_USER=user
MONGO_PWD=pass
MONGO_AUTH_SRC=admin
MONGO_DB=observatory
# Disambiguation
GITHUB_TOKEN=ghp_...
GITLAB_TOKEN=...
OPENROUTER_API_KEY=...
HUGGINGFACE_API_KEY=...
Run outputs¶
Each full pipeline run creates a versioned directory under:
data/integration/runs/<timestamp>-<gitsha>(-tag)/
A latest symlink points to the most recent run.
Full pipeline runs write a manifest.json file containing:
- run ID and run directory;
- git short SHA;
- created and last-updated timestamps;
- paths to generated files;
- latest execution options;
- selected and executed stages;
- masked environment configuration;
- execution history.
Transformation-only runs write their own transformation manifest.
Help¶
Use the built-in help commands for contextual usage:
rsetl --help
rsetl run --help
rsetl run-transformation --help
rsetl enrich-publications --help
rsetl run-webavailability --help
rsetl runs --help
rsetl runs list --help
rsetl runs show --help
rsetl runs latest --help
rsetl check-env --help