Skip to content

Pipeline Stages

Overview

The Research Software Observatory – Data Pipeline orchestrates the consolidation, enrichment, and integration of software metadata into analysis-ready datasets consumed by the Research Software Observatory.

Pipeline overview

Overview of the main and auxiliary pipelines of the Research Software Observatory. Raw data importers are external to this repository and act as upstream inputs.

The pipeline operates downstream of independent importer processes, which periodically collect and normalize metadata from external registries such as bio.tools, Bioconda, Galaxy, and others. These importer components are maintained separately and populate the raw data layer, which serves as the entry point of this pipeline.

The pipeline then performs:

  • Normalization and enrichment of software metadata
  • Integration and disambiguation of duplicate records
  • FAIRsoft score computation and statistics generation
  • Auxiliary enrichments such as publications and service availability

Execution model

In normal maintenance work, the pipeline is usually executed in one of two ways:

  1. Run the main pipeline until disambiguation
    This is the usual option when new identity conflicts may require human review. The pipeline generates the intermediate files, performs automated disambiguation, and creates the outputs needed for manual annotation.

bash rsetl run --until disambiguation --tag <TAG>

  1. Run the full pipeline
    This is used when the data can be processed end-to-end, or when no manual curation round is expected.

bash rsetl run --tag <TAG>

After human annotation has been completed and committed, the latest relevant run should be resumed from the human_updates stage. This reuses the existing run directory and applies the curator decisions before continuing with merge, statistics, and FAIRsoft scoring.

rsetl run --resume-run latest --from-stage human_updates

Alternatively, use the explicit run ID:

rsetl run --resume-run <RUN_ID> --from-stage human_updates

This is important because the human annotations correspond to the conflict files generated in that specific run. Starting a fresh run after annotation may produce different block or conflict identifiers and can make the annotations impossible to apply correctly.


Main pipeline stages

The full pipeline is composed of the following stages, in execution order:

# Stage CLI Module / Script Description External Services
1 Transformation src.adapters.cli.transformation.transformation Loads and normalizes raw metadata from MongoDB sources. MongoDB
2 License normalization src.adapters.cli.post_transformation.normalize_licenses Maps license information to standardized SPDX identifiers. MongoDB
3 Grouping / blocking & recovery src.adapters.cli.integration.group_and_recovery Groups records into candidate identity blocks using names and repository-like links. MongoDB
Remove OEB metrics scripts/utils/remove_oeb_metrics.py Removes redundant OpenEBench metric entries to reduce noise and processing time.
4 Conflict detection src.adapters.cli.integration.conflict_detection Identifies disconnected records within blocks as potential identity conflicts.
Simplify blocks scripts/utils/simplify_grouped_entries.py Reduces block structure to the minimal representation needed downstream.
JSON to JSONL conversion scripts/utils/json_to_jsonl.py Converts conflict and block files into JSONL format.
5 Disambiguation src.adapters.cli.integration.disambiguation Resolves conflicts using heuristics and LLM-based agreement scoring. Can generate manual-review issues for unresolved cases. OpenRouter, Hugging Face, GitHub, GitLab
6 Human updates src.adapters.cli.integration.update_disambiguation_after_human_resolution Integrates curator decisions from human_annotations/ into the disambiguation output. Git
7 Merge src.adapters.cli.integration.merge_entries Consolidates resolved records into unified software entries. MongoDB
8 Statistics src.adapters.cli.generate_stats Computes descriptive statistics for Observatory dashboards. MongoDB
9 FAIRsoft scoring src.adapters.cli.fair_scores Computes FAIRsoft indicators and scores for software entries. MongoDB
10 Similarity src.adapters.cli.similarity Embeds tool descriptions using gte-modernbert-base and precomputes the top-10 nearest neighbours per tool, stored in similaritiesDev to power "similar software" recommendations. MongoDB, HuggingFace (model download)

Practical maintenance workflow

Standard run with possible human curation

Use this when you expect some conflicts to require manual review.

rsetl run --until disambiguation --tag <TAG>

Then complete and commit the human annotations.

Once annotation is finished, resume the same run:

rsetl run --resume-run latest --from-stage human_updates

This executes:

  • human_updates
  • merge
  • stats
  • fairsoft
  • similarity

unless one of those stages is skipped with a command-line option.

Full automatic run

Use this when you want to execute the full pipeline in one pass.

rsetl run --tag <TAG>

Resume a specific run

Use this when the run to resume is not the latest one.

rsetl run --resume-run <RUN_ID> --from-stage human_updates

Run only one stage

Useful for debugging or recomputing a downstream step.

rsetl run --resume-run <RUN_ID> --only stats

Outputs

Primary outputs

The pipeline produces:

  • Merged software records, stored in MongoDB. The default target collection is tools.
  • Precomputed metrics and FAIRsoft scores, stored in MongoDB. The default target collection is computations.
  • Precomputed similarity scores, stored in MongoDB. The default target collection is similaritiesDev.

Run artifacts

Each execution generates a versioned run directory under:

data/integration/runs/<run_id>/

Typical contents:

├── grouped_entries.<run_id>.json
├── grouped_entries.no_opeb_metrics.<run_id>.json
├── conflicts.<run_id>.json
├── grouped_entries.simplified.<run_id>.json
├── conflicts.<run_id>.jsonl
├── grouped_entries.simplified.<run_id>.jsonl
├── disambiguation.<run_id>.jsonl
└── manifest.json

A latest symlink points to the most recent run directory.

The manifest.json file records:

  • run ID and run directory
  • git short SHA
  • created and last-updated timestamps
  • paths to generated files
  • selected and executed stages
  • latest execution options
  • masked environment configuration
  • execution history for resumed runs

Auxiliary pipelines

Auxiliary pipelines run independently from the main integration workflow.

Publications enrichment

Field Value
CLI rsetl enrich-publications
Module adapters.cli.enrich_publications
Description Enriches publication metadata, citation counts, and cited-by counts using Europe PMC.
Output Updates the MongoDB publication collection, publicationsMetadataDev by default, and optionally writes a JSONL cache.
External services Europe PMC API

For each publication the enrichment fetches:

  • Metadata — title, abstract, authors, journal, year, PMID, and source identifier.
  • citations — publications that cite this paper, grouped by year, sourced from the Europe PMC /{source}/{id}/citations endpoint.
  • citedBy — same dataset as citations, stored under a dedicated field for downstream consumers that distinguish between outgoing and incoming citation relationships.

Both citations and citedBy are stored as a list with a single entry of the form:

[
  {
    "source": "Europe PMC",
    "count": {
      "2020": 3,
      "2021": 7,
      "total": 10
    }
  }
]

Examples:

rsetl enrich-publications --limit 100
rsetl enrich-publications --progress-every 100
rsetl enrich-publications --no-update-db
rsetl enrich-publications --no-skip-existing-europe-pmc-citations

Web availability

Field Value
CLI rsetl run-webavailability
Module adapters.cli.web_availability
Description Checks availability of tool URLs and web services.
Output Updates the MongoDB collection used for service availability dashboards.
External services HTTP endpoints

Example:

rsetl run-webavailability

Notes for maintainers

  • The pipeline is stage-driven and resumable.
  • Use --until disambiguation when a human annotation round is expected.
  • After human annotation, resume the same run from human_updates; do not start a fresh run for the annotation integration step.
  • Use rsetl runs list and rsetl runs latest to identify available runs.
  • Use --only for targeted recomputation of individual stages.
  • Use --dry-run-disambiguation to test the disambiguation stage without creating conflict files or GitHub issues.
  • See CLI Reference for all available command-line options.
  • See Development Guide for extending or customizing stages.