A public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer (NSCLC) on first-line osimertinib.
OncoTraj harmonizes real-world clinical-genomic data into a single schema with frozen, leakage-audited splits, and ships reproducible baselines and an evaluation harness. Its purpose is to establish a public floor — and an honest ceiling — for what snapshot (single-timepoint) features can predict about acquired resistance.
Headline finding (read this first). On clean within-source evaluation, snapshot-feature models do not beat chance on any of the three tasks at this cohort size. The binding constraint is the data modality — single-timepoint tissue/ctDNA snapshots — not the model. Serial ctDNA trajectories, not a better algorithm, are the precondition for above-chance performance. That negative result is the contribution: it tells the field where the signal must come from. See
leaderboard.mdfor the full, CI-annotated numbers and the cross-source (v1.1) analysis.
v1 comprises 813 EGFR-mutant NSCLC patients on first-line osimertinib, harmonized from three real-world clinical-genomic sources:
| Source | n | Access |
|---|---|---|
| MSK-CHORD | 672 | cBioPortal (open) |
| FLAURA molecular-resistance supplement | 107 | Public journal supplements |
| AACR Project GENIE BPC (NSCLC) | 34 | Registered access |
Patient-level test split: n=122 (Task A landmark-evaluable n=110; Task B/C event-observed n=85).
- Task A — 12-month landmark progression (binary classification).
- Task B — time-to-progression (regression; reported with MAE and C-index).
- Task C — resistance mechanism (6-class classification).
Majority/floor, logistic regression, random forest, XGBoost, LSTM, a small
Transformer, and a Cox proportional-hazards model (Task B). All baselines reuse
a single leak-safe feature pipeline; the leakage audit lives in
tests/test_taska_feature_leakage.py.
uv sync
uv run pytest # test suite, incl. leakage audits
uv run ruff checkEvaluate a submission against the locked test split:
uv run oncotraj-eval --predictions yours.csv --split test \
--submission-id <your-name> --output eval_reports_v1/<your-name>.json
uv run oncotraj-eval --refresh # regenerate leaderboard.mdThe prediction CSV schema is documented in
src/oncotraj/eval/report.py (SCHEMA_DOC).
The harmonized dataset is not redistributed in this repository. The underlying sources (AACR GENIE BPC, MSK-CHORD) carry Data Use Agreements that restrict redistribution of patient-level data. Instead, OncoTraj ships the parsers and a build script that regenerate the harmonized Parquet tables from sources you obtain under their respective terms:
# after placing source files under data/raw/ (see docs/DATASET_SPEC.md and DATA.md)
uv run python scripts/build_dataset.py
uv run python scripts/build_splits.pySee DATA.md for how to obtain each source, and
docs/DATASET_SPEC.md for the full harmonization
schema (one row per patient/variant/treatment/outcome, explicit missingness
sentinels, no silent imputation).
src/oncotraj/ # library: schemas, parsers, splits, metrics, eval, baselines
scripts/ # build_dataset, build_splits, train_*, make_figures
tests/ # pytest suite, incl. leakage audits
eval_reports/ # baseline metric reports (JSON)
eval_reports_v1/ # v1 + v1.1 cross-source reports (JSON)
leaderboard.md # auto-generated, CI-annotated results
docs/ # dataset spec and reproduction docs
data/, models/ # gitignored — built locally, not redistributed
A preprint describing OncoTraj is available. Citation details will be added here once the preprint is posted. In the meantime, please cite this repository.
Code: MIT (see LICENSE). Data are not included; any data you
build with these tools remain governed by the upstream sources' terms (AACR
GENIE, MSK-CHORD, and the respective journal supplements).