Automatically Labeling $200B Life-Saving Datasets: A Large Clinical Trial Outcome Benchmark

Download Dataset Code Paper

Note: The paper is from a previous version and will be updated as soon as possible to reflect the changes we have made.


Abstract

Background

The global cost of drug discovery and development exceeds \$200 billion annually, with clinical trial outcomes playing a critical role in the regulatory approval of new drugs and impacting patient outcomes. Despite their significance, large-scale, high-quality clinical trial outcome data are not readily available to the public, limiting advances in trial outcome predictive modeling.

Methods

We introduce the Clinical Trial Outcome (CTO) dataset, a fully reproducible, large-scale (around 125K drug and biologics trials), open-source dataset of clinical trial outcomes derived from a comprehensive knowledge base. This knowledge base integrates weakly supervised labels from multiple sources, including large language model (LLM) interpretations of publications, trial phase transitions, sentiment analysis from news, stock prices of trial sponsors, and other trial-related metrics.

Results

Using our labeling pipeline, we generated high-quality trial outcome labels that demonstrate strong agreement with human annotations, achieving an F1 score of 94 for Phase 3 trials and 91 across all phases. Additionally, we provide monthly dataset updates reflecting the latest trial information, along with open-source code and a manually curated test set of 11,012 trials completed between 2020 and 2024.

Conclusions

CTO provides an unprecedented resource for clinical research, designed to enhance the reproducibility and precision of predictive models in drug development. This publicly available dataset will support ongoing research in clinical trial outcomes, offering insights that could optimize the drug development process.

Dataset Viewer

Usage Instructions

Citation

Gao, C., Pradeepkumar, J., Das, T., Thati, S., & Sun, J. (2024). Automatically Labeling $200 B Life-Saving Datasets: A Large Clinical Trial Outcome Benchmark. arXiv preprint arXiv:2406.10292.

Other Material and Related Work

License

The dataset is licensed under the MIT license.