iMerit, Segmed, and Advocate Health have released what they say is the largest open-source, annotated breast tomosynthesis dataset available to date – designed to accelerate AI development for breast cancer detection and made available for free download.

The dataset includes imaging studies from 558 female patients, all using digital breast tomosynthesis (3D mammography) with biopsy-confirmed diagnoses: 271 malignant (48.5%) and 287 benign (51.5%) cases. With an average tumor size of just 1.34 cm and approximately 85% of lesions smaller than 2 cm, the dataset is specifically oriented toward early-stage detection – the stage at which five-year survival rates can exceed 99%.

All images were interpreted by U.S. board-certified, MQSA-certified radiologists and segmented by breast imaging specialists. The data is fully de-identified and compliant with both HIPAA and GDPR standards.

“By releasing this dataset openly, we hope to empower researchers worldwide to develop tools that can support radiologists, improve outcomes, and ultimately save lives,” said Dr. Sina Bari, VP of Healthcare and Life Science AI at iMerit.

One limitation worth noting: The dataset’s demographic representation skews heavily white (96%), with just 1% Black or African American, 1% Asian, and 1% multi-racial patients. Given that Black women have a 40% higher breast cancer mortality rate than white women in the U.S., AI models trained primarily on this data will need supplementary datasets to ensure equitable performance across populations.

The dataset is available in DICOM format with JSON annotations and is free to registered users – lowering the barrier for academic researchers, startups, and institutions working on breast cancer detection tools.

Show CommentsClose Comments

Leave a comment