RecruitingNot Applicable

Construction of a Benchmark for Breast Ultrasound AI Interpretation and Performance Evaluation of Multimodal AI Models

China1,380 participantsStarted 2026-03-12

Plain-language summary

This single-center, retrospective, observational study aims to construct a standardized benchmark evaluation system for intelligent breast ultrasound image interpretation and to systematically assess the diagnostic performance of current mainstream multimodal artificial intelligence (AI) models. De-identified B-mode breast ultrasound images with confirmed pathological diagnoses will be retrospectively collected from the institutional archive (2018-2025) and supplemented with images from published open-access datasets. Expert radiologists with varying experience levels will independently annotate all images according to the American College of Radiology (ACR) Breast Imaging Reporting and Data System (BI-RADS) v2025 criteria, including glandular tissue composition, lesion characterization (mass vs. non-mass lesion), morphological descriptors, and final BI-RADS classification. Baseline deep learning models (CNN-based ResNet-50 and Transformer-based USFM) will be trained to establish performance baselines and to stratify cases by diagnostic difficulty through cross-architecture consensus. Multiple multimodal large language models (MLLMs), including both general-purpose and medical-domain models, will then be evaluated via standardized API calls using BI-RADS-guided chain-of-thought prompts at temperature 0 for reproducibility. Primary endpoints include BI-RADS classification accuracy and diagnostic AUC for benign-malignant differentiation. Model robustness and safety will be assessed through out-of-distribution rejection testing, temperature-stability experiments, and thinking-mode ablation studies. This study adheres to the FLAIR and TRIPOD-LLM reporting guidelines.

Who can participate

Age range18 Years – 75 Years

SexFEMALE

See this in plain English?

AI-rewrites the medical criteria so a patient or caregiver can understand them. Always confirm with the trial site.

Inclusion Criteria: * B-mode breast ultrasound grayscale images from the institutional PACS database or from published open-access breast ultrasound datasets with documented original institutional ethics approval * Image quality adequate for clinical diagnosis with clear visualization of the region of interest * Pathological diagnosis confirmed (for benign and malignant lesion groups), or normal breast status confirmed by a senior radiologist with \>15 years of breast ultrasound experience (for the normal group) * Complete de-identification with removal of all personally identifiable information Exclusion Criteria: * Severely degraded image quality precluding meaningful BI-RADS assessment * Duplicate images from the same patient (only the most representative image retained per lesion) * Images with residual personally identifiable information after de-identification processing * Cases with ambiguous, disputed, or unavailable pathological results * Non-B-mode ultrasound images, including elastography, contrast-enhanced ultrasound, and Doppler imaging

What they're measuring

Diagnostic Accuracy for Pathological Diagnosis

Timeframe: At study completion, approximately 12 months

BI-RADS Classification Accuracy

Timeframe: At study completion, approximately 12 months

Trial details

NCT IDNCT07500428

SponsorPeking Union Medical College Hospital

Sponsor typeOTHER

Study typeOBSERVATIONAL

Primary completion2026-12-01

Contact for this trial

Qingli Zhu, MD

+86 13621376699 zqlpumch@126.com