This study evaluates whether large language models (LLMs) can reliably assess written clinical-reasoning case examinations completed by undergraduate physiotherapy students, compared with faculty assessment. In the course "Specific Methods in Physiotherapy" (third year of the Physiotherapy Degree), students solve complex clinical cases that require clinical reasoning, technical knowledge, and therapeutic decision-making. These cases are traditionally graded by faculty, a time-consuming process that may show inter-rater variability. A set of de-identified student case examinations will be assessed using the rubric currently applied in the course, which covers clarity and structure of clinical reasoning, integration of the biopsychosocial model (ICF and APTA frameworks), accuracy in identifying pain mechanisms, coherence between diagnosis, hypotheses, and treatment, originality and depth of analysis, and professional writing. Each examination will be scored independently by three LLMs (for example, Claude, ChatGPT, and Gemini), each receiving an identical standardized prompt that embeds the same rubric, and by faculty serving as the reference standard. To avoid overloading faculty, full double human grading may not be feasible; the human reference will therefore consist of expert faculty grading by one independent rater or, when resources allow, two independent raters. In contrast, paired assessment is fully implemented across the AI models: each examination is scored by several LLMs, and each model is queried in duplicate, allowing the study to estimate agreement between models and the test-retest stability of each model. The primary aim is to quantify agreement between LLM-generated scores and the faculty reference score. Secondary aims include agreement among the LLMs, test-retest reliability of each model, criterion-level agreement, the quality and usefulness of the qualitative feedback generated, the time and cost associated with each approach, and students' perceptions of the usefulness of human versus AI feedback. The findings will clarify the strengths and limitations of LLMs as supportive tools for formative assessment in health-professions education and will inform criteria for their responsible and effective use. No LLM output will affect students' official grades, which remain the sole responsibility of faculty.
Age range
18 Years
Sex
ALL
See this in plain English?
AI-rewrites the medical criteria so a patient or caregiver can understand them. Always confirm with the trial site.
Bring these to your next appointment. They're a starting point for a shared conversation — not a sign you qualify or a recommendation to enrol.
Generated to help you prepare — always confirm anything about your own eligibility and care with the study team and your doctor.
The trial coordinator is the person who runs the study day to day. These cover the practical side — logistics, costs, and what taking part would actually mean for your life. The study team confirms whether you meet the criteria; these are questions to ask, not a sign you qualify.
A starting point for the conversation — always confirm anything about your own eligibility, costs, and care with the study team and your doctor.
Agreement between LLM global scores and the faculty reference global score
Timeframe: Single cross-sectional assessment during the data-collection period (approximately 2 months)