This study evaluates the performance of three contemporary multimodal large language models (MLLMs)—ChatGPT 4.0, Gemini Advanced 1.5 Pro, and Perplexity Pro—on the complex ophthalmic task of distinguishing benign choroidal nevi from malignant melanoma. The research benchmarks the diagnostic and treatment recommendation capabilities of these AI systems against the performance of two expert human ocular oncologists, providing a direct comparison of their clinical utility in a specialized, high-stakes medical domain.
The methodology employs a retrospective cross-sectional design, analyzing a dataset of 48 eyes from 47 patients with confirmed diagnoses. The ground truth for each case was rigorously established through biopsy for melanomas and a minimum of 6-month imaging stability for nevi. Both the MLLMs and human experts were presented with a rich, multimodal dataset including various ocular images (e.g., color fundus, optical coherence tomography) and, in some scenarios, relevant clinical text. Performance was systematically assessed across six structured prompts designed to test abilities ranging from multiple-choice diagnosis to treatment recommendations under different data conditions.
The key findings reveal a significant performance gap between current AI and human expertise. The human graders statistically outperformed all MLLMs in both accuracy and sensitivity (P < 0.005). Among the AI models, Gemini consistently demonstrated the highest performance, achieving a peak accuracy of 0.725 on treatment recommendations when provided with clinical information. However, the study uncovered critical, systematic biases: ChatGPT exhibited a tendency to over-diagnose malignancy, whereas Gemini tended to under-diagnose it. All models performed poorly when required to apply strict clinical guidelines, showing zero sensitivity for detecting melanoma in that context, highlighting a crucial failure in rule-based reasoning.
The study concludes that while MLLMs show some capability in analyzing complex medical images, they are not yet reliable for autonomous clinical decision-making in this field. Their moderate performance, coupled with predictable and dangerous error patterns, underscores the need for cautious integration and significant further development. The ability of some models to improve with additional clinical text suggests a potential future role as assistive tools, but their current limitations preclude their use as standalone diagnostic agents.
Overall, the evidence strongly supports the conclusion that current-generation MLLMs are unreliable for the autonomous diagnosis and management of pigmented choroidal lesions. This verdict is substantiated by two key findings: the statistically significant underperformance of all AI models compared to human experts (P < 0.005) and, more critically, the discovery of systematic, opposing diagnostic biases. The misclassification analysis (Tables 4-6) revealed that ChatGPT tends to over-diagnose malignancy while Gemini tends to under-diagnose it, both of which pose unacceptable risks to patient safety. These predictable failure modes outweigh the moderate performance gains seen in some models, indicating a fundamental lack of clinical readiness.
Major Limitations and Risks: The study's primary limitation is its small sample size (n=48), which constrains the statistical power and generalizability of the results, as evidenced by the wide confidence intervals in Table 3. This proof-of-concept design is suitable for demonstrating feasibility but not for establishing definitive performance benchmarks for clinical use. A second methodological risk, identified in the Methods section, is the use of a 6-month stability criterion to define benign nevi. This relatively short duration introduces the possibility that slow-growing melanomas were included in the 'ground truth' benign cohort, which could skew the measured specificity and accuracy of all tested systems, including the human experts.
Based on the demonstrated underperformance and significant safety risks, the implementation recommendation for these MLLMs in this clinical context is one of **Low confidence**. The retrospective, small-sample proof-of-concept design is insufficient to support adoption for autonomous decision-making. The most critical next step is a larger, prospective validation study with a diverse patient cohort and longer follow-up periods to confirm the stability of benign lesions. The key unanswered question is whether the observed systematic biases are inherent limitations of general-purpose MLLMs or if they can be mitigated through fine-tuning on specialized, curated medical datasets to create a more reliable clinical tool.
The abstract is exceptionally well-organized, following a clear Purpose, Methods, Results, and Conclusions structure. This format allows readers to quickly and efficiently understand the study's core components, from its initial hypothesis to its final implications, making the key information highly accessible.
The abstract effectively summarizes key quantitative findings, providing specific metrics that anchor the conclusions in data. Citing Gemini's peak accuracy (0.725) and the p-value for human superiority (P < 0.005) gives the reader a concise yet powerful understanding of the performance differences observed.
The abstract mentions 'multimodal large language models (MLLMs)' in the Purpose but only names the specific models (ChatGPT, Gemini, Perplexity) in the Methods section. Naming these widely-recognized models in the opening 'Purpose' statement would provide immediate clarity and context for the reader. This is a low-impact change that would enhance readability and immediately signal the study's specific focus to those familiar with the AI landscape.
Implementation: Revise the 'Purpose' sentence to explicitly name the models being evaluated. For example: 'Purpose: To evaluate the diagnostic and treatment recommendation performance of three multimodal large language models (MLLMs)—ChatGPT 4.0, Gemini Advanced 1.5 Pro, and Perplexity Pro—in identifying and classifying retinal lesions...'
The conclusion is direct and accurate but could be strengthened by briefly acknowledging the study's context or scope (e.g., the specific cohort). While abstracts must be concise, adding a short qualifying phrase enhances scientific transparency and manages reader expectations by subtly hinting at the generalizability limitations discussed later in the paper. This is a medium-impact suggestion that improves the abstract's scientific rigor.
Implementation: Modify the 'Conclusions' sentence to include a brief qualifying phrase. For example: 'Conclusions: In this cohort, human graders outperform current MLLMs, which show only moderate ability to diagnose choroidal nevi or melanoma from imaging.' This frames the finding within the context of the study without undermining its significance.
The introduction effectively employs a classic "funnel" structure, starting with the broad context of AI in medicine, narrowing to its use in ophthalmology, then focusing on the specific technological advance of MLLMs, and finally pinpointing the precise clinical problem. This logical progression efficiently guides the reader and builds a strong rationale for the study.
The paper effectively establishes the clinical importance of the research problem by providing specific, impactful data. Citing the prevalence of choroidal nevi (4.7%), the annual transformation risk (1 in 8845), and the malignant nature of melanoma clearly communicates the high stakes involved in accurate diagnosis.
The introduction clearly states the study's aim but stops short of articulating a formal hypothesis. Adding an explicit hypothesis, such as predicting that MLLMs would underperform relative to human experts, would strengthen the scientific rigor of the introduction. This is a medium-impact suggestion that aligns the paper more closely with the conventional structure of hypothesis-driven research, setting clear expectations for the reader about the anticipated outcome.
Implementation: After stating the study's aim in the final paragraph, add a sentence articulating the hypothesis. For example: "Based on the nascent stage of this technology for complex medical imaging, we hypothesized that the MLLMs would exhibit lower diagnostic accuracy and sensitivity than expert ocular oncology specialists."
While the specific MLLMs are mentioned earlier as examples, the final aim statement refers to them generically as "these MLLMs." Explicitly naming the models (ChatGPT-4, Gemini, and Perplexity) within the final aim statement would provide greater clarity and precision, immediately informing the reader of the exact scope of the comparison. This is a low-impact change that enhances readability and removes any potential ambiguity.
Implementation: Revise the final paragraph to name the models. For example: "Our study aims to compare the accuracy of three leading MLLMs—ChatGPT-4, Gemini, and Perplexity—when analyzing retinal imaging for eyes exhibiting either choroidal nevus or melanoma."
The study's validity is significantly enhanced by its strict criteria for establishing the ground truth. Requiring biopsy confirmation for melanoma and a minimum of 6-month imaging stability for nevi creates a reliable and clinically relevant benchmark against which all MLLM and human performance can be accurately measured.
The methods section provides excellent detail on the specific MLLMs, including the underlying model for Perplexity Pro (Llama 3.1 Sonar), and the manufacturers of all imaging equipment. This high level of specificity is crucial for the study's reproducibility, allowing other researchers to understand the exact conditions of the experiment and potentially replicate it.
The authors demonstrate a thoughtful approach to experimental design by implementing specific strategies to mitigate bias for both AI and human evaluators. Using temporary chats to prevent AI learning and introducing distractor images for human graders are sophisticated controls that strengthen the credibility of the comparative results.
The methods state that a 6-month follow-up was used to confirm nevus stability, which serves as the ground truth for benign lesions. However, the discussion section later acknowledges that longer periods (e.g., 24 months) are sometimes used to exclude slow-growing melanomas. This is a medium-impact suggestion; adding a brief sentence in the Methods to justify the selection of the 6-month timeframe—whether it reflects a common clinical standard, a pragmatic constraint, or a specific study design choice—would enhance methodological transparency and proactively address potential questions about the ground truth criteria.
Implementation: In the paragraph detailing the inclusion criteria, after stating the 6-month follow-up period, add a sentence of justification. For example: '...over a minimum follow-up period of 6 months, a duration considered sufficient in many clinical settings to establish short-term stability.'
The inclusion criteria include the "availability of sufficient ocular imaging," which is a slightly ambiguous term. While the specific imaging modalities are listed later, explicitly defining what constitutes a 'sufficient' or complete set of images within the inclusion criteria would improve clarity and reproducibility. This is a low-impact suggestion that would make the selection criteria more concrete, ensuring readers understand if every case included, for example, all four imaging types (color, autofluorescence, OCT, and ultrasound).
Implementation: Revise the inclusion criteria sentence to be more specific. For example: 'Inclusion criteria were... and availability of a sufficient set of ocular images, defined as having, at a minimum, color fundus photography and optical coherence tomography.'
Table 7. Significant Differences for Accuracy and Sensitivity across Various Prompts