Assessing the Clinical Utility of Multimodal Large Language Models in the Diagnosis and Management of Pigmented Choroidal Lesions

Nehal Nailesh Mehta, Evan Walker, Elena Flester, Gillian Folk, Akshay Agnihotri, Ines D. Nagel, Melanie Tran, Michael H. Goldbaum, Shyamanga Borooah, Nathan L. Scott
Translational Vision Science & Technology
University of California, San Diego

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This study evaluates the performance of three contemporary multimodal large language models (MLLMs)—ChatGPT 4.0, Gemini Advanced 1.5 Pro, and Perplexity Pro—on the complex ophthalmic task of distinguishing benign choroidal nevi from malignant melanoma. The research benchmarks the diagnostic and treatment recommendation capabilities of these AI systems against the performance of two expert human ocular oncologists, providing a direct comparison of their clinical utility in a specialized, high-stakes medical domain.

The methodology employs a retrospective cross-sectional design, analyzing a dataset of 48 eyes from 47 patients with confirmed diagnoses. The ground truth for each case was rigorously established through biopsy for melanomas and a minimum of 6-month imaging stability for nevi. Both the MLLMs and human experts were presented with a rich, multimodal dataset including various ocular images (e.g., color fundus, optical coherence tomography) and, in some scenarios, relevant clinical text. Performance was systematically assessed across six structured prompts designed to test abilities ranging from multiple-choice diagnosis to treatment recommendations under different data conditions.

The key findings reveal a significant performance gap between current AI and human expertise. The human graders statistically outperformed all MLLMs in both accuracy and sensitivity (P < 0.005). Among the AI models, Gemini consistently demonstrated the highest performance, achieving a peak accuracy of 0.725 on treatment recommendations when provided with clinical information. However, the study uncovered critical, systematic biases: ChatGPT exhibited a tendency to over-diagnose malignancy, whereas Gemini tended to under-diagnose it. All models performed poorly when required to apply strict clinical guidelines, showing zero sensitivity for detecting melanoma in that context, highlighting a crucial failure in rule-based reasoning.

The study concludes that while MLLMs show some capability in analyzing complex medical images, they are not yet reliable for autonomous clinical decision-making in this field. Their moderate performance, coupled with predictable and dangerous error patterns, underscores the need for cautious integration and significant further development. The ability of some models to improve with additional clinical text suggests a potential future role as assistive tools, but their current limitations preclude their use as standalone diagnostic agents.

Research Impact and Future Directions

Overall, the evidence strongly supports the conclusion that current-generation MLLMs are unreliable for the autonomous diagnosis and management of pigmented choroidal lesions. This verdict is substantiated by two key findings: the statistically significant underperformance of all AI models compared to human experts (P < 0.005) and, more critically, the discovery of systematic, opposing diagnostic biases. The misclassification analysis (Tables 4-6) revealed that ChatGPT tends to over-diagnose malignancy while Gemini tends to under-diagnose it, both of which pose unacceptable risks to patient safety. These predictable failure modes outweigh the moderate performance gains seen in some models, indicating a fundamental lack of clinical readiness.

Major Limitations and Risks: The study's primary limitation is its small sample size (n=48), which constrains the statistical power and generalizability of the results, as evidenced by the wide confidence intervals in Table 3. This proof-of-concept design is suitable for demonstrating feasibility but not for establishing definitive performance benchmarks for clinical use. A second methodological risk, identified in the Methods section, is the use of a 6-month stability criterion to define benign nevi. This relatively short duration introduces the possibility that slow-growing melanomas were included in the 'ground truth' benign cohort, which could skew the measured specificity and accuracy of all tested systems, including the human experts.

Based on the demonstrated underperformance and significant safety risks, the implementation recommendation for these MLLMs in this clinical context is one of **Low confidence**. The retrospective, small-sample proof-of-concept design is insufficient to support adoption for autonomous decision-making. The most critical next step is a larger, prospective validation study with a diverse patient cohort and longer follow-up periods to confirm the stability of benign lesions. The key unanswered question is whether the observed systematic biases are inherent limitations of general-purpose MLLMs or if they can be mitigated through fine-tuning on specialized, curated medical datasets to create a more reliable clinical tool.

Critical Analysis and Recommendations

Clear and Structured Abstract (written-content)
The abstract's use of a standard 'Purpose, Methods, Results, Conclusions' structure provides a clear, concise summary of the study. This format makes the core findings, including key quantitative data like Gemini's peak accuracy (0.725) and the statistical significance of human superiority (P < 0.005), highly accessible to readers.
Section: Abstract
Acknowledge Study Scope in Abstract Conclusion (written-content)
The abstract's conclusion is direct but could be improved by briefly mentioning the study's context (e.g., 'In this cohort'). Adding this nuance would enhance scientific transparency by subtly acknowledging the generalizability limitations inherent in the study's specific dataset, without undermining the main finding.
Section: Abstract
Strong Justification of Clinical Need (written-content)
The introduction effectively establishes the clinical importance of distinguishing choroidal nevi from melanoma by citing specific statistics, such as the prevalence of nevi (4.7%) and the annual risk of malignant transformation (1 in 8845). This data-driven approach clearly communicates the high-stakes nature of the diagnostic problem, providing a compelling rationale for the study.
Section: Introduction
Formulate an Explicit, Testable Hypothesis (written-content)
While the introduction clearly states the study's aim, it lacks a formal, explicit hypothesis. Articulating a specific prediction (e.g., 'we hypothesized that MLLMs would exhibit lower diagnostic accuracy than expert specialists') would strengthen the scientific rigor by framing the research as a direct test of an expected outcome, rather than a purely descriptive comparison.
Section: Introduction
Rigorous 'Ground Truth' for High-Confidence Benchmarking (written-content)
The study's validity is anchored by its strict definition of the correct diagnosis ('ground truth'). Requiring biopsy confirmation for melanoma and a minimum of 6-month imaging stability for benign nevi establishes a reliable, clinically relevant standard. This ensures that the performance of both AI and human graders is measured against a high-confidence benchmark, making the comparisons meaningful.
Section: Methods
Sophisticated Controls to Mitigate Bias (written-content)
The methodology includes thoughtful controls to reduce bias for both AI and human evaluators. Using temporary chat sessions to prevent the AI from 'learning' between cases and introducing distractor images to prevent human experts from defaulting to a binary choice are sophisticated measures that significantly increase the credibility and internal validity of the comparative results.
Section: Methods
Error Analysis Reveals Critical, Systematic AI Biases (graphical-figure)
The misclassification analysis (Tables 4, 5, 6) provides crucial insights beyond simple accuracy scores. It reveals that the AI models' errors are not random but systematic and opposing: ChatGPT shows a bias toward over-diagnosing malignancy (mislabeling 48.1% of benign nevi as melanoma), while Gemini shows a dangerous bias toward under-diagnosing it (mislabeling 84.2% of melanomas as benign). This detailed error analysis is essential for assessing the clinical safety and failure modes of these tools.
Section: Methods
Methodological Limitation: Small Sample Size Limits Generalizability (written-content)
The study's small sample size (n=48) and short follow-up period for nevi (6 months) fundamentally limit the generalizability of the findings. This proof-of-concept design means the quantitative performance metrics have wide confidence intervals and may not be representative of real-world performance. The short follow-up also introduces a risk that slow-growing melanomas were misclassified as benign, potentially affecting the accuracy of specificity measurements.
Section: Methods

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 1. Agreement Analysis between Human Graders
Figure/Table Image (Page 4)
Table 1. Agreement Analysis between Human Graders
First Reference in Text
prompts, with κ values ranging from 0.634 to 0.707 across the prompts (Table 1).
Description
  • Quantification of Inter-Grader Agreement: This table quantifies the level of agreement between two expert human graders who were asked to evaluate the same set of medical images based on six different prompts or tasks. The 'Percentage Agreement' column shows the raw proportion of cases where both experts gave the identical answer, with values ranging from 81.2% to 85.4%, indicating a high degree of consistency.
  • Cohen's Kappa (κ) Statistic: The table includes Cohen's Kappa (κ), a statistical measure used to assess the reliability of agreement between two raters. Unlike simple percentage agreement, Kappa accounts for the possibility that agreement could occur by chance. A Kappa value of 0 indicates agreement is equivalent to chance, while 1 indicates perfect agreement. The values in the table range from 0.634 to 0.707, which are generally interpreted as representing 'substantial agreement', reinforcing that the expert graders were highly consistent in their judgments beyond random chance.
Scientific Validity
  • ✅ Use of an appropriate statistical measure: The inclusion of Cohen's Kappa (κ) alongside percentage agreement is a methodological strength. Kappa is the standard and appropriate metric for assessing inter-rater reliability for categorical ratings, as it corrects for chance agreement. This provides a more rigorous and scientifically sound measure of consistency than percentage agreement alone.
  • ✅ Establishes a crucial human performance benchmark: This analysis is essential for the study's overall validity. By demonstrating substantial agreement between expert human graders, you establish a reliable 'gold standard' for performance. This benchmark is critical for contextualizing and evaluating the performance of the multimodal large language models (MLLMs) presented later in the study.
  • 💡 Lack of prompt descriptions: The table lists prompts by number (1-6) but does not describe the nature of each task. While this information is likely available elsewhere in the manuscript, the table would be more informative and self-contained if it included a brief description for each prompt (e.g., 'Multiple-choice diagnosis', 'Binary classification'). This would allow readers to see if agreement levels varied based on task complexity or type.
Communication
  • ✅ Clear and concise presentation: The table is well-structured, with a simple three-column layout and clear headings. This minimalist design makes the data easy to read and interpret, allowing the reader to quickly grasp the key takeaway: the human graders showed high agreement.
  • 💡 Placement within the manuscript: The table is referenced in the 'Methods' section, but it presents results. Standard reporting convention would place a table summarizing data outcomes in the 'Results' section. Moving this table to the Results section would improve the manuscript's logical flow, separating the description of the methodology from the presentation of its outcomes.
Table 2. Human Grader Performance by Prompt
Figure/Table Image (Page 5)
Table 2. Human Grader Performance by Prompt
First Reference in Text
The human graders had high overall and per-prompt accuracy (ranging from 81.2% to 93.8%), sensitivity (78.6% to 96.2%), and specificity (ranging from 81% to 95.2%) (Table 2).
Description
  • Presentation of Diagnostic Performance Metrics: This table details the performance of two individual human experts ('Grader 1' and 'Grader 2') across six different diagnostic tasks, or 'prompts'. Performance is measured using three standard metrics: Accuracy, Sensitivity, and Specificity. 'Accuracy' represents the overall proportion of correct diagnoses. 'Sensitivity' measures how well the expert correctly identified the presence of a condition (e.g., melanoma), acting like a 'disease-detector'. 'Specificity' measures how well the expert correctly identified the absence of that condition (e.g., a harmless nevus), acting as a 'healthy-case detector'.
  • High Performance Levels of Human Experts: The data shows that both human graders performed at a very high level across all tasks. Accuracy ranged from 78.7% to 93.8%. Sensitivity was also high, ranging from 78.6% to 96.2%, indicating a strong ability to detect disease. Specificity was similarly excellent, ranging from 78.9% to 95.2%, showing a strong ability to rule out disease when it was not present. Each performance value is accompanied by a 95% confidence interval (e.g., 0.812 (0.667, 0.896)), which provides a range of plausible values for the true performance, giving an indication of the estimate's precision.
Scientific Validity
  • ✅ Comprehensive and appropriate metrics: The use of accuracy, sensitivity, and specificity is the standard and most appropriate method for evaluating the performance of a diagnostic test or classifier. This provides a much more complete picture than accuracy alone, distinguishing between the ability to correctly identify positive and negative cases, which is clinically crucial.
  • ✅ Inclusion of confidence intervals: Reporting 95% confidence intervals for each point estimate is a significant methodological strength. It transparently communicates the statistical uncertainty and precision of the performance measures, allowing for a more robust interpretation of the results and comparison between graders and prompts.
  • ✅ Establishes a robust human performance benchmark: This table is critical for the study's objective as it quantifies expert-level performance. This data serves as the 'gold standard' or benchmark against which the performance of the AI models is later compared, providing essential context for evaluating the clinical utility of the MLLMs.
Communication
  • ✅ Clear and logical structure: The table is well-organized and easy to read. It logically groups data by prompt, then by grader, with separate columns for each key performance metric. This structure allows for easy comparison between the two graders and across the different tasks.
  • 💡 Lack of context for prompts: The prompts are identified only by number (1-6). The table would be significantly more self-contained and informative if a brief description of each prompt's task was included (e.g., 'Prompt 1 - Multiple-choice diagnosis'). This would allow readers to interpret performance variations across tasks without needing to search for this information in the main text.
  • 💡 Potential for misplacement in the manuscript: The element is referenced in the 'Results' section of the paper, but the provided section is 'Methods'. While this might be an artifact of the prompt, tables presenting study outcomes, such as this one, should be located in the 'Results' section to maintain a logical flow from methodology to findings.
Table 3. MLLM Performance by Prompt
Figure/Table Image (Page 5)
Table 3. MLLM Performance by Prompt
First Reference in Text
Misclassification analysis was not performed for prompts 2, 3, and 4, as responses were limited to two classes and are fully characterized by sensitivity and specificity metrics in Table 3.
Description
  • Comparative Performance of AI Models: This table presents the performance of three different AI systems—ChatGPT, Gemini, and Perplexity—on six different tasks (prompts) related to diagnosing eye lesions. Their performance is measured using three key metrics: 'Accuracy' (the overall percentage of correct answers), 'Sensitivity' (the ability to correctly identify cases with the disease, like a good smoke detector), and 'Specificity' (the ability to correctly identify cases without the disease, avoiding false alarms). Each reported value is accompanied by a 95% confidence interval, which indicates the range of plausible true values for that metric.
  • Performance Varies by Model and Task: The results show a clear difference in performance. Gemini consistently performed best, especially on Prompt 6 (recommending treatment with clinical information), where it achieved the highest accuracy of 0.708 (or 70.8%). In contrast, ChatGPT's performance was generally lower, with an accuracy of just 0.314 (31.4%) on the same task. This suggests that the choice of AI model and the specific nature of the task significantly impact the outcome.
  • Critical Failure on Rule-Based Diagnosis: A notable finding is the performance on Prompt 4, which required the AIs to apply strict clinical guidelines (COMS criteria). For this task, both Gemini and Perplexity had a sensitivity of 0.000, meaning they failed to identify a single case of melanoma. They achieved perfect specificity (1.000), meaning they never misidentified a healthy case, but their inability to detect the disease under these rules is a critical failure.
Scientific Validity
  • ✅ Comprehensive and appropriate metrics: The use of accuracy, sensitivity, and specificity provides a robust and multifaceted evaluation of the MLLMs' diagnostic capabilities. This is far superior to relying on accuracy alone, as it differentiates between the models' ability to identify true positives and true negatives, which is essential for clinical applicability.
  • ✅ Inclusion of confidence intervals: Reporting 95% confidence intervals for all performance metrics is a major strength. It provides a measure of statistical precision for each estimate, transparently conveying the uncertainty associated with the sample size and allowing for more nuanced comparisons between models.
  • ✅ Strong experimental design: The stepwise progression of prompts (e.g., diagnosis without clinical info vs. with clinical info) is an excellent design choice. It allows for a direct assessment of how additional data impacts model performance, which is a key question for understanding their reasoning capabilities. The results from Prompt 4, testing adherence to strict guidelines, provide particularly powerful evidence of the models' current limitations.
  • 💡 Wide confidence intervals suggest limited statistical power: While the inclusion of confidence intervals is a strength, their width (e.g., Gemini's accuracy on Prompt 6 is 0.708 with a CI of 0.562 to 0.812) indicates considerable uncertainty in the point estimates. This is an inherent limitation of the sample size (n=48) and should be explicitly acknowledged when interpreting the magnitude of performance differences between models.
Communication
  • ✅ Logical and clear table structure: The table is well-organized, grouping results first by prompt and then by model. This hierarchical structure makes it straightforward for readers to compare the performance of the three MLLMs on any given task and to track a single model's performance across different tasks.
  • 💡 Lack of visual aids for comparison: The table is dense with numerical data, which can make it difficult to quickly identify the key findings. To improve scannability, consider using bold text to highlight the best-performing model for each metric within each prompt. This simple visual cue would help guide the reader's attention to important patterns, such as Gemini's general superiority.
  • 💡 Table is not self-contained: The prompts are only identified by number, requiring the reader to refer back to the main text to understand the specific task being evaluated. The table's utility would be greatly enhanced by adding a brief description of each prompt, either in a separate column or as a footnote to the table.
  • 💡 Potential misplacement in manuscript: The table presents key study results, but its section is listed as 'Methods'. To align with standard scientific reporting structure, this table should be placed in the 'Results' section, ensuring a clear distinction between the description of the study's design and the presentation of its findings.
Table 4. Misclassification Summary for Prompt 1
Figure/Table Image (Page 6)
Table 4. Misclassification Summary for Prompt 1
First Reference in Text
ChatGPT frequently misclassified nevi as melanoma (48.1% of errors) or as choroidal hemangioma or choroidal osteoma (Table 4).
Description
  • Error Analysis of AI Models: This table breaks down the specific mistakes made by three AI models (ChatGPT, Gemini, Perplexity) when asked to perform a multiple-choice diagnosis of eye lesions. It shows not just how often they were wrong ('Misclassification, %'), but also what incorrect diagnosis they chose. This helps to reveal if the AIs have specific biases in their errors.
  • ChatGPT's Tendency to Over-diagnose Malignancy: The data shows a significant bias for ChatGPT. When presented with a choroidal nevus, which is a benign, mole-like spot in the back of the eye, ChatGPT misdiagnosed it 93.1% of the time. Critically, of these errors, nearly half (48.1%) were misclassifications as melanoma, which is a dangerous form of eye cancer. This indicates a strong tendency to mistake a harmless condition for a malignant one.
  • Gemini's Tendency to Under-diagnose Malignancy: Gemini displayed the opposite bias. When shown cases of actual melanoma (cancer), it failed to identify it correctly 100% of the time. The vast majority of these errors (84.2%) were misclassifications as a benign choroidal nevus. This reveals a dangerous tendency to mistake a malignant condition for a harmless one.
  • Perplexity's High Error Rate: Perplexity showed high rates of misclassification for both conditions. It incorrectly diagnosed benign nevi 62.1% of the time and malignant melanomas 68.4% of the time. Its errors were more varied, but it also showed a tendency to mislabel melanoma as a benign nevus (53.8% of its melanoma errors).
Scientific Validity
  • ✅ Provides crucial insights beyond accuracy metrics: This misclassification analysis is a major strength of the study. By moving beyond simple accuracy scores and detailing the specific nature of the errors, you provide invaluable insight into the diagnostic biases and failure modes of each MLLM. This level of analysis is essential for evaluating the clinical safety and utility of these models.
  • ✅ Data strongly supports the conclusion of systematic bias: The results clearly demonstrate that the errors are not random. The distinct and opposing patterns—ChatGPT's over-diagnosis of malignancy and Gemini's under-diagnosis—are stark. This provides strong evidence for systematic biases within the models' decision-making processes, which is a significant finding.
  • 💡 Ambiguity in the 'Misclassification, %' column: The column header 'Misclassification, %' could be interpreted in multiple ways. It appears to represent the error rate for that specific ground truth category (e.g., 100% of melanoma cases were misclassified by Gemini). Clarifying the header to 'Error Rate (%)' or 'Misclassification Rate for Ground Truth (%)' would remove any ambiguity for the reader.
Communication
  • ✅ Effective structure for error analysis: The table is well-structured, clearly separating the overall misclassification rate from the detailed breakdown of error types. This allows the reader to first grasp the magnitude of the errors and then investigate their nature, which is a logical and effective way to present this complex data.
  • 💡 Overuse of abbreviations hinders readability: The use of abbreviations for all diagnostic classes (AMD, CH, CN, etc.) in the column headers forces the reader to constantly consult the table footer. This significantly slows down interpretation. Given the available space, spelling out the full condition names (e.g., 'Choroidal Nevus', 'Melanoma') would make the table much more self-contained and user-friendly.
  • 💡 Lack of visual hierarchy: The table presents many numbers, but the most critical findings—the specific biases—do not stand out visually. Consider using bold text or cell shading to highlight the most frequent misclassification for each ground truth (e.g., bolding the '48.1' for ChatGPT/CN and the '84.2' for Gemini/Melanoma). This would help guide the reader's eye to the key takeaways.
Table 5. Misclassification Summary for Prompt 5
Figure/Table Image (Page 6)
Table 5. Misclassification Summary for Prompt 5
First Reference in Text
However, all models showed significant misclassification patterns favoring radiotherapy (Table 5).
Description
  • Analysis of AI Treatment Recommendation Errors: This table details the specific mistakes made by three AI models (ChatGPT, Gemini, Perplexity) when asked to recommend a course of treatment for eye lesions. The analysis focuses on cases where the AI's recommendation was incorrect, showing not just how often it was wrong but also which incorrect treatment it chose. The correct treatments, or 'Ground Truth', fall into three categories: 'Enucleation' (surgical removal of the eye), 'Observation' (a 'watch and wait' approach), and 'Radiotherapy' (using radiation to treat the lesion).
  • Systematic Bias Towards Recommending Radiotherapy: A striking pattern emerged across all three AI models: a strong tendency to default to recommending radiotherapy, especially when it was the wrong answer. For example, in cases where the correct action was 'Observation', both ChatGPT and Perplexity incorrectly recommended radiotherapy 100% of the time. Similarly, for cases requiring 'Enucleation', ChatGPT and Gemini incorrectly recommended radiotherapy in 100% of their errors. This shows the AIs are not making random mistakes but are systematically biased towards a specific treatment option.
  • High Rates of Critical Misclassifications: The table highlights clinically significant errors. For instance, ChatGPT misclassified 100% of cases that should have been observed and 100% of cases that required eye removal, defaulting to radiotherapy in every single instance. This represents a high potential for both unnecessary, aggressive treatment (overtreatment) and failure to recommend the definitive surgical procedure (undertreatment).
Scientific Validity
  • ✅ Crucial analysis for clinical safety assessment: This misclassification analysis is vital for evaluating the real-world safety of these MLLMs. By detailing the nature of the errors (e.g., recommending aggressive treatment for cases needing observation), the table provides far more insight into potential patient harm than a simple accuracy score would. This is a methodologically sound and clinically relevant approach.
  • ✅ Data provides strong evidence for systematic bias: The data overwhelmingly supports the claim made in the reference text. The consistent defaulting to 'Radiotherapy' across different models and for different ground truths is not a random occurrence but a clear, systematic failure mode. This is a significant and robust finding regarding the current state of these models for medical decision-making.
  • 💡 Suggests a potential 'middle-ground' bias: The strong preference for radiotherapy could be interpreted as the models selecting a 'middle-ground' option between the less invasive 'Observation' and the highly invasive 'Enucleation'. While the analysis is sound, a discussion of this potential cognitive bias—a tendency to avoid extremes—could add a valuable interpretive layer to the results, suggesting a specific area for future model refinement.
Communication
  • ✅ Effective hierarchical structure: The table is well-organized, grouping the data first by MLLM and then by the correct treatment ('Ground Truth'). This structure makes it easy for the reader to assess each model's error patterns individually and compare them.
  • 💡 Ambiguity in nested percentages: The columns under 'Misclassification Class, %' represent the percentage breakdown of the errors, not of the total cases. For example, for ChatGPT/Observation, it misclassified 86.2% of cases, and of those errors, 100% were 'Radiotherapy'. This nested percentage can be confusing. To improve clarity, the sub-header could be changed to '% of Misclassified Cases' or a footnote could be added to explain that these columns characterize the distribution of incorrect answers.
  • 💡 Placement in manuscript: This table presents key findings from the study's experiments. According to standard scientific reporting conventions, it should be located in the 'Results' section rather than the 'Methods' section to maintain a clear and logical separation between the description of the experimental setup and the presentation of its outcomes.
Table 6. Misclassification Summary for Prompt 6
Figure/Table Image (Page 6)
Table 6. Misclassification Summary for Prompt 6
First Reference in Text
ChatGPT continued to misclassify all enucleation cases and 87.9% of observation cases as radiotherapy (Table 6).
Description
  • Analysis of AI Treatment Recommendations with Clinical Data: This table shows the specific errors made by three AI models (ChatGPT, Gemini, Perplexity) when asked to recommend a treatment for eye lesions, but this time they were also provided with relevant patient clinical information (Prompt 6). The analysis breaks down the incorrect recommendations for three potential correct actions: 'Enucleation' (surgical removal of the eye), 'Observation' (a 'watch and wait' strategy), and 'Radiotherapy'. This allows for a direct comparison with Table 5 to see if clinical data improved the AIs' performance.
  • Gemini's Significant Improvement: The addition of clinical data led to a dramatic improvement for the Gemini model. Its misclassification rate for cases requiring 'Observation' dropped from 62.1% (in Table 5) to just 10.3%. For cases requiring 'Enucleation', its error rate fell from 100% to 50%. This suggests Gemini was able to effectively integrate the textual clinical data with the images to make more accurate recommendations.
  • ChatGPT's Persistent Bias: In contrast to Gemini, ChatGPT's performance did not improve with the addition of clinical data. It continued to misclassify 100% of cases requiring 'Enucleation' and a very high percentage (89.7%) of cases requiring 'Observation'. In virtually all of these errors, its recommendation was 'Radiotherapy', indicating that the clinical information did not correct its strong, pre-existing bias.
  • Perplexity's Moderate Improvement: Perplexity showed some improvement but was not as successful as Gemini. Its misclassification rate for 'Observation' cases decreased from 75.9% to 27.6%, and for 'Enucleation' cases, it dropped from 100% to 50%. While an improvement, it still made incorrect recommendations at a significant rate.
Scientific Validity
  • ✅ Powerful experimental design isolates the impact of clinical data: By presenting the same cases with and without clinical information (Prompt 5 vs. Prompt 6), the study design effectively isolates the variable of interest: the models' ability to integrate multimodal data. This direct comparison provides strong evidence for the differential capabilities of the MLLMs, which is a significant finding.
  • ✅ Reveals critical differences in model architecture or training: The starkly different outcomes—Gemini's marked improvement versus ChatGPT's stagnation—strongly suggest fundamental differences in how these models process and weigh combined image and text inputs. This analysis provides valuable data for understanding the current state of multimodal AI in medicine.
  • 💡 Lacks formal statistical comparison between prompts: While the improvement for Gemini and Perplexity is visually apparent when comparing Table 5 and Table 6, the analysis would be more rigorous if it included a formal statistical test (e.g., McNemar's test) to confirm that the change in error rates between the two conditions is statistically significant. This would add quantitative support to the qualitative observations.
Communication
  • ✅ Consistent structure facilitates comparison: The table maintains the same clear and logical format as Table 5. This consistency is an excellent design choice, as it makes it very easy for the reader to directly compare the results across the two prompts and quickly identify changes in performance.
  • 💡 Ambiguous percentage reporting: The columns under 'Misclassification Class, %' detail the percentage breakdown of the errors, not the overall case count. This could be misinterpreted. To prevent confusion, the sub-header should be clarified to something like '% of Misclassified Cases' or a footnote explaining the calculation should be added.
  • 💡 Missed opportunity for visual emphasis: The key story of this table is the change in performance from Table 5. This could have been visually emphasized to aid rapid comprehension. For example, using color-coding (e.g., green for improved metrics, red for stagnant ones) or adding small arrows (↑ or ↓) next to the misclassification percentages to indicate the direction of change from the previous prompt would have made the main findings instantly recognizable.
Table 7. Significant Differences for Accuracy and Sensitivity across Various...
Full Caption

Table 7. Significant Differences for Accuracy and Sensitivity across Various Prompts

Figure/Table Image (Page 7)
Table 7. Significant Differences for Accuracy and Sensitivity across Various Prompts
First Reference in Text
Finally, ChatGPT performed significantly worse than Gemini, Perplexity, and both human graders in terms of accuracy for prompt 6 (Table 7).
Description
  • Pairwise Performance Comparison Matrix: This table is a matrix that summarizes the results of statistical tests comparing the performance of five different 'graders' (three AI models: ChatGPT, Gemini, Perplexity; and two human experts: Grader 1, Grader 2) against each other. Each cell in the table answers the question: 'Was the grader in this row significantly worse than the grader in this column?'
  • Decoding the Table's Structure and Content: The table is split into two halves. The shaded upper-right half shows comparisons for 'Accuracy' (overall correctness), while the unshaded lower-left half shows comparisons for 'Sensitivity' (the ability to correctly identify a condition when it's present). If a cell contains a prompt number (e.g., 'Prompt 6'), it means the row grader's performance was statistically worse than the column grader's on that specific task. A cell with 'None' indicates no significant difference was found.
  • Key Findings on AI vs. Human Performance: The table clearly shows that the human graders consistently outperformed the AI models. For example, the row for 'ChatGPT' shows prompt numbers in almost every column, indicating it was significantly worse than nearly every other grader on most tasks for both accuracy and sensitivity. In contrast, the rows for the human graders have many 'None' entries when compared to each other, showing their performance was statistically similar. Gemini appears to be the best-performing AI, as its row contains fewer entries indicating poor performance compared to ChatGPT.
Scientific Validity
  • ✅ Rigorous statistical methodology: Presenting the results of pairwise statistical comparisons, presumably corrected for multiple comparisons as mentioned in the methods, is a methodologically rigorous approach. It moves beyond simply observing differences in mean performance scores (from Tables 2 and 3) to determining whether those differences are statistically meaningful, which strengthens the study's conclusions.
  • 💡 Lacks detail on statistical tests: The table effectively summarizes the outcomes but lacks transparency about the underlying statistics. It does not specify the statistical test used, the p-values, or the effect sizes for the significant differences. While the goal is simplification, this omission prevents a full critical appraisal of the statistical evidence. A supplementary table with these details would enhance the study's rigor.
  • 💡 Directionality of comparison is not explicit: The most significant limitation is that the table does not explicitly state the direction of the significant difference (i.e., which grader was better or worse). This crucial information must be inferred by the reader from the narrative text, which undermines the table's value as a standalone element. The analysis is sound, but its reporting lacks the necessary clarity to be fully self-contained.
Communication
  • ✅ Highly efficient format for complex data: Using a matrix to display all pairwise comparisons is an extremely space-efficient and effective method. It condenses a large number of individual statistical results into a single, comprehensive visual, avoiding the need for lengthy and repetitive text.
  • 💡 Poorly self-contained due to lack of a key: The table is nearly impossible to interpret without external context from the main text. There is no key or footnote to explain what the cell entries mean (e.g., 'Cell entries indicate prompts for which the row grader performed significantly worse than the column grader'). This is a major communication failure and should be rectified by adding a clear explanatory note to the table.
  • 💡 Awkward labeling and visual design: The diagonal label splitting 'Accuracy' and 'Sensitivity' is unconventional and not immediately clear. A better design would be to use explicit super-headers above the table or create two separate, smaller tables. Furthermore, the distinction between the shaded and unshaded regions is subtle and could be missed by a casual reader.
  • 💡 Caption is imprecise: The caption, 'Significant Differences...', is neutral, whereas the table exclusively reports directional outcomes (i.e., one grader being worse than another). A more precise caption, such as 'Summary of Pairwise Comparisons Identifying Significantly Poorer Performance', would more accurately reflect the table's content.