Skin Lesion Image Classification With Tree-Based Ensembles: Benchmarking Random Forest and Gradient Boosting

Sanman Pattnaik, Saphalya Pattnaik, Mohamed Khalid, Sagaya Joel Leo, Gur-Aziz Singh Sidhu
Cureus
Manipal Institute of Technology

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This study investigates whether traditional, interpretable machine learning (ML) models can serve as a practical alternative to computationally intensive deep learning (DL) systems for classifying skin lesions. The primary objective was to benchmark two 'tree-based ensemble' models—Random Forest and Gradient Boosting—against a lightweight DL model for classifying four types of skin lesions: basal cell carcinoma (BCC), benign keratosis-like lesions (BKL), melanocytic nevi (MN), and melanoma. The core of the research questions the trade-off between the high accuracy of 'black box' DL models and the efficiency and transparency offered by classical ML approaches.

The methodology involved using a public dataset of 8,000 dermoscopic images. Unlike DL models that learn features automatically from raw pixels, this study employed a traditional pipeline based on 'handcrafted' feature engineering. This process involved several steps: standardizing images through preprocessing, then programmatically extracting specific numerical descriptors related to texture (e.g., Local Binary Patterns), color (histograms in the LAB color space), and shape (e.g., Histogram of Oriented Gradients). These features were then used to train an ensemble model, which combines the predictions of both Random Forest and Gradient Boosting classifiers to improve overall performance. The model's accuracy was evaluated on a hold-out test set and compared to a MobileNetV2, a DL model optimized for efficiency.

The study reports that its traditional ML approach achieved high performance, particularly in identifying the most critical lesion types. The Gradient Boosted model reached 89% accuracy, and the final ensemble demonstrated excellent F1-scores (a combined measure of precision and recall) for melanoma (0.92) and melanocytic nevi (0.98). However, the model struggled to distinguish between BCC and BKL, which have similar visual features. A key finding highlighted in the abstract is that the traditional models trained more than 10 times faster than the DL benchmark while achieving comparable performance in melanoma detection. The use of interpretability techniques also confirmed that the model's decisions were based on clinically relevant cues, such as irregular borders and pigmentation.

The paper concludes that for applications like skin lesion classification, traditional ML models present a compelling and viable alternative to deep learning. By offering a balance of strong diagnostic performance, significantly greater computational efficiency, and transparent decision-making, this approach is particularly well-suited for deployment in resource-constrained clinical settings or as point-of-care diagnostic aids where speed and interpretability are paramount.

Research Impact and Future Directions

Overall, the evidence only partially supports the study's conclusion that traditional machine learning models are a robust and practical alternative to deep learning for this task. The paper's primary strength lies in its clear demonstration of high classification performance for clinically critical lesions like melanoma (0.92 F1-score) and its compelling, albeit unsubstantiated, claim of a tenfold efficiency gain. However, the study's reliability is significantly undermined by a major methodological contradiction regarding the image processing pipeline (grayscale vs. color) and a key results figure (Figure 5) that visually contradicts the central premise of using an ensemble model. These unresolved issues weaken confidence in the reported findings.

Major Limitations and Risks: The study's conclusions are constrained by several high-impact issues. First, the methodological contradiction between the described grayscale preprocessing pipeline (Figure 2) and the stated use of color-based features (Methods text) makes the core methodology irreproducible and its results questionable. Second, the graphical evidence in the Results section (Figure 5) showing the ensemble model underperforming its components directly conflicts with the paper's main argument, suggesting a potential flaw in the ensembling process or the data visualization. Finally, the failure to substantiate the crucial '>10x faster' training claim from the abstract with data or discussion in the main text leaves the central comparison to deep learning as an unsupported assertion.

Based on this analysis, the implementation or adoption of this specific model is not recommended. The confidence in its reported performance is Low due to the unresolved methodological contradictions and conflicting evidence within the results. The study serves as a proof-of-concept, but its design limitations prevent it from making reliable claims of superiority or even equivalence to a deep learning benchmark. The most critical next step required to raise confidence would be to resolve the fundamental contradiction in the preprocessing methodology and either correct or explain the counterintuitive results in Figure 5. Following that, a rigorous validation on an external, multi-source clinical dataset would be necessary to assess the model's true generalizability and performance.

Critical Analysis and Recommendations

Clear and Quantitative Findings in Abstract (written-content)
The abstract presents key findings with specific, quantitative metrics, stating that the Gradient Boosted model achieved 89% accuracy and trained 'more than 10 times faster' than the deep learning benchmark. This quantitative clarity immediately establishes the study's primary claims and their significance, providing a strong, evidence-based summary that enhances the paper's credibility and impact from the outset.
Section: Abstract
Failure to Introduce Central Research Tension (written-content)
The introduction establishes the clinical need for automated diagnostics but fails to introduce the paper's central thesis: the trade-off between interpretable, efficient traditional ML and complex, resource-intensive deep learning. Introducing this core tension earlier would provide a stronger narrative hook and better align the introduction with the compelling framing presented in the abstract, improving the paper's overall cohesion.
Section: Introduction
Methodological Contradiction in Image Processing (graphical-figure)
Figure 2 depicts a preprocessing pipeline that converts all images to grayscale, discarding color information. However, the methods text and Figure 3 state that LAB color histograms, which require a color image, were extracted as key features. This is a significant methodological contradiction that undermines the study's reproducibility and raises questions about which pipeline was actually used to generate the final results.
Section: Materials And Methods
Omission of Essential Methodological Details (written-content)
Limitation: Lack of key experimental parameters. The paper omits the dataset split ratio (e.g., 80% train, 20% test), a fundamental detail for reproducibility. This omission prevents other researchers from accurately contextualizing the model's performance and replicating the experimental setup, weakening the methodological transparency of the study.
Section: Materials And Methods
Contradictory Evidence in Performance Graph (graphical-figure)
Figure 5, which illustrates model performance during hyperparameter tuning, shows the final ensemble model consistently performing worse than its individual components (Random Forest and Gradient Boosting). This visual evidence directly contradicts the fundamental premise of ensembling—that combining models should yield superior performance—and undermines the paper's primary conclusion that the ensemble approach is effective.
Section: Results
Unclear Labeling in Key Results Figure (graphical-figure)
The confusion matrix in Figure 7 uses numerical indices (0, 1, 2, 3) for its axes instead of the actual names of the skin lesion classes. This forces the reader to cross-reference the caption or text to interpret the results, violating the principle that a figure should be as self-contained as possible and hindering clear communication of the model's specific error patterns.
Section: Results
Failure to Discuss the Central Deep Learning Comparison (written-content)
The abstract makes a compelling case for the study by highlighting a >10x training speed advantage over a deep learning (DL) benchmark. However, the Discussion section completely fails to revisit or elaborate on this crucial comparative finding. This omission leaves the paper's central thesis underdeveloped and misses a critical opportunity to analyze the practical implications of the efficiency-vs-accuracy trade-off, which was the study's main narrative hook.
Section: Discussion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Materials And Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

FIGURE 1: To address class imbalance and enhance generalizability, data...
Full Caption

FIGURE 1: To address class imbalance and enhance generalizability, data augmentation techniques, such as rotation, flipping, scaling, cropping, color jittering, Gaussian blur, and gamma correction, are applied to generate diverse variants of each image, simulating various orientations, spatial settings, and lighting conditions.

Figure/Table Image (Page 3)
FIGURE 1: To address class imbalance and enhance generalizability, data augmentation techniques, such as rotation, flipping, scaling, cropping, color jittering, Gaussian blur, and gamma correction, are applied to generate diverse variants of each image, simulating various orientations, spatial settings, and lighting conditions.
First Reference in Text
To address class imbalance and improve generalizability, data augmentation techniques are employed, generating 10 variants of each original image (Figure 1).
Description
  • Visual demonstration of data augmentation workflow: The figure illustrates a process called data augmentation, where an 'Original Image' of a skin lesion is systematically altered to create new, modified versions. This technique is used to artificially expand a dataset for training a machine learning model, helping it become more robust.
  • Sequence of image transformations: The figure displays a sequence of specific modifications applied to the original image. These include geometric changes like rotation and flipping, as well as alterations to appearance such as applying a 'Gaussian Blur' (a softening effect), adjusting contrast and brightness, and applying 'gamma correction' (a method to fine-tune an image's brightness levels). The workflow culminates in a 'Final Augmented Image' that incorporates these changes.
  • Purpose of generating image variants: According to the reference text, this process is used to generate 10 different variants for each original image. The goal, as stated in the caption, is to simulate various real-world conditions like different viewing angles, lighting, and spatial orientations. This helps the model learn to identify skin lesions more accurately, regardless of these variations, and addresses 'class imbalance,' a situation where the model has significantly more examples of one class than another.
Scientific Validity
  • ✅ Use of standard augmentation techniques: The augmentation techniques demonstrated—rotation, flipping, color jittering, and blurring—are standard and appropriate methods in the field of computer vision for medical image analysis. They effectively simulate plausible real-world variations and are well-established for improving model generalization and mitigating overfitting.
  • 💡 Ambiguity in the generation process: The figure depicts a single, linear sequence of transformations, culminating in one 'Final Augmented Image'. However, the text states that '10 variants' are generated. It is unclear if these 10 variants are created by applying this exact sequence with different parameters, or if transformations are applied randomly and independently. Clarifying the strategy for generating the 10 variants would improve methodological reproducibility.
  • 💡 Lack of parameter specification: The figure is illustrative but lacks quantitative details about the augmentation parameters. For instance, the range for rotation angles, the kernel size for the Gaussian blur, or the magnitude of brightness and contrast adjustments are not specified. Including these parameters, either in the figure, caption, or methods text, is crucial for allowing other researchers to replicate the study's methodology accurately.
Communication
  • ✅ Clear workflow visualization: The use of arrows to connect the sequential panels effectively communicates the step-by-step process of image augmentation. This makes the overall concept intuitive and easy to follow for the reader.
  • 💡 Poor image quality and illegible text: The resolution of the individual image panels is low, and the embedded text (e.g., pixel coordinates and RGB values) is pixelated and largely unreadable. This extraneous information clutters the figure without adding value due to its illegibility. It is recommended to remove these low-level details or significantly improve the figure's resolution.
  • 💡 Redundant in-figure title: The figure contains the title 'AUGMENTATION RESULTS' within the image itself. This is redundant, as the figure number and caption already serve this purpose. For a cleaner presentation, all descriptive text should be confined to the caption.
  • 💡 Discrepancy between caption and visual elements: The caption mentions 'scaling' and 'cropping' as augmentation techniques, but these operations are not explicitly visualized as distinct steps in the figure's workflow. This creates a minor inconsistency. Ensure that all techniques mentioned in the caption are either visually represented or that the caption is adjusted to match the figure's content precisely.
FIGURE 2: Preprocessing steps, including grayscale conversion, adaptive...
Full Caption

FIGURE 2: Preprocessing steps, including grayscale conversion, adaptive thresholding, sharpening, aspect ratio-preserving resizing, and padding, are applied to enhance image uniformity and quality while preserving lesion morphology for effective feature extraction.

Figure/Table Image (Page 4)
FIGURE 2: Preprocessing steps, including grayscale conversion, adaptive thresholding, sharpening, aspect ratio-preserving resizing, and padding, are applied to enhance image uniformity and quality while preserving lesion morphology for effective feature extraction.
First Reference in Text
All images undergo grayscale conversion to reduce computational complexity while emphasizing textural features over color variations, which allows the model to focus on morphological characteristics essential for lesion classification (Figure 2).
Description
  • Visual workflow of image preprocessing: The figure illustrates a sequence of image processing steps applied to an 'Original Image' of a skin lesion. This workflow transforms the image to standardize it for analysis by a machine learning model.
  • Specific transformation steps: The process includes several key transformations: 1) 'Grayscale Image' conversion, which removes color information, leaving only shades of gray. 2) 'Adaptive Thresholding', a technique that converts the image to black and white to highlight the lesion's boundaries by adapting to local lighting variations. 3) 'Sharpened Image', which enhances the edges and fine details. 4) 'Resized Image' and 'Padded Image to Target Size', where the image is scaled down and then placed on a black background to ensure all images have a uniform size without distorting the lesion's original shape.
  • Stated purpose of preprocessing: According to the caption, the goal of these steps is to improve the consistency ('uniformity') and quality of the images. This is done while preserving the original shape ('morphology') of the lesion, which is crucial for the subsequent step of 'feature extraction,' where the computer identifies and measures key characteristics of the lesion.
Scientific Validity
  • ✅ Use of standard preprocessing techniques: The methods shown, such as grayscale conversion, thresholding, and standardized resizing with padding, are standard and appropriate initial steps in many computer vision pipelines for medical imaging. They effectively reduce computational load and create uniform inputs for the model.
  • 💡 Questionable order of operations: The workflow shows sharpening being applied after adaptive thresholding. Adaptive thresholding produces a binary (black and white) image. Applying a sharpening filter to a binary image has a negligible effect, as there are no intermediate gray levels to enhance. Sharpening is typically applied to the grayscale image before thresholding to improve edge detection. This depicted order suggests a potential methodological flaw.
  • 💡 Contradiction with feature extraction methods: This figure shows a pipeline that results in a grayscale image, explicitly discarding color. However, the 'Feature extraction' section states that LAB color histograms are computed. This creates a significant contradiction. If color features are used, the model cannot be exclusively trained on the output of this grayscale pipeline. The manuscript must clarify if there are parallel preprocessing pipelines (one for texture, one for color) or correct the description.
  • 💡 Lack of methodological detail: The figure is purely illustrative and omits critical parameters required for reproducibility. For instance, the block size and constant for adaptive thresholding, the kernel details for the sharpening filter, and the final target dimensions (e.g., 128x128 pixels) are not specified. These details are essential for validating and replicating the study.
Communication
  • ✅ Effective visualization of the transformation process: The diagram clearly illustrates the visual changes that the image undergoes at each step of the preprocessing pipeline. The use of arrows creates a logical and easy-to-follow workflow.
  • 💡 Poor image resolution and clutter: The resolution of the figure is low, rendering the small text annotations (e.g., pixel values) on the individual panels completely illegible. This text adds visual clutter without providing any useful information and should be removed for a cleaner presentation.
  • 💡 Redundant title within the figure: The title 'PREPROCESSING RESULTS' is embedded within the figure itself. This is redundant, as the figure number and caption already provide this context. Best practice is to keep all descriptive text in the caption.
  • 💡 Ambiguous overall pipeline representation: The figure presents this linear sequence as the definitive preprocessing pipeline. Given the later mention of color-based feature extraction, this representation is misleading. A more comprehensive diagram should be used to show how both textural (from grayscale) and color features are derived, perhaps by illustrating parallel processing paths.
FIGURE 3: After preprocessing and augmentation, feature extraction methods,...
Full Caption

FIGURE 3: After preprocessing and augmentation, feature extraction methods, including LBP, LAB color histograms, GLCM, and HOG, are applied to convert images into numerical representations that capture texture, color, and structural characteristics essential for skin lesion classification.

Figure/Table Image (Page 5)
FIGURE 3: After preprocessing and augmentation, feature extraction methods, including LBP, LAB color histograms, GLCM, and HOG, are applied to convert images into numerical representations that capture texture, color, and structural characteristics essential for skin lesion classification.
First Reference in Text
Local binary patterns (LBP) are employed to extract textural information by encoding local patterns of pixel intensity variations within the image, proving particularly effective in identifying subtle textural differences that are critical for accurate dermatological analysis and lesion characterization (Figure 3).
Description
  • Illustration of the feature extraction process: The figure provides a visual overview of 'feature extraction,' a process where an image is converted into a set of meaningful numbers (features) that a machine learning model can understand. It shows that multiple different techniques are used to capture various aspects of the skin lesion image.
  • Extraction of multiple feature types: The diagram illustrates the generation of four distinct sets of features: 1) 'LBP Histogram', which summarizes image texture using Local Binary Patterns. 2) 'LAB Color Histogram', which quantifies the color distribution in the perceptually uniform LAB color space. 3) 'GLCM Features', which are statistical measures of texture from a Gray-Level Co-occurrence Matrix, including contrast and homogeneity. 4) 'HOG Features', representing shape information through Histograms of Oriented Gradients, though this is only mentioned textually.
  • Combination of features: The final step depicted is 'Combine', indicating that the numerical data derived from all four methods (LBP, LAB color, GLCM, and HOG) are aggregated into a single, comprehensive feature vector. This combined vector serves as the final numerical representation of the image for the classification model.
Scientific Validity
  • ✅ Use of complementary feature sets: The selection of feature extraction methods is methodologically sound. Combining texture descriptors (LBP, GLCM), color information (LAB histograms), and shape features (HOG) provides a multi-faceted and robust representation of the skin lesion, which is a well-established approach in classical machine learning for image classification.
  • 💡 Lacks quantitative information: The figure is purely illustrative and lacks any quantitative detail. The histograms and bar charts are presented without scales, axis labels, or numerical values, making it impossible to interpret the actual feature distributions or values for the example image. This limits the figure's utility to that of a conceptual diagram.
  • 💡 Inconsistent with preprocessing workflow: The figure shows the extraction of LAB color features, which requires a color image. This contradicts the workflow in Figure 2, which depicted a preprocessing pipeline that converts images to grayscale, thereby discarding color information. The manuscript needs to clarify how these two seemingly incompatible processes are reconciled.
  • 💡 Incomplete visualization: While LBP, LAB color, and GLCM are visualized with corresponding plots, the HOG (Histogram of Oriented Gradients) feature is only mentioned in the 'Combine' step text. For completeness and consistency, a visual representation of the HOG feature extraction or the resulting feature vector should have been included.
Communication
  • ✅ Conceptually clear workflow: The figure effectively communicates the high-level concept of parallel feature extraction and subsequent combination. The layout logically flows from input images to their numerical representations, making the overall strategy easy to understand.
  • 💡 Poorly labeled and low-resolution plots: The plots are of low quality and lack essential components such as titles and labeled axes (e.g., what do the x and y axes on the histograms represent?). This makes them uninformative beyond their basic shape. The resolution is also too low to discern any detail.
  • 💡 Incorrect labels on subplots: The plots for the 'LAB Color Histogram' and 'GLCM Features' are incorrectly labeled as 'Figure 1'. This is a significant error that can cause confusion, and it should be corrected to ensure clarity and accuracy.
  • 💡 Redundant in-figure title: The title 'FEATURE EXTRACTION' is embedded within the figure. This is redundant, as the figure caption already provides this information. Removing the in-figure title would result in a cleaner and more professional presentation.
FIGURE 4: Prediction is provided with the percentage of the image classified as...
Full Caption

FIGURE 4: Prediction is provided with the percentage of the image classified as basal cell carcinoma, benign keratosis-like lesions, melanocytic nevi, and melanoma.

Figure/Table Image (Page 6)
FIGURE 4: Prediction is provided with the percentage of the image classified as basal cell carcinoma, benign keratosis-like lesions, melanocytic nevi, and melanoma.
First Reference in Text
During testing, the GUI demonstrated rapid response times and consistent prediction accuracy, mirroring the results observed in model evaluation (Figure 4).
Description
  • Demonstration of a Graphical User Interface (GUI): The figure displays a screenshot of a software application's window, known as a Graphical User Interface (GUI). This interface is designed for a user to interact with the machine learning model. It shows an image of a skin lesion that has been analyzed by the system.
  • Model's prediction and probability scores: For the displayed lesion, the model's final prediction is 'Melanocytic Nevi (NV)'. The interface also provides 'Prediction Probabilities,' which are the model's confidence scores for each of the four possible diagnoses. The scores are: Melanocytic Nevi (63.05%), Melanoma (16.97%), Basal Cell Carcinoma (11.95%), and Benign Keratosis-like Lesions (8.02%). The highest probability determines the final predicted disease.
Scientific Validity
  • ✅ Display of probabilistic output: Presenting the full probability distribution for all classes is a significant strength. It provides more nuanced information than a single, absolute prediction. This allows a clinical user to see the model's uncertainty, for example, noting the non-trivial probability (16.97%) assigned to melanoma, which could inform further diagnostic steps.
  • 💡 Use of a single, anecdotal example: The figure shows a single, likely successful, prediction. This single data point is insufficient to support the reference text's claim of 'consistent prediction accuracy'. Scientific validation requires aggregated performance metrics across a large, unseen test dataset, not a cherry-picked example. The figure demonstrates functionality, not validated performance.
  • 💡 Lack of ground truth: The figure does not state the actual, medically confirmed diagnosis (the 'ground truth') for the lesion shown. Without this information, it is impossible to determine if the model's prediction of 'Melanocytic Nevi' is correct. As presented, the figure only illustrates what the GUI's output looks like, not whether it is accurate.
  • 💡 Inappropriate section placement: This figure, which demonstrates the output of the final model, is more appropriate for the 'Results' section. The 'Materials and Methods' section should focus on describing the methodology used to build and validate the model and GUI, rather than showcasing its final output on a specific example.
Communication
  • ✅ Clear and simple user interface: The GUI layout is clean and intuitive. The predicted disease is stated prominently, and the breakdown of probabilities is clearly listed, making the model's output easy to interpret for a user.
  • 💡 Under-informative caption: The caption merely describes what is already obvious from looking at the figure (that percentages are provided). A more effective caption would specify that this is an example output from the developed GUI for a sample image, contextualizing its purpose better.
  • 💡 Basic visual design: The visual design of the GUI is very basic, characteristic of a prototype (e.g., using a default Tkinter theme). While functional, a more polished and professional design would enhance the credibility and potential user adoption of the tool in a clinical or research setting.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

FIGURE 5: This study trained and evaluated an ensemble model combining Random...
Full Caption

FIGURE 5: This study trained and evaluated an ensemble model combining Random Forest and Gradient Boosting classifiers to classify four skin lesion types, using a dataset split for training, testing, and hyperparameter tuning, resulting in strong performance demonstrated by classification reports, confusion matrices, and interpretive analysis.

Figure/Table Image (Page 7)
FIGURE 5: This study trained and evaluated an ensemble model combining Random Forest and Gradient Boosting classifiers to classify four skin lesion types, using a dataset split for training, testing, and hyperparameter tuning, resulting in strong performance demonstrated by classification reports, confusion matrices, and interpretive analysis.
First Reference in Text
During the hyperparameter tuning stage, five parameter combinations were explored for each classifier using a parameter distribution approach (Figure 5).
Description
  • Model performance during hyperparameter tuning: The line graph displays the 'Accuracy Score' for three different machine learning models over five trials, labeled 'Hyperparameter Combination Iteration'. This process, called hyperparameter tuning, is like trying different recipes to find the best settings for a model before its final training. The models are 'RandomForest', 'GradientBoosting', and an 'Ensemble' model, which combines the first two.
  • Accuracy scores and model comparison: The accuracy, which measures how often the models made a correct prediction, fluctuates between 70% (0.70) and 73% (0.73). The RandomForest and GradientBoosting models perform best, with their accuracies peaking at iteration 3, reaching approximately 72.8% and 72.5% respectively. The Ensemble model consistently shows the lowest accuracy, never rising above 71%.
Scientific Validity
  • 💡 Counterintuitive ensemble performance: The figure shows the ensemble model consistently underperforming its individual components. This is highly unexpected, as the primary purpose of ensembling is to achieve performance superior to that of any single constituent model. This result suggests either a flaw in the ensembling methodology (e.g., a naive combination strategy that degrades performance) or an error in the data generation for this plot. This finding directly contradicts the common understanding of ensemble methods and the paper's ultimate conclusion that the ensemble model is superior.
  • 💡 Lack of methodological transparency: The x-axis, 'Hyperparameter Combination Iteration', is not defined. The specific hyperparameter values tested in each of the five iterations are not provided, which makes the experiment impossible to reproduce. For scientific rigor, the parameters explored (e.g., number of estimators, learning rate) and their values at each iteration should be detailed.
  • 💡 Mismatch between figure content and caption: The caption provides a high-level summary of the entire study's results, mentioning classification reports and confusion matrices. However, the figure itself only shows a narrow, preliminary step: the hyperparameter tuning phase. The results depicted do not represent the final, optimized model's performance and do not support the caption's claim of 'strong performance'. The reference text is a much more accurate description of the figure's content.
  • ✅ Visualization of the tuning process: Despite its flaws, the figure appropriately visualizes that a hyperparameter tuning process was conducted. Showing the performance variation across different parameter sets is a standard and necessary step in developing a robust machine learning model.
Communication
  • ✅ Clear plot elements: The graph is easy to read. The axes are clearly labeled, the legend effectively distinguishes the three data series, and the use of distinct colors and markers for each line adheres to good data visualization practices.
  • 💡 Inaccurate and overly broad caption: The caption is misleading as it describes the entire study's outcome rather than the specific content of the figure. The caption should be revised to accurately reflect that the graph shows model accuracy during the hyperparameter tuning iterations. For example: 'Model accuracy comparison across five hyperparameter tuning iterations for Random Forest, Gradient Boosting, and the ensemble classifier.'
  • 💡 Contradictory visual narrative: The figure's primary visual message is that the ensemble model is the worst-performing of the three. This contradicts the main narrative of the paper, which concludes that the ensemble approach is effective. Such a direct contradiction between a figure and the text can severely confuse the reader and undermine the study's conclusions. The authors should address this discrepancy by either correcting the figure, providing a compelling explanation for the poor ensemble performance during tuning, or removing the figure.
  • 💡 Vague title: The main title 'TRAINING MODEL' is ambiguous. The subtitle, 'Model Accuracy Over Hyperparameter Tuning Iterations', is much more descriptive and should serve as the primary title for the figure to immediately inform the reader of its specific content.
FIGURE 6: The model shows excellent precision and recall for melanocytic nevi...
Full Caption

FIGURE 6: The model shows excellent precision and recall for melanocytic nevi and melanoma, reflected in high F1-scores (0.98 and 0.92), while basal cell carcinoma and benign keratosis-like lesions have moderate performance, resulting in balanced overall macro-average and weighted F1-scores of 0.79.

Figure/Table Image (Page 7)
FIGURE 6: The model shows excellent precision and recall for melanocytic nevi and melanoma, reflected in high F1-scores (0.98 and 0.92), while basal cell carcinoma and benign keratosis-like lesions have moderate performance, resulting in balanced overall macro-average and weighted F1-scores of 0.79.
First Reference in Text
The classification report summarizes model performance across all four skin lesion classes using precision, recall, and F1-score as evaluation metrics (Figure 6).
Description
  • Model performance metrics: The figure is a screenshot of a 'classification report,' which details the performance of the machine learning model. It reports three key metrics for four different classes of skin lesions (labeled 0, 1, 2, and 3). 'Precision' indicates the accuracy of positive predictions (low false positives). 'Recall' measures the model's ability to find all relevant cases (low false negatives). The 'F1-score' is a combined measure that balances precision and recall.
  • Performance variation across classes: The model's performance varies significantly by class. Class 2 shows near-perfect performance with a precision of 1.00 and an F1-score of 0.98. Class 3 also performs well, with an F1-score of 0.92. In contrast, Classes 0 and 1 show more moderate performance, with F1-scores of 0.65 and 0.62, respectively.
  • Overall model performance: The report provides summary statistics for the entire model across all 16,000 test samples. The overall accuracy (the proportion of all correct predictions) is 0.79, or 79%. The 'macro avg' and 'weighted avg' F1-scores are also both 0.79, indicating a balanced, albeit imperfect, performance across the different classes when averaged.
  • Hyperparameter tuning results: The screenshot also includes information from the model tuning phase, showing the best parameters found ('gb_n_estimators': 48, 'rf_n_estimators': 38) and the corresponding test score achieved during that sampling process (0.7818).
Scientific Validity
  • ✅ Use of comprehensive evaluation metrics: Presenting a full classification report with precision, recall, and F1-score is a robust and appropriate method for evaluating a multi-class classifier, especially in a medical context. It provides a much more detailed and insightful view of performance than a single accuracy score would.
  • ✅ Data supports claims in caption: The numerical data in the report directly supports the summary provided in the caption. The F1-scores of 0.98 and 0.92 for classes 2 and 3, respectively, and the overall weighted F1-score of 0.79, are all accurately reported and interpreted.
  • 💡 Undefined class labels: A significant limitation is that the classes are only identified by numbers (0, 1, 2, 3). The figure itself does not map these numbers to the actual skin lesion types. The reader must rely entirely on the caption to understand which class corresponds to which disease, hindering the figure's ability to stand alone.
  • 💡 Inclusion of irrelevant information: The screenshot includes output from the hyperparameter search process (e.g., 'Best parameters', 'Best test score (sampled)'). This information pertains to the model development phase, not the final evaluation on the test set. Its inclusion is distracting and irrelevant to the main point of the figure, which is the final classification report.
Communication
  • 💡 Unprofessional presentation format: Using a raw screenshot of a command-line terminal is not a professional or effective way to present data in a scientific publication. The information should be transcribed into a properly formatted and clearly labeled table for improved readability and aesthetics.
  • 💡 Poor legibility and visual clarity: The text in the screenshot is low-resolution and slightly blurry. The monospaced font and plain text format lack visual hierarchy, making it harder to quickly identify key values. A formatted table would allow for better font choice, alignment, and the use of bolding to highlight important results.
  • ✅ Effective and accurate caption: The caption is well-written and serves as an excellent summary of the data. It successfully guides the reader to the most important findings in the report, highlighting the performance differences between classes and stating the key overall metrics, which greatly aids interpretation.
  • 💡 Figure is not self-contained: Due to the undefined numerical class labels, the figure cannot be fully understood without reading the caption. Best practice dictates that figures should be as self-contained as possible. A formatted table could easily solve this by including a column for the disease name corresponding to each class number.
FIGURE 7: The confusion matrix reveals strong diagonal dominance for...
Full Caption

FIGURE 7: The confusion matrix reveals strong diagonal dominance for melanocytic nevi and melanoma, indicating accurate classification, while notable misclassifications occur between basal cell carcinoma and benign keratosis-like lesions, highlighting feature overlap between these classes.

Figure/Table Image (Page 8)
FIGURE 7: The confusion matrix reveals strong diagonal dominance for melanocytic nevi and melanoma, indicating accurate classification, while notable misclassifications occur between basal cell carcinoma and benign keratosis-like lesions, highlighting feature overlap between these classes.
First Reference in Text
To further analyze classification accuracy, a confusion matrix was generated to visualize the comparison between predictions and actual class labels (Figure 7).
Description
  • Visualization of model predictions vs. actual labels: This figure displays a 'confusion matrix,' which is a grid that acts as a scorecard for the classification model. The vertical axis represents the true, correct category ('True label') for a skin lesion, while the horizontal axis shows what the model predicted ('Predicted label'). Each cell in the grid shows the number of images that fall into that combination of true vs. predicted category.
  • High accuracy for specific classes: The numbers along the main diagonal (from top-left to bottom-right) represent correct classifications. The model performed very well for class 2 and class 3, correctly identifying 3,869 and 3,851 images, respectively. These large numbers on the diagonal indicate high accuracy for these two categories.
  • Specific misclassification patterns: The cells off the main diagonal show the model's errors or 'confusion'. The most significant errors occurred between class 0 and class 1. Specifically, 1,520 images that were actually class 1 were incorrectly predicted as class 0, and 957 images that were truly class 0 were misclassified as class 1. This suggests the model found it difficult to distinguish between these two types of lesions.
Scientific Validity
  • ✅ Appropriate visualization for classification performance: A confusion matrix is the standard and most appropriate tool for visualizing the performance of a multi-class classification model. It provides a granular view of per-class accuracy and inter-class confusion patterns, offering much more insight than a single accuracy metric.
  • ✅ Data is consistent with other reported metrics: The raw counts in the matrix are mathematically consistent with the precision and recall values presented in the classification report (Figure 6). For example, the recall for class 2 (3869 / (28+99+3869+0)) is 0.97, which matches the report. This consistency across figures strengthens the validity of the reported results.
  • 💡 Undefined class labels limit interpretability: The primary scientific limitation is the use of numerical indices (0, 1, 2, 3) for the class labels on the axes. Without referring to the caption or other text, it is impossible to know which disease each number represents. The figure should be self-contained by explicitly labeling the axes with the names of the skin lesion types.
  • 💡 Lack of normalization: The matrix presents absolute counts. While informative, a normalized version (e.g., showing row-wise percentages) would also be beneficial. Normalization would make it easier to interpret the proportion of misclassifications for each class, independent of the total number of samples in that class, which is particularly useful for identifying biases in imbalanced datasets.
Communication
  • ✅ Effective use of a heatmap: The use of color intensity (a heatmap) is effective, as the dark blue on the diagonal immediately draws the reader's attention to the high number of correct classifications. This visual cue makes the 'strong diagonal dominance' mentioned in the caption instantly apparent.
  • ✅ Excellent and informative caption: The caption is a model of clarity. It not only describes the figure but also accurately interprets its key findings—the high performance for two classes and the specific confusion between the other two—providing valuable context and guiding the reader's analysis.
  • 💡 Axis labels should be descriptive, not numerical: The most critical communication issue is the use of '0, 1, 2, 3' for axis labels. To make the figure self-contained and immediately understandable, these numerical indices should be replaced with the actual names of the skin lesion classes (e.g., 'Basal cell carcinoma', 'Melanoma').
  • 💡 Minor formatting errors on Y-axis labels: The labels on the Y-axis have a trailing hyphen (e.g., '1-', '2-'). This is a minor typographical error that should be corrected for a more professional and clean presentation.

Discussion

Key Aspects

Strengths

Suggestions for Improvement