Decision analysis framework for predicting no-shows to appointments using machine learning algorithms

Carolina Deina, Flavio S. Fogliatto, Giovani J. C. da Silveira, Michel J. Anzanello
BMC Health Services Research
Department of Industrial Engineering, Federal University of Rio Grande do Sul

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This study addresses the significant problem of patient no-shows in healthcare by proposing and evaluating a novel decision analysis framework using machine learning. The primary objective is to accurately predict no-shows, particularly in the context of imbalanced datasets where actual no-shows are relatively rare. The research introduces Symbolic Regression (SR), an algorithm that discovers mathematical formulas from data, as a classification method, and Instance Hardness Threshold (IHT), a technique that balances datasets by removing hard-to-classify majority instances, as a resampling method. These are benchmarked against established algorithms like K-Nearest Neighbors (KNN) and Support Vector Machine (SVM), and other resampling techniques such as SMOTE and Random UnderSampling.

A key methodological innovation is a comprehensive six-step analytical framework that includes a rigorous dual z-fold cross-validation process. This involves splitting the data for model training and testing in two nested stages, resulting in 100 independent simulation runs for each model configuration. This approach is designed to ensure robust assessment of model generalization (ability to perform on new data) and stability, which is often a challenge with imbalanced data. The framework was validated using two distinct datasets from Brazilian hospitals with no-show rates of 6.65% and 19.03%.

The findings indicate that the novel techniques, particularly IHT, demonstrated superior performance. When combined with various classification algorithms (including SR, KNN, and SVM), IHT consistently led to high sensitivity (the proportion of actual no-shows correctly identified) and Area Under the Curve (AUC, a measure of overall model performance), with sensitivity values exceeding 0.94 for SR/IHT combinations on the validation portions of both datasets. This high sensitivity is crucial for healthcare applications, as it helps minimize false negatives (failing to predict a no-show), thereby enabling more effective targeted interventions like patient reminders or optimized overbooking strategies.

The study concludes that SR and IHT are promising methods for no-show prediction and emphasizes the critical importance of robust validation strategies, like the proposed dual cross-validation, when dealing with imbalanced datasets. Relying on few validation runs can lead to biased results and an inadequate understanding of model reliability. The research provides both theoretical contributions through its novel framework and techniques, and practical implications for improving healthcare resource management and patient care.

Research Impact and Future Directions

This research makes a significant contribution by proposing a methodologically robust framework for predicting patient no-shows, a persistent challenge in healthcare. The introduction of Symbolic Regression (SR) as a classification algorithm and Instance Hardness Threshold (IHT) as a resampling technique, both novel in this context, demonstrates considerable promise. The study's emphasis on a dual z-fold cross-validation process, yielding 100 simulation runs, sets a high standard for evaluating model performance and stability, particularly crucial when dealing with imbalanced datasets where no-shows are the minority.

The findings indicate that the IHT technique, which selectively removes hard-to-classify majority instances to improve class separation, consistently enhanced model performance across different algorithms, notably achieving high sensitivity (the ability to correctly identify patients who will not show up, e.g., >0.94 for SR/IHT on validation sets). This is practically vital as minimizing missed no-shows allows for more effective resource allocation and targeted interventions. The SR algorithm, which evolves mathematical models from data, also performed well, suggesting its utility for uncovering complex patterns in patient behavior.

However, while the paper demonstrates superior observed performance for SR and IHT combinations through extensive simulations, it's important to note that these claims of superiority are based on descriptive outcomes from the 100 replicates rather than formal statistical significance testing between methods. This is a limitation in the current analysis that tempers the certainty of comparative conclusions. The study design is a rigorous comparative evaluation of algorithmic performance within a novel framework using historical data, not an experimental trial assessing real-world impact. Therefore, its primary strength lies in advancing methodological best practices for no-show prediction and highlighting the potential of SR and IHT. Future research should incorporate formal statistical comparisons and prospective validation in clinical settings to confirm the practical efficacy and cost-effectiveness of these models.

Ultimately, the paper successfully underscores the pitfalls of relying on limited validation runs for imbalanced datasets and provides a valuable, adaptable framework for developing more reliable predictive models. This can empower healthcare managers to make more informed decisions, potentially leading to improved operational efficiency, reduced costs, and better patient access to care. The key takeaway is the critical need for methodological rigor in this domain and the promising avenues opened by the novel techniques explored.

Critical Analysis and Recommendations

Clear Problem Definition and Relevance (written-content)
The abstract clearly defines the no-show problem and the practical relevance of using machine learning for prediction. This is achieved by immediately stating the adverse effects of no-shows and how ML can enable strategies like overbooking and targeted reminders. This effectively establishes the research's importance and its direct application to improving healthcare service delivery. Healthcare managers can readily understand the value proposition of such predictive tools.
Section: Abstract
Explicit Statement of Novel Contributions (written-content)
The abstract explicitly states the study's novel academic contributions, identifying it as the first to propose Symbolic Regression (SR) and Instance Hardness Threshold (IHT) for predicting patient no-shows. This directness in highlighting its unique position within the literature is crucial for researchers. This clarity helps establish the paper's specific advancements in the field. Researchers can quickly grasp the innovative aspects being introduced.
Section: Abstract
Enhance Link Between Framework Design and Generalizability in Abstract Conclusion (written-content)
The abstract's conclusion could more directly attribute the framework's specific design feature (dual z-fold cross-validation) to achieving better generalization and stability, which are mentioned as benefits in the Methods section. This is a 'Lack of explicit connection impacting clarity of methodological benefit'. Explicitly linking this robust validation strategy to its intended benefits in the abstract would reinforce this as a key strength and practical outcome, enhancing the takeaway message about the framework's utility for producing reliable models, especially for imbalanced datasets. A minor textual modification could achieve this.
Section: Abstract
Comprehensive Problem Articulation and Impact (written-content)
The introduction thoroughly defines the no-show problem, detailing its multifaceted negative impacts on healthcare systems (resource inefficiency, increased costs) and patients (discontinuity of care, worsened outcomes). This comprehensive articulation is achieved by citing consequences like lack of care for two patients for each no-show. This effectively establishes the research's significance and the urgent need for solutions, providing a strong foundation for the study. This helps readers understand the scale and importance of the problem being addressed.
Section: Introduction
Clear Identification of Research Gap and Novelty (written-content)
The paper explicitly identifies Symbolic Regression (SR) and Instance Hardness Threshold (IHT) as unexplored methods in no-show prediction, clearly stating a gap in current research. This is done by stating, 'To the best of our knowledge, no studies used...' these techniques. This clear identification of novelty effectively positions the paper's unique contributions to the field. This allows the research community to understand the specific advancements being proposed.
Section: Introduction
Enhance Rationale for SR and IHT Suitability in No-Show Context (written-content)
The introduction describes SR and IHT and their novelty but could more explicitly connect their specific advantages (e.g., SR's flexibility for non-linear relationships, IHT's handling of imbalance and noise) to the inherent challenges of no-show data. This is an 'Insufficient justification impacting understanding of method choice'. Enhancing this rationale would strengthen the motivation for choosing these novel methods beyond their unexplored status, better preparing the reader for why these methods might outperform others. This could be achieved by adding sentences detailing how SR's structure-agnostic approach or IHT's targeted filtering mechanism are particularly suited to the complexities of no-show datasets.
Section: Introduction
Comprehensive and Structured Predictive Framework (written-content)
The paper outlines a clear, comprehensive, and logically sequenced six-step predictive framework, visualized in Figure 1. This structured approach details steps from data gathering to performance evaluation, including novel applications of cross-validation and resampling. This enhances the reproducibility and understanding of the complex analytical process, which is critical for scientific validation and adoption by other researchers. Practitioners can also better understand the workflow for implementing such a system.
Section: Method
Rigorous Validation Approach with Dual Cross-Validation (written-content)
The study employs a dual z-fold cross-validation approach (Z1=10 for calibration/validation split, Z2=10 for train/test split within calibration), leading to 100 evaluation runs. This rigorous method provides a robust assessment of model performance and stability, especially important for imbalanced datasets. This methodological strength ensures more reliable and generalizable findings compared to studies with fewer validation runs. This gives higher confidence in the reported performance metrics.
Section: Method
Novel Methodological Applications (IHT and SR) (written-content)
The study is the first to apply Instance Hardness Threshold (IHT) for resampling and Symbolic Regression (SR) for classification in no-show prediction. This is explicitly stated and addresses a gap in existing literature. This novelty is a significant contribution, potentially opening new avenues for improving predictive accuracy in this domain. Researchers are provided with new tools and approaches to tackle a persistent problem.
Section: Method
Figure 1: Robust and Comprehensive Framework Design (graphical-figure)
Figure 1 outlines a comprehensive six-step predictive framework, including dual cross-validation loops and explicit steps for data resampling and feature selection. This visual representation details a robust process from data input to performance evaluation. The design addresses key challenges like class imbalance and aims for model generalizability, which is crucial for developing reliable predictive tools in healthcare. This allows for a clear understanding of the entire analytical pipeline.
Section: Method
Discuss Computational Cost of the Extensive Framework (written-content)
Given the extensive six-step framework, particularly the 10x10 cross-validation loops combined with wrapper feature selection, the paper could benefit from a brief discussion on the overall computational resources or time involved. This is a 'Practicality concern impacting reproducibility'. Providing this context would be beneficial for researchers aiming to replicate or adapt this thorough methodology, offering insights into the feasibility and resource demands of such an approach. Adding a short paragraph on computational considerations would address this.
Section: Method
Superior and Stable Performance with IHT Resampling (written-content)
The results, presented in Tables 1-4, show that combinations involving the Instance Hardness Threshold (IHT) resampling technique consistently yielded high sensitivity (e.g., SR/IHT >0.94 on validation for both datasets) and AUC values. This was observed across two distinct datasets with different no-show rates, using a rigorous framework with 100 replicates. Achieving high sensitivity is paramount in healthcare to minimize false negatives (failing to identify a no-show), thus enabling more effective targeted interventions and resource optimization. This demonstrates the practical value of IHT in improving the identification of at-risk patients.
Section: Results
Strong Justification for Metric Focus (Sensitivity and AUC) (written-content)
The paper effectively justifies prioritizing sensitivity and AUC as key performance metrics by linking them to the higher cost of false negatives in healthcare. This rationale is clearly stated in the Results section. This contextual justification is important because it aligns the model evaluation with the practical needs of healthcare managers, where correctly identifying potential no-shows is often more critical than other metrics. This ensures that the model's success is measured by its ability to address the most costly errors.
Section: Results
Figures 2 & 4: Effective Visualization of Model Sensitivity and Stability (graphical-figure)
Figures 2 and 4 (boxplots of sensitivity) effectively visualize the distribution of sensitivity results across 100 replicates for all model combinations on the validation sets. These figures clearly show the median performance, variability (spread), and outliers for each model. This visual representation strongly supports claims about model stability and allows for easy comparison, highlighting that IHT-based models generally achieve higher and more stable sensitivity. This provides a transparent and robust way to assess and compare model reliability.
Section: Results
Explicitly Summarize IHT's Consistent Benefit Across Algorithms in Text (written-content)
While Tables 1-4 show IHT's strong performance, the main text could more explicitly synthesize the observation that IHT consistently elevates performance (especially sensitivity and AUC) across all three classification algorithms (KNN, SVM, SR) for both datasets. This is a 'Clarity issue impacting synthesis of findings'. Reinforcing this robustness of IHT as a resampling technique within the narrative summary would strengthen the message, rather than relying solely on the reader to synthesize this from the tables. Adding a summary sentence after presenting IHT's success for each dataset would achieve this.
Section: Results
Strong Rationale for Extensive Cross-Validation (written-content)
The paper provides a strong rationale for its extensive cross-validation strategy (100 simulations across two stages), contrasting it with approaches using fewer validation runs. This is detailed by explaining how 100 simulations curb repetitions and strengthen robustness. This highlights the methodological rigor of the study in assessing model generalization and stability, which is particularly important for imbalanced datasets. This robust evaluation increases confidence in the study's findings and promotes better practices in the field.
Section: Discussion
Insightful Explanation for IHT's Effectiveness (written-content)
The discussion offers an insightful explanation for the superior performance of the Instance Hardness Threshold (IHT) technique, attributing its success to its ability to identify and remove challenging majority class instances. This is explained by stating IHT's removal of these instances improves class separation. This plausible mechanism helps readers understand why IHT is effective, particularly in imbalanced scenarios common in no-show prediction. This provides a theoretical underpinning for the empirical results.
Section: Discussion
Recommend Statistical Testing for IHT/SR Superiority (written-content)
The study reports superior mean performance for SR/IHT combinations based on 100 replicates but does not include formal statistical tests (e.g., Friedman test, Wilcoxon signed-rank) to confirm if these differences are statistically significant compared to other methods. This is a 'Lack of formal statistical comparison impacting strength of claims of superiority'. While the observed advantage across 100 runs is compelling, statistical testing would move claims of superiority from observational to statistically validated evidence, significantly enhancing the rigor and impact of the conclusions. Future work should incorporate such tests to definitively establish the relative performance of the novel techniques.
Section: Discussion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Method

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 1 Outline of the proposed framework
Figure/Table Image (Page 5)
Fig. 1 Outline of the proposed framework
First Reference in Text
To provide information to help clinics adopt strategies to minimize problems related with no-shows, we propose a six-step predictive framework (Fig. 1).
Description
  • Six-step predictive framework: The figure depicts a six-step analytical framework designed for predicting no-shows to appointments. Step 1, 'Data gathering and pre-processing,' involves collecting raw data, preparing it (e.g., handling missing values), and then 'data scaling,' which standardizes the range of continuous initial data (e.g., patient age, waiting time) typically to a 0-1 range, ensuring that no single feature dominates due to its scale. Step 2, 'Complete sample stratification and z-fold cross-validation,' divides the entire dataset into 'Z1' folds (subsets). 'Stratification' ensures that each fold maintains the original proportion of outcomes (e.g., no-shows vs. shows). 'Cross-validation' is a resampling procedure used to evaluate machine learning models on a limited data sample; here, for each of the Z1 iterations, one fold is set aside for 'Validation' and the rest for 'Calibration'. Step 3, 'Data resampling,' is applied to the calibration data to address imbalances, such as when 'no-show' instances are much rarer than 'show' instances. This involves techniques like 'Undersampling' (reducing the majority class) or 'Oversampling' (increasing the minority class). Step 4, 'Balanced sample stratification and z-fold cross-validation,' takes the resampled data from Step 3 and again divides it, this time into 'Z2' folds, for 'Train' and 'Test' purposes within the model building phase. Step 5, 'Feature selection,' aims to identify the most relevant input variables (features) from the 'Set of all features.' It iteratively generates feature subsets, applies a 'Learning Algorithm' (a specific machine learning model like K-Nearest Neighbors or Support Vector Machine), and evaluates them using the 'F1-score' (a metric balancing precision and recall, indicating model accuracy for the minority class). The subset yielding the 'Best F1-score' is selected. This process loops until the best feature subset is consolidated. Step 6, 'Performance evaluation,' uses the best feature subset and the chosen learning algorithm to test the model on the 'Validation' data held out in Step 2. This is repeated for all Z1 folds. Finally, the 'best performing model' is determined, and its 'average performance results and model stability' (consistency across different data subsets) are calculated.
Scientific Validity
  • ✅ Comprehensive and robust framework design: The proposed framework is comprehensive, addressing key stages in predictive modeling from data preprocessing to performance evaluation. The inclusion of two distinct cross-validation loops (Steps 2 and 4) is a robust approach aimed at ensuring model generalizability and stability.
  • ✅ Addresses key challenges in predictive modeling: The explicit inclusion of 'Data resampling' (Step 3) to handle imbalanced datasets and 'Feature selection' (Step 5) to identify relevant predictors are critical and appropriate steps for building effective classification models, especially in healthcare where class imbalance (e.g., no-shows vs. shows) is common.
  • ✅ Double cross-validation with stratification: The use of z-fold cross-validation twice (once on the complete sample for calibration/validation split, and again on the balanced calibration sample for train/test split) is a sophisticated approach. This should help in obtaining reliable estimates of model performance and reducing overfitting. The stratification at both cross-validation stages is also a good practice for imbalanced data.
  • ✅ Methodological soundness for the stated purpose: The framework appears methodologically sound for developing and evaluating machine learning models for no-show prediction. The iterative nature of feature selection (Step 5) based on F1-score is appropriate for optimizing models where the positive class (no-show) is often the minority and of primary interest.
  • 💡 Clarification on feature selection termination and algorithm specification: While the framework is detailed, the specific criteria for 'Best F1-score?' leading to the consolidation of the 'best feature subset' could be elaborated in the main text. For instance, is it the absolute best F1-score across all iterations, or is there a threshold or comparative process involved if multiple subsets yield similar scores? Also, the term 'Learning Algorithm' is generic; the paper should detail which algorithms are tested within this framework.
  • ✅ Unbiased final validation: The separation of a final 'Validation' set in Step 2, which is not touched during the resampling and feature selection (Steps 3-5 on the 'Calibration' set), is crucial for an unbiased assessment of the final model's performance. This ensures that the performance metrics reported in Step 6 are on data unseen during model tuning.
Communication
  • ✅ Logical flow and standard conventions: The flowchart is generally well-structured and uses standard conventions (rectangles for processes, diamonds for decisions), making the overall workflow easy to follow. The sequential numbering of the main steps (1 to 6) aids in understanding the progression.
  • 💡 Text size and readability: The use of color is minimal, relying on shades of grey and blue, which is good for avoiding distraction. However, the text within some boxes is quite small, particularly in the feature selection loop (Step 5) and the decision diamonds. The lines connecting elements are clear.
  • 💡 Clarity of internal loops and terminology: While the main steps are clear, the internal loops and decision points (e.g., 'z2 = Z2?', 'z1 = Z1?') could benefit from slightly more descriptive labels or a brief note in the caption explaining Z1 and Z2 represent the number of folds in the respective cross-validation stages. The term 'Learning Algorithm' is generic; specifying the types considered or linking to where they are detailed would enhance self-containment.
  • ✅ Effective overview of a complex process: The figure effectively outlines a complex multi-stage process. The visual separation of steps helps in breaking down the framework into manageable components. The flow from data input to performance evaluation is logically presented.
  • 💡 Suggestions for improvement: Consider increasing the font size for text within the flowchart boxes and decision diamonds to improve readability, especially for terms like 'Generate a subset', 'Learning Algorithm', and the conditions in the decision diamonds. If space is an issue, slightly rephrasing for conciseness might help. Adding a legend or a brief explanation for Z1 and Z2 in the caption or as a footnote to the figure would be beneficial.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 2 Boxplot of sensitivity results in the validation set for all prediction...
Full Caption

Fig. 2 Boxplot of sensitivity results in the validation set for all prediction models

Figure/Table Image (Page 9)
Fig. 2 Boxplot of sensitivity results in the validation set for all prediction models
First Reference in Text
A boxplot of the sensitivity metric is presented to verify the stability of prediction models in the validation set considering cases correctly classified as no-shows (Fig. 2).
Description
  • Distribution of sensitivity for prediction models: The figure is a boxplot chart displaying the 'sensitivity' results for various prediction models on a 'validation set'. Sensitivity, in this context, measures how well a model correctly identifies patients who will actually miss their appointments (no-shows); a sensitivity of 1.0 means all no-shows were correctly predicted, while 0.0 means none were. Each boxplot represents a different combination of a classification algorithm (KNN: K-Nearest Neighbors, a method that classifies based on the majority class among its 'k' closest examples; SR: Symbolic Regression, a technique that searches for mathematical expressions to fit data; SVM: Support Vector Machine, an algorithm that finds an optimal boundary to separate classes) and a resampling technique (SMOTE, RUS, NM, IHT - these are methods to balance datasets where one class, like 'no-show', is much rarer than another). For each box: the horizontal line inside is the median sensitivity (the middle value); the box itself spans the interquartile range (IQR, the middle 50% of the data); the 'whiskers' (lines extending from the box) typically show the range of the data, excluding outliers; and individual points beyond the whiskers are outliers. The y-axis ranges from 0.0 to 1.0. Visually, some models, particularly those involving the IHT resampling technique (e.g., KNN IHT, SR IHT, SVM IHT), show median sensitivities around 0.9 or higher, with relatively compact boxes, indicating consistent high performance. Other combinations, such as KNN SMOTE or SR SMOTE, show much lower median sensitivities (around 0.1-0.2) and wider spreads, indicating poorer and more variable performance in correctly identifying no-shows.
  • Variability and stability of models: The spread of each boxplot (the height of the box and the length of the whiskers) indicates the stability or variability of the sensitivity metric for that particular model combination across multiple validation runs (the text mentions 100 replicates). For instance, the SVM IHT model shows a very tight box with its median line near the top, and very short whiskers, suggesting its high sensitivity is quite stable. In contrast, the SR RUS model shows a very wide box and long whiskers, with its median around 0.5, indicating highly variable sensitivity results across different runs.
Scientific Validity
  • ✅ Appropriate visualization for distributional data: The use of boxplots is a statistically appropriate method for visualizing and comparing the distributions of a performance metric like sensitivity, especially when derived from multiple replicates or cross-validation folds. It effectively shows median performance, variability (IQR and range), and potential outliers.
  • ✅ Supports claims about model stability: The figure strongly supports the reference text's claim of verifying the stability of prediction models. The spread of each box (IQR) and the extent of the whiskers directly visualize the consistency (or lack thereof) of the sensitivity metric for each model configuration over the 100 replicates performed on the validation set.
  • ✅ Facilitates comparison of techniques: The figure allows for a clear comparison of the effectiveness of different resampling techniques (SMOTE, RUS, NM, IHT) when combined with various classification algorithms (KNN, SR, SVM). The visual evidence suggests that IHT consistently leads to higher and often more stable sensitivity across all three algorithms for this dataset.
  • 💡 Visual comparison without explicit statistical tests: While the boxplots themselves are informative, the figure does not inherently include statistical significance tests between the different model combinations. Such tests, if reported in the text, would complement the visual findings by quantifying the differences in sensitivity distributions.
  • ✅ Representation of outliers: The presence of outliers (e.g., for SR SMOTE, SR RUS, SVM RUS) is well-represented, providing a complete picture of the performance distribution, including atypical results from some validation runs.
Communication
  • ✅ Appropriate chart type: The use of a boxplot is appropriate for visually comparing the distributions of the sensitivity metric across multiple model configurations. It effectively conveys central tendency, spread, and outliers for each group.
  • ✅ Clear labeling: The y-axis is clearly labeled "Sensitivity" with a sensible range (0.0 to 1.0). The x-axis labels, while abbreviations (e.g., KNN SMOTE, SR IHT), are consistent with the terminology used in the paper and represent combinations of classification algorithms (KNN, SR, SVM) and resampling techniques (SMOTE, RUS, NM, IHT).
  • ✅ Visual design and clarity: The figure is relatively clean and uncluttered. The different shades used for the boxes help to distinguish them, although a legend clarifying if shades correspond to algorithm type or resampling technique systematically would be a minor improvement if not already obvious from context. The horizontal gridlines aid in reading sensitivity values.
  • 💡 X-axis label density: The x-axis labels are quite dense. Consider rotating the labels by 45 degrees or arranging them in a staggered two-line format if possible to improve readability and prevent potential overlap, especially if more models were to be added. Alternatively, grouping by primary algorithm (KNN, SR, SVM) and then by resampling technique might make comparisons within algorithm types easier.
  • ✅ Informative caption: The caption is informative, stating the metric (sensitivity), the dataset portion (validation set), and the scope (all prediction models). It aligns well with the visual content.
Fig. 3 Features selected by top models, occurrence frequency in 100 test set...
Full Caption

Fig. 3 Features selected by top models, occurrence frequency in 100 test set replicates

Figure/Table Image (Page 10)
Fig. 3 Features selected by top models, occurrence frequency in 100 test set replicates
First Reference in Text
Figure 3 displays the most frequently selected features from the test set for combinations of KNN, SR, and SVM classification algorithms with the IHT resampling technique.
Description
  • Feature selection frequencies for different models: The figure is a table listing various 'Features' (input variables for predictive models, e.g., 'Month december', 'Age') in the first column. The subsequent three columns, labeled 'KNN IHT', 'SR IHT', and 'SVM IHT', represent different machine learning model configurations (K-Nearest Neighbors, Symbolic Regression, Support Vector Machine, all combined with the Instance Hardness Threshold resampling technique). The numerical values within these columns indicate the 'Frequency of occurrence' – specifically, how many times out of 100 test set replicates each feature was selected as important by that particular model. For instance, 'Month december' and 'Spring season' were selected in all 100 replicates (frequency 100) by all three listed models. 'Distance to the clinic' was selected 41 times by KNN IHT, 90 times by SR IHT, and 57 times by SVM IHT. 'Previous no-show in appointments' was selected 30 times by KNN IHT, 55 times by SR IHT, and 21 times by SVM IHT. The table includes a long list of features with varying selection frequencies, some as low as 3 (e.g., 'Cancer record' for KNN IHT) or 0 (e.g., 'Complete high school' for SVM IHT).
Scientific Validity
  • ✅ Valid approach for assessing feature importance and stability: Presenting the frequency of feature selection across multiple replicates is a valid method to assess the stability and relative importance of features for different models. It helps identify features that are consistently chosen as predictive.
  • ✅ Supports claims about frequently selected features: The figure clearly supports the reference text by showing which features were most frequently selected by the specified model combinations (KNN, SR, SVM with IHT). The data allows for identification of features consistently ranked high by these 'top models'.
  • ✅ Highlights algorithm-specific feature preferences: The table reveals interesting differences in feature selection patterns across the three algorithms, even when using the same resampling technique (IHT). For example, 'Distance to the clinic' has a much higher selection frequency for SR IHT (90) compared to KNN IHT (41), suggesting that the SR algorithm might find this feature more consistently useful. This provides insight into model-specific feature dependencies.
  • ✅ Comprehensive representation of selection frequencies: The inclusion of features with low selection frequencies (e.g., 3, 4, 5 occurrences out of 100) provides a comprehensive view but also indicates that many features are not consistently selected. This is an honest representation of the feature selection process outcomes.
  • ✅ Sufficient number of replicates: The basis of 100 replicates for determining selection frequency is a good number to provide a stable estimate of feature importance.
  • 💡 Scope of information presented: While the figure shows which features were selected and how often, it doesn't inherently explain why certain features were selected more by one algorithm than another, nor the directionality of their impact. This level of detail would typically be discussed in the text.
Communication
  • ✅ Appropriate format and clear headers: The tabular format is appropriate for presenting the frequency of selected features across different models. Column headers are clear.
  • ✅ Informative caption: The caption clearly explains what the figure represents: the frequency of features selected by top models (specified as KNN, SR, SVM with IHT in the reference text) in 100 test set replicates.
  • 💡 Feature ordering and readability: The list of features is quite extensive. To improve readability and immediate understanding of the most critical features, consider sorting the features based on their average frequency across all models, or by their frequency in one reference model (e.g., SR IHT, which seems to select 'Distance to the clinic' most often among the top few). Alternatively, grouping features by category (e.g., demographic, appointment-specific) if applicable, could also aid interpretation.
  • 💡 Enhance visual comparison: While numerical frequencies are provided, adding horizontal bars (like an embedded bar chart) next to each frequency value could offer a more immediate visual comparison of feature importance across the three models for each specific feature. This would make it easier to spot differences in selection patterns.
  • ✅ Descriptive feature names: The feature names are descriptive. Some are long and wrap lines, which is acceptable for clarity, but ensure consistent formatting throughout the table.
Table 1 Average predictive performance and standard deviations obtained from...
Full Caption

Table 1 Average predictive performance and standard deviations obtained from 100 replicates of dataset 1's test portion

Figure/Table Image (Page 8)
Table 1 Average predictive performance and standard deviations obtained from 100 replicates of dataset 1's test portion
First Reference in Text
The processing of Dataset 1 following the framework steps in Fig. 1 led to the results reported in Tables 1 (test set) and 2 (validation set).
Description
  • Overview of model performance evaluation: The table presents the average predictive performance of various machine learning models on the 'test portion' of 'dataset 1', based on 100 repeated experiments (replicates). Performance is evaluated using several standard metrics, with both the mean value and its standard deviation (in parentheses) reported. The models are combinations of a 'Resampling technique' (methods to adjust class distribution in imbalanced datasets) and a 'Classification algorithm'.
  • Techniques and algorithms tested: Four resampling techniques are compared: SMOTE (Synthetic Minority Oversampling Technique, which creates new minority class instances), RUS (Random Under-Sampling, which removes majority class instances), NM (NearMiss, an undersampling technique that selects majority class instances based on distance to minority class instances), and IHT (Instance Hardness Threshold, an undersampling technique that removes instances likely to be misclassified). Three classification algorithms are used: KNN (K-Nearest Neighbors, which classifies based on the majority vote of its closest neighbors), SVM (Support Vector Machine, which finds an optimal hyperplane to separate classes), and SR (Symbolic Regression, which evolves mathematical expressions to fit data).
  • Performance metrics reported: Performance metrics include: AUC (Area Under the ROC Curve, a measure of overall model distinguishability, higher is better), Sensitivity (True Positive Rate, ability to correctly identify no-shows), Specificity (True Negative Rate, ability to correctly identify shows), NPV (Negative Predictive Value, probability that a predicted show is a true show), PPV (Positive Predictive Value, probability that a predicted no-show is a true no-show), F1-score (harmonic mean of precision and sensitivity, useful for imbalanced classes), and Accuracy (overall proportion of correct classifications).
  • Key performance highlights (IHT models): Models using the IHT resampling technique consistently show the highest performance across most key metrics, particularly AUC and Sensitivity. For example, KNN with IHT achieved an AUC of 0.9087 (0.032) and Sensitivity of 0.9122 (0.052). SR with IHT achieved the highest Sensitivity of 0.9582 (0.040) and NPV of 0.9728 (0.025). SVM with IHT also performed well, with an AUC of 0.9017 (0.027) and Sensitivity of 0.9447 (0.042).
  • Performance of other resampling techniques: In contrast, other resampling techniques like SMOTE, RUS, and NM generally resulted in lower performance, especially for Sensitivity when combined with certain algorithms. For instance, SR with SMOTE had a Sensitivity of 0.5034 (0.250) and SR with RUS had 0.5090 (0.416). The highest PPV was achieved by SR with SMOTE at 0.9056 (0.147). The highest Specificity was also for SR with SMOTE at 0.9076 (0.186).
  • Indication of model stability via standard deviations: The standard deviations reported alongside the means indicate the variability of the performance across the 100 replicates. Smaller standard deviations suggest more stable and reliable model performance. For example, for the IHT models, standard deviations for Sensitivity are generally low (e.g., 0.040 for SR IHT), indicating consistent performance.
Scientific Validity
  • ✅ Robust performance assessment: Reporting both mean and standard deviation from 100 replicates provides a robust assessment of model performance and stability. This is a strong methodological choice.
  • ✅ Comprehensive comparison of methods: The table comprehensively evaluates various combinations of classification algorithms and resampling techniques using a wide array of standard performance metrics. This allows for a thorough comparison and identification of superior approaches for the given dataset and problem.
  • ✅ Highlights effective techniques: The results clearly demonstrate the impact of different resampling techniques, with IHT showing marked superiority, especially in terms of sensitivity and AUC for this 'test portion' of dataset 1. This provides valuable insights into handling imbalanced data for no-show prediction.
  • 💡 Clarity of 'test portion' context within the framework: The term 'test portion' (from caption) and 'test set' (from reference text) should ideally correspond to the 'Test' set generated in Step 4 of the framework (Fig. 1), which is used for tuning/selecting the best feature subset and model parameters within the calibration phase. Table 2 is then the 'validation set' from Step 2 of Fig. 1, used for final unbiased evaluation. Ensuring this distinction is clear in the manuscript is important for interpreting the overall evaluation strategy.
  • ✅ Appropriate choice of performance metrics: The choice of metrics, including AUC, sensitivity, specificity, PPV, NPV, and F1-score, is appropriate for a classification problem with potentially imbalanced classes, providing a multi-faceted view of performance beyond simple accuracy.
  • ✅ Strong support for high sensitivity with IHT: The data strongly supports the claim that IHT combined with SR or SVM can achieve high sensitivity (above 0.94), which is crucial for minimizing missed no-show predictions. The PPV values, while lower than sensitivity for IHT models (around 0.80-0.86), are still reasonably high, indicating that a good proportion of predicted no-shows are actual no-shows.
Communication
  • ✅ Clear structure and organization: The table is well-structured with clear rows for resampling technique/classification algorithm combinations and columns for standard performance metrics. This organization facilitates comparison across different models.
  • ✅ Informative caption: The caption is informative, clearly stating the content (average predictive performance and standard deviations), the source of the data (100 replicates of dataset 1's test portion), and the metrics presented.
  • ✅ Effective use of bolding for best results: Using bold text to highlight the best performing value for each metric is an effective visual cue that helps the reader quickly identify top-performing models for specific criteria.
  • ✅ Concise use of abbreviations: Abbreviations for algorithms (KNN, SVM, SR) and resampling techniques (SMOTE, RUS, NM, IHT) are standard and make the table concise. It is assumed these are defined in the main text.
  • 💡 Information density: The table is quite dense with information. While comprehensive, consider if a summary figure or a smaller table highlighting only the IHT results (which appear superior) for key metrics could be provided in the main text or supplement for quicker assimilation of the primary findings, with this full table providing the exhaustive details.
  • ✅ Consistent decimal precision: The number of decimal places (typically three or four) is consistent and appropriate for these types of performance metrics.
Table 2 Average predictive performance and standard deviations obtained from...
Full Caption

Table 2 Average predictive performance and standard deviations obtained from 100 replicates of dataset 1's validation portion

Figure/Table Image (Page 8)
Table 2 Average predictive performance and standard deviations obtained from 100 replicates of dataset 1's validation portion
First Reference in Text
The processing of Dataset 1 following the framework steps in Fig. 1 led to the results reported in Tables 1 (test set) and 2 (validation set).
Description
  • Overview of model performance on validation set: This table displays the average predictive performance and standard deviations of various machine learning models, this time evaluated on the 'validation portion' of 'dataset 1', derived from 100 replicates. The structure mirrors Table 1, showing combinations of 'Resampling technique' and 'Classification algorithm'.
  • Techniques, algorithms, and metrics: The same resampling techniques (SMOTE, RUS, NM, IHT) and classification algorithms (KNN, SVM, SR) as in Table 1 are evaluated. The performance metrics reported are also the same: AUC (Area Under the ROC Curve), Sensitivity (True Positive Rate), Specificity (True Negative Rate), NPV (Negative Predictive Value), PPV (Positive Predictive Value), F1-score, and Accuracy.
  • Key performance highlights on validation set: On this validation set, SR with IHT achieves the highest Sensitivity at 0.9537 (0.042) and the highest NPV at 0.9886 (0.009). KNN with IHT has the highest F1-score at 0.1670 (0.018) and the highest AUC at 0.6302 (0.037). The highest Specificity is achieved by SR with SMOTE at 0.9087 (0.191), which also yields the highest Accuracy at 0.8562 (0.161). The highest PPV is achieved by KNN with RUS at 0.0938 (0.011).
  • Comparison with test set performance (Table 1): Compared to Table 1 (test portion), many metrics show a notable decrease in performance on the validation set, particularly PPV, F1-score, and Accuracy. For instance, the best F1-score in Table 1 was 0.8822 (KNN IHT), while in Table 2 it is 0.1670 (KNN IHT). Similarly, the best PPV in Table 1 was 0.9056 (SR SMOTE), while in Table 2 it is 0.0938 (KNN RUS). Accuracy also drops from a high of 0.9079 (KNN IHT) in Table 1 to 0.8562 (SR SMOTE) in Table 2, with many models showing accuracies around 0.3-0.6. This suggests that the models, while performing well on the balanced test set (from Step 4 of the framework), struggle more on the likely imbalanced validation set (from Step 2).
  • Performance of IHT models on validation set: Despite the overall drop in some metrics, Sensitivity remains high for IHT models, with SR IHT (0.9537), SVM IHT (0.9463), and KNN IHT (0.8998) still demonstrating a strong ability to identify no-shows. NPV also remains very high for these models (e.g., 0.9886 for SR IHT), indicating that when the model predicts a 'show', it is very likely to be correct.
  • Stability of performance on validation set: The standard deviations, particularly for metrics like PPV and F1-score, are relatively small for most models, suggesting that the low performance in these areas is consistent across the 100 replicates on the validation set.
Scientific Validity
  • ✅ Assessment on a true validation set: Presenting results on a separate validation set (presumably untouched during model training, resampling, and feature selection on the calibration set) is crucial for assessing the true generalization capability of the models. This is a methodologically sound practice.
  • ✅ Robustness from multiple replicates: The use of 100 replicates provides confidence in the stability of the reported average performance metrics on this validation set.
  • ✅ Highlights challenges of class imbalance on validation: The significant drop in metrics like F1-score and PPV compared to Table 1 highlights the challenge of class imbalance when models are applied to data that reflects real-world distributions (assuming the validation set is not resampled). High sensitivity maintained by IHT models is a positive finding, but the low PPV suggests a high number of false positives (shows predicted as no-shows), which has practical implications (e.g., overbooking strategies might affect patients who do show up if based solely on these predictions).
  • 💡 Trade-off between sensitivity and precision: The results demonstrate that while IHT consistently improves sensitivity, the overall predictive power for identifying no-shows with high precision (PPV) is limited on this validation set. This suggests that while the models are good at capturing most no-shows, they also misclassify a significant number of actual shows as no-shows.
  • ✅ Realistic portrayal of model performance limitations: The table clearly shows that even the 'best' models according to metrics like F1-score and Accuracy on this validation set have relatively low absolute values (e.g., F1-score of 0.1670, Accuracy often below 0.5 for many combinations). This underscores the difficulty of the prediction task on this particular dataset's validation portion. The high specificity of SR SMOTE (0.9087) and its corresponding highest accuracy (0.8562) is interesting, suggesting it's very good at identifying true 'shows', but its sensitivity is very low (0.1195), meaning it misses most 'no-shows'.
  • 💡 Contextual discussion of 'best' metrics needed: The choice to bold the highest value for each metric is helpful, but it's important for the authors to discuss the practical significance of these 'best' values, especially when they are low in absolute terms (e.g., PPV). The discussion should focus on the most relevant metrics for the problem (likely sensitivity and perhaps NPV, given the context of minimizing missed no-shows).
Communication
  • ✅ Consistent and clear structure: The table maintains a consistent and clear structure, similar to Table 1, which aids in comparing results between the test and validation sets. Rows define model combinations and columns define performance metrics.
  • ✅ Informative caption: The caption is unambiguous, clearly stating that these results pertain to the 'validation portion' of dataset 1, averaged over 100 replicates, and lists the performance metrics shown.
  • ✅ Effective use of bolding: The use of bold text to highlight the best-performing value for each metric is continued from Table 1, effectively guiding the reader's attention to key results.
  • ✅ Clear metric labeling and data presentation: The metrics are standard and clearly labeled. The presentation of mean (standard deviation) is a good practice for summarizing results from multiple replicates.
  • 💡 Contextual interpretation of performance drop: Given the significant drop in performance for some metrics (e.g., PPV, F1-score, Accuracy) compared to Table 1, especially due to class imbalance in the validation set, the discussion in the text should carefully address these differences and their implications. The table itself presents the data clearly, but the interpretation is crucial.
Table 3 Average predictive performance and standard deviations obtained from...
Full Caption

Table 3 Average predictive performance and standard deviations obtained from 100 replicates of dataset 2's test portion

Figure/Table Image (Page 11)
Table 3 Average predictive performance and standard deviations obtained from 100 replicates of dataset 2's test portion
First Reference in Text
Applying the framework steps in Fig. 1 to Dataset 2 led to the results reported in Tables 3 (test set) and 4 (validation set).
Description
  • Overview of model performance on Dataset 2 test portion: This table presents the average predictive performance and standard deviations for various machine learning models applied to the 'test portion' of 'dataset 2', based on 100 experimental replicates. It follows the same format as Table 1, allowing for comparison of the framework's application on a different dataset.
  • Techniques, algorithms, and metrics evaluated: The same set of resampling techniques (SMOTE, RUS, NM, IHT) and classification algorithms (KNN, SVM, SR) are evaluated. Performance is measured using the same metrics: AUC (Area Under the ROC Curve, indicating how well the model can distinguish between classes), Sensitivity (True Positive Rate, or the proportion of actual no-shows correctly identified), Specificity (True Negative Rate, or the proportion of actual shows correctly identified), NPV (Negative Predictive Value, the probability that a patient predicted to show up actually does), PPV (Positive Predictive Value, the probability that a patient predicted to no-show actually does not), F1-score (a balance between precision/PPV and sensitivity), and Accuracy (overall correctness of predictions).
  • Key performance highlights (IHT models on Dataset 2): For Dataset 2's test portion, models using the IHT resampling technique again generally outperform others, particularly in AUC and Sensitivity. SR with IHT achieves the highest AUC (0.9429 ± 0.048), Sensitivity (0.9425 ± 0.094), NPV (0.9570 ± 0.057), and F1-score (0.9349 ± 0.065), and ties for the highest Accuracy (0.9429 ± 0.045). KNN with IHT also shows strong performance with an AUC of 0.9399 (0.021) and Sensitivity of 0.9396 (0.038).
  • Performance highlights for SVM with IHT: SVM with IHT achieved the highest Specificity (0.9505 ± 0.037) and the highest PPV (0.9397 ± 0.040). This suggests that for Dataset 2's test set, SVM with IHT was particularly good at correctly identifying patients who would show up and ensuring that predicted no-shows were indeed no-shows.
  • Comparison with Dataset 1 test portion performance: Compared to Dataset 1 (Table 1), the performance on Dataset 2's test portion appears generally higher and more balanced across metrics for the IHT models. For instance, the F1-scores for IHT models are consistently above 0.92 for Dataset 2, whereas for Dataset 1 they were around 0.86-0.88. PPV values for IHT models are also higher for Dataset 2 (above 0.93) compared to Dataset 1 (around 0.80-0.86).
  • Stability of IHT models on Dataset 2: The standard deviations for the IHT models are generally low, indicating stable performance across the 100 replicates. For example, SR with IHT has a standard deviation of 0.094 for Sensitivity, and SVM with IHT has 0.037 for Specificity.
Scientific Validity
  • ✅ Testing framework on a second dataset: The application of the same analytical framework and evaluation methodology (100 replicates, comprehensive metrics) to a second dataset (Dataset 2) is a strong point, as it allows for an assessment of the framework's generalizability and the robustness of the findings regarding technique performance (e.g., IHT).
  • ✅ Strong performance of IHT on Dataset 2: The reported results, particularly the high performance of IHT-based models across multiple metrics (AUC, Sensitivity, F1-score, PPV often >0.93), suggest that the proposed methods are effective for this dataset's test portion. The consistency of IHT's strong performance across two datasets (comparing qualitatively with Table 1) strengthens the claims about its utility.
  • ✅ Balanced view through multiple metrics: The comprehensive set of metrics provides a balanced view of model performance. For example, while SR IHT has the highest sensitivity, SVM IHT has the highest specificity and PPV, allowing for nuanced model selection based on specific operational priorities (e.g., minimizing false alarms vs. capturing all no-shows).
  • ✅ Proper distinction of evaluation sets: The distinction between 'test set' (Table 3) and 'validation set' (Table 4, to be reviewed next) is maintained, which is critical for proper model evaluation. Table 3 results likely reflect performance on data used for internal tuning/model selection within the calibration phase of the outer cross-validation loop.
  • ✅ Realistic variation in algorithm performance: The variability in which specific algorithm (KNN, SR, or SVM) performs best for a given metric, even within the IHT group, suggests that there isn't a single universally superior algorithm, and the choice might depend on the dataset characteristics or the specific metric being optimized. This is a realistic outcome in machine learning.
  • ✅ Indication of stable high performance: The standard deviations are generally low for the top-performing IHT models, indicating that the high performance observed is consistent across the replicates for this test portion.
Communication
  • ✅ Consistent and clear structure: The table maintains a consistent and clear structure, identical to Tables 1 and 2, facilitating comparison across datasets and model evaluation stages. Rows for model combinations and columns for performance metrics are well-defined.
  • ✅ Informative caption: The caption is precise, indicating the dataset (Dataset 2), the portion (test portion), the basis of results (100 replicates), and the nature of the data (average predictive performance and standard deviations).
  • ✅ Effective use of bolding: The use of bold text to highlight the best-performing value for each metric continues to be an effective visual aid for quickly identifying top results.
  • ✅ Concise abbreviations and clear labels: Standard abbreviations for algorithms and resampling techniques are used, maintaining conciseness. Performance metrics are clearly labeled.
  • ✅ Comprehensive and consistent presentation: The table is comprehensive and provides detailed results. The consistency in presentation with previous tables is a strength.
Table 4 Average predictive performance and standard deviations obtained from...
Full Caption

Table 4 Average predictive performance and standard deviations obtained from 100 replicates of dataset 2's validation portion

Figure/Table Image (Page 11)
Table 4 Average predictive performance and standard deviations obtained from 100 replicates of dataset 2's validation portion
First Reference in Text
Applying the framework steps in Fig. 1 to Dataset 2 led to the results reported in Tables 3 (test set) and 4 (validation set).
Description
  • Overview of model performance on Dataset 2 validation portion: This table presents the average predictive performance and standard deviations for various machine learning models, evaluated on the 'validation portion' of 'dataset 2'. The results are based on 100 experimental replicates. The structure is identical to previous performance tables, allowing for comparisons.
  • Techniques, algorithms, and metrics evaluated: The same set of resampling techniques (SMOTE: Synthetic Minority Oversampling Technique, RUS: Random Under-Sampling, NM: NearMiss, IHT: Instance Hardness Threshold) and classification algorithms (KNN: K-Nearest Neighbors, SVM: Support Vector Machine, SR: Symbolic Regression) are evaluated. Performance is measured using the same metrics: AUC (Area Under the ROC Curve, a measure of a model's ability to distinguish between classes), Sensitivity (True Positive Rate, the proportion of actual no-shows correctly identified), Specificity (True Negative Rate, the proportion of actual shows correctly identified), NPV (Negative Predictive Value, the probability that a patient predicted to show up actually does), PPV (Positive Predictive Value, the probability that a patient predicted to not show up actually does not), F1-score (a balanced measure of precision and sensitivity), and Accuracy (overall proportion of correct classifications).
  • Key performance highlights (IHT models): On this validation set for Dataset 2, SR with IHT achieves the highest Sensitivity (0.9434 ± 0.087), AUC (0.7734 ± 0.038), and NPV (0.9802 ± 0.020). KNN with IHT also shows high Sensitivity (0.9418 ± 0.035).
  • Performance highlights for SMOTE models: The highest Specificity is achieved by KNN with SMOTE (0.8864 ± 0.086), which also yields the highest Accuracy (0.8194 ± 0.055) and the highest PPV (0.5592 ± 0.115). SVM with SMOTE yields the highest F1-score (0.5327 ± 0.073).
  • Comparison with Dataset 2 test portion performance (Table 3): Compared to the test portion results for Dataset 2 (Table 3), there is a noticeable drop in performance for several metrics on this validation set, particularly PPV, F1-score, and Accuracy, especially for the IHT models. For instance, the best PPV for IHT models in Table 3 was above 0.93, whereas in Table 4, the best PPV for IHT models is around 0.34-0.37. Similarly, F1-scores for IHT models were above 0.92 in Table 3 but are around 0.50-0.52 in Table 4. This suggests that while models performed well on the balanced test set, their performance degrades on the (likely imbalanced) validation set, especially in terms of precision-related metrics.
  • Maintained high Sensitivity and NPV for IHT models: Despite the drop in some metrics, Sensitivity for the IHT models remains very high (around 0.91-0.94), indicating they are still effective at identifying most of the no-show instances. NPV also remains high for IHT models (e.g., 0.9802 for SR IHT).
  • Stability of performance on validation set: The standard deviations are generally low to moderate, indicating a degree of consistency in the performance across the 100 replicates on this validation set.
Scientific Validity
  • ✅ Assessment on a true validation set: Evaluating models on a separate validation set, untouched during the main model training and selection phases (which used the 'test set' from Table 3, itself derived from the calibration portion), is a critical and methodologically sound step for assessing true generalization performance.
  • ✅ Robustness from multiple replicates: The use of 100 replicates provides a reliable basis for the reported average performance metrics and their standard deviations, lending credibility to the stability of these findings on the validation set.
  • ✅ Realistic assessment of generalization challenges: The results highlight the practical challenges of applying models trained/tuned on balanced data (implied for the 'test set' in Table 3 due to resampling in Step 3 of Fig. 1) to real-world, likely imbalanced validation data. The drop in precision-related metrics (PPV, F1-score) for IHT models, despite high sensitivity, is a key finding. This indicates that while most no-shows are caught, many actual shows might be misclassified as no-shows.
  • 💡 Divergent best-performing techniques on validation set: The superior performance of SMOTE-based models (KNN SMOTE for Specificity, PPV, Accuracy; SVM SMOTE for F1-score) on this validation set is noteworthy and contrasts with the general dominance of IHT on the test set (Table 3) and for Dataset 1. This suggests that the optimal resampling strategy can be dataset-dependent and also depends on the specific characteristics of the validation data distribution. The text discussion should explore this.
  • ✅ High sensitivity of IHT maintained, but with low PPV: The high sensitivity maintained by IHT models (SR IHT: 0.9434) is a significant positive, as correctly identifying no-shows is often a primary goal. However, the corresponding PPV for SR IHT is 0.3661, which means that out of all patients predicted to be no-shows by this model, only about 36.6% actually are no-shows. This trade-off needs careful consideration in practical application.
  • ✅ Comprehensive data supporting complexity of prediction task: The table provides a comprehensive picture, allowing for comparison across all tested combinations. The data supports the conclusion that predicting no-shows is a complex task, and performance can vary significantly depending on the dataset characteristics and the evaluation set used.
Communication
  • ✅ Consistent and clear structure: The table maintains a consistent structure with previous performance tables (Tables 1-3), which is excellent for comparability. Clear row and column labeling is maintained.
  • ✅ Informative caption: The caption is precise, clearly identifying the dataset (Dataset 2), the specific portion (validation portion), the number of replicates (100), and the nature of the presented data (average predictive performance and standard deviations).
  • ✅ Effective use of bolding: The continued use of bold text to highlight the best-performing value for each metric is effective in guiding the reader to the top results for each criterion.
  • ✅ Standard metrics and consistent abbreviations: The performance metrics are standard and well-understood in the field. Abbreviations for models are consistently used.
  • 💡 Contextual interpretation of performance changes: Similar to Table 2, the performance on the validation set shows a drop in some metrics (e.g., PPV, Accuracy) compared to the test set (Table 3). The table clearly presents this, and the discussion in the main text should adequately address the reasons and implications, especially concerning class imbalance.
Fig. 4 Boxplot of sensitivity results in the validation set for all prediction...
Full Caption

Fig. 4 Boxplot of sensitivity results in the validation set for all prediction models

Figure/Table Image (Page 12)
Fig. 4 Boxplot of sensitivity results in the validation set for all prediction models
First Reference in Text
Classification algorithms KNN, SR, and SVM,
Description
  • Distribution of sensitivity scores for different models: The figure is a boxplot chart illustrating the 'sensitivity' results for various prediction models when applied to the 'validation set' (presumably of Dataset 2, given the sequence of figures). Sensitivity measures the model's ability to correctly identify true positive cases (i.e., actual no-shows). A sensitivity of 1.0 means all no-shows were correctly predicted. Each boxplot represents a specific combination of a classification algorithm (KNN: K-Nearest Neighbors; SR: Symbolic Regression; SVM: Support Vector Machine) and a resampling technique (SMOTE, RUS, NM, IHT – these are methods to balance imbalanced datasets). For each box: the line inside marks the median sensitivity; the box itself covers the interquartile range (IQR, the middle 50% of the data); the 'whiskers' extend to show the data range excluding outliers; and individual points are outliers. The y-axis ranges from 0.2 to 1.0.
  • High performance of IHT models: Visually, models incorporating the IHT resampling technique (KNN IHT, SR IHT, SVM IHT) consistently show the highest median sensitivities, all appearing to be above or around 0.9. Their boxes are also relatively compact, particularly for KNN IHT and SR IHT, suggesting stable high performance in identifying no-shows. SVM IHT shows a slightly wider box but still a high median.
  • Varied performance of non-IHT models: Other model combinations exhibit more varied and generally lower median sensitivities. For example, KNN SMOTE has a median around 0.5, while SR RUS has a median around 0.7 but with a very wide spread and several outliers, indicating high variability. SVM NM shows a median sensitivity around 0.8 but also with a noticeable spread. Some models, like SR SMOTE, show medians around 0.6 with outliers extending to lower values.
  • Indication of model stability: The spread of each boxplot (height of the box and length of whiskers) indicates the stability of the sensitivity metric for that model across the 100 replicates mentioned in similar contexts (e.g., Table 4). Models like KNN IHT and SR IHT show relatively tight distributions, suggesting more consistent sensitivity. In contrast, SR RUS shows a very wide distribution, indicating less stable sensitivity results.
Scientific Validity
  • ✅ Appropriate visualization method: The use of boxplots to compare the distributions of sensitivity scores from multiple model replicates is a statistically sound and appropriate visualization method. It effectively communicates central tendency, dispersion, and the presence of outliers for each model configuration.
  • ✅ Supports textual claims on model performance and stability: The figure provides strong visual evidence for the claims made in the text (page 9) about the stability and performance of different models, particularly highlighting the superior and more stable sensitivity of IHT-based models on the validation set of Dataset 2.
  • ✅ Effective comparison of techniques and algorithms: The visualization clearly demonstrates the impact of different resampling techniques and classification algorithms on sensitivity. The consistent high performance of IHT across KNN, SR, and SVM algorithms for this metric is evident.
  • ✅ Robust view from multiple replicates: The figure, by showing the full distribution from 100 replicates, provides a more robust view of performance than a single point estimate. This is good scientific practice.
  • 💡 Visual comparison without explicit statistical tests: While visually compelling, the figure itself does not provide formal statistical tests of significance between the different model groups. Such tests, if reported in the text (as they are not here), would strengthen the conclusions drawn from these visual comparisons.
  • 💡 Incomplete reference text: The reference text provided ("Classification algorithms KNN, SR, and SVM,") is very brief and only partially describes the content. The figure actually shows these algorithms in combination with various resampling techniques. The caption is more complete.
Communication
  • ✅ Appropriate chart type: The boxplot is an appropriate choice for visualizing the distribution of sensitivity scores across different model configurations, allowing for easy comparison of median performance, spread, and outliers.
  • ✅ Clear labeling: The y-axis is clearly labeled "Sensitivity" and spans a relevant range (0.2 to 1.0). The x-axis labels, representing model combinations (e.g., KNN SMOTE, SR IHT), are consistent with previous figures and tables, though they are abbreviations.
  • ✅ Good visual design: The visual design is clean, with distinct boxes and clear differentiation between medians, quartiles, whiskers, and outliers. The use of color and horizontal gridlines aids readability.
  • 💡 X-axis label density: The x-axis labels are dense. Similar to Fig. 2, consider rotating the labels or using a staggered layout to improve readability and prevent any potential overlap, especially if more models were included.
  • ✅ Informative caption: The caption clearly states the metric (sensitivity), the dataset portion (validation set), and the scope (all prediction models). This aligns well with the visual content presented.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Supplementary Table S1. No-show modeling approaches in the literature.
Figure/Table Image (Page 15)