Decision analysis framework for predicting no-shows to appointments using machine learning algorithms

Section Analysis

Abstract

Key Aspects

🎯 Problem & ML Solution for No-Shows: The abstract begins by establishing the critical issue of patient no-shows for medical appointments, highlighting their substantial negative consequences for both healthcare systems and patients. It then introduces the core objective of the research: leveraging machine learning (ML) algorithms to predict these no-show instances. The significance of this predictive capability is underscored by its potential to enable healthcare managers to implement proactive strategies, such as optimized overbooking and targeted patient reminders, thereby improving resource utilization and operational efficiency. This sets the context and justifies the need for the proposed research.
⚙️ Novel Methodological Framework: The paper proposes a comprehensive analytical framework specifically designed to predict no-shows, with a crucial focus on addressing the common challenge of imbalanced datasets in this domain. This framework introduces several methodological innovations: a novel application of z-fold cross-validation (a robust validation process involving multiple train-test splits of the data, performed here in two stages) to enhance model robustness and generalization; the introduction of Symbolic Regression (SR), an algorithm that searches for mathematical formulas to model data, as a classification algorithm; and the use of Instance Hardness Threshold (IHT), a technique that selectively removes hard-to-classify majority class instances, as a resampling technique. The performance of these new approaches is systematically compared against established methods.
📊 Superior Performance of SR and IHT: The results section of the abstract clearly indicates the successful application of the novel methods. Academically, the study is positioned as the first to employ Symbolic Regression (SR) and Instance Hardness Threshold (IHT) for predicting patient no-shows. The empirical findings demonstrate that SR and IHT exhibited superior performance compared to the conventional techniques they were tested against. Notably, IHT was particularly effective when combined with various classification algorithms, leading to consistent and less variable performance metrics, and the study's results surpassed sensitivity outcomes (the ability to correctly identify no-shows) reported in existing literature, achieving values above 0.94.
🔑 Key Contributions and Implications: The conclusion reiterates the primary novel contributions: the pioneering use of SR and IHT methods for patient no-show prediction and the innovative proposal of a dual z-fold cross-validation process. Beyond these specific methodological advancements, the study delivers a critical message regarding the handling of imbalanced datasets, which are datasets where one class is much more frequent than another. It emphasizes the potential pitfalls of relying on a limited number of validation runs, which can lead to biased results and an insufficient assessment of the model's ability to generalize to new data and maintain stability, particularly during the training phase.

Strengths

✅ Clear Problem Definition and Relevance
The abstract effectively communicates the significance of the no-show problem and the practical benefits of using machine learning for prediction, immediately establishing the research's importance and relevance to healthcare service improvement.

"No-show to medical appointments has significant adverse effects on healthcare systems and their clients. Using machine learning to predict no-shows allows managers to implement strategies such as overbooking and reminders targeting patients most likely to miss appointments, optimizing the use of resources." (Page 1)
✅ Explicit Statement of Novel Contributions
The abstract clearly articulates the novel academic contributions of the study, specifically positioning it as the first to introduce Symbolic Regression (SR) and Instance Hardness Threshold (IHT) for predicting patient no-shows. This directness is crucial for highlighting its unique place in the literature.

"From the academic perspective, our study is the first to propose using SR and IHT to predict the no-show of patients." (Page 1)
✅ Concise Presentation of Key Findings
The results are summarized succinctly, emphasizing the superior performance of the proposed techniques (SR and IHT) and the high sensitivity achieved (above 0.94). This provides a clear and impactful takeaway of the study's empirical outcomes.

"Our findings indicate that SR and IHT presented superior performances compared to other techniques, particularly IHT, which excelled when combined with all classification algorithms and led to low variability in performance metrics results. Our results also outperformed sensitivity outcomes reported in the literature, with values above 0.94 for both datasets." (Page 1)
✅ Important Methodological Warning
The conclusion provides a valuable and broadly applicable caution regarding the validation of machine learning models, especially when dealing with imbalanced datasets. This highlights a commitment to methodological rigor and contributes to best practices in the field.

"Our study highlights the importance of avoiding relying on few validation runs for imbalanced datasets as it may lead to biased results and inadequate analysis of the generalization and stability of the models obtained during the training stage." (Page 1)

Suggestions for Improvement

💡 Explicitly link the framework's design to enhanced generalizability in the conclusion
The abstract mentions improved model robustness and generalization in the Methods section and discusses generalization in the Conclusion. However, the Conclusion could more directly attribute the framework's specific design feature (i.e., dual z-fold cross-validation) to achieving better generalization and stability, reinforcing this as a key strength and practical outcome of the proposed methodology. This would be a medium-impact clarification, enhancing the takeaway message about the framework's utility for producing reliable models. It fits well within the abstract's concluding remarks, as it summarizes a core benefit of the methodological innovation.

"This is the first study to use SR and IHT methods to predict patient no-shows and the first to propose performing z-fold cross-validation twice." (Page 1)

Implementation: Modify the conclusion to explicitly link the dual z-fold cross-validation to its intended benefits. For instance, after '...the first to propose performing z-fold cross-validation twice,' consider adding a phrase like 'a robust validation strategy designed to enhance the generalization and stability of predictive models, especially crucial for imbalanced datasets.' Then continue with 'Our study highlights...'

Introduction

Key Aspects

⚕️ The No-Show Problem in Healthcare: The Introduction establishes the pervasive issue of patient no-shows and late cancellations in medical practices, detailing their significant negative consequences. For healthcare systems, these events lead to inefficient resource utilization, increased operational costs, staff idleness, and restricted access for other patients. For individuals, no-shows disrupt care continuity, potentially worsening clinical outcomes, and contribute to patient dissatisfaction due to extended waiting lists. This comprehensive problem definition underscores the critical need for effective solutions.
📚 Existing Solutions and Their Shortcomings: The paper reviews conventional strategies employed to mitigate the no-show problem, including appointment reminders (via calls or text messages), financial penalties for missed appointments, and the practice of overbooking. It critically assesses these approaches, highlighting their inherent limitations: reminders often yield only modest behavioral changes, financial penalties raise concerns about equitable access to care, and overbooking can lead to scheduling conflicts and prolonged patient wait times when attendance is higher than anticipated. This critique effectively justifies the search for more sophisticated and less problematic mitigation strategies.
💻 Mathematical Modeling as a Promising Approach: The Introduction advocates for the use of mathematical modeling as a crucial asset for decision-makers aiming to implement more effective and proactive measures against patient no-shows. By predicting no-show probabilities, this approach can diminish the subjectivity inherent in managing appointments and guide more informed operational plans. The paper notes that patient attendance, being a binary event (show or no-show), is well-suited for prediction using classification algorithms, which are machine learning techniques designed to categorize data into predefined classes. This sets the stage for introducing specific algorithmic contributions.
❓ Identifying a Research Gap: Novel ML Techniques: The paper identifies a significant lacuna in the current literature by pointing out that Symbolic Regression (SR) and Instance Hardness Threshold (IHT) have not previously been explored for predicting patient no-shows. SR is a machine learning technique that automatically discovers mathematical formulas from data by evolving expressions from basic functional building blocks, offering high flexibility in modeling complex, unknown relationships without pre-specified structural assumptions. IHT is a data balancing technique designed to improve model performance on imbalanced datasets (common in no-show scenarios where actual no-shows are a minority) by selectively removing majority class instances that are difficult to classify, noisy, or outliers, thus clarifying the distinction between classes for the learning algorithm. The introduction posits that these novel methods offer unique advantages for addressing the complexities of no-show prediction.
🎯 Research Objectives and Scope: The Introduction clearly delineates the study's two primary research objectives. The first is to propose a novel analytical framework that leverages machine learning algorithms for predicting no-shows, specifically designed to tackle the challenge of imbalanced datasets and ensure adaptability across various medical no-show contexts. The second objective is to empirically evaluate the performance of the unexplored classification method (Symbolic Regression) and data balancing technique (Instance Hardness Threshold) by comparing them against traditional algorithms (like K-Nearest Neighbors and Support Vector Machine) and resampling methods (such as Random Under Sampling and SMOTE) using two distinct Brazilian hospital attendance datasets. This focused scope guides the subsequent methodological development and empirical investigation.
🔑 Theoretical and Practical Contributions: The Introduction concludes by summarizing the anticipated theoretical and practical contributions of the research. From a theoretical standpoint, the work proposes an innovative analytical framework for no-show prediction, featuring a novel dual application of z-fold cross-validation (a robust model validation technique involving multiple systematic data splits to train and test the model) to enhance model generalization and reliability, and introduces previously unused methods (SR and IHT) along with new combinations of algorithms and balancing techniques. Practically, the study highlights how accurate no-show estimation, particularly within integrated e-Health systems, can significantly improve healthcare quality by enabling targeted interventions for high-risk patients and optimizing operational strategies like overbooking, thereby leading to more efficient resource use and better patient outcomes.

Strengths

✅ Comprehensive Problem Articulation and Impact
The introduction thoroughly defines the no-show problem, detailing its multifaceted negative impacts on both healthcare systems (e.g., resource inefficiency, increased costs) and patients (e.g., discontinuity of care, worsened outcomes), effectively establishing the research's significance and the urgency for solutions.

"No-shows of patients or late cancellations of medical appointments, which do not allow the use of the assigned time interval for other purposes, are reported as com- mon events in different medical practices [1–3]. Their consequences impact health systems and clients [4], as they imply a lack of care for two patients: one who did not attend the appointment and another who could not schedule an appointment at the assigned timeslot [1]." (Page 2)
✅ Clear Identification of Research Gap and Novelty
The paper explicitly states the gap in current research by identifying Symbolic Regression (SR) and Instance Hardness Threshold (IHT) as unexplored methods in the context of no-show prediction. This clear identification of novelty effectively positions the paper's unique contributions to the field.

"To the best of our knowledge, no studies used the Symbolic Regression (SR) algorithm for prediction or the Instance Hardness Threshold (IHT) technique for data balancing." (Page 2)
✅ Well-Defined and Actionable Research Objectives
The research objectives are clearly and concisely stated, providing a distinct roadmap for the study's aims and methodologies. This clarity helps the reader understand the specific goals the paper intends to achieve regarding both framework development and empirical testing of novel techniques.

"Our two research objectives are: • To propose an analytical framework that utilizes machine learning algorithms for predicting no-shows to appointments, while effectively addressing the challenge of imbalanced datasets... • Test classification and balancing methods not yet explored in the no-show prediction literature (i.e., SR and IHT)..." (Page 2)
✅ Strong Justification for Methodological Approach
The introduction provides a solid rationale for the methodological choices, such as the exploration of new methods (SR and IHT) and the strategic use of two distinct datasets. This justification underscores the pursuit of robust, generalizable models capable of handling diverse real-world scenarios.

"The use of two databases aims to better explore the analytical algorithm and techniques, strengthening the results." (Page 2)
✅ Explicit Outline of Theoretical and Practical Contributions
The authors clearly delineate both the theoretical contributions (e.g., a novel analytical framework, dual z-fold cross-validation, exploration of SR/IHT) and the practical implications (e.g., enhanced healthcare quality, targeted interventions, improved overbooking strategies) of their work, effectively highlighting its overall value.

"The contributions of our work can be summarized regarding its theoretical and practical aspects, as follows." (Page 2)

Suggestions for Improvement

💡 Enhance Rationale for SR and IHT Suitability in No-Show Context
The introduction describes Symbolic Regression (SR) and Instance Hardness Threshold (IHT) and notes their novelty in this context. However, it could more explicitly connect the specific advantages of these techniques to the inherent challenges of no-show prediction data, such as complex non-linear relationships potentially discoverable by SR, and the class imbalance and noisy data that IHT is designed to handle. This would be a medium-impact clarification, strengthening the motivation for choosing these novel methods beyond their unexplored status and better preparing the reader for why these methods might outperform others. This enhancement would fit well within the paragraph introducing SR and IHT (page 2).

"SR is an algorithm that does not initially have a pre-specified mathematical structure, which is inferred from the data... The IHT technique is judicious in balancing the data using a hardness property that acts as a filter to exclude data that may be outliers, noise, or that overlap the sample space of the minority class [14, 15]." (Page 2)

Implementation: After describing SR's ability to infer mathematical structure, consider adding a sentence like: "This inherent flexibility of SR is particularly promising for no-show prediction, where the underlying factors influencing patient behavior can be multifaceted and non-linear, potentially eluding traditional models with predefined structures." Similarly, after explaining IHT's filtering mechanism, add: "Given that no-show datasets are typically imbalanced and may contain noisy or borderline instances, IHT's targeted approach to data balancing by identifying and removing such challenging majority class samples is hypothesized to be particularly effective in clarifying class distinctions for the learning algorithm."

Method

Key Aspects

⚙️ Six-Step Predictive Framework: The paper introduces a structured, six-step analytical framework (visualized in Fig. 1) designed to predict patient no-shows, explicitly addressing the common challenge of class imbalance in healthcare datasets. This systematic process encompasses data gathering and preprocessing, a dual-stage z-fold cross-validation for robust model calibration and validation, resampling of the calibration data, feature selection using a wrapper method, training and parameter optimization of predictive models, and finally, comprehensive performance evaluation. The framework's design emphasizes methodological rigor and aims to produce generalizable and stable predictive models applicable to diverse no-show contexts.
📊 Datasets and Preprocessing: The study utilizes two distinct datasets to test the framework's robustness: an unpublished dataset from a Brazilian public hospital's radiology department (Dataset 1, 6.65% no-show rate) and an open-access dataset from a Brazilian public healthcare system (Dataset 2, 19.03% no-show rate). This deliberate choice facilitates assessing model adaptability. Consistent preprocessing involved outlier management, removal of missing data, and max-min scaling of continuous variables to a [0,1] range, ensuring standardized inputs for machine learning algorithms.
⚖️ Resampling for Imbalance: To address the inherent class imbalance where no-shows are the minority, the framework dedicates Step 3 to resampling techniques applied to the calibration data. Four methods are evaluated: Synthetic Minority Oversampling Technique (SMOTE), which synthesizes new minority instances; Random UnderSampling (RUS), which discards majority instances; NearMiss (NM), an undersampling approach based on instance distances; and Instance Hardness Threshold (IHT). IHT, novel to no-show prediction, is an undersampling technique that filters majority class instances based on their 'instance hardness' (probability of misclassification), aiming to remove noisy or ambiguous samples and improve class separability.
💻 Predictive Modeling Approaches: The study employs three supervised machine learning algorithms for the no-show classification task: K-Nearest Neighbors (KNN), an instance-based learner classifying based on proximity; Support Vector Machine (SVM), which identifies an optimal separating boundary between classes; and Symbolic Regression (SR), introduced as a novel method in this field. SR, implemented using Genetic Programming (GP), distinguishes itself by evolving mathematical expressions directly from the data without a predefined model structure, offering potential advantages in uncovering complex, non-linear relationships that might characterize no-show behavior.
🛠️ Feature Selection and Parameter Optimization: Step 5 of the framework details the feature selection and hyperparameter optimization processes, conducted on the balanced calibration data. A wrapper method is employed for feature selection, iteratively evaluating subsets of features with the chosen learning algorithm to maximize the F1-score, a metric balancing precision and recall. The paper provides specific ranges and justifications for KNN's k-value, SVM's kernel functions and C parameter, and a comprehensive list of settings for the SR/GP algorithm, including mathematical operators, population size, and evolutionary parameters, ensuring methodological transparency.
📈 Performance Evaluation Strategy: A cornerstone of the methodology is its rigorous performance evaluation strategy, characterized by a dual z-fold cross-validation scheme. The initial 10-fold cross-validation (Z1) divides the dataset into calibration and validation portions. Subsequently, another 10-fold cross-validation (Z2) is applied to the (resampled) calibration portion for training and testing. This yields 100 independent model evaluations on the validation set (which is not resampled, reflecting real-world imbalance), enabling robust assessment of performance metrics (accuracy, PPV, NPV, sensitivity, specificity, F1-Score, AUC) and model stability (via standard deviations). The no-show class is defined as positive to prioritize minimizing false negatives.

Strengths

✅ Comprehensive and Structured Framework
The paper outlines a clear, comprehensive, and logically sequenced six-step predictive framework (Fig. 1), which enhances the reproducibility and understanding of the complex analytical process involved in predicting no-shows.

"To provide information to help clinics adopt strategies to minimize problems related with no-shows, we propose a six-step predictive framework (Fig. 1)." (Page 5)
✅ Novel Methodological Applications
The study significantly contributes to the field by being the first to apply Instance Hardness Threshold (IHT) as a resampling technique and Symbolic Regression (SR) as a classification algorithm in the context of no-show prediction, addressing gaps in the existing literature.

"Importantly, we intentionally incorporated the IHT technique with a specific emphasis on its novel application, as it has not been employed in previous no-show prediction studies." (Page 6)
✅ Rigorous Validation Approach
The dual z-fold cross-validation approach (Z1=10 for calibration/validation split, Z2=10 for train/test split within calibration) is a rigorous method for assessing model performance and stability, leading to 100 evaluation runs.

"Since the same technique is applied in steps 2 and 4 on different versions of the dataset (complete and balanced), we denote the total number of folds in this step by Z1 ( z1 = 1, . . . , Z1 ). In our application, Z1 was set to 10..." (Page 5)
✅ Detailed Parameter Specification for Novel SR
The parameters for the novel Symbolic Regression (SR) algorithm are explicitly detailed (mathematical functions, population size, generations, crossover/mutation probabilities), which is crucial for transparency and reproducibility of this less common technique.

"For SR, parameters were set as follows: mathematical functions = addition, subtraction, multiplication, division, square root, log, and absolute value; population size = 500 individuals; number of generations = 50; crossover probability = 90%; mutation probability = 1%." (Page 6)
✅ Clear Rationale for Methodological Choices
The authors provide clear justifications for key methodological decisions, such as the choice of resampling techniques (IHT for novelty), feature selection method (wrapper for performance despite computational cost), and the prioritization of the F1-score and sensitivity as performance metrics.

"Despite being computationally more intensive than these alternatives, our choice is justified by the type of dataset typically analyzed in no-show prediction problems, characterized by a small ratio of variables to observations and high imbalance between classes." (Page 6)
✅ Use of Multiple Diverse Datasets
The use of two distinct datasets with different characteristics (one unpublished, one open-access, with varying no-show rates) strengthens the study's claims about the robustness and adaptability of the proposed framework.

"The intentional incorporation of datasets with distinct characteristics was a deliberate choice to explicitly test the robustness of the proposed model in various clinical contexts." (Page 5)

Suggestions for Improvement

💡 Clarify Parameterization of Internal Learner for IHT
While the paper mentions that for Instance Hardness Threshold (IHT), the function h (used to calculate instance hardness) is determined by a learning algorithm, defaulting to Random Forest in Python's Imbalanced-learn, the Method section could briefly state if the default parameters of this Random Forest were used or if any specific configurations were applied. This would provide a slightly more complete picture for precise replication of the IHT process. This is a low-impact suggestion, as stating the default is informative, but confirming its use or detailing minor adjustments would enhance methodological transparency.

"In practice, function h is determined using some learning algorithm, e.g., in Python’s Imbalanced-learn the default is the random forest algorithm [52]." (Page 4)

Implementation: After stating that the default for function h in Imbalanced-learn is the random forest algorithm, add a sentence clarifying the parameterization. For example: "For this study, the default Random Forest classifier within the Imbalanced-learn library was utilized with its standard parameters to derive the instance hardness values." Or, if specific parameters were set for this internal Random Forest, they should be briefly mentioned.
💡 Discuss Computational Cost and Management of the Framework
The paper acknowledges the computational intensity of wrapper methods and the parameter search for SVM. Given the extensive nature of the six-step framework, particularly the 10x10 cross-validation loops combined with wrapper feature selection, a brief discussion on the overall computational resources or time involved would be beneficial. This information would provide practical context for researchers aiming to replicate or adapt this thorough methodology. This is a medium-impact suggestion that would improve the practical reproducibility aspect of the Method section.

"Despite being computationally more intensive than these alternatives, our choice is justified by the type of dataset typically analyzed in no-show prediction problems..." (Page 6)

Implementation: Consider adding a short paragraph or a few sentences towards the end of the 'Method' section (e.g., before 'Performance metrics' or as a concluding remark within the methodology description) addressing computational considerations. For instance: "The comprehensive nature of the proposed framework, particularly the dual cross-validation and wrapper-based feature selection, entails significant computational effort. Analyses were conducted using [mention type of computing resources, if possible, e.g., high-performance computing cluster or standard desktop specifications], with total run times varying based on the specific algorithm and dataset. Future implementations might explore parallelization or other optimization strategies to manage these demands."

Non-Text Elements

Fig. 1 Outline of the proposed framework

Figure/Table Image (Page 5)

First Reference in Text

To provide information to help clinics adopt strategies to minimize problems related with no-shows, we propose a six-step predictive framework (Fig. 1).

Description

Six-step predictive framework: The figure depicts a six-step analytical framework designed for predicting no-shows to appointments. Step 1, 'Data gathering and pre-processing,' involves collecting raw data, preparing it (e.g., handling missing values), and then 'data scaling,' which standardizes the range of continuous initial data (e.g., patient age, waiting time) typically to a 0-1 range, ensuring that no single feature dominates due to its scale. Step 2, 'Complete sample stratification and z-fold cross-validation,' divides the entire dataset into 'Z1' folds (subsets). 'Stratification' ensures that each fold maintains the original proportion of outcomes (e.g., no-shows vs. shows). 'Cross-validation' is a resampling procedure used to evaluate machine learning models on a limited data sample; here, for each of the Z1 iterations, one fold is set aside for 'Validation' and the rest for 'Calibration'. Step 3, 'Data resampling,' is applied to the calibration data to address imbalances, such as when 'no-show' instances are much rarer than 'show' instances. This involves techniques like 'Undersampling' (reducing the majority class) or 'Oversampling' (increasing the minority class). Step 4, 'Balanced sample stratification and z-fold cross-validation,' takes the resampled data from Step 3 and again divides it, this time into 'Z2' folds, for 'Train' and 'Test' purposes within the model building phase. Step 5, 'Feature selection,' aims to identify the most relevant input variables (features) from the 'Set of all features.' It iteratively generates feature subsets, applies a 'Learning Algorithm' (a specific machine learning model like K-Nearest Neighbors or Support Vector Machine), and evaluates them using the 'F1-score' (a metric balancing precision and recall, indicating model accuracy for the minority class). The subset yielding the 'Best F1-score' is selected. This process loops until the best feature subset is consolidated. Step 6, 'Performance evaluation,' uses the best feature subset and the chosen learning algorithm to test the model on the 'Validation' data held out in Step 2. This is repeated for all Z1 folds. Finally, the 'best performing model' is determined, and its 'average performance results and model stability' (consistency across different data subsets) are calculated.

Scientific Validity

✅ Comprehensive and robust framework design: The proposed framework is comprehensive, addressing key stages in predictive modeling from data preprocessing to performance evaluation. The inclusion of two distinct cross-validation loops (Steps 2 and 4) is a robust approach aimed at ensuring model generalizability and stability.
✅ Addresses key challenges in predictive modeling: The explicit inclusion of 'Data resampling' (Step 3) to handle imbalanced datasets and 'Feature selection' (Step 5) to identify relevant predictors are critical and appropriate steps for building effective classification models, especially in healthcare where class imbalance (e.g., no-shows vs. shows) is common.
✅ Double cross-validation with stratification: The use of z-fold cross-validation twice (once on the complete sample for calibration/validation split, and again on the balanced calibration sample for train/test split) is a sophisticated approach. This should help in obtaining reliable estimates of model performance and reducing overfitting. The stratification at both cross-validation stages is also a good practice for imbalanced data.
✅ Methodological soundness for the stated purpose: The framework appears methodologically sound for developing and evaluating machine learning models for no-show prediction. The iterative nature of feature selection (Step 5) based on F1-score is appropriate for optimizing models where the positive class (no-show) is often the minority and of primary interest.
💡 Clarification on feature selection termination and algorithm specification: While the framework is detailed, the specific criteria for 'Best F1-score?' leading to the consolidation of the 'best feature subset' could be elaborated in the main text. For instance, is it the absolute best F1-score across all iterations, or is there a threshold or comparative process involved if multiple subsets yield similar scores? Also, the term 'Learning Algorithm' is generic; the paper should detail which algorithms are tested within this framework.
✅ Unbiased final validation: The separation of a final 'Validation' set in Step 2, which is not touched during the resampling and feature selection (Steps 3-5 on the 'Calibration' set), is crucial for an unbiased assessment of the final model's performance. This ensures that the performance metrics reported in Step 6 are on data unseen during model tuning.

Communication

✅ Logical flow and standard conventions: The flowchart is generally well-structured and uses standard conventions (rectangles for processes, diamonds for decisions), making the overall workflow easy to follow. The sequential numbering of the main steps (1 to 6) aids in understanding the progression.
💡 Text size and readability: The use of color is minimal, relying on shades of grey and blue, which is good for avoiding distraction. However, the text within some boxes is quite small, particularly in the feature selection loop (Step 5) and the decision diamonds. The lines connecting elements are clear.
💡 Clarity of internal loops and terminology: While the main steps are clear, the internal loops and decision points (e.g., 'z2 = Z2?', 'z1 = Z1?') could benefit from slightly more descriptive labels or a brief note in the caption explaining Z1 and Z2 represent the number of folds in the respective cross-validation stages. The term 'Learning Algorithm' is generic; specifying the types considered or linking to where they are detailed would enhance self-containment.
✅ Effective overview of a complex process: The figure effectively outlines a complex multi-stage process. The visual separation of steps helps in breaking down the framework into manageable components. The flow from data input to performance evaluation is logically presented.
💡 Suggestions for improvement: Consider increasing the font size for text within the flowchart boxes and decision diamonds to improve readability, especially for terms like 'Generate a subset', 'Learning Algorithm', and the conditions in the decision diamonds. If space is an issue, slightly rephrasing for conciseness might help. Adding a legend or a brief explanation for Z1 and Z2 in the caption or as a footnote to the figure would be beneficial.

Results

Key Aspects

📊 Dataset 1: IHT Dominance in High Imbalance: The results for Dataset 1, which is characterized by a significant class imbalance (with 93.35% of observations in the validation set belonging to the 'show' class), are meticulously presented. This involves detailing various performance metrics such as AUC, Sensitivity, Specificity, NPV, PPV, and F1-score, derived from 100 data replicates for different combinations of resampling techniques and classification algorithms (Tables 1 and 2). A key finding is the superior performance of the Instance Hardness Threshold (IHT) resampling technique, especially when combined with K-Nearest Neighbors (KNN) and Symbolic Regression (SR), which achieved sensitivity and AUC values exceeding 0.9 in the test set. This underscores IHT's effectiveness in mitigating bias from skewed class distributions in high-cost false negative scenarios.
🎯 Prioritizing Sensitivity and AUC for No-Show Prediction: The paper explicitly rationalizes the prioritization of Sensitivity (also known as recall) and Area Under the Receiver Operating Characteristic Curve (AUC) as primary performance indicators. This choice is grounded in the healthcare context, where minimizing false negatives (i.e., failing to identify a patient who will not show up) is paramount due to associated resource wastage and higher operational costs. Sensitivity directly quantifies the model's success in identifying true no-show instances. AUC offers a threshold-independent measure of the model's ability to discriminate between classes, which is particularly advantageous in imbalanced datasets where traditional accuracy can be misleading. This contextual framing is crucial for a meaningful interpretation of model efficacy.
⚖️ Model Stability Analysis for Dataset 1: The study assesses the stability of prediction models for Dataset 1 by presenting boxplots of sensitivity results obtained from the validation set across 100 replicates (Fig. 2). This visual analysis demonstrates that model combinations incorporating the IHT resampling technique (specifically KNN/IHT, SR/IHT, and SVM/IHT) exhibit enhanced stability. Notably, SVM/IHT displayed the smallest interquartile dispersion, while SR/IHT was highlighted for its high stability, attributed to data symmetry and low variability. Such rigorous stability assessment is vital for ensuring that the reported model performance is robust and not merely an artifact of a particular data partitioning, thereby affirming the reliability of the findings.
📈 Dataset 2: SR/IHT Confirmed as Favorable for Sensitivity: The analytical approach is consistently applied to Dataset 2, with results presented in a parallel structure (Tables 3 and 4, Fig. 4 mentioned). For the test set of Dataset 2, the combination of Symbolic Regression (SR) and Instance Hardness Threshold (IHT) again demonstrated superior performance across most metrics. In the validation set, a nuanced trade-off emerged: while KNN/SMOTE achieved the best specificity, PPV, and accuracy, its comparatively lower sensitivity (0.5346) was considered less optimal due to the high cost of false negatives in healthcare. Conversely, the SR/IHT combination yielded the highest sensitivity (0.9434) and AUC (0.7734), reinforcing its suitability for scenarios where correctly identifying no-shows is the primary objective. This cross-dataset validation strengthens the conclusions regarding the proposed methods.
🔑 Identification of Frequently Selected Features: The Results section includes the presentation of the most frequently selected features derived from the test set for the top-performing combinations: Symbolic Regression (SR), K-Nearest Neighbors (KNN), and Support Vector Machine (SVM) classification algorithms when paired with the Instance Hardness Threshold (IHT) resampling technique (Fig. 3 for Dataset 1). While the paper indicates that a detailed discussion and interpretation of these features in the context of existing literature will occur later (stated as "in 'Results' section" on page 9, but typically this is in the Discussion), their initial listing here provides a preliminary insight into the variables that the models identified as most influential for prediction. This sets the stage for a deeper understanding of the factors driving no-show behavior.

Strengths

✅ Clear Presentation of Performance Metrics
The results are presented in well-structured tables (Tables 1 and 2 for Dataset 1; Tables 3 and 4 mentioned for Dataset 2) that clearly display multiple performance metrics for each combination of resampling technique and classification algorithm. This allows for a comprehensive comparison and easy identification of top-performing models, with best results highlighted in bold.

"The processing of Dataset 1 following the framework steps in Fig. 1 led to the results reported in Tables 1 (test set) and 2 (validation set). ... The best results for each metric are marked in bold." (Page 8)
✅ Justification for Metric Focus
The paper effectively justifies the emphasis on sensitivity and AUC, linking these metrics directly to the practical implications of no-show prediction in healthcare, particularly the higher cost of false negatives. This provides a strong rationale for evaluating models based on their ability to correctly identify no-shows.

"As no-shows lead to higher costs and waste of resources, the cost of a false negative is usually higher than that of a false positive, so it is important that false negatives are minimized [28, 31, 60]. In situations of class imbalance, sensitivity and AUC often take precedence as they provide critical insights into a model’s performance." (Page 8)
✅ Consistent Reporting Across Datasets
The results for Dataset 2 are presented following the same structure as Dataset 1, facilitating a consistent comparison of model performance across different data contexts. This consistency strengthens the overall findings regarding the effectiveness of techniques like IHT and SR.

"Applying the framework steps in Fig. 1 to Dataset 2 led to the results reported in Tables 3 (test set) and 4 (validation set). The presentation follows the same structure used in Dataset 1." (Page 9)
✅ Inclusion of Model Stability Analysis
The use of boxplots (Fig. 2 for Dataset 1, Fig. 4 mentioned for Dataset 2) to illustrate the stability of sensitivity results across 100 replicates is a significant strength. This provides insights into the variability and reliability of the prediction models beyond just average performance metrics.

"A boxplot of the sensitivity metric is presented to verify the stability of prediction models in the validation set considering cases correctly classified as no-shows (Fig. 2)." (Page 9)

Suggestions for Improvement

💡 Explicitly summarize IHT's consistent benefit across algorithms in the text
While Tables 1 and 2 clearly show IHT's strong performance for Dataset 1, the main text could more explicitly synthesize the observation that IHT consistently elevates performance (especially sensitivity and AUC) across all three classification algorithms (KNN, SVM, SR). A similar summary for Dataset 2 would also be beneficial. This would be a medium-impact clarification, reinforcing the robustness of IHT as a resampling technique within this section's narrative summary of findings, rather than relying solely on the reader to synthesize this from the tables. This fits well when discussing the overall performance of IHT.

"In the test set (Table 1), sensitivity and AUC values resulted larger than 0.9 for all combinations of classification algorithms with the IHT resampling technique, confirming its suitability for highly imbalanced datasets." (Page 9)

Implementation: After presenting the general success of IHT for a dataset, add a sentence summarizing its broad applicability. For instance, after '...confirming its suitability for highly imbalanced datasets,' for Dataset 1, add something like: 'Notably, the IHT technique demonstrated a consistent advantage, improving key metrics such as sensitivity and AUC not only for SR but also when paired with KNN and SVM, as evidenced in Table 1.' A similar statement could be made for Dataset 2 results.
💡 Quantify the general magnitude of sensitivity improvement with IHT combinations
The results highlight IHT's superior performance, especially for sensitivity. To further emphasize this, the text could briefly quantify the general magnitude of improvement in sensitivity achieved by IHT combinations compared to other resampling techniques (SMOTE, RUS, NM) for both datasets. This would be a medium-impact addition, providing a more concrete sense of IHT's practical advantage directly within the Results narrative, complementing the tabular data. This enhancement would fit well when discussing the best sensitivity scores achieved with IHT.

"The combination of SR and IHT yielded the best sensitivity score (0.9537) and AUC value (0.6175)." (Page 9)

Implementation: When discussing the best sensitivity scores achieved with IHT (e.g., for SR/IHT in Dataset 1's validation set), add a comparative statement. For example: 'The combination of SR and IHT yielded the best sensitivity score (0.9537)... This represents a substantial improvement over combinations using SMOTE, RUS, or NM techniques, which generally yielded sensitivities often below 0.83 (Table 2) for Dataset 1's validation set.'

Non-Text Elements

Fig. 2 Boxplot of sensitivity results in the validation set for all prediction...

Full Caption

Fig. 2 Boxplot of sensitivity results in the validation set for all prediction models

Figure/Table Image (Page 9)

First Reference in Text

A boxplot of the sensitivity metric is presented to verify the stability of prediction models in the validation set considering cases correctly classified as no-shows (Fig. 2).

Description

Distribution of sensitivity for prediction models: The figure is a boxplot chart displaying the 'sensitivity' results for various prediction models on a 'validation set'. Sensitivity, in this context, measures how well a model correctly identifies patients who will actually miss their appointments (no-shows); a sensitivity of 1.0 means all no-shows were correctly predicted, while 0.0 means none were. Each boxplot represents a different combination of a classification algorithm (KNN: K-Nearest Neighbors, a method that classifies based on the majority class among its 'k' closest examples; SR: Symbolic Regression, a technique that searches for mathematical expressions to fit data; SVM: Support Vector Machine, an algorithm that finds an optimal boundary to separate classes) and a resampling technique (SMOTE, RUS, NM, IHT - these are methods to balance datasets where one class, like 'no-show', is much rarer than another). For each box: the horizontal line inside is the median sensitivity (the middle value); the box itself spans the interquartile range (IQR, the middle 50% of the data); the 'whiskers' (lines extending from the box) typically show the range of the data, excluding outliers; and individual points beyond the whiskers are outliers. The y-axis ranges from 0.0 to 1.0. Visually, some models, particularly those involving the IHT resampling technique (e.g., KNN IHT, SR IHT, SVM IHT), show median sensitivities around 0.9 or higher, with relatively compact boxes, indicating consistent high performance. Other combinations, such as KNN SMOTE or SR SMOTE, show much lower median sensitivities (around 0.1-0.2) and wider spreads, indicating poorer and more variable performance in correctly identifying no-shows.
Variability and stability of models: The spread of each boxplot (the height of the box and the length of the whiskers) indicates the stability or variability of the sensitivity metric for that particular model combination across multiple validation runs (the text mentions 100 replicates). For instance, the SVM IHT model shows a very tight box with its median line near the top, and very short whiskers, suggesting its high sensitivity is quite stable. In contrast, the SR RUS model shows a very wide box and long whiskers, with its median around 0.5, indicating highly variable sensitivity results across different runs.

Scientific Validity

✅ Appropriate visualization for distributional data: The use of boxplots is a statistically appropriate method for visualizing and comparing the distributions of a performance metric like sensitivity, especially when derived from multiple replicates or cross-validation folds. It effectively shows median performance, variability (IQR and range), and potential outliers.
✅ Supports claims about model stability: The figure strongly supports the reference text's claim of verifying the stability of prediction models. The spread of each box (IQR) and the extent of the whiskers directly visualize the consistency (or lack thereof) of the sensitivity metric for each model configuration over the 100 replicates performed on the validation set.
✅ Facilitates comparison of techniques: The figure allows for a clear comparison of the effectiveness of different resampling techniques (SMOTE, RUS, NM, IHT) when combined with various classification algorithms (KNN, SR, SVM). The visual evidence suggests that IHT consistently leads to higher and often more stable sensitivity across all three algorithms for this dataset.
💡 Visual comparison without explicit statistical tests: While the boxplots themselves are informative, the figure does not inherently include statistical significance tests between the different model combinations. Such tests, if reported in the text, would complement the visual findings by quantifying the differences in sensitivity distributions.
✅ Representation of outliers: The presence of outliers (e.g., for SR SMOTE, SR RUS, SVM RUS) is well-represented, providing a complete picture of the performance distribution, including atypical results from some validation runs.

Communication

✅ Appropriate chart type: The use of a boxplot is appropriate for visually comparing the distributions of the sensitivity metric across multiple model configurations. It effectively conveys central tendency, spread, and outliers for each group.
✅ Clear labeling: The y-axis is clearly labeled "Sensitivity" with a sensible range (0.0 to 1.0). The x-axis labels, while abbreviations (e.g., KNN SMOTE, SR IHT), are consistent with the terminology used in the paper and represent combinations of classification algorithms (KNN, SR, SVM) and resampling techniques (SMOTE, RUS, NM, IHT).
✅ Visual design and clarity: The figure is relatively clean and uncluttered. The different shades used for the boxes help to distinguish them, although a legend clarifying if shades correspond to algorithm type or resampling technique systematically would be a minor improvement if not already obvious from context. The horizontal gridlines aid in reading sensitivity values.
💡 X-axis label density: The x-axis labels are quite dense. Consider rotating the labels by 45 degrees or arranging them in a staggered two-line format if possible to improve readability and prevent potential overlap, especially if more models were to be added. Alternatively, grouping by primary algorithm (KNN, SR, SVM) and then by resampling technique might make comparisons within algorithm types easier.
✅ Informative caption: The caption is informative, stating the metric (sensitivity), the dataset portion (validation set), and the scope (all prediction models). It aligns well with the visual content.

Fig. 3 Features selected by top models, occurrence frequency in 100 test set...

Full Caption

Fig. 3 Features selected by top models, occurrence frequency in 100 test set replicates

Figure/Table Image (Page 10)

First Reference in Text

Figure 3 displays the most frequently selected features from the test set for combinations of KNN, SR, and SVM classification algorithms with the IHT resampling technique.

Description

Feature selection frequencies for different models: The figure is a table listing various 'Features' (input variables for predictive models, e.g., 'Month december', 'Age') in the first column. The subsequent three columns, labeled 'KNN IHT', 'SR IHT', and 'SVM IHT', represent different machine learning model configurations (K-Nearest Neighbors, Symbolic Regression, Support Vector Machine, all combined with the Instance Hardness Threshold resampling technique). The numerical values within these columns indicate the 'Frequency of occurrence' – specifically, how many times out of 100 test set replicates each feature was selected as important by that particular model. For instance, 'Month december' and 'Spring season' were selected in all 100 replicates (frequency 100) by all three listed models. 'Distance to the clinic' was selected 41 times by KNN IHT, 90 times by SR IHT, and 57 times by SVM IHT. 'Previous no-show in appointments' was selected 30 times by KNN IHT, 55 times by SR IHT, and 21 times by SVM IHT. The table includes a long list of features with varying selection frequencies, some as low as 3 (e.g., 'Cancer record' for KNN IHT) or 0 (e.g., 'Complete high school' for SVM IHT).

Scientific Validity

✅ Valid approach for assessing feature importance and stability: Presenting the frequency of feature selection across multiple replicates is a valid method to assess the stability and relative importance of features for different models. It helps identify features that are consistently chosen as predictive.
✅ Supports claims about frequently selected features: The figure clearly supports the reference text by showing which features were most frequently selected by the specified model combinations (KNN, SR, SVM with IHT). The data allows for identification of features consistently ranked high by these 'top models'.
✅ Highlights algorithm-specific feature preferences: The table reveals interesting differences in feature selection patterns across the three algorithms, even when using the same resampling technique (IHT). For example, 'Distance to the clinic' has a much higher selection frequency for SR IHT (90) compared to KNN IHT (41), suggesting that the SR algorithm might find this feature more consistently useful. This provides insight into model-specific feature dependencies.
✅ Comprehensive representation of selection frequencies: The inclusion of features with low selection frequencies (e.g., 3, 4, 5 occurrences out of 100) provides a comprehensive view but also indicates that many features are not consistently selected. This is an honest representation of the feature selection process outcomes.
✅ Sufficient number of replicates: The basis of 100 replicates for determining selection frequency is a good number to provide a stable estimate of feature importance.
💡 Scope of information presented: While the figure shows which features were selected and how often, it doesn't inherently explain why certain features were selected more by one algorithm than another, nor the directionality of their impact. This level of detail would typically be discussed in the text.

Communication

✅ Appropriate format and clear headers: The tabular format is appropriate for presenting the frequency of selected features across different models. Column headers are clear.
✅ Informative caption: The caption clearly explains what the figure represents: the frequency of features selected by top models (specified as KNN, SR, SVM with IHT in the reference text) in 100 test set replicates.
💡 Feature ordering and readability: The list of features is quite extensive. To improve readability and immediate understanding of the most critical features, consider sorting the features based on their average frequency across all models, or by their frequency in one reference model (e.g., SR IHT, which seems to select 'Distance to the clinic' most often among the top few). Alternatively, grouping features by category (e.g., demographic, appointment-specific) if applicable, could also aid interpretation.
💡 Enhance visual comparison: While numerical frequencies are provided, adding horizontal bars (like an embedded bar chart) next to each frequency value could offer a more immediate visual comparison of feature importance across the three models for each specific feature. This would make it easier to spot differences in selection patterns.
✅ Descriptive feature names: The feature names are descriptive. Some are long and wrap lines, which is acceptable for clarity, but ensure consistent formatting throughout the table.

Table 1 Average predictive performance and standard deviations obtained from...

Full Caption

Table 1 Average predictive performance and standard deviations obtained from 100 replicates of dataset 1's test portion

Figure/Table Image (Page 8)

First Reference in Text

The processing of Dataset 1 following the framework steps in Fig. 1 led to the results reported in Tables 1 (test set) and 2 (validation set).

Description

Overview of model performance evaluation: The table presents the average predictive performance of various machine learning models on the 'test portion' of 'dataset 1', based on 100 repeated experiments (replicates). Performance is evaluated using several standard metrics, with both the mean value and its standard deviation (in parentheses) reported. The models are combinations of a 'Resampling technique' (methods to adjust class distribution in imbalanced datasets) and a 'Classification algorithm'.
Techniques and algorithms tested: Four resampling techniques are compared: SMOTE (Synthetic Minority Oversampling Technique, which creates new minority class instances), RUS (Random Under-Sampling, which removes majority class instances), NM (NearMiss, an undersampling technique that selects majority class instances based on distance to minority class instances), and IHT (Instance Hardness Threshold, an undersampling technique that removes instances likely to be misclassified). Three classification algorithms are used: KNN (K-Nearest Neighbors, which classifies based on the majority vote of its closest neighbors), SVM (Support Vector Machine, which finds an optimal hyperplane to separate classes), and SR (Symbolic Regression, which evolves mathematical expressions to fit data).
Performance metrics reported: Performance metrics include: AUC (Area Under the ROC Curve, a measure of overall model distinguishability, higher is better), Sensitivity (True Positive Rate, ability to correctly identify no-shows), Specificity (True Negative Rate, ability to correctly identify shows), NPV (Negative Predictive Value, probability that a predicted show is a true show), PPV (Positive Predictive Value, probability that a predicted no-show is a true no-show), F1-score (harmonic mean of precision and sensitivity, useful for imbalanced classes), and Accuracy (overall proportion of correct classifications).
Key performance highlights (IHT models): Models using the IHT resampling technique consistently show the highest performance across most key metrics, particularly AUC and Sensitivity. For example, KNN with IHT achieved an AUC of 0.9087 (0.032) and Sensitivity of 0.9122 (0.052). SR with IHT achieved the highest Sensitivity of 0.9582 (0.040) and NPV of 0.9728 (0.025). SVM with IHT also performed well, with an AUC of 0.9017 (0.027) and Sensitivity of 0.9447 (0.042).
Performance of other resampling techniques: In contrast, other resampling techniques like SMOTE, RUS, and NM generally resulted in lower performance, especially for Sensitivity when combined with certain algorithms. For instance, SR with SMOTE had a Sensitivity of 0.5034 (0.250) and SR with RUS had 0.5090 (0.416). The highest PPV was achieved by SR with SMOTE at 0.9056 (0.147). The highest Specificity was also for SR with SMOTE at 0.9076 (0.186).
Indication of model stability via standard deviations: The standard deviations reported alongside the means indicate the variability of the performance across the 100 replicates. Smaller standard deviations suggest more stable and reliable model performance. For example, for the IHT models, standard deviations for Sensitivity are generally low (e.g., 0.040 for SR IHT), indicating consistent performance.

Scientific Validity

✅ Robust performance assessment: Reporting both mean and standard deviation from 100 replicates provides a robust assessment of model performance and stability. This is a strong methodological choice.
✅ Comprehensive comparison of methods: The table comprehensively evaluates various combinations of classification algorithms and resampling techniques using a wide array of standard performance metrics. This allows for a thorough comparison and identification of superior approaches for the given dataset and problem.
✅ Highlights effective techniques: The results clearly demonstrate the impact of different resampling techniques, with IHT showing marked superiority, especially in terms of sensitivity and AUC for this 'test portion' of dataset 1. This provides valuable insights into handling imbalanced data for no-show prediction.
💡 Clarity of 'test portion' context within the framework: The term 'test portion' (from caption) and 'test set' (from reference text) should ideally correspond to the 'Test' set generated in Step 4 of the framework (Fig. 1), which is used for tuning/selecting the best feature subset and model parameters within the calibration phase. Table 2 is then the 'validation set' from Step 2 of Fig. 1, used for final unbiased evaluation. Ensuring this distinction is clear in the manuscript is important for interpreting the overall evaluation strategy.
✅ Appropriate choice of performance metrics: The choice of metrics, including AUC, sensitivity, specificity, PPV, NPV, and F1-score, is appropriate for a classification problem with potentially imbalanced classes, providing a multi-faceted view of performance beyond simple accuracy.
✅ Strong support for high sensitivity with IHT: The data strongly supports the claim that IHT combined with SR or SVM can achieve high sensitivity (above 0.94), which is crucial for minimizing missed no-show predictions. The PPV values, while lower than sensitivity for IHT models (around 0.80-0.86), are still reasonably high, indicating that a good proportion of predicted no-shows are actual no-shows.

Communication

✅ Clear structure and organization: The table is well-structured with clear rows for resampling technique/classification algorithm combinations and columns for standard performance metrics. This organization facilitates comparison across different models.
✅ Informative caption: The caption is informative, clearly stating the content (average predictive performance and standard deviations), the source of the data (100 replicates of dataset 1's test portion), and the metrics presented.
✅ Effective use of bolding for best results: Using bold text to highlight the best performing value for each metric is an effective visual cue that helps the reader quickly identify top-performing models for specific criteria.
✅ Concise use of abbreviations: Abbreviations for algorithms (KNN, SVM, SR) and resampling techniques (SMOTE, RUS, NM, IHT) are standard and make the table concise. It is assumed these are defined in the main text.
💡 Information density: The table is quite dense with information. While comprehensive, consider if a summary figure or a smaller table highlighting only the IHT results (which appear superior) for key metrics could be provided in the main text or supplement for quicker assimilation of the primary findings, with this full table providing the exhaustive details.
✅ Consistent decimal precision: The number of decimal places (typically three or four) is consistent and appropriate for these types of performance metrics.

Table 2 Average predictive performance and standard deviations obtained from...

Full Caption

Table 2 Average predictive performance and standard deviations obtained from 100 replicates of dataset 1's validation portion

Figure/Table Image (Page 8)

First Reference in Text

The processing of Dataset 1 following the framework steps in Fig. 1 led to the results reported in Tables 1 (test set) and 2 (validation set).

Description

Overview of model performance on validation set: This table displays the average predictive performance and standard deviations of various machine learning models, this time evaluated on the 'validation portion' of 'dataset 1', derived from 100 replicates. The structure mirrors Table 1, showing combinations of 'Resampling technique' and 'Classification algorithm'.
Techniques, algorithms, and metrics: The same resampling techniques (SMOTE, RUS, NM, IHT) and classification algorithms (KNN, SVM, SR) as in Table 1 are evaluated. The performance metrics reported are also the same: AUC (Area Under the ROC Curve), Sensitivity (True Positive Rate), Specificity (True Negative Rate), NPV (Negative Predictive Value), PPV (Positive Predictive Value), F1-score, and Accuracy.
Key performance highlights on validation set: On this validation set, SR with IHT achieves the highest Sensitivity at 0.9537 (0.042) and the highest NPV at 0.9886 (0.009). KNN with IHT has the highest F1-score at 0.1670 (0.018) and the highest AUC at 0.6302 (0.037). The highest Specificity is achieved by SR with SMOTE at 0.9087 (0.191), which also yields the highest Accuracy at 0.8562 (0.161). The highest PPV is achieved by KNN with RUS at 0.0938 (0.011).
Comparison with test set performance (Table 1): Compared to Table 1 (test portion), many metrics show a notable decrease in performance on the validation set, particularly PPV, F1-score, and Accuracy. For instance, the best F1-score in Table 1 was 0.8822 (KNN IHT), while in Table 2 it is 0.1670 (KNN IHT). Similarly, the best PPV in Table 1 was 0.9056 (SR SMOTE), while in Table 2 it is 0.0938 (KNN RUS). Accuracy also drops from a high of 0.9079 (KNN IHT) in Table 1 to 0.8562 (SR SMOTE) in Table 2, with many models showing accuracies around 0.3-0.6. This suggests that the models, while performing well on the balanced test set (from Step 4 of the framework), struggle more on the likely imbalanced validation set (from Step 2).
Performance of IHT models on validation set: Despite the overall drop in some metrics, Sensitivity remains high for IHT models, with SR IHT (0.9537), SVM IHT (0.9463), and KNN IHT (0.8998) still demonstrating a strong ability to identify no-shows. NPV also remains very high for these models (e.g., 0.9886 for SR IHT), indicating that when the model predicts a 'show', it is very likely to be correct.
Stability of performance on validation set: The standard deviations, particularly for metrics like PPV and F1-score, are relatively small for most models, suggesting that the low performance in these areas is consistent across the 100 replicates on the validation set.

Scientific Validity

✅ Assessment on a true validation set: Presenting results on a separate validation set (presumably untouched during model training, resampling, and feature selection on the calibration set) is crucial for assessing the true generalization capability of the models. This is a methodologically sound practice.
✅ Robustness from multiple replicates: The use of 100 replicates provides confidence in the stability of the reported average performance metrics on this validation set.
✅ Highlights challenges of class imbalance on validation: The significant drop in metrics like F1-score and PPV compared to Table 1 highlights the challenge of class imbalance when models are applied to data that reflects real-world distributions (assuming the validation set is not resampled). High sensitivity maintained by IHT models is a positive finding, but the low PPV suggests a high number of false positives (shows predicted as no-shows), which has practical implications (e.g., overbooking strategies might affect patients who do show up if based solely on these predictions).
💡 Trade-off between sensitivity and precision: The results demonstrate that while IHT consistently improves sensitivity, the overall predictive power for identifying no-shows with high precision (PPV) is limited on this validation set. This suggests that while the models are good at capturing most no-shows, they also misclassify a significant number of actual shows as no-shows.
✅ Realistic portrayal of model performance limitations: The table clearly shows that even the 'best' models according to metrics like F1-score and Accuracy on this validation set have relatively low absolute values (e.g., F1-score of 0.1670, Accuracy often below 0.5 for many combinations). This underscores the difficulty of the prediction task on this particular dataset's validation portion. The high specificity of SR SMOTE (0.9087) and its corresponding highest accuracy (0.8562) is interesting, suggesting it's very good at identifying true 'shows', but its sensitivity is very low (0.1195), meaning it misses most 'no-shows'.
💡 Contextual discussion of 'best' metrics needed: The choice to bold the highest value for each metric is helpful, but it's important for the authors to discuss the practical significance of these 'best' values, especially when they are low in absolute terms (e.g., PPV). The discussion should focus on the most relevant metrics for the problem (likely sensitivity and perhaps NPV, given the context of minimizing missed no-shows).

Communication

✅ Consistent and clear structure: The table maintains a consistent and clear structure, similar to Table 1, which aids in comparing results between the test and validation sets. Rows define model combinations and columns define performance metrics.
✅ Informative caption: The caption is unambiguous, clearly stating that these results pertain to the 'validation portion' of dataset 1, averaged over 100 replicates, and lists the performance metrics shown.
✅ Effective use of bolding: The use of bold text to highlight the best-performing value for each metric is continued from Table 1, effectively guiding the reader's attention to key results.
✅ Clear metric labeling and data presentation: The metrics are standard and clearly labeled. The presentation of mean (standard deviation) is a good practice for summarizing results from multiple replicates.
💡 Contextual interpretation of performance drop: Given the significant drop in performance for some metrics (e.g., PPV, F1-score, Accuracy) compared to Table 1, especially due to class imbalance in the validation set, the discussion in the text should carefully address these differences and their implications. The table itself presents the data clearly, but the interpretation is crucial.

Table 3 Average predictive performance and standard deviations obtained from...

Full Caption

Table 3 Average predictive performance and standard deviations obtained from 100 replicates of dataset 2's test portion

Figure/Table Image (Page 11)

First Reference in Text

Applying the framework steps in Fig. 1 to Dataset 2 led to the results reported in Tables 3 (test set) and 4 (validation set).

Description

Overview of model performance on Dataset 2 test portion: This table presents the average predictive performance and standard deviations for various machine learning models applied to the 'test portion' of 'dataset 2', based on 100 experimental replicates. It follows the same format as Table 1, allowing for comparison of the framework's application on a different dataset.
Techniques, algorithms, and metrics evaluated: The same set of resampling techniques (SMOTE, RUS, NM, IHT) and classification algorithms (KNN, SVM, SR) are evaluated. Performance is measured using the same metrics: AUC (Area Under the ROC Curve, indicating how well the model can distinguish between classes), Sensitivity (True Positive Rate, or the proportion of actual no-shows correctly identified), Specificity (True Negative Rate, or the proportion of actual shows correctly identified), NPV (Negative Predictive Value, the probability that a patient predicted to show up actually does), PPV (Positive Predictive Value, the probability that a patient predicted to no-show actually does not), F1-score (a balance between precision/PPV and sensitivity), and Accuracy (overall correctness of predictions).
Key performance highlights (IHT models on Dataset 2): For Dataset 2's test portion, models using the IHT resampling technique again generally outperform others, particularly in AUC and Sensitivity. SR with IHT achieves the highest AUC (0.9429 ± 0.048), Sensitivity (0.9425 ± 0.094), NPV (0.9570 ± 0.057), and F1-score (0.9349 ± 0.065), and ties for the highest Accuracy (0.9429 ± 0.045). KNN with IHT also shows strong performance with an AUC of 0.9399 (0.021) and Sensitivity of 0.9396 (0.038).
Performance highlights for SVM with IHT: SVM with IHT achieved the highest Specificity (0.9505 ± 0.037) and the highest PPV (0.9397 ± 0.040). This suggests that for Dataset 2's test set, SVM with IHT was particularly good at correctly identifying patients who would show up and ensuring that predicted no-shows were indeed no-shows.
Comparison with Dataset 1 test portion performance: Compared to Dataset 1 (Table 1), the performance on Dataset 2's test portion appears generally higher and more balanced across metrics for the IHT models. For instance, the F1-scores for IHT models are consistently above 0.92 for Dataset 2, whereas for Dataset 1 they were around 0.86-0.88. PPV values for IHT models are also higher for Dataset 2 (above 0.93) compared to Dataset 1 (around 0.80-0.86).
Stability of IHT models on Dataset 2: The standard deviations for the IHT models are generally low, indicating stable performance across the 100 replicates. For example, SR with IHT has a standard deviation of 0.094 for Sensitivity, and SVM with IHT has 0.037 for Specificity.

Scientific Validity

✅ Testing framework on a second dataset: The application of the same analytical framework and evaluation methodology (100 replicates, comprehensive metrics) to a second dataset (Dataset 2) is a strong point, as it allows for an assessment of the framework's generalizability and the robustness of the findings regarding technique performance (e.g., IHT).
✅ Strong performance of IHT on Dataset 2: The reported results, particularly the high performance of IHT-based models across multiple metrics (AUC, Sensitivity, F1-score, PPV often >0.93), suggest that the proposed methods are effective for this dataset's test portion. The consistency of IHT's strong performance across two datasets (comparing qualitatively with Table 1) strengthens the claims about its utility.
✅ Balanced view through multiple metrics: The comprehensive set of metrics provides a balanced view of model performance. For example, while SR IHT has the highest sensitivity, SVM IHT has the highest specificity and PPV, allowing for nuanced model selection based on specific operational priorities (e.g., minimizing false alarms vs. capturing all no-shows).
✅ Proper distinction of evaluation sets: The distinction between 'test set' (Table 3) and 'validation set' (Table 4, to be reviewed next) is maintained, which is critical for proper model evaluation. Table 3 results likely reflect performance on data used for internal tuning/model selection within the calibration phase of the outer cross-validation loop.
✅ Realistic variation in algorithm performance: The variability in which specific algorithm (KNN, SR, or SVM) performs best for a given metric, even within the IHT group, suggests that there isn't a single universally superior algorithm, and the choice might depend on the dataset characteristics or the specific metric being optimized. This is a realistic outcome in machine learning.
✅ Indication of stable high performance: The standard deviations are generally low for the top-performing IHT models, indicating that the high performance observed is consistent across the replicates for this test portion.

Communication

✅ Consistent and clear structure: The table maintains a consistent and clear structure, identical to Tables 1 and 2, facilitating comparison across datasets and model evaluation stages. Rows for model combinations and columns for performance metrics are well-defined.
✅ Informative caption: The caption is precise, indicating the dataset (Dataset 2), the portion (test portion), the basis of results (100 replicates), and the nature of the data (average predictive performance and standard deviations).
✅ Effective use of bolding: The use of bold text to highlight the best-performing value for each metric continues to be an effective visual aid for quickly identifying top results.
✅ Concise abbreviations and clear labels: Standard abbreviations for algorithms and resampling techniques are used, maintaining conciseness. Performance metrics are clearly labeled.
✅ Comprehensive and consistent presentation: The table is comprehensive and provides detailed results. The consistency in presentation with previous tables is a strength.

Table 4 Average predictive performance and standard deviations obtained from...

Full Caption

Table 4 Average predictive performance and standard deviations obtained from 100 replicates of dataset 2's validation portion

Figure/Table Image (Page 11)

First Reference in Text

Applying the framework steps in Fig. 1 to Dataset 2 led to the results reported in Tables 3 (test set) and 4 (validation set).

Description

Overview of model performance on Dataset 2 validation portion: This table presents the average predictive performance and standard deviations for various machine learning models, evaluated on the 'validation portion' of 'dataset 2'. The results are based on 100 experimental replicates. The structure is identical to previous performance tables, allowing for comparisons.
Techniques, algorithms, and metrics evaluated: The same set of resampling techniques (SMOTE: Synthetic Minority Oversampling Technique, RUS: Random Under-Sampling, NM: NearMiss, IHT: Instance Hardness Threshold) and classification algorithms (KNN: K-Nearest Neighbors, SVM: Support Vector Machine, SR: Symbolic Regression) are evaluated. Performance is measured using the same metrics: AUC (Area Under the ROC Curve, a measure of a model's ability to distinguish between classes), Sensitivity (True Positive Rate, the proportion of actual no-shows correctly identified), Specificity (True Negative Rate, the proportion of actual shows correctly identified), NPV (Negative Predictive Value, the probability that a patient predicted to show up actually does), PPV (Positive Predictive Value, the probability that a patient predicted to not show up actually does not), F1-score (a balanced measure of precision and sensitivity), and Accuracy (overall proportion of correct classifications).
Key performance highlights (IHT models): On this validation set for Dataset 2, SR with IHT achieves the highest Sensitivity (0.9434 ± 0.087), AUC (0.7734 ± 0.038), and NPV (0.9802 ± 0.020). KNN with IHT also shows high Sensitivity (0.9418 ± 0.035).
Performance highlights for SMOTE models: The highest Specificity is achieved by KNN with SMOTE (0.8864 ± 0.086), which also yields the highest Accuracy (0.8194 ± 0.055) and the highest PPV (0.5592 ± 0.115). SVM with SMOTE yields the highest F1-score (0.5327 ± 0.073).
Comparison with Dataset 2 test portion performance (Table 3): Compared to the test portion results for Dataset 2 (Table 3), there is a noticeable drop in performance for several metrics on this validation set, particularly PPV, F1-score, and Accuracy, especially for the IHT models. For instance, the best PPV for IHT models in Table 3 was above 0.93, whereas in Table 4, the best PPV for IHT models is around 0.34-0.37. Similarly, F1-scores for IHT models were above 0.92 in Table 3 but are around 0.50-0.52 in Table 4. This suggests that while models performed well on the balanced test set, their performance degrades on the (likely imbalanced) validation set, especially in terms of precision-related metrics.
Maintained high Sensitivity and NPV for IHT models: Despite the drop in some metrics, Sensitivity for the IHT models remains very high (around 0.91-0.94), indicating they are still effective at identifying most of the no-show instances. NPV also remains high for IHT models (e.g., 0.9802 for SR IHT).
Stability of performance on validation set: The standard deviations are generally low to moderate, indicating a degree of consistency in the performance across the 100 replicates on this validation set.

Scientific Validity

✅ Assessment on a true validation set: Evaluating models on a separate validation set, untouched during the main model training and selection phases (which used the 'test set' from Table 3, itself derived from the calibration portion), is a critical and methodologically sound step for assessing true generalization performance.
✅ Robustness from multiple replicates: The use of 100 replicates provides a reliable basis for the reported average performance metrics and their standard deviations, lending credibility to the stability of these findings on the validation set.
✅ Realistic assessment of generalization challenges: The results highlight the practical challenges of applying models trained/tuned on balanced data (implied for the 'test set' in Table 3 due to resampling in Step 3 of Fig. 1) to real-world, likely imbalanced validation data. The drop in precision-related metrics (PPV, F1-score) for IHT models, despite high sensitivity, is a key finding. This indicates that while most no-shows are caught, many actual shows might be misclassified as no-shows.
💡 Divergent best-performing techniques on validation set: The superior performance of SMOTE-based models (KNN SMOTE for Specificity, PPV, Accuracy; SVM SMOTE for F1-score) on this validation set is noteworthy and contrasts with the general dominance of IHT on the test set (Table 3) and for Dataset 1. This suggests that the optimal resampling strategy can be dataset-dependent and also depends on the specific characteristics of the validation data distribution. The text discussion should explore this.
✅ High sensitivity of IHT maintained, but with low PPV: The high sensitivity maintained by IHT models (SR IHT: 0.9434) is a significant positive, as correctly identifying no-shows is often a primary goal. However, the corresponding PPV for SR IHT is 0.3661, which means that out of all patients predicted to be no-shows by this model, only about 36.6% actually are no-shows. This trade-off needs careful consideration in practical application.
✅ Comprehensive data supporting complexity of prediction task: The table provides a comprehensive picture, allowing for comparison across all tested combinations. The data supports the conclusion that predicting no-shows is a complex task, and performance can vary significantly depending on the dataset characteristics and the evaluation set used.

Communication

✅ Consistent and clear structure: The table maintains a consistent structure with previous performance tables (Tables 1-3), which is excellent for comparability. Clear row and column labeling is maintained.
✅ Informative caption: The caption is precise, clearly identifying the dataset (Dataset 2), the specific portion (validation portion), the number of replicates (100), and the nature of the presented data (average predictive performance and standard deviations).
✅ Effective use of bolding: The continued use of bold text to highlight the best-performing value for each metric is effective in guiding the reader to the top results for each criterion.
✅ Standard metrics and consistent abbreviations: The performance metrics are standard and well-understood in the field. Abbreviations for models are consistently used.
💡 Contextual interpretation of performance changes: Similar to Table 2, the performance on the validation set shows a drop in some metrics (e.g., PPV, Accuracy) compared to the test set (Table 3). The table clearly presents this, and the discussion in the main text should adequately address the reasons and implications, especially concerning class imbalance.

Fig. 4 Boxplot of sensitivity results in the validation set for all prediction...

Full Caption

Fig. 4 Boxplot of sensitivity results in the validation set for all prediction models

Figure/Table Image (Page 12)

First Reference in Text

Classification algorithms KNN, SR, and SVM,

Description

Distribution of sensitivity scores for different models: The figure is a boxplot chart illustrating the 'sensitivity' results for various prediction models when applied to the 'validation set' (presumably of Dataset 2, given the sequence of figures). Sensitivity measures the model's ability to correctly identify true positive cases (i.e., actual no-shows). A sensitivity of 1.0 means all no-shows were correctly predicted. Each boxplot represents a specific combination of a classification algorithm (KNN: K-Nearest Neighbors; SR: Symbolic Regression; SVM: Support Vector Machine) and a resampling technique (SMOTE, RUS, NM, IHT – these are methods to balance imbalanced datasets). For each box: the line inside marks the median sensitivity; the box itself covers the interquartile range (IQR, the middle 50% of the data); the 'whiskers' extend to show the data range excluding outliers; and individual points are outliers. The y-axis ranges from 0.2 to 1.0.
High performance of IHT models: Visually, models incorporating the IHT resampling technique (KNN IHT, SR IHT, SVM IHT) consistently show the highest median sensitivities, all appearing to be above or around 0.9. Their boxes are also relatively compact, particularly for KNN IHT and SR IHT, suggesting stable high performance in identifying no-shows. SVM IHT shows a slightly wider box but still a high median.
Varied performance of non-IHT models: Other model combinations exhibit more varied and generally lower median sensitivities. For example, KNN SMOTE has a median around 0.5, while SR RUS has a median around 0.7 but with a very wide spread and several outliers, indicating high variability. SVM NM shows a median sensitivity around 0.8 but also with a noticeable spread. Some models, like SR SMOTE, show medians around 0.6 with outliers extending to lower values.
Indication of model stability: The spread of each boxplot (height of the box and length of whiskers) indicates the stability of the sensitivity metric for that model across the 100 replicates mentioned in similar contexts (e.g., Table 4). Models like KNN IHT and SR IHT show relatively tight distributions, suggesting more consistent sensitivity. In contrast, SR RUS shows a very wide distribution, indicating less stable sensitivity results.

Scientific Validity

✅ Appropriate visualization method: The use of boxplots to compare the distributions of sensitivity scores from multiple model replicates is a statistically sound and appropriate visualization method. It effectively communicates central tendency, dispersion, and the presence of outliers for each model configuration.
✅ Supports textual claims on model performance and stability: The figure provides strong visual evidence for the claims made in the text (page 9) about the stability and performance of different models, particularly highlighting the superior and more stable sensitivity of IHT-based models on the validation set of Dataset 2.
✅ Effective comparison of techniques and algorithms: The visualization clearly demonstrates the impact of different resampling techniques and classification algorithms on sensitivity. The consistent high performance of IHT across KNN, SR, and SVM algorithms for this metric is evident.
✅ Robust view from multiple replicates: The figure, by showing the full distribution from 100 replicates, provides a more robust view of performance than a single point estimate. This is good scientific practice.
💡 Visual comparison without explicit statistical tests: While visually compelling, the figure itself does not provide formal statistical tests of significance between the different model groups. Such tests, if reported in the text (as they are not here), would strengthen the conclusions drawn from these visual comparisons.
💡 Incomplete reference text: The reference text provided ("Classification algorithms KNN, SR, and SVM,") is very brief and only partially describes the content. The figure actually shows these algorithms in combination with various resampling techniques. The caption is more complete.

Communication

✅ Appropriate chart type: The boxplot is an appropriate choice for visualizing the distribution of sensitivity scores across different model configurations, allowing for easy comparison of median performance, spread, and outliers.
✅ Clear labeling: The y-axis is clearly labeled "Sensitivity" and spans a relevant range (0.2 to 1.0). The x-axis labels, representing model combinations (e.g., KNN SMOTE, SR IHT), are consistent with previous figures and tables, though they are abbreviations.
✅ Good visual design: The visual design is clean, with distinct boxes and clear differentiation between medians, quartiles, whiskers, and outliers. The use of color and horizontal gridlines aids readability.
💡 X-axis label density: The x-axis labels are dense. Similar to Fig. 2, consider rotating the labels or using a staggered layout to improve readability and prevent any potential overlap, especially if more models were included.
✅ Informative caption: The caption clearly states the metric (sensitivity), the dataset portion (validation set), and the scope (all prediction models). This aligns well with the visual content presented.

Discussion

Key Aspects

⚖️ Cross-Validation and Stratification Strategies: The paper critically evaluates cross-validation and stratification methodologies prevalent in no-show prediction literature, contrasting them with its own more rigorous approach. Many studies are noted for using limited validation runs (e.g., a single validation set or 5/10-fold cross-validation repeated few times), which the authors argue can lead to biased performance assessments and insufficient evaluation of model generalizability and stability, especially with imbalanced class frequencies. The proposed framework's use of a two-stage cross-validation—a process of splitting data for training and testing models, performed here in two nested loops to create many unique test scenarios—resulting in 100 distinct simulations is presented as a significant advancement for robust model evaluation. Furthermore, the importance of stratified partitioning, ensuring that class proportions (e.g., the ratio of 'show' to 'no-show' appointments) in data splits mirror the entire dataset, is emphasized for enhancing the reliability of results.
🛠️ Resampling Techniques for Imbalance: The discussion underscores the critical role of resampling techniques in mitigating the detrimental effects of class imbalance on machine learning algorithm performance in no-show prediction. Class imbalance, where no-shows are typically the minority, can lead to models biased towards the majority class, poorly identifying actual no-show cases. The paper reviews literature, noting that studies employing resampling (often undersampling, which reduces the number of majority class instances) generally report higher sensitivity (the model's ability to correctly identify no-shows). The authors advocate for their framework combining resampling with controlled stratification and cross-validation, highlighting the Instance Hardness Threshold (IHT) technique's exceptional performance in their study. IHT is an undersampling method that removes majority class instances based on their 'instance hardness'—a measure of how difficult they are to classify correctly; its success is attributed to IHT's ability to selectively remove these challenging or noisy majority class instances, thereby improving class separability and boosting the model's capacity to correctly identify the crucial minority no-show class.
📊 Significant No-Show Predictors: This section delves into the analysis of predictor variables (features, such as patient age or appointment time) that were most frequently selected by the machine learning models for predicting no-shows across the two datasets. The authors compare their findings with established literature, identifying consistencies in significant predictors such as 'age', 'day of the month', 'month of the year', and 'previous no-shows'. However, they also note variations in predictor importance based on the specific dataset and the classification algorithm used, attributing the latter to differing processing strategies inherent to each algorithm (e.g., how decision trees split data versus how support vector machines define boundaries). The discussion acknowledges the case-specific nature of no-show phenomena, influenced by unique internal and external factors of each medical service, which limits the universal generalizability of predictors but emphasizes that context-specific predictor identification can still guide effective managerial strategies.
🎯 Practical Implications and Applications: The discussion highlights the tangible benefits and real-world applicability of the research in addressing the operational and financial challenges posed by patient no-shows in healthcare. By quantifying the annual financial loss due to no-shows in one dataset, the paper underscores the economic imperative for effective solutions. The core practical implication lies in leveraging accurate no-show predictions, as enabled by their proposed framework, to implement targeted and resource-efficient strategies. Specifically, identifying patients with a high likelihood of not attending allows healthcare managers to direct interventions such as personalized reminders or strategically apply overbooking (scheduling more patients than capacity, anticipating some no-shows) to these specific patient groups, thereby optimizing resource utilization, reducing system idle times, and potentially improving patient access and service revenue.

Strengths

✅ Strong Rationale for Extensive Cross-Validation
The paper provides a robust justification for its extensive cross-validation strategy (100 simulations across two stages), effectively contrasting it with the limitations of approaches with fewer validation runs commonly found in the literature, thereby highlighting the methodological rigor of the study in assessing model generalization and stability.

"If only 10 validation runs were considered, results could be biased for better or worse. Contrastingly, employing cross-validation across two stages enables 100 simulations across distinct folds, curbing repetitions and strengthening result robustness." (Page 13)
✅ Clear Connection of Resampling to Sensitivity and Practical Needs
The discussion effectively links the use of resampling techniques to tangible improvements in sensitivity, a critical metric for no-show prediction. It clearly articulates why this is important in the healthcare context (minimizing false negatives) and positions the proposed framework favorably.

"All eleven studies that used resampling techniques displayed sensitivity results of at least 64%. Of the remaining studies that did not apply resampling, 18 reported sensitivity results: five of them were between 20 and 55%..." (Page 13)
✅ Insightful Explanation for IHT's Effectiveness
The paper offers a plausible and insightful explanation for the superior performance of the Instance Hardness Threshold (IHT) technique, attributing its success to its ability to identify and remove challenging majority class instances, which improves class separation and classification accuracy.

"We believe that the outstanding performance of IHT lies in its ability to identify these challenging instances, allowing for their removal during machine learning model training. This results in significant improvements in the separation between classes, directly impacting classification results." (Page 13)
✅ Balanced Discussion of Significant Predictors
The discussion on significant predictors is well-balanced, acknowledging consistency with existing literature while also highlighting the dataset-specific nature of predictors and the variability based on the algorithm used. This nuanced perspective adds credibility to the findings.

"The most frequently retained variables found in our study were consistent with the results found in the literature." (Page 14)
✅ Clear Articulation of Practical Implications
The practical implications are clearly articulated, demonstrating how the study's findings can translate into actionable strategies for healthcare managers, such as targeted patient reminders and optimized overbooking, to mitigate the negative impacts of no-shows.

"Our study has practical implications since knowledge of most likely no-show patients allows directing strategies such as patient reminders and overbooking to those patients, optimizing the use of resources." (Page 14)

Suggestions for Improvement

💡 Recommend statistical testing for IHT/SR superiority
This is a high-impact suggestion. The Discussion section is the appropriate place to acknowledge methodological limitations and suggest avenues for future research that would bolster the current findings. While the paper claims 'most favorable outcomes' for IHT and SR combinations, this assertion lacks formal statistical backing within the study. Performing statistical tests to compare model performances across the 100 replicates would significantly strengthen the claims of superiority for these novel techniques, moving beyond observational advantage to statistically validated evidence, thereby enhancing the overall scientific rigor and impact of the study's conclusions.

"In our study, the combination of IHT and SR yielded sensitivity results greater than 0.94 in the two analyzed datasets, i.e., representing the most favorable outcomes reported in the literature, to the best of our knowledge. It is essential to note that our study, while showcasing this absolute advantage, did not perform specific statistical analyses to confirm significant differences." (Page 13)

Implementation: In future work or as an addendum, incorporate appropriate non-parametric statistical tests (e.g., Wilcoxon signed-rank test or Friedman test followed by post-hoc tests) to compare the performance metrics (especially sensitivity and AUC) of the IHT/SR combinations against other models across the 100 validation replicates. The Discussion should then reflect these statistical findings, stating, for example: 'Future research should also focus on statistically validating these performance differences. While our study demonstrated consistently higher mean sensitivity for IHT/SR, formal statistical testing would confirm the significance of this advantage.'
💡 Elaborate on reasons for predictor importance variability across algorithms
This is a medium-impact suggestion. The Discussion is an ideal section for deeper interpretation of nuanced findings. The paper correctly notes that predictor importance varies by algorithm and cites Nasir et al. regarding 'different processing strategies.' However, a brief elaboration on how these strategies might lead to such variations would provide readers with a more profound understanding. For instance, explaining that distance-based algorithms like KNN might prioritize different features than rule-based or boundary-based algorithms like SVM or equation-evolving SR would add valuable explanatory depth, making the discussion more insightful.

"Furthermore, most predictors displayed importance levels that varied depending on the classification algorithm being tested. According to Nasir et al. [28], that is due to the different processing strategies performed by the algorithms." (Page 14)

Implementation: Expand the sentence following the citation of Nasir et al. [28] to briefly illustrate the concept. For example: "According to Nasir et al. [28], that is due to the different processing strategies performed by the algorithms; for instance, distance-based methods like KNN may emphasize features defining local neighborhoods, while methods like SVM focus on features that define class boundaries, and Symbolic Regression may identify complex non-linear interactions that other algorithms might overlook, leading to these observed differences in predictor importance."

Non-Text Elements

Supplementary Table S1. No-show modeling approaches in the literature.

Figure/Table Image (Page 15)

$Supplementary Table S1. No-show modeling approaches in the literature.$

First Reference in Text

Studies in Supplementary Table S1 show that the no-show rate is generally lower than the attendance rate in most medical specialties, but class imbalance impacts prediction, with sensitivity results below 47% in some cases [28, 40, 41, 45-47].

Description

Summary of literature on no-show modeling approaches: Supplementary Table S1 is described as a compilation of no-show modeling approaches found in existing scientific literature. According to the main text, this table is intended to demonstrate that in most medical specialties, the proportion of patients who miss their appointments (the 'no-show rate') is generally lower than the proportion of those who attend. This disparity leads to 'class imbalance,' a situation where the two outcomes (showing up vs. not showing up) are not equally represented in the data used for building predictive models. The reference text further claims that this imbalance negatively affects the ability to accurately predict no-shows, citing that the table includes studies where 'sensitivity' results (the model's ability to correctly identify patients who will indeed miss their appointments) were below 47%.

Scientific Validity

✅ Value of literature review summary: Providing a literature review in a supplementary table is a valuable scholarly practice. It allows for a detailed contextualization of the current study's contributions against the backdrop of existing research without overburdening the main text.
💡 Verification of claims dependent on unseen table content: The scientific validity of the claims made about Supplementary Table S1 (i.e., that it shows generally lower no-show rates, the impact of class imbalance, and specific sensitivity figures below 47%) cannot be directly verified without access to the table's content. The accuracy and comprehensiveness of the table itself are crucial for these claims to be substantiated.
💡 Desirable content for robust support of claims: For Supplementary Table S1 to robustly support the assertions made in the reference text, it should ideally include specific details for each reviewed study, such as: the publication reference, the medical specialty, the dataset size, the reported no-show percentage, the specific modeling techniques employed, how class imbalance was addressed (if at all), and the exact sensitivity values reported. This level of detail would allow readers to independently assess the evidence for the claims.
✅ Alignment with established machine learning principles: The statement that 'class imbalance impacts prediction' is a well-established concept in machine learning. The table's role would be to provide concrete examples from the no-show modeling literature that illustrate this impact, particularly on the sensitivity metric.

Communication

✅ Appropriate use of supplementary material: The use of a supplementary table is an effective way to present extensive literature review details without disrupting the flow of the main manuscript, allowing interested readers to delve into specifics.
💡 Assessment of communication effectiveness limited: Without access to the actual Supplementary Table S1, a direct assessment of its visual design, clarity, organization, and adherence to best practices for tabular data presentation is not possible. For effective communication, such a table should be clearly organized with unambiguous column headers (e.g., 'Study Reference', 'No-Show Rate (%)', 'Key Predictors Used', 'Modeling Technique', 'Reported Sensitivity (%)'), use consistent formatting, and be easily navigable for the reader to extract specific information or compare different approaches.
✅ Good integration with main text: The reference text appropriately directs the reader to this table for supporting its claims about no-show rates, class imbalance, and sensitivity results in the literature, which is good practice for integrating supplementary material into the main narrative.

Fig. 5 Features selected by top models, occurrence frequency in 100 test set...

Full Caption

Fig. 5 Features selected by top models, occurrence frequency in 100 test set replicates

Figure/Table Image (Page 12)

First Reference in Text

Figure 5 displays the most frequently selected features from the test set for combinations of KNN, SR, and SVM classification algorithms with the IHT resampling technique.

Description

Feature selection frequencies for different models on Dataset 2: Figure 5 is a table that lists various 'Features' (input variables used for making predictions, such as 'Waiting days' or 'Age') down the first column. The subsequent three columns, labeled 'KNN IHT', 'SR IHT', and 'SVM IHT', represent specific machine learning model combinations: K-Nearest Neighbors, Symbolic Regression, and Support Vector Machine, respectively, each paired with the Instance Hardness Threshold (IHT) resampling technique. The numbers in these columns show the 'Frequency of occurrence,' indicating how many times out of 100 test set replicates (experiments on different subsets of the data) each particular feature was chosen as important by that specific model. For example, 'Waiting days', 'Previous appointments', and 'Previous no-show in appointments' were all selected 100 times by all three model types (KNN IHT, SR IHT, SVM IHT). 'Number of days since previous appointment' was selected 85 times by KNN IHT, 24 times by SR IHT, and 17 times by SVM IHT. 'Age' was selected 0 times by KNN IHT, 46 times by SR IHT, and 14 times by SVM IHT. The table includes many other features with varying selection frequencies.

Scientific Validity

✅ Valid method for assessing feature importance and stability: Presenting the frequency of feature selection across 100 replicates is a scientifically sound method to evaluate the stability and relative importance of different input variables for the predictive models. It helps identify features that are consistently deemed relevant.
✅ Supports claims about frequently selected features: The figure directly supports the reference text by clearly showing which features were most frequently selected by the specified model combinations (KNN IHT, SR IHT, SVM IHT) for Dataset 2. The data allows for identification of consistently important features.
✅ Highlights robustly selected features: The table effectively demonstrates that certain features ('Waiting days', 'Previous appointments', 'Previous no-show in appointments') are robustly selected across all three algorithm types when combined with IHT for this dataset. This consistency suggests these are strong predictors.
✅ Reveals algorithm-specific feature preferences: The figure also reveals algorithm-specific preferences for other features. For instance, 'Day of the month' and 'Age' are selected much more frequently by SR IHT (51 and 46 times, respectively) than by KNN IHT (1 and 0 times) or SVM IHT (15 and 14 times). This provides valuable insight into how different algorithms prioritize information.
✅ Sufficient number of replicates: Using 100 replicates as the basis for determining selection frequency provides a good foundation for stable estimates of feature importance.
💡 Scope of information presented: Similar to Figure 3 (for Dataset 1), this table shows which features were selected and their frequency, but it does not inherently explain why a feature was selected or the directionality of its impact on the prediction. Such interpretations would be expected in the discussion text.

Communication

✅ Appropriate format and clear headers: The tabular format is suitable for presenting the frequency data of selected features across the different model configurations. Column headers are clear.
✅ Informative caption: The caption accurately describes the content of the figure: the occurrence frequency of features selected by top models (specified as KNN, SR, SVM combined with IHT in the reference text) in 100 test set replicates (presumably for Dataset 2, given the sequence).
💡 Feature ordering and readability: The list of features is extensive. For enhanced readability and quicker identification of the most consistently important features, consider sorting them by their average frequency across all models or by their frequency in a specific reference model. Grouping features by category (e.g., temporal, historical, demographic) could also aid interpretation if applicable.
💡 Enhance visual comparison: While the numerical frequencies are provided, incorporating small horizontal bars (like an embedded bar chart) next to each frequency value could offer a more direct visual comparison of feature importance across the three models for each feature. This would make differences in selection patterns more apparent.
✅ Descriptive feature names: The feature names are generally descriptive. Ensure consistent formatting for clarity, especially where names are long and might wrap.

Decision analysis framework for predicting no-shows to appointments using machine learning algorithms

First Page Preview

Table of Contents

Overall Summary

Study Background and Main Findings

Research Impact and Future Directions

Critical Analysis and Recommendations

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Method

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements