Hospital Length-of-Stay Prediction Using Machine Learning Algorithms-A Literature Review

Guilherme Almeida, Fernanda Brito Correia, Ana Rosa Borges, Jorge Bernardino
Applied Sciences
Coimbra Institute of Engineering-ISEC, Polytechnic University of Coimbra

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This paper presents a literature review focused on the application of machine learning (ML) algorithms for predicting hospital length of stay (LoS), a critical factor for efficient hospital management and patient care. The primary objective is to identify the most effective ML algorithms for building LoS predictive models by analyzing existing research concerning model types, performance metrics, dataset characteristics, and ethical considerations.

Methodologically, the review adheres to Kitchenham's systematic protocol. The authors conducted a bibliographic search primarily using Google Scholar, which initially yielded 604 articles. Through a multi-stage filtering process—based on language (English), publication date (post-2020), relevance of title and abstract to ML for LoS prediction, and study availability—this pool was narrowed down to 12 core research articles for in-depth analysis. These selected papers were then scrutinized for the ML algorithms employed, datasets utilized (e.g., MIMIC-II/III, institutional EHRs), data preprocessing techniques, reported performance metrics (like Mean Absolute Error (MAE), R-squared (R2), Accuracy), and discussions on challenges and ethical implications.

Key findings from the reviewed literature indicate that several ML algorithms show promise for LoS prediction. Notably, Neural Networks (NNs) demonstrated high accuracy in some studies (e.g., up to 94.74% accuracy reported in study [29] after optimization) and XGBoost also showed strong performance (e.g., an R2 score of 0.89 in study [38]). However, the review emphasizes that the performance of these algorithms is highly contingent on factors such as the quality and characteristics of the dataset, the extent of data preprocessing (including feature selection and handling missing values), and meticulous hyperparameter tuning. The importance of robust data management frameworks and adherence to ethical principles—particularly patient privacy (e.g., HIPAA, GDPR compliance), data security, and the mitigation of algorithmic bias—was also a recurrent theme.

The paper concludes that while Neural Networks are a popular and often high-performing choice, no single ML algorithm is universally optimal for all LoS prediction scenarios. The selection of the most suitable algorithm ultimately depends on the specific context, including the nature of the available data, computational resources, the need for model interpretability, and the specific goals of the healthcare application. The review highlights the ongoing need for careful validation and consideration of practical implementation challenges when deploying ML models in clinical settings.

Research Impact and Future Directions

This literature review provides a valuable and comprehensive synthesis of the application of machine learning (ML) algorithms for predicting hospital length of stay (LoS). Its systematic approach, following Kitchenham's methodology, lends credibility to its overview of common algorithms, performance metrics, and prevailing challenges in the field. The paper effectively highlights the potential of advanced models like Neural Networks and XGBoost, while judiciously emphasizing that their performance is highly context-dependent and that practical implementation faces hurdles such as data quality, model interpretability, computational costs, and ethical considerations.

The review's main strength lies in its balanced perspective, discussing not only the technical aspects of ML models but also the critical importance of ethical frameworks, patient privacy, and bias mitigation. It correctly concludes that there is no universally superior algorithm for LoS prediction; rather, the optimal choice depends on specific dataset characteristics, available resources, and the clinical context. This nuanced conclusion is crucial for guiding healthcare practitioners and researchers in selecting and deploying ML solutions responsibly and effectively.

While the review successfully maps the current landscape, it also implicitly underscores the limitations within the primary research itself, such as the heterogeneity in datasets, methodologies, and reporting standards across studies. This makes definitive cross-study comparisons of algorithm effectiveness challenging. The review navigates this by focusing on trends and general capabilities rather than making absolute claims of algorithmic supremacy. The study design of this paper—a systematic literature review—is appropriate for its objective of summarizing existing knowledge and identifying research gaps. It reliably contributes an understanding of the current state-of-the-art, common practices, and challenges in ML for LoS prediction.

Ultimately, the paper serves as a useful guide for understanding the complexities involved in leveraging ML for healthcare efficiency. It underscores that future progress requires not only algorithmic advancements but also a focus on robust data governance, ethical oversight, and strategies for seamless clinical integration. The critical unanswered questions revolve around how to best bridge the gap from predictive accuracy in research settings to tangible, equitable improvements in real-world patient care and hospital management, particularly concerning the interpretability and trustworthiness of complex models.

Critical Analysis and Recommendations

Clear Statement of Purpose and Significance (written-content)
The abstract clearly states the paper's goal of identifying effective ML algorithms for LoS prediction and its significance for hospital management and patient care. This sets a clear context and communicates the research's value proposition effectively from the outset.
Section: Abstract
Comprehensive Scope Indicated (written-content)
The abstract outlines a comprehensive scope, including ML algorithms, metrics, challenges, data quality, and ethical considerations. This holistic approach suggests a thorough treatment of the topic, promising a well-rounded review.
Section: Abstract
Specify Target Audience or Context for Enhanced Relevance (written-content)
While the abstract outlines the paper's scope, explicitly mentioning the primary intended audience or specific contexts (e.g., general hospitals vs. specialized units) could enhance its focus and immediate applicability. This addition would help readers more quickly ascertain the paper's direct relevance to their work.
Section: Abstract
Clear Articulation of Purpose and Scope (written-content)
The introduction clearly establishes the paper's central theme (ML for LoS prediction) and its primary goal (identifying effective ML algorithms). This directness provides immediate clarity for the reader regarding the paper's focus.
Section: Introduction
Comprehensive Roadmap of Paper Content (written-content)
The introduction effectively outlines the paper's structure, informing readers about upcoming discussions on ML algorithms, features, healthcare outcomes, and ethical considerations. This sets clear expectations for the paper's content flow.
Section: Introduction
Explicitly Introduce Research Questions Earlier (written-content)
While the main objective is stated, explicitly formulating or thematically outlining the specific research questions (RQs) within the Introduction itself would sharpen the paper's focus from the outset. Although RQs are detailed later, their early integration would enhance the Introduction's role as a complete preparatory section.
Section: Introduction
Clear Categorization and Workflow Overview (written-content)
The section effectively categorizes ML algorithms and outlines a structured five-step application workflow. This provides a solid conceptual foundation and a clear understanding of the typical ML process for readers.
Section: Machine Learning Algorithms in Healthcare
Diverse Applications with Concrete Impact Example (written-content)
The paper illustrates diverse ML applications in healthcare and substantiates ML's transformative potential with a compelling clinical trial example on sepsis, showing significant reductions in LoS and mortality. This highlights tangible benefits to patient outcomes.
Section: Machine Learning Algorithms in Healthcare
Clarify Random Forest's Interaction with Preprocessing and Invariance (written-content)
The statement that Random Forest's effectiveness is due to its ability to 'capture complex interactions that require intensive preprocessing due to its invariance' is ambiguous. Clarifying the relationship between invariance, complex data, and preprocessing for Random Forest would improve understanding of the algorithm's characteristics, which is important as it's a key algorithm discussed for LoS prediction.
Section: Machine Learning Algorithms in Healthcare
Comprehensive Coverage of Core Ethical Themes (written-content)
The section comprehensively covers core ethical themes pertinent to ML in healthcare, including patient privacy, data management, and fairness/bias mitigation. This breadth ensures key ethical challenges are introduced and contextualized.
Section: Ethical Considerations
Detailed Articulation of Patient Privacy Risks (written-content)
The discussion on patient privacy is robust, detailing specific risks like data breaches and re-identification. This level of detail effectively highlights the complexities in safeguarding sensitive patient data.
Section: Ethical Considerations
Deepen Discussion on Informed Consent and Data Usage Specific to ML (written-content)
The section mentions 'informed consent, and data usage' as key issues but does not develop them with the same depth as other topics. A more dedicated discussion on the nuances of informed consent for ML applications (e.g., data reuse, model interpretability for patients) is warranted, as it's a foundational ethical principle in medicine and its absence represents an underdeveloped ethical dimension.
Section: Ethical Considerations
Systematic and Transparent Review Protocol (written-content)
The explicit adoption and detailed description of Kitchenham's methodology for the literature review provide a strong foundation of rigor and transparency. Clearly outlining the steps enhances the review's credibility by allowing readers to understand the systematic process undertaken.
Section: Materials and Methods
Clear Search and Selection Criteria (written-content)
The paper clearly defines its search strategy, including the search string, exclusion rationale, and multi-stage filtering process. This clarity allows for potential replication and demonstrates a focused approach to literature gathering.
Section: Materials and Methods
Numerical Inconsistency in Literature Review Flowchart (Figure 1) (graphical-figure)
Figure 1, illustrating the literature review process, contains a numerical inconsistency where the exclusion steps sum to 14 articles remaining, but the final count is 12. This discrepancy undermines the figure's clarity and the reported accuracy of the selection process, impacting the perceived rigor of the review methodology.
Section: Materials and Methods
Detail the Data Extraction Protocol from Selected Papers (written-content)
The 'Materials and Methods' section does not explicitly detail what specific data items were systematically extracted from each selected paper to answer the review's research questions. Providing an overview of the data extraction form or key data points sought would enhance transparency and methodological rigor, making it clearer how the synthesis in the Discussion is derived.
Section: Materials and Methods
Effective Use of Tabular Summaries for Comparative Data (written-content)
The Results section effectively uses Table 3 to present a comprehensive comparison of algorithms and their reported performance metrics from reviewed studies. This tabular format allows for a concise and accessible overview of diverse results, facilitating comparison.
Section: Results
Address Confusing MAE Values and Missing Units in Table 3 (graphical-figure)
Table 3 presents MAE values for Logistic Regression in study [28] (198,379,877,732,011.9) that are confusing due to vastly different scales and lack of units or context, and also lacks units for other MAE/RMSE values. This ambiguity and omission hinder the interpretability and comparative value of the reported metrics, impacting the clarity of the results.
Section: Results
Address Inconsistency and Relocate Interpretative Text Regarding Study [39] (written-content)
The Results section includes interpretative statements and authorial conclusions regarding study [39] that contradict Table 3 (which lists metrics for [39] as 'Not Specified'). Such interpretations and discussions of individual study findings, especially with internal inconsistencies, typically belong in the Discussion section and the factual discrepancy needs resolution for credibility. This represents a structural misplacement and a significant internal inconsistency.
Section: Results
Structured Performance Overview and Synthesis (written-content)
The discussion effectively categorizes and summarizes the performance of various ML algorithms based on the reviewed literature. This provides a clear comparative landscape for understanding their relative effectiveness in LoS prediction.
Section: Discussion
In-depth Critical Analysis of Algorithmic Limitations (written-content)
The section critically analyzes practical limitations of top-performing algorithms (e.g., Neural Networks, XGBoost), discussing data preprocessing demands, tuning sensitivity, computational costs, and interpretability issues. These are crucial considerations for real-world healthcare applications.
Section: Discussion
Direct and Comprehensive Answers to Research Questions (written-content)
The discussion systematically addresses each of the four research questions posed earlier in the paper. This provides clear, synthesized answers based on the literature review, fulfilling a key objective of the paper.
Section: Discussion
Emphasis on Context-Specific Algorithm Selection (written-content)
The discussion highlights that optimal algorithm choice is context-dependent (dataset characteristics, data quality, resources, application context). This promotes a nuanced understanding of ML model deployment rather than a universal solution.
Section: Discussion
Enhance Discussion on Bridging the Gap to Clinical Implementation (written-content)
While the critical analysis highlights challenges like model complexity and interpretability, expanding this to discuss actionable strategies or specific research avenues for bridging the gap to clinical implementation would be impactful. This involves technical, organizational, ethical, and workflow integration considerations, guiding future work more effectively.
Section: Discussion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Machine Learning Algorithms in Healthcare

Key Aspects

Strengths

Suggestions for Improvement

Ethical Considerations

Key Aspects

Strengths

Suggestions for Improvement

Materials and Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1. Literature review process.
Figure/Table Image (Page 8)
Figure 1. Literature review process.
First Reference in Text
Finally, 7 articles were excluded because they were not available, and the search ended with n = 12 articles to be analyzed in this review. Figure 1 illustrates the process used in this literature review.
Description
  • Overall process depiction: The figure is a flowchart illustrating the systematic process of a literature review, detailing how an initial pool of 604 identified articles was narrowed down to 12 final articles for inclusion. This type of diagram is common in systematic reviews and often follows guidelines like PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses), which standardizes the reporting of the review process.
  • Identification phase: The process begins with 604 records identified from databases (and 0 from registers). No records were removed for duplicates, by automation tools, or other pre-screening reasons according to the diagram.
  • Screening initiation: In the screening phase, 558 records were screened. This implies 46 records (604-558) were excluded prior to this screening, which the main text indicates were non-English articles.
  • Exclusion steps and numbers: Several exclusion criteria were applied sequentially to the 558 screened records: 243 records were excluded based on 'Language, Publication Date'; 263 records based on 'Title Analysis'; 31 records based on 'Abstract Analysis'; and 7 records because they were 'Not available'.
  • Numerical outcome of exclusions and final count discrepancy: Following the listed exclusions (243 + 263 + 31 + 7 = 544 exclusions), 14 articles (558 - 544 = 14) should mathematically remain. However, the diagram concludes with 12 studies included in the review, indicating a discrepancy of 2 articles that are not accounted for in the exclusion process depicted.
  • Zero-value intermediate steps: The flowchart includes several standard PRISMA-style boxes such as 'Reports sought for retrieval', 'Reports not retrieved', and 'Reports assessed for eligibility', all of which are reported with 'n = 0'. Similarly, a section for 'Reports excluded' with generic reasons (Reason 1, 2, 3) also shows 'n = 0' for each.
Scientific Validity
  • ✅ Transparency of selection process: The use of a flowchart to document the article selection process is a methodological strength, promoting transparency and reproducibility of the literature review, consistent with PRISMA guidelines.
  • 💡 Numerical inconsistency in article count: The numbers presented in the flowchart do not reconcile. After applying all stated exclusions (243 for Language/Pub Date, 263 for Title, 31 for Abstract, 7 for Not Available) to the 558 screened articles, 14 articles should remain (558 - 544 = 14). However, the figure states 12 articles were included. This mathematical inconsistency of 2 articles needs to be resolved to ensure the accuracy of the reported methodology. The main text also leads to this same inconsistency (21 articles after abstract analysis, minus 7 unavailable, should leave 14, not 12).
  • 💡 Implausible zero duplicates: The reporting of 'n=0' for 'Duplicate records removed' is highly improbable for a search yielding 604 initial records, especially if multiple sources or comprehensive search strategies were employed. This raises questions about the thoroughness of the deduplication process or its reporting.
  • 💡 Questionable zero values in intermediate PRISMA steps: The intermediate steps 'Reports sought for retrieval (n=0)', 'Reports not retrieved (n=0)', and 'Reports assessed for eligibility (n=0)' showing zero counts are atypical for a systematic review. These stages usually involve non-zero numbers as screened articles are progressed. If these steps were indeed not applicable or resulted in zero, it should be clarified; otherwise, the diagram may not accurately reflect the actual review process detailed in the text (e.g., abstract analysis leading to 21 articles implies these 21 were assessed for eligibility).
  • 💡 Missing explicit reason for initial record reduction in diagram: The reason for the initial drop from 604 identified records to 558 screened records (46 articles) is not explicitly stated in the diagram, though the text clarifies it was due to non-English language. For methodological completeness within the figure, this should be specified.
  • 💡 Unused template fields: The section 'Reports excluded: Reason 1 (n=0), Reason 2 (n=0), Reason 3 (n=0)' appears to be an unused part of a template. It does not provide useful information and should be removed or populated correctly if applicable, to avoid suggesting an incomplete or poorly adapted reporting format.
Communication
  • ✅ Clear overall structure: The flowchart format is a standard and effective way to visually represent the multi-stage filtering process of a literature review, making the overall workflow understandable at a glance.
  • ✅ Concise and accurate caption: The caption is concise and accurately describes the content of the figure.
  • 💡 Numerical discrepancy: The numerical inconsistency where the exclusion steps lead to 14 articles (558 screened - 243 - 263 - 31 - 7 = 14) while the final "Studies included in review" box states n=12 is confusing and undermines the figure's clarity. This discrepancy of 2 articles should be resolved by either correcting the numbers in the exclusion steps or the final count, or by adding a step that accounts for the removal of these 2 articles with a stated reason.
  • 💡 Unused or confusing zero-value boxes: The boxes "Reports sought for retrieval (n = 0)", "Reports not retrieved (n = 0)", "Reports assessed for eligibility (n = 0)", and the unused "Reports excluded: Reason 1 (n = 0), Reason 2 (n = 0), Reason 3 (n = 0)" are problematic. If these steps genuinely resulted in zero articles, this is highly unusual for a systematic review and may require explanation. If they are merely placeholders from a template (e.g., a PRISMA template) that do not reflect the actual process, they should be removed or adapted to accurately represent the study's methodology (e.g., mapping the abstract analysis stage to eligibility assessment). This would reduce clutter and improve clarity.
  • 💡 Missing clarification for initial reduction: The reason for the reduction from 604 identified records to 558 screened records (a difference of 46) is not explicitly stated within the diagram at that transition, although the main text clarifies these were non-English articles. Adding a brief note like "Non-English articles excluded" at this step in the diagram would enhance its self-containedness.
  • 💡 Font size consistency: The font size for the exclusion reasons under "Records excluded" (e.g., "Language, Publication Date") appears smaller than other text in the diagram. Consider increasing it slightly for better legibility.
Table 1. Most-used ML algorithms in healthcare.
Figure/Table Image (Page 8)
Table 1. Most-used ML algorithms in healthcare.
First Reference in Text
The following Table 1 shows the most commonly used algorithms in healthcare.
Description
  • Table overview: Table 1 provides a summary of seven Machine Learning (ML) algorithms commonly applied in healthcare. For each algorithm, it specifies its type, an assessment of its frequency of use, and lists common applications.
  • Listed algorithms: The listed ML algorithms are: Linear Regression, Logistic Regression, Decision Trees, Random Forest, K-Nearest Neighbors, Support Vector Machines, and Naive Bayes.
  • Algorithm types: The 'Type of Algorithm' column categorizes each method. For example, Linear Regression is 'Supervised (Regression),' meaning it learns from data with known outcomes (supervised) to predict continuous numerical values (regression), like a patient's blood pressure. Logistic Regression is 'Supervised (Classification),' meaning it also learns from labeled data but predicts categories, such as whether a tumor is benign or malignant. Random Forest is 'Supervised (Ensemble Learning),' which means it combines multiple learning models (in this case, decision trees) to improve predictive performance. K-Nearest Neighbors is 'Supervised (Classification/Regression),' indicating its applicability to both types of prediction tasks. Support Vector Machines and Naive Bayes are listed as 'Supervised (Classification).'
  • Frequency of use ratings: The 'Frequency of Use' is described qualitatively. Random Forest is rated 'Very High'. Linear Regression and Logistic Regression are rated 'High'. Decision Trees, K-Nearest Neighbors, Support Vector Machines, and Naive Bayes are rated 'Moderate' or 'Moderate to High' (Support Vector Machines).
  • Example applications: The 'Common Applications' column provides examples for each algorithm. For instance, Linear Regression is used for 'Predictive modeling, trend analysis.' Logistic Regression is applied to 'Binary classification, medical diagnostics.' Random Forest is used for 'Fraud detection, customer churn prediction, genomics.' Support Vector Machines are used for 'Image recognition, text classification, bioinformatics.'
Scientific Validity
  • ✅ General overview of relevant algorithms: The table provides a useful, albeit high-level, overview of common ML algorithms relevant to healthcare, which can serve as a good starting point for readers less familiar with the field.
  • ✅ Accurate algorithm typing: The categorization of algorithms by type (e.g., Supervised Regression, Supervised Classification, Ensemble Learning) is generally accurate and standard in ML literature.
  • 💡 Unsubstantiated "Frequency of Use" claims: The "Frequency of Use" claims (High, Moderate, Very High) are not substantiated with evidence or citations within the table or the immediate reference text. Without knowing the methodology (e.g., a systematic count from a defined corpus of literature, expert consensus), these frequencies are subjective assertions. The validity of these claims is therefore questionable. The authors should clarify the basis for these frequency assessments.
  • 💡 Potential selectivity in "most-used" algorithms: The list of "most-used" algorithms might be selective and not fully representative of the entire healthcare ML landscape. The criteria for inclusion as "most-used" are not defined. For example, deep learning models (e.g., CNNs, RNNs) are increasingly used in healthcare, especially in imaging and sequential data analysis, but are not explicitly listed here, though 'Neural Network' is mentioned later in the paper.
  • 💡 Generality of listed applications: While the listed "Common Applications" are generally appropriate, they are broad. The table's purpose seems to be introductory, but for a scientific review, a more nuanced or specific list of applications, perhaps tied to particular healthcare challenges (e.g., specific disease prediction, resource allocation), could strengthen its contribution.
  • ✅ Serves as a contextual summary: The table provides a snapshot that is useful for context but does not, in itself, present novel research findings. Its value lies in summarizing existing knowledge for the reader.
Communication
  • ✅ Clear structure and readability: The table is well-structured with clear column headers (ML Algorithm, Type of Algorithm, Frequency of Use, Common Applications), making it easy to read and understand the information presented for each algorithm.
  • ✅ Concise presentation: The information is presented concisely, allowing for a quick overview of common ML algorithms in healthcare.
  • ✅ Accurate caption: The caption accurately describes the table's content.
  • 💡 Subjective "Frequency of Use" categories: The "Frequency of Use" column uses subjective categories (High, Moderate, Very High) without providing a scale, reference, or methodology for these classifications. This makes it difficult to interpret the relative frequencies objectively. Consider adding a footnote explaining the basis for these frequency ratings (e.g., based on the authors' literature search for this review, or citing a specific source that quantifies usage). Alternatively, if precise frequencies are unknown, using more nuanced qualitative descriptors or ranking could be considered, though a quantitative basis is always preferred.
  • 💡 Broadness of some listed applications: While the applications listed are generally correct, some are very broad (e.g., "Data mining" for Decision Trees). Providing slightly more specific or differentiated examples for each algorithm, where possible, could enhance the table's utility.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 2. Dataset overview.
Figure/Table Image (Page 11)
Table 2. Dataset overview.
First Reference in Text
Table 2 summarizes the datasets employed in these studies.
Description
  • Table overview: Table 2 lists the datasets used in the 12 primary research articles reviewed in this paper, identified by their reference numbers (e.g., [28], [29], etc.). For each study, the table specifies the name or type of dataset used and provides information on its size and the number of features (variables or characteristics used for analysis).
  • Use of public MIMIC datasets: Several studies utilized well-known public datasets. For instance, study [28] used MIMIC-II (Multiparameter Intelligent Monitoring in Intensive Care II), a database containing de-identified data from intensive care unit patients, with 4927 patients and 24 features. Studies [30] and [33] used MIMIC-III, a later version of this database; study [30] involved 8024 patient stays with 'various features', and study [33] had 'Not available' size but 'various features'.
  • Institutional or specific datasets: Other studies used institutional or specific datasets. Study [29] used data from 'UPT Puskesmas Arjasa Kangean' comprising 3055 patient records and 30 features. Study [35] used a 'Medical Institution Dataset' (size and features not specified). Study [37] used 'Femur Fracture Data' with 547 pre-DTAP (Diagnostic-Therapeutic-Assistance Pathway) and 562 post-DTAP records, with 'Various features'. Study [38] used 'Total Hip Arthroplasty' data with 792 samples and 13 features. Study [40] used 'Neurosurgical Treatment Cases' with 90,685 cases and 6 features.
  • Limited dataset details in some entries: For some studies, dataset details were limited. Study [32] used 'Electronic Health Records' with size and features 'Not specified'. Study [34] used an 'Unspecified' dataset with 350,393 records and 46 features. Study [36] used 'Electronic Health Records' with size 'Not specified' but 10 features. Study [39] used 'Patient Admission Data' with size and features 'Not specified'.
  • Variation in number of features: The number of features (input variables for the machine learning models) varies widely, from 6 features in study [40] to 46 features in study [34], with many studies reporting 'various features' or not specifying the number.
  • Variation in dataset size: The size of datasets also varies significantly, from 792 samples in study [38] to 350,393 records in study [34] and 90,685 cases in study [40].
Scientific Validity
  • ✅ Appropriate and necessary information for a review: Summarizing the datasets used in the reviewed studies is crucial for a literature review, as it provides context for the types of data, scale, and dimensionality that the machine learning models were applied to. This helps in understanding the scope and potential generalizability of the findings from those studies.
  • ✅ Good practice in summarizing dataset characteristics: The inclusion of both dataset source/name and its characteristics (size, features) is good practice, allowing readers to quickly grasp the nature of the data underpinning each study's results.
  • 💡 Limited detail impacts comparative utility: The frequent occurrence of 'Not specified', 'Not available', or 'various features' limits the scientific utility of the table for detailed comparison or meta-analysis. While this likely reflects limitations in the reporting of the original studies, it highlights a challenge in synthesizing information from diverse sources. The review itself acknowledges this by presenting the information as is.
  • ✅ Accurate reflection of source information (assumed): The table accurately reflects the information as cited from the original papers, which is essential for the integrity of a literature review. The authors of this review are summarizing, not generating new data here.
  • ✅ Demonstrates heterogeneity of data in the field: The diversity of datasets (public, institutional, specific conditions) and their varying sizes/features highlights the heterogeneous nature of research in hospital LoS prediction. This implicitly supports the review's aim to identify effective ML algorithms across different contexts.
  • 💡 Lacks information on data preprocessing/feature engineering: The table does not provide information on data preprocessing steps or specific feature engineering undertaken in the original studies, which are critical aspects of ML model development. While perhaps beyond the scope of a summary table, acknowledging this limitation of the summarized information could be useful context for the reader if not discussed elsewhere.
Communication
  • ✅ Clear structure and readability: The table is clearly structured with columns for 'Study', 'Dataset', and 'Size and Features', making it easy to locate information for each reviewed paper.
  • ✅ Accurate caption: The caption accurately reflects the content of the table.
  • 💡 Incomplete information for several entries: The use of 'Not specified' or 'Not available' for dataset size or features in several entries ([32], [34], [35], [36], [39]) limits the table's completeness and utility for comparative purposes. While this may reflect missing information in the original studies, it's a notable gap. If the information truly isn't in the source, this is a limitation of the source, not the table. However, if it was in the source and omitted, it's a table flaw. Assuming it reflects the source, it's a communication limitation inherited.
  • 💡 Vagueness of 'various features': The term 'various features' is vague. While understandable if the exact number is large or complex to list, it reduces the specificity. If a range or predominant types of features were mentioned in the source papers, adding that detail could be beneficial.
  • 💡 Ambiguity in feature count for split dataset [37]: For study [37], the feature count is 'Various features' but the size is split into '547 (pre-DTAP), 562 (post-DTAP)'. It's unclear if 'Various features' applies to both pre- and post-DTAP groups or if feature sets differed. Clarification, if available in the source, would be helpful.
  • 💡 Inconsistent unit of size (patients vs. records vs. stays): Consistency in reporting patient numbers versus records would be ideal. For example, [28] lists 'patients', [29] lists 'patient records', [30] lists 'patient stays', [34] lists 'records'. While reflecting source terminology, a note explaining these distinctions or standardizing where possible could aid comparison.
Table 3. Comparation of algorithms and metrics.
Figure/Table Image (Page 14)
Table 3. Comparation of algorithms and metrics.
First Reference in Text
Table 3 presents the key performance metrics reported for each algorithm.
Description
  • Table overview: Table 3 summarizes the performance of various machine learning algorithms as reported in the 12 reviewed studies, identified by their reference numbers. For each study, it details the general method (e.g., Regression, Neural Network, Classification), the specific algorithm(s) used, the performance metric(s) reported (e.g., MAE, Accuracy, R2 Score), and the corresponding numerical values of these metrics.
  • Performance metrics for study [28] (Regression): Study [28] evaluated regression models. Logistic Regression showed a Mean Absolute Error (MAE) – an average of the absolute differences between predicted and actual values – of 198,379,877,732,011.9 (this large value with multiple commas is unusual and likely represents different scenarios or a concatenation of results, rather than a single MAE value). Ridge Regression, Lasso Regression, and ElasticNet had MAE values of 0.82131, 0.96865, and 0.95121, respectively.
  • Performance metrics for study [29] (Neural Network): Study [29] focused on Neural Networks, reporting Accuracy – the proportion of correct predictions – of 94.66% with default parameters, and 94.74% after Grid Search Optimization or Random Search Optimization.
  • Performance metrics for study [30] (Regression - R2 Score): Study [30] used regression algorithms, reporting R2 Score – a measure of how well the model's predictions approximate the actual outcomes, with 1 being a perfect fit. Random Forest achieved an R2 Score of 0.7780, XGBoost 0.7608, Gradient Boosting 0.7651, Logistic Regression 0.6466, and K-Nearest Neighbors 0.7306.
  • Performance metrics for study [32] (Survival Analysis): Study [32] employed Survival Analysis, achieving a Concordance Index – a measure of discrimination in survival models, indicating how well the model predicts the order of events – of 0.7.
  • Performance metrics for study [33] (Decision Trees): Study [33] used Decision Trees, reporting an Accuracy 'Above 80%'.
  • Performance metrics for study [34] (Regression/Neural Network - Accuracy): Study [34] reported dual Accuracy values for Random Forest (59.78% and 36.57%) and Neural Network (47.52% and 36.67%).
  • Performance metrics for study [35] (Classification - Accuracy): Study [35] used classification algorithms. Logistic Regression achieved an Accuracy of 80.54%, Modified Random Forest 81.09%, and Gradient Boosting 82.41%.
  • Performance metrics for study [36] (Classification - F1 Score): Study [36] also used classification, reporting F1 Score – a balance between precision and recall. Logistic Regression had an F1 Score of 0.59705, Decision Trees 0.59273, Neural Network 0.67248, Random Forest 0.66797, and Gradient Boosting 0.64848.
  • Performance metrics for study [37] (Regression - R2 Score, Std. Error): Study [37] used Multiple Linear Regression, reporting R2 (Pre-DTAP) of 0.63 and R2 (Post-DTAP) of 0.50. Standard Errors (Pre and Post) were 3.12 and 5.08, respectively.
  • Performance metrics for study [38] (Regression - RMSE, R2 Score): Study [38] used XGBoost for regression, achieving a Root Mean Squared Error (RMSE) – the square root of the average of squared differences between prediction and actual observation – of 2.03 and an R2 Score of 0.89.
  • Performance metrics for study [39] (Classification - Accuracy Not Specified): Study [39] used classification algorithms (Naive Bayes, Random Forest, Support Vector Machine), but the Accuracy values were 'Not Specified'.
  • Performance metrics for study [40] (Neural Network - MAE): Study [40] used a Neural Network (GPT-3), reporting MAE values of 2.37 days, 2 days, and 1.88 days (presumably for different model variations or evaluation subsets).
Scientific Validity
  • ✅ Appropriate goal of summarizing performance: The table attempts to consolidate key performance metrics from the reviewed studies, which is a fundamental component of a systematic literature review aiming to compare different approaches.
  • ✅ Reflects diversity of reported metrics: The inclusion of various metrics (MAE, Accuracy, R2, F1, Concordance Index, RMSE) reflects the diversity of evaluation approaches in the original studies and the different types of prediction tasks (regression vs. classification). This is methodologically sound as it represents what was reported.
  • 💡 Questionable validity of MAE values in study [28]: The MAE values reported for Logistic Regression in study [28] (198,379,877,732,011.9) are highly problematic and suggest a misunderstanding, misreporting, or lack of normalization/context from the original paper. Such large, disparate numbers without clear explanation or units are not interpretable as typical MAE values for LoS prediction. This significantly undermines the validity of this specific entry.
  • 💡 Missing units for error metrics limits interpretation: The lack of units for MAE and RMSE (e.g., days) makes it difficult to assess the practical significance of these error metrics. An RMSE of 2.03 is only meaningful in context (e.g., 2.03 days). This omission limits the scientific utility of these reported values.
  • 💡 'Not Specified' values prevent comparison: Reporting 'Not Specified' for accuracy in study [39] means no direct performance comparison can be made for these algorithms from this table. This is a limitation inherited from the source or the review's extraction process.
  • 💡 Lack of context for multiple performance values per algorithm: The presentation of multiple, distinct performance values for the same algorithm within a single study (e.g., study [29] Neural Network accuracies; study [34] dual accuracies; study [40] multiple MAEs) without clear differentiation of the conditions under which each was achieved (e.g., different hyperparameter settings, feature sets, or evaluation subsets) makes it hard to pinpoint a single definitive performance. While this reflects the source, the review could benefit from clarifying these distinctions if available in the original papers.
  • 💡 Inherent difficulty in cross-study comparison: Direct comparison of performance across different studies is inherently challenging due to variations in datasets (as shown in Table 2), specific feature sets, preprocessing techniques, and evaluation protocols used in the original papers. The table presents the data as is, but readers should be cautious about drawing strong comparative conclusions across studies without considering these underlying differences.
  • ✅ Useful compilation despite heterogeneity: The table serves as a useful compilation of reported results, but its scientific value for drawing definitive conclusions about algorithm superiority is limited by the heterogeneity and sometimes incomplete reporting of the primary studies.
Communication
  • ✅ Logical table structure: The table structure is logical, with columns for Study, Method, Algorithm, Metric, and Values, allowing for a systematic presentation of results from the reviewed literature.
  • 💡 Typographical error in caption: The caption has a typographical error; 'Comparation' should be 'Comparison'.
  • 💡 Confusing MAE values for study [28]: The presentation of multiple MAE values for Logistic Regression in study [28] (198,379,877,732,011.9) is confusing due to the vastly different scales and lack of units or context. It's unclear what these disparate numbers represent. Clarification or separation of these values with context is needed.
  • 💡 Ambiguous dual accuracy values for study [34]: For study [34], two accuracy values (59.78% and 36.57% for Random Forest; 47.52% and 36.67% for Neural Network) are listed without distinguishing what each value represents (e.g., different feature sets, different patient subgroups). This ambiguity hinders interpretation. Specify what each value corresponds to.
  • 💡 'Not Specified' values limit comparison: The entries 'Not Specified' for accuracy values in study [39] reduce the table's informativeness for those particular algorithms. While this may reflect the source material, it's a limitation in the presented comparison.
  • ✅ Appropriate inclusion of diverse metrics: The table mixes different types of metrics (e.g., R2 Score, Accuracy, MAE, F1 Score) which is appropriate given the different tasks (regression vs. classification) and reporting in original studies. However, this makes direct comparison across all studies challenging, which is an inherent difficulty in literature reviews.
  • 💡 Missing units for error metrics (MAE, RMSE): Units are missing for MAE and RMSE values (e.g., days for LoS prediction). Adding units would significantly improve the interpretability of these error metrics. For example, an MAE of '2.37' is meaningless without knowing if it's days, hours, etc.
  • 💡 Ambiguity of 'Std. Error': The term 'Std. Error' in study [37] is ambiguous. It could refer to standard error of the mean, standard error of the estimate (for regression), etc. Specifying the exact type of standard error would enhance clarity.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 4. Best-performing algorithms.
Figure/Table Image (Page 16)
Table 4. Best-performing algorithms.
First Reference in Text
Based on the summarized results in Table 4, we can analyze and rank the algorithms to determine the best-performing methods for predicting the LoS in hospitals.
Description
  • Table overview: Table 4 aims to highlight the best-performing machine learning algorithm from each of the 12 reviewed studies, along with the specific performance metric and its value that led to this designation. The studies are identified by their reference numbers.
  • Best performer from study [28]: For study [28], 'Regression' (specifically Ridge Regression, as identified in the discussion referencing Table 3) is listed with a metric of 'Ridge Regression' and a value of 0.82131 (this value corresponds to the MAE reported in Table 3 for Ridge Regression).
  • Best performer from study [29]: Study [29] identified 'Grid Search Optimization and Random Search Optimization' (applied to a Neural Network) as best, achieving an Accuracy – the proportion of correct predictions – of 94.74%.
  • Best performer from study [30]: From study [30], 'Random Forest' was best, with an R2 Score – a measure of how well predictions approximate actual outcomes, where 1 is a perfect fit – of 0.7780.
  • Best performer from study [32]: Study [32] reported 'Survival Analysis (Various models)' as best, with a Concordance Index – a measure of discrimination in survival models – of 0.7.
  • Best performer from study [33]: For study [33], 'Decision Tree' was best, achieving an Accuracy 'Above 80%'.
  • Best performer from study [34]: Study [34] listed 'Random Forest Regressor' with Accuracy values of 59.78% and 36.57%.
  • Best performer from study [35]: From study [35], 'Gradient Boosting' was best, with an Accuracy of 82.41%.
  • Best performer from study [36]: Study [36] identified 'Neural Network' as best, achieving an F1 Score – a balance between precision and recall – of 0.67248.
  • Best performer from study [37]: For study [37], 'Multiple Linear Regression (Pre-DTAP)' was best, with an R2 Score of 0.63.
  • Best performer from study [38]: Study [38] highlighted 'XGBoost' with an R2 Score of 0.89.
  • Best performer from study [39]: From study [39], 'Support Vector Machine' is listed as best, but its Accuracy is 'Not Specified'.
  • Best performer from study [40]: Study [40] identified 'GPT-3' as best, with a Mean Absolute Error (MAE) – an average of the absolute differences between predicted and actual values – of 2.37 days (units inferred from text, not in table).
Scientific Validity
  • ✅ Valuable synthesis goal: The table attempts to synthesize the top findings from each reviewed paper, which is a valuable step in a literature review to guide discussion towards impactful methods.
  • 💡 Potential oversimplification and undefined selection criteria for 'best': The selection of a single 'best-performing' algorithm per study simplifies comparison but can be an oversimplification if studies reported multiple high-performing models or if 'best' varied by metric. The criteria for this selection are not explicitly defined in the table, which could affect the interpretation of what 'best' signifies (e.g., highest accuracy, lowest error, best on a specific subgroup).
  • 💡 Inherent difficulty in comparing 'best' across heterogeneous studies and metrics: The direct comparison and ranking of these 'best-performing' algorithms across studies is inherently challenging due to the heterogeneity of datasets, metrics used (e.g., comparing an R2 of 0.89 to an Accuracy of 94.74% is not straightforward), and specific LoS prediction tasks. The table presents these top results, but any subsequent ranking based on this table must acknowledge these limitations.
  • 💡 Unsubstantiated 'best-performing' claim for study [39] due to missing metric value: The inclusion of study [39]'s Support Vector Machine with a 'Not Specified' accuracy value in a table of 'Best-performing algorithms' is problematic. If the performance metric is unknown, its claim as 'best-performing' is unsubstantiated within the context of this comparative table. The text states that for study [39], SVM was found to be the most accurate, but the metric was not provided in the original paper. This nuance is lost in the table.
  • ✅ Consistency with Table 3 for study [28] (assuming Ridge was best regression): The MAE value for study [28] (Regression, 0.82131) is taken from Table 3 where it was associated with Ridge Regression. The 'Best Algorithm' column just says 'Regression'. While consistent with Table 3 if Ridge Regression was indeed the best regression model in that study, the generalization to 'Regression' here is less specific.
  • 💡 Ambiguity in 'best-performing' status for study [34] with dual values: The dual accuracy values for study [34] (Random Forest Regressor: 59.78% and 36.57%) make it unclear which specific result led to its 'best-performing' status, or if both scenarios are considered. This ambiguity reduces the clarity of its 'best' designation.
  • ✅ Accurate extraction from Table 3 (mostly): The table accurately extracts the reported top metrics from Table 3 for most entries, serving as a filtered view. However, the interpretation that these can be easily ranked to find the single best method for LoS prediction overall is a strong claim that this table alone cannot fully support without extensive caveats about context-dependency.
Communication
  • ✅ Clear and logical structure: The table is clearly structured, making it easy to identify the study, the algorithm deemed best-performing from that study, the specific metric, and its value.
  • ✅ Accurate caption: The caption is concise and accurately reflects the table's intent to highlight top-performing algorithms from the reviewed studies.
  • 💡 Missing units for MAE: Similar to Table 3, units are missing for the MAE value (e.g., 2.37 days for study [40]). Adding units is crucial for the interpretability of this error metric. The text elsewhere mentions 'days', but the table should be self-contained.
  • 💡 Ambiguous dual accuracy values for study [34]: The dual accuracy values for study [34] (Random Forest Regressor: 59.78% and 36.57%) are presented without context or explanation for the two different figures. This ambiguity was also present in Table 3 and persists here, making it difficult to understand which value represents the 'best performance' or under what conditions. Specify what each value corresponds to.
  • 💡 Inclusion of 'Not Specified' value for best-performing algorithm: For study [39], the metric value for Support Vector Machine is 'Not Specified'. Including this entry in a 'Best-performing algorithms' table is contradictory if its performance isn't quantified. If the original paper claimed it was best without providing a specific metric value, this should be noted explicitly; otherwise, its inclusion here is confusing. Consider removing it or adding a footnote explaining why it's listed despite the missing value.
  • 💡 Lack of explicit selection criteria for 'best-performing': The selection criteria for what constitutes 'best-performing' from each study, especially when multiple algorithms or metrics were reported (as seen in Table 3), are not explicitly stated within or alongside Table 4. This makes it difficult to verify the selection. A brief note on selection criteria (e.g., 'highest reported accuracy/R2 for primary LoS prediction task') would improve transparency.