Hospital Length-of-Stay Prediction Using Machine Learning Algorithms-A Literature Review

Section Analysis

Abstract

Key Aspects

🎯 Focus on LoS Prediction and Healthcare Efficiency: The abstract clearly articulates the critical role of predicting hospital length of stay (LoS) for effective hospital management. It highlights how accurate LoS predictions facilitate proactive resource allocation, optimize bed availability, and enhance patient care. The paper positions machine learning (ML) as a transformative technology in this domain, aiming to improve overall healthcare efficiency and patient outcomes through better predictive capabilities. This establishes the central problem and the proposed solution's potential impact.
📚 Literature Review Methodology: The paper employs a literature review methodology to achieve its primary objective of identifying the most effective ML algorithm for LoS prediction. This involves a bibliographic search to pinpoint relevant existing research on ML algorithms applied to hospital LoS predictions. The selected papers are subsequently analyzed based on the types of models employed and the performance metrics reported, which are crucial for understanding their impact on healthcare decision-making and synthesizing current knowledge.
⚙️ Analysis of ML Models and Metrics: A core component of the research, as outlined in the abstract, is the detailed analysis of the identified literature concerning specific machine learning model types and the metrics used to evaluate their performance in LoS prediction. This analytical focus signifies an intent to compare and contrast various algorithmic approaches and their documented effectiveness. Understanding these elements is vital for discerning which algorithms demonstrate the most promise for practical and impactful application within diverse healthcare settings, thereby guiding future research and implementation.
❓ Discussion of Challenges and Ethical Considerations: The abstract indicates that the paper will extend beyond technical algorithm comparison to address the inherent challenges and limitations associated with using ML for LoS prediction. It specifically emphasizes the critical importance of data quality, a foundational element for robust ML models, and crucial ethical considerations pertinent to healthcare data. This inclusion suggests a comprehensive and balanced perspective, acknowledging the practical and responsible implementation requirements for these technologies in sensitive healthcare environments, which is significant for real-world applicability.

Strengths

✅ Clear Statement of Purpose and Significance
The abstract effectively communicates the paper's main objective and the importance of the research problem. It clearly states the goal of identifying effective ML algorithms for LoS prediction and links this to tangible benefits in hospital management and patient care, setting a clear context for the review.

"Predicting hospital length of stay is critical for efficient hospital management, enabling proactive resource allocation, the optimization of bed availability, and optimal patient care. This paper explores the potential of machine learning algorithms to revolutionize hospital length-of-stay predictions, contributing to healthcare efficiency and patient care." (Page 1)
✅ Comprehensive Scope Indicated
The abstract outlines a broad scope for the literature review, indicating that it will cover not only the identification and analysis of ML algorithms and metrics but also a discussion of challenges, limitations, data quality, and ethical considerations. This holistic approach suggests a thorough and well-rounded treatment of the topic.

"We also discuss the challenges and limitations of machine learning algorithms for predicting length of stay, and the importance of data quality and ethical considerations." (Page 1)
✅ Methodological Transparency
The abstract clearly states the methodological approach: a bibliographic search and subsequent analysis of the existing literature based on model types and metrics. This transparency provides readers with an immediate understanding of how the research was conducted and the basis for its findings.

"The bibliographic search of the existing literature on machine learning algorithms applied to hospital length of stay predictions highlighted the most relevant papers within this area of research. The papers were analyzed in terms of model types and metrics..." (Page 1)

Suggestions for Improvement

💡 Specify Target Audience or Context for Enhanced Relevance
While the abstract clearly outlines the paper's scope, explicitly mentioning the primary intended audience (e.g., healthcare administrators, data scientists in healthcare, clinical researchers) or specific contexts (e.g., general hospitals vs. specialized units, specific patient populations) could enhance its focus and immediate applicability. This addition would have a medium impact by helping readers more quickly ascertain the paper's direct relevance to their work or field of interest. Including this in the abstract is appropriate as it aids in framing the paper's contribution effectively from the outset.

"This paper explores the potential of machine learning algorithms to revolutionize hospital length-of-stay predictions, contributing to healthcare efficiency and patient care." (Page 1)

Implementation: Incorporate a brief phrase identifying the target audience or context. For example, after "...contributing to healthcare efficiency and patient care," consider adding a clause like, "offering crucial insights for hospital managers and data analysts seeking to implement predictive solutions." If the review has a more specific focus, such as LoS in intensive care units, this could also be briefly mentioned.
💡 Briefly Hint at Key Trends or Promising Algorithm Categories
Although abstracts for literature reviews primarily summarize scope and methodology, offering a very brief, high-level hint at any overarching trends or particularly promising categories of algorithms identified (if feasible without overstating or preempting the main text) could significantly increase reader engagement and the perceived value of the review. This would be a high-impact addition, as it provides a tantalizing preview of the review's core takeaways, encouraging further reading. Such a hint is suitable for an abstract as it concisely conveys the essence of the findings.

"The papers were analyzed in terms of model types and metrics that contributed to the considerable impact on healthcare decision making." (Page 1)

Implementation: If the review reveals a dominant trend (e.g., "revealing a growing prominence of deep learning approaches" or "underscoring the consistent efficacy of ensemble methods"), a concise mention could be integrated. For instance, after "...impact on healthcare decision making," a phrase such as, "uncovering key trends in algorithmic performance and application focus" or a more specific note on a class of algorithms could be added, if a clear pattern emerged from the literature.

Introduction

Key Aspects

🎯 Scope and Objective Definition: The introduction meticulously delineates the paper's scope, establishing its focus on the application of machine learning (ML) algorithms within the healthcare sector, with a specific emphasis on predicting hospital length of stay (LoS). It clearly articulates the primary research objective: to conduct an analytical inquiry into various ML algorithms to identify those most efficacious for constructing predictive LoS models. This foundational framing is crucial for guiding the reader's expectations regarding the paper's content and contributions to the field of applied medical informatics.
🗺️ Paper Structure and Content Roadmap: The introduction provides a structured prospectus of the paper's subsequent sections, promising an exploration of diverse ML algorithm categories, their intrinsic characteristics, and the critical data features pertinent to LoS prediction. It further outlines the analysis of how these algorithmic applications contribute to enhancing healthcare outcomes and operational efficiencies. Significantly, the roadmap includes the dedicated examination of ethical considerations, such as patient privacy and bias mitigation, underscoring a holistic approach to the subject matter.
⚕️ Problem Significance and Healthcare Impact: The text effectively underscores the critical importance and inherent complexity of hospital LoS prediction, characterizing it as a multifaceted challenge influenced by an intricate interplay of patient-specific health factors, systemic resource availability, and overarching healthcare policies. It emphasizes that while accurate LoS prediction is notoriously difficult, its successful implementation holds immense potential for optimizing healthcare system efficiency, achieving substantial cost reductions, and ultimately elevating the quality of patient outcomes. This highlights the practical and clinical relevance of the research domain.
⚖️ Acknowledgment of Challenges and Ethical Considerations: The introduction proactively acknowledges the array of challenges inherent in LoS prediction, explicitly mentioning issues such as the pursuit of predictive accuracy, the management of data complexity, constraints on resources, and inter-patient variability. Crucially, it also commits to addressing the significant ethical dimensions of employing ML in healthcare, including safeguarding patient privacy, developing robust data management frameworks, and implementing strategies for bias mitigation. This demonstrates a comprehensive awareness of both the technical hurdles and the socio-ethical responsibilities associated with the research.

Strengths

✅ Clear Articulation of Purpose and Scope
The introduction clearly establishes the paper's central theme—applying machine learning to predict hospital length of stay—and explicitly states its primary goal: to identify the most effective ML algorithms for this purpose. This directness provides immediate clarity for the reader.

"This paper offers a comprehensive overview of machine learning (ML) algorithms and their applications in the healthcare industry, with an emphasis on hospital length of stay (LoS) prediction. The main objective of this study is to analyze which ML algorithms are the most effective to build a predictive model capable of predicting hospital LoS." (Page 1)
✅ Comprehensive Roadmap of Paper Content
The introduction effectively outlines the structure and content flow of the paper. It informs the reader about the upcoming exploration of ML algorithm types, relevant features, their role in improving healthcare outcomes, and significantly, the discussion of ethical considerations, setting clear expectations.

"Initially, different types of ML algorithms are explored, detailing their characteristics and the importance of the most relevant features... Afterwards, we analyze how these algorithms are used to improve outcomes and efficiency in healthcare. Ethical considerations... are also addressed." (Page 1)
✅ Emphasis on Problem Significance and Practical Relevance
The section successfully contextualizes the research by highlighting the practical importance and inherent difficulties of LoS prediction. It emphasizes the potential benefits for healthcare systems and patient outcomes, thereby justifying the study's relevance and addressing the multifaceted nature of the problem.

"Accurately predicting admissions or LoSs is notoriously difficult, yet it holds immense potential for improving healthcare system efficiency, reducing costs, and ultimately improving patient outcomes." (Page 1)
✅ Acknowledgment of Challenges and Ethical Dimensions
The introduction commendably acknowledges not only the technical challenges in LoS prediction, such as data complexity and resource constraints, but also the crucial ethical dimensions like patient privacy and bias. This balanced perspective indicates a thorough and responsible approach to the topic.

"Among the myriad issues addressed are inaccurate predictions, data complexity, resource constraints, patient variability, and ethical considerations." (Page 1)

Suggestions for Improvement

💡 Explicitly Introduce or Thematically Outline Research Questions
While the main objective is clearly stated, explicitly formulating or at least thematically outlining the specific research questions (RQs) that guide the literature review within the Introduction section itself would sharpen the paper's focus from the outset. This is a high-impact suggestion because RQs provide a clear framework for the review and are conventionally introduced early. Although the RQs are detailed on page 2, integrating their essence into the Introduction would enhance its role as a complete preparatory section.

"This article delves into the intricate world of hospital LoS prediction with the goal of providing readers with a thorough understanding of the subject." (Page 1)

Implementation: In the final paragraph of the Introduction, consider adding a sentence that previews the guiding research questions. For instance: "This review delves into the intricate world of hospital LoS prediction by addressing key questions concerning the types of ML algorithms utilized, their comparative efficacy, critical predictive features, and the associated challenges and ethical ramifications."
💡 Specify the Type of Literature Review
The Introduction mentions it offers a "comprehensive overview," and the paper is a literature review. Specifying the type of literature review (e.g., systematic literature review, scoping review) within the Introduction would immediately inform readers about the methodology's rigor and scope. This is a medium-impact suggestion as it sets precise expectations. Given that Section 4.1 later details a systematic approach (Kitchenham's methodology), signaling this earlier enhances coherence.

"This paper offers a comprehensive overview of machine learning (ML) algorithms and their applications in the healthcare industry..." (Page 1)

Implementation: In the first paragraph, when describing the paper's nature, incorporate the specific type of review. For example: "This paper offers a comprehensive systematic literature review of machine learning (ML) algorithms and their applications..."
💡 Briefly Define Machine Learning in Context
While "machine learning" is a common term in applied sciences, providing a very brief, contextual definition or explanation of ML within the Introduction could enhance accessibility for a potentially broader segment of the journal's readership or those less specialized in computational methods. This is a low-to-medium impact suggestion aimed at ensuring a common understanding from the start. It's a minor addition that can improve clarity without significantly altering the text.

"This paper offers a comprehensive overview of machine learning (ML) algorithms and their applications in the healthcare industry..." (Page 1)

Implementation: After the first mention of "machine learning (ML) algorithms," consider adding a concise explanatory phrase. For example: "...overview of machine learning (ML) algorithms—computational techniques that enable systems to learn from and make predictions based on data—and their applications..."

Machine Learning Algorithms in Healthcare

Key Aspects

💻 Foundational ML Concepts and Workflow: This section establishes a foundational understanding of Machine Learning (ML) by defining ML algorithms as computational tools that learn from data. It outlines their broad categorization into types such as supervised learning (learning from labeled data), unsupervised learning (finding patterns in unlabeled data), semi-supervised learning (a hybrid approach), and reinforcement learning (learning through trial and error with rewards/penalties). Furthermore, it details a generic five-step workflow for applying ML: data gathering, data preparation (including cleaning and standardization), selection of appropriate algorithms, algorithm training, and iterative model improvement. This framework is crucial for contextualizing subsequent discussions on ML's role in healthcare.
⚕️ Broad Spectrum of ML Applications and Impact in Healthcare: The paper highlights the extensive utility of ML across various healthcare domains, demonstrating its capacity to significantly enhance medical practices and patient outcomes. Key applications discussed include diagnostics, where ML analyzes medical images and clinical data to identify disease patterns; personalized medicine, tailoring treatments based on individual patient profiles; treatment optimization, by adjusting therapies according to disease history and patient characteristics; and outbreak prediction, leveraging AI and ML to monitor and forecast epidemics using diverse data sources. A notable clinical trial example involving sepsis prediction underscores ML's real-world impact, showing substantial reductions in hospital length of stay and mortality through early, ML-guided intervention.
⚖️ Critical Challenges and Transformative Opportunities in Healthcare ML: The text provides a balanced perspective by discussing significant challenges inherent in the application of ML within healthcare settings, alongside the opportunities it presents. Challenges include the presence of bias in ML models which can perpetuate health disparities, the critical lack of high-quality and sufficiently large datasets, the need for meticulous and accurate data annotation, the technical demand for hyperparameter tuning (optimizing model settings), and the overarching requirement for ethical and responsible validation of algorithms. Conversely, ML offers substantial opportunities to improve the quality of medical care, reduce operational costs, and increase the overall efficiency of healthcare systems, thereby improving the risk-reward ratio of adopting these advanced technologies.
📊 Algorithm Suitability for Hospital LoS Prediction: Focusing on the paper's central theme of hospital Length of Stay (LoS) prediction, this part of the section discusses the suitability of specific ML algorithms like Random Forest and Support Vector Machines (SVMs) for this task, particularly when dealing with complex multivariate time-series data common in clinical settings. Random Forest is noted for its robustness and ability to capture intricate interactions within high-dimensional data, while SVMs are highlighted for their efficacy in classification tasks, especially with smaller datasets, and their ability to handle outliers. The reference to a comparative study by Ruiz et al. [12] reinforces the competitiveness and suitability of these algorithms for clinical LoS prediction, where accurate modeling of temporal patient changes is vital.

Strengths

✅ Clear Categorization and Workflow Overview
The section effectively introduces Machine Learning (ML) by categorizing algorithms (e.g., supervised, unsupervised) and outlining a structured five-step application workflow (data gathering, preparation, algorithm choice, training, model improvement). This provides readers with a solid conceptual foundation and a clear understanding of the typical ML process before discussing specific healthcare applications.

"ML algorithms... can be broadly categorized into several types, including supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning [4]. ... The application of ML algorithms can be divided into the following steps [7]:" (Page 2)
✅ Diverse Applications with Concrete Impact Example
The paper effectively illustrates the breadth of ML applications in healthcare, covering diagnosis, personalized medicine, treatment personalization, and outbreak prediction. It substantiates the transformative potential of ML with a compelling clinical trial example on sepsis, which demonstrated significant reductions in hospital length of stay and in-hospital mortality, thereby highlighting tangible benefits to patient outcomes.

"This resulted in significant reductions in both hospital length of stay (20.6%) and in-hospital mortality (58%), highlighting the transformative potential of ML to improve patient outcomes through early intervention [11]." (Page 3)
✅ Comprehensive Acknowledgment of Challenges
The section provides a comprehensive and realistic acknowledgment of the challenges associated with applying ML in healthcare. It enumerates key difficulties including inherent bias in ML, the lack of quality data, the need for accurate data annotation, the necessity of hyperparameter tuning, the requirement for large datasets, and the imperative for ethical and responsible validation of developed algorithms.

"Despite the revolutionary potential for healthcare, the application of ML algorithms also faces some challenges, such as: Bias is present in ML; Lack of quality data; Data annotation must be performed accurately; Need for hyperparameter tuning." (Page 3)
✅ Highlighting Relevant Algorithms for LoS Prediction
The text appropriately highlights specific algorithms, Random Forest and Support Vector Machine, detailing their particular strengths and suitability for complex tasks like predicting hospital Length of Stay (LoS), especially with multivariate time-series data. The mention of a comparative study supporting their efficacy further grounds these claims.

"Especially when dealing with complex multivariate time-series data, ML algorithms have become increasingly important for predicting hospital LoS, due to algorithms such as random forest and support vector machine." (Page 3)

Suggestions for Improvement

💡 Clarify Random Forest's Interaction with Preprocessing and Invariance
The statement that Random Forest's effectiveness is due to its ability to 'capture complex interactions that require intensive preprocessing due to its invariance' is potentially ambiguous and could be misconstrued by readers. Typically, invariance to certain data characteristics (e.g., feature scaling) reduces some preprocessing burdens. Clarifying this relationship—whether invariance allows RF to handle data that would otherwise need extensive preprocessing for other algorithms, or if the complex interactions themselves necessitate preprocessing regardless of RF's invariance—would significantly improve clarity. This is a medium-impact suggestion as it pertains to a nuanced characteristic of a key algorithm discussed for LoS prediction, and clarification belongs in this section where algorithm characteristics are detailed.

"Random forest is particularly effective because of its ability to capture complex interactions that require intensive preprocessing due to its invariance." (Page 3)

Implementation: Rephrase the sentence to explicitly state the nature of the interaction between invariance, complex data, and preprocessing. For example: 'Random Forest is particularly effective because its invariance to certain data transformations enables it to robustly capture complex interactions, even in datasets that might necessitate intensive preprocessing for other types of algorithms.' Alternatively, if the original intent was that the complexity itself drives preprocessing, state: 'Random Forest's ability to model complex interactions is valuable, though the inherent complexity of such data often requires intensive preprocessing, a step for which RF's invariance to certain aspects can be beneficial.'
💡 Briefly Elaborate on 'Hyperparameter Tuning' for Broader Accessibility
The term 'hyperparameter tuning' is listed as a challenge without further explanation. While this term is standard for ML specialists, readers from broader scientific disciplines within the journal's scope might not be familiar with it. Adding a brief, parenthetical definition would enhance the accessibility and educational value of this section for a wider audience. This is a low-impact suggestion that improves general comprehension of a listed challenge and fits naturally within the enumeration of challenges.

"Need for hyperparameter tuning." (Page 3)

Implementation: Following the phrase 'Need for hyperparameter tuning,' insert a concise explanation. For instance: 'Need for hyperparameter tuning (i.e., the process of optimizing the algorithm's internal settings before the training phase to achieve best performance).'
💡 Integrate the Concept of Feature Engineering into ML Workflow Steps
The outlined ML application steps include 'Data preparation,' which mentions cleaning, removing nulls, and standardizing data. Explicitly incorporating 'feature engineering'—the process of creating new, more informative input variables from existing data—into this step or the 'Improve the model' step would offer a more complete depiction of the ML pipeline. Feature engineering is often a critical determinant of model performance in healthcare applications. This is a medium-impact suggestion that would strengthen the comprehensiveness of the ML workflow description presented in this section.

"Data preparation: The previously gathered data must be prepared for training the algorithms, and this includes cleaning, removing null values, and standardizing the data [9];" (Page 2)

Implementation: Amend the 'Data preparation' bullet point to include feature engineering. For example: 'Data preparation: The previously gathered data must be prepared for training the algorithms, and this includes cleaning, removing null values, standardizing the data [9], and often performing feature engineering to derive more predictive inputs.' Alternatively, it could be mentioned under 'Improve the model' as a strategy for enhancement.

Ethical Considerations

Key Aspects

⚖️ Foundational Ethical Landscape in Healthcare AI: This section establishes the critical ethical context for employing Machine Learning (ML) technologies in healthcare. It acknowledges the profound opportunities for enhancing patient outcomes and operational efficiencies while simultaneously highlighting the inherent ethical challenges that necessitate careful examination. The discussion outlines a comprehensive range of key ethical issues, including patient privacy, the establishment of robust data management frameworks, strategies for fairness and bias mitigation, the complexities of informed consent, and responsible data usage. Fundamentally, it underscores that the integration of ML solutions must be guided by unwavering principles that prioritize patient welfare and adhere to the highest standards of ethical medical practice.
🔒 Patient Privacy and Data Security Imperatives: Safeguarding patient privacy is identified as a paramount concern when ML technologies are integrated into healthcare, particularly for applications like hospital Length of Stay (LoS) prediction which utilize sensitive patient data. The section delves into significant ethical, legal, and technological challenges, emphasizing the need for robust frameworks to protect such data. Specific privacy risks are enumerated, including the threat of data security breaches and unauthorized access exacerbated by complex big data systems, the potential for re-identification of individuals through data aggregation, the inherent difficulties and limitations of data anonymization techniques, and the necessity of ensuring data integrity and accuracy for both privacy protection and the efficacy of predictive models.
🏛️ Data Management and Regulatory Frameworks: The section underscores the importance of structured data management frameworks in guiding hospitals to protect patient privacy during predictive analytics. These frameworks provide systematic approaches to data governance throughout its lifecycle, ensuring that sensitive patient information is handled safely and ethically. Prominent examples cited are the Health Insurance Portability and Accountability Act (HIPAA) in the United States, which sets federal standards for protecting patient health information, and the General Data Protection Regulation (GDPR) in the European Union, which imposes stringent guidelines on personal data collection, use, and storage. Adherence to such frameworks is presented as essential for responsible data stewardship and maintaining patient trust in ML applications.
🧐 Fairness, Bias Mitigation, and Data Quality Assurance: Addressing fairness and mitigating bias in ML models is presented as crucial for achieving equitable healthcare outcomes. The text outlines a systematic approach involving the identification of biases within data and algorithms (e.g., through bias audits), the application of mitigation techniques (e.g., oversampling to balance data representation), and the evaluation of the impact of biases on different patient demographics. A significant emphasis is placed on the foundational role of data quality, asserting that clean, complete, and well-processed data are indispensable for preventing the introduction or perpetuation of biases that could disproportionately affect vulnerable patient groups and compromise the fairness of ML-driven clinical predictions.
🧩 Ethical Dimensions of Feature Selection in Predictive Modeling: While Section 3.4 details various categories of features (e.g., patient demographics, admission details, clinical parameters) considered for hospital admission and LoS predictions, its inclusion within the 'Ethical Considerations' chapter implies inherent ethical responsibilities in their selection and use. The choice of features is not merely a technical decision but carries ethical weight concerning fairness, potential for bias, and patient privacy. For instance, utilizing demographic data requires careful scrutiny to prevent discriminatory outcomes, while collecting extensive admission or clinical details must align with data minimization principles and ensure that features do not inadvertently encode or amplify existing societal biases, thereby impacting the ethical integrity of the predictive models.

Strengths

✅ Comprehensive Coverage of Core Ethical Themes
The section comprehensively covers the primary ethical domains pertinent to ML in healthcare, including patient privacy, data management frameworks, and fairness/bias mitigation. This breadth ensures that key ethical challenges are introduced and contextualized for the reader.

"This section explores the key ethical issues associated with using data for ML pre- diction in healthcare settings, addressing the concerns related to patient privacy, data management frameworks, fairness and bias mitigation, informed consent, and data us- age." (Page 4)
✅ Detailed Articulation of Patient Privacy Risks
The discussion on patient privacy is particularly robust, clearly articulating its paramount importance and detailing specific, tangible risks such as data security breaches, re-identification through data aggregation, and the limitations of anonymization techniques. This level of detail effectively highlights the complexities involved in safeguarding sensitive patient data in ML applications.

"When leveraging ML algorithms to predict hospital LoS, patient privacy is a vital concern. These predictive models can use various types of sensitive patient data, including medical records, demographic information, and lifestyle factors, which raises several important privacy issues." (Page 4)
✅ Actionable Framework for Fairness and Bias Mitigation
The subsection on Fairness and Bias Mitigation offers a clear, structured, and actionable framework. By outlining the process as identifying bias, mitigating bias, and evaluating impact, it provides a practical approach for addressing this critical ethical concern in the development and deployment of ML models in healthcare.

"Ensuring fairness and reducing bias in ML models is essential for equitable healthcare trained models for them to perform well [22]: Identifying bias...Mitigating bias...Evaluating Impact." (Page 5)
✅ Grounding in Existing Regulatory Frameworks
The section effectively grounds the discussion of data management in established regulatory standards by referencing key frameworks like HIPAA and GDPR. This connection to real-world legal and ethical guidelines enhances the practical relevance and authority of the considerations presented.

"Some key frameworks include: Health Importability and Accountability Act... General Data Protection Regulation..." (Page 5)

Suggestions for Improvement

💡 Deepen Discussion on Informed Consent and Data Usage Specific to ML
The introductory paragraph of Section 3 explicitly mentions 'informed consent, and data usage' as key ethical issues to be explored. However, these topics are not subsequently developed with the same depth as patient privacy, data management, or fairness/bias mitigation. Given that informed consent is a foundational ethical principle in medicine, and its application to complex ML predictions (e.g., regarding data reuse, model interpretability for patients, consent for evolving algorithms) is particularly nuanced, a more dedicated discussion is warranted. This is a high-impact suggestion as it addresses a critical, yet underdeveloped, ethical dimension within the section. Elaborating on these aspects here would significantly strengthen the paper's ethical analysis.

"This section explores the key ethical issues... addressing the concerns related to... informed consent, and data usage." (Page 4)

Implementation: Consider adding a distinct subsection (e.g., 3.4 Informed Consent and Ethical Data Usage, which would require renumbering the current 3.4) or significantly expanding the brief mention of informed consent within subsection 3.1. This expanded discussion should address challenges in obtaining meaningful consent for ML applications, the scope of consent for secondary data use, patients' rights to understand ML-driven decisions, and ethical guidelines for data usage beyond general privacy protections.
💡 Elaborate on Ethical Oversight and Accountability Mechanisms
While the section adeptly discusses ethical principles and data management frameworks, it could be enhanced by addressing the practical mechanisms for ethical oversight and accountability in the context of ML in healthcare. Discussing the role of bodies like Institutional Review Boards (IRBs) or dedicated AI ethics committees in the review, approval, and ongoing monitoring of ML systems, as well as frameworks for accountability when ethical issues (e.g., biased outcomes, privacy breaches) arise, would add a crucial layer of governance. This is a medium-impact suggestion that would bridge the gap between ethical principles and their operational enforcement, fitting well within the scope of 'Ethical Considerations'.

"Exploring these dimensions ensures that the integration of ML solutions is guided by principles that prioritize patient welfare and uphold the highest standards of ethical practice [14]." (Page 4)

Implementation: Incorporate a paragraph or a brief new subsection that discusses the necessity of robust ethical governance structures. This could include the adaptation of existing oversight mechanisms (like IRBs) for ML, the potential need for specialized AI ethics review boards, processes for continuous monitoring of deployed ML models for ethical drift or unintended consequences, and clear institutional accountability for the development, deployment, and impact of these technologies.
💡 Explicitly Connect Feature Selection (3.4) to Ethical Principles
Subsection 3.4, 'Features to Consider in Hospital Admission and Hospital LoS Predictions,' lists various data features like patient demographics and admission details. While relevant to model development, its inclusion under 'Ethical Considerations' feels somewhat disconnected without an explicit bridge to the section's core ethical themes. Clarifying how the selection and utilization of these specific features can intersect with ethical principles such as fairness, bias, and privacy would improve the section's coherence. This is a medium-impact suggestion that would better integrate subsection 3.4 into the overarching ethical discussion.

"Features that can significantly impact hospital admission and hospital LOS predictions are [24]: 1. Patient demographics... 2. Admission Details..." (Page 5)

Implementation: Add a concise introductory statement to subsection 3.4, or a concluding one to 3.3, that explicitly links feature selection to ethical responsibilities. For example, explain how choices regarding demographic features (e.g., age, gender) must be carefully evaluated to prevent algorithmic bias and discrimination, or how the collection of detailed admission and clinical data must be balanced against data minimization principles to protect patient privacy, ensuring that selected features do not inadvertently perpetuate societal inequities.

Materials and Methods

Key Aspects

📚 Systematic Review Protocol Adherence: The paper outlines a structured approach for its literature review, explicitly adopting Kitchenham's methodology, a well-established systematic review process prominent in computer science and software engineering. This methodology provides a rigorous framework encompassing several key stages: (a) precise definition of research questions to guide the inquiry, (b) selection of appropriate data sources for literature retrieval, (c) formulation of a comprehensive search string, (d) establishment of preconditions for study inclusion, (e) clear definition of inclusion and exclusion criteria to filter relevant studies, and (f) systematic data extraction from the selected papers. The adoption of this protocol is intended to ensure transparency, replicability, and comprehensive coverage of the relevant literature on Machine Learning (ML) for hospital Length of Stay (LoS) prediction, thereby enhancing the review's methodological soundness and the credibility of its findings.
🔍 Search Strategy and Selection Funnel: The methodology details a meticulous search and selection process for identifying relevant literature. Google Scholar was chosen as the primary data source. A specific search string was constructed, initially yielding 604 articles, and subsequently refined by excluding terms like "COVID-19" and "review" to narrow the focus to general hospital LoS prediction. Preconditions were applied, requiring studies to be in English (reducing the pool to 558 articles) and published after 2020 (further reducing to 315 articles). Subsequent filtering based on title analysis (resulting in 52 articles), abstract analysis (21 articles), and finally, study availability (yielding 12 articles) was conducted using clearly defined inclusion criteria (e.g., studies must investigate ML for LoS, be peer-reviewed) and exclusion criteria (e.g., article does not refer to ML, not focused on LoS, unavailable, duplicate, non-articles). This multi-stage filtering process, visually summarized in Figure 1, ensures that the final corpus of 12 articles is highly pertinent to the research questions, forming a focused basis for the review.
💻 Common Healthcare ML Algorithms Overview: Section 4.2 provides a concise overview of Machine Learning (ML) algorithms frequently employed within the broader healthcare sector, with key examples presented in Table 1. This table categorizes algorithms such as Linear Regression, Logistic Regression, Decision Trees, Random Forest, K-Nearest Neighbors, Support Vector Machines, and Naive Bayes by their fundamental type (e.g., Supervised Regression, Supervised Classification, Ensemble Learning) and also indicates their general frequency of use and common application areas in healthcare (e.g., predictive modeling, medical diagnostics, data mining, image recognition). While this subsection does not exclusively focus on Length of Stay (LoS) prediction, it serves to contextualize the types of algorithms that are generally prevalent in healthcare analytics. This information provides a foundational understanding of the ML landscape before the paper delves into the specific applications and algorithms for LoS prediction identified in the selected literature.
📊 Performance Evaluation Metrics for LoS Prediction: The paper meticulously defines key performance metrics essential for evaluating the efficacy of hospital LoS prediction models, as detailed in Section 4.3. For each critical metric—Mean Absolute Error (MAE), which measures the average magnitude of prediction errors; Accuracy, representing the proportion of correct predictions; R2 Score (Coefficient of Determination), indicating the proportion of variance in the LoS that is predictable from the independent variables; F1 Score, the harmonic mean of precision and recall, useful for imbalanced datasets; and Root Mean Squared Error (RMSE), quantifying the standard deviation of the residuals—a clear definition, its specific significance in assessing model performance, and its mathematical formula are provided. This detailed exposition of metrics is crucial as it establishes a standardized and quantitative basis for comparing the performance of different ML models discussed in the reviewed studies, ensuring that evaluations are robust and interpretable.
📄 Analysis and Synthesis of Selected Literature: Section 4.4 presents a descriptive analysis of the 12 articles that were selected through the systematic review protocol, all of which concentrate on the application of Machine Learning (ML) to hospital Length of Stay (LoS) prediction. For each of these studies (individually referenced [28] through [30], and [32] through [40]), the authors summarize pivotal aspects including the specific dataset utilized (with an overview often provided in Table 2, citing sources like MIMIC-II/III or UPT Puskesmas Arjasa Kangean), the particular ML algorithms employed (such as various regression models, neural networks, random forest, XGBoost), data preprocessing techniques implemented (e.g., one-hot encoding for categorical variables, feature selection methods), and the salient performance outcomes reported by the original studies (e.g., values for MAE, accuracy, R2 scores). This subsection aims to synthesize the diverse methodologies and principal findings from the core body of selected literature, thereby providing a consolidated empirical foundation for the subsequent discussion and conclusions drawn by the review concerning effective ML strategies for LoS prediction in healthcare.

Strengths

✅ Systematic and Transparent Protocol
The explicit adoption and detailed description of Kitchenham's methodology for the literature review (Section 4.1) provide a strong foundation of rigor and transparency. Clearly outlining the steps from research question definition to data extraction allows readers to understand the systematic process undertaken to identify and select relevant studies, enhancing the review's credibility.

"This review followed Kitchenham’s methodology [26] that is extensively used in computer science research. Principles and guidelines for developing systematic reviews are outlined in this methodology. The following steps were applied: (a) research question definition; (b) data source selection; (c) search string definition; (d) precondition definition; (e) inclusion and exclusion criteria definitions; and (f) data extraction." (Page 6)
✅ Clear Search and Selection Criteria
The paper clearly defines its search strategy, including the specific search string used in Google Scholar, the rationale for exclusions (e.g., "COVID-19," "review"), and the multi-stage filtering process (preconditions, inclusion/exclusion criteria). This clarity (Sections 4.1.2-4.1.5, Figure 1) allows for potential replication and demonstrates a focused approach to literature gathering.

"The search string used was (hospital AND prevision AND forecast AND “machine learning” AND predict AND “length of stay” AND LoS-“COVID-19”-malnutrition- transfusion-review)." (Page 6)
✅ Comprehensive Definition of Evaluation Metrics
Section 4.3 provides a thorough explanation of the key performance metrics (MAE, Accuracy, R2 Score, F1 Score, RMSE) used to evaluate LoS prediction models. Defining each metric and providing its formula ensures that readers understand how model performance is assessed and compared across different studies, which is crucial for interpreting the results of the review.

"This subsection explores the most important metrics used for evaluating hospital LoS prediction models. Each metric serves to measure different dimensions of the model’s performance, ranging from predictive accuracy to the practical impact on healthcare operations." (Page 9)
✅ Structured Presentation of Reviewed Studies
The "Analysis of Selected Papers" (Section 4.4) systematically summarizes the 12 chosen articles. For each paper, it generally outlines the dataset, methods, and key findings, often referencing Table 2 for dataset overviews. This structured approach helps in understanding the landscape of current research in ML for LoS prediction.

"We analyzed 12 articles. These articles focused on ML applied to hospital LoS prediction. ... The 12 articles that were analyzed here were obtain using Kitchenham’s methodology [26], described in Section 3 in this work." (Page 10)

Suggestions for Improvement

💡 Justify the Choice of Kitchenham's Methodology
While Kitchenham's methodology is stated as "extensively used in computer science research," briefly justifying why it was specifically chosen for this particular review on ML in healthcare LoS prediction would strengthen the methodological rationale. This is a low-to-medium impact suggestion that adds a layer of deliberate choice to the methodology description. It belongs in Section 4.1 where the methodology is introduced, as it sets the stage for the entire review process.

"This review followed Kitchenham’s methodology [26] that is extensively used in computer science research." (Page 6)

Implementation: Add a sentence after introducing Kitchenham's methodology, explaining its suitability. For example: "Kitchenham's methodology [26]... was selected due to its structured approach to defining scope, systematically searching literature, and extracting data, which is well-suited for a comprehensive review of ML applications requiring clear process documentation and potential for reproducibility in a rapidly evolving field like healthcare AI."
💡 Detail the Data Extraction Protocol from Selected Papers
Section 4.1 mentions "(f) data extraction" as the final step of Kitchenham's methodology. However, the "Materials and Methods" section does not explicitly detail what specific data items were systematically extracted from each of the 12 selected papers to answer the review's research questions (e.g., specific algorithms, dataset characteristics, reported metrics, challenges highlighted). Providing a brief overview of the data extraction form or key data points sought would enhance transparency and methodological rigor. This is a medium-impact suggestion crucial for understanding how the synthesis in the Discussion/Results is derived, and it belongs in Section 4.1 as part of the protocol description.

"The following steps were applied: (a) research question definition; (b) data source selection; (c) search string definition; (d) precondition definition; (e) inclusion and exclusion criteria definitions; and (f) data extraction." (Page 6)

Implementation: Add a subsection (e.g., 4.1.6 Data Extraction Items) or a paragraph within 4.1 detailing the key information extracted from each paper. For example: "For each of the 12 selected articles, the following data points were extracted: (1) study objectives, (2) dataset source and characteristics (size, features), (3) ML algorithms employed, (4) data preprocessing techniques, (5) key performance metrics reported (e.g., MAE, R2, Accuracy), (6) main findings regarding LoS prediction, and (7) identified limitations or challenges."
💡 Clarify Rationale for Specific Search Term Exclusions
The paper states that "COVID-19", "malnutrition", "transfusion", and "review" were removed from the search string "to narrow the search, since the intention was to predict hospital LoS in general." While excluding "COVID-19" (due to its unique impact) and "review" (to focus on primary studies) is clear, the rationale for excluding "malnutrition" and "transfusion" could be briefly elaborated, as these conditions can be significant general factors influencing LoS and might not always indicate an overly narrow subpopulation. Clarification would prevent potential misinterpretation of the search scope. This is a low-impact suggestion for enhanced clarity in Section 4.1.2, where the search string is defined.

"Four words/expression were removed from the search when re- searching the articles: “COVID-19”, “malnutrition”, “transfusion”, and “review”. The reason for this was to narrow the search, since the intention was to predict hospital LoS in general." (Page 7)

Implementation: Expand slightly on the reasoning for these specific exclusions. For example: "...Four words/expression were removed...: “COVID-19” (to avoid focus on pandemic-specific LoS alterations), “malnutrition”, “transfusion” (to prevent an over-concentration of studies on highly specific patient cohorts or interventions rather than broader LoS prediction models), and “review” (to exclude secondary literature from this primary search phase)."
💡 Enhance Link Between General ML Algorithms (4.2) and LoS-Specific Analysis (4.4)
Section 4.2 presents "Most Commonly Used ML Algorithms in Healthcare" (Table 1), which provides a general overview. Section 4.4 then analyzes papers specifically on LoS prediction. The transition and direct relevance of Table 1 to the specific LoS context could be made more explicit within the methods section. It is currently unclear if Table 1 is derived from the 12 selected papers or broader healthcare literature, and how it directly informs or contrasts with the algorithms found to be prominent in the LoS-specific studies. This is a medium-impact suggestion for improving coherence between these subsections and clarifying the methodological flow.

"The following Table 1 shows the most commonly used algorithms in healthcare." (Page 7)

Implementation: Add a sentence at the end of Section 4.2 or the beginning of Section 4.4 to bridge this. If Table 1 is general, state: "While Table 1 provides a broad overview of ML algorithms prevalent in healthcare, the subsequent analysis in Section 4.4 will focus on the specific algorithms and their performance as reported in the 12 studies selected for their direct relevance to hospital LoS prediction." If Table 1 is derived from the selected papers, this should be clearly stated in the introduction to Section 4.2.

Non-Text Elements

Figure 1. Literature review process.

Figure/Table Image (Page 8)

First Reference in Text

Finally, 7 articles were excluded because they were not available, and the search ended with n = 12 articles to be analyzed in this review. Figure 1 illustrates the process used in this literature review.

Description

Overall process depiction: The figure is a flowchart illustrating the systematic process of a literature review, detailing how an initial pool of 604 identified articles was narrowed down to 12 final articles for inclusion. This type of diagram is common in systematic reviews and often follows guidelines like PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses), which standardizes the reporting of the review process.
Identification phase: The process begins with 604 records identified from databases (and 0 from registers). No records were removed for duplicates, by automation tools, or other pre-screening reasons according to the diagram.
Screening initiation: In the screening phase, 558 records were screened. This implies 46 records (604-558) were excluded prior to this screening, which the main text indicates were non-English articles.
Exclusion steps and numbers: Several exclusion criteria were applied sequentially to the 558 screened records: 243 records were excluded based on 'Language, Publication Date'; 263 records based on 'Title Analysis'; 31 records based on 'Abstract Analysis'; and 7 records because they were 'Not available'.
Numerical outcome of exclusions and final count discrepancy: Following the listed exclusions (243 + 263 + 31 + 7 = 544 exclusions), 14 articles (558 - 544 = 14) should mathematically remain. However, the diagram concludes with 12 studies included in the review, indicating a discrepancy of 2 articles that are not accounted for in the exclusion process depicted.
Zero-value intermediate steps: The flowchart includes several standard PRISMA-style boxes such as 'Reports sought for retrieval', 'Reports not retrieved', and 'Reports assessed for eligibility', all of which are reported with 'n = 0'. Similarly, a section for 'Reports excluded' with generic reasons (Reason 1, 2, 3) also shows 'n = 0' for each.

Scientific Validity

✅ Transparency of selection process: The use of a flowchart to document the article selection process is a methodological strength, promoting transparency and reproducibility of the literature review, consistent with PRISMA guidelines.
💡 Numerical inconsistency in article count: The numbers presented in the flowchart do not reconcile. After applying all stated exclusions (243 for Language/Pub Date, 263 for Title, 31 for Abstract, 7 for Not Available) to the 558 screened articles, 14 articles should remain (558 - 544 = 14). However, the figure states 12 articles were included. This mathematical inconsistency of 2 articles needs to be resolved to ensure the accuracy of the reported methodology. The main text also leads to this same inconsistency (21 articles after abstract analysis, minus 7 unavailable, should leave 14, not 12).
💡 Implausible zero duplicates: The reporting of 'n=0' for 'Duplicate records removed' is highly improbable for a search yielding 604 initial records, especially if multiple sources or comprehensive search strategies were employed. This raises questions about the thoroughness of the deduplication process or its reporting.
💡 Questionable zero values in intermediate PRISMA steps: The intermediate steps 'Reports sought for retrieval (n=0)', 'Reports not retrieved (n=0)', and 'Reports assessed for eligibility (n=0)' showing zero counts are atypical for a systematic review. These stages usually involve non-zero numbers as screened articles are progressed. If these steps were indeed not applicable or resulted in zero, it should be clarified; otherwise, the diagram may not accurately reflect the actual review process detailed in the text (e.g., abstract analysis leading to 21 articles implies these 21 were assessed for eligibility).
💡 Missing explicit reason for initial record reduction in diagram: The reason for the initial drop from 604 identified records to 558 screened records (46 articles) is not explicitly stated in the diagram, though the text clarifies it was due to non-English language. For methodological completeness within the figure, this should be specified.
💡 Unused template fields: The section 'Reports excluded: Reason 1 (n=0), Reason 2 (n=0), Reason 3 (n=0)' appears to be an unused part of a template. It does not provide useful information and should be removed or populated correctly if applicable, to avoid suggesting an incomplete or poorly adapted reporting format.

Communication

✅ Clear overall structure: The flowchart format is a standard and effective way to visually represent the multi-stage filtering process of a literature review, making the overall workflow understandable at a glance.
✅ Concise and accurate caption: The caption is concise and accurately describes the content of the figure.
💡 Numerical discrepancy: The numerical inconsistency where the exclusion steps lead to 14 articles (558 screened - 243 - 263 - 31 - 7 = 14) while the final "Studies included in review" box states n=12 is confusing and undermines the figure's clarity. This discrepancy of 2 articles should be resolved by either correcting the numbers in the exclusion steps or the final count, or by adding a step that accounts for the removal of these 2 articles with a stated reason.
💡 Unused or confusing zero-value boxes: The boxes "Reports sought for retrieval (n = 0)", "Reports not retrieved (n = 0)", "Reports assessed for eligibility (n = 0)", and the unused "Reports excluded: Reason 1 (n = 0), Reason 2 (n = 0), Reason 3 (n = 0)" are problematic. If these steps genuinely resulted in zero articles, this is highly unusual for a systematic review and may require explanation. If they are merely placeholders from a template (e.g., a PRISMA template) that do not reflect the actual process, they should be removed or adapted to accurately represent the study's methodology (e.g., mapping the abstract analysis stage to eligibility assessment). This would reduce clutter and improve clarity.
💡 Missing clarification for initial reduction: The reason for the reduction from 604 identified records to 558 screened records (a difference of 46) is not explicitly stated within the diagram at that transition, although the main text clarifies these were non-English articles. Adding a brief note like "Non-English articles excluded" at this step in the diagram would enhance its self-containedness.
💡 Font size consistency: The font size for the exclusion reasons under "Records excluded" (e.g., "Language, Publication Date") appears smaller than other text in the diagram. Consider increasing it slightly for better legibility.

Table 1. Most-used ML algorithms in healthcare.

Figure/Table Image (Page 8)

First Reference in Text

The following Table 1 shows the most commonly used algorithms in healthcare.

Description

Table overview: Table 1 provides a summary of seven Machine Learning (ML) algorithms commonly applied in healthcare. For each algorithm, it specifies its type, an assessment of its frequency of use, and lists common applications.
Listed algorithms: The listed ML algorithms are: Linear Regression, Logistic Regression, Decision Trees, Random Forest, K-Nearest Neighbors, Support Vector Machines, and Naive Bayes.
Algorithm types: The 'Type of Algorithm' column categorizes each method. For example, Linear Regression is 'Supervised (Regression),' meaning it learns from data with known outcomes (supervised) to predict continuous numerical values (regression), like a patient's blood pressure. Logistic Regression is 'Supervised (Classification),' meaning it also learns from labeled data but predicts categories, such as whether a tumor is benign or malignant. Random Forest is 'Supervised (Ensemble Learning),' which means it combines multiple learning models (in this case, decision trees) to improve predictive performance. K-Nearest Neighbors is 'Supervised (Classification/Regression),' indicating its applicability to both types of prediction tasks. Support Vector Machines and Naive Bayes are listed as 'Supervised (Classification).'
Frequency of use ratings: The 'Frequency of Use' is described qualitatively. Random Forest is rated 'Very High'. Linear Regression and Logistic Regression are rated 'High'. Decision Trees, K-Nearest Neighbors, Support Vector Machines, and Naive Bayes are rated 'Moderate' or 'Moderate to High' (Support Vector Machines).
Example applications: The 'Common Applications' column provides examples for each algorithm. For instance, Linear Regression is used for 'Predictive modeling, trend analysis.' Logistic Regression is applied to 'Binary classification, medical diagnostics.' Random Forest is used for 'Fraud detection, customer churn prediction, genomics.' Support Vector Machines are used for 'Image recognition, text classification, bioinformatics.'

Scientific Validity

✅ General overview of relevant algorithms: The table provides a useful, albeit high-level, overview of common ML algorithms relevant to healthcare, which can serve as a good starting point for readers less familiar with the field.
✅ Accurate algorithm typing: The categorization of algorithms by type (e.g., Supervised Regression, Supervised Classification, Ensemble Learning) is generally accurate and standard in ML literature.
💡 Unsubstantiated "Frequency of Use" claims: The "Frequency of Use" claims (High, Moderate, Very High) are not substantiated with evidence or citations within the table or the immediate reference text. Without knowing the methodology (e.g., a systematic count from a defined corpus of literature, expert consensus), these frequencies are subjective assertions. The validity of these claims is therefore questionable. The authors should clarify the basis for these frequency assessments.
💡 Potential selectivity in "most-used" algorithms: The list of "most-used" algorithms might be selective and not fully representative of the entire healthcare ML landscape. The criteria for inclusion as "most-used" are not defined. For example, deep learning models (e.g., CNNs, RNNs) are increasingly used in healthcare, especially in imaging and sequential data analysis, but are not explicitly listed here, though 'Neural Network' is mentioned later in the paper.
💡 Generality of listed applications: While the listed "Common Applications" are generally appropriate, they are broad. The table's purpose seems to be introductory, but for a scientific review, a more nuanced or specific list of applications, perhaps tied to particular healthcare challenges (e.g., specific disease prediction, resource allocation), could strengthen its contribution.
✅ Serves as a contextual summary: The table provides a snapshot that is useful for context but does not, in itself, present novel research findings. Its value lies in summarizing existing knowledge for the reader.

Communication

✅ Clear structure and readability: The table is well-structured with clear column headers (ML Algorithm, Type of Algorithm, Frequency of Use, Common Applications), making it easy to read and understand the information presented for each algorithm.
✅ Concise presentation: The information is presented concisely, allowing for a quick overview of common ML algorithms in healthcare.
✅ Accurate caption: The caption accurately describes the table's content.
💡 Subjective "Frequency of Use" categories: The "Frequency of Use" column uses subjective categories (High, Moderate, Very High) without providing a scale, reference, or methodology for these classifications. This makes it difficult to interpret the relative frequencies objectively. Consider adding a footnote explaining the basis for these frequency ratings (e.g., based on the authors' literature search for this review, or citing a specific source that quantifies usage). Alternatively, if precise frequencies are unknown, using more nuanced qualitative descriptors or ranking could be considered, though a quantitative basis is always preferred.
💡 Broadness of some listed applications: While the applications listed are generally correct, some are very broad (e.g., "Data mining" for Decision Trees). Providing slightly more specific or differentiated examples for each algorithm, where possible, could enhance the table's utility.

Results

Key Aspects

📊 Comparative Analysis Framework: The Results section initiates its analysis by establishing a framework for comparing machine learning (ML) algorithms used in predicting hospital admission and Length of Stay (LoS). This comparison is based on findings synthesized from 12 selected research articles. The analytical structure focuses on evaluating these algorithms across several critical dimensions: the characteristics of the datasets employed (e.g., size, source, features), the specific ML methodologies and techniques applied, the performance metrics used for evaluation (such as MAE, R2, Accuracy), and the ultimate performance outcomes achieved in each study. This systematic approach is fundamental for providing a clear, evidence-based overview of the efficacy of different ML strategies in the healthcare context.
📚 Dataset Diversity in Reviewed Studies: Subsection 5.1 briefly addresses the diversity of datasets utilized across the 12 reviewed studies for hospital LoS prediction. It highlights that these datasets exhibited considerable variation in terms of their size (number of patient records) and the range of features (patient variables) they encompassed. While the textual description within this subsection is minimal, it crucially directs readers to Table 2 (located on page 11 of the paper, within the Methods section) for a detailed, itemized summary of the datasets. Understanding these dataset characteristics—such as whether they are derived from Electronic Health Records (EHRs), specific clinical databases like MIMIC-II/III, or smaller institutional datasets—is essential for contextualizing the reported performance of ML models and assessing their potential generalizability to different healthcare environments.
⚙️ Algorithmic Spectrum and Performance Metrics: Subsection 5.2 provides a consolidated overview of the spectrum of machine learning algorithms that were evaluated in the selected literature, explicitly mentioning categories such as regression models, neural networks, decision trees, and gradient boosting techniques. The core of this subsection is its reference to Table 3 (presented on page 14), which meticulously tabulates the key performance metrics reported for each algorithm as applied in the respective studies. This table serves as the primary vehicle for presenting the quantitative results, allowing for a direct comparison of algorithmic performance based on standard evaluation measures like Mean Absolute Error (MAE), R-squared (R2), accuracy, F1-score, and concordance index. The significance lies in its function as a central repository of empirical evidence on algorithm effectiveness for LoS prediction.

Strengths

✅ Clear Structure and Purpose Statement
The Results section begins by clearly outlining its objective to provide a comparative analysis of ML algorithms from the selected studies, focusing on datasets, methods, metrics, and performance. This establishes clear expectations for the reader regarding the content and purpose of this part of the paper.

"This section provides a comparative analysis of the results obtained from various ML algorithms applied to predict hospital admission and hospital LoS outcomes, based on a review of 12 selected studies. The comparison focuses on the datasets used, methods applied, metrics evaluated, and the performance outcomes achieved." (Page 13)
✅ Effective Use of Tabular Summaries for Comparative Data
The section effectively utilizes Table 3 to present a comprehensive comparison of algorithms and their reported performance metrics from the reviewed studies. This tabular format allows for a concise and accessible overview of diverse results, facilitating easier comparison across different methodologies and studies presented in the literature.

"Table 3. Comparation of algorithms and metrics." (Page 14)
✅ Concise High-Level Overviews of Datasets and Algorithms
Subsections 5.1 and 5.2 provide succinct textual introductions to the diversity of datasets and the range of ML algorithms employed in the reviewed literature, respectively. These summaries effectively preface the detailed information presented in Table 2 (referenced for datasets) and Table 3 (presented for algorithms and metrics).

"Different ML algorithms were evaluated across the studies, including regression models, neural network, decision trees, and gradient boosting. Table 3 presents the key performance metrics reported for each algorithm." (Page 14)

Suggestions for Improvement

💡 Enhance Narrative Synthesis of Key Results from Table 3
While Table 3 effectively presents the comparative metrics, the Results section (specifically 5.2) would be strengthened by a brief narrative summary that highlights key observations or trends directly from this table. For instance, pointing out algorithms that frequently appear with high accuracy or low error rates, or noting the range of performance for certain algorithm types, would provide immediate context and guide the reader's interpretation before they encounter the more in-depth analysis in the Discussion section. This is a medium-impact suggestion that enhances the Results section's role in presenting findings, not just raw data, and aligns with typical conventions for a Results section.

"Table 3 presents the key performance metrics reported for each algorithm." (Page 14)

Implementation: After introducing Table 3 in subsection 5.2, add a short paragraph summarizing salient points. For example: "Table 3 reveals a spectrum of performance across the applied algorithms. Notably, neural network approaches, as seen in study [29], achieved high accuracy (94.74%), while XGBoost in study [38] demonstrated a strong R2 score of 0.89. Conversely, some studies, such as [39], did not specify quantitative metrics for certain algorithms, highlighting variability in reporting practices."
💡 Expand Textual Summary of Dataset Characteristics within Results
Subsection 5.1, "Overview of Datasets," is currently very concise, primarily directing readers to Table 2 (which is located in the Methods section on page 11). To make the Results section more self-contained in presenting its findings related to datasets, a brief textual summary of key dataset characteristics drawn from Table 2 could be included directly within subsection 5.1. This would involve highlighting the range of dataset sizes, common data sources (e.g., MIMIC, EHRs), and the typical number or types of features used in the reviewed studies. This is a medium-impact suggestion that improves the flow and comprehensiveness of the Results section itself, making it easier for readers to grasp dataset context without immediately flipping back to the Methods.

"The studies analyzed utilized diverse datasets, with varying sizes and features, to predict hospital LoS. Table 2 summarizes the datasets employed in these studies." (Page 13)

Implementation: Expand subsection 5.1 with a few sentences summarizing key aspects from Table 2. For example: "The studies analyzed utilized diverse datasets, as detailed in Table 2 (page 11). These ranged from datasets with a few hundred patient records (e.g., 792 samples in study [38] using Total Hip Arthroplasty data) to large-scale repositories containing over 350,000 records (study [34] using Unspecified data). Common data sources included Electronic Health Records and specialized clinical databases like MIMIC-II/III, with the number of features varying widely, from 6 features in neurosurgical cases [40] to 46 features in broader EHR data [34]."
💡 Add Explanatory Note for "Not Specified" Metrics in Table 3
Table 3 reports "Not Specified" for accuracy values for algorithms in study [39]. While this may accurately reflect the information available from the source paper, the Results section could explicitly state that these metrics were not provided in the original publication. This clarification, either in the narrative of subsection 5.2 or as a footnote to Table 3, would prevent reader ambiguity about whether the data is missing or if "Not Specified" implies a particular outcome (e.g., zero or not applicable). This is a low-impact suggestion for enhancing clarity and completeness of the presented results within this section.

"Support Vector Machine Accuracy Not Specified" (Page 14)

Implementation: Add a sentence in subsection 5.2 after introducing Table 3, such as: "It should be noted that where 'Not Specified' appears in Table 3 for performance values, this indicates that the corresponding quantitative metric was not reported in the original cited study." Alternatively, add a footnote directly to Table 3: "*'Not Specified' indicates that the metric was not quantitatively reported in the source publication."
💡 Address Inconsistency and Relocate Interpretative Text Regarding Study [39]
The Results section contains interpretative statements about study [39] on page 13, including a self-referential conclusion by the current paper's authors ("In the Results Section, we concluded...") and a claim ("In article [39], although the values are specified, the author concludes that the support vector machine was the most accurate model"). This latter claim directly contradicts Table 3, where performance metrics for study [39] are listed as "Not Specified." Such interpretations, authorial conclusions, and discussions of individual study findings typically belong in the Discussion section. More importantly, the factual inconsistency regarding whether performance values for study [39] were specified needs to be resolved for the paper's credibility. This is a high-impact suggestion as it addresses both the structural appropriateness of a Results section and a significant internal inconsistency.

"In article [39], although the values are specified, the author concludes that the support vector machine was the most accurate model." (Page 13)

Implementation: 1. Remove the two paragraphs from page 13: "In the Results Section, we concluded that the support vector machine model..." and "In article [39], although the values are specified...". 2. Verify the original study [39] (Byraboina & Yohan, 2022, ZKG Int.). If performance values are specified in that source, Table 3 of this review must be updated accordingly. If they are not specified in study [39], then the statement on page 13 of this review is erroneous and should be corrected or removed when discussing study [39] in the Discussion section (Section 6). 3. Relocate the substantive (and verified) discussion of study [39]'s findings concerning SVM performance to the Discussion section.

Non-Text Elements

Table 2. Dataset overview.

Figure/Table Image (Page 11)

First Reference in Text

Table 2 summarizes the datasets employed in these studies.

Description

Table overview: Table 2 lists the datasets used in the 12 primary research articles reviewed in this paper, identified by their reference numbers (e.g., [28], [29], etc.). For each study, the table specifies the name or type of dataset used and provides information on its size and the number of features (variables or characteristics used for analysis).
Use of public MIMIC datasets: Several studies utilized well-known public datasets. For instance, study [28] used MIMIC-II (Multiparameter Intelligent Monitoring in Intensive Care II), a database containing de-identified data from intensive care unit patients, with 4927 patients and 24 features. Studies [30] and [33] used MIMIC-III, a later version of this database; study [30] involved 8024 patient stays with 'various features', and study [33] had 'Not available' size but 'various features'.
Institutional or specific datasets: Other studies used institutional or specific datasets. Study [29] used data from 'UPT Puskesmas Arjasa Kangean' comprising 3055 patient records and 30 features. Study [35] used a 'Medical Institution Dataset' (size and features not specified). Study [37] used 'Femur Fracture Data' with 547 pre-DTAP (Diagnostic-Therapeutic-Assistance Pathway) and 562 post-DTAP records, with 'Various features'. Study [38] used 'Total Hip Arthroplasty' data with 792 samples and 13 features. Study [40] used 'Neurosurgical Treatment Cases' with 90,685 cases and 6 features.
Limited dataset details in some entries: For some studies, dataset details were limited. Study [32] used 'Electronic Health Records' with size and features 'Not specified'. Study [34] used an 'Unspecified' dataset with 350,393 records and 46 features. Study [36] used 'Electronic Health Records' with size 'Not specified' but 10 features. Study [39] used 'Patient Admission Data' with size and features 'Not specified'.
Variation in number of features: The number of features (input variables for the machine learning models) varies widely, from 6 features in study [40] to 46 features in study [34], with many studies reporting 'various features' or not specifying the number.
Variation in dataset size: The size of datasets also varies significantly, from 792 samples in study [38] to 350,393 records in study [34] and 90,685 cases in study [40].

Scientific Validity

✅ Appropriate and necessary information for a review: Summarizing the datasets used in the reviewed studies is crucial for a literature review, as it provides context for the types of data, scale, and dimensionality that the machine learning models were applied to. This helps in understanding the scope and potential generalizability of the findings from those studies.
✅ Good practice in summarizing dataset characteristics: The inclusion of both dataset source/name and its characteristics (size, features) is good practice, allowing readers to quickly grasp the nature of the data underpinning each study's results.
💡 Limited detail impacts comparative utility: The frequent occurrence of 'Not specified', 'Not available', or 'various features' limits the scientific utility of the table for detailed comparison or meta-analysis. While this likely reflects limitations in the reporting of the original studies, it highlights a challenge in synthesizing information from diverse sources. The review itself acknowledges this by presenting the information as is.
✅ Accurate reflection of source information (assumed): The table accurately reflects the information as cited from the original papers, which is essential for the integrity of a literature review. The authors of this review are summarizing, not generating new data here.
✅ Demonstrates heterogeneity of data in the field: The diversity of datasets (public, institutional, specific conditions) and their varying sizes/features highlights the heterogeneous nature of research in hospital LoS prediction. This implicitly supports the review's aim to identify effective ML algorithms across different contexts.
💡 Lacks information on data preprocessing/feature engineering: The table does not provide information on data preprocessing steps or specific feature engineering undertaken in the original studies, which are critical aspects of ML model development. While perhaps beyond the scope of a summary table, acknowledging this limitation of the summarized information could be useful context for the reader if not discussed elsewhere.

Communication

✅ Clear structure and readability: The table is clearly structured with columns for 'Study', 'Dataset', and 'Size and Features', making it easy to locate information for each reviewed paper.
✅ Accurate caption: The caption accurately reflects the content of the table.
💡 Incomplete information for several entries: The use of 'Not specified' or 'Not available' for dataset size or features in several entries ([32], [34], [35], [36], [39]) limits the table's completeness and utility for comparative purposes. While this may reflect missing information in the original studies, it's a notable gap. If the information truly isn't in the source, this is a limitation of the source, not the table. However, if it was in the source and omitted, it's a table flaw. Assuming it reflects the source, it's a communication limitation inherited.
💡 Vagueness of 'various features': The term 'various features' is vague. While understandable if the exact number is large or complex to list, it reduces the specificity. If a range or predominant types of features were mentioned in the source papers, adding that detail could be beneficial.
💡 Ambiguity in feature count for split dataset [37]: For study [37], the feature count is 'Various features' but the size is split into '547 (pre-DTAP), 562 (post-DTAP)'. It's unclear if 'Various features' applies to both pre- and post-DTAP groups or if feature sets differed. Clarification, if available in the source, would be helpful.
💡 Inconsistent unit of size (patients vs. records vs. stays): Consistency in reporting patient numbers versus records would be ideal. For example, [28] lists 'patients', [29] lists 'patient records', [30] lists 'patient stays', [34] lists 'records'. While reflecting source terminology, a note explaining these distinctions or standardizing where possible could aid comparison.

Table 3. Comparation of algorithms and metrics.

Figure/Table Image (Page 14)

First Reference in Text

Table 3 presents the key performance metrics reported for each algorithm.

Description

Table overview: Table 3 summarizes the performance of various machine learning algorithms as reported in the 12 reviewed studies, identified by their reference numbers. For each study, it details the general method (e.g., Regression, Neural Network, Classification), the specific algorithm(s) used, the performance metric(s) reported (e.g., MAE, Accuracy, R2 Score), and the corresponding numerical values of these metrics.
Performance metrics for study [28] (Regression): Study [28] evaluated regression models. Logistic Regression showed a Mean Absolute Error (MAE) – an average of the absolute differences between predicted and actual values – of 198,379,877,732,011.9 (this large value with multiple commas is unusual and likely represents different scenarios or a concatenation of results, rather than a single MAE value). Ridge Regression, Lasso Regression, and ElasticNet had MAE values of 0.82131, 0.96865, and 0.95121, respectively.
Performance metrics for study [29] (Neural Network): Study [29] focused on Neural Networks, reporting Accuracy – the proportion of correct predictions – of 94.66% with default parameters, and 94.74% after Grid Search Optimization or Random Search Optimization.
Performance metrics for study [30] (Regression - R2 Score): Study [30] used regression algorithms, reporting R2 Score – a measure of how well the model's predictions approximate the actual outcomes, with 1 being a perfect fit. Random Forest achieved an R2 Score of 0.7780, XGBoost 0.7608, Gradient Boosting 0.7651, Logistic Regression 0.6466, and K-Nearest Neighbors 0.7306.
Performance metrics for study [32] (Survival Analysis): Study [32] employed Survival Analysis, achieving a Concordance Index – a measure of discrimination in survival models, indicating how well the model predicts the order of events – of 0.7.
Performance metrics for study [33] (Decision Trees): Study [33] used Decision Trees, reporting an Accuracy 'Above 80%'.
Performance metrics for study [34] (Regression/Neural Network - Accuracy): Study [34] reported dual Accuracy values for Random Forest (59.78% and 36.57%) and Neural Network (47.52% and 36.67%).
Performance metrics for study [35] (Classification - Accuracy): Study [35] used classification algorithms. Logistic Regression achieved an Accuracy of 80.54%, Modified Random Forest 81.09%, and Gradient Boosting 82.41%.
Performance metrics for study [36] (Classification - F1 Score): Study [36] also used classification, reporting F1 Score – a balance between precision and recall. Logistic Regression had an F1 Score of 0.59705, Decision Trees 0.59273, Neural Network 0.67248, Random Forest 0.66797, and Gradient Boosting 0.64848.
Performance metrics for study [37] (Regression - R2 Score, Std. Error): Study [37] used Multiple Linear Regression, reporting R2 (Pre-DTAP) of 0.63 and R2 (Post-DTAP) of 0.50. Standard Errors (Pre and Post) were 3.12 and 5.08, respectively.
Performance metrics for study [38] (Regression - RMSE, R2 Score): Study [38] used XGBoost for regression, achieving a Root Mean Squared Error (RMSE) – the square root of the average of squared differences between prediction and actual observation – of 2.03 and an R2 Score of 0.89.
Performance metrics for study [39] (Classification - Accuracy Not Specified): Study [39] used classification algorithms (Naive Bayes, Random Forest, Support Vector Machine), but the Accuracy values were 'Not Specified'.
Performance metrics for study [40] (Neural Network - MAE): Study [40] used a Neural Network (GPT-3), reporting MAE values of 2.37 days, 2 days, and 1.88 days (presumably for different model variations or evaluation subsets).

Scientific Validity

✅ Appropriate goal of summarizing performance: The table attempts to consolidate key performance metrics from the reviewed studies, which is a fundamental component of a systematic literature review aiming to compare different approaches.
✅ Reflects diversity of reported metrics: The inclusion of various metrics (MAE, Accuracy, R2, F1, Concordance Index, RMSE) reflects the diversity of evaluation approaches in the original studies and the different types of prediction tasks (regression vs. classification). This is methodologically sound as it represents what was reported.
💡 Questionable validity of MAE values in study [28]: The MAE values reported for Logistic Regression in study [28] (198,379,877,732,011.9) are highly problematic and suggest a misunderstanding, misreporting, or lack of normalization/context from the original paper. Such large, disparate numbers without clear explanation or units are not interpretable as typical MAE values for LoS prediction. This significantly undermines the validity of this specific entry.
💡 Missing units for error metrics limits interpretation: The lack of units for MAE and RMSE (e.g., days) makes it difficult to assess the practical significance of these error metrics. An RMSE of 2.03 is only meaningful in context (e.g., 2.03 days). This omission limits the scientific utility of these reported values.
💡 'Not Specified' values prevent comparison: Reporting 'Not Specified' for accuracy in study [39] means no direct performance comparison can be made for these algorithms from this table. This is a limitation inherited from the source or the review's extraction process.
💡 Lack of context for multiple performance values per algorithm: The presentation of multiple, distinct performance values for the same algorithm within a single study (e.g., study [29] Neural Network accuracies; study [34] dual accuracies; study [40] multiple MAEs) without clear differentiation of the conditions under which each was achieved (e.g., different hyperparameter settings, feature sets, or evaluation subsets) makes it hard to pinpoint a single definitive performance. While this reflects the source, the review could benefit from clarifying these distinctions if available in the original papers.
💡 Inherent difficulty in cross-study comparison: Direct comparison of performance across different studies is inherently challenging due to variations in datasets (as shown in Table 2), specific feature sets, preprocessing techniques, and evaluation protocols used in the original papers. The table presents the data as is, but readers should be cautious about drawing strong comparative conclusions across studies without considering these underlying differences.
✅ Useful compilation despite heterogeneity: The table serves as a useful compilation of reported results, but its scientific value for drawing definitive conclusions about algorithm superiority is limited by the heterogeneity and sometimes incomplete reporting of the primary studies.

Communication

✅ Logical table structure: The table structure is logical, with columns for Study, Method, Algorithm, Metric, and Values, allowing for a systematic presentation of results from the reviewed literature.
💡 Typographical error in caption: The caption has a typographical error; 'Comparation' should be 'Comparison'.
💡 Confusing MAE values for study [28]: The presentation of multiple MAE values for Logistic Regression in study [28] (198,379,877,732,011.9) is confusing due to the vastly different scales and lack of units or context. It's unclear what these disparate numbers represent. Clarification or separation of these values with context is needed.
💡 Ambiguous dual accuracy values for study [34]: For study [34], two accuracy values (59.78% and 36.57% for Random Forest; 47.52% and 36.67% for Neural Network) are listed without distinguishing what each value represents (e.g., different feature sets, different patient subgroups). This ambiguity hinders interpretation. Specify what each value corresponds to.
💡 'Not Specified' values limit comparison: The entries 'Not Specified' for accuracy values in study [39] reduce the table's informativeness for those particular algorithms. While this may reflect the source material, it's a limitation in the presented comparison.
✅ Appropriate inclusion of diverse metrics: The table mixes different types of metrics (e.g., R2 Score, Accuracy, MAE, F1 Score) which is appropriate given the different tasks (regression vs. classification) and reporting in original studies. However, this makes direct comparison across all studies challenging, which is an inherent difficulty in literature reviews.
💡 Missing units for error metrics (MAE, RMSE): Units are missing for MAE and RMSE values (e.g., days for LoS prediction). Adding units would significantly improve the interpretability of these error metrics. For example, an MAE of '2.37' is meaningless without knowing if it's days, hours, etc.
💡 Ambiguity of 'Std. Error': The term 'Std. Error' in study [37] is ambiguous. It could refer to standard error of the mean, standard error of the estimate (for regression), etc. Specifying the exact type of standard error would enhance clarity.

Discussion

Key Aspects

📊 Comparative Algorithm Performance Synthesis: The Discussion section initiates with a comparative synthesis of various Machine Learning (ML) algorithms and performance metrics identified in the reviewed literature. It systematically evaluates the effectiveness of different model categories—Regression Models (e.g., Ridge Regression showing MAE of 0.82131), Neural Networks (e.g., achieving 94.74% accuracy with optimization), Decision Trees and Ensemble Methods (e.g., Gradient Boosting with 82.41% accuracy), Survival Analysis (concordance index ~0.7), and other Advanced Algorithms like XGBoost (R2 of 0.89). This comparative overview serves to identify and rank the most promising methods for hospital Length of Stay (LoS) prediction, highlighting XGBoost and Neural Networks as particularly strong performers based on reported metrics.
📉 Critical Appraisal of Leading Algorithms and Their Limitations: Beyond quantitative performance, the section provides a critical analysis of leading algorithms, particularly Neural Networks (NNs) and XGBoost. It highlights that NNs, while accurate, necessitate significant data preprocessing and hyperparameter tuning (e.g., grid search in [29]), and their performance can vary if not tuned properly. XGBoost, despite robust results (R2=0.89 in [38]), is computationally demanding and susceptible to overfitting with smaller datasets. This critical lens reveals that raw performance metrics do not tell the whole story, and practical implementation faces considerable hurdles related to data dependency and resource intensity.
⚙️ Contextual Dependencies and Practical Applicability Challenges: A core theme is the context-dependency of algorithm performance and applicability. The discussion underscores that differences in study design, data collection characteristics, feature set sizes, and the extent of model tuning significantly influence reported effectiveness. For example, a Neural Network underperformed in one study [34] possibly due to a small feature set, while another [29] achieved high accuracy with more features and broader tuning. Furthermore, the complexity and "black-box" nature of NNs and ensemble models, along with their high computational requirements, are identified as practical barriers to immediate clinical adoption, especially in settings with limited resources or a strong need for model interpretability for decision-making.
❓ Systematic Response to Guiding Research Questions: The section systematically addresses the four predefined Research Questions (RQs) that guided the literature review. For RQ1, it identifies key factors influencing LoS (patient demographics, clinical features, operational factors, data preprocessing). For RQ2, it lists the types of ML algorithms employed (Regression, NNs, Decision Trees/Ensembles, Bayesian methods, Text-based models). For RQ3, it compares algorithm performance (e.g., ElasticNet's superiority in regression, NN's high but conditional accuracy, XGBoost's strength with large datasets, SVM's utility in non-disease-specific predictions, GPT-3's potential). Finally, for RQ4, it outlines the benefits (improved accuracy, efficiency, data-driven decisions) and limitations (complexity, interpretability, data quality/availability) of using ML for LoS predictions.
🔑 Overall Algorithmic Suitability and Contextual Caveats: Concluding its analysis, the Discussion reiterates that Neural Networks appear to be a highly popular and promising approach for hospital LoS prediction due to their flexibility and consistently high performance across multiple studies [29,36]. However, it strongly emphasizes that this success is contingent upon extensive data preprocessing and hyperparameter tuning. The overarching conclusion is that while advanced algorithms like NNs and XGBoost show great potential, the optimal choice for any given LoS prediction task ultimately depends on the specific characteristics of the dataset, the quality of the data, and the intended application context, rather than a one-size-fits-all solution.

Strengths

✅ Structured Performance Overview and Synthesis
The discussion effectively categorizes and summarizes the performance of various ML algorithms (Regression, Neural Networks, Ensembles, etc.) based on the reviewed literature, providing a clear comparative landscape for understanding their relative effectiveness and efficiency in LoS prediction.

"This section presents a comprehensive comparison of various algorithms and metrics employed in different analyzed studies to evaluate their effectiveness and efficiency for medical data analysis [41]." (Page 15)
✅ In-depth Critical Analysis of Algorithmic Limitations
The section moves beyond simple performance reporting to critically analyze the practical limitations of top-performing algorithms like Neural Networks and XGBoost. It thoughtfully discusses context-dependent challenges such as data preprocessing demands, parameter tuning sensitivity, computational costs, overfitting risks on small datasets, and the interpretability issues posed by "black-box" models, which are crucial considerations for real-world healthcare applications.

"While performance research emphasizes robust candidates, such as neural networks and XGBoost, an in-depth analysis has revealed context-dependent challenges and limitations." (Page 16)
✅ Direct and Comprehensive Answers to Research Questions
The discussion systematically addresses each of the four research questions posed earlier in the paper (page 2). It synthesizes the findings from the literature to provide clear, itemized answers regarding key factors influencing LoS, types of ML algorithms used, their comparative performance, and the benefits/limitations of ML models in this context.

"Based on the RQs posed, after this comparison, they can be answered." (Page 17)
✅ Emphasis on Context-Specific Algorithm Selection and Nuance
A significant strength is the consistent emphasis that the optimal algorithm choice is not universal. The discussion highlights that selection depends heavily on specific dataset characteristics, the quality of data, the extent of preprocessing and tuning, available computational resources, technical expertise, and the particular clinical application context, thereby promoting a nuanced understanding of ML model deployment.

"However, the selection of the optimal algorithm for hospital LoS predictions may depend on the specific characteristics of the dataset and the intended application context." (Page 17)

Suggestions for Improvement

💡 Enhance Discussion on Bridging the Gap to Clinical Implementation and Future Research
The critical analysis effectively highlights challenges like model complexity and interpretability. Expanding this to discuss actionable strategies or specific research avenues needed to bridge the gap between high-performing ML models and their practical, routine adoption in clinical settings would be a high-impact addition. This involves not just technical solutions (e.g., XAI advancements) but also organizational, ethical, and workflow integration considerations, which are pertinent to a Discussion section aiming to translate research findings into real-world impact and guide future work.

"The complexity of neural network and ensemble models may make them less practical for immediate use in clinical settings. Although neural networks provide high accuracy (as previously mentioned in [29]), their “black-box” nature may challenge interpretations, which is a key factor in healthcare decision making." (Page 16)

Implementation: Add a paragraph discussing potential pathways to overcome implementation hurdles. This could include: (1) advocating for research into more inherently interpretable yet powerful models suitable for clinical decision support, (2) suggesting the development of standardized evaluation frameworks that include clinical utility and ease-of-integration metrics alongside technical performance, (3) emphasizing pilot studies focusing on integration into hospital IT systems and clinical workflows, and (4) highlighting the role of multidisciplinary teams (clinicians, data scientists, IT specialists, ethicists) in co-designing and deploying these systems responsibly.
💡 Clarify the Scope of "Best-Performing Algorithm in This Study"
The statement 'the analysis determined that the neural network algorithm is the best-performing algorithm in this study on the datasets tested' on page 17 could be slightly ambiguous. Clarifying whether 'this study' refers to the current literature review's overall synthesis of findings across multiple papers, or if it's emphasizing a conclusion drawn from a particularly strong primary study cited within the review, would enhance precision. This is a low-impact suggestion for improved clarity in the Discussion's concluding remarks.

"As a result, the analysis determined that the neural network algorithm is the best-performing algorithm in this study on the datasets tested." (Page 17)

Implementation: Rephrase for enhanced clarity. For example: 'As a result, our synthesis of the analyzed literature suggests that the neural network algorithm frequently emerged as a top-performing approach across the various datasets and contexts examined in the reviewed studies.' Or, if highlighting a specific study's conclusion that strongly supports this, ensure that study is explicitly tied to the assertion in that sentence.
💡 Improve Integration of Table 4 with Initial Algorithm Comparison
The Discussion section introduces Table 4 ('Best-performing algorithms') when beginning to rank methods. While the table is useful, its relationship to the initial, more detailed breakdown of algorithm categories (Regression Models, Neural Networks, etc.) on page 15 could be signposted earlier. A brief mention of Table 4 as a summary of top performers when these categories are first introduced might help orient the reader more effectively. This is a low-impact suggestion focused on improving the structural flow and integration of tabular information within the narrative of the Discussion.

"Based on the summarized results in Table 4, we can analyze and rank the algorithms to determine the best-performing methods for predicting the LoS in hospitals." (Page 15)

Implementation: Consider a brief introductory reference to Table 4 when starting the comparative overview. For example, the first paragraph on page 15 could state: 'This section presents a comprehensive comparison of various algorithms and metrics... with a summary of standout performers highlighted in Table 4. Performance analysis is a crucial aspect...'. This would prime the reader for the summarized data before they encounter the detailed ranking.

Non-Text Elements

Table 4. Best-performing algorithms.

Figure/Table Image (Page 16)

First Reference in Text

Based on the summarized results in Table 4, we can analyze and rank the algorithms to determine the best-performing methods for predicting the LoS in hospitals.

Description

Table overview: Table 4 aims to highlight the best-performing machine learning algorithm from each of the 12 reviewed studies, along with the specific performance metric and its value that led to this designation. The studies are identified by their reference numbers.
Best performer from study [28]: For study [28], 'Regression' (specifically Ridge Regression, as identified in the discussion referencing Table 3) is listed with a metric of 'Ridge Regression' and a value of 0.82131 (this value corresponds to the MAE reported in Table 3 for Ridge Regression).
Best performer from study [29]: Study [29] identified 'Grid Search Optimization and Random Search Optimization' (applied to a Neural Network) as best, achieving an Accuracy – the proportion of correct predictions – of 94.74%.
Best performer from study [30]: From study [30], 'Random Forest' was best, with an R2 Score – a measure of how well predictions approximate actual outcomes, where 1 is a perfect fit – of 0.7780.
Best performer from study [32]: Study [32] reported 'Survival Analysis (Various models)' as best, with a Concordance Index – a measure of discrimination in survival models – of 0.7.
Best performer from study [33]: For study [33], 'Decision Tree' was best, achieving an Accuracy 'Above 80%'.
Best performer from study [34]: Study [34] listed 'Random Forest Regressor' with Accuracy values of 59.78% and 36.57%.
Best performer from study [35]: From study [35], 'Gradient Boosting' was best, with an Accuracy of 82.41%.
Best performer from study [36]: Study [36] identified 'Neural Network' as best, achieving an F1 Score – a balance between precision and recall – of 0.67248.
Best performer from study [37]: For study [37], 'Multiple Linear Regression (Pre-DTAP)' was best, with an R2 Score of 0.63.
Best performer from study [38]: Study [38] highlighted 'XGBoost' with an R2 Score of 0.89.
Best performer from study [39]: From study [39], 'Support Vector Machine' is listed as best, but its Accuracy is 'Not Specified'.
Best performer from study [40]: Study [40] identified 'GPT-3' as best, with a Mean Absolute Error (MAE) – an average of the absolute differences between predicted and actual values – of 2.37 days (units inferred from text, not in table).

Scientific Validity

✅ Valuable synthesis goal: The table attempts to synthesize the top findings from each reviewed paper, which is a valuable step in a literature review to guide discussion towards impactful methods.
💡 Potential oversimplification and undefined selection criteria for 'best': The selection of a single 'best-performing' algorithm per study simplifies comparison but can be an oversimplification if studies reported multiple high-performing models or if 'best' varied by metric. The criteria for this selection are not explicitly defined in the table, which could affect the interpretation of what 'best' signifies (e.g., highest accuracy, lowest error, best on a specific subgroup).
💡 Inherent difficulty in comparing 'best' across heterogeneous studies and metrics: The direct comparison and ranking of these 'best-performing' algorithms across studies is inherently challenging due to the heterogeneity of datasets, metrics used (e.g., comparing an R2 of 0.89 to an Accuracy of 94.74% is not straightforward), and specific LoS prediction tasks. The table presents these top results, but any subsequent ranking based on this table must acknowledge these limitations.
💡 Unsubstantiated 'best-performing' claim for study [39] due to missing metric value: The inclusion of study [39]'s Support Vector Machine with a 'Not Specified' accuracy value in a table of 'Best-performing algorithms' is problematic. If the performance metric is unknown, its claim as 'best-performing' is unsubstantiated within the context of this comparative table. The text states that for study [39], SVM was found to be the most accurate, but the metric was not provided in the original paper. This nuance is lost in the table.
✅ Consistency with Table 3 for study [28] (assuming Ridge was best regression): The MAE value for study [28] (Regression, 0.82131) is taken from Table 3 where it was associated with Ridge Regression. The 'Best Algorithm' column just says 'Regression'. While consistent with Table 3 if Ridge Regression was indeed the best regression model in that study, the generalization to 'Regression' here is less specific.
💡 Ambiguity in 'best-performing' status for study [34] with dual values: The dual accuracy values for study [34] (Random Forest Regressor: 59.78% and 36.57%) make it unclear which specific result led to its 'best-performing' status, or if both scenarios are considered. This ambiguity reduces the clarity of its 'best' designation.
✅ Accurate extraction from Table 3 (mostly): The table accurately extracts the reported top metrics from Table 3 for most entries, serving as a filtered view. However, the interpretation that these can be easily ranked to find the single best method for LoS prediction overall is a strong claim that this table alone cannot fully support without extensive caveats about context-dependency.

Communication

✅ Clear and logical structure: The table is clearly structured, making it easy to identify the study, the algorithm deemed best-performing from that study, the specific metric, and its value.
✅ Accurate caption: The caption is concise and accurately reflects the table's intent to highlight top-performing algorithms from the reviewed studies.
💡 Missing units for MAE: Similar to Table 3, units are missing for the MAE value (e.g., 2.37 days for study [40]). Adding units is crucial for the interpretability of this error metric. The text elsewhere mentions 'days', but the table should be self-contained.
💡 Ambiguous dual accuracy values for study [34]: The dual accuracy values for study [34] (Random Forest Regressor: 59.78% and 36.57%) are presented without context or explanation for the two different figures. This ambiguity was also present in Table 3 and persists here, making it difficult to understand which value represents the 'best performance' or under what conditions. Specify what each value corresponds to.
💡 Inclusion of 'Not Specified' value for best-performing algorithm: For study [39], the metric value for Support Vector Machine is 'Not Specified'. Including this entry in a 'Best-performing algorithms' table is contradictory if its performance isn't quantified. If the original paper claimed it was best without providing a specific metric value, this should be noted explicitly; otherwise, its inclusion here is confusing. Consider removing it or adding a footnote explaining why it's listed despite the missing value.
💡 Lack of explicit selection criteria for 'best-performing': The selection criteria for what constitutes 'best-performing' from each study, especially when multiple algorithms or metrics were reported (as seen in Table 3), are not explicitly stated within or alongside Table 4. This makes it difficult to verify the selection. A brief note on selection criteria (e.g., 'highest reported accuracy/R2 for primary LoS prediction task') would improve transparency.

Hospital Length-of-Stay Prediction Using Machine Learning Algorithms-A Literature Review

First Page Preview

Table of Contents

Overall Summary

Study Background and Main Findings

Research Impact and Future Directions

Critical Analysis and Recommendations

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Machine Learning Algorithms in Healthcare

Key Aspects

Strengths

Suggestions for Improvement

Ethical Considerations

Key Aspects

Strengths

Suggestions for Improvement

Materials and Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements