This research investigates the application of machine learning to predict hospital Length of Stay (LoS), a crucial factor in healthcare cost estimation and resource management. Using a large, publicly available dataset from the New York State Statewide Planning and Research Cooperative System (SPARCS), containing 2.3 million de-identified patient records, the authors develop and evaluate various predictive models. A key aspect of their approach is the emphasis on model interpretability, ensuring that the results are understandable and actionable for healthcare professionals.
The study employs a robust methodology, including data pre-processing, feature engineering, and the development of both classification and regression models. Several machine learning algorithms are explored, including linear regression, random forests, and CatBoost. Performance is evaluated using metrics such as R-squared (R2) for regression and Area Under the Curve (AUC) for classification. The models are trained and tested on separate datasets to assess their generalizability. Feature importance is analyzed using SHAP (SHapley Additive Explanations) values to understand the key drivers of LoS predictions.
The results demonstrate the effectiveness of machine learning in predicting LoS, with R2 scores of 0.82 for newborns using linear regression and 0.43 for non-newborns using CatBoost regression. Focusing on specific conditions, such as cardiovascular disease, further improves the predictive accuracy. Importantly, the study finds no evidence of bias based on demographic features like race, gender, or ethnicity. The authors also highlight the practical utility of interpretable models, such as decision trees, which provide clear, rule-based predictions based on key features like birth weight for newborns and diagnostic-related group classifications for non-newborns.
The study concludes that machine learning offers a valuable tool for predicting LoS and informing healthcare management decisions. The authors advocate for open data and open-source methodologies to promote transparency and reproducibility in healthcare research. They also acknowledge limitations in the available data, such as the lack of detailed physiological information and co-morbidity data, and suggest directions for future research.
This study makes a valuable contribution to healthcare analytics by demonstrating the potential of machine learning for predicting Length of Stay (LoS), a critical factor in cost management and resource allocation. The focus on model interpretability, particularly through the use of decision trees, enhances the practical utility of the findings for healthcare providers and administrators. The authors' commitment to open data and open-source methodology further strengthens the study's impact, promoting transparency and reproducibility in this important area of research.
While the models achieve reasonable predictive accuracy, particularly for newborns, the limitations imposed by the available data (e.g., lack of physiological data, co-morbidity information) are acknowledged. The study's strength lies in its broad scope, analyzing LoS across a wide range of disease categories and a large patient population, providing a high-level system view. This broad perspective, combined with the emphasis on interpretable models, allows for the identification of key drivers of LoS and informs potential areas for targeted interventions or policy changes.
The study's design, using readily available public data, inherently limits the strength of causal claims that can be made. However, it successfully demonstrates the feasibility and potential of machine learning for LoS prediction, paving the way for future research using more granular data and more sophisticated modeling techniques. The authors' advocacy for open data and open-source methods, combined with their clear articulation of the study's limitations and potential future directions, positions this work as a valuable contribution to the growing field of healthcare analytics.
The abstract effectively sets the stage by explaining the importance of LoS prediction for healthcare costs and capacity planning, justifying the research need.
The abstract clearly points out the model's ability to handle a large number of diagnosis codes simultaneously, distinguishing it from studies focused on single diseases.
The abstract provides concrete R2 scores for different models and patient groups, allowing readers to quickly grasp the study's main quantitative outcomes.
The abstract consistently highlights the importance of model interpretability and explainability, and connects the research to tangible benefits for various healthcare stakeholders, which is crucial for translational research.
High impact. The Results section begins with 'The study yields promising results...'. While the R2 scores are mentioned shortly after, placing the most impressive R2 score (e.g., 0.82 for newborns) directly after 'promising results' would immediately convey the magnitude of the findings and increase the abstract's impact. This is a standard practice in abstracts to quickly grab reader attention with key quantitative outcomes.
Implementation: Revise the first sentence of the Results section to: 'The study yields promising results, demonstrating the effectiveness of machine learning in predicting LoS, with the best R2 scores achieved being 0.82 for newborns...'. Alternatively, 'The study demonstrates the effectiveness of machine learning in predicting LoS, achieving noteworthy R2 scores of 0.82 for newborns...'
Medium impact. The Background mentions using 'a large open health dataset' and later names SPARCS with '2.3 million de-identified patient records.' Mentioning the 2.3 million records when 'large open health dataset' is first introduced would immediately establish the study's substantial data foundation and strengthen the context. This provides a stronger initial impression of the study's scope.
Implementation: In the Background section, consider integrating the scale earlier, for example: 'An example is the release of open healthcare data to researchers... We leverage such a resource, the New York State Statewide Planning and Research Cooperative System (SPARCS) dataset, containing 2.3 million de-identified patient records, to predict...'
Low impact. The sentence 'For instance, birth-weight is employed for predicting LoS in newborns, while diagnostic-related group classification proves valuable for non-newborns' is clear but could be slightly more concise for an abstract. While not a major issue, minor conciseness improvements can enhance readability in a space-constrained abstract.
Implementation: Consider rephrasing to something like: 'Key predictive features included birth-weight for newborns and diagnostic-related group classification for non-newborns.' This maintains clarity while being slightly more direct.
The introduction effectively establishes the broader societal and governmental push for transparency, skillfully narrowing the focus to its critical importance in the healthcare sector, particularly concerning open data initiatives.
The paper clearly articulates the problem of predicting Length of Stay (LoS) and robustly justifies its significance for healthcare cost management, hospital capacity planning, and empowering patient decision-making.
The authors explicitly state the twofold goal of their research, which provides a clear roadmap for the reader and sets expectations for the paper's contributions regarding both system design and practical application.
The introduction effectively bolsters the credibility of the proposed research by citing a previous, concrete example of their system's utility in identifying and enabling interventions for mental health issues, showcasing real-world impact.
Medium impact. While the paper eventually clarifies LoS prediction as the core problem, introducing it more directly when the 'problem of interest' is first mentioned in paragraph six on page 2 would enhance immediate clarity and focus. Currently, the reader waits until the end of that paragraph to understand the specific application of the analytics system. Specifying LoS prediction earlier would ground the subsequent discussion on its significance more effectively.
Implementation: Revise the sentence 'Our goal in this paper is twofold: (a) to demonstrate how to design an analytics system that works with open health data and (b) to apply it to a problem of interest to both healthcare providers and patients' to something like 'Our goal in this paper is twofold: (a) to demonstrate how to design an analytics system that works with open health data and (b) to apply it to predicting patient Length of Stay (LoS), a problem of critical interest to both healthcare providers and patients.'
Medium impact. The Introduction states, 'By using a linear regression model, we obtain an R2 value of 0.42 when we predict the LoS from a set of 23 patient features' (Page 3). While this offers a preview of results, briefly specifying the patient population this R2 value pertains to (e.g., general non-newborn population, as implied by abstract comparisons) or clarifying it as an illustrative baseline would prevent potential minor ambiguities. The abstract presents distinct R2 values for different models and patient groups (newborns vs. non-newborns), so contextualizing this specific 0.42 value within the Introduction would enhance precision before the detailed Results section.
Implementation: Consider adding a brief qualifier to the sentence. For example: 'By using a linear regression model for the general adult patient population, we obtain an R2 value of 0.42...' or 'As an illustrative baseline using a linear regression model, we obtain an R2 value of 0.42...'
Low to medium impact. The introduction effectively advocates for open data and then transitions to the LoS prediction problem. While the link is implicit (LoS prediction utilizes open data), more explicitly stating early on how the open data philosophy enables or is crucial for tackling LoS prediction on a large scale could create an even more cohesive narrative. This would strengthen the bridge between the general advocacy for open data (like SPARCS) and the specific research focus. This suggestion is for the Introduction section as it helps frame the motivation and necessity of the approach from the start.
Implementation: After discussing the availability of open health data via SPARCS and the need to process it (end of paragraph 3, page 2), consider inserting a bridging sentence. For example: 'This open access to large-scale datasets is particularly vital for developing robust predictive models for complex healthcare metrics, such as patient Length of Stay (LoS), which require diverse data for accuracy and generalizability.' Then, proceed with the paragraph on healthcare cost transparency.
The explicit statement of three core requirements—utilization of open-source platforms for replicability, creation of interpretable and explainable models, and a demonstrated understanding of how input features determine outcomes—provides a strong ethical and practical foundation for the methodology, which is particularly crucial in the healthcare AI domain.
The detailed account of data cleaning steps, such as handling missing values (discarding 1.55% of samples), removing specific admission types ('Unknown'), and the reasoned partitioning of data into 'newborns' and 'non-newborns' based on feature characteristics ('Birth Weight'), enhances the study's reproducibility and demonstrates meticulous data handling.
The paper details multiple feature encoding strategies, including distribution-dependent target encoding (replacing categorical data with the product of mean LoS and median LoS) and one-hot encoding, tailored to different model types (e.g., specific combined encoding for linear regression, target encoding for Random Forests, one-hot for multinomial logistic regression). This showcases a rigorous and adaptable approach to preparing categorical data for machine learning.
The methodology consistently emphasizes explainability by restricting model choices to those known for better interpretability (e.g., linear regression, logistic regression, decision trees, random forests with controlled depth for classification) and by employing SHAP (Shapley Additive Explanations) analysis for feature importance. This directly addresses the stated requirement for explainable AI in healthcare.
The use of a 10% holdout test set, explicitly stated as unseen during the training phase, combined with tenfold cross-validation on the remaining 90% of the data for model training and parameter determination, constitutes a standard and robust approach for assessing model generalizability and mitigating overfitting.
Medium impact. The "Methods" section commences with the system architecture (Fig. 1) and the technological framework. Although the overall context of the paper implies LoS prediction, explicitly stating this as the primary application of the described methods at the very beginning of the "Methods" section would immediately orient the reader to the specific purpose of the subsequent detailed procedures. This would enhance the clarity and focus of this section, ensuring readers understand the goal before delving into the specifics of implementation.
Implementation: Insert a concise introductory sentence at the beginning of the "Methods" section on page 4, before the paragraph starting "We have designed the overall system architecture...". For example: "This section details the methodological approach employed to predict patient Length of Stay (LoS) using the SPARCS dataset, encompassing system design, data processing, feature engineering, and the development and evaluation of predictive models."
Low to medium impact. On page 6, the paper states that for the linear regression model, "we sampled a set of 6 categorical features..." which were then target encoded. While the features are listed, providing a brief rationale for the selection of these specific six features for this particular encoding strategy would enhance methodological transparency. Explaining if they were chosen based on preliminary importance, to represent diverse data aspects, or as an illustrative set would add a layer of justification to this feature engineering step. If the selection was primarily illustrative, a brief note to that effect would also be beneficial.
Implementation: After listing the 6 categorical features on page 6 ([‘Type of Admission’, ‘Patient Disposition’, ‘APR Severity of Illness Code’, ‘APR Medical Surgical Description’, ‘APR MDC Code’]), add a short parenthetical note or a brief sentence explaining their selection basis. For instance: "...which we target encoded (these features were selected as they represent key clinical and administrative categories / based on preliminary analysis suggesting their potential relevance / as an illustrative set of commonly influential categorical variables) with the mean of the LoS..."
Medium impact. The paper mentions on page 8 that for classification models, the LoS was binned into "roughly balanced classes" (1 day, 2 days, 3 days, 4–6 days, >6 days), referencing Figs. 3 and 4 for distributional basis. While visual inspection and balance are noted, providing a more explicit quantitative or clinical rationale for these specific bin boundaries would strengthen the methodological rigor. Clarifying if these bins align with established clinical decision points, common LoS groupings in existing literature, or specific distributional quantiles beyond a general sense of balance would provide a more robust justification for this discretization strategy.
Implementation: On page 8, when introducing the LoS bins for classification models, expand slightly on the rationale. For example: "We binned the LoS into roughly balanced classes as follows: 1 day, 2 days, 3 days, 4–6 days, > 6 days. This strategy is based on the distribution of the LoS as shown earlier in Figs. 3 and 4, and these bins were chosen to [mention if they align with clinical significance, literature standards, or specific quantile-based divisions that ensure adequate sample sizes per bin while maintaining clinical relevance]."
The paper employs a diverse suite of metrics (R2 score, p-value for regression; true positive rate, false negative rate, F1 score, Brier score, AUC, Delong test for classification) to thoroughly evaluate model performance across different tasks, providing a multifaceted and robust assessment.
The Results section is well-supported by numerous figures (SHAP plots, confusion matrices, density plots of actual vs. predicted LoS, scatter plots of regression fits, decision tree structures) and tables, which effectively visualize complex data, model outputs, and performance comparisons, significantly aiding reader comprehension.
The study rigorously investigates feature importance using SHAP analysis (Figs. 7-9) and explores model parsimony by testing models with minimal feature sets using CatBoost regression (Fig. 20). This provides valuable insights into the key drivers of LoS and the potential for simpler, efficient models.
The paper commendably presents separate analyses and models for newborn and non-newborn patient populations. This acknowledges the distinct predictive factors and LoS characteristics inherent to these groups, thereby enhancing the specificity, relevance, and potential applicability of the developed models.
The authors maintain transparency by clearly reporting the impact of different feature encoding schemes on regression results (Table 3) and by providing a dedicated table for model parameters and hyperparameters (Table 16), which enhances the study's reproducibility.
Medium impact. The Results section is dense, presenting findings from descriptive statistics, feature engineering, multiple classification and regression models, minimal feature set analysis, and decision trees. While individual subsections are clear, incorporating brief transitional sentences or a more explicit narrative to connect these diverse analytical components would enhance overall cohesion. This would help the reader synthesize the multifaceted findings more effectively as they progress through the section, rather than primarily relying on the Discussion for this synthesis.
Implementation: Introduce brief linking statements at the beginning of major subsections. For instance, transitioning from feature importance to classification: 'Building on the identified key predictive features, we subsequently assessed their performance in various classification models for LoS prediction.' Similarly, before regression: 'In parallel with classification, regression models were developed to predict continuous LoS values, providing a different perspective on model efficacy.' An introductory paragraph outlining the structure of the Results section could also be beneficial.
Low to medium impact. The Results section includes numerous tables detailing performance metrics (e.g., Tables 6-8, 10-15). While comprehensive, this extensive tabulation can make it challenging for readers to quickly grasp key comparative takeaways. For instance, AUC, Delong test results, and Brier scores are presented in six separate tables. Consolidating some of these, particularly for direct comparisons across models or patient groups (newborn vs. non-newborn), or providing a more focused textual summary of the main comparative findings from these tables, could improve readability and immediate impact within the Results section.
Implementation: Consider creating a summary table that highlights the most critical comparative metrics (e.g., best AUCs, significance from Delong tests, key Brier scores) for the main models across newborn/non-newborn categories. The more detailed individual tables (e.g., Tables 10-15) could then be referenced, with an option to move some to supplementary material if space is a concern. Alternatively, ensure the textual discussion accompanying these tables is highly targeted, explicitly stating the main comparative conclusions derived from them.
Low impact. Table 3 shows that a combination of 'One Hot and Target' encoding yields the best R2 score (0.42) for Linear Regression on non-newborn data. The text mentions this increases column dimensionality. While the Methods section details the encoding, briefly reiterating in the Results section, when discussing Table 3, the potential synergistic benefit of this combined approach (e.g., capturing both discrete category effects and LoS-related trends) would offer a more complete interpretation of this specific finding directly within its presentation context.
Implementation: When discussing the results of Table 3 (page 9 or 11), add a brief explanatory clause. For example: 'As shown in Table 3, the combination of one-hot and target encoding, which potentially captures both the distinct impact of individual categories and their overall learned association with LoS, achieved the highest R2 score of 0.42 for linear regression with non-newborn data.'
Table 2 This table depicts the frequency of occurrence of the top 20 APR DRG descriptions in the dataset
Fig. 6 A 3-d plot showing the distribution of the LoS for the top-20 most frequently occuring APR DRG descriptions.
Table 3 The regression results produced by varying the encoding scheme and the model.