This research investigates the application of machine learning to predict hospital Length of Stay (LoS), a crucial factor in healthcare cost estimation and resource management. Using a large, publicly available dataset from the New York State Statewide Planning and Research Cooperative System (SPARCS), containing 2.3 million de-identified patient records, the authors develop and evaluate various predictive models. A key aspect of their approach is the emphasis on model interpretability, ensuring that the results are understandable and actionable for healthcare professionals.
The study employs a robust methodology, including data pre-processing, feature engineering, and the development of both classification and regression models. Several machine learning algorithms are explored, including linear regression, random forests, and CatBoost. Performance is evaluated using metrics such as R-squared (R2) for regression and Area Under the Curve (AUC) for classification. The models are trained and tested on separate datasets to assess their generalizability. Feature importance is analyzed using SHAP (SHapley Additive Explanations) values to understand the key drivers of LoS predictions.
The results demonstrate the effectiveness of machine learning in predicting LoS, with R2 scores of 0.82 for newborns using linear regression and 0.43 for non-newborns using CatBoost regression. Focusing on specific conditions, such as cardiovascular disease, further improves the predictive accuracy. Importantly, the study finds no evidence of bias based on demographic features like race, gender, or ethnicity. The authors also highlight the practical utility of interpretable models, such as decision trees, which provide clear, rule-based predictions based on key features like birth weight for newborns and diagnostic-related group classifications for non-newborns.
The study concludes that machine learning offers a valuable tool for predicting LoS and informing healthcare management decisions. The authors advocate for open data and open-source methodologies to promote transparency and reproducibility in healthcare research. They also acknowledge limitations in the available data, such as the lack of detailed physiological information and co-morbidity data, and suggest directions for future research.
This study makes a valuable contribution to healthcare analytics by demonstrating the potential of machine learning for predicting Length of Stay (LoS), a critical factor in cost management and resource allocation. The focus on model interpretability, particularly through the use of decision trees, enhances the practical utility of the findings for healthcare providers and administrators. The authors' commitment to open data and open-source methodology further strengthens the study's impact, promoting transparency and reproducibility in this important area of research.
While the models achieve reasonable predictive accuracy, particularly for newborns, the limitations imposed by the available data (e.g., lack of physiological data, co-morbidity information) are acknowledged. The study's strength lies in its broad scope, analyzing LoS across a wide range of disease categories and a large patient population, providing a high-level system view. This broad perspective, combined with the emphasis on interpretable models, allows for the identification of key drivers of LoS and informs potential areas for targeted interventions or policy changes.
The study's design, using readily available public data, inherently limits the strength of causal claims that can be made. However, it successfully demonstrates the feasibility and potential of machine learning for LoS prediction, paving the way for future research using more granular data and more sophisticated modeling techniques. The authors' advocacy for open data and open-source methods, combined with their clear articulation of the study's limitations and potential future directions, positions this work as a valuable contribution to the growing field of healthcare analytics.
The abstract effectively sets the stage by explaining the importance of LoS prediction for healthcare costs and capacity planning, justifying the research need.
The abstract clearly points out the model's ability to handle a large number of diagnosis codes simultaneously, distinguishing it from studies focused on single diseases.
The abstract provides concrete R2 scores for different models and patient groups, allowing readers to quickly grasp the study's main quantitative outcomes.
The abstract consistently highlights the importance of model interpretability and explainability, and connects the research to tangible benefits for various healthcare stakeholders, which is crucial for translational research.
High impact. The Results section begins with 'The study yields promising results...'. While the R2 scores are mentioned shortly after, placing the most impressive R2 score (e.g., 0.82 for newborns) directly after 'promising results' would immediately convey the magnitude of the findings and increase the abstract's impact. This is a standard practice in abstracts to quickly grab reader attention with key quantitative outcomes.
Implementation: Revise the first sentence of the Results section to: 'The study yields promising results, demonstrating the effectiveness of machine learning in predicting LoS, with the best R2 scores achieved being 0.82 for newborns...'. Alternatively, 'The study demonstrates the effectiveness of machine learning in predicting LoS, achieving noteworthy R2 scores of 0.82 for newborns...'
Medium impact. The Background mentions using 'a large open health dataset' and later names SPARCS with '2.3 million de-identified patient records.' Mentioning the 2.3 million records when 'large open health dataset' is first introduced would immediately establish the study's substantial data foundation and strengthen the context. This provides a stronger initial impression of the study's scope.
Implementation: In the Background section, consider integrating the scale earlier, for example: 'An example is the release of open healthcare data to researchers... We leverage such a resource, the New York State Statewide Planning and Research Cooperative System (SPARCS) dataset, containing 2.3 million de-identified patient records, to predict...'
Low impact. The sentence 'For instance, birth-weight is employed for predicting LoS in newborns, while diagnostic-related group classification proves valuable for non-newborns' is clear but could be slightly more concise for an abstract. While not a major issue, minor conciseness improvements can enhance readability in a space-constrained abstract.
Implementation: Consider rephrasing to something like: 'Key predictive features included birth-weight for newborns and diagnostic-related group classification for non-newborns.' This maintains clarity while being slightly more direct.
The introduction effectively establishes the broader societal and governmental push for transparency, skillfully narrowing the focus to its critical importance in the healthcare sector, particularly concerning open data initiatives.
The paper clearly articulates the problem of predicting Length of Stay (LoS) and robustly justifies its significance for healthcare cost management, hospital capacity planning, and empowering patient decision-making.
The authors explicitly state the twofold goal of their research, which provides a clear roadmap for the reader and sets expectations for the paper's contributions regarding both system design and practical application.
The introduction effectively bolsters the credibility of the proposed research by citing a previous, concrete example of their system's utility in identifying and enabling interventions for mental health issues, showcasing real-world impact.
Medium impact. While the paper eventually clarifies LoS prediction as the core problem, introducing it more directly when the 'problem of interest' is first mentioned in paragraph six on page 2 would enhance immediate clarity and focus. Currently, the reader waits until the end of that paragraph to understand the specific application of the analytics system. Specifying LoS prediction earlier would ground the subsequent discussion on its significance more effectively.
Implementation: Revise the sentence 'Our goal in this paper is twofold: (a) to demonstrate how to design an analytics system that works with open health data and (b) to apply it to a problem of interest to both healthcare providers and patients' to something like 'Our goal in this paper is twofold: (a) to demonstrate how to design an analytics system that works with open health data and (b) to apply it to predicting patient Length of Stay (LoS), a problem of critical interest to both healthcare providers and patients.'
Medium impact. The Introduction states, 'By using a linear regression model, we obtain an R2 value of 0.42 when we predict the LoS from a set of 23 patient features' (Page 3). While this offers a preview of results, briefly specifying the patient population this R2 value pertains to (e.g., general non-newborn population, as implied by abstract comparisons) or clarifying it as an illustrative baseline would prevent potential minor ambiguities. The abstract presents distinct R2 values for different models and patient groups (newborns vs. non-newborns), so contextualizing this specific 0.42 value within the Introduction would enhance precision before the detailed Results section.
Implementation: Consider adding a brief qualifier to the sentence. For example: 'By using a linear regression model for the general adult patient population, we obtain an R2 value of 0.42...' or 'As an illustrative baseline using a linear regression model, we obtain an R2 value of 0.42...'
Low to medium impact. The introduction effectively advocates for open data and then transitions to the LoS prediction problem. While the link is implicit (LoS prediction utilizes open data), more explicitly stating early on how the open data philosophy enables or is crucial for tackling LoS prediction on a large scale could create an even more cohesive narrative. This would strengthen the bridge between the general advocacy for open data (like SPARCS) and the specific research focus. This suggestion is for the Introduction section as it helps frame the motivation and necessity of the approach from the start.
Implementation: After discussing the availability of open health data via SPARCS and the need to process it (end of paragraph 3, page 2), consider inserting a bridging sentence. For example: 'This open access to large-scale datasets is particularly vital for developing robust predictive models for complex healthcare metrics, such as patient Length of Stay (LoS), which require diverse data for accuracy and generalizability.' Then, proceed with the paragraph on healthcare cost transparency.
The explicit statement of three core requirements—utilization of open-source platforms for replicability, creation of interpretable and explainable models, and a demonstrated understanding of how input features determine outcomes—provides a strong ethical and practical foundation for the methodology, which is particularly crucial in the healthcare AI domain.
The detailed account of data cleaning steps, such as handling missing values (discarding 1.55% of samples), removing specific admission types ('Unknown'), and the reasoned partitioning of data into 'newborns' and 'non-newborns' based on feature characteristics ('Birth Weight'), enhances the study's reproducibility and demonstrates meticulous data handling.
The paper details multiple feature encoding strategies, including distribution-dependent target encoding (replacing categorical data with the product of mean LoS and median LoS) and one-hot encoding, tailored to different model types (e.g., specific combined encoding for linear regression, target encoding for Random Forests, one-hot for multinomial logistic regression). This showcases a rigorous and adaptable approach to preparing categorical data for machine learning.
The methodology consistently emphasizes explainability by restricting model choices to those known for better interpretability (e.g., linear regression, logistic regression, decision trees, random forests with controlled depth for classification) and by employing SHAP (Shapley Additive Explanations) analysis for feature importance. This directly addresses the stated requirement for explainable AI in healthcare.
The use of a 10% holdout test set, explicitly stated as unseen during the training phase, combined with tenfold cross-validation on the remaining 90% of the data for model training and parameter determination, constitutes a standard and robust approach for assessing model generalizability and mitigating overfitting.
Medium impact. The "Methods" section commences with the system architecture (Fig. 1) and the technological framework. Although the overall context of the paper implies LoS prediction, explicitly stating this as the primary application of the described methods at the very beginning of the "Methods" section would immediately orient the reader to the specific purpose of the subsequent detailed procedures. This would enhance the clarity and focus of this section, ensuring readers understand the goal before delving into the specifics of implementation.
Implementation: Insert a concise introductory sentence at the beginning of the "Methods" section on page 4, before the paragraph starting "We have designed the overall system architecture...". For example: "This section details the methodological approach employed to predict patient Length of Stay (LoS) using the SPARCS dataset, encompassing system design, data processing, feature engineering, and the development and evaluation of predictive models."
Low to medium impact. On page 6, the paper states that for the linear regression model, "we sampled a set of 6 categorical features..." which were then target encoded. While the features are listed, providing a brief rationale for the selection of these specific six features for this particular encoding strategy would enhance methodological transparency. Explaining if they were chosen based on preliminary importance, to represent diverse data aspects, or as an illustrative set would add a layer of justification to this feature engineering step. If the selection was primarily illustrative, a brief note to that effect would also be beneficial.
Implementation: After listing the 6 categorical features on page 6 ([‘Type of Admission’, ‘Patient Disposition’, ‘APR Severity of Illness Code’, ‘APR Medical Surgical Description’, ‘APR MDC Code’]), add a short parenthetical note or a brief sentence explaining their selection basis. For instance: "...which we target encoded (these features were selected as they represent key clinical and administrative categories / based on preliminary analysis suggesting their potential relevance / as an illustrative set of commonly influential categorical variables) with the mean of the LoS..."
Medium impact. The paper mentions on page 8 that for classification models, the LoS was binned into "roughly balanced classes" (1 day, 2 days, 3 days, 4–6 days, >6 days), referencing Figs. 3 and 4 for distributional basis. While visual inspection and balance are noted, providing a more explicit quantitative or clinical rationale for these specific bin boundaries would strengthen the methodological rigor. Clarifying if these bins align with established clinical decision points, common LoS groupings in existing literature, or specific distributional quantiles beyond a general sense of balance would provide a more robust justification for this discretization strategy.
Implementation: On page 8, when introducing the LoS bins for classification models, expand slightly on the rationale. For example: "We binned the LoS into roughly balanced classes as follows: 1 day, 2 days, 3 days, 4–6 days, > 6 days. This strategy is based on the distribution of the LoS as shown earlier in Figs. 3 and 4, and these bins were chosen to [mention if they align with clinical significance, literature standards, or specific quantile-based divisions that ensure adequate sample sizes per bin while maintaining clinical relevance]."
The paper employs a diverse suite of metrics (R2 score, p-value for regression; true positive rate, false negative rate, F1 score, Brier score, AUC, Delong test for classification) to thoroughly evaluate model performance across different tasks, providing a multifaceted and robust assessment.
The Results section is well-supported by numerous figures (SHAP plots, confusion matrices, density plots of actual vs. predicted LoS, scatter plots of regression fits, decision tree structures) and tables, which effectively visualize complex data, model outputs, and performance comparisons, significantly aiding reader comprehension.
The study rigorously investigates feature importance using SHAP analysis (Figs. 7-9) and explores model parsimony by testing models with minimal feature sets using CatBoost regression (Fig. 20). This provides valuable insights into the key drivers of LoS and the potential for simpler, efficient models.
The paper commendably presents separate analyses and models for newborn and non-newborn patient populations. This acknowledges the distinct predictive factors and LoS characteristics inherent to these groups, thereby enhancing the specificity, relevance, and potential applicability of the developed models.
The authors maintain transparency by clearly reporting the impact of different feature encoding schemes on regression results (Table 3) and by providing a dedicated table for model parameters and hyperparameters (Table 16), which enhances the study's reproducibility.
Medium impact. The Results section is dense, presenting findings from descriptive statistics, feature engineering, multiple classification and regression models, minimal feature set analysis, and decision trees. While individual subsections are clear, incorporating brief transitional sentences or a more explicit narrative to connect these diverse analytical components would enhance overall cohesion. This would help the reader synthesize the multifaceted findings more effectively as they progress through the section, rather than primarily relying on the Discussion for this synthesis.
Implementation: Introduce brief linking statements at the beginning of major subsections. For instance, transitioning from feature importance to classification: 'Building on the identified key predictive features, we subsequently assessed their performance in various classification models for LoS prediction.' Similarly, before regression: 'In parallel with classification, regression models were developed to predict continuous LoS values, providing a different perspective on model efficacy.' An introductory paragraph outlining the structure of the Results section could also be beneficial.
Low to medium impact. The Results section includes numerous tables detailing performance metrics (e.g., Tables 6-8, 10-15). While comprehensive, this extensive tabulation can make it challenging for readers to quickly grasp key comparative takeaways. For instance, AUC, Delong test results, and Brier scores are presented in six separate tables. Consolidating some of these, particularly for direct comparisons across models or patient groups (newborn vs. non-newborn), or providing a more focused textual summary of the main comparative findings from these tables, could improve readability and immediate impact within the Results section.
Implementation: Consider creating a summary table that highlights the most critical comparative metrics (e.g., best AUCs, significance from Delong tests, key Brier scores) for the main models across newborn/non-newborn categories. The more detailed individual tables (e.g., Tables 10-15) could then be referenced, with an option to move some to supplementary material if space is a concern. Alternatively, ensure the textual discussion accompanying these tables is highly targeted, explicitly stating the main comparative conclusions derived from them.
Low impact. Table 3 shows that a combination of 'One Hot and Target' encoding yields the best R2 score (0.42) for Linear Regression on non-newborn data. The text mentions this increases column dimensionality. While the Methods section details the encoding, briefly reiterating in the Results section, when discussing Table 3, the potential synergistic benefit of this combined approach (e.g., capturing both discrete category effects and LoS-related trends) would offer a more complete interpretation of this specific finding directly within its presentation context.
Implementation: When discussing the results of Table 3 (page 9 or 11), add a brief explanatory clause. For example: 'As shown in Table 3, the combination of one-hot and target encoding, which potentially captures both the distinct impact of individual categories and their overall learned association with LoS, achieved the highest R2 score of 0.42 for linear regression with non-newborn data.'
Table 2 This table depicts the frequency of occurrence of the top 20 APR DRG descriptions in the dataset
Fig. 6 A 3-d plot showing the distribution of the LoS for the top-20 most frequently occuring APR DRG descriptions.
Table 3 The regression results produced by varying the encoding scheme and the model.
Fig. 8 1-D SHAP plot, in order of decreasing feature importance: top to bottom (for non-newborns)
Fig. 9 A 2-D plot showing the relationship between SHAP values for one feature, "APR Severity of Illness Code", and the feature values themselves (non-newborns)
Fig. 10 A density plot showing the relationship between APR Severity of Illness Code and the LoS.
Fig. 11 A density plot showing the distribution of the birth weight values (in grams) versus the LoS.
Fig. 14 Shows the density plot of the predicted length of stay versus actual length of stay for the classifier model for non-newborns.
Fig. 15 Shows the distribution of correctly predicted LoS values for each class used in our model.
Fig. 16 Scatter plot showing an instance of a linear regression fit to the data (newborns).
Fig. 18 Shows the density plot of the predicted length of stay versus actual length of stay for the classifier model for non-newborns.
Fig. 19 This figure shows the three CCS diagnosis codes that produced the top three R² scores using linear regression.
Table 4 CCS Diagnosis codes, descriptions and R² Scores for the top 3 CCS codes in Fig. 19
Table 5 CCS Diagnosis codes, descriptions and R² Scores for the lowest 3 CCS codes in Fig. 19
Fig. 20 The labels for each row on the left show combinations of different input features.
The discussion effectively highlights the significance of interpretable models, particularly the decision trees (Figs. 21 and 22), by explaining how key features like birth weight for newborns and APR DRG codes for non-newborns drive predictions, making the models understandable for healthcare providers and patients.
The paper candidly discusses the challenges in modeling LoS for certain conditions like schizophrenia and mood disorders, attributing poor model fit to limitations in the SPARCS dataset (e.g., lack of patient vitals, income level) and the inherent variability in treating these disorders. This transparency is commendable.
The discussion thoughtfully contextualizes the achieved R2 scores by comparing them with reported values in other healthcare studies (Bertsimas [70], Kshirsagar [71]), acknowledging the general difficulty in obtaining high R2 values in this domain, especially with large, diverse datasets.
A significant and reassuring finding is the lack of predictive power from features like race, gender, and ethnicity for LoS, suggesting an absence of systemic bias in the dataset concerning these demographic factors. This is well-contrasted with known biases in other fields like criminology.
The discussion thoroughly compares the performance of various regression and classification models (CatBoost, Random Forest, Logistic Regression) for both newborn and non-newborn populations, using multiple metrics (AUC, Brier score, Delong test), providing a comprehensive overview of model efficacy.
The paper effectively situates its findings within the broader healthcare landscape by discussing policy implications (e.g., data collection for mental health, inclusion of ASA scores) and comparing its methodology and results with numerous other studies on LoS prediction, highlighting its unique contributions like model generality.
The discussion strongly advocates for open data, open-source methodologies, and reproducibility in healthcare research, referencing initiatives like OHDSI and the potential of applying their pipeline to emerging datasets like hospital pricing data. The commitment to sharing code on Github is a practical step towards this.
The "Limitations of our models" subsection clearly outlines the constraints imposed by the SPARCS dataset, such as the lack of detailed physiological data and information on co-morbidities, while also fairly stating the advantage of using large-scale population data for a high-level system view.
Medium impact. The Discussion effectively details the performance of various models (e.g., CatBoost being best for non-newborns, Linear Regression for newborns) and separately emphasizes the value of interpretable models like decision trees. However, a more explicit discussion bridging these two aspects—specifically, how the practical choice of a model (like CatBoost for its performance and pipeline simplification) balances against its interpretability compared to more transparent models like the presented decision trees—would enrich the practical guidance offered. This is pertinent to the Discussion's role in synthesizing findings for real-world application.
Implementation: After the statement about potentially using CatBoost for both patient groups for pipeline simplification, add a sentence or two that directly addresses the trade-off. For example: "While CatBoost demonstrates superior predictive performance for non-newborns, its inherent complexity relative to decision trees (as shown in Figs. 21-22) presents a trade-off. Hospital administrators and implementers should weigh the benefits of higher accuracy and streamlined processing against the value of the direct rule-based interpretability offered by simpler tree models when selecting a final system."
Low to medium impact. In the "Limitations of our models" section, it's stated that the approach provides a "high-level view of the operation of the healthcare system, which provides valuable insights." This is a valid point justifying the utility despite data granularity limitations. Expanding slightly on the nature of these valuable high-level insights would add more substance. This elaboration fits well within the Discussion as it helps qualify the study's contribution in light of its constraints.
Implementation: Following the quote, add a clause or sentence that specifies the types of insights. For example: "...which provides valuable insights, such as identifying broad-scale drivers of LoS across diverse patient populations (e.g., the consistent importance of APR DRG codes), highlighting systemic data gaps (like the absence of detailed psychiatric or physiological data), and pinpointing areas where resource allocation or policy interventions might yield significant impact at a population level."
Medium impact. The limitation regarding the inability to account for patient co-morbidities due to SPARCS de-identification is clearly stated. The Discussion section is an appropriate place to briefly touch upon potential, even if challenging, future research avenues or advocacy points to address this significant factor in LoS. This would strengthen the forward-looking aspect of the Discussion.
Implementation: After explaining the limitation, consider adding a sentence about future possibilities. For instance: "Future research could explore advanced statistical methods to infer the impact of co-morbidities from aggregated data if individual patient linkage remains restricted. Alternatively, this highlights the need to advocate for de-identification protocols that, while preserving privacy, might retain flags for common co-morbidity clusters, thereby enhancing the utility of such valuable public datasets for more nuanced healthcare analytics."
Fig. 21 A random forest tree that represents a best-fit model to the data for newborns.
Fig. 22 A random forest tree using only a tree of depth 3 that represents a best-fit model to the data for non-newborns.
Table 6 This table summarizes the R² scores for three different regression models we investigated.
Table 7 This table summarizes the R² scores for three different regression models we investigated.
Table 8 Evalution of multi-class classifier metrics for logistic regression for non-newborns.
Table 9 In the first scenario, we developed a multiclass classifier using logistic regression with the 2016 SPARCS dataset.
Table 10 We report the AUC scores for the three different classifiers we used, logistic regression, random forest and catboost.
Table 11 We report the AUC scores for the three different classifiers we used, logistic regression, random forest and catboost.
Table 14 We report the Brier scores computed for the performance of the different classifier models we developed.
Table 15 We report the Brier scores computed for the performance of the different classifier models we developed.
The Conclusion effectively reiterates the study's primary goal and situates it within the important context of enhancing government transparency and evidence-based policymaking through open data.
It provides a clear and concise summary of the paper's main practical achievement—the development of a robust machine learning pipeline for LoS prediction—highlighting its scale and direct relevance to healthcare costs.
The Conclusion strongly re-emphasizes the critical importance of model interpretability for adoption in healthcare, backing this up with specific model performance metrics and key predictive features identified.
The section clearly articulates the multifaceted desirable qualities of the presented approach, effectively linking them to tangible benefits for a range of key healthcare stakeholders.
High impact. The conclusion states, "The length of stay is directly related to costs." While accurate, briefly elaborating on how this direct relationship translates into actionable value for healthcare administrators (e.g., in terms of resource allocation, budgeting efficiencies) and policymakers (e.g., for developing cost-containment strategies or assessing system performance), as detailed in earlier sections like the Introduction and Discussion, would more powerfully underscore the practical importance of the LoS prediction capability within the concluding summary. This addition would strengthen the connection between the technical achievement and its real-world financial and operational utility, making the conclusion more impactful for those specific stakeholders.
Implementation: After the sentence "The length of stay is directly related to costs." on page 26, consider adding a concise clause or short sentence such as: ", making its accurate prediction crucial for efficient hospital resource management, strategic financial planning by administrators, and the development of cost-effective healthcare policies by governing bodies."
Medium impact. The conclusion rightly points out, "There is an opportunity to further improve performance for specific diseases," citing cardiovascular disease as an example where a better R2 score was achieved. To make this observation more forward-looking and constructive within the Conclusion section, a brief mention of potential avenues for achieving such improvements could be beneficial. Hinting at strategies like developing disease-specific feature sets or dedicated sub-models, which are alluded to in the Discussion, would provide a more complete thought and subtly guide future research perspectives stemming from this work.
Implementation: Following the sentence "If we restrict our analysis to cardiovascular disease, we obtain an improved R2 score of 0.62." on page 27, consider adding a phrase like: ", suggesting that future work focusing on disease-specific feature engineering or tailored sub-models could yield even more precise predictions for targeted conditions."