Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs

Gerardo A. Flores, Alyssa H. Smith, Julia A. Fukuyama, Ashia C. Wilson
arXiv: arXiv:2506.14540v2
Massachusetts Institute of Technology

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This paper addresses the critical issue of misaligned evaluation metrics in clinical machine learning. Current standards like accuracy and AUC-ROC fail to adequately capture clinically relevant factors such as model calibration (reliability of probability scores), robustness to distributional shifts (changes in data characteristics between development and deployment), and asymmetric error costs (where different types of errors have varying clinical consequences). The authors propose a new evaluation framework based on proper scoring rules, specifically leveraging the Schervish representation, which provides a theoretical basis for averaging cost-weighted performance across clinically relevant ranges of class balance. They derive an adjusted cross-entropy (log score) metric, termed the DCA log score, designed to be sensitive to clinical deployment conditions and prioritize calibrated and robust models.

The paper systematically critiques accuracy and AUC-ROC, demonstrating their limitations in handling label shift, cost asymmetry, and calibration. It introduces a taxonomy of set-based evaluation metrics and then presents the Schervish representation as a way to connect proper scoring rules to cost-weighted errors. The proposed framework adapts the Schervish representation to handle label shift and cost asymmetry by averaging performance over clinically relevant ranges of class balance. This leads to the development of the clipped cross-entropy and DCA log score, which are designed to be more sensitive to clinical priorities.

The authors apply their framework to analyze racial disparities in in-hospital mortality prediction using the eICU dataset. They demonstrate how the clipped cross-entropy approach can decompose performance differences between subgroups into components attributable to the predictive mechanism versus label shift, and to calibration versus sharpness. This decomposition allows for a more nuanced understanding of performance disparities and highlights the potential for bias detection and mitigation.

The paper concludes by discussing limitations and future research directions. Challenges related to cost uncertainty, sampling variability, adaptive base rate estimation, and the semantics of cost parameterization are identified as areas requiring further investigation. The authors emphasize the need for evaluation methods that are both conceptually principled and practically actionable, aligning with the goals of evidence-based medicine and patient benefit in clinical decision support.

Research Impact and Future Directions

This paper offers a valuable contribution to the field of machine learning evaluation, particularly for clinical applications. By highlighting the shortcomings of standard metrics like accuracy and AUC-ROC in capturing clinically relevant factors such as calibration, label shift, and cost asymmetry, it motivates the adoption of more nuanced evaluation methods. The proposed framework, grounded in the Schervish representation and Decision Curve Analysis, provides practical tools like the clipped cross-entropy and DCA log score to address these limitations. The eICU case study demonstrates the framework's ability to decompose performance disparities between subgroups, offering actionable insights for model development and deployment.

However, the paper also acknowledges important limitations, including challenges related to cost uncertainty, sampling variability, and the semantics of cost parameterization. These limitations underscore the need for further research into tractable approximations, robust variance estimation techniques, and clearer guidelines for cost modeling. The paper's explicit discussion of these open problems not only strengthens its scientific rigor but also provides a roadmap for future work in this crucial area.

Overall, the paper advocates for a shift in evaluation practices towards methods that are both conceptually principled and practically actionable in clinical settings. By integrating decision theory, causal reasoning, and clinically relevant calibration metrics, it aims to improve the alignment of machine learning models with the goals of evidence-based medicine and patient benefit. This emphasis on real-world clinical utility makes the paper's contributions particularly relevant for practitioners and researchers working on machine learning applications in healthcare.

Critical Analysis and Recommendations

Clear Problem Statement and Solution Overview (written-content)
The abstract clearly identifies the core problem of inadequate evaluation metrics in clinical ML and succinctly presents the proposed solution and its theoretical basis, effectively motivating the work and highlighting its practical benefits.
Section: Abstract
Clarify Link Between Theory and Method (written-content)
While the abstract effectively summarizes the key aspects, clarifying the direct link between the Schervish representation and the proposed averaging method would enhance understanding for a broader audience.
Section: Abstract
Strong Motivation, Clear Principles, and Structured Outline (written-content)
The introduction effectively motivates the research by establishing the clinical importance of ML and articulating clear guiding principles for scoring functions. The explicit outline of contributions and paper structure enhances readability.
Section: Introduction
Explain Expected Value and Miscalibration Link (written-content)
While the introduction mentions using expected value calculations to measure miscalibration, briefly explaining this connection would improve accessibility for readers less familiar with decision theory.
Section: Introduction
Comprehensive Literature Review and Clear Differentiation (written-content)
The related work section provides a comprehensive and well-structured review, highlighting seminal works and clearly differentiating the paper's approach, particularly regarding the use of uniform intervals for dispersion.
Section: Related Work
Justify Intuitiveness of Uniform Intervals (written-content)
While the paper claims uniform intervals are more intuitive, briefly justifying this choice within the related work section would strengthen the argument.
Section: Related Work
Systematic Progression and Clear Motivation for Calibration (written-content)
The section systematically introduces advanced metrics, clearly articulating accuracy's limitations and motivating the need for calibration beyond decision optimization. The concise introduction of the Schervish representation effectively sets the stage.
Section: Accuracy: Calibration without Label Shift Uncertainty
Strengthen Link Between Label Shift and Calibration; Clarify "Coherent" Scores (written-content)
Explicitly linking label shift uncertainty to the need for broader calibration earlier in the section would improve the flow. A brief explanation of "coherent" scores would enhance understanding of PAMA/PAMNB.
Section: Accuracy: Calibration without Label Shift Uncertainty
Novel Interpretation and Comprehensive Critique of AUC-ROC (written-content)
The section provides a novel reinterpretation of AUC-ROC (Theorem 4.2) and a comprehensive critique of its limitations, including its disregard for calibration and reliance on a model-dependent averaging distribution.
Section: AUC-ROC: Label Shift Uncertainty without Calibration
Clarify Limited Robustness and Illustrate Calibration Impact (written-content)
Elaborating on the "limited form of robustness" offered by AUC-ROC and illustrating the practical consequences of its indifference to calibration would strengthen the argument.
Section: AUC-ROC: Label Shift Uncertainty without Calibration
Concrete Example and Quantitative Decomposition (written-content)
The eICU example effectively demonstrates the method's utility in dissecting subgroup performance disparities and critiques conventional metrics. The quantitative decomposition of disparity components is a key strength.
Section: Application to Subgroup Decomposition
Connect Schervish to Decomposition Power (written-content)
Explicitly linking the Schervish representation to the decomposition capabilities would strengthen the methodological narrative and highlight the framework's advantage.
Section: Application to Subgroup Decomposition
Comprehensive Summary, Limitations, and Future Directions (written-content)
The discussion effectively recaps the contributions, thoroughly acknowledges limitations, and proposes constructive future research directions, maintaining a strong connection to clinical relevance.
Section: Discussion
Strengthen Contribution-Limitation Link and Explain Forecasting vs. Classification (written-content)
A more explicit narrative bridge between contributions and limitations, and further explanation of the forecasting vs. classification tension, would enhance the discussion's cohesiveness.
Section: Discussion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Related Work

Key Aspects

Strengths

Suggestions for Improvement

Accuracy: Calibration without Label Shift Uncertainty

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 1: Taxonomy of set-based evaluation metrics. Each row represents a...
Full Caption

Table 1: Taxonomy of set-based evaluation metrics. Each row represents a different approach to handling error costs, and each column represents a different approach to handling class balance.

Figure/Table Image (Page 24)
Table 1: Taxonomy of set-based evaluation metrics. Each row represents a different approach to handling error costs, and each column represents a different approach to handling class balance.
First Reference in Text
Table 1: Taxonomy of set-based evaluation metrics.
Description
  • Table Purpose and Organizing Dimensions: Table 1 organizes different ways to measure how well a classification model performs, specifically for tasks where the outcome is one of two options (binary classification). It structures these methods based on two main considerations: how the measurement deals with 'error costs' and how it handles 'class balance'. 'Error costs' refer to the idea that making one type of mistake (e.g., saying someone is sick when they are healthy) might be more or less serious than making the other type of mistake (e.g., saying someone is healthy when they are sick). 'Class balance' refers to whether the dataset has an equal number of examples for each outcome (e.g., equal numbers of sick and healthy patients) or an unequal number (e.g., many more healthy patients than sick ones).
  • Row Categories: Handling Error Costs: The table has three rows, each representing a different approach to incorporating error costs into the evaluation: "Accuracy" (which typically treats all errors equally), "Weighted Accuracy" (which assigns different weights to correctly classifying each class, implicitly handling asymmetric error costs), and "Net Benefit" (a metric often used in medical decision making that quantifies the benefit of using a model considering the relative costs of true positives, false positives, etc.).
  • Column Categories: Handling Class Balance: The table has three columns, each representing a different approach to handling class balance: "Empirical" (likely meaning evaluation on the observed class distribution of the dataset), "Balanced" (evaluation as if the classes were balanced, e.g., 50/50), and "Prior-Adjusted Maximum" (evaluation that considers a target or deployment class prevalence, potentially different from the training data, and optimizes for it).
  • Specific Metrics Listed: The cells of the table contain acronyms for specific evaluation metrics that fit the intersection of the row and column categories. For example: - Under "Empirical" and "Accuracy", the metric is "Accuracy". - Under "Balanced" and "Accuracy", the metric is "BA" (Balanced Accuracy). - Under "Prior-Adjusted Maximum" and "Accuracy", the metric is "PAMA" (Prior-Adjusted Maximum Accuracy). Similar acronyms fill the table: WA (Weighted Accuracy), BWA (Balanced Weighted Accuracy), PAMWA (Prior-Adjusted Maximum Weighted Accuracy), NB (Net Benefit), BNB (Balanced Net Benefit), and PAMNB (Prior-Adjusted Maximum Net Benefit).
  • Equivalence Note for Balanced Metrics: A note below the table clarifies that when the "Balanced" approach to class balance is used (the middle column), the metrics for "Weighted Accuracy" (BWA) and "Net Benefit" (BNB) become equivalent. This means that under balanced conditions, these two ways of accounting for error costs yield the same or a proportionally scaled result.
Scientific Validity
  • βœ… Logical Taxonomy Structure: The proposed taxonomy provides a structured and logical way to categorize set-based evaluation metrics along two critical dimensions relevant to clinical machine learning: sensitivity to error costs and robustness to class imbalance/label shift. This is a useful conceptual framework.
  • βœ… Relevance of Included Metrics: The metrics listed (Accuracy, BA, PAMA, WA, BWA, PAMWA, NB, BNB, PAMNB) are generally recognized or are plausible extensions of existing metrics within the machine learning and medical decision-making literature. The categorization appears consistent with the typical properties of these metrics.
  • βœ… Valid Distinction of Class Balance Approaches: The distinction between empirical, balanced, and prior-adjusted maximum approaches to class balance is a valid and important one, reflecting different assumptions and goals in model evaluation.
  • βœ… Insightful Equivalence Note: The note about the equivalence of Weighted Accuracy and Net Benefit under balanced conditions is an insightful observation that might not be immediately obvious, adding to the utility of the table.
  • πŸ’‘ Scope of "Set-Based" Metrics: While the taxonomy is useful, it is focused on "set-based" metrics. The paper also discusses calibration and proper scoring rules. The table doesn't explicitly incorporate metrics that directly measure calibration (like ECE, Brier score variants that might fit here) or sharpness, which are central themes of the paper. This is not a flaw of the table itself for its stated purpose, but its scope should be understood in the context of the paper's broader aims. Perhaps the term "set-based" is meant to distinguish these from scoring rules applied to probabilistic outputs directly.
  • πŸ’‘ Dependence on External Definitions for Novel Metrics: The definitions or derivations of the "Prior-Adjusted Maximum" versions of these metrics (PAMA, PAMWA, PAMNB) are critical for their correct application and interpretation. The table itself doesn't provide this but relies on the text (e.g., Definition 3.3 for PAMA, Definition 3.4 for PAMNB). The validity of these specific novel metrics depends on those definitions being sound and well-justified.
Communication
  • βœ… Clear Structure: The tabular format is an effective way to organize and present a taxonomy of metrics based on two distinct dimensions (handling error costs and class balance).
  • βœ… Informative Caption: The caption clearly explains the organizing principle of the table: rows for error cost approaches and columns for class balance approaches. This helps the reader understand how to navigate the table.
  • πŸ’‘ Acronym Usage: The use of acronyms (BA, WA, BWA, NB, BNB, PAMA, PAMWA, PAMNB) is concise but requires the reader to either know them or for them to be defined in the text. For a table aiming to be somewhat self-contained, brief expansions or references to where they are defined would be beneficial. Suggestion: Consider adding a footnote or a brief parenthetical expansion for less common acronyms, or ensure they are defined immediately preceding or following the table's first mention.
  • βœ… Header Clarity: The column headers ("Empirical", "Balanced", "Prior-Adjusted Maximum") and row headers ("Accuracy", "Weighted Accuracy", "Net Benefit") are reasonably clear, though "Empirical" as a contrast to "Balanced" might benefit from a slightly more descriptive term if space permits (e.g., "Standard/Unadjusted").
  • βœ… Useful Interpretive Note: The note "Note that when balanced, the second and third rows are equivalent" is an important piece of information for correctly interpreting the table. Placing it directly below the table is appropriate.
Table 2: Value function for Accuracy
Figure/Table Image (Page 24)
Table 2: Value function for Accuracy
First Reference in Text
Table 2: Value function for Accuracy
Description
  • Table Purpose: Defining Value Function for Accuracy: Table 2 defines a 'value function' for the classification metric 'Accuracy'. A value function in this context assigns a numerical score (value) to each possible outcome of a prediction. The table is structured as a 2x2 grid, representing the four possible scenarios in a binary (two-outcome) classification problem.
  • Table Structure: True States vs. Predicted Actions: The columns represent the true state of an individual: 'y=1' indicates the individual truly has the condition (e.g., Syphilis), and 'y=0' indicates they do not (e.g., No Syphilis). The rows represent the model's prediction or the action taken based on it: 'Ε·=1' means the model predicts the condition is present (e.g., Treat for Syphilis), and 'Ε·=0' means the model predicts the condition is absent (e.g., Don't treat).
  • Cell Values: Scores for Prediction Outcomes: The cells of the table show the value assigned to each combination: - If the true state is 'y=1' (has Syphilis) and the prediction is 'Ε·=1' (Treat), the value is 1. This is a True Positive (correctly identifying a sick person). - If the true state is 'y=0' (No Syphilis) and the prediction is 'Ε·=1' (Treat), the value is 0. This is a False Positive (incorrectly diagnosing a healthy person as sick). - If the true state is 'y=1' (has Syphilis) and the prediction is 'Ε·=0' (Don't treat), the value is 0. This is a False Negative (incorrectly diagnosing a sick person as healthy). - If the true state is 'y=0' (No Syphilis) and the prediction is 'Ε·=0' (Don't treat), the value is 1. This is a True Negative (correctly identifying a healthy person).
  • Relation to Standard Accuracy: This value function directly corresponds to the standard definition of accuracy, where correct classifications (True Positives and True Negatives) are given a value of 1, and incorrect classifications (False Positives and False Negatives) are given a value of 0. Accuracy is then calculated as the sum of these values divided by the total number of instances.
Scientific Validity
  • βœ… Correct Representation of Accuracy's Value Function: The table accurately represents the standard value function (or utility/loss function where loss = 1 - value) underlying the calculation of simple accuracy. It assigns a value of 1 to correct classifications (TP, TN) and 0 to incorrect ones (FP, FN).
  • βœ… Appropriateness for Basic Accuracy: This representation is fundamental and appropriate for defining the most basic form of accuracy, which treats all types of errors (FP and FN) as equally costly and all types of correct classifications as equally beneficial.
  • βœ… Highlights Limitations of Standard Accuracy: The table implicitly highlights the limitations of standard accuracy that the paper aims to address: it does not differentiate between the costs of false positives and false negatives, nor does it inherently account for class imbalance. This makes it a good starting point for discussing more nuanced metrics.
  • βœ… Generalizability Beyond Example: The use of a specific clinical example ("Syphilis") is illustrative but does not change the fundamental mathematical definition of the value function for accuracy, which is general.
Communication
  • βœ… Clear Structure: The table uses a standard 2x2 contingency table format, which is familiar and easy to understand for representing outcomes of a binary classification.
  • βœ… Clear and Intuitive Labeling: The labels for true states (columns: "y=1 (Syphilis)", "y=0 (No Syphilis)") and predicted actions/labels (rows: "Ε·=1 (Treat)", "Ε·=0 (Don't treat)") are clear and use a concrete example (Syphilis) which aids intuition.
  • βœ… Unambiguous Values: The numerical values (1 for correct classification, 0 for incorrect) are standard for a simple accuracy value function and are unambiguous.
  • πŸ’‘ Notation V1/2(y, Ε·): The notation V1/2(y, Ε·) in the top-left cell, representing the value function, is specific. While V(y, Ε·) is common, the subscript '1/2' might require clarification if it has a specific meaning beyond this context (e.g., related to balanced classes or equal costs, which is implicit in standard accuracy). Suggestion: If '1/2' has a specific implication beyond standard accuracy (e.g., tied to balanced scenarios or specific normalization), briefly note it. Otherwise, V(y, Ε·) would suffice.
Table 3: Value function for Balanced Accuracy
Figure/Table Image (Page 25)
Table 3: Value function for Balanced Accuracy
First Reference in Text
Table 3: Value function for Balanced Accuracy
Description
  • Table Purpose: Value Function for Balanced Accuracy: Table 3 specifies a 'value function' used in the calculation of 'Balanced Accuracy'. A value function assigns a numerical score to each possible outcome of a classification. This table is for binary classification, where there are two possible true states and two possible predictions.
  • Table Structure: True States vs. Predicted Actions: The table is a 2x2 grid. The columns represent the true actual condition of an individual: 'y=1' (e.g., has Syphilis) and 'y=0' (e.g., No Syphilis). The rows represent the model's prediction or the resulting action: 'Ε·=1' (e.g., Treat for Syphilis) and 'Ε·=0' (e.g., Don't treat).
  • Cell Values and Role of Ο€β‚€: The values in the cells are defined in terms of Ο€β‚€ (pi-naught), which represents the empirical class prevalence of the positive class (y=1) in the dataset. That is, Ο€β‚€ is the proportion of actual positive cases in the data being evaluated. - True Positive (y=1, Ε·=1): Correctly identifying a positive case. Value = 1 / (2Ο€β‚€). - False Positive (y=0, Ε·=1): Incorrectly identifying a negative case as positive. Value = 0. - False Negative (y=1, Ε·=0): Incorrectly identifying a positive case as negative. Value = 0. - True Negative (y=0, Ε·=0): Correctly identifying a negative case. Value = 1 / (2(1-Ο€β‚€)).
  • Mechanism of Balancing: This value function achieves 'balance' by weighting the contribution of correct predictions for each class by the inverse of that class's prevalence (scaled by 2). For instance, if the positive class is rare (small Ο€β‚€), a correct positive prediction gets a higher weight (1/(2Ο€β‚€)). This ensures that the performance on both the majority and minority classes contributes equally to the overall Balanced Accuracy score, which is typically calculated as the average of sensitivity (true positive rate) and specificity (true negative rate). Summing these values over all instances and averaging appropriately yields the Balanced Accuracy.
Scientific Validity
  • βœ… Correct Mathematical Formulation: The table correctly represents the per-instance contributions to Balanced Accuracy. When these values are summed over all instances in a dataset with positive class prevalence Ο€β‚€, and then averaged (divided by the total number of instances N), the result is equivalent to (Sensitivity + Specificity) / 2.
  • βœ… Appropriateness for Balanced Accuracy: This value function is appropriate for defining Balanced Accuracy, a metric specifically designed to address class imbalance issues by giving equal importance to performance on each class, regardless of their relative frequencies in the data.
  • βœ… Implicit Error Cost Handling: By assigning values of 0 to misclassifications (False Positives and False Negatives), this specific value function still implies that all errors are equally undesirable in terms of direct penalty, similar to standard accuracy. The balancing comes from up-weighting correct classifications of the minority class, rather than differential penalization of error types (which would be handled by metrics like Weighted Accuracy or Net Benefit).
  • βœ… Consistency with Paper's Definitions: The formulation is consistent with the paper's Definition B.2 of Balanced Accuracy, where these cell values correspond to the product of the basic 0/1 accuracy value function (V1/2 from Table 2) and the importance sampling weights (W(Ο€β‚€ β†’ 1/2; y)) used to reweight the original dataset to a perfectly balanced one.
Communication
  • βœ… Clear Structure: The 2x2 contingency table format is standard and easy to interpret for binary classification outcomes.
  • βœ… Intuitive Labeling: The labels for true states (e.g., "y=1 (Syphilis)") and predicted actions (e.g., "Ε·=1 (Treat)") are clear and the use of a concrete example aids understanding.
  • βœ… Explicit Weighting: The mathematical expressions in the cells, involving Ο€β‚€, clearly show how the weighting is applied to achieve balance.
  • πŸ’‘ Definition of Ο€β‚€: The term Ο€β‚€ is used in the cell values. While it's a standard notation for empirical class prevalence in this paper (defined on page 3), briefly defining it in a footnote to the table or ensuring its definition is very proximal in the text would enhance the table's self-containedness for readers who jump to it. Suggestion: Add a footnote: "Ο€β‚€ represents the empirical prevalence of the positive class (y=1) in the dataset."
  • βœ… Appropriate Header Notation: The table header V(y, Ε·) is simpler and appropriate here.
Table 4: Value function for Shifted Accuracy
Figure/Table Image (Page 25)
Table 4: Value function for Shifted Accuracy
First Reference in Text
Table 4: Value function for Shifted Accuracy
Description
  • Table Purpose: Value Function for Shifted Accuracy: Table 4 defines a 'value function' for a metric called 'Shifted Accuracy'. In classification, a value function assigns a numerical score to each possible outcome (e.g., correctly identifying a sick person, incorrectly identifying a healthy person). 'Shifted Accuracy' is a type of accuracy measurement that is adjusted to reflect a scenario where the prevalence (or frequency) of the condition in the real-world deployment setting (target prevalence, denoted by Ο€) is different from its prevalence in the dataset used for evaluation (empirical prevalence, denoted by Ο€β‚€).
  • Table Structure: True States vs. Predicted Actions: The table is a 2x2 grid. The columns represent the true condition of an individual: 'y=1' (e.g., has Syphilis) and 'y=0' (e.g., No Syphilis). The rows represent the model's prediction or action: 'Ε·=1' (e.g., Treat for Syphilis) and 'Ε·=0' (e.g., Don't treat).
  • Cell Values and Prevalence Adjustment Factors: The values in the cells represent the score assigned to each outcome, adjusted for the shift in prevalence: - True Positive (y=1, Ε·=1): Correctly identifying a positive case. The value is Ο€/Ο€β‚€. This means a correct positive identification is weighted by the ratio of the target prevalence to the empirical prevalence. - False Positive (y=0, Ε·=1): Incorrectly identifying a negative case as positive. The value is 0. - False Negative (y=1, Ε·=0): Incorrectly identifying a positive case as negative. The value is 0. - True Negative (y=0, Ε·=0): Correctly identifying a negative case. The value is (1-Ο€)/(1-Ο€β‚€). This means a correct negative identification is weighted by the ratio of the target negative class prevalence to the empirical negative class prevalence.
  • Mechanism of Shifting/Re-weighting: This value function essentially re-weights the standard accuracy outcomes. If the target prevalence of the positive class (Ο€) is higher than in the evaluation data (Ο€β‚€), then True Positives are given more weight. Conversely, if the target prevalence of the negative class (1-Ο€) is higher than in the evaluation data (1-Ο€β‚€), True Negatives are given more weight. Misclassifications (False Positives and False Negatives) are still assigned a value of 0.
Scientific Validity
  • βœ… Correct Formulation for Label Shift Adjustment: The table correctly represents the value function for an accuracy metric adjusted for label shift using importance weighting. The factors Ο€/Ο€β‚€ and (1-Ο€)/(1-Ο€β‚€) are standard importance weights for re-weighting class-conditional expectations when moving from a source distribution with prevalence Ο€β‚€ to a target distribution with prevalence Ο€, assuming P(X|Y) is invariant.
  • βœ… Appropriateness for Label Shift Evaluation: This approach is scientifically valid and appropriate for evaluating classifier performance under known or assumed label shift conditions, providing a more relevant estimate of performance in a target deployment environment.
  • βœ… Consistency with PAMA Definition: This value function, when applied to a dataset, directly leads to the calculation of Prior-Adjusted Maximum Accuracy (PAMA) as defined in Definition B.3 (and Definition 3.3) of the paper, where the standard 0/1 accuracy values are multiplied by these importance weights.
  • πŸ’‘ Error Cost Handling Unchanged from Standard Accuracy: It's important to note that this 'Shifted Accuracy' still assigns a value of 0 to both types of errors (False Positives and False Negatives), meaning it doesn't inherently differentiate between the costs of these errors, similar to standard accuracy. The adjustment is purely for the shift in class prevalences, not for asymmetric error costs.
Communication
  • βœ… Clear Structure: The 2x2 contingency table is a standard and clear way to present this type of value function.
  • βœ… Intuitive Labeling: The use of a concrete example (Syphilis) for true states and predicted actions aids in understanding the context.
  • βœ… Explicit Re-weighting Factors: The mathematical expressions in the cells, involving Ο€ and Ο€β‚€, clearly demonstrate how the re-weighting is applied to adjust for the shift in class prevalence.
  • πŸ’‘ Definition of Ο€ and Ο€β‚€: The symbols Ο€ (target class prevalence) and Ο€β‚€ (empirical class prevalence) are crucial. While defined in the text (Ο€ on page 3, Ο€β‚€ on page 3), adding a brief footnote to the table defining them would improve its self-containedness. Suggestion: Add a footnote: "Ο€ represents the target/deployment class prevalence of y=1; Ο€β‚€ represents the empirical class prevalence of y=1 in the evaluation dataset."
Table 5: Value function for Balanced Weighted Accuracy
Figure/Table Image (Page 26)
Table 5: Value function for Balanced Weighted Accuracy
First Reference in Text
Table 5: Value function for Balanced Weighted Accuracy
Description
  • Table Purpose: Value Function for Balanced Weighted Accuracy: Table 5 outlines a 'value function' for a metric called 'Balanced Weighted Accuracy'. In classification, a value function assigns a numerical score to each possible outcome. This table applies to binary classification (two possible outcomes) and shows how scores are assigned when considering both the balance of classes in the dataset and the differing costs or importance of various correct/incorrect predictions.
  • Table Structure: True States vs. Predicted Actions: The table is structured as a 2x2 grid. The columns represent the true actual condition: 'y=1' (e.g., the patient has Syphilis) and 'y=0' (e.g., the patient does not have Syphilis). The rows represent the model's prediction or the action taken based on it: 'Ε·=1' (e.g., predict/treat for Syphilis) and 'Ε·=0' (e.g., predict/don't treat for Syphilis).
  • Cell Values and Key Parameters (Ο€β‚€, c): The values in the cells are defined using Ο€β‚€ (pi-naught), which is the observed proportion of positive cases (y=1) in the evaluation dataset, and 'c', a parameter between 0 and 1 related to the relative costs or importance of outcomes. - True Positive (y=1, Ε·=1): Correctly identifying a positive case. The value assigned is (1-c)/Ο€β‚€. - False Positive (y=0, Ε·=1): Incorrectly predicting positive when it's negative. The value is 0. - False Negative (y=1, Ε·=0): Incorrectly predicting negative when it's positive. The value is 0. - True Negative (y=0, Ε·=0): Correctly identifying a negative case. The value assigned is c/(1-Ο€β‚€).
  • Mechanism of Balancing and Weighting: This value function achieves 'balancing' by dividing by the class prevalences (Ο€β‚€ for positives, 1-Ο€β‚€ for negatives). This means that correctly identifying a case from a rare class gets a higher raw value, helping to ensure performance on minority classes isn't overlooked. It achieves 'weighting' through the 'c' parameter: (1-c) acts as a weight for the value of correctly identifying positives, and 'c' acts as a weight for the value of correctly identifying negatives. If, for example, correctly identifying positives is considered more important, '1-c' would be larger than 'c'. Misclassifications (False Positives and False Negatives) are assigned a value of 0 in this specific formulation.
Scientific Validity
  • βœ… Correct Combination of Balancing and Cost-Weighting Principles: The formulation presented in the table correctly combines principles of class re-weighting (using inverse prevalence: 1/Ο€β‚€ and 1/(1-Ο€β‚€)) for balancing, and cost-based weighting (using 1-c and c) for correct classifications. This is a scientifically valid approach to constructing a value function for a metric that considers both aspects.
  • βœ… Appropriateness for Balanced Weighted Accuracy: This value function is appropriate for defining a 'Balanced Weighted Accuracy' where the contributions of True Positives and True Negatives are adjusted both for their class sizes and for their relative importance as determined by the cost parameter 'c'. When averaged over the dataset Dπο, the sum of these values per instance yields (1-c)TPR + c*TNR, which is a standard form for weighted accuracy (where TPR is True Positive Rate, TNR is True Negative Rate).
  • πŸ’‘ Relationship to Paper's BWA Definition (Scaling Factor): The paper's Definition B.4 for BWA is Ξ£_{x,y∈D1/2} (1 – c)^y c^(1-y) V1/2(Y, K(s(x), c)), which evaluates to 0.5 [(1-c)TPR + cTNR]. The value function in Table 5, when averaged over the original dataset Dπο, yields a sum of (1-c)TPR + c*TNR. Therefore, the value function in Table 5 defines a quantity that is twice the BWA as formulated in Definition B.4. This scaling factor of 2 is a nuance but does not invalidate the relative contributions defined by the value function.
  • βœ… Logical Interpretation of Cost Parameter 'c': The interpretation of 'c' as a cost parameter is crucial. If 'c' is defined such that (1-c) reflects the importance or value of a True Positive and 'c' reflects the importance or value of a True Negative (consistent with contexts where c might be Cost(FP)/(Cost(FP)+Cost(FN))), then the weighting scheme is logical.
  • βœ… Error Handling Approach: Assigning a value of 0 to both False Positives and False Negatives means that this specific value function does not directly penalize errors differently; rather, it values correct classifications differently based on class and cost, and achieves balance. This is a valid approach, distinct from directly assigning different negative values (losses) to different error types.
Communication
  • βœ… Clear Structure: The 2x2 contingency table format is a standard and clear way to present the value function for different classification outcomes.
  • βœ… Intuitive Labeling: The use of a concrete example (Syphilis) for true states (y=1, y=0) and predicted actions (Ε·=1, Ε·=0) aids in understanding the context of the classification task.
  • βœ… Explicit Weighting Factors: The mathematical expressions in the cells clearly show how both cost-weighting (via 'c') and class balance adjustment (via Ο€β‚€) are incorporated into the value of correct classifications.
  • πŸ’‘ Definition of Ο€β‚€ and 'c': The symbols Ο€β‚€ (empirical positive class prevalence) and 'c' (cost/threshold parameter) are crucial for understanding the table. While likely defined elsewhere in the paper, adding a brief footnote to the table defining them would improve its self-containedness and immediate interpretability. Suggestion: Add a footnote: "Ο€β‚€ represents the empirical prevalence of the positive class (y=1). 'c' is a cost parameter, where (1-c) can be seen as the relative importance/value of correctly identifying a positive case, and 'c' as the relative importance/value of correctly identifying a negative case."
Table 6: Value function for Weighted Accuracy
Figure/Table Image (Page 26)
Table 6: Value function for Weighted Accuracy
First Reference in Text
Table 6: Value function for Weighted Accuracy
Description
  • Table Purpose: Value Function for Weighted Accuracy: Table 6 defines a 'value function' for a metric called 'Weighted Accuracy'. A value function in classification assigns a numerical score to each possible outcome (e.g., correctly identifying a sick person). This table is for binary classification and specifies how scores are assigned when considering the differing importance or costs associated with correctly classifying positive versus negative cases, and also normalizing the metric.
  • Table Structure: True States vs. Predicted Actions: The table is a 2x2 grid. Columns represent the true actual condition: 'y=1' (e.g., has Syphilis) and 'y=0' (e.g., No Syphilis). Rows represent the model's prediction or action: 'Ε·=1' (e.g., Treat for Syphilis) and 'Ε·=0' (e.g., Don't treat).
  • Cell Values and Key Parameters (c, Ο€β‚€, D): The values in the cells are defined using 'c' (a cost parameter, 0 < c < 1) and Ο€β‚€ (pi-naught, the observed proportion of positive cases (y=1) in the evaluation dataset). The denominator in each cell, D = (1-c)Ο€β‚€ + c(1-Ο€β‚€), serves as a normalization factor. - True Positive (y=1, Ε·=1): Correctly identifying a positive case. Value = (1-c) / D. - False Positive (y=0, Ε·=1): Incorrectly predicting positive when it's negative. Value = 0. - False Negative (y=1, Ε·=0): Incorrectly predicting negative when it's positive. Value = 0. - True Negative (y=0, Ε·=0): Correctly identifying a negative case. Value = c / D.
  • Mechanism of Weighting and Normalization: This value function achieves 'weighting' through the 'c' parameter: (1-c) acts as a weight for the value of correctly identifying positives, and 'c' acts as a weight for the value of correctly identifying negatives. For instance, if correctly identifying positives is deemed more important, '1-c' would be larger than 'c' (meaning 'c' is small). Misclassifications (False Positives and False Negatives) are assigned a value of 0. The denominator 'D' normalizes the metric so that a perfect classifier (one that gets everything right according to these weights) achieves a score of 1, regardless of the class balance Ο€β‚€ or the cost 'c'. The text mentions this normalization: "normalized so that a perfect classifier achieves a score of 1 regardless of class balance."
Scientific Validity
  • βœ… Correct Formulation for Normalized Weighted Accuracy: The table correctly represents a value function for Weighted Accuracy where the weights for true positives and true negatives are (1-c) and c respectively, and the result is normalized by the expected score of a perfect classifier under these weights and the empirical class distribution Ο€β‚€. This normalization ensures the metric ranges from 0 to 1.
  • βœ… Appropriateness for Weighted and Normalized Evaluation: This formulation is appropriate for evaluating classifiers when different types of correct classifications have different utilities or when aiming for a score that is comparable across different class balances due to the normalization.
  • βœ… Correct Normalization Factor: The normalization factor D = (1-c)Ο€β‚€ + c(1-Ο€β‚€) is precisely the expected value achieved by a perfect classifier given the weights (1-c) for true positives and c for true negatives, and the empirical class distribution (Ο€β‚€ for positives, 1-Ο€β‚€ for negatives). Dividing by D ensures the maximum score is 1.
  • βœ… Consistency with Paper's Definition of WA: This value function is consistent with the definition of Weighted Accuracy (WA) in Definition B.5 of the paper, where the sum of these per-instance values gives the overall WA score.
  • βœ… Error Handling Focus: By assigning a value of 0 to both False Positives and False Negatives, this value function focuses on the weighted sum of correct classifications rather than differentially penalizing specific error types. The cost 'c' influences the relative value of a True Positive versus a True Negative.
Communication
  • βœ… Clear Structure: The 2x2 contingency table format is a standard and clear way to present the value function.
  • βœ… Intuitive Labeling: The use of a concrete example (Syphilis) for true states and predicted actions aids in understanding.
  • βœ… Explicit Weighting and Normalization: The mathematical expressions in the cells clearly show how cost-weighting (via 'c') and normalization (via the denominator involving Ο€β‚€ and 'c') are applied.
  • πŸ’‘ Definition of Ο€β‚€ and 'c': The symbols Ο€β‚€ (empirical positive class prevalence) and 'c' (cost parameter) are central. While likely defined elsewhere, a brief footnote defining them would enhance the table's self-containedness. Suggestion: Add a footnote: "Ο€β‚€ represents the empirical prevalence of the positive class (y=1). 'c' is a cost parameter, where (1-c) can be seen as the relative cost/importance of a false negative error, and 'c' as the relative cost/importance of a false positive error, in some formulations. Here, it weights the value of true positives and true negatives."
Table 7: Value function for Net Benefit
Figure/Table Image (Page 27)
Table 7: Value function for Net Benefit
First Reference in Text
Table 7: Value function for Net Benefit
Description
  • Table Purpose: Value Function for Net Benefit: Table 7 defines a 'value function' for a metric called 'Net Benefit'. A value function in classification assigns a numerical score (benefit or cost) to each possible outcome. This table is for binary classification (two possible outcomes) and shows how these scores are assigned.
  • Table Structure: True States vs. Predicted Actions: The table is a 2x2 grid. The columns represent the true actual condition of an individual: 'y=1' (e.g., has Syphilis) and 'y=0' (e.g., No Syphilis). The rows represent the model's prediction or the action taken: 'Ε·=1' (e.g., Treat for Syphilis) and 'Ε·=0' (e.g., Don't treat).
  • Cell Values and Role of Cost Parameter 'c': The values in the cells represent the net benefit associated with each outcome, defined in terms of a parameter 'c' (typically a cost ratio or threshold, 0 < c < 1): - True Positive (y=1, Ε·=1): Correctly identifying a positive case. The value (benefit) is 1. - False Positive (y=0, Ε·=1): Incorrectly predicting positive when it's negative. The value (which is a cost or negative benefit) is -c/(1-c). This term represents the cost of a false positive relative to the benefit of a true negative (or, more directly, if 'c' is the probability threshold, this is the odds of a false positive being treated as equivalent to the loss of a true negative's benefit). - False Negative (y=1, Ε·=0): Incorrectly predicting negative when it's positive. The value (cost/negative benefit) is 0. In this specific formulation of Net Benefit focusing on the benefit of true negatives (as stated on page 5), the cost of a false negative is implicitly handled by the missed benefit of a true positive (which is 1). - True Negative (y=0, Ε·=0): Correctly identifying a negative case. The value (benefit) is 0. This formulation, as described on page 5,
  • Specific Formulation Context: This specific formulation of Net Benefit, as stated in the paper (page 5),
Scientific Validity
  • βœ… Consistency with Paper's Specific Net Benefit Definition: The table represents a specific formulation of Net Benefit. Standard Decision Curve Analysis (DCA) defines Net Benefit as (TP - FP (pt/(1-pt)))/N, where pt is the probability threshold. This table's values (1 for TP, -c/(1-c) for FP, 0 for FN, 0 for TN) are consistent with a definition where Net Benefit = TP_value P(TP) + FP_value P(FP) + TN_value P(TN) + FN_value P(FN). If c = pt, then -c/(1-c) = -pt/(1-pt). The paper's Definition 3.2 focuses on the benefit of true negatives, which is a variation. The table here is V(y,Ε·) for the term (c/(1-c))^(1-y) 1(y=k(s(x),Ο„)) in Definition 3.2. For y=1 (positive), the exponent is 0, so value is 1 if predicted positive (TP), 0 if predicted negative (FN). For y=0 (negative), the exponent is 1. If predicted positive (FP), value is -c/(1-c). If predicted negative (TN), value is 0. This is consistent with a net benefit calculation where true positives yield a benefit of 1, false positives incur a cost of c/(1-c), and both false negatives and true negatives yield zero direct benefit/cost in this particular additive formulation focusing on benefit over a default of 'treat none' and comparing to 'treat all'. The paper states on page 5: "We use a variation of net benefit that focuses on the benefit of true negatives rather than the costs of false positives in order to be more directly comparable to accuracy." This table then seems to be a component of that variation, rather than the more common Net Benefit. The values in Table 7 are: TP=1, FP=0, FN=0, TN=c/(1-c). This is different from the description above. Let's re-evaluate based on the table's actual content. TP=1, FP=0, FN=0, TN = c/(1-c). This formulation is unusual for 'Net Benefit' which typically penalizes FPs. This seems to be a 'Weighted Accuracy' focused on TPs and TNs, where TNs are weighted by c/(1-c). The text for Definition 3.2 states Net Benefit(D,s,Ο„,c) = (1/|D|) Ξ£ (c/(1-c))^(1-y) * 1(y=ΞΊ(s(x),Ο„)). If y=1, (c/(1-c))^0 = 1. So TP contributes 1. If y=0, (c/(1-c))^1 = c/(1-c). So TN contributes c/(1-c). All errors (FP, FN) contribute 0 because 1(y=ΞΊ(s(x),Ο„)) is 0 for them. This matches the table values.
  • βœ… Appropriateness for Defined Variation: This specific formulation (TP=1, TN=c/(1-c), errors=0) is indeed a variation. It measures a weighted sum of correct classifications, where the weight of a True Negative is c/(1-c) relative to a True Positive's weight of 1. This is appropriate if the goal is to define a benefit metric that scales the value of true negatives based on the cost parameter 'c'.
  • βœ… Logical Use of Cost Parameter 'c': The parameter 'c' is typically related to a decision threshold or a cost ratio. If 'c' is the threshold probability for action, then c/(1-c) represents the odds. The value function correctly uses 'c' to modulate the benefit of a true negative relative to a true positive.
  • πŸ’‘ Non-Standard Formulation (Acknowledged by Authors): This definition of Net Benefit is explicitly stated as a variation by the authors to be "more directly comparable to accuracy" by focusing on positive contributions. While different from some standard DCA formulations that might show negative values for FPs, it is internally consistent with the authors' Definition 3.2.
Communication
  • βœ… Clear Structure: The 2x2 contingency table format is a standard and clear way to present the value function for different classification outcomes.
  • βœ… Intuitive Labeling: The use of a concrete example (Syphilis) for true states and predicted actions aids in understanding the context.
  • βœ… Clear Value Definitions: The values in the cells (1, 0, and c/(1-c)) clearly define the benefit/cost structure for this version of Net Benefit.
  • πŸ’‘ Definition of 'c': The parameter 'c' (cost ratio) is central to this table. While likely defined elsewhere in the paper (e.g., page 5, Definition 3.2), a brief footnote defining 'c' or its interpretation in this context (e.g., related to the trade-off between false positives and false negatives, or the threshold for decision making) would enhance the table's self-containedness. Suggestion: Add a footnote: "'c' represents a cost parameter or decision threshold, where c/(1-c) is the cost ratio of a false positive to a true negative's benefit (or relative cost of a false positive to the benefit of avoiding it)."
Table 8: Value function for Prior- Adjusted Maximum Weighted Accu- racy
Figure/Table Image (Page 27)
Table 8: Value function for Prior- Adjusted Maximum Weighted Accu- racy
First Reference in Text
Table 8: Value function for Prior- Adjusted Maximum Weighted Accu- racy
Description
  • Table Purpose: Value Function for PAMWA: Table 8 specifies a 'value function' for a metric called 'Prior-Adjusted Maximum Weighted Accuracy' (PAMWA). This value function assigns a numerical score to each of the four possible outcomes in a binary classification task (True Positive, False Positive, False Negative, True Negative). The metric aims to evaluate model performance while simultaneously accounting for (1) a shift in class prevalence from an evaluation dataset (with positive class prevalence Ο€β‚€) to a target deployment setting (with positive class prevalence Ο€), (2) differing costs or importance of correctly classifying positives versus negatives (via a cost parameter 'c'), and (3) ensuring balanced consideration of performance on both classes.
  • Table Structure and Parameters: The table is a 2x2 grid where columns represent the true class (1 for positive, 0 for negative) and rows would represent the predicted class (though row labels are missing, they are implicitly Ε·=1 and Ε·=0). The cell entries define the value for each outcome based on Ο€, Ο€β‚€, and 'c'. Let D = (1-c)Ο€ + c(1-Ο€) be a normalization factor.
  • Cell Values for Outcomes: The values assigned are: - True Positive (y=1, predicted Ε·=1): Value = (1-c)Ο€ / (2Ο€β‚€ D). This rewards correctly identifying a positive case, weighted by (1-c), adjusted for the target prevalence Ο€, normalized by D, and further adjusted by the empirical prevalence Ο€β‚€ with a balancing factor of 1/2. - False Positive (y=0, predicted Ε·=1): Value = 0. - False Negative (y=1, predicted Ε·=0): Value = 0. - True Negative (y=0, predicted Ε·=0): Value = c(1-Ο€) / (2(1-Ο€β‚€) D). This rewards correctly identifying a negative case, weighted by 'c', adjusted for the target prevalence of negatives (1-Ο€), normalized by D, and further adjusted by the empirical prevalence of negatives (1-Ο€β‚€) with a balancing factor of 1/2.
  • Combined Adjustments in the Value Function: This value function essentially combines several adjustments: cost-weighting ((1-c) for TPs, c for TNs, applied to target prevalences), label shift adjustment (implicit in using Ο€ and Ο€β‚€), a balancing factor (the '2' in the denominator, common in balanced accuracy type metrics), and normalization (division by D). The goal is to create a sophisticated metric that reflects performance under specific target conditions and cost assumptions.
Scientific Validity
  • βœ… Sound Combination of Methodological Principles: The formulation combines importance weighting for label shift (from Ο€β‚€ to Ο€), cost-weighting for differential utility of true positives vs. true negatives (via 'c'), and a balancing component (the factor of 1/2 and division by source class prevalence Ο€β‚€ or 1-Ο€β‚€). This multi-faceted approach is scientifically sound for constructing a comprehensive evaluation metric tailored to specific deployment conditions and cost structures.
  • βœ… Correct Normalization Factor for Weighted Accuracy Component: The normalization factor D = (1-c)Ο€ + c(1-Ο€) correctly represents the expected weighted sum of correct classifications for a perfect classifier under the target prevalence Ο€ and cost weights (1-c, c). Dividing by D ensures that the weighted accuracy component is scaled to a maximum of 1 before other adjustments.
  • βœ… Consistency with PAMWA Metric Goal: This value function, when applied to instances from the empirical dataset Dπο and summed, should yield the PAMWA score as intended by Definition B.7. The structure of the terms ( (Weighted value in target) * (Balancing/Importance weight from source) / (Target Normalization) ) is consistent with deriving such a prior-adjusted, weighted, and balanced metric.
  • βœ… Error Handling Approach (Focus on Positive Contributions): Assigning a value of 0 to both False Positives and False Negatives means that, like other accuracy-based metrics in this paper, errors are not directly penalized with negative values in this specific value function; rather, the focus is on the positively valued contributions of correct classifications, adjusted in multiple ways.
  • πŸ’‘ Sensitivity to Parameter Specification (Ο€, c): The complexity of the metric, while comprehensive, relies heavily on the accurate specification of Ο€ (target prevalence) and 'c' (cost parameter). Misspecification of these parameters could lead to misleading evaluations. The validity of the PAMWA score in practice depends on the reliability of these inputs.
Communication
  • βœ… Clear Structure: The 2x2 contingency table format is standard for defining value functions in binary classification, making the structure familiar.
  • βœ… Comprehensive Scope: The table attempts to show the combined effects of prior adjustment (for label shift), cost-weighting, and balancing. However, the resulting expressions are quite complex.
  • πŸ’‘ Parameter and Normalization Factor Definition: The cell values involve multiple parameters (Ο€, Ο€β‚€, c) and a derived normalization factor D (which is (1-c)Ο€ + c(1-Ο€)). For clarity, it would be beneficial to explicitly define D in a footnote or directly within the table if space allowed. Suggestion: Add a footnote: "Ο€ is the target positive class prevalence, Ο€β‚€ is the empirical positive class prevalence, c is the cost parameter. D = (1-c)Ο€ + c(1-Ο€) is a normalization factor based on the target distribution and costs."
  • πŸ’‘ Incomplete Row/Column Labeling: The column headers 'V', '1', '0' are terse. While '1' and '0' likely refer to true positive (y=1) and true negative (y=0) states, expanding 'V' to 'V(y,Ε·)' or similar, and explicitly labeling columns as 'y=1' and 'y=0' (as in other tables) would improve clarity. The row labels for predictions (Ε·=1, Ε·=0) are missing entirely. Suggestion: Add row labels for predictions (e.g., Ε·=1, Ε·=0) and ensure column labels clearly denote true states.
  • πŸ’‘ Cross-referencing to Text: Given the complexity of the expressions, ensuring these are cross-referenced clearly with their derivation or definition (Definition B.7 for PAMWA) in the main text is crucial for reader comprehension.
Table 9: Value function for Prior-Adjusted Max- imum Net Benefit
Figure/Table Image (Page 28)
Table 9: Value function for Prior-Adjusted Max- imum Net Benefit
First Reference in Text
Table 9: Value function for Prior-Adjusted Max- imum Net Benefit
Description
  • Table Purpose: Value Function for PAMNB: Table 9 defines a 'value function' for a metric called 'Prior-Adjusted Maximum Net Benefit' (PAMNB). In the context of evaluating classification models, a value function assigns a numerical score (representing benefit or cost) to each of the four possible outcomes of a binary prediction: True Positive, False Positive, False Negative, and True Negative.
  • Table Structure: True States vs. Predicted Actions: The table is laid out as a 2x2 grid. The columns represent the true actual condition of an individual: 'y=1' (e.g., the individual has Syphilis) and 'y=0' (e.g., the individual does not have Syphilis). The rows represent the model's prediction or the action taken based on that prediction: 'Ε·=1' (e.g., predict/treat for Syphilis) and 'Ε·=0' (e.g., predict/don't treat for Syphilis).
  • Cell Values and Key Parameters (Ο€, Ο€β‚€, c): The values within the cells of the table define the score for each outcome, based on three key parameters: Ο€ (pi, the target or deployment prevalence of the positive class), Ο€β‚€ (pi-naught, the empirical or observed prevalence of the positive class in the evaluation dataset), and 'c' (a cost parameter or decision threshold, typically between 0 and 1). - True Positive (y=1, Ε·=1): Correctly identifying a positive case. The value assigned is Ο€/Ο€β‚€. - False Positive (y=0, Ε·=1): Incorrectly predicting positive when the true state is negative. The value assigned is 0. - False Negative (y=1, Ε·=0): Incorrectly predicting negative when the true state is positive. The value assigned is 0. - True Negative (y=0, Ε·=0): Correctly identifying a negative case. The value assigned is (c/(1-c)) * ((1-Ο€)/(1-Ο€β‚€)).
  • Mechanism: Combining Label Shift Adjustment and Cost-Sensitivity for True Negatives: This value function for PAMNB combines adjustments for label shift (the Ο€/Ο€β‚€ and (1-Ο€)/(1-Ο€β‚€) terms, which re-weight outcomes based on changes in class prevalence between evaluation and target settings) with a cost-sensitive valuation of True Negatives (the c/(1-c) term). Specifically, a True Positive is valued based on the ratio of target to empirical positive prevalence. A True Negative's value is scaled by the odds c/(1-c) and then by the ratio of target to empirical negative prevalence. Both types of errors (False Positives and False Negatives) are assigned a value of 0 in this particular formulation.
Scientific Validity
  • βœ… Correct Formulation for PAMNB: The table correctly represents the value function for Prior-Adjusted Maximum Net Benefit (PAMNB) as implied by Definition 3.4 in the paper. The terms Ο€/Ο€β‚€ and (1-Ο€)/(1-Ο€β‚€) are standard importance weights for label shift correction. The term c/(1-c) introduces cost-sensitivity, specifically to the valuation of True Negatives.
  • βœ… Appropriateness for Target Application: This value function is appropriate for scenarios where one needs to evaluate a classifier's net benefit under a specific target class prevalence (Ο€) different from the evaluation set's prevalence (Ο€β‚€), and where the benefit of a True Negative is scaled by a cost-related factor c/(1-c) relative to the (label-shift-adjusted) benefit of a True Positive.
  • βœ… Alignment with Paper's Goals: The formulation is consistent with the authors' stated aim of developing metrics that are sensitive to clinical deployment conditions, including distributional shifts (label shift) and asymmetric error costs (handled here by valuing TNs differently based on 'c').
  • βœ… Specific Approach to Error Valuation: Assigning a value of 0 to both False Positives and False Negatives means that this particular Net Benefit formulation focuses on the positive contributions of correct classifications, adjusted for label shift and cost-weighting of TNs, rather than directly assigning negative values (penalties) for errors. This is a specific choice in defining 'benefit'.
  • πŸ’‘ Dependence on Accurate Parameter Specification (Ο€, c): The practical utility and interpretation of the PAMNB score derived from this value function will heavily depend on the accurate estimation or specification of the target prevalence Ο€ and the cost parameter 'c'. Sensitivity analyses for these parameters would be important in real-world applications.
Communication
  • βœ… Clear Structure: The 2x2 contingency table format is a standard and clear method for presenting value functions in binary classification contexts.
  • βœ… Intuitive Labeling: The use of a concrete example (Syphilis) for defining true states (y=1, y=0) and predicted actions (Ε·=1, Ε·=0) helps in making the table's context more intuitive.
  • βœ… Explicit Parameter Usage: The mathematical expressions in the cells explicitly show how adjustments for target prevalence (Ο€), empirical prevalence (Ο€β‚€), and cost/threshold parameter (c) are incorporated into the value function.
  • πŸ’‘ Definition of Parameters (Ο€, Ο€β‚€, c): The parameters Ο€ (target positive class prevalence), Ο€β‚€ (empirical positive class prevalence), and 'c' (cost parameter/threshold) are fundamental to understanding the table. While these are likely defined in the main text of the paper, adding a brief footnote to the table that defines these symbols would significantly improve its self-containedness and immediate comprehensibility for readers. Suggestion: Include a footnote such as: "Ο€ represents the target/deployment positive class prevalence; Ο€β‚€ is the empirical positive class prevalence in the evaluation dataset; 'c' is a cost parameter or decision threshold."

AUC-ROC: Label Shift Uncertainty without Calibration

Key Aspects

Strengths

Suggestions for Improvement

Application to Subgroup Decomposition

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

(C) African American patients (orange) have noticeably better AUC- ROC than...
Full Caption

(C) African American patients (orange) have noticeably better AUC- ROC than Caucasian patients (blue). However, we can decompose the difference in accuracy into (A) a difference in mechanism of prediction at equal class balances (i.e. same in-hospital mortality) and (B) a difference in the class balance at which accuracy is evaluated for the two groups.

Figure/Table Image (Page 11)
(C) African American patients (orange) have noticeably better AUC- ROC than Caucasian patients (blue). However, we can decompose the difference in accuracy into (A) a difference in mechanism of prediction at equal class balances (i.e. same in-hospital mortality) and (B) a difference in the class balance at which accuracy is evaluated for the two groups.
First Reference in Text
(C) African American patients (orange) have noticeably better AUC- ROC than Caucasian patients (blue).
Description
  • Graph Axes and Range: The graph displays "Accuracy" on its vertical axis, ranging from approximately 0.86 to 1.00, plotted against "Positive Class Fraction" on its horizontal axis, which spans from 0.01 to 0.10. The "Positive Class Fraction" likely represents the proportion of instances classified as positive by the model at varying decision thresholds, or the prevalence of the positive class in different data segments.
  • Comparative Curves: Two distinct curves are shown: an orange line representing data for African American patients and a blue line for Caucasian patients. Across the depicted range of positive class fractions, the orange line is generally positioned above the blue line, suggesting higher accuracy for the model's predictions concerning African American patients compared to Caucasian patients.
  • Overall AUC-ROC Finding (C): The caption indicates an overall finding (labeled C) that African American patients exhibit a noticeably better AUC-ROC. AUC-ROC, or Area Under the Receiver Operating Characteristic curve, is a common metric evaluating a model's ability to distinguish between positive and negative classes across all decision thresholds; a higher value (closer to 1.0) signifies better discriminatory power.
  • Decomposition Component (A): Mechanism Difference: The graph visually decomposes the observed accuracy difference. Component (A), described as "a difference in mechanism of prediction at equal class balances (i.e. same in-hospital mortality)," is marked on the graph as a vertical distance between the orange and blue curves at a specific positive class fraction (around 0.04). This highlights that at a comparable positive class fraction, the model achieves higher accuracy for African American patients.
  • Decomposition Component (B): Class Balance Difference: Component (B), termed "a difference in the class balance at which accuracy is evaluated for the two groups," is depicted as a horizontal segment between points on the two curves. This suggests that the groups might be evaluated under different prevailing class balances, or that a similar level of accuracy is achieved at different positive class fractions for the two groups.
Scientific Validity
  • βœ… Decomposition Approach: The attempt to decompose overall performance differences (like AUC-ROC or accuracy) into more granular components related to prediction mechanisms and class balance effects across subgroups is a valuable analytical approach, potentially offering deeper insights than a single aggregate metric.
  • πŸ’‘ Link Between Plotted Accuracy and AUC-ROC: The caption states that point (C) refers to African American patients having "noticeably better AUC-ROC," yet the graph plots "Accuracy" against "Positive Class Fraction." The precise mathematical and conceptual linkage between this specific accuracy plot and the overall AUC-ROC claim is not explicitly detailed. While accuracy at various thresholds forms the basis of an ROC curve, the figure itself is not an ROC curve. This ambiguity could affect the interpretation of how the decomposition directly explains AUC-ROC differences.
  • πŸ’‘ Operationalization of "Positive Class Fraction": The x-axis term "Positive Class Fraction" and its operationalization in this context (e.g., is it based on true prevalence, predicted prevalence, or tied to specific threshold settings?) requires clearer definition to fully assess the validity of the decomposition, particularly for component (B) related to "class balance at which accuracy is evaluated."
  • πŸ’‘ Rigor of Decomposition Definitions: The conceptual distinction and empirical representation of component (A) "difference in mechanism of prediction" (shown as a vertical gap) and (B) "difference in class balance" (shown as a horizontal segment) need rigorous definition. For instance, it's not immediately clear how the visual representation of 'A' isolates only the "mechanism of prediction" independent of other factors, or how 'B' precisely quantifies the impact of differing class balances from this plot.
  • πŸ’‘ Lack of Statistical Significance Assessment: The reference text and caption claim a "noticeably better" AUC-ROC and show higher accuracy curves. However, the graph lacks error bars, confidence intervals, or any reported statistical tests. Without these, it is not possible to determine if the observed differences between the groups are statistically significant or could be due to chance, especially if sample sizes for the subgroups differ or are small.
  • πŸ’‘ Justification for X-axis Range: The analysis is presented for a specific range of "Positive Class Fraction" (0.01 to 0.10). The rationale for selecting this particular range and its relevance to the clinical context (e.g., typical prevalence rates for in-hospital mortality, or clinically meaningful threshold ranges) should be provided to support the generalizability and applicability of the findings.
  • πŸ’‘ Contextual Details for Standalone Interpretation: The paper mentions this analysis is on the eICU dataset for in-hospital mortality predictions using APACHE IV scores (page 11). While this context is available elsewhere, for the element to stand alone or be fully assessed, key details about the dataset, patient population, and specific model being evaluated would ideally be briefly summarized with the figure itself or in its immediate caption.
Communication
  • βœ… Color Usage: The use of distinct colors (orange for African American patients, blue for Caucasian patients) effectively differentiates the two groups being compared.
  • βœ… Direct Labeling: The direct labeling of components (A and B) on the graph helps connect visual features to the explanations provided in the caption, aiding interpretation.
  • πŸ’‘ Accuracy vs. AUC-ROC Clarity: The primary source of confusion is the relationship between the plotted "Accuracy" (y-axis) and the "AUC-ROC" metric prominently mentioned in component (C) of the caption and the reference text. The graph itself is not an ROC curve. If this plot is intended to illustrate factors contributing to the AUC-ROC difference, this connection needs to be explicitly and clearly articulated. Suggestion: Clarify if this graph represents accuracy values at different decision thresholds (where "Positive Class Fraction" might relate to predicted positive rates or prevalence in evaluated segments) which then contribute to the overall AUC-ROC, or if it's a related but distinct accuracy analysis.
  • πŸ’‘ Caption Structure: The caption is quite dense and combines the main finding (C) with the decomposition (A, B). Suggestion: Improve readability by stating the main AUC-ROC finding (C) first, then clearly explaining how the plotted accuracy differences, visualized as (A) and (B), contribute to understanding this overall performance difference between the groups.
  • πŸ’‘ X-axis Scale Clarity: The x-axis, labeled "Positive Class Fraction," shows tick marks at 0.01 and 0.10. The spacing of intermediate, unlabeled ticks suggests a non-linear scale, possibly logarithmic. This should be explicitly stated if it is a log scale to avoid misinterpretation of the rate of change or distribution. Suggestion: Clearly indicate the scale of the x-axis (e.g., "Positive Class Fraction (log scale)").
  • πŸ’‘ Explicit Legend: While the colors are explained in the caption, adding an explicit legend directly on the graph (e.g., "African American", "Caucasian") would enhance immediate comprehension without requiring the reader to refer back to the text. Suggestion: Include an embedded legend in the graph.
  • πŸ’‘ Context of Parenthetical Remark: The caption for point (A) includes "(i.e. same in-hospital mortality)". This parenthetical remark implies that "Positive Class Fraction" might be linked to actual mortality rates (prevalence) or that the comparison is made at points of equivalent outcome prevalence. This connection should be made clearer in the main explanation of the axis or the decomposition. Suggestion: Elaborate on the meaning of "Positive Class Fraction" and how it relates to "in-hospital mortality" in the context of this graph.
(C) African American patients (orange) have noticeably better AUC- ROC than...
Full Caption

(C) African American patients (orange) have noticeably better AUC- ROC than Caucasian patients (blue). However, we can plot the accuracy of a perfectly recalibrated model (dashed lines), and then de- compose the average accuracy using the calibration-sharpness frame- work [80, 28].

Figure/Table Image (Page 11)
(C) African American patients (orange) have noticeably better AUC- ROC than Caucasian patients (blue). However, we can plot the accuracy of a perfectly recalibrated model (dashed lines), and then de- compose the average accuracy using the calibration-sharpness frame- work [80, 28].
First Reference in Text
(C) African American patients (orange) have noticeably better AUC- ROC than Caucasian patients (blue).
Description
  • Graph Axes and Plotted Metric: The graph presents "Accuracy" on the vertical axis (ranging approximately from 0.86 to 1.00) against "Positive Class Fraction" on the horizontal axis (from 0.01 to 0.10). "Positive Class Fraction" likely refers to the proportion of instances identified as positive, possibly varying with decision thresholds or representing different data segments by prevalence.
  • Plotted Lines: Original vs. Recalibrated Models: Four lines are plotted: Solid orange and blue lines represent the accuracy of the original predictive model for African American and Caucasian patients, respectively. Dashed orange and blue lines show the accuracy of a "perfectly recalibrated" version of the model for the same respective groups. A recalibrated model is one whose predicted probabilities are adjusted to better match observed frequencies.
  • Overall Performance (C) and AUC Labels: The caption and an annotation (C) indicate that African American patients (orange) have a noticeably better overall AUC-ROC than Caucasian patients (blue). AUC-ROC (Area Under the Receiver Operating Characteristic curve) is a measure of a model's ability to distinguish between classes; a higher value is better. On the graph, labels "AUC" point to the orange curves (near accuracy ~0.94-0.95) and blue curves (near accuracy ~0.91-0.92), presumably indicating these overall AUC-ROC values.
  • Miscalibration Loss (A and B): The graph visually decomposes accuracy based on a calibration-sharpness framework. The gap labeled 'A' represents the miscalibration loss for African American patients (the difference between the solid orange line and the higher dashed orange line). This gap 'A' is visibly larger than the gap 'B', which represents the miscalibration loss for Caucasian patients (difference between the solid blue line and the slightly higher dashed blue line).
  • Model Sharpness: The "sharpness" of the model for each group can be inferred from the height of the dashed (recalibrated model) lines. The dashed orange line (recalibrated model for African Americans) is generally higher than the dashed blue line (recalibrated model for Caucasians), suggesting the model has the potential for sharper (more confident and accurate when perfectly calibrated) predictions for African American patients.
  • Comparative Accuracy of Original Models: Overall, the solid orange line (original model, African Americans) is mostly above the solid blue line (original model, Caucasians), indicating higher observed accuracy for African Americans across the shown range of positive class fractions, despite a larger miscalibration loss ('A').
Scientific Validity
  • βœ… Decomposition Framework: The approach of decomposing model performance into calibration and sharpness components is a methodologically sound way to gain deeper insights beyond aggregate metrics like AUC-ROC or overall accuracy. Plotting the performance of a "perfectly recalibrated model" provides a clear theoretical baseline for evaluating calibration loss.
  • πŸ’‘ Link Between Accuracy Decomposition and AUC-ROC Claim: The caption states that point (C) refers to AUC-ROC differences, and the graph includes "AUC" labels. However, the y-axis plots "Accuracy." While accuracy across thresholds underlies AUC-ROC, the direct quantitative link showing how the decomposition of this specific accuracy plot explains the differences in overall AUC-ROC values is not explicitly established by the figure and its immediate caption. This requires careful articulation.
  • βœ… Support for Calibration/Sharpness Claims (with external text): The interpretation of "sharpness" (from the height of dashed lines) and "miscalibration loss" (gaps A and B) is crucial. The text accompanying this figure (page 11, paragraph 2) clarifies these interpretations, aligning with the visual representation. The figure, in conjunction with this text, effectively supports the claim of higher sharpness but worse calibration for the African American patient group with this model.
  • πŸ’‘ Definition of "Positive Class Fraction": The term "Positive Class Fraction" on the x-axis needs a precise operational definition within this context (e.g., is it related to true class prevalence, predicted positive rates at varying thresholds, or something else?). This definition is critical for interpreting the accuracy variations and the meaning of the decomposition across this axis.
  • πŸ’‘ Absence of Statistical Significance Assessment: The figure lacks error bars, confidence intervals, or any reported statistical tests for the plotted accuracy curves or the derived AUC values and decomposition components (A, B, sharpness differences). Without these, it's impossible to assess whether the observed differences are statistically significant or could be attributable to sampling variability, especially if subgroup sample sizes are disparate or small.
  • πŸ’‘ Clarity on "Perfectly Recalibrated Model": The meaning of "perfectly recalibrated model" should ideally be briefly clarified (e.g., a model whose predicted probabilities match observed frequencies across all score ranges). The method for achieving this recalibration (e.g., isotonic regression, Platt scaling) could also be mentioned if it impacts the interpretation of the dashed lines.
  • πŸ’‘ Justification for X-axis Range: The choice of the x-axis range (0.01 to 0.10 for Positive Class Fraction) should be justified in terms of its relevance to the clinical problem (e.g., typical prevalence of in-hospital mortality, or range of clinically relevant decision thresholds).
Communication
  • βœ… Color and Line Style: The use of distinct colors (orange for African American patients, blue for Caucasian patients) and line styles (solid for original model, dashed for recalibrated model) effectively differentiates the groups and model states.
  • βœ… Visual Annotations (A, B): The annotations 'A' and 'B' on the graph visually highlight the differences between the original and recalibrated model accuracies for the two patient groups, aiding in understanding the concept of miscalibration loss.
  • πŸ’‘ Accuracy vs. AUC-ROC Clarity: The graph plots "Accuracy" on the y-axis, while the primary finding (C) and the overlaid "AUC" labels refer to "AUC-ROC." This can be confusing. The caption mentions decomposing "average accuracy," but the relationship to the AUC-ROC values needs to be explicitly bridged. Suggestion: Clarify in the caption how the plotted accuracy decomposition (into calibration loss via A/B and sharpness via dashed lines) helps explain or contextualize the observed AUC-ROC differences (C).
  • πŸ’‘ Placement of "AUC" Labels: The labels "AUC" placed directly on the graph near the curves, presumably representing the overall AUC-ROC values for each group, are somewhat unconventional on an accuracy plot. Suggestion: State these AUC-ROC values explicitly in the caption or a legend entry for better clarity and to avoid potential misinterpretation as specific accuracy points.
  • πŸ’‘ X-axis Definition and Scale: The x-axis, labeled "Positive Class Fraction," ranges from 0.01 to 0.10. Its specific meaning (e.g., true prevalence, predicted positive rate at varying thresholds) and the nature of its scale (linear or logarithmic, as suggested by tick spacing) should be clearly defined. Suggestion: Specify the scale of the x-axis and provide a brief explanation of what "Positive Class Fraction" represents in this context.
  • πŸ’‘ Explicit Legend: While the groups are color-coded, adding an explicit legend within the graph (e.g., "African American (Original)", "African American (Recalibrated)", etc.) would enhance immediate comprehension and make the figure more self-contained.
  • πŸ’‘ Caption Structure and Flow: The caption is dense and introduces multiple concepts (AUC-ROC difference, recalibrated models, calibration-sharpness framework). Suggestion: Structure the caption to first state the overall AUC-ROC finding (C), then clearly introduce the concept of recalibrated models, and finally explain how the visual elements (A, B, and dashed lines) represent the calibration and sharpness components of the decomposition.
Figure 1: Causal diagrams: (1) shows differences in performance are based both...
Full Caption

Figure 1: Causal diagrams: (1) shows differences in performance are based both on label shift and differences in mechanism between subgroups (2) if we can intervene on Y to hold it constant, then we can measure differences in performance based purely on differing performance of the model between subgroups.

Figure/Table Image (Page 41)
Figure 1: Causal diagrams: (1) shows differences in performance are based both on label shift and differences in mechanism between subgroups (2) if we can intervene on Y to hold it constant, then we can measure differences in performance based purely on differing performance of the model between subgroups.
First Reference in Text
Figure 1: Causal diagrams: (1) shows differences in performance are based both on label shift and differences in mechanism between subgroups (2) if we can intervene on Y to hold it constant, then we can measure differences in performance based purely on differing performance of the model between subgroups.
Description
  • Overall Structure: Two Causal Diagrams: Figure 1 presents two conceptual diagrams, labeled (1) and (2), which are types of causal diagrams. Causal diagrams use nodes (represented by letters D, Y, X, K) to denote variables or factors, and arrows to show presumed causal influences between them. The direction of the arrow indicates the direction of influence.
  • Diagram (1): Combined Effects: Diagram (1) illustrates a scenario where a factor 'D' (likely representing different data distributions or subgroups) influences both 'Y' (perhaps the true outcome or label) and 'X' (likely the input features for a model). 'Y' also influences 'X'. Finally, 'X' influences 'K' (possibly the model's prediction or performance). According to the caption, this diagram represents a situation where performance differences arise from both 'label shift' (changes in the distribution of Y, influenced by D) and differences in 'mechanism' (how X is generated or how X relates to K, also potentially influenced by D via its effect on X).
  • Diagram (2): Effect of Intervention on Y: Diagram (2) shows a modified scenario. Here, 'D' still influences 'Y', but the influence of 'D' directly on 'X' is absent. The relationship Y -> X and X -> K remains. The key difference is the introduction of a 'do(Y)' operator applied to 'Y'. The 'do-operator' is a concept from causal inference signifying an intervention where the variable 'Y' is forced to take on a specific value, or its distribution is set, independently of its usual causes (like D, in terms of its direct path to X). The caption states this diagram represents measuring performance differences based purely on the model's differing performance between subgroups when 'Y' is held constant through intervention.
  • Comparative Purpose: The comparison between diagram (1) and (2) is intended to conceptually separate the sources of performance differences: diagram (1) shows the combined effects, while diagram (2) isolates the effect of the model's behavior (X -> K) under a controlled 'Y', thereby removing the confounding influence of 'D' on 'X' that isn't mediated through 'Y'.
Scientific Validity
  • βœ… Appropriateness of Causal Diagrams: The use of directed acyclic graphs (DAGs) as causal diagrams is a standard and appropriate method for representing assumed causal relationships and for reasoning about concepts like confounding, intervention, and decomposition of effects. They provide a formal language for the concepts discussed.
  • βœ… Representation of Combined Effects: Diagram (1) effectively illustrates a scenario where a common cause 'D' (subgroup/distribution) can influence both the label distribution 'Y' (contributing to label shift) and the feature distribution 'X' or the feature-outcome relationship (contributing to mechanism differences affecting 'K'). The path D -> Y -> X -> K and D -> X -> K shows these combined influences.
  • βœ… Representation of Intervention: Diagram (2) correctly uses the 'do(Y)' notation to represent an intervention on 'Y'. This intervention, by definition, removes incoming arrows to 'Y' from its natural causes if 'Y' were being set exogenously, or more relevantly here, it isolates the downstream effects from Y to K, effectively allowing comparison of X->K across subgroups 'D' under a common distribution of 'Y'. This is a valid way to conceptualize isolating performance differences not attributable to label shift.
  • πŸ’‘ Abstract Nature and Dependence on Variable Definitions: The diagrams are abstract and rely on the reader understanding the mapping of D, Y, X, K to specific concepts in the authors' model evaluation framework. While the caption provides high-level interpretation, the scientific validity of the conclusions drawn from these specific causal structures would depend on the precise definitions of these variables and the justification for the depicted arrows (or their absence) in the context of the problem being studied (e.g., model fairness, robustness to distribution shift). The diagrams themselves are plausible causal models for the described scenarios.
  • πŸ’‘ Specificity of 'Mechanism Differences': The term 'mechanism between subgroups' in the caption for diagram (1) is broad. The diagram shows D influencing X. This could represent differences in P(X|D) or P(X|Y,D). The diagram D->X is a direct path, while D->Y->X is an indirect path. Clarity on what specific 'mechanism' (e.g., feature distribution P(X|D), or conditional feature distribution P(X|Y,D), or model P(K|X,D)) is being referred to would strengthen the link between the diagram and the concept.
  • βœ… Qualitative Conceptual Framework: The diagrams are qualitative. They set up a conceptual framework but do not, by themselves, provide quantitative evidence or test specific hypotheses. Their validity lies in their ability to correctly frame the problem for subsequent quantitative analysis, which seems to be their intended role here.
Communication
  • βœ… Simplicity of Notation: The use of simple, standard notation for nodes (letters) and directed arrows makes the causal relationships easy to follow at a basic level.
  • βœ… Clear Distinction of Diagrams: Numbering the diagrams (1) and (2) clearly distinguishes between the two scenarios being presented, which is helpful for comparative understanding.
  • βœ… Informative Caption: The caption clearly explains the purpose of each diagram, linking them to the concepts of label shift, mechanism differences, and intervention.
  • πŸ’‘ Undefined Nodes: The meaning of the nodes D, Y, X, and K is not defined within the figure or its immediate caption. While likely defined elsewhere in the text (e.g., D for dataset/distribution, Y for true label/outcome, X for features, K for model performance/prediction), their absence here reduces the figure's self-containedness. Suggestion: Add a brief legend or key defining D, Y, X, and K directly with the figure.
  • πŸ’‘ Clarity of 'do(Y)' Notation: The 'do(Y)' notation in diagram (2) is specific to causal inference and might not be immediately understood by all readers. Suggestion: While the caption explains its implication (intervening on Y), a brief parenthetical explanation of 'do(Y)' (e.g., "setting Y to a fixed value") could improve accessibility for a broader audience.
  • πŸ’‘ Visual Emphasis of Intervention Effect: The visual distinction between D influencing Y and X in diagram (1) versus only Y in diagram (2) after the 'do' operation could be made more salient. The core change is the removal of the D->X arrow implicitly by the intervention on Y. Suggestion: Consider subtly highlighting the arrows that are present/absent between the two diagrams to emphasize the effect of the intervention, perhaps with a brief note.
  • πŸ’‘ Figure Size and Legibility: The diagrams are quite small. Suggestion: Ensure the figure is rendered at a sufficient size in the final publication so that all labels and arrows are clearly legible without strain.
Figure 2: The label shift effect is quite dramatic on the public subsample of...
Full Caption

Figure 2: The label shift effect is quite dramatic on the public subsample of EICU. In fact a great deal of this results from small sample size.

Figure/Table Image (Page 41)
Figure 2: The label shift effect is quite dramatic on the public subsample of EICU. In fact a great deal of this results from small sample size.
First Reference in Text
Figure 2: The label shift effect is quite dramatic on the public subsample of EICU.
Description
  • Overall Structure: Two Panes: Figure 2 consists of two separate plots, or panes, placed side-by-side. Both panes display "Accuracy" on the vertical axis against "Positive Class Fraction" on the horizontal axis, which ranges from 0.01 to 0.10. "Positive Class Fraction" likely refers to the proportion of cases considered positive, either based on true prevalence in data segments or predicted positive rates at varying model thresholds.
  • Left Pane: Public Subsample Analysis: The left pane shows accuracy values ranging from approximately 0.86 to 1.00. It features two curves, one orange and one blue (colors inferred from Figure 1 context, representing different patient subgroups). Arrows labeled Ξ”Dβ†’Y and Ξ”Dβ†’X point to differences between these curves. Ξ”Dβ†’Y likely signifies the portion of the accuracy difference attributed to 'label shift' – which is when the underlying frequency of the condition (e.g., disease) changes between groups or datasets. Ξ”Dβ†’X likely represents the difference due to 'mechanism shift' – changes in how patient characteristics (features) relate to the outcome, or differences in the feature distributions themselves, independent of the overall condition frequency.
  • Interpretation of Left Pane (from Caption): The caption states that the label shift effect (Ξ”Dβ†’Y) is "quite dramatic" in this left pane, which visualizes data from a "public subsample of EICU." EICU refers to a specific intensive care unit database. The caption also suggests that this dramatic effect is largely a result of "small sample size" in this subsample.
  • Right Pane: Finer Scale Analysis (likely Full Sample): The right pane displays a similar plot structure but with a much finer y-axis scale for Accuracy, ranging from approximately 0.914 to 0.922. This suggests it is showing a less 'dramatic' or more nuanced view of the accuracy differences. Although not explicitly stated in the caption for Figure 2, accompanying text (page 41, paragraph 3: "But on the full sample, at a much finer scale, the same effect is present.") implies this pane represents an analysis on the "full sample" of the EICU dataset.
  • Decomposition Components (Ξ”Dβ†’Y and Ξ”Dβ†’X): In both panes, the Ξ”Dβ†’Y and Ξ”Dβ†’X annotations are used to decompose the observed accuracy differences between the two plotted subgroup curves. The overall difference in performance between the subgroups is implied to be a combination of these two effects.
Scientific Validity
  • βœ… Decomposition Methodology: The conceptual approach of decomposing performance differences into components like label shift (Ξ”Dβ†’Y) and mechanism shift (Ξ”Dβ†’X) is a valuable method for understanding disparities or changes in model behavior across different subgroups or datasets. This provides more nuanced insights than a single aggregate performance metric.
  • βœ… Acknowledgment of Small Sample Size Impact: The caption's admission that the "dramatic" effect in the public subsample (left pane) is largely due to "small sample size" is a crucial piece of self-critique and highlights a potential limitation of drawing strong conclusions from that specific pane alone. This demonstrates awareness of statistical power issues.
  • βœ… Comparison with Larger Sample Analysis: The presentation of a comparative analysis on what is likely the "full sample" (right pane, based on surrounding text) alongside the "public subsample" allows for an assessment of how findings might change with more data, which is good practice.
  • πŸ’‘ Absence of Uncertainty Quantification: The figures lack any indication of statistical uncertainty, such as confidence intervals or error bars around the accuracy curves or the derived Ξ”Dβ†’Y and Ξ”Dβ†’X components. This is particularly important given the explicit mention of "small sample size" for the public subsample. Without them, it's difficult to ascertain if the observed differences and decompositions are statistically robust or could be due to sampling variability.
  • πŸ’‘ Methodological Detail for Decomposition: The precise definitions and methods for calculating Ξ”Dβ†’Y and Ξ”Dβ†’X are not provided within the figure or its immediate caption. Understanding how these components are isolated and quantified is essential for evaluating the rigor of the decomposition.
  • πŸ’‘ Contextual Details of Analysis: While the caption mentions the EICU dataset, further details about the specific subgroups being compared (presumably African American and Caucasian patients, as in Figure 1), the prediction task (in-hospital mortality), and the model being evaluated would enhance the scientific context and interpretability of the figure if it were to be viewed in isolation.
  • πŸ’‘ Definition of X-axis and Link to Overall Accuracy: The term "Positive Class Fraction" on the x-axis needs explicit definition. If it represents varying decision thresholds, the relationship between this axis and the decomposition of overall accuracy (which is usually a single value or averaged over thresholds for metrics like AUC) should be clarified.
  • πŸ’‘ Visual Support for Quantitative Claims (Right Pane): The surrounding text (page 41, paragraph 3) mentions for the full sample (likely right pane): "βˆ†Dβ†’ΠΊ > 0, but the difference is all from ADβ†’Y, and in fact ADβ†’X < 0." The visual representation in the right pane should clearly support this claim (e.g., ADβ†’X pointing in an opposite direction or having a negative contribution if it's an additive decomposition). The current annotations are generic arrows; their direction and magnitude relative to each other are key to supporting this specific claim.
Communication
  • βœ… Visual Annotations and Color Consistency: The use of distinct colors for the two demographic groups (presumably orange and blue, as in Figure 1, though not explicitly stated for Fig 2 caption) is continued, which aids consistency if the groups are the same. The annotations Ξ”Dβ†’Y and Ξ”Dβ†’X directly on the plots are helpful for linking the visual components to the concepts of label shift and mechanism shift.
  • βœ… Side-by-Side Comparison: The figure presents two panes side-by-side, allowing for a direct visual comparison of the effects on the public subsample versus what appears to be a different dataset or analysis scale (likely the full sample, based on surrounding text).
  • πŸ’‘ Incomplete Caption for Multi-Pane Figure: The caption focuses solely on the "public subsample" (presumably the left pane) and mentions "small sample size" as a cause for the dramatic effect. However, it does not describe the right pane at all. This makes the figure not fully self-contained based on the caption alone. Suggestion: The caption should briefly introduce both panes and their respective contexts (e.g., "Left: Public subsample showing dramatic label shift effect due to small sample size. Right: Full sample analysis at a finer scale showing...").
  • πŸ’‘ Axis Clarity and Scale Definition: The axes are labeled "Accuracy" and "Positive Class Fraction," but the units or the precise meaning of "Positive Class Fraction" (e.g., actual prevalence, predicted rate) are not defined. The scale of the x-axis (0.01 to 0.10) appears non-linear in both panes (suggesting a log scale possibly), which should be explicitly stated. Suggestion: Clearly define "Positive Class Fraction" and specify if the x-axis is on a log scale. Ensure y-axis scales are appropriate for the data range in each pane (left: 0.86-1.00, right: 0.914-0.922).
  • πŸ’‘ Identification of Compared Groups: The specific demographic groups represented by the orange and blue lines are not identified in the caption for Figure 2, though they are likely consistent with Figure 1 (African American and Caucasian patients). Suggestion: Reiterate the group definitions in the caption or add a legend to each pane for clarity.
  • πŸ’‘ Qualitative Language: The term "dramatic" used in the caption is qualitative. While the visual difference in the left pane might appear large, quantifying this "dramatic" effect or providing context for what constitutes a dramatic shift would be beneficial. Suggestion: If possible, provide a quantitative measure or comparison to contextualize "dramatic."
Figure 3: The difference in sharpness is fairly dramatic on the public...
Full Caption

Figure 3: The difference in sharpness is fairly dramatic on the public subsample of EICU, but swamped by the difference in calibration (which is far worse for African Americans).

Figure/Table Image (Page 42)
Figure 3: The difference in sharpness is fairly dramatic on the public subsample of EICU, but swamped by the difference in calibration (which is far worse for African Americans).
First Reference in Text
Figure 3: The difference in sharpness is fairly dramatic on the public subsample of EICU, but swamped by the difference in calibration (which is far worse for African Americans).
Description
  • Pane Context and Axes: This left pane of Figure 3 plots "Accuracy" (vertical axis, ranging from approximately 0.86 to 1.00) against "Positive Class Fraction" (horizontal axis, from 0.01 to 0.10) for a "public subsample of EICU." "Positive Class Fraction" likely indicates the proportion of cases considered positive, varying across the x-axis.
  • Plotted Lines: Actual vs. Recalibrated Models (Sharpness): Two sets of lines are shown: solid lines represent the actual accuracy of a model for two different demographic groups (orange and blue). Dashed lines represent the accuracy of "perfectly recalibrated" versions of the model for these same groups. A recalibrated model is one whose predicted probabilities are adjusted to accurately reflect observed event frequencies. The dashed lines, therefore, reflect the model's optimal performance if calibration were perfect, with their height indicating the model's "sharpness" (ability to make confident and distinct predictions).
  • Ξ”TOTAL: Total Actual Performance Difference: A purple shaded area labeled Ξ”TOTAL highlights the overall difference in accuracy between the actual performances of the two groups (i.e., the area between the solid orange and solid blue lines).
  • Ξ”S: Difference in Sharpness (Recalibrated Performance Difference): A green shaded area labeled Ξ”S represents the difference in accuracy between the perfectly recalibrated models for the two groups (i.e., the area between the dashed orange and dashed blue lines). This Ξ”S is described in the caption as the "difference in sharpness."
  • Caption Interpretation: Sharpness vs. Calibration Difference: The caption states that for this public subsample, the "difference in sharpness (Ξ”S) is fairly dramatic." Visually, Ξ”S occupies a noticeable portion of Ξ”TOTAL. However, the caption further states this is "swamped by the difference in calibration (which is far worse for African Americans)." This implies that the component of Ξ”TOTAL not accounted for by Ξ”S (i.e., Ξ”TOTAL - Ξ”S, representing the inter-group difference attributable to differential miscalibration) is larger than Ξ”S. The miscalibration for a single group is the gap between its solid line (actual) and its dashed line (recalibrated).
Scientific Validity
  • βœ… Decomposition Methodology: Decomposing the overall performance difference between groups (Ξ”TOTAL) into components related to sharpness differences (Ξ”S) and calibration differences (implicitly Ξ”TOTAL - Ξ”S, or by summing individual calibration losses) is a methodologically sound approach to understanding sources of disparity in model performance.
  • βœ… Conceptual Illustration: The visualization effectively illustrates the concepts: the dashed lines represent the best achievable accuracy if calibration were perfect (reflecting sharpness), and the gap between solid and dashed lines for each group (not explicitly shaded but inferable) represents miscalibration loss for that group. Ξ”S directly shows the difference in potential performance due to sharpness.
  • βœ… Visual Support for Caption Claims (Subsample): The caption's claim that the sharpness difference is "swamped by the difference in calibration" for the public subsample appears visually supported if (Ξ”TOTAL - Ξ”S) is interpreted as the calibration component of the inter-group difference, and this area is larger than Ξ”S. The further claim "(which is far worse for African Americans)" implies the gap between the solid orange and dashed orange lines is particularly large.
  • πŸ’‘ Absence of Uncertainty Quantification: The figure lacks any representation of statistical uncertainty (e.g., confidence bands around the curves or for the areas Ξ”TOTAL and Ξ”S). Given that this pane represents a "public subsample" where small sample size effects have been noted (Figure 2 caption), the robustness of these "dramatic" differences and the "swamping" effect cannot be statistically ascertained from the figure alone.
  • πŸ’‘ Methodological Detail for Area Calculation: The precise mathematical definitions for how Ξ”TOTAL and Ξ”S are calculated as areas, and how the "difference in calibration" component is derived from them to support the "swamped by" claim, should be explicitly provided. The accompanying text (page 42) defines Ξ”S as E[PAMNB(D_orange,Ο€, s_orange, c) – PAMNB(D_blue,Ο€, s_blue, c)] and Ξ”C based on individual calibration losses. It should be clear how the visualized areas relate to these definitions.
  • πŸ’‘ Contextual Details: Contextual details, such as the specific demographic groups, the prediction task, and the base model being evaluated, are important for full scientific interpretation and are assumed from prior figures/text.
Communication
  • βœ… Color and Line Style Usage: The use of distinct colors for the two demographic groups (orange and blue, presumably consistent with previous figures) and line styles (solid for actual model, dashed for recalibrated model) aids in distinguishing the plotted data series.
  • βœ… Visual Highlighting of Decomposition: The shaded areas labeled Ξ”TOTAL and Ξ”S visually highlight the components of the accuracy difference being discussed (total difference between actual model performances of the two groups, and the difference between their recalibrated model performances, respectively).
  • πŸ’‘ Axis Definition and Scale: The labels on the axes ("Accuracy", "Positive Class Fraction") are clear, but the specific meaning of "Positive Class Fraction" (e.g., true prevalence, predicted positive rate) and the nature of its scale (0.01 to 0.10, possibly logarithmic) should be explicitly defined for full clarity. Suggestion: Add a note specifying if the x-axis is on a log scale and briefly define "Positive Class Fraction" in this context.
  • πŸ’‘ Qualitative Language: The caption uses qualitative terms like "fairly dramatic" and "swamped by." While illustrative, providing quantitative context or values for Ξ”S and the calibration difference component (Ξ”TOTAL - Ξ”S) would strengthen the message. Suggestion: Consider adding key quantitative values for the depicted differences in the caption or text.
  • πŸ’‘ Group Identification: The demographic groups represented by orange and blue are not explicitly named in this figure's caption, relying on context from previous figures. Suggestion: Briefly restate the group definitions (e.g., "Orange: African Americans, Blue: Caucasians") for better self-containment.
  • βœ… Appropriate Y-axis Scale: The y-axis scale (0.86 to 1.00) is appropriate for showing the relatively large differences discussed for this subsample.
Figure 4: The difference in sharpness for clipped cross entropy is not exactly...
Full Caption

Figure 4: The difference in sharpness for clipped cross entropy is not exactly the same as the difference in sharpness measured by the AUC-ROC curve because, as mentioned earlier, the AUC-ROC weights different thresholds differently.

Figure/Table Image (Page 43)
Figure 4: The difference in sharpness for clipped cross entropy is not exactly the same as the difference in sharpness measured by the AUC-ROC curve because, as mentioned earlier, the AUC-ROC weights different thresholds differently.
First Reference in Text
Figure 4: The difference in sharpness for clipped cross entropy is not exactly the same as the difference in sharpness measured by the AUC-ROC curve because, as mentioned earlier, the AUC-ROC weights different thresholds differently.
Description
  • Pane Context and Axes: This left pane of Figure 4, titled "EICU," plots "Accuracy" on the vertical axis (ranging from 0.75 to 1.00) against "Positive Class Fraction" on the horizontal axis (from 0.01 to 0.10). It displays data for the "public subsample" of the EICU dataset, according to accompanying text.
  • Plotted Lines and Sharpness Representation: Two dotted lines are shown, representing the sharpness (likely accuracy of perfectly recalibrated models) for two demographic groups. The orange dotted line is consistently above the blue dotted line. "Sharpness" here refers to a model's ability to make confident and distinct predictions when perfectly calibrated.
  • AUC Values and Sharpness Comparison: An "AUC" (Area Under the Curve, likely referring to AUC-ROC of these recalibrated models) value is annotated near each curve. The orange curve has an associated AUC of approximately 0.90, while the blue curve has an AUC of approximately 0.85. This indicates higher sharpness for the group represented by the orange line in this subsample.
  • Caption's Context on Sharpness Measurement: The caption explains that differences in sharpness as visualized here (presumably related to the "clipped cross entropy" approach) might not be identical to sharpness differences measured solely by an AUC-ROC methodology because AUC-ROC weights different decision thresholds differently.
Scientific Validity
  • βœ… Visualization of Sharpness: Presenting sharpness via accuracy curves of recalibrated models is a valid way to visualize this aspect of model performance. Comparing these across demographic subgroups helps in understanding potential disparities in model quality beyond just raw predictive accuracy or overall AUC-ROC of the original models.
  • βœ… Support for Textual Claims (Public Subsample): The accompanying text (page 43, paragraph 2) states: "Sharpness across the full range of prevalences is a fair bit higher for African Americans in the public subsample..." Assuming orange represents African Americans, this pane visually supports that claim, showing the orange curve and its AUC (0.90) noticeably higher than the blue curve's AUC (0.85).
  • πŸ’‘ Absence of Uncertainty Quantification: The figure lacks any indication of statistical uncertainty (e.g., confidence intervals for the AUC values or confidence bands for the accuracy curves). This is particularly relevant for a "public subsample," where sample sizes might be limited, making it difficult to assess the statistical significance of the observed sharpness difference.
  • πŸ’‘ Methodological Clarity for Sharpness Curves: The precise method for obtaining these sharpness curves (i.e., the recalibrated model accuracy) and how "clipped cross entropy" relates to this visualization should be clear. If these curves represent sharpness under the "clipped cross entropy" framework, this should be explicit.
  • πŸ’‘ Illustrating the Weighting Difference: The caption's point about AUC-ROC weighting thresholds differently is a known characteristic of the metric. The figure, by showing both the curves and their AUCs, provides data that could be used to discuss this, but doesn't inherently demonstrate the weighting difference itself without further comparative analysis or explanation.
Communication
  • βœ… Color and Line Style: The use of distinct colors for the two demographic groups (orange and blue, presumably African Americans and Caucasians, respectively, based on context from previous figures) and dotted line styles (indicating recalibrated models/sharpness) is consistent and aids differentiation.
  • βœ… AUC Annotation and Y-axis Scale: The direct annotation of "AUC" values near the respective curves provides a quantitative summary of the sharpness represented by these curves. The y-axis range (0.75 to 1.00) is appropriate for the data shown.
  • πŸ’‘ Axis and Group Clarity: The x-axis "Positive Class Fraction" (0.01 to 0.10) and its scale (potentially logarithmic) should be explicitly defined. The groups represented by orange and blue should be stated in the caption or a legend for this pane to be fully self-contained. Suggestion: Add a legend identifying the orange and blue lines and specify the x-axis scale.
  • πŸ’‘ Linking Pane to Caption's Core Argument: The caption's main point is a comparison of sharpness measurement methods. This specific pane illustrates sharpness for two groups in the public subsample. The connection between the visual (accuracy curves and AUCs) and the concept of "sharpness for clipped cross entropy" versus "sharpness measured by AUC-ROC curve" could be made more direct in how this pane contributes to that argument. Suggestion: Briefly note in the caption that this pane shows sharpness (likely via the paper's preferred method) and the AUCs are one summary, with the understanding that the difference might vary if purely using AUC-ROC methodology for sharpness comparison.
Figure 5: There's a significant gap between the performance on the public...
Full Caption

Figure 5: There's a significant gap between the performance on the public dataset for black and white patients, but it's well within the 95% confidence interval.

Figure/Table Image (Page 44)
Figure 5: There's a significant gap between the performance on the public dataset for black and white patients, but it's well within the 95% confidence interval.
First Reference in Text
Figure 5: There's a significant gap between the performance on the public dataset for black and white patients, but it's well within the 95% confidence interval.
Description
  • Pane Context and Axes: This left pane of Figure 5, titled "EICU," displays "Accuracy" on the vertical axis (scaled from 0.5 to 1.0) against "Positive Class Fraction" on the horizontal axis (scaled from 0.01 to 0.10). "Positive Class Fraction" likely represents the proportion of instances classified as positive by a model, or the prevalence of the positive class within different segments of data.
  • Central Performance Lines: Two central trend lines are plotted: an orange line and a blue line, representing the model's accuracy for black and white patients, respectively (as inferred from the caption). The orange line is generally positioned above the blue line, suggesting higher mean accuracy for black patients in this public dataset.
  • 95% Confidence Intervals: Shaded areas in light orange and light blue surround each respective central line. These shaded regions represent the 95% confidence intervals for the accuracy estimates, indicating the range within which the true accuracy likely falls with 95% confidence. These confidence intervals are notably wide and show considerable overlap between the two groups across most of the "Positive Class Fraction" range.
  • Observed Gap and Confidence Interval Overlap: The caption highlights a "significant gap" between the performance for black and white patients on this public dataset. Visually, the central orange line is consistently higher than the blue line. However, the caption also states this gap is "well within the 95% confidence interval," which refers to the substantial overlap of the shaded uncertainty regions.
Scientific Validity
  • βœ… Use of Confidence Intervals: The inclusion of 95% confidence intervals is a methodologically sound practice, essential for interpreting differences, especially when dealing with potentially small sample sizes as implied by "public dataset" and the wide CIs.
  • πŸ’‘ Interpretation of "Significant Gap" with Overlapping CIs: The caption's statement "There's a significant gap ... but it's well within the 95% confidence interval" presents a contradiction. A gap that is "well within the 95% confidence interval" (implying the CIs of the two groups' means overlap substantially, or the CI for the difference includes zero) typically means the observed difference is not statistically significant at the p < 0.05 level. The term "significant gap" usually implies statistical significance. This phrasing needs careful revision to accurately reflect statistical interpretation. If the gap is large in magnitude (effect size) but not statistically significant, this should be stated precisely.
  • πŸ’‘ Method for CI Calculation Missing: The method used to calculate the confidence intervals (e.g., bootstrapping, analytical methods based on assumptions about data distribution) is not specified. This information is important for assessing their validity.
  • βœ… Indication of Uncertainty: The width of the confidence intervals suggests considerable uncertainty in the accuracy estimates for this public dataset, which appropriately tempers conclusions about the observed gap. This supports the notion that findings from this subsample might not be robust.
  • πŸ’‘ Definition of "Positive Class Fraction": The operational definition of "Positive Class Fraction" and how accuracy varies across it (e.g., are these different thresholds, or different natural prevalences in subsets?) is crucial for a full scientific understanding of what is being compared.
Communication
  • βœ… Clarity of Confidence Intervals: The visualization of 95% confidence intervals as shaded areas around the central performance lines (orange and blue) is a clear and effective way to represent statistical uncertainty.
  • βœ… Pane Title: The title "EICU" clearly indicates the dataset context for this pane.
  • πŸ’‘ Caption Phrasing: "Significant Gap" vs. CI Overlap: The caption states there's a "significant gap" but it's "well within the 95% confidence interval." This phrasing is contradictory. If the confidence intervals of the two groups overlap substantially, as they appear to do, the difference is generally not considered statistically significant. Suggestion: Rephrase to clarify. For instance, if the gap is visually apparent but not statistically significant, state "An apparent gap... however, this difference is not statistically significant as indicated by the overlapping 95% confidence intervals." Or, if "significant" refers to effect size rather than statistical significance, this distinction must be made explicit.
  • πŸ’‘ Explicit Legend for Groups: The demographic groups represented by the orange and blue lines (presumably black and white patients, as per the caption) are not explicitly labeled within the legend of the graph itself. Suggestion: Add a direct legend (e.g., "Black patients (Orange)", "White patients (Blue)") to the pane for immediate clarity.
  • πŸ’‘ X-axis Scale Clarity: The x-axis, "Positive Class Fraction," ranges from 0.01 to 0.10. Its scale appears non-linear (possibly logarithmic), which should be explicitly stated to aid correct interpretation of the curve shapes and distances. Suggestion: Specify if the x-axis is on a logarithmic scale.
  • βœ… Y-axis Scale: The y-axis range (0.5 to 1.0) is appropriate for accuracy data but the actual data occupies roughly 0.6 to 0.95. This is acceptable.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

↑