Sufficient is better than optimal for training neural networks

Irina Babayan, Hazhir Aliahmadi, Greg van Anders
Nature Communications
Department of Physics, Engineering Physics, and Astronomy, Queen's University, Kingston ON, K7L 3N6, Canada

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This paper challenges the conventional wisdom of training neural networks through optimization, arguing that the relentless pursuit of a single 'optimal' solution paradoxically leads to poor performance on new data, a problem known as overfitting. The authors propose a fundamental paradigm shift from 'optimality' to 'sufficiency.' To achieve this, they introduce 'simmering,' a novel training method inspired by statistical physics. Instead of minimizing an error function to find one perfect set of model parameters, simmering treats the parameters as a system of particles and uses algorithms from molecular dynamics to simulate them at a constant, non-zero 'temperature.' This controlled thermal agitation prevents the model from settling into a single, overfit state and instead allows it to sample a diverse collection, or ensemble, of 'good enough' solutions.

The methodology is validated through a series of computational experiments. First, the authors demonstrate that simmering can act as a corrective tool, successfully 'retrofitting' and improving the performance of models that have already been overfit by standard optimization techniques. More significantly, when applied from the start of training ('ab initio'), simmering is shown to produce models that are inherently more robust and generalizable. A key feature of this ensemble-based approach is its natural ability to quantify prediction uncertainty, providing a measure of the model's confidence in its own outputs, a capability often lacking in single-solution models.

The paper's central claims are substantiated with strong quantitative results on standard benchmark tasks. In an image classification task using the CIFAR-10 dataset, the simmering method achieved a test accuracy of over 82% in 20 training epochs, significantly outperforming established techniques like dropout and ensembled early stopping, which did not exceed 76%. On a more complex Portuguese-to-English language translation task, simmering not only surpassed the accuracy of its competitors but did so in less than half the training time (21 epochs versus over 53). These findings suggest that shifting the training goal from finding an optimal solution to sampling a sufficient ensemble can lead to models that are both more accurate and more efficient.

Research Impact and Future Directions

Overall, the evidence presented strongly supports the paper's central thesis that 'sufficient is better than optimal' for training neural networks. The simmering method is demonstrated to be a viable, and often superior, alternative to standard optimization-based techniques. This conclusion is most strongly corroborated by the direct quantitative comparisons on CIFAR-10 and language translation tasks (Figure 3), where simmering achieved higher accuracy in equivalent or significantly less training time. However, the reliability of these claims is materially weakened by a critical methodological omission: the lack of statistical significance testing for these key results. Without formal statistical validation, it remains an unresolved issue whether the observed performance advantages are consistently reproducible or potentially attributable to random experimental variation.

Major Limitations and Risks: The most significant risk to the paper's conclusions is the previously mentioned lack of statistical testing in the Results section, which prevents a rigorous assessment of the method's claimed superiority. Second, the evidence supporting simmering's ability to produce more nuanced confidence estimates (Supplementary Figure 3) is anecdotal, based on a case study of only two images, which limits the generalizability of this claim. Finally, the Methods section introduces a novel algorithm with several unfamiliar, physics-based hyperparameters without providing practical guidance on their selection (Algorithm 1). This poses a substantial barrier to the method's adoption and reproducibility, as practitioners have no clear starting point for tuning the algorithm for new problems.

Based on the presented work, simmering can be recommended for adoption in research and exploratory settings with Medium confidence. The proof-of-concept study design provides compelling evidence of its potential on important benchmarks. However, the confidence level is constrained by the lack of statistical rigor and clear implementation guidelines. The single most critical next step to increase confidence would be a follow-up study that reproduces the key benchmark comparisons with multiple random seeds and applies appropriate statistical tests (e.g., paired t-tests) to formally validate the significance of the performance differences. Such a study would be essential to confirm that simmering's advantages are not just apparent, but statistically robust.

Critical Analysis and Recommendations

Clear Problem Framing and Memorable Concept (written-content)
The abstract effectively frames optimization-based training as 'misguided' and introduces 'simmering' as a memorable, physics-based alternative. This strong narrative immediately establishes the paper's significance and makes its core contribution easy to grasp, enhancing its impact on the reader.
Section: Abstract
Lack of Quantitative Evidence in Abstract (written-content)
The abstract makes strong qualitative claims of outperformance but omits specific numbers. Including a key quantitative result, such as the >6 percentage point accuracy improvement on CIFAR-10, would make the claims more concrete and compelling, significantly strengthening the paper's initial pitch.
Section: Abstract
Novel Interdisciplinary Approach (written-content)
The paper's proposal to use concepts from molecular dynamics (specifically, Nosé-Hoover chain thermostats) to solve a core machine learning problem is highly innovative. This cross-pollination of ideas provides a strong, principled foundation for the method, distinguishing it from more conventional, heuristic-based approaches.
Section: Introduction
Ambiguous Definition of Core Concept (written-content)
The paper's central thesis rests on the concept of 'sufficiency,' yet this term is introduced without a clear operational definition in the introduction. Adding a concise explanation (e.g., the systematic sampling of an ensemble of near-optimal models) would immediately ground the reader in the paper's core idea, enhancing clarity.
Section: Introduction
Strong Quantitative Benchmarking on Diverse Tasks (graphical-figure)
The results compellingly demonstrate simmering's superiority against standard methods (dropout, early stopping) on two distinct, challenging benchmarks: image classification (CIFAR-10) and language translation (Figure 3). This use of diverse and relevant tasks provides strong evidence for the method's broad applicability and practical advantages.
Section: Results
Methodological Limitation: Lack of Statistical Significance Testing (graphical-figure)
The central claims of outperformance in Figure 3 are based on point estimates of accuracy without any statistical tests to assess significance. This is a critical methodological omission, as it is impossible to determine if the observed performance gaps are real or simply the result of random chance, which weakens the certainty of the paper's primary conclusions.
Section: Results
Powerful Conceptual Framing of the Problem (written-content)
The discussion provides an insightful conceptual distinction between conventional models with 'anticipated' behavior and neural networks with 'emergent' behavior. This framing effectively justifies why traditional optimization is ill-suited for neural networks and a new paradigm is necessary, elevating the paper's contribution from a new method to a new way of thinking.
Section: Discussion
Missed Connection to Bayesian Inference (written-content)
The discussion connects simmering to physics and information geometry but fails to explicitly link it to the more familiar machine learning framework of Bayesian Neural Networks. Drawing this parallel would ground the work in an established context and make its principles of sampling and uncertainty quantification more accessible to a broader ML audience.
Section: Discussion
Principled Theoretical Grounding (written-content)
The simmering method is not presented as an ad-hoc heuristic but is rigorously derived from the first principles of statistical mechanics and information theory. This principled foundation provides a strong theoretical justification for the approach, lending significant scientific weight to the paper's claims.
Section: Methods
Practical Limitation: Lack of Hyperparameter Guidance (graphical-figure)
The algorithm introduces several new hyperparameters derived from physics (e.g., temperature T, thermostat masses Q) that are unfamiliar in a typical machine learning context. The paper provides no practical guidance on how these should be selected or tuned, creating a significant barrier to implementation and reproducibility for other researchers.
Section: Methods

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1. Sufficient-training based retrofitting reduces overfitting in...
Full Caption

Figure 1. Sufficient-training based retrofitting reduces overfitting in optimized networks. Optimization-based training produces discrepancies in performance on training vs. test data (c.f. light blue and dark blue MSE curves, panel a) that manifest in discrepancies between model fits and underlying relationships (c.f. dark blue and green curves, respectively, in panel b). We apply simmering to retrofit the overfit network by gradually increasing temperature (c.f. grey lines in panel a), which reduces overfitting (panel c) before producing an ensemble of networks that yield model predictions that are nearly indistinguishable from the underlying data distribution (c.f. dark magenta and green curves, panel d). Analogous applications of simmering can be employed to retrofit classification problems (panel e) and regression problems (panel f). Panel e shows prediction accuracy for image classification (MNIST), event classification (HIGGS), and species classification (IRIS). Panel f shows fit quality (squared residual, R²) for regression problems including the sinusoidal fit shown in detail in panels a-d, as well as single- (S) and multivariate regression (M) of automotive mileage data (AUTO-MPG). In all cases, simmering reduces the overfitting produced by Adam (indicated by black arrows).

Figure/Table Image (Page 6)
Figure 1. Sufficient-training based retrofitting reduces overfitting in optimized networks. Optimization-based training produces discrepancies in performance on training vs. test data (c.f. light blue and dark blue MSE curves, panel a) that manifest in discrepancies between model fits and underlying relationships (c.f. dark blue and green curves, respectively, in panel b). We apply simmering to retrofit the overfit network by gradually increasing temperature (c.f. grey lines in panel a), which reduces overfitting (panel c) before producing an ensemble of networks that yield model predictions that are nearly indistinguishable from the underlying data distribution (c.f. dark magenta and green curves, panel d). Analogous applications of simmering can be employed to retrofit classification problems (panel e) and regression problems (panel f). Panel e shows prediction accuracy for image classification (MNIST), event classification (HIGGS), and species classification (IRIS). Panel f shows fit quality (squared residual, R²) for regression problems including the sinusoidal fit shown in detail in panels a-d, as well as single- (S) and multivariate regression (M) of automotive mileage data (AUTO-MPG). In all cases, simmering reduces the overfitting produced by Adam (indicated by black arrows).
First Reference in Text
Fig. 1a gives an example of this retrofitting procedure in the case of a standard curve fitting problem.
Description
  • Figure Overview: This multi-panel figure demonstrates a new method called 'simmering' for improving the performance of neural networks, which are a type of machine learning model. It shows how simmering can fix a common problem called 'overfitting,' where a model learns the training data too well, including its noise, and then fails to make accurate predictions on new, unseen data. The figure is divided into three main parts: panels a-d provide a step-by-step visual example of the process on a simple curve-fitting task, while panels e and f show summary results on more complex classification and regression problems, respectively.
  • Panels a-d: Retrofitting a Noisy Sine Wave: This series of plots illustrates the core concept. Panel 'a' tracks the model's error—specifically, the Mean Squared Error (MSE), a measure of the average squared difference between estimated values and actual values—over time. Initially, using a standard training method ('Adam'), the training error (light blue) drops to near zero, but the error on new 'test' data (dark blue) levels off, indicating overfitting. The 'simmering' process is then applied, which involves gradually increasing a 'temperature' parameter (grey line, rising from 0 to 0.05). During simmering, both training (pink) and test (magenta) errors remain low and close together. Panels 'b', 'c', and 'd' show the model's actual predictions at different stages. In 'b', the Adam-trained model (blue line) wiggles to hit the training data points perfectly but misses the true underlying sine wave (green line). In 'c' and 'd', as simmering proceeds, the model's prediction (magenta line) becomes smoother and eventually, in 'd', almost perfectly matches the true green line.
  • Panel e: Performance on Classification Tasks: This bar chart compares the performance of the standard Adam method versus the simmering method on three different classification tasks: identifying handwritten digits (MNIST), classifying particle physics events (HIGGS), and identifying flower species (IRIS). For all three tasks, simmering results in a higher test accuracy (the model's performance on unseen data) than the Adam method. For example, on the HIGGS dataset, Adam's test accuracy is approximately 0.65, while simmering improves it to about 0.68. The chart also shows that simmering reduces the gap between training performance and test performance, which is a key sign of reduced overfitting.
  • Panel f: Performance on Regression Tasks: This bar chart shows a similar comparison for regression tasks, which involve predicting a continuous value instead of a category. The tasks include the sine wave from the first panels, and predicting car mileage from single (S) or multiple (M) vehicle features (AUTO-MPG). Performance is measured by R-squared (R²), a statistic where 1.0 represents a perfect fit. In all cases, simmering improves the R² score on the test data. For instance, on the single-variable AUTO-MPG task, Adam achieves a test R² of about 0.58, whereas simmering boosts it to approximately 0.7.
Scientific Validity
  • ✅ Strong illustrative example: The use of a simple, intuitive sine wave fitting problem (panels a-d) is an excellent pedagogical choice to clearly demonstrate the mechanism and effect of the 'simmering' method in correcting a visibly overfit model. The visual progression from panel b to d provides compelling qualitative support for the method's efficacy.
  • ✅ Broad experimental validation: The inclusion of multiple standard benchmark datasets for both classification (MNIST, HIGGS, IRIS in panel e) and regression (SINE, AUTO-MPG in panel f) strengthens the claims of the method's general applicability across different problem domains.
  • 💡 Lack of uncertainty quantification: The bar charts in panels e and f present point estimates of performance metrics (Accuracy and R²). Without error bars or confidence intervals derived from multiple experimental runs (e.g., with different random seeds or data splits), the stability and statistical significance of the observed improvements cannot be assessed. It is unclear if the reported improvements are consistent or the result of a single favorable run.
  • 💡 Ambiguity in hyperparameters: In panel 'a', the temperature 'T' is shown to be increased to a final value of 0.05. The rationale for this specific value and the temperature schedule is not provided. The performance of the method may be sensitive to this hyperparameter, and a sensitivity analysis or justification for its selection would enhance the rigor of the demonstration.
  • 💡 Vague definition of 'Time' axis: The x-axis in panel 'a' is labeled 'Time'. This unit is ambiguous and should be specified as training epochs, iterations, or wall-clock time to allow for a proper interpretation of the training dynamics and computational cost.
Communication
  • ✅ Effective narrative structure: The figure is well-structured to tell a clear story. It starts with a detailed, step-by-step example of the problem and solution (a-d) and then zooms out to show summarized evidence of its broader applicability (e-f). This makes the central message easy to follow.
  • ✅ Consistent and clear color scheme: The color coding is used effectively and consistently throughout the figure: blue tones for the baseline 'Adam' optimizer, pink/magenta for the proposed 'Simmering' method, and green for the ground truth. This consistency aids in quick and accurate interpretation.
  • ✅ Useful visual cues: The inclusion of black arrows in panels e and f is a simple but highly effective visual device that immediately draws the reader's attention to the primary conclusion: that simmering improves test performance compared to the baseline in all cases shown.
  • 💡 Incomplete axis labeling: In panel 'a', the right-side y-axis corresponding to the grey temperature line is missing a label. It should be explicitly labeled 'Temperature (T)' to avoid ambiguity.
  • 💡 Suboptimal legend design: In panels e and f, the legend is split into two parts: one for the colors ('Adam' vs 'Simmering') and another for the shades ('Train' vs 'Test'). A consolidated legend, such as a 2x2 grid, could present this information more efficiently and reduce the cognitive load on the reader.
Figure 2. Ab initio sufficient training avoids overfitting and yields...
Full Caption

Figure 2. Ab initio sufficient training avoids overfitting and yields prediction uncertainty distributions. Ensembles of models sampled at finite temperature yield smooth decision boundaries (white lines in panel a) and average predictions (dark magenta curve in panel b) that are not skewed by noisy training data (star, triangle and square black markers in panel a, and round black markers in panel b). Test data (star, triangle and square grey markers in panel a, and round grey markers in panel b) are overlayed to show how the ensemble predictions (decision boundaries and average curve in panels a and b, respectively) generalize to unseen data. The background in panel a is shaded using a weighted average of the ensemble votes for each class at each point in the feature space, showing regions of confident ensemble prediction (regions of bright purple, teal, or orange in panel a) vs. uncertain prediction (intermediate coloured regions in panel a). Analogously, panel b shows the density of predicted curves (transparent magenta curves in panel b) around the ensemble average (dark magenta curve in panel b). For classification problems, panels c and d show the ensemble's decision-making confidence at different points in the data feature space via the proportion of ensemble votes for each class (c.f. panels c and d correspond to pink markers labelled c and d on panel a). For regression problems, we can compare the distributions of sampled predictions with the ensemble average at different input values (c.f. pink solution distribution and dark magenta point on panels e and f, sampled at two different inputs indicated in panel b) and assess how the data noise distribution affects predictions throughout the feature space. Ab initio sufficient training produces correspondingly sufficiently descriptive predictions alongside insight into the ensemble prediction process that is inaccessible with a singular, optimized model.

Figure/Table Image (Page 7)
Figure 2. Ab initio sufficient training avoids overfitting and yields prediction uncertainty distributions. Ensembles of models sampled at finite temperature yield smooth decision boundaries (white lines in panel a) and average predictions (dark magenta curve in panel b) that are not skewed by noisy training data (star, triangle and square black markers in panel a, and round black markers in panel b). Test data (star, triangle and square grey markers in panel a, and round grey markers in panel b) are overlayed to show how the ensemble predictions (decision boundaries and average curve in panels a and b, respectively) generalize to unseen data. The background in panel a is shaded using a weighted average of the ensemble votes for each class at each point in the feature space, showing regions of confident ensemble prediction (regions of bright purple, teal, or orange in panel a) vs. uncertain prediction (intermediate coloured regions in panel a). Analogously, panel b shows the density of predicted curves (transparent magenta curves in panel b) around the ensemble average (dark magenta curve in panel b). For classification problems, panels c and d show the ensemble's decision-making confidence at different points in the data feature space via the proportion of ensemble votes for each class (c.f. panels c and d correspond to pink markers labelled c and d on panel a). For regression problems, we can compare the distributions of sampled predictions with the ensemble average at different input values (c.f. pink solution distribution and dark magenta point on panels e and f, sampled at two different inputs indicated in panel b) and assess how the data noise distribution affects predictions throughout the feature space. Ab initio sufficient training produces correspondingly sufficiently descriptive predictions alongside insight into the ensemble prediction process that is inaccessible with a singular, optimized model.
First Reference in Text
Fig. 2 shows results from sufficiently trained neural networks in which simmering was deployed from the outset, without the need for optimization.
Description
  • Figure Overview: This six-panel figure illustrates how a machine learning training method called 'simmering', when used from the start ('ab initio'), can produce not just a single answer but also a measure of its own uncertainty. It demonstrates this on two common types of tasks: classification (panels a, c, d) and regression (panels b, e, f).
  • Classification Example (Panels a, c, d): Panel 'a' shows the model classifying Iris flowers into three species based on sepal and petal width. The white lines are 'decision boundaries,' the thresholds where the model's prediction changes from one species to another. The colored background indicates the model's confidence, with solid colors (purple, teal, orange) representing high confidence. Panels 'c' and 'd' quantify this confidence at two specific points marked on panel 'a'. At point 'c', the model is highly confident in its prediction of 'Virginica', with about 80% of its internal 'votes' going to that class. At point 'd', it is even more confident in predicting 'Setosa', with over 80% of votes for that class.
  • Regression Example (Panels b, e, f): Panel 'b' shows the model predicting a car's 'Miles per gallon' based on its 'Horsepower'. Instead of one single prediction line, it shows an ensemble of possibilities (many faint magenta lines) and their average (the solid dark magenta line). This cloud of lines visually represents the model's uncertainty. Panels 'e' and 'f' show this uncertainty as a probability distribution for two specific horsepower values. For a car with horsepower around 100 (point 'e'), the model predicts a miles per gallon value centered around 25, with a specific range of possibilities. For a car with horsepower around 180 (point 'f'), it predicts a value centered around 18, again with a distribution showing its uncertainty.
Scientific Validity
  • ✅ Excellent demonstration of uncertainty quantification: The figure's core strength is its clear and appropriate visualization of prediction uncertainty, a key claim of the paper. Using ensembles to generate distributions of outcomes (panels c-f) and visualizing prediction density (panel b) are standard and rigorous methods for this purpose, and they strongly support the assertion that the method yields more than a single point prediction.
  • ✅ Appropriate choice of illustrative datasets: The use of the classic Iris and Auto-MPG datasets is well-suited for this demonstration. Their low dimensionality allows for clear 2D visualization, enabling the audience to intuitively grasp the concepts of decision boundaries and regression uncertainty without being distracted by data complexity.
  • 💡 Lack of a baseline comparison: The figure effectively shows the properties of the 'simmering' method but does so in isolation. The central claim that it 'avoids overfitting' and is superior to 'a singular, optimized model' would be much more powerfully substantiated with a direct side-by-side comparison. For instance, showing the jagged, overfit decision boundary and single prediction line from a standard optimizer on the same plots would provide a crucial reference point and make the benefits of simmering immediately apparent.
  • 💡 Generalization is shown qualitatively, not quantitatively: The caption states that the predictions generalize to unseen test data (grey markers). While the grey markers in panels 'a' and 'b' appear to be well-described by the model's predictions, this is a qualitative assessment. Including quantitative performance metrics on the test set (e.g., test accuracy for panel 'a', test R² for panel 'b') directly within the figure would provide stronger, objective evidence to support this claim.
Communication
  • ✅ Highly effective panel linkage: The design choice to explicitly link points in the main plots (panels a, b) to detailed distributional plots (panels c, d, e, f) is an outstanding communication strategy. It makes the abstract concept of querying an ensemble for prediction uncertainty at a specific point concrete and easy to follow.
  • ✅ Intuitive visualization of uncertainty: The use of a shaded background in panel 'a' to represent classification confidence and a 'cloud' of transparent curves in panel 'b' to represent regression possibilities are excellent visual metaphors. They intuitively convey the concept of model uncertainty without requiring deep technical understanding.
  • 💡 Potentially confusing markers in panel a: In panel 'a', the training data (black markers) and test data (grey markers) use the same shapes (star, triangle, square) for different classes. While a legend is provided, this can be cognitively demanding. Consider using filled markers for training and open markers of the same shape/color for testing to create a clearer visual distinction between the two sets.
  • 💡 Redundant labeling: The class labels ('Setosa', 'Versicolor', 'Virginica') are present in the legend of panel 'a' and also as x-axis labels in panels 'c' and 'd'. To reduce clutter, the legend in panel 'a' could potentially be removed if the clusters are distinct enough, relying on the clearer labels in the bar charts.
Figure 3. Simmering outperforms ensembled early stopping and dropout on the...
Full Caption

Figure 3. Simmering outperforms ensembled early stopping and dropout on the CIFAR-10²⁵ (panel a) and Portuguese-English TED talk transcript translation²⁷ (panel b) datasets. Simmering's ensemble prediction (rectangular marker) achieves both the highest accuracy and the most significant ensembling improvement (rectangular marker vs. round markers), with the latter indicating that the advantage of simmering extends beyond just ensembling. In contrast, the early stopped ensemble accuracy (rectangular marker) does not exceed that of its ensemble members (round markers) for both training tasks. For the CIFAR-10 dataset, we employed the ConvNet architecture ²⁶, and all non-simmering cases learned via stochastic gradient descent. The early stopping ensemble consists of 100 independently optimized early stopped models, with an average training duration of 14.56 epochs. Dropout and ab initio simmering each trained for 20 epochs, and the models corresponding to the last 2000 weight updates contributed to the simmering ensemble. We used dropout's inference mode prediction as its ensemble prediction, ³⁸ and aggregated the early stopping and simmering ensembles via majority voting. For the translation task, we trained a reduced version (described in Supplementary Methods) of the Transformer architecture presented in Ref. ¹⁸ with a pre-trained BERT tokenizer, ³⁵ and assessed accuracy via teacher-forced token prediction accuracy. We fixed the learning rate for all cases, and trained non-simmering cases with the Adam optimizer. ¹⁶ The early stopping ensemble consists of 10 independently trained models, with an average training time of 53.1 epochs, aggregated with majority voting. We optimized a model with dropout for 60 epochs and used its inference mode as its ensemble prediction. Accuracy convergence curves for both training tasks are shown in Supplementary Figures 1-2, and additional comparison implementation information is details in Supplementary Methods. The simmering ensemble exceeded the test accuracy of all other cases after only 21 training epochs, with a majority-voted ensemble prediction from 200 models sampled during the last epoch. For equivalent training time, ab initio simmering produces more accurate predictions than other ensembled overfitting mitigation techniques on the CIFAR-10 dataset. However, simmering can both accelerate training and exceed the accuracy of other overfitting techniques on a natural language processing task.

Figure/Table Image (Page 8)
Figure 3. Simmering outperforms ensembled early stopping and dropout on the CIFAR-10²⁵ (panel a) and Portuguese-English TED talk transcript translation²⁷ (panel b) datasets. Simmering's ensemble prediction (rectangular marker) achieves both the highest accuracy and the most significant ensembling improvement (rectangular marker vs. round markers), with the latter indicating that the advantage of simmering extends beyond just ensembling. In contrast, the early stopped ensemble accuracy (rectangular marker) does not exceed that of its ensemble members (round markers) for both training tasks. For the CIFAR-10 dataset, we employed the ConvNet architecture ²⁶, and all non-simmering cases learned via stochastic gradient descent. The early stopping ensemble consists of 100 independently optimized early stopped models, with an average training duration of 14.56 epochs. Dropout and ab initio simmering each trained for 20 epochs, and the models corresponding to the last 2000 weight updates contributed to the simmering ensemble. We used dropout's inference mode prediction as its ensemble prediction, ³⁸ and aggregated the early stopping and simmering ensembles via majority voting. For the translation task, we trained a reduced version (described in Supplementary Methods) of the Transformer architecture presented in Ref. ¹⁸ with a pre-trained BERT tokenizer, ³⁵ and assessed accuracy via teacher-forced token prediction accuracy. We fixed the learning rate for all cases, and trained non-simmering cases with the Adam optimizer. ¹⁶ The early stopping ensemble consists of 10 independently trained models, with an average training time of 53.1 epochs, aggregated with majority voting. We optimized a model with dropout for 60 epochs and used its inference mode as its ensemble prediction. Accuracy convergence curves for both training tasks are shown in Supplementary Figures 1-2, and additional comparison implementation information is details in Supplementary Methods. The simmering ensemble exceeded the test accuracy of all other cases after only 21 training epochs, with a majority-voted ensemble prediction from 200 models sampled during the last epoch. For equivalent training time, ab initio simmering produces more accurate predictions than other ensembled overfitting mitigation techniques on the CIFAR-10 dataset. However, simmering can both accelerate training and exceed the accuracy of other overfitting techniques on a natural language processing task.
First Reference in Text
Fig. 3 shows that, for both learning tasks, ab initio simmering achieves a higher test data accuracy than standard implementations of early stopping and dropout.
Description
  • Figure Overview: This figure presents two plots that compare a new machine learning training method called 'Simmering' against two standard methods, 'Early Stopping' and 'Dropout'. The comparison is performed on two distinct and challenging tasks: classifying images (panel a) and translating text (panel b). The plots show the accuracy of individual models (round dots) and the combined, or 'ensembled', accuracy of a group of models (rectangular bar).
  • Panel a: Image Classification (CIFAR-10): In this panel, 'Simmering' achieves the highest final ensemble accuracy of approximately 82% after 20 training cycles ('epochs'). This is notably higher than the ensemble accuracy for 'Early Stopping' (~73.5%) and 'Dropout' (~75.5%). A key observation is that for Simmering, the ensemble's accuracy is significantly better than any of its individual models (which range from 74-78%). In contrast, the ensemble accuracy for 'Early Stopping' is roughly the same as its best individual models, showing little benefit from combining them.
  • Panel b: Language Translation: This panel shows results for a text translation task. 'Simmering' again achieves the highest ensemble accuracy at about 42%, and it does so in only 21 epochs. The 'Early Stopping' method reaches about 41% accuracy but requires an average of 53.1 epochs. The 'Dropout' method achieves about 40% accuracy after 60 epochs. This demonstrates that Simmering is not only more accurate but also significantly more efficient, requiring less than half the training time of the other methods to achieve a better result.
Scientific Validity
  • ✅ Strong comparative analysis: The study design is robust in its comparison of the novel 'simmering' method against two highly relevant and widely used baseline techniques for mitigating overfitting (ensembled early stopping and dropout). This provides a strong context for evaluating the proposed method's performance.
  • ✅ Diverse and challenging benchmark tasks: The use of both a standard computer vision dataset (CIFAR-10) and a complex natural language processing task (translation) effectively demonstrates the potential versatility and broad applicability of the simmering method across different domains and model architectures (ConvNets and Transformers).
  • ✅ Clear demonstration of ensembling benefit: The figure correctly visualizes and the caption explicitly discusses a crucial aspect of ensemble methods: the performance gain of the aggregate model over its individual members. The finding that simmering yields a large ensembling improvement while early stopping does not is a significant result that is well-supported by the visual evidence.
  • 💡 Lack of statistical testing: While the plots show clear differences in the mean aggregate accuracies, the absence of statistical tests (e.g., t-tests or ANOVA with post-hoc analysis) makes it impossible to determine if these differences are statistically significant. The observed superiority of simmering could potentially be due to random chance.
  • 💡 Potential for unequal hyperparameter optimization: The caption details the training setup but does not state whether a systematic and equitable hyperparameter search was conducted for all three methods. The performance of dropout and early stopping is known to be sensitive to their respective hyperparameters. If the baselines were not optimally tuned, the comparison may unfairly favor the simmering method.
Communication
  • ✅ Effective visualization of ensemble performance: The use of jittered strip plots for individual ensemble members (round markers) combined with a distinct marker for the aggregate performance (rectangle) is an excellent way to simultaneously convey the distribution of member performance and the final ensemble result. This clearly illustrates the concept of ensembling and its benefits.
  • ✅ High information density in a compact format: Each panel successfully communicates multiple layers of information: a three-way comparison of methods, the performance distribution within ensembles, the final aggregated score, and the relative training cost (epochs). This is achieved without making the plots feel overly cluttered.
  • ✅ Exceptionally detailed and self-contained caption: The caption is comprehensive, providing extensive methodological details about the datasets, model architectures, training parameters, and aggregation methods. This allows the figure to stand on its own, a hallmark of excellent scientific communication.
  • 💡 Overplotting in panel a: In panel a, the large number of ensemble members for 'Early Stopping' (100) and 'Simmering' (2000 weight updates imply many models) leads to significant overplotting of the round markers, obscuring their true distribution. Consider replacing the strip plot with a violin plot or a box plot to better represent the distribution of member accuracies, especially for the 'Early Stopping' case.
Supplementary Figure 1. The training convergence curves corresponding to the...
Full Caption

Supplementary Figure 1. The training convergence curves corresponding to the CIFAR-10 training comparison results, also shown in Main Fig. 3, show that simmering (panel a) produces both the highest ensemble accuracy (red vs. green or blue rectangular marker, panels b and c respectively) and the most significant ensembling advantage (red round vs. rectangular markers in panel a) compared to early stopping (green round vs. rectangular markers in panel b), and dropout (blue round vs. rectangular markers in panel c). In all panels, lines denote the individual model accuracy averaged over all weight updates of a given epoch, and round markers denote the performance of individual models that are included in the ultimate ensemble.

Figure/Table Image (Page 26)
Supplementary Figure 1. The training convergence curves corresponding to the CIFAR-10 training comparison results, also shown in Main Fig. 3, show that simmering (panel a) produces both the highest ensemble accuracy (red vs. green or blue rectangular marker, panels b and c respectively) and the most significant ensembling advantage (red round vs. rectangular markers in panel a) compared to early stopping (green round vs. rectangular markers in panel b), and dropout (blue round vs. rectangular markers in panel c). In all panels, lines denote the individual model accuracy averaged over all weight updates of a given epoch, and round markers denote the performance of individual models that are included in the ultimate ensemble.
First Reference in Text
Supplementary Figure 1 shows the training performance curves that accompany the CIFAR-10 test accuracy results presented in Main Fig. 3.
Description
  • Figure Overview: This figure consists of three graphs (panels a, b, and c) that show how the accuracy of three different machine learning training methods evolves over time. Each panel corresponds to a method: 'simmering' (a), 'early stopping' (b), and 'dropout' (c). The goal is to see not just the final result, but how each method learns over 20 training cycles, known as 'epochs', on a complex image recognition task (CIFAR-10).
  • Visual Elements Explained: In each graph, the lines represent the average accuracy during a training epoch on both the training data (the information the model learns from) and validation data (a separate set to check for performance on new data). The round dots show the final accuracy of individual models that are selected to be part of a combined 'ensemble' or team. The single rectangular bar represents the final accuracy of this combined team after all training is complete.
  • Performance of Simmering (Panel a): The simmering method shows training accuracy quickly reaching 100% (light pink line). The individual models chosen for its ensemble (pink dots) have test accuracies around 80%. Critically, when these models are combined, their team accuracy (red rectangle) is higher, at approximately 82%. This demonstrates a clear 'ensembling advantage,' where the team is better than its individual members.
  • Performance of Early Stopping (Panel b): The early stopping method shows accuracies leveling off around 75%. The individual models in its ensemble (green dots) have a range of accuracies, mostly between 70% and 76%. The final team accuracy (green rectangle) is about 73.5%, which is right in the middle of the individual performances. This indicates that unlike simmering, combining these models provides no significant extra benefit.
  • Performance of Dropout (Panel c): The dropout method's accuracy steadily increases, with the final model performance (blue rectangle) reaching about 75.5%. This is better than early stopping but not as good as simmering. No individual models (dots) are shown because dropout's ensembling is an implicit part of how the method works, rather than an explicit combination of separate models.
Scientific Validity
  • ✅ Provides crucial mechanistic insight: This figure effectively supplements the main text results by visualizing the training dynamics. It provides strong evidence for the paper's claims by showing how simmering achieves superior performance, specifically by illustrating its unique and significant 'ensembling advantage' (panel a) compared to the lack thereof in the early stopping ensemble (panel b).
  • ✅ Direct support for key claims: The visual comparison between the distribution of individual model performances (round markers) and the final aggregate performance (rectangular marker) offers direct, compelling support for the caption's central claim that simmering's benefit 'extends beyond just ensembling'.
  • 💡 Methodological ambiguity in ensemble composition: The caption describes the early stopping ensemble as composed of 100 independently optimized models, which is a standard approach. However, the simmering ensemble is derived from the last 2000 weight updates of a single training run. These are not independent models, and this fundamental difference in how the ensembles are generated should be more prominently discussed, as it could affect the interpretation of the results and the diversity of the ensemble members.
  • 💡 Incomplete comparison of training dynamics: Panels 'a' and 'b' correctly include both training and validation accuracy curves, which are essential for diagnosing overfitting. Panel 'c' for the dropout method is missing the validation curve, which prevents a complete and direct comparison of how all three methods manage the gap between training and validation performance over time.
Communication
  • ✅ Clear comparative structure: The side-by-side layout of the three panels with identical axis scales and consistent visual encoding is highly effective for direct comparison of the different training methods' convergence behaviors.
  • ✅ Effective visual encoding of data types: The choice to represent epoch averages as lines, individual ensemble members as dots, and the final aggregate result as a distinct rectangular marker is an excellent visual strategy. It allows multiple layers of information to be conveyed clearly and intuitively within each panel.
  • 💡 Redundant legends: The legend is repeated in each of the three panels. To improve visual economy and reduce clutter, a single, shared legend for the entire figure would be more efficient.
  • 💡 Overplotting obscures data distribution: In panels 'a' and 'b', the large number of round markers are heavily overplotted, making it difficult to discern the true distribution and density of individual model accuracies. Consider using smaller, semi-transparent markers or alternative plot types like violin plots to provide a clearer representation of this data.
Supplementary Figure 2. The training convergence curves corresponding to the...
Full Caption

Supplementary Figure 2. The training convergence curves corresponding to the PT-EN translation training comparison results, also shown in Main Fig. 3, show that simmering (panel a) produces both the highest ensemble accuracy (red rectangular marker, all panels) and the most significant ensembling advantage (round vs. rectangular markers in panel a) in less than half of the training time (grey vs. coloured line, and red vs. green or blue rectangular marker in panels b and c) compared to training with an equivalent learning rate with early stopping (green lines and markers, panel b), and dropout (blue lines and markers, panel c). In all panels, lines denote individual-model accuracy averaged over all weight updates of a given epoch, and round markers denote the performance of individual models that are included in the ultimate ensemble. The simmering epoch-averaged validation curve and ensembled test accuracy (grey line and red rectangular marker, respectively, in panels b and c) are provided for convergence timescale and performance comparison.

Figure/Table Image (Page 27)
Supplementary Figure 2. The training convergence curves corresponding to the PT-EN translation training comparison results, also shown in Main Fig. 3, show that simmering (panel a) produces both the highest ensemble accuracy (red rectangular marker, all panels) and the most significant ensembling advantage (round vs. rectangular markers in panel a) in less than half of the training time (grey vs. coloured line, and red vs. green or blue rectangular marker in panels b and c) compared to training with an equivalent learning rate with early stopping (green lines and markers, panel b), and dropout (blue lines and markers, panel c). In all panels, lines denote individual-model accuracy averaged over all weight updates of a given epoch, and round markers denote the performance of individual models that are included in the ultimate ensemble. The simmering epoch-averaged validation curve and ensembled test accuracy (grey line and red rectangular marker, respectively, in panels b and c) are provided for convergence timescale and performance comparison.
First Reference in Text
Supplementary Figure 2 shows the accuracy convergence curves that accompany the PT-EN translation test accuracy results (rectangular markers in Supplementary Fig. 2) shown in Main Fig. 3.
Description
  • Figure Overview: This figure presents three graphs that track the performance of different machine learning training methods over time on a language translation task. Each panel shows a different method: 'Simmering' (panel a), 'Early Stopping' (panel b), and 'Dropout' (panel c). The y-axis shows 'Categorical Accuracy,' a measure of how often the model makes the correct prediction, while the x-axis shows 'Epochs,' which are complete cycles through the training data.
  • Panel a: Simmering Performance: This panel shows that the Simmering method learns very quickly, reaching its peak accuracy of approximately 42% (red rectangle) in just over 15 epochs. The round dots, representing individual models that make up a final 'ensemble' or team, have accuracies around 40%. The final team accuracy is noticeably higher than any individual member, demonstrating a strong benefit from combining them.
  • Panel b: Early Stopping Performance: The Early Stopping method learns more slowly, taking over 50 epochs to reach its peak. The individual models in its ensemble (green dots) have accuracies that vary, with the best one reaching about 41%. The final team accuracy (green rectangle) is also about 41%, showing no significant improvement from combining the models. For comparison, Simmering's performance is overlaid (grey line and red rectangle), showing it achieves a better result in less than half the time.
  • Panel c: Dropout Performance: The Dropout method is the slowest, with accuracy still gradually increasing even at 60 epochs. Its final accuracy is approximately 40% (blue rectangle). Again, Simmering's superior speed and final accuracy are shown overlaid for direct comparison.
Scientific Validity
  • ✅ Powerful demonstration of efficiency: This figure provides critical evidence supporting the claims of efficiency made in the main text. By visualizing the full convergence curves, it compellingly demonstrates that simmering not only achieves higher final accuracy but does so in a fraction of the training time required by established methods like early stopping and dropout.
  • ✅ Strong evidence for ensembling advantage: The figure provides clear visual support for a key mechanistic claim. The noticeable gap between the individual model performances (round markers) and the aggregate performance (rectangle) in panel 'a' contrasts sharply with the lack of such a gap in panel 'b', robustly demonstrating that simmering's ensembling process is uniquely effective.
  • ✅ Direct and fair comparison: The overlay of the simmering results onto the plots for the other methods (panels b and c) is a methodologically sound choice that facilitates a direct and unambiguous comparison of both convergence speed and final performance, strengthening the paper's conclusions.
  • 💡 Ambiguity regarding learning rates: The caption states an 'equivalent learning rate' was used, but the optimal learning rate can differ significantly between optimizers and methods. Without clarification that a hyperparameter search was performed for each method to find its optimal learning rate, it's possible the comparison may be biased if the chosen rate was more favorable to simmering than to the baselines.
Communication
  • ✅ Effective use of overlays for comparison: Superimposing the simmering convergence curve and final result onto the baseline plots in panels b and c is an excellent communication strategy. It allows for immediate, direct comparison without requiring the reader to mentally map data across panels with different x-axis scales.
  • ✅ Consistent visual language: The figure maintains a consistent and clear visual encoding scheme across all panels (lines for averages, dots for individuals, rectangles for ensembles) and uses distinct colors for each method, which greatly aids in clarity and ease of interpretation.
  • 💡 Overly complex caption: The caption is extremely dense and contains a significant amount of methodological detail and cross-references to other panels and figures. While comprehensive, its complexity can be overwhelming. Suggest breaking the caption into a concise summary of the main finding, followed by a more detailed description of the visual elements, to improve readability.
  • 💡 Redundant legends: Each of the three panels includes its own identical legend. To reduce clutter and improve the figure's visual economy, a single, shared legend should be used for the entire figure.
Supplementary Figure 3. Ab initio simmering correctly classifies CIFAR-10 test...
Full Caption

Supplementary Figure 3. Ab initio simmering correctly classifies CIFAR-10 test images. Panel b shows an image with high pixel-intensity standard deviation and low Shannon entropy, which depicts an automobile, with a clear foreground and with the automobile in full view. Early stopping (green bars), dropout (blue bars) and ab initio simmering (pink bars) ensembles yield predictions with similarly high confidence (high proportion of ensemble votes). In contrast, panel a shows an image with low pixel-intensity standard deviation and high Shannon entropy, which depicts a truck at close range with low background contrast. Early stopping remains nearly equally confident (panel c, green bar) whereas simmering predicts the correct class with lower confidence, aligning with the difficulty of identifying the truck in the image. Simmering's second-most voted prediction bears relation to the correct class (“automobile" and "truck" have similar image features). Though dropout also predicts uncertainly, it predicts the correct label with near-equivalent confidence to most other classes. Thus, ab initio simmering predicts images of varying difficulty accurately with confidence that can reflect image difficulty.

Figure/Table Image (Page 28)
Supplementary Figure 3. Ab initio simmering correctly classifies CIFAR-10 test images. Panel b shows an image with high pixel-intensity standard deviation and low Shannon entropy, which depicts an automobile, with a clear foreground and with the automobile in full view. Early stopping (green bars), dropout (blue bars) and ab initio simmering (pink bars) ensembles yield predictions with similarly high confidence (high proportion of ensemble votes). In contrast, panel a shows an image with low pixel-intensity standard deviation and high Shannon entropy, which depicts a truck at close range with low background contrast. Early stopping remains nearly equally confident (panel c, green bar) whereas simmering predicts the correct class with lower confidence, aligning with the difficulty of identifying the truck in the image. Simmering's second-most voted prediction bears relation to the correct class (“automobile" and "truck" have similar image features). Though dropout also predicts uncertainly, it predicts the correct label with near-equivalent confidence to most other classes. Thus, ab initio simmering predicts images of varying difficulty accurately with confidence that can reflect image difficulty.
First Reference in Text
Supplementary Fig. 3 shows how the three ensemble learning approaches predict two different CIFAR-10 test samples (Supplementary Fig. 3a-b) chosen based on relative image difficulty (pixel-intensity standard deviation and Shannon entropy).
Description
  • Figure Overview: This figure compares how three different machine learning methods ('Early stopping', 'Dropout', and 'Simmering') assess their own confidence when classifying two images of different difficulty. The figure presents an 'easy' image (a clear photo of an automobile, panel b) and a 'hard' image (a blurry, close-up of a truck, panel a), and then shows bar charts (panels d and c, respectively) of how a team, or 'ensemble', of models trained with each method 'voted' for the correct category.
  • The 'Easy' Image Case (Panels b & d): Panel b shows a clear image of a red automobile. Panel d shows that for this easy-to-identify image, all three methods are highly confident in the correct prediction. The proportion of ensemble votes for 'automobile' is nearly 1.0 (or 100%) for Early stopping (green), Dropout (blue), and Simmering (pink).
  • The 'Hard' Image Case (Panels a & c): Panel a shows a visually challenging image of a truck. Panel c shows how the models' confidence changes. The 'Early stopping' method remains extremely confident, with nearly 100% of its votes for 'truck', suggesting it is overconfident. The 'Dropout' method correctly predicts 'truck' but with low confidence (around 30% of votes), with its remaining votes spread thinly across many other incorrect categories. The 'Simmering' method also correctly predicts 'truck' but with a moderate confidence of about 45%. Notably, its second-highest vote (~20%) is for 'automobile', a visually and conceptually similar vehicle, suggesting its uncertainty is more nuanced and reasonable.
Scientific Validity
  • ✅ Innovative qualitative evaluation: The experimental design, which uses two images of varying difficulty to probe the nature of model confidence, is a clever and effective way to demonstrate a qualitative advantage of the proposed method. It moves beyond simple accuracy metrics to assess the reasonableness of the models' uncertainty, which is a critical aspect of trustworthy AI.
  • ✅ Strong support for nuanced confidence claims: The results compellingly support the central claim that simmering's confidence reflects image difficulty. The contrast between simmering's scaled confidence on the hard image versus early stopping's overconfidence and dropout's diffuse uncertainty provides strong evidence for the superiority of the simmering method in this context.
  • 💡 Evidence is anecdotal: The conclusion is drawn from a case study of only two images (N=2). While highly illustrative, this is anecdotal evidence. To make the claim more robust, it should be supported by a quantitative analysis across a larger dataset, demonstrating a statistical correlation between a formal image difficulty metric (like the mentioned Shannon entropy) and the prediction confidence (e.g., entropy of the prediction vector) for each method.
  • 💡 Objective difficulty metrics are not reported: The caption and reference text state the images were chosen based on 'pixel-intensity standard deviation and Shannon entropy'. However, these numerical values are not provided in the figure or caption. Reporting these metrics would objectify the 'easy' vs. 'hard' distinction and strengthen the methodological rigor.
Communication
  • ✅ Excellent figure organization: The layout is extremely effective. Placing each image directly above its corresponding bar chart creates a clear and intuitive link between the input (the image) and the output (the model's prediction confidence), allowing the reader to grasp the figure's message almost instantly.
  • ✅ Clear and consistent visual encoding: The use of color to represent the three different methods is consistent across both bar charts, making comparisons straightforward. The full list of class labels on the x-axis provides important context for interpreting the distribution of votes.
  • 💡 Missing legend: The legend defining the colors for the three methods is present in panel d but absent in panel c. For clarity and completeness, the legend should be included for panel c as well, or a single, figure-wide legend should be used.
  • 💡 Bar labels would enhance precision: The bar charts rely on the y-axis for interpretation of vote proportions. To improve readability and precision, consider adding the exact percentage value on top of or within each significant bar. This would make the quantitative differences between the methods' confidence levels more immediately apparent.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Supplementary Figure 4. Simmering (panel b), an implementation of sufficient...
Full Caption

Supplementary Figure 4. Simmering (panel b), an implementation of sufficient training, produces a distinct training trajectory (black curve, panel b) from the Adam optimizer (black curve, panel a). The Adam optimizer (black curve with arrow indicating direction of traversal, panel a) traverses the training loss landscape (coloured contours in panel a) perpendicular to the loss contours, as expected in gradient-based optimization. In contrast, simmering travels along equivalent-loss (coloured contours in panel b) directions in parameter space, aligning with the information-geometric interpretation of sufficient training presented in Main Methods. The Adam trajectory (panel a) corresponds to the entire Adam training phase of Main Fig. 1a (beginning and end marked by the green and red markers respectively, panel a), and the simmering trajectory is a 100-epoch slice (endpoints indicated by green markers, panel b) of the simmering sampling (after the target temperature is reached) shown in Fig. 1a. The loss landscape and training trajectories have been projected onto the principal components (PCs) of the Adam and simmering training trajectories in panels a and b respectively.

Figure/Table Image (Page 29)
Supplementary Figure 4. Simmering (panel b), an implementation of sufficient training, produces a distinct training trajectory (black curve, panel b) from the Adam optimizer (black curve, panel a). The Adam optimizer (black curve with arrow indicating direction of traversal, panel a) traverses the training loss landscape (coloured contours in panel a) perpendicular to the loss contours, as expected in gradient-based optimization. In contrast, simmering travels along equivalent-loss (coloured contours in panel b) directions in parameter space, aligning with the information-geometric interpretation of sufficient training presented in Main Methods. The Adam trajectory (panel a) corresponds to the entire Adam training phase of Main Fig. 1a (beginning and end marked by the green and red markers respectively, panel a), and the simmering trajectory is a 100-epoch slice (endpoints indicated by green markers, panel b) of the simmering sampling (after the target temperature is reached) shown in Fig. 1a. The loss landscape and training trajectories have been projected onto the principal components (PCs) of the Adam and simmering training trajectories in panels a and b respectively.
First Reference in Text
Since this PC points along the direction of equivalent loss in parameter space, Supplementary Fig. 4b shows that sufficient training automatically identifies and samples along sloppy (equivalent-loss) modes in model parameter space as described in Main Methods.
Description
  • Visualizing the Training Process: This figure visualizes how two different machine learning training methods search for the best set of model parameters. Imagine a map where the altitude represents the model's error, or 'loss'; this is called a 'loss landscape'. The goal of training is to find the lowest point. The colored lines are 'loss contours,' like on a topographic map, connecting points of equal error. The black line is the 'training trajectory,' showing the path the model's parameters take during training.
  • Panel a: The Adam Optimizer's Path: This panel shows the path taken by the standard 'Adam' optimizer. It starts at the green marker and follows a direct path toward the center of the contours (the lowest error point), ending at the red marker. This path is like a ball rolling down the steepest part of a hill, always moving perpendicular to the contour lines. The landscape is shown in its two most important directions of variation, called Principal Components (PCs), which simplifies the complex, high-dimensional space for visualization.
  • Panel b: The Simmering Method's Path: This panel shows the path of the 'Simmering' method. The landscape here is very different, with long, flat valleys where the error is almost the same. These are called 'sloppy modes' or 'equivalent-loss' directions. Instead of heading for a single lowest point, the simmering trajectory (black line) moves back and forth along the bottom of this valley, between the two green markers. It's not trying to find one perfect solution, but rather exploring a whole family of solutions that are all 'good enough'.
Scientific Validity
  • ✅ Powerful visualization of a core mechanism: The figure provides a compelling and intuitive visualization of the fundamental mechanistic difference between standard gradient-based optimization (Adam) and the proposed sufficient training method (Simmering). It successfully translates the abstract concept of sampling along 'sloppy modes' in a high-dimensional parameter space into a clear and interpretable 2D representation.
  • ✅ Appropriate use of dimensionality reduction: Projecting the high-dimensional trajectories and loss landscapes onto their principal components is a standard and appropriate technique for this type of visualization. It allows for a meaningful representation of the dominant modes of variation during training, which is the central focus of the figure.
  • 💡 Comparison is between different subspaces and training stages: The caption notes that the projections are onto the PCs of the respective trajectories, and that the Adam trajectory is the full optimization while the simmering one is a post-convergence slice. This means the figure compares two different views of two different processes, not two processes on the same fixed landscape. While this effectively highlights their distinct behaviors (optimization vs. sampling), it is not a direct, apples-to-apples comparison of their paths on an identical problem space. This nuance is crucial for correct interpretation and should be emphasized more clearly.
  • 💡 The claim of automatic identification is not fully demonstrated: The reference text claims the method 'automatically identifies' these sloppy modes. The figure shows the model traveling along a sloppy mode, but it does not explicitly demonstrate the identification process itself. For example, it doesn't show an initial search phase that then settles into this mode. The evidence supports the 'sampling along' part of the claim more strongly than the 'automatic identification' part.
Communication
  • ✅ Highly effective side-by-side contrast: The juxtaposition of the two panels creates a stark and immediate visual contrast between the goal-oriented trajectory of Adam and the exploratory trajectory of Simmering. This design choice makes the core message of the figure incredibly easy and quick to grasp.
  • ✅ Clear depiction of trajectory relative to contours: The visualization effectively illustrates the key geometric relationships: Adam's trajectory is perpendicular to its loss contours, while Simmering's is parallel to its contours. This is the central visual argument of the figure, and it is communicated successfully.
  • 💡 Disparate axis scales: The y-axis scales are vastly different between panel a (-2 to 2) and panel b (-0.08 to 0.08). While the qualitative shape of the trajectories is the main point, such a large difference in scale can make it difficult to intuitively compare the magnitude of parameter changes. A note in the caption explaining why the scales differ (e.g., due to the nature of the respective PCs) would be beneficial.
  • 💡 Lack of quantitative context for contours: The contour plots lack a color bar or labels indicating the loss values they represent. This makes it impossible to quantitatively compare the landscapes. For example, it is unclear if the range of loss values in the 'valley' of panel b is comparable to the loss values near the minimum in panel a. Adding a color bar would add valuable context.
Supplementary Table 1. Symbols used to describe simmering and its...
Full Caption

Supplementary Table 1. Symbols used to describe simmering and its implementation in Methods and Supplementary Methods, and their analogues in machine learning (ML) contexts.

Figure/Table Image (Page 30)
Supplementary Table 1. Symbols used to describe simmering and its implementation in Methods and Supplementary Methods, and their analogues in machine learning (ML) contexts.
First Reference in Text
Supplementary Table 1 describes the notation used in Main Methods and Supplementary Methods.
Description
  • Table Purpose and Structure: This table acts as a glossary or a 'Rosetta Stone' to help readers understand the mathematical notation used in the paper. It is organized into three columns: 'Symbol' (the mathematical notation), 'Physical Terminology' (its meaning in the language of statistical physics), and 'Equivalent ML Terminology' (its meaning or closest equivalent in the language of machine learning). The table's purpose is to bridge the conceptual gap between these two fields, as the paper's 'simmering' method is derived from physics concepts.
  • Key Conceptual Translations: The table translates fundamental concepts. For instance, the symbol 'T', which means 'Temperature' in physics, is shown to be analogous to a 'Temperature' or regularization parameter in machine learning, a term that controls the trade-off between model complexity and fitting the data. Similarly, 'x', the vector of particle positions in physics, is directly equated with 'θ', the set of trainable parameters (weights and biases) in a neural network. The 'Training loss', L(x, D), is shown to be a shared concept in both fields.
  • Algorithm-Specific Notation: The table also defines more complex, algorithm-specific terms. For example, symbols like 'sk' and 'vsk' are described as the position and velocity of 'virtual thermostat particles' in the physics framework. Their machine learning equivalents are given as 'auxiliary thermostat states', which are extra variables used by the algorithm to control the training dynamics in a way that mimics a physical heat bath. This translation is crucial for understanding the mechanics of the simmering algorithm.
Scientific Validity
  • ✅ Essential for interdisciplinary clarity: This table is a critical component for an interdisciplinary paper. By explicitly mapping the terminology from statistical physics to machine learning, it makes the novel methodology accessible to researchers from both fields, which is a significant strength and crucial for the paper's impact.
  • ✅ Enhances methodological rigor and reproducibility: A clear and unambiguous definition of all mathematical symbols is fundamental to scientific rigor. This table provides that clarity, ensuring that the algorithm described in the Methods section is well-defined and can be understood and implemented by other researchers, thereby promoting reproducibility.
  • 💡 Analogies may oversimplify some concepts: While the translations are generally excellent, some analogies could be perceived as oversimplifications. For example, equating the physical 'Temperature' (T) with a generic ML 'Temperature' or 'Tempered posterior' might obscure subtle but important differences in their theoretical underpinnings and practical application in different ML contexts (e.g., simulated annealing vs. Bayesian inference). While useful, the direct equivalence should be interpreted with this nuance in mind.
Communication
  • ✅ Excellent structure for a reference tool: The three-column layout is perfectly suited for the table's purpose. It allows a reader to quickly look up a symbol and understand its role from both a physical and a machine learning perspective. This format is highly effective and user-friendly.
  • ✅ Improves overall readability of the paper: By consolidating all notation definitions into a single, comprehensive table, the authors avoid cluttering the main text with repeated explanations. This greatly improves the flow and readability of the Methods section, as readers can easily refer back to this table whenever they encounter an unfamiliar symbol.
  • 💡 Could be improved with structural grouping: The symbols are listed in a generally logical sequence, but the table's scannability could be enhanced by adding subheadings to group related terms. For example, creating sections for 'Core Model Concepts' (e.g., parameters, loss), 'Thermodynamic Analogues' (e.g., temperature, free energy), and 'Algorithm-Specific Variables' (e.g., thermostat parameters) would provide a clearer structure and help readers navigate the content more efficiently.
Algorithm 1 Simmering algorithm for a generic learning problem described by a...
Full Caption

Algorithm 1 Simmering algorithm for a generic learning problem described by a differentiable objective function L which has parameters x, based on the NHC thermostat dynamics integration in Ref.⁷. In this integration scheme, particle positions are integrated over one half-step, i.e., half of the learning rate, at a time, and corresponding velocities are integrated forward by a full time-step at a time. The objective function parameters (neural network parameters) are treated as the positions of N one-dimensional particles which evolve over time according to Supplementary Equations 1-9. In this algorithm description, the first subscript of each variable indexes the parameters (i for the ith parameter, k for the kth even or odd NHC variable), and the second subscript indicates the timestep of the associated variable. Note that since the NHC particles are integrated differently based on their chain position (even or odd), the virtual particle index takes on values k = 1, . . . , NNHC/2. Unless otherwise indicated (e.g., with a particle index of all), operations are performed element-wise for all elements in the given vector(s). If no initialization is specified, assume that a vector's initial value is 0 for all elements.

Figure/Table Image (Page 33)
Algorithm 1 Simmering algorithm for a generic learning problem described by a differentiable objective function L which has parameters x, based on the NHC thermostat dynamics integration in Ref.⁷. In this integration scheme, particle positions are integrated over one half-step, i.e., half of the learning rate, at a time, and corresponding velocities are integrated forward by a full time-step at a time. The objective function parameters (neural network parameters) are treated as the positions of N one-dimensional particles which evolve over time according to Supplementary Equations 1-9. In this algorithm description, the first subscript of each variable indexes the parameters (i for the ith parameter, k for the kth even or odd NHC variable), and the second subscript indicates the timestep of the associated variable. Note that since the NHC particles are integrated differently based on their chain position (even or odd), the virtual particle index takes on values k = 1, . . . , NNHC/2. Unless otherwise indicated (e.g., with a particle index of all), operations are performed element-wise for all elements in the given vector(s). If no initialization is specified, assume that a vector's initial value is 0 for all elements.
First Reference in Text
Algorithm 1 describes the implementation of simmering used to generate the results presented in this work.
Description
  • Algorithm's Purpose and Analogy: This is a pseudocode recipe for the 'Simmering' training method. It treats a neural network's parameters (its weights and biases) as if they were physical particles. The goal is not just to find the single best set of parameters, but to explore a range of 'good enough' parameters. It does this by simulating the particles moving in a 'heat bath' at a constant temperature, 'T'. This temperature introduces a controlled jiggle, preventing the parameters from settling into a single, potentially overfit, solution.
  • Key Inputs: The algorithm requires several inputs to run: the target temperature 'T', a learning rate 'Δt' (how big of a step to take at each iteration), the loss function 'L' (which measures the model's error), the initial parameters for the network 'x_all,0', and masses for the real and 'virtual' thermostat particles ('m', 'Q'), which control their inertia.
  • Core Calculation Loop: The main part of the algorithm is a loop that repeats for a set number of iterations. Inside the loop, it performs a series of calculations to update the 'position' (the value of a parameter) and 'velocity' (how fast the parameter is changing) for all the network's parameters. A key step (line 20) calculates the 'force' on each particle, which is derived from the negative gradient of the loss function. This is the same core calculation used in standard machine learning, representing the 'downhill' direction in the error landscape. The algorithm combines this force with influences from the simulated thermostat to update the parameters.
  • The Thermostat Mechanism: The algorithm uses a Nosé-Hoover chain (NHC) thermostat, a sophisticated method from computational physics to maintain a constant temperature. This involves a chain of 'virtual' particles (represented by variables 's' and 'v_s') that exchange energy with the real parameter-particles to add or remove 'heat' as needed, ensuring the system 'simmers' at the desired temperature 'T'.
Scientific Validity
  • ✅ Promotes reproducibility: Providing a detailed, step-by-step algorithm in pseudocode is a best practice that is essential for scientific reproducibility. This allows other researchers to understand and implement the proposed method precisely as the authors intended.
  • ✅ Grounded in established numerical methods: The algorithm is built upon a solid theoretical foundation, using a Verlet integrator and a Nosé-Hoover chain thermostat. These are standard, well-vetted, and robust numerical techniques from computational physics for simulating molecular dynamics, which lends significant credibility to the method's implementation.
  • 💡 Lacks guidance on hyperparameter selection: The algorithm introduces several new hyperparameters (T, m, Q, NNHC) that are unfamiliar in a typical machine learning context. The pseudocode itself offers no guidance on how these should be set or tuned. The method's practical utility depends heavily on the sensitivity to these parameters and the difficulty of finding good values for them, which is not addressed here.
  • 💡 Assumes a specific integration scheme without justification: The caption mentions the use of a specific half-step/full-step integration scheme (a type of Verlet integrator). While this is a common choice in physics, the algorithm does not justify why this particular scheme was chosen over others, or discuss its stability and accuracy properties in the context of neural network loss landscapes.
Communication
  • ✅ Clear, modular structure: The algorithm is well-structured, breaking down complex calculations into smaller, named functions (e.g., `POSITION_FORWARD`, `ACCELERATION_S1`). This modularity makes the overall logic much easier to follow and understand compared to a single, monolithic block of equations.
  • ✅ Effective use of comments: The inline comments (marked with ▷) are highly effective. They act as signposts, explaining the purpose of key lines or blocks of code in plain language (e.g., '▷ Use loss gradient here!'), which is invaluable for a reader trying to parse the complex sequence of updates.
  • 💡 High notational density: The reliance on single-letter variables with multiple subscripts (e.g., `v_s2k-1,t+1/2`) makes the algorithm mathematically precise but visually dense and difficult to read. While standard in physics, it presents a barrier to entry for a machine learning audience. Consider defining variables more descriptively in the 'Require' section.
  • 💡 Unclear variable scope: The caption notes that operations are element-wise unless specified (e.g., with subscript 'all'). However, in the main loop, this is not always explicit (e.g., line 20). It would improve clarity to consistently use the 'all' subscript or vector notation for operations intended to apply to all parameters, to avoid any ambiguity.