Accurate predictions on small data with a tabular foundation model

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, Frank Hutter
Nature
Machine Learning Lab, University of Freiburg, Freiburg, Germany

Table of Contents

Overall Summary

Study Background and Main Findings

This paper addresses the long-standing challenge of applying deep learning effectively to tabular data (data organized in rows and columns, like spreadsheets), a domain historically dominated by traditional methods such as gradient-boosted decision trees (GBDTs). The researchers introduce the Tabular Prior-data Fitted Network (TabPFN), a novel approach positioned as a 'foundation model' for tabular data. Unlike typical models trained on specific datasets, TabPFN utilizes a transformer architecture (a type of neural network successful in language processing) pre-trained entirely on millions of synthetically generated datasets. This pre-training aims to imbue the model with a general understanding of tabular data patterns and structures.

The core methodology relies on In-Context Learning (ICL). At prediction time, TabPFN processes the entire new dataset (both training examples and points to predict) in a single forward pass. The model uses the provided training examples within its 'context window' to infer the underlying patterns and make predictions, effectively learning the prediction algorithm on-the-fly without needing retraining or parameter updates for each new dataset. This contrasts sharply with traditional methods that require explicit training and often extensive hyperparameter tuning for every new task.

The key findings demonstrate that TabPFN achieves state-of-the-art performance on a wide range of benchmark datasets, specifically those with up to 10,000 samples and 500 features. Notably, its default configuration, requiring no tuning, outperforms heavily tuned (e.g., 4 hours) GBDT ensembles and AutoML frameworks, while being significantly faster (e.g., achieving predictions in approximately 2.8 seconds for classification vs. 4 hours tuning for baselines). The paper also showcases TabPFN's versatility beyond prediction, including capabilities for fine-tuning, synthetic data generation, density estimation (useful for identifying unusual data points), and learning reusable data representations (embeddings).

The main conclusion is that TabPFN represents a paradigm shift in tabular data modeling for small-to-medium datasets, offering substantial gains in speed and out-of-the-box accuracy. By leveraging large-scale synthetic data pre-training and ICL, it automates the discovery of effective prediction algorithms. While currently limited in scalability to larger datasets, TabPFN presents a powerful new tool with the potential to accelerate scientific discovery and decision-making in various fields where tabular data is prevalent.

Research Impact and Future Directions

This paper introduces TabPFN, a novel approach to modeling tabular data that represents a significant departure from traditional methods. By leveraging a transformer architecture pre-trained exclusively on millions of synthetically generated datasets, TabPFN effectively learns a general-purpose algorithm for tabular prediction. Its core innovation lies in using In-Context Learning (ICL), allowing the single pre-trained model to make predictions on new, unseen datasets (up to 10,000 samples and 500 features) extremely rapidly—often in seconds—without requiring dataset-specific retraining or extensive hyperparameter tuning.

The primary strength demonstrated is TabPFN's remarkable speed and out-of-the-box performance on small-to-medium datasets, where it consistently outperforms heavily tuned state-of-the-art baselines like gradient-boosted decision trees (GBDTs) and automated machine learning (AutoML) frameworks, achieving comparable or better accuracy with orders-of-magnitude less computation time (e.g., seconds vs. hours). This makes TabPFN a potentially powerful tool for rapid prototyping and analysis in scientific domains where such dataset sizes are common and computational resources or tuning expertise may be limited. The model also exhibits versatility, demonstrating capabilities in data generation, density estimation, and fine-tuning, positioning it as a multi-functional foundation model for tabular data.

However, the current iteration of TabPFN has clear limitations. Its primary constraint is scalability; performance degrades, and computational requirements become prohibitive for datasets significantly larger than 10,000 samples or 500 features. While interpretability is explored using SHAP, claims of learning 'simple' relationships require careful consideration, as visual assessment is subjective and may not fully capture underlying complexity compared to inherently simpler models. Furthermore, the reliance on a synthetic data prior means performance hinges on how well this prior captures the characteristics of real-world data distributions; its effectiveness on highly specialized or out-of-distribution datasets remains an open question. Future work appropriately focuses on addressing these limitations, particularly scaling, handling data drift, and developing more specialized priors.

Critical Analysis and Recommendations

Clear Problem Statement and Solution Introduction (written-content)
The abstract clearly states the problem (deep learning struggles with tabular data compared to GBDTs) and introduces TabPFN as a novel solution (transformer foundation model trained on synthetic data). + This effectively frames the research gap and the proposed contribution for readers, immediately highlighting the paper's context and novelty.
Section: Abstract
Incomplete Scope Definition in Abstract (written-content)
The abstract mentions the sample limit (up to 10,000) but omits the feature limit (up to 500, stated later). + Explicitly stating both limits upfront would provide a more complete picture of the model's intended operational range, helping readers quickly determine its relevance to their specific datasets.
Section: Abstract
Clear Explanation of Core Concepts (ICL/PFN) (written-content)
The introduction clearly explains the core concepts of In-Context Learning (ICL) and Prior-data Fitted Networks (PFNs) as applied to tabular data, differentiating the approach from standard supervised learning. + This clarifies the fundamental mechanism behind TabPFN—learning a general algorithm from synthetic data priors—which is crucial for understanding its novelty and operational paradigm.
Section: Introduction
Elaborate on ICL Mechanism for Algorithm Learning (written-content)
The introduction explains that TabPFN uses ICL trained on synthetic data but could more explicitly state how ICL enables algorithm learning (i.e., inferring the task-solving algorithm from the synthetic prior data within the context window). + A slightly deeper explanation would enhance understanding for readers less familiar with ICL's application beyond NLP, clarifying the link between synthetic data training and the model's ability to generalize algorithms.
Section: Introduction
Strong Quantitative Demonstration of Performance and Speed (graphical-figure)
Figure 4 clearly demonstrates TabPFN's superior performance (higher normalized metrics) and dramatic speed advantage (e.g., 2.8s vs 4h tuning for baselines) on benchmark datasets within its scope. + This provides compelling quantitative evidence for the paper's central claims regarding accuracy and efficiency, highlighting the practical benefit of the approach for rapid analysis.
Section: Results
Rigorous Quantitative Evaluation Protocol (written-content)
The Results section employs a rigorous quantitative evaluation using standard benchmarks (AutoML, OpenML-CTR23), relevant metrics, strong baselines, and a clear protocol (repetitions, splits, tuning budgets). + This methodological rigor lends significant credibility to the performance claims and allows for fair comparison with established methods.
Section: Results
Qualify Subjective Interpretability Claims (written-content)
The paper claims TabPFN learns 'simple, interpretable feature relationships' based on visual SHAP analysis (Extended Data Fig. 3), but visual assessment of simplicity is subjective. + Qualifying this claim by acknowledging the evidence type (visual SHAP) or adding quantitative interpretability metrics would provide a more nuanced and scientifically rigorous assessment of interpretability, crucial for high-stakes applications.
Section: Results
Clarify Scope of TabPFN 'Tuning' (written-content)
The Results compare 'default' and 'tuned' TabPFN without initially clarifying that 'tuning' refers only to inference-time hyperparameters (ensembling, preprocessing), not retraining the core model. + Explicitly stating this distinction upfront would prevent potential misinterpretation of TabPFN's operational paradigm (fixed pre-trained model) and the nature of performance gains from tuning.
Section: Results
Concise Summary and Significance Statement (written-content)
The conclusion concisely summarizes the core contribution (TabPFN leveraging ICL and synthetic data for superior performance/speed on small/medium datasets) and highlights the paradigm shift towards foundation models for tabular data. + This effectively reinforces the main takeaways and significance of the work for the reader.
Section: Conclusion
Link Future Work Explicitly to Limitations (written-content)
Future directions are listed but not explicitly framed as addressing specific, known limitations (e.g., scaling beyond 10k samples). + Directly linking future research goals to overcoming current model boundaries would provide a clearer rationale for why these specific directions are priorities, strengthening the narrative.
Section: Conclusion
Comprehensive and Rigorous Evaluation Protocol (written-content)
The Methods section provides a comprehensive and rigorous evaluation protocol, detailing benchmarks, baselines, metrics, cross-validation, normalization, and tuning procedures. + This high degree of transparency and methodological detail ensures the quantitative comparisons are well-defined and supports the reproducibility of the findings.
Section: Methods
Clear User Guidance and Scope Definition (written-content)
The Methods section offers clear practical guidance on when to use TabPFN (dataset size limits), its limitations (scaling), computational needs, and basic data preparation. + This enhances the usability of the model for potential adopters by setting realistic expectations and outlining practical considerations.
Section: Methods
Justify Choice of SCM Computational Modules (written-content)
The description of the Structural Causal Model (SCM) prior details the components used (NNs, trees, etc.) but doesn't explicitly state why this specific mix was chosen. + Briefly justifying the choice (e.g., to mimic diverse real-world data generation processes) would strengthen the rationale behind the synthetic data design.
Section: Methods

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 1| Overview of the proposed method. a, The high-level overview of TabPFN...
Full Caption

Fig. 1| Overview of the proposed method. a, The high-level overview of TabPFN pre-training and usage. b, The TabPFN architecture. We train a model to solve more than 100 million synthetic tasks. Our architecture is an adaptation of the

Figure/Table Image (Page 2)
Fig. 1| Overview of the proposed method. a, The high-level overview of TabPFN pre-training and usage. b, The TabPFN architecture. We train a model to solve more than 100 million synthetic tasks. Our architecture is an adaptation of the
First Reference in Text
Figures 1 and 2 outline our approach:
Description
  • Two-Stage Workflow: Pre-training and Application: Panel A depicts the overall strategy of the TabPFN method. It shows a two-stage process. First, a neural network model, called TabPFN, is trained using a vast amount of synthetically generated datasets. A synthetic dataset consists of training data (Xtrain, Ytrain) and test data (Xtest, Ytest). The model learns by trying to predict the test target values (Ytest) given the rest of the data, and its internal settings (parameters, denoted by theta) are adjusted based on how well it performs across millions of these artificial tasks. The goal is to minimize a 'training loss', specifically the negative log-likelihood (-log q_theta(Ytest|...)), which measures how surprising the true test values are given the model's predictions. Second, once trained, this single TabPFN model can be directly applied to new, real-world datasets. It takes the training portion (Xtrain, Ytrain) of the real dataset as input context and makes predictions for the unseen test portion (Xtest) in a single step, without needing further parameter adjustments.
Scientific Validity
  • Conceptual Soundness: The conceptual split between pre-training on synthetic data and inference on real-world data is a valid and increasingly common approach in machine learning, particularly for foundation models. The use of a prior defined by synthetic data generation is a sound principle rooted in Bayesian inference.
  • Alignment with Methodology: The diagram accurately represents the intended workflow of pre-training a general model and then applying it in-context to specific tasks, which aligns with the principles of In-Context Learning (ICL) and Prior-data Fitted Networks (PFNs) cited in the text.
  • High-Level Representation: The panel illustrates a high-level concept. The scientific validity hinges on the successful implementation and empirical validation detailed later in the paper, particularly the effectiveness of the synthetic data prior and the model's generalization.
Communication
  • Clarity of Workflow Stages: The diagram clearly illustrates the two distinct phases: pre-training on synthetic data and application to real-world data. The use of distinct visual flows for synthetic and real-world data application aids comprehension.
  • Conceptual Representation: The diagram effectively conveys the core concept of using a pre-trained model (TabPFN) for prediction tasks on new datasets without explicit retraining for each new dataset, leveraging the knowledge gained from synthetic data.
  • Lack of Quantitative Detail in Diagram: While illustrating the concept, the diagram lacks specific details about the nature of the synthetic data generation or the scale ('millions of datasets'), which are mentioned elsewhere but not visually quantified here.
Fig. 2| Overview of the TabPFN prior. a, For each dataset, we first sample...
Full Caption

Fig. 2| Overview of the TabPFN prior. a, For each dataset, we first sample high-level hyperparameters. b, Based on these hyperparameters, we construct a structural causal model that encodes the computational function generating the dataset. Each node holds a vector and each edge in the computational graph implements a function according to one of the connection types. In step 1, using random noise variables we generate initialization data, which is fed into the root nodes of the graphs and propagated through the computational graph

Figure/Table Image (Page 3)
Fig. 2| Overview of the TabPFN prior. a, For each dataset, we first sample high-level hyperparameters. b, Based on these hyperparameters, we construct a structural causal model that encodes the computational function generating the dataset. Each node holds a vector and each edge in the computational graph implements a function according to one of the connection types. In step 1, using random noise variables we generate initialization data, which is fed into the root nodes of the graphs and propagated through the computational graph
First Reference in Text
Figures 1 and 2 outline our approach:
Description
  • Sampling High-Level Dataset Properties: Panel A illustrates the first phase in generating a synthetic dataset according to the TabPFN prior. Before creating the actual data rows and columns, the process starts by sampling several high-level characteristics, referred to as hyperparameters. These include parameters like the total number of data points (rows) the dataset will have, the number of features (columns), the complexity of the underlying structure generating the data (represented by the number of nodes and complexity of a graph), and the specific structure of this graph itself. These initial choices dictate the overall nature and difficulty of the synthetic dataset that will be generated in the subsequent steps.
Scientific Validity
  • Standard Practice in Synthetic Data Generation: Sampling high-level hyperparameters to control the characteristics of generated data is a standard and necessary step in procedural content generation and synthetic data creation. It allows for systematic exploration of different data regimes.
  • Ensuring Data Diversity: Controlling parameters like dataset size, feature count, and complexity ensures that the generated synthetic datasets cover a diverse range of scenarios, which is crucial for training a robust and generalizable foundation model like TabPFN.
Communication
  • Clarity of Initial Step: Panel A clearly depicts the initial step of sampling high-level parameters that define the characteristics of the synthetic dataset to be generated. The visual representation using abstract nodes and connections effectively conveys the concept of defining dataset properties before generation.
  • Specificity of Parameters: The specific parameters listed (number of data points, features, nodes, graph complexity) provide concrete examples of the high-level properties being controlled.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 3 | The behaviour of TabPFN and a set of baselines on simple functions. In...
Full Caption

Fig. 3 | The behaviour of TabPFN and a set of baselines on simple functions. In all plots, we use orange for the ground truth and blue for model predictions. a, Each column represents a different toy function, each having a single feature (along the x-axis) and a target (along the y-axis). TabPFN can model a lot of

Figure/Table Image (Page 4)
Fig. 3 | The behaviour of TabPFN and a set of baselines on simple functions. In all plots, we use orange for the ground truth and blue for model predictions. a, Each column represents a different toy function, each having a single feature (along the x-axis) and a target (along the y-axis). TabPFN can model a lot of
First Reference in Text
In Fig. 3a, we compare TabPFN with a diverse set of standard predictors, with all methods using default settings.
Description
  • Comparative Visualization Grid: Panel 3a presents a grid of plots designed to visually compare the behavior of the proposed TabPFN model against three baseline machine learning models (CatBoost, MLP, Linear) and the true underlying function. Each column represents a different simple mathematical relationship ('toy function') between a single input feature (plotted on the x-axis) and a target variable (plotted on the y-axis).
  • Variety of Toy Functions: The functions include smooth non-linear (sin(x) + x), simple non-linear (x^2), non-smooth (|x|), and discontinuous (step function) examples. Additionally, two columns show the step function with added noise: 'homoscedastic noise' where the noise level is constant, and 'heteroscedastic noise' where the noise level varies with the input feature.
  • Models Compared and Visual Representation: Each row in the grid corresponds to either the true function ('True function') or the predictions made by one of the models: TabPFN, CatBoost (a gradient-boosted decision tree model), MLP (Multilayer Perceptron, a standard type of neural network), and Linear (specifically, ridge regression, a linear model with regularization). The orange points/lines represent the true function or the data generated from it, while the blue points/lines show the predictions made by the respective model.
  • Qualitative Assessment Goal: The panel aims to illustrate qualitatively how well each model, using its default settings, captures the different types of relationships presented by these simple functions, highlighting strengths and weaknesses like handling non-linearity, discontinuities, and noise.
Scientific Validity
  • Use of Toy Functions: Using simple, low-dimensional toy functions is a standard and valid approach for gaining qualitative insights into the inductive biases and failure modes of different machine learning models. It allows for easy visualization of the learned function.
  • Choice of Baselines: The chosen baseline models (Linear, MLP, CatBoost) represent distinct and relevant classes of algorithms commonly used for tabular data (linear, neural network, tree-based ensemble), providing a reasonable spectrum for comparison.
  • Use of Default Settings: Evaluating models with their default settings provides a baseline understanding of out-of-the-box performance, which is relevant for many users. However, it's acknowledged that performance could potentially change with tuning (as explored elsewhere in the paper).
  • Diversity of Function Characteristics: The inclusion of functions with different characteristics (smooth, non-smooth, discontinuous, noisy) allows for probing different aspects of model flexibility and robustness, which is scientifically valuable.
  • Limited Generalizability from Toy Examples: While illustrative, performance on these simple 1D functions may not directly translate to performance on complex, high-dimensional real-world tabular datasets. This panel serves as a qualitative illustration rather than a rigorous benchmark.
Communication
  • Grid Layout for Comparison: The grid layout effectively allows for direct visual comparison of different models (rows) across various simple functions (columns). This facilitates understanding the qualitative differences in model behavior.
  • Color Coding and Ground Truth: The consistent use of orange for ground truth and blue for model predictions is clear and aids quick interpretation. The inclusion of the ground truth in each plot provides a necessary reference.
  • Clarity of Function Labels: While the functions are simple, clear titles for each column (e.g., 'Sine + Linear', 'Quadratic', 'Absolute Value', 'Step Function', 'Homoscedastic Noise', 'Heteroscedastic Noise') would improve immediate readability over just mathematical notation or brief descriptions.
  • Clarity of Model Labels: Labeling the rows clearly with the model names (True function, TabPFN, CatBoost, MLP, Linear) is effective.
  • Effectiveness in Showing Model Differences: The visualization successfully highlights the distinct fitting characteristics, such as the linear model's inability to capture non-linearity, the MLP's struggle with the step function, CatBoost's piecewise constant nature, and TabPFN's flexibility.
Fig. 4| Comparison of TabPFN on our test benchmarks, containing datasets with...
Full Caption

Fig. 4| Comparison of TabPFN on our test benchmarks, containing datasets with up to 10,000 samples and 500 features. Performance was normalized per dataset before aggregation using all baselines; intervals represent the 95% confidence interval. Wilcoxon Prefers to the two-sided Wilcoxon signed-rank test Pvalue54. a, Average performance of the default as well as the tuned versions of TabPFN and our baselines. All methods are tuned for ROC AUC or RMSE, respectively, thus decreasing the representativeness of the secondary metrics.

Figure/Table Image (Page 5)
Fig. 4| Comparison of TabPFN on our test benchmarks, containing datasets with up to 10,000 samples and 500 features. Performance was normalized per dataset before aggregation using all baselines; intervals represent the 95% confidence interval. Wilcoxon Prefers to the two-sided Wilcoxon signed-rank test Pvalue54. a, Average performance of the default as well as the tuned versions of TabPFN and our baselines. All methods are tuned for ROC AUC or RMSE, respectively, thus decreasing the representativeness of the secondary metrics.
First Reference in Text
Figure 4a demonstrates the strong out-of-the-box performance of TabPFN compared with tuned and default configurations of XGBoost, CatBoost and a random forest.
Description
  • Benchmark Performance Summary: Panel 4a presents aggregated performance results for the proposed TabPFN model compared against several standard machine learning algorithms on benchmark datasets. The datasets used contain up to 10,000 data samples (rows) and 500 features (columns).
  • Task Separation (Classification/Regression): The panel is divided into two main sections: 'Classification' tasks (top row) and 'Regression' tasks (bottom row). Within each section, two primary metrics are shown.
  • Evaluation Metrics: For Classification, the metrics are 'Normalized ROC AUC' (Area Under the Receiver Operating Characteristic Curve, a measure of a model's ability to distinguish between classes, where higher is better) and 'Normalized accuracy' (the proportion of correct predictions). For Regression, the metrics are 'Normalized negative RMSE' (Root Mean Squared Error, a measure of prediction error magnitude, made negative so higher is better) and 'Normalized R2' (Coefficient of Determination, representing the proportion of variance explained by the model, higher is better).
  • Performance Normalization: Performance scores are 'normalized' per dataset before averaging. This means for each dataset, the scores of all compared methods are scaled so the best-performing method gets a score of 1.0 and the worst gets 0.0. The bars show the average of these normalized scores across all datasets in the benchmark.
  • Algorithm Comparison (Default vs. Tuned): Each metric plot compares multiple algorithms (TabPFN, XGBoost, CatBoost, LightGBM, Random Forest, MLP, SVM, Linear models - identified by abbreviations). For each algorithm, two bars are often shown: 'Default' (using standard, untuned settings) and 'Tuned (4 h)' (after optimizing algorithm settings for 4 hours).
  • Confidence Intervals: Error bars on each bar represent the 95% confidence interval, indicating the statistical uncertainty in the average normalized performance.
  • Magnified Comparison Insets: Inset plots labeled 'Magnification' provide a zoomed-in view comparing TabPFN against the strongest baseline methods (like CatBoost, XGBoost, Random Forest) for each metric.
  • Key Visual Finding: Visually, the plots suggest that TabPFN (especially the default version) achieves higher average normalized scores compared to the baseline methods across most metrics shown.
Scientific Validity
  • Benchmark Selection: The use of established benchmark datasets (AutoML Benchmark, OpenML-CTR23, detailed later) provides a solid foundation for comparison, assuming these benchmarks are relevant and diverse.
  • Choice of Evaluation Metrics: Evaluating across multiple relevant metrics (ROC AUC, Accuracy, R2, RMSE) provides a more comprehensive picture of performance than relying on a single metric.
  • Default vs. Tuned Comparison: Comparing both default and tuned performance is valuable. Default performance reflects ease-of-use, while tuned performance indicates potential capability given optimization effort. The 4-hour tuning budget is a practical constraint.
  • Normalization Method: Normalization allows aggregation across diverse datasets but obscures absolute performance differences and makes results dependent on the specific set of methods included in the normalization pool for each dataset.
  • Baseline Selection: The inclusion of state-of-the-art baselines, particularly gradient-boosted trees (XGBoost, CatBoost, LightGBM) which are known strong performers on tabular data, makes the comparison rigorous.
  • Statistical Rigor (Confidence Intervals): Reporting 95% confidence intervals based on multiple repetitions (10 runs, detailed later) is good practice for assessing the statistical significance and robustness of the observed performance differences.
  • Tuning Objective vs. Reported Metrics: The caption notes that tuning is performed for the primary metric (ROC AUC or RMSE), which may decrease the representativeness of secondary metrics (Accuracy or R2). This is an important caveat regarding the tuned results for secondary metrics.
  • Dataset Size Limitation: The focus on datasets up to 10,000 samples is a specific scope, and conclusions may not directly extend to much larger datasets without further evidence.
Communication
  • Clarity of Bar Chart Representation: The use of bar charts with error bars (95% CIs) is a standard and clear way to present aggregated performance metrics across multiple datasets.
  • Organization of Comparisons: Separating results for Classification and Regression tasks, and further splitting by 'Default' vs 'Tuned (4 h)' settings, aids in structured comparison.
  • Usefulness of Magnification Insets: The 'Magnification' inset plots are a useful addition, allowing for a clearer visual comparison between the top-performing methods (TabPFN and strong baselines like CatBoost/XGBoost) where differences might be small on the main chart's scale.
  • Visual Distinction of Models: Consistent color coding or distinct patterns for each algorithm across the different plots would enhance visual tracking, although the current labeling is adequate.
  • Clarity of Normalization Concept: The concept of 'Normalized' performance is crucial but requires careful reading of the caption/text to fully grasp that 1.0 is the best relative performance among the evaluated methods on a given dataset, not an absolute score. This could be potentially misinterpreted if the caption isn't read closely.
  • Labeling and Abbreviations: Axis labels are clear (e.g., 'Normalized ROC AUC', 'Normalized negative RMSE'). Abbreviations for models are defined in the caption legend.
Fig. 5 | Robustness across datasets and performance comparison with tuned...
Full Caption

Fig. 5 | Robustness across datasets and performance comparison with tuned ensembles. a, A comparison of modified datasets. We can see that TabPFN is not more vulnerable to the modifications compared with baselines. We also see that TabPFN reproduces the accuracy of CatBoost (default) with only half the training samples provided. Here we normalize scores per dataset (sharing one normalization across all modifications of one experiment) to avoid negative outliers. b, We split the test datasets by data characteristics and

Figure/Table Image (Page 6)
Fig. 5 | Robustness across datasets and performance comparison with tuned ensembles. a, A comparison of modified datasets. We can see that TabPFN is not more vulnerable to the modifications compared with baselines. We also see that TabPFN reproduces the accuracy of CatBoost (default) with only half the training samples provided. Here we normalize scores per dataset (sharing one normalization across all modifications of one experiment) to avoid negative outliers. b, We split the test datasets by data characteristics and
First Reference in Text
In Fig. 5a,b, we show the robustness of TabPFN to dataset character- istics that are traditionally hard to handle for neural-network-based approaches 14,23.
Description
  • Robustness Experiments Overview: Panel 5a investigates the robustness of TabPFN (default version) compared to other default baseline models (CatBoost, MLP, Linear) when faced with common data quality issues or reductions. It presents results from four types of experiments.
  • Uninformative Features Test: The first experiment ('Uninformative features') adds features containing random noise (0%, 50%, 90% of total features) to the datasets to see how models handle irrelevant information.
  • Outlier Robustness Test: The second experiment ('Outlier factor') introduces outliers by multiplying a small fraction of data cells (2%, as stated in the text) by a random 'outlier factor' (ranging from 1 to 100) to test sensitivity to extreme values.
  • Reduced Sample Size Test: The third experiment ('Dropping samples') randomly removes a portion of the training samples (keeping 100%, 50%, or 25%) to assess performance with reduced data quantity.
  • Reduced Feature Set Test: The fourth experiment ('Dropping features') randomly removes a portion of the input features (keeping 100%, 50%, or 25%) to test robustness to missing input variables.
  • Performance Metric and Normalization: Performance is measured using a 'Normalized average performance' score, which combines normalized ROC AUC (for classification) and normalized negative RMSE (for regression) across the benchmark datasets. Scores are normalized within each experiment type (e.g., all results for 'Dropping samples' share one normalization) to focus on relative performance changes due to the modification. Higher bars indicate better relative performance.
  • Key Visual Findings: Visually, the results suggest that TabPFN's performance degradation under these modifications is often comparable to or less severe than that of the baseline methods, particularly CatBoost. For instance, with only 50% of samples, TabPFN maintains performance similar to CatBoost using 100% of samples.
Scientific Validity
  • Relevance of Robustness Tests: Testing robustness against uninformative features, outliers, and reduced data (samples/features) are standard and important evaluations for machine learning models, reflecting realistic data challenges.
  • Use of Default Configurations: Comparing default model configurations isolates the inherent robustness properties without the confounding factor of hyperparameter tuning specifically for corrupted data.
  • Normalization Strategy: The normalization approach described (per experiment type) is reasonable for comparing relative performance drops due to specific modifications, although combining normalized metrics from different task types (classification/regression) warrants caution in interpretation.
  • Baseline Selection for Robustness: The choice of baseline models (CatBoost, MLP, Linear) provides relevant comparisons, especially including MLP which is often considered sensitive to outliers and irrelevant features.
  • Choice of Modification Levels: The specific levels chosen for modifications (e.g., % dropped, outlier factor range) seem reasonable for illustrating trends, though the impact might vary with different levels or types of corruption.
  • Support for Claims: The conclusions drawn (e.g., TabPFN not being more vulnerable, performing well with half the samples) appear supported by the visual evidence presented in the bars, subject to the statistical uncertainty (not explicitly shown with error bars here, unlike Fig 4a).
Communication
  • Clarity of Grouped Bar Charts: The use of grouped bar charts clearly presents the performance under different data modifications side-by-side for each modification type.
  • Clarity of Modification Labels: Labeling the x-axis with the specific modification and its level (e.g., 'Uninformative features Fraction (%)', 'Outlier factor', 'Dropping samples Fraction kept (%)', 'Dropping features Fraction kept (%)') is clear and informative.
  • Clarity of Performance Metric Label: The y-axis label 'Normalized average performance (ROC AUC and negative RMSE)' clearly indicates the combined metric being presented, although averaging normalized scores from different metric types (classification AUC and regression RMSE) requires careful interpretation.
  • Model Identification: The legend implicitly identifies the models by their position/color within each group, which is understandable but could be made more explicit with a direct legend.
  • Effectiveness in Showing Robustness Trends: The panel effectively conveys the relative robustness of the models to these specific data corruptions, showing how performance degrades (or doesn't) as the corruption level increases.
Fig. 6 | Showcase of the application of TabPFN as tabular foundation model. a,...
Full Caption

Fig. 6 | Showcase of the application of TabPFN as tabular foundation model. a, b, On the German Credit Dataset, we perform data density estimation (a) and generation of new synthetic samples (b). c, We show our learned embeddings are useful representations of each sample on the handwritten digits dataset

Figure/Table Image (Page 7)
Fig. 6 | Showcase of the application of TabPFN as tabular foundation model. a, b, On the German Credit Dataset, we perform data density estimation (a) and generation of new synthetic samples (b). c, We show our learned embeddings are useful representations of each sample on the handwritten digits dataset
First Reference in Text
Figure 6d shows an example fine-tuning result.
Description
  • Fine-tuning Concept Demonstration: Panel 6d demonstrates the fine-tuning capability of TabPFN using a specific example involving sine wave data. Fine-tuning is a process where a pre-trained model is further trained on a smaller, specific dataset to adapt its knowledge to that particular task.
  • Fine-tuning Dataset Example: The top plot ('Fine-tuning data') shows a dataset generated from a sine wave function (y = sin(x) + offset). The blue dots represent the limited training samples provided for the fine-tuning process, while the orange line shows the underlying true sine curve for this specific dataset.
  • Prediction Before Fine-tuning: The middle plot ('Default TabPFN predictions') shows the predictions (blue line) made by the standard, pre-trained TabPFN model on this sine wave dataset before any fine-tuning. The orange line again represents the true curve for this dataset.
  • Prediction After Fine-tuning: The bottom plot ('Finetuned TabPFN predictions') shows the predictions (blue line) made by the TabPFN model after it has been fine-tuned on the specific sine wave dataset shown in the top plot. The orange line is the true curve.
  • Visual Comparison of Pre- vs. Post-Fine-tuning: By comparing the middle and bottom plots, the figure illustrates that fine-tuning allows the model to make more accurate predictions that better match the specific sine curve of the target dataset, compared to the predictions from the general-purpose default model.
Scientific Validity
  • Validity of Fine-tuning Approach: Fine-tuning is a standard and scientifically valid technique for adapting pre-trained neural network models (like transformers) to specific downstream tasks or data distributions. Demonstrating this capability is relevant for foundation models.
  • Experimental Design for Demonstration: Using related but distinct tasks (sine curves with different offsets, as detailed in Extended Data Fig. 4) is a suitable way to demonstrate knowledge transfer and adaptation through fine-tuning in a controlled setting.
  • Support for Fine-tuning Claim: The visual results presented, showing improved alignment of predictions with the target sine curve after fine-tuning, provide qualitative evidence supporting the claim that TabPFN can be successfully fine-tuned.
  • Proof-of-Concept Nature: This panel serves as a proof-of-concept illustration. The effectiveness and generalizability of fine-tuning across diverse real-world tabular tasks would require more extensive empirical evaluation (partially addressed by Extended Data Fig. 4).
  • Comparison to Traditional Methods: The ability to fine-tune distinguishes neural network-based models like TabPFN from traditional tree-based methods, which typically lack this capability, highlighting a potential advantage.
Communication
  • Clarity of Comparison Layout: The three-plot layout (Fine-tuning data, Default predictions, Finetuned predictions) provides a clear visual comparison of the model's behavior before and after fine-tuning.
  • Choice of Simple Visualization Task: Using a simple sine curve allows for easy visualization of the prediction accuracy and the shift learned during fine-tuning.
  • Distinction of Training Data and Ground Truth: Clearly distinguishing between training samples (dots) and the ground truth curve (line) in the top plot helps understand the fine-tuning setup.
  • Effectiveness in Showing Fine-tuning Impact: The visual difference between the default predictions (middle plot) and the finetuned predictions (bottom plot), showing the latter aligning better with the new ground truth, effectively communicates the positive impact of fine-tuning.
  • Labeling and Titles: Axis labels ('x', 'y') are minimal but sufficient for this illustrative example. Titles clearly label each plot's content.
Extended Data Fig. 1 | Performance comparison across additional dataset...
Full Caption

Extended Data Fig. 1 | Performance comparison across additional dataset characteristics, extending Fig. 5. This figure shows the relative performance of different methods when datasets are split based on specific attributes. Error bars represent 95% confidence intervals. While performance differences are

Figure/Table Image (Page 14)
Extended Data Fig. 1 | Performance comparison across additional dataset characteristics, extending Fig. 5. This figure shows the relative performance of different methods when datasets are split based on specific attributes. Error bars represent 95% confidence intervals. While performance differences are
First Reference in Text
Further ablations in Extended Data Fig. 1.
Description
  • Purpose and Models Compared: Extended Data Figure 1 presents a performance comparison of TabPFN (default) against default configurations of CatBoost, MLP, and Linear models, analyzing how performance varies across different types of datasets.
  • Dataset Characteristics Analyzed: The comparison is based on splitting the benchmark datasets into subgroups according to specific characteristics: number of target classes (binary vs. multiclass), class balance (balanced vs. imbalanced), the ratio of features to samples (two bins: 0.09%-0.61% and 2.13%-10.80%), and the presence of significant outliers (>10 standard deviations) in the target variable for regression tasks.
  • Performance Metric: Performance is reported as the 'Normalized Average Performance', which combines normalized ROC AUC scores from classification tasks and normalized negative RMSE scores from regression tasks. Normalization is done relative to the performance of all methods within each subgroup. Higher values indicate better relative performance.
  • Visualization Format: Each panel displays grouped bar charts, where each group corresponds to a specific model, and the bars within the group show performance on the different splits defined by the characteristic (e.g., '2 classes' vs '3+ classes'). Error bars indicate the 95% confidence intervals for the average normalized performance.
  • Key Observation: The figure aims to show that TabPFN's relative performance is generally consistent across these different dataset types, suggesting robustness to these characteristics, although minor variations exist.
Scientific Validity
  • Validity of Subgroup Analysis: Analyzing performance across subgroups based on dataset characteristics (like class balance, feature ratio, outliers) is a valid and informative way to understand potential strengths or weaknesses of different algorithms and assess robustness.
  • Relevance of Chosen Characteristics: The chosen characteristics (class number/balance, feature ratio, outliers) are relevant properties known to influence the performance of machine learning models, particularly for tabular data.
  • Use of Default Models: Using default configurations for all models ensures a fair comparison of out-of-the-box behavior across these different conditions.
  • Normalization and Metric Combination: The normalization approach allows comparison across diverse datasets and metrics, but averaging normalized scores from different metric types (AUC, RMSE) might obscure nuances and should be interpreted with caution regarding the combined scale.
  • Statistical Rigor: Reporting 95% confidence intervals adds statistical rigor, allowing assessment of the significance of observed performance differences between models or subgroups.
  • Consistency of Findings: The observation that performance differences are generally subtle across splits, as mentioned in the caption, seems consistent with the visual representation, suggesting TabPFN does not exhibit strong sensitivity to these specific characteristics relative to the baselines.
Communication
  • Visualization Method: The figure effectively uses grouped bar charts to compare the performance of different models within specific dataset subgroups.
  • Clarity of Dataset Subgroups: The characteristics chosen for splitting the datasets (# Classes, Class Balance, Feature/Sample Ratio, Target Outliers) are clearly labeled, allowing readers to understand the basis of comparison in each panel.
  • Metric Labeling: The y-axis label 'Normalized Average Performance (ROC AUC and Negative RMSE)' specifies the metric, although averaging normalized scores across different task types and metrics requires careful interpretation by the reader.
  • Indication of Uncertainty: Error bars representing 95% confidence intervals are included, which clearly communicates the statistical uncertainty associated with the average performance in each subgroup.
  • Ease of Comparison: The layout allows for easy comparison of TabPFN against baselines (MLP, Linear, CatBoost) under different data conditions.
Extended Data Fig. 2 | Performance comparisons of TabPFN and baselines on...
Full Caption

Extended Data Fig. 2 | Performance comparisons of TabPFN and baselines on additional benchmark datasets and with GPU support. (a) Classification performance on the Grinsztajn medium-sized benchmark with categorical features, across 7 datasets. (b) Classification performance on the Grinsztajn medium-sized benchmark with numerical features, across its 15 datasets. (c) Classification performance on the TabZilla benchmark, consisting of 102 datasets with fewer than 10,000 rows of data, 500 features, and 10 classes.

Figure/Table Image (Page 15)
Extended Data Fig. 2 | Performance comparisons of TabPFN and baselines on additional benchmark datasets and with GPU support. (a) Classification performance on the Grinsztajn medium-sized benchmark with categorical features, across 7 datasets. (b) Classification performance on the Grinsztajn medium-sized benchmark with numerical features, across its 15 datasets. (c) Classification performance on the TabZilla benchmark, consisting of 102 datasets with fewer than 10,000 rows of data, 500 features, and 10 classes.
First Reference in Text
As shown in Extended Data Fig. 2, similar to our primary benchmarks, TabPFN substantially outperformed all baselines on the benchmarks of refs.
Description
  • Performance Metric and Time Axis: Panel (a) displays the classification performance, measured by Normalized ROC AUC (Area Under the Receiver Operating Characteristic Curve, a metric evaluating discrimination ability where higher is better), as a function of the average time taken for model fitting and prediction.
  • Benchmark Dataset Subset: The comparison is performed on a specific subset of the Grinsztajn benchmark datasets – those characterized by having categorical features (features whose values fall into discrete categories). The caption states this involves 7 datasets.
  • Performance Normalization: Performance scores are normalized per dataset before aggregation, meaning 1.0 represents the best score achieved by any method on that dataset, and 0.0 the worst. The plot shows the average normalized score.
  • Compared Methods and Tuning Time: The plot compares TabPFN (default), TabPFN (PHE) (an enhanced version with Post Hoc Ensembling), AutoGluon (an automated machine learning framework often using ensembles), and CatBoost (a gradient boosting baseline). Performance is plotted against increasing time budgets allowed for hyperparameter tuning (ranging from 5 seconds to 14,400 seconds or 4 hours).
  • Key Visual Finding: The plot visually suggests that TabPFN achieves high performance very quickly (low time values) compared to the baselines, which require significantly more tuning time to reach comparable or lower performance levels on these specific datasets.
Scientific Validity
  • Use of Additional Benchmarks: Evaluating on additional, established benchmarks like the Grinsztajn suite strengthens the generalizability claims beyond the primary benchmarks used in Figure 4.
  • Focus on Categorical Features: Focusing specifically on datasets with categorical features addresses a common challenge in tabular data and tests model robustness to this data type.
  • Inclusion of AutoML Baseline: Comparing against AutoGluon, a strong AutoML baseline, provides a rigorous comparison point, especially regarding tuned performance.
  • Time Budget Analysis: Analyzing performance as a function of tuning time is a standard and informative way to assess the practical trade-off between computational cost and accuracy.
  • Normalization Context: The normalization method allows aggregation but is relative to the methods compared. The conclusions hold within the context of this comparison.
Communication
  • Clarity of Performance vs. Time Trade-off: The line plot clearly shows the trade-off between performance (Normalized ROC AUC) and computation time (Average Fit + Predict Time).
  • Use of Logarithmic Time Scale: Using a logarithmic scale for the x-axis (time) effectively visualizes performance across different time budgets, from seconds to hours.
  • Method Differentiation: Distinct markers and colors, along with a clear legend, differentiate the compared methods (TabPFN, TabPFN (PHE), AutoGluon, CatBoost).
  • Visualization of Uncertainty: Confidence bands (shaded areas, likely 95% CI as per caption of Fig 4) effectively communicate the variability or uncertainty in performance at different time points.
Extended Data Fig. 3 | Comparing SHAP (SHapley Additive exPlanations) summary...
Full Caption

Extended Data Fig. 3 | Comparing SHAP (SHapley Additive exPlanations) summary plots between TabPFN and baselines. We compare SHAP feature importance and impact for Logistic Regression, TabPFN, and CatBoost on the "Default of Credit Card Clients" dataset. The top features visualized are credit amount, age, and duration. Each point represents a single instance, with the color indicating the value of the checking status feature (blue for low, red for high), illustrating its interaction with the respective feature on the x-axis.

Figure/Table Image (Page 16)
Extended Data Fig. 3 | Comparing SHAP (SHapley Additive exPlanations) summary plots between TabPFN and baselines. We compare SHAP feature importance and impact for Logistic Regression, TabPFN, and CatBoost on the "Default of Credit Card Clients" dataset. The top features visualized are credit amount, age, and duration. Each point represents a single instance, with the color indicating the value of the checking status feature (blue for low, red for high), illustrating its interaction with the respective feature on the x-axis.
First Reference in Text
Extended Data Fig. 3 compares the feature importance and impact for logistic regression, CatBoost and TabPFN.
Description
  • Interpretability Comparison using SHAP: Extended Data Figure 3 presents a comparison of model interpretability using SHAP (SHapley Additive exPlanations) summary plots. SHAP values quantify the contribution of each feature to the prediction for individual data instances.
  • Models and Dataset: The analysis compares three models: Logistic Regression (a simple linear model), TabPFN (the proposed model), and CatBoost (a complex tree-based model), applied to the 'Default of Credit Card Clients' dataset.
  • Features Analyzed: The figure displays plots for the three most important features identified by SHAP: 'credit_amount', 'age', and 'duration'. Each column corresponds to one of these features.
  • SHAP Value Visualization: Each plot is a scatter plot where every point represents a single client (instance) from the dataset. The horizontal position (x-axis) shows the actual value of the feature for that client (e.g., their age). The vertical position (y-axis) shows the calculated SHAP value for that feature for that client, indicating the feature's impact on the model's prediction (e.g., predicting default risk). Positive SHAP values push the prediction towards default, negative values away from it.
  • Interaction Effect Visualization: The color of each point represents the value of the 'checking_status' feature for that client (ranging from blue for low values to red for high values). This coloring helps visualize how the impact of the primary feature (on the x-axis) might change depending on the client's checking account status (an interaction effect).
  • Qualitative Comparison Goal: The figure aims to visually compare the complexity and nature of the relationships learned by each model. For example, Logistic Regression shows near-linear relationships, while CatBoost exhibits more complex, non-monotonic patterns, and TabPFN appears to learn relatively smooth, interpretable relationships.
Scientific Validity
  • Methodological Soundness (SHAP): Using SHAP is a state-of-the-art, theoretically grounded method for explaining machine learning model predictions, making it a valid choice for comparing model interpretability.
  • Baseline Selection for Interpretability: Comparing TabPFN's interpretability against both a simple baseline (Logistic Regression) and a complex, high-performance baseline (CatBoost) provides valuable context.
  • Dataset Relevance: Applying the analysis to a real-world dataset ('Default of Credit Card Clients') where interpretability is often crucial enhances the practical relevance.
  • Visualization Technique Validity: Visualizing SHAP values against feature values, colored by another feature, is a standard and informative technique for understanding feature effects and interactions.
  • Consistency of Visuals and Claims: The qualitative conclusions drawn in the text (e.g., TabPFN learning simple, interpretable relationships while maintaining accuracy) appear consistent with the visual patterns shown in the plots.
Communication
  • Comparative Grid Layout: The grid layout comparing models (rows) across key features (columns) facilitates direct visual comparison of interpretability patterns.
  • Use of SHAP Summary Plots: Using SHAP summary plots is a standard and powerful way to visualize feature effects, though potentially complex for non-experts in ML interpretability. The explanation in the caption is helpful.
  • Visualization of Interactions: Color-coding points based on the 'checking_status' feature effectively highlights potential interaction effects, adding another layer of insight.
  • Labeling Clarity: Clear labeling of axes (feature names, SHAP value) and rows/columns (models/features) aids interpretation.
  • Effectiveness in Showing Model Differences: The figure successfully conveys qualitative differences in learned feature relationships, such as linearity vs. non-linearity and complexity.
Extended Data Fig. 4 | Finetuning TabPFN on 2-dimensional sine curve datasets....
Full Caption

Extended Data Fig. 4 | Finetuning TabPFN on 2-dimensional sine curve datasets. (a) Examples of 2D sine curve datasets with different offsets. (b) Finetuning loss curves for 50 runs with random train-test offsets. Colors indicate the offset between train and test. TabPFN shows positive transfer,

Figure/Table Image (Page 17)
Extended Data Fig. 4 | Finetuning TabPFN on 2-dimensional sine curve datasets. (a) Examples of 2D sine curve datasets with different offsets. (b) Finetuning loss curves for 50 runs with random train-test offsets. Colors indicate the offset between train and test. TabPFN shows positive transfer,
First Reference in Text
Our analysis across 50 runs (Extended Data Fig. 4) shows that TabPFN successfully transfers knowledge even when labels differ significantly between fine-tuning and test tasks, with performance improving as distributions become more similar.
Description
  • Dataset Visualization: Panel (a) displays four examples of the synthetic 2-dimensional datasets used for the fine-tuning experiments. These datasets represent a function based on sine waves across two input dimensions (Dimension 1, Dimension 2).
  • Color Map Representation: Each plot uses a color map where the color intensity likely represents the output value of the 2D sine function at each point in the 2D input space.
  • Illustration of Data Offsets/Shifts: The key aspect illustrated is the 'offset' or phase shift between datasets. Two training datasets (A and B) are shown, along with corresponding test datasets that have a specific phase shift relative to the training sets (Pi/2 offset from A, Pi offset from B). This setup is designed to test the model's ability to adapt to related but different data distributions during fine-tuning.
Scientific Validity
  • Controlled Synthetic Data: Using synthetic datasets with controlled variations (phase shifts in sine waves) is a valid approach for systematically studying the fine-tuning capabilities and robustness to distribution shifts in a simplified setting.
  • Relevance to Fine-tuning Research: The concept of training on one distribution (e.g., Dataset A) and testing/fine-tuning on a related but shifted distribution (e.g., Test Offset from A) directly addresses the important research question of knowledge transfer and adaptation.
  • Context for Quantitative Results: These specific examples provide a visual basis for understanding the experimental setup described quantitatively in panel (b).
Communication
  • Visualization Method: The use of color maps effectively visualizes the 2D sine function's output across the input space (Dimension 1, Dimension 2).
  • Illustration of Offsets: Presenting four distinct examples (Training Dataset A, Training Dataset B, Test Dataset Offset from A, Test Dataset Offset from B) clearly illustrates the concept of different offsets or shifts between training and testing distributions.
  • Labeling and Titles: Axes are clearly labeled as 'Dimension 1' and 'Dimension 2', and titles identify each plot's role (Training/Test dataset and its relationship to others).
  • Clarity of Phase Shift: The visual difference between, for example, 'Training Dataset A' and 'Test Dataset, Offset from A: Pi / 2' clearly shows the phase shift that the fine-tuning aims to adapt to.
Extended Data Table 1 | Aggregated results on the 29 AMLB classification...
Full Caption

Extended Data Table 1 | Aggregated results on the 29 AMLB classification Benchmark datasets

Figure/Table Image (Page 18)
Extended Data Table 1 | Aggregated results on the 29 AMLB classification Benchmark datasets
First Reference in Text
We show comparisons on a larger number of metrics in Extended Data Tables 1 and 2.
Description
  • Benchmark and Task: This table presents a detailed summary of performance results for various machine learning models on 29 classification datasets from the AutoML Benchmark (AMLB).
  • Models Compared: It compares TabPFN (in default, 4h tuned, and PHE 4h tuned versions) against several baseline algorithms including AutoGluon, XGBoost, CatBoost, LightGBM, Random Forest, SVM, MLP, and Logistic Regression. Both default and 4-hour tuned versions of most baselines are included.
  • Evaluation Metrics: Performance is evaluated using multiple metrics: ROC AUC (Area Under the Receiver Operating Characteristic Curve), Accuracy (Acc.), F1-score (harmonic mean of precision and recall), Cross-Entropy (CE, a loss function measuring prediction error), and Expected Calibration Error (ECE, measuring how well predicted probabilities match actual frequencies). Arrows indicate whether higher (↑) or lower (↓) values are better.
  • Result Categories: The table shows results in four main sections: 'Mean Normalized' scores (where scores are scaled 0-1 per dataset based on min/max performance across models), 'Mean' absolute scores (average raw metric values), 'Wins' (count of datasets where a model achieved the best score for a metric), and 'Mean Time' (average computation time in seconds).
  • Uncertainty Reporting: Uncertainty is reported using '±' notation, likely representing standard error or confidence intervals around the mean scores.
  • Key Numerical Findings: Key numerical results show TabPFN variants generally achieving the highest normalized scores (e.g., TabPFN PHE 4h tuned: 0.971 Normalized ROC AUC) and the most 'Wins' across metrics, often with significantly lower computation time compared to tuned baselines (e.g., default TabPFN time: 2.793 s vs. tuned CatBoost time: 14437.103 s).
Scientific Validity
  • Benchmark Selection: The use of the AMLB benchmark suite ensures comparison on a standard, curated set of diverse classification tasks.
  • Metric Selection: Evaluating across a comprehensive set of standard classification metrics (ROC AUC, Acc, F1, CE, ECE) provides a robust assessment of model performance beyond just accuracy.
  • Default vs. Tuned Comparison: Comparing default and time-constrained (4h) tuned versions provides a fair assessment of both out-of-the-box usability and potential performance.
  • Baseline Selection: The inclusion of strong baselines like AutoGluon, CatBoost, and XGBoost makes the comparison rigorous.
  • Comprehensive Reporting: Reporting both normalized and absolute scores, along with win counts and timing, offers a multi-faceted view of the results.
  • Statistical Aggregation and Uncertainty: Aggregating results across 29 datasets provides statistical power, and reporting uncertainty (± values) is crucial for assessing the reliability of mean differences.
  • Methodological Rigor: The methodology aligns with standard practices in empirical machine learning research for comparing algorithm performance.
Communication
  • Table Structure and Clarity: The table is well-structured, clearly separating normalized mean scores, absolute mean scores, win counts, and mean time.
  • Metric Labeling: Column headers clearly indicate the metric (e.g., ROC, Acc., F1, CE, ECE) and whether higher (↑) or lower (↓) is better, aiding interpretation.
  • Model/Condition Labeling: Rows clearly identify the model and the condition (e.g., 'TabPFN (PHE, 4h tuned)', 'CatBoost (default)').
  • Inclusion of Normalized and Absolute Scores: Presenting both normalized and absolute mean scores provides complementary information, though the normalized scores are perhaps more central to the paper's comparison across diverse datasets.
  • Clarity of 'Wins' Summary: The 'Wins' columns provide a simple, interpretable summary of how often each method performed best for each metric.
  • Inclusion of Time Cost: Reporting mean time provides crucial context on computational cost.
  • Indication of Uncertainty: The use of '±' notation clearly indicates the reporting of uncertainty (likely standard error or derived from confidence intervals mentioned elsewhere).
Extended Data Table 2 | Aggregated results on the 28 AMLB and OpenML-CTR23...
Full Caption

Extended Data Table 2 | Aggregated results on the 28 AMLB and OpenML-CTR23 regression Benchmark datasets

Figure/Table Image (Page 19)
Extended Data Table 2 | Aggregated results on the 28 AMLB and OpenML-CTR23 regression Benchmark datasets
First Reference in Text
We show comparisons on a larger number of metrics in Extended Data Tables 1 and 2.
Description
  • Benchmark and Task: This table summarizes the performance of machine learning models on 28 regression datasets drawn from the AutoML Benchmark (AMLB) and the OpenML-CTR23 benchmark suite.
  • Models Compared: It compares TabPFN (default, 4h tuned, PHE 4h tuned) with baselines including AutoGluon, XGBoost, CatBoost, LightGBM, Random Forest, SVM, Ridge regression (Linear equivalent for regression), and MLP. Default and 4-hour tuned versions are included for most.
  • Evaluation Metrics: Performance is assessed using standard regression metrics: RMSE (Root Mean Squared Error, lower is better), Spearman correlation coefficient (Spearman, measures rank correlation, higher is better), R2 (Coefficient of Determination, proportion of variance explained, higher is better), and MAE (Mean Absolute Error, lower is better).
  • Result Categories: Results are presented in four sections: 'Mean Normalized' scores (scaled 0-1 per dataset), 'Mean' absolute scores, 'Wins' (count of datasets where a model performed best), and 'Mean Time' (average computation time).
  • Uncertainty Reporting: Mean scores are accompanied by '±' values, indicating the uncertainty (e.g., standard error).
  • Key Numerical Findings: Numerically, TabPFN variants, particularly the tuned versions, show top performance in normalized scores (e.g., TabPFN PHE 4h tuned: 0.022 Normalized RMSE, 0.983 Normalized R2) and win counts, while the default TabPFN offers competitive performance with minimal time cost (4.745 s) compared to tuned baselines (often >14000 s).
Scientific Validity
  • Benchmark Selection: Using datasets from established benchmarks (AMLB, OpenML-CTR23) ensures evaluation on relevant and diverse regression tasks.
  • Metric Selection: The selection of metrics (RMSE, Spearman, R2, MAE) covers different aspects of regression performance, including error magnitude, rank correlation, and variance explained.
  • Default vs. Tuned Comparison: Comparing default and tuned configurations provides insights into both immediate usability and optimized potential.
  • Baseline Selection: The set of baselines includes strong, commonly used algorithms for tabular regression, ensuring a rigorous comparison.
  • Comprehensive Reporting: The comprehensive reporting format (normalized, absolute, wins, time) allows for thorough analysis.
  • Statistical Aggregation and Uncertainty: Aggregation across 28 datasets and reporting uncertainty provides statistical robustness to the conclusions drawn.
  • Methodological Rigor: The experimental procedure adheres to standard practices for empirical comparison of regression algorithms.
Communication
  • Table Structure and Clarity: The table follows a clear structure, analogous to Extended Data Table 1, separating normalized mean, absolute mean, win counts, and mean time for regression tasks.
  • Metric Labeling: Column headers clearly define the regression metrics (RMSE, Spearman correlation, R2, MAE) and indicate the desired direction for optimization (↓ for errors, ↑ for correlation/R2).
  • Model/Condition Labeling: Rows unambiguously identify the model and its configuration (e.g., default, 4h tuned, PHE 4h tuned).
  • Inclusion of Normalized and Absolute Scores: Presenting both normalized and absolute mean scores gives a fuller picture, accommodating comparisons across diverse datasets (normalized) and understanding typical performance levels (absolute).
  • Clarity of 'Wins' Summary: The 'Wins' columns offer an intuitive summary of relative superiority for each metric.
  • Inclusion of Time Cost: Inclusion of mean computation time is essential for evaluating the practical trade-offs.
  • Indication of Uncertainty: Uncertainty is clearly indicated using '±' notation.
Extended Data Table 6 | Performance on Kaggle Data Science Challenges
Figure/Table Image (Page 23)
Extended Data Table 6 | Performance on Kaggle Data Science Challenges
First Reference in Text
Moreover, we show in Extended Data Table 6 that default TabPFN outperforms default CatBoost on all five Kaggle competitions with less than 10,000 training samples from the latest completed Tabular Playground Series.
Description
  • Purpose and Models Compared: This table compares the performance of the default (untuned) TabPFN model against the default configuration of CatBoost, a strong baseline model, on five specific data science challenges from the Kaggle platform.
  • Data Source: Kaggle Competitions: The challenges are identified by their 'Episode' number (3, 5, 9, 22, 26) from the Tabular Playground Series Season 3, a series of competitions focused on tabular data.
  • Problem Types: The table lists the type of machine learning problem for each competition: Binary Classification, Ordinal Regression, Regression, and Multiclass Classification (two instances).
  • Performance Scores and Metrics: For each competition, the table shows the performance score achieved by default CatBoost and default TabPFN using the specific evaluation metric defined for that competition. The metrics include ROC AUC (Area Under the Receiver Operating Characteristic Curve), Quadratic Weighted Kappa (for ordinal regression), RMSE (Root Mean Squared Error), Micro-averaged F1-Score, and Log loss.
  • Key Numerical Findings: Numerical results show TabPFN achieving better scores than CatBoost on all five competitions according to their respective primary metrics. For example, on Episode 3 (Binary Classification), TabPFN scored 0.868 ROC AUC compared to CatBoost's 0.841. On Episode 9 (Regression), TabPFN achieved an RMSE of 12.238 versus CatBoost's 12.506 (lower is better for RMSE).
  • Selection Criteria and Comparison Context: The footnote clarifies that these competitions were selected because they used datasets with fewer than 10,000 training samples and 500 features, fitting the target domain of TabPFN, and that the comparison uses the raw competition data without specialized feature engineering or ensembling.
Scientific Validity
  • Use of Competition Datasets: Evaluating on real-world competition datasets from Kaggle provides a practical test of model performance on challenging, independently curated tasks.
  • Default vs. Default Comparison: Comparing default configurations provides a fair assessment of out-of-the-box performance, relevant for users seeking quick solutions.
  • Use of Official Competition Metrics: Using the official competition metric for each challenge ensures the evaluation aligns with the intended measure of success for that specific problem.
  • Alignment with Study Scope: Selecting competitions that fit the specified size constraints (<10k samples, <500 features) aligns the evaluation with TabPFN's claimed domain of strength.
  • Support for Claims: The finding that default TabPFN consistently outperforms default CatBoost across these five diverse tasks provides strong evidence supporting the claims made in the main text, although it's based on a limited sample of competitions.
  • Contextualization of Results: The explicit mention in the footnote that this comparison uses raw data without advanced techniques used by top Kaggle competitors sets realistic expectations and clarifies the scope of the comparison.
Communication
  • Table Structure: The table format is clear and concise, presenting the core comparison effectively.
  • Clarity of Headers: Column headers (Competition, Problem type, CatBoost (default), TabPFN (default), Metric) are unambiguous.
  • Metric Specification: Specifying the exact metric used for each competition (e.g., ROC AUC, Kappa, RMSE, F1-Score, Log loss) and its optimization direction (arrows) is crucial and well-executed.
  • Footnote Clarity and Context: The footnote provides essential context regarding the source of the competitions (Tabular Playground Series Season 3), selection criteria (size constraints), and the nature of the comparison (default models, raw data).

Conclusion

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Extended Data Table 3 | List of test datasets used for primary evaluation of...
Full Caption

Extended Data Table 3 | List of test datasets used for primary evaluation of classification tasks

Figure/Table Image (Page 20)
Extended Data Table 3 | List of test datasets used for primary evaluation of classification tasks
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose and Scope: This table provides a comprehensive list of the 29 datasets used for evaluating classification performance in the main study (specifically, those contributing to the results in Extended Data Table 1).
  • Information Provided: For each dataset, the table lists its common name, its unique identifier on the OpenML platform (OpenML ID, a public repository for machine learning data and experiments), the scientific or application domain it comes from (e.g., Census, Healthcare, Finance, Biology), and several key properties.
  • Key Dataset Statistics: The listed properties include the number of features (columns or input variables, ranging from 4 to 308), the number of samples (rows or data instances, ranging from 690 to 9873), the number of target classes (distinct output categories to predict, mostly 2 but up to 10), and the number of categorical features (features with discrete, non-numeric values, ranging from 0 to 180).
  • Dataset Source and Selection Criteria: The footnote indicates these datasets are sourced from the AutoML Benchmark and selected based on size constraints (fewer than 10,000 samples and 500 features), ensuring relevance to the study's focus on small-to-medium data.
Scientific Validity
  • Reproducibility and Transparency: Listing the specific datasets used, along with their OpenML IDs, is crucial for reproducibility and transparency, allowing other researchers to replicate the experiments.
  • Dataset Diversity: The datasets span a wide range of domains, feature counts, sample sizes, target numbers, and prevalence of categorical features, supporting the claim of evaluating performance on diverse real-world tabular data.
  • Use of Standard Benchmark: Using datasets from a recognized benchmark (AutoML Benchmark) adds credibility, as these datasets are typically curated for relevance and quality.
  • Alignment with Study Scope: The selection criteria (size constraints) align with the paper's stated focus on datasets where TabPFN is claimed to excel (up to 10,000 samples).
  • Methodological Detail: Providing this level of detail about the evaluation data is essential methodological information.
Communication
  • Table Structure and Headers: The table is clearly structured with informative column headers (Name, OpenML ID, Domain, Features, Samples, Targets, Categorical Feats.).
  • Inclusion of OpenML IDs: Providing the OpenML ID for each dataset is excellent practice, as it allows readers to easily find and access the exact datasets used, enhancing transparency and reproducibility.
  • Domain Information: Listing the domain for each dataset helps convey the diversity of the benchmark suite.
  • Dataset Characteristics: Including key dataset characteristics (number of features, samples, targets, categorical features) provides valuable context about the scale and nature of the problems evaluated.
  • Footnote Clarity: The footnote clarifies the source and selection criteria (AutoML Benchmark, <10k samples, <500 features), which is helpful context.
Extended Data Table 4 | List of test datasets used for primary evaluation of...
Full Caption

Extended Data Table 4 | List of test datasets used for primary evaluation of regression tasks

Figure/Table Image (Page 21)
Extended Data Table 4 | List of test datasets used for primary evaluation of regression tasks
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Listing Regression Datasets: This table details the 28 specific datasets used for the primary evaluation of regression models, complementing Extended Data Table 3 which listed classification datasets.
  • Information Provided per Dataset: For each dataset, the table provides its common name, its unique identifier on the OpenML platform (OpenML ID - a code used to find the exact dataset on a public online repository for machine learning data), its application domain (e.g., Marine Biology, Economics, Real Estate, Materials Science), and key statistical properties.
  • Key Dataset Statistics: The statistical properties listed are: the number of features (input variables, ranging from 3 for 'quake' to 376 for 'Mercedes_Benz_Greener_Manufacturing'), the number of samples (data instances, ranging from 240 for 'tecator' to 10,000 for 'grid_stability'), and the number of categorical features (non-numeric inputs, ranging from 0 to 43).
  • Dataset Source and Selection Criteria: The footnote clarifies that these datasets are sourced from the AutoML (AMLB) and OpenML-CTR23 benchmarks and were selected because they have fewer than 10,000 samples and 500 features, aligning with the study's focus.
Scientific Validity
  • Reproducibility and Transparency: Providing a complete list of datasets with unique identifiers (OpenML IDs) is fundamental for ensuring the scientific reproducibility of the regression experiments.
  • Dataset Diversity: The datasets cover a broad range of domains and vary significantly in their number of features, samples, and presence of categorical variables, indicating a diverse and challenging benchmark suite for regression.
  • Use of Standard Benchmarks: Sourcing datasets from established benchmarks (AMLB, OpenML-CTR23) ensures that the evaluation is performed on tasks recognized as relevant by the machine learning community.
  • Alignment with Study Scope: The selection criteria (size constraints) are consistent with the paper's focus on small-to-medium tabular data, making the evaluation relevant to the claims about TabPFN's performance regime.
  • Methodological Detail: This table provides essential methodological information, detailing the specific data used for the regression evaluations reported in Extended Data Table 2.
Communication
  • Table Structure and Headers: The table is well-organized with clear column headers (Name, OpenML ID, Domain, Features, Samples, Categorical Features), facilitating easy lookup of dataset information.
  • Inclusion of OpenML IDs: Including the OpenML ID is crucial for enabling readers to directly access the datasets, significantly enhancing transparency and reproducibility.
  • Domain Information: Specifying the application domain provides context on the variety of real-world problems represented in the benchmark.
  • Dataset Characteristics Summary: Listing the number of features, samples, and categorical features gives a quick overview of the characteristics and potential challenges of each dataset.
  • Footnote Clarity: The footnote clearly states the source benchmarks and selection criteria, providing necessary context for the dataset list.
Extended Data Table 5 | Hyperparameter defaults and search space for TabPFN and...
Full Caption

Extended Data Table 5 | Hyperparameter defaults and search space for TabPFN and our baselines

Figure/Table Image (Page 22)
Extended Data Table 5 | Hyperparameter defaults and search space for TabPFN and our baselines
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Hyperparameter Specification: This table details the hyperparameters used for the TabPFN model and the baseline machine learning algorithms evaluated in the study. Hyperparameters are settings that are configured before the model training process begins, controlling the model's structure or learning behavior (e.g., the complexity of a decision tree, the learning rate of a neural network).
  • TabPFN Hyperparameters (Panel a): Panel (a) lists hyperparameters specific to TabPFN, showing the default settings used for classification and regression tasks, and the search space explored during hyperparameter optimization (tuning). Parameters include choices about preprocessing (e.g., 'Use the random forest preprocessing'), ensemble configuration (e.g., 'Number of predictions to average'), and internal model settings (e.g., 'Softmax temperature').
  • Baseline Hyperparameters (Panels b, c): Panels (b) and (c) list the hyperparameters and their search spaces for the baseline models: MLP (Multi-Layer Perceptron), Random Forest, SVM (Support Vector Machine), CatBoost, XGBoost, and LightGBM. For each baseline, key hyperparameters controlling their behavior are listed (e.g., for Random Forest: 'n_estimators', 'max_features', 'max_depth'; for SVM: 'C', 'gamma', 'kernel'; for tree boosting models like XGBoost: 'learning_rate', 'max_depth', 'subsample').
  • Hyperparameter Search Spaces: For each hyperparameter in the baselines, the table specifies the range or set of values explored during the tuning process (the 'Search Space'). This defines the possible configurations considered when optimizing the models for the 'Tuned (4h)' results shown elsewhere.
Scientific Validity
  • Reproducibility and Transparency: Providing detailed hyperparameter defaults and search spaces is absolutely essential for the reproducibility of the experimental results. Without this information, replicating the 'Default' or 'Tuned' performance figures would be impossible.
  • Reasonableness of Search Spaces: The chosen search spaces for the baseline models appear reasonable and cover commonly tuned hyperparameters for these algorithms, using standard distribution types (uniform, log-uniform) appropriate for different parameter types.
  • Importance of Defaults: The default parameters listed for TabPFN represent the authors' recommended starting point, crucial for evaluating its out-of-the-box performance.
  • Coverage of Key Hyperparameters: The selection of hyperparameters to tune for each baseline aligns with standard practices in machine learning; key parameters known to influence performance are included.
  • Methodological Detail: This table provides critical methodological detail necessary to understand the experimental setup for both default performance evaluation and hyperparameter optimization.
Communication
  • Table Organization: The table is clearly divided into sections for TabPFN (a) and the different baseline models (b, c), making it easy to locate information for a specific algorithm.
  • Clarity of Information: Listing parameters, their default values (where applicable), and the search space used for tuning is clear and standard practice.
  • Notation for Search Spaces: The notation used for search spaces (e.g., U{...}, logU(...), specific value lists) is common in hyperparameter optimization literature but might require familiarity for interpretation (U likely means Uniform distribution, logU means Log-Uniform).
  • Clarity of Parameter Names: Parameter names are generally standard for the respective libraries (e.g., 'learning_rate', 'max_depth', 'n_estimators', 'C', 'gamma').
  • Clarity of TabPFN Section: Panel (a) clearly distinguishes between defaults for classifier vs. regressor TabPFN models and the unified search space.
↑ Back to Top