A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets

Assaf Shmuel, Oren Glickman, Teddy Lazebnik
arXiv: arXiv:2408.14817v1
Department of Computer Science, Bar Ilan University, Israel

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This paper addresses the challenge of deep learning (DL) models often underperforming compared to traditional machine learning (ML) methods, particularly tree-based ensembles, on tabular datasets (data organized in rows and columns). The authors conducted a comprehensive benchmark involving 111 datasets and 20 different models, covering both regression (predicting a continuous value) and classification (predicting a category) tasks. The datasets varied in size and included those with and without categorical features (features that take on values from a limited set of categories).

The benchmark results confirmed the general superiority of ML models, especially tree-based ensembles like CatBoost and LightGBM, on tabular data. However, the study also found that DL models performed better on certain types of datasets. A key contribution of the paper is the development of a predictive model that can identify scenarios where DL models are likely to outperform traditional ML models, achieving 86.1% accuracy. This predictive model was built using a meta-learning approach, where each dataset was represented by a set of characteristics ('meta-features'), such as the number of rows and columns, statistical properties like kurtosis (a measure of how 'tailed' a distribution is), and relationships between features and the target variable.

The meta-analysis revealed that DL models tend to perform better on smaller datasets with a larger number of columns, and on datasets with high kurtosis values. The study also found that DL models have a relative advantage in classification tasks compared to regression tasks. These findings were supported by both a logistic regression model and a symbolic regression model, which generated an explicit formula for predicting DL outperformance based on dataset characteristics. To further investigate the impact of dataset size, the authors conducted an experiment where they downsampled larger datasets to 1000 training samples. While some DL models improved in ranking on these smaller datasets, tree-based ensembles generally continued to dominate.

The authors discuss their findings in the context of previous benchmarks, noting that their results generally align with prior observations about the relative performance of ML and DL models on tabular data. They acknowledge that dataset size is just one of many factors influencing model performance and emphasize the importance of considering other dataset characteristics, such as kurtosis and task type, when choosing between ML and DL models. They also highlight the strong performance of AutoGluon, an automated ML framework that combines both ML and DL models, suggesting its potential as a robust solution for tabular data problems.

Research Impact and Future Directions

This comprehensive benchmark significantly advances our understanding of machine learning (ML) and deep learning (DL) model performance on tabular data. By evaluating 20 diverse models across 111 datasets, the study confirms the general superiority of traditional ML, especially tree-based ensembles, while also identifying specific conditions where DL models can excel. The development of a predictive model, achieving 86.1% accuracy in identifying scenarios favoring DL, is a key contribution, offering practitioners a valuable tool for informed model selection.

While the benchmark's breadth is a strength, the diversity of datasets might introduce noise and limit the depth of analysis within specific data types. Additionally, the focus on accuracy, while common, should be interpreted cautiously, especially for potentially imbalanced datasets where other metrics like AUC or F1-score offer a more robust evaluation. Future work could explore feature engineering's impact and extend the benchmark to other tabular data tasks like time-series analysis.

Despite these limitations, the study provides valuable insights into the complex interplay of dataset characteristics and model performance. The identification of task type (classification vs. regression) and kurtosis as statistically significant predictors of DL's relative success offers actionable guidance. The mechanistic explanations for these findings, while plausible, warrant further investigation to confirm the proposed underlying mechanisms. Overall, this benchmark establishes a strong foundation for future research and provides practitioners with a data-driven approach to model selection for tabular data, moving beyond the 'safe bet' of traditional ML towards a more nuanced understanding of when DL can offer superior performance.

Critical Analysis and Recommendations

Clear Problem Definition and Objective (written-content)
The abstract clearly articulates the research gap (DL underperformance on tabular data) and the study's objective (comprehensive benchmark). This clarity immediately orients the reader to the paper's purpose and significance.
Section: Abstract
Emphasis on Benchmark Comprehensiveness (written-content)
The abstract emphasizes the benchmark's scale (111 datasets, 20 models) and diversity (regression/classification tasks, varying scales, categorical variables). This breadth strengthens the potential generalizability and impact of the findings.
Section: Abstract
Novel Predictive Outcome with Quantified Performance (written-content)
The abstract highlights the novel predictive model and quantifies its performance (86.1% accuracy, 0.78 AUC). This tangible outcome adds practical value and demonstrates a clear contribution.
Section: Abstract
Clear Problem Statement and Motivation (written-content)
The introduction effectively establishes the central problem (ML generally outperforming DL on tabular data) and motivates the need for the study. This clear framing immediately engages the reader.
Section: 1 Introduction
Effective Summary of Prior Work (written-content)
The introduction concisely summarizes relevant prior work, providing context and demonstrating awareness of the existing literature. This background sets the stage for the paper's contributions.
Section: 1 Introduction
Missing Data Handling (written-content)
The paper lacks details on how missing values in datasets were handled. This omission hinders reproducibility and could introduce bias, as different imputation strategies can significantly affect model performance.
Section: 2 Experimental setup
Justification for Meta-Learning Target Metrics (written-content)
The justification for using only RMSE (regression) and AUC (classification) for the meta-learning target variable, while other metrics were collected, is missing. This choice directly impacts the meta-learning outcome and should be explicitly rationalized.
Section: 2 Experimental setup
Dataset Comparison Table (Table 1) (graphical-figure)
Table 1 effectively compares the current study's dataset characteristics with previous benchmarks, highlighting its comprehensiveness and diversity. This contextualization strengthens the paper's contribution.
Section: 2 Experimental setup
Descriptive Statistics Table (Table 2) (graphical-figure)
Table 2 provides descriptive statistics of the dataset variables, demonstrating the diversity of the benchmark. This information is crucial for understanding the scope and generalizability of the findings.
Section: 2 Experimental setup
Logistic Regression Results (Table 6) (graphical-figure)
Table 6 presents the logistic regression coefficients for predicting DL outperformance. The inclusion of p-values allows for assessing statistical significance, strengthening the conclusions about influential factors.
Section: 3 Results
Heatmaps of DL Outperformance Probability (Figure 3) (graphical-figure)
Figure 3 visualizes the impact of various factors on DL outperformance probability. These heatmaps provide valuable insights into complex interactions, though the interpretation requires careful consideration of probability thresholds.
Section: 3 Results
Effective Synthesis with Prior Work (written-content)
The discussion effectively synthesizes the findings with prior work, providing context and highlighting agreements or disagreements. This strengthens the paper's contribution to the field.
Section: 4 Discussion
Mechanistic Explanations for Findings (written-content)
The discussion provides plausible explanations for the observed influence of factors like task type and kurtosis on DL performance. These mechanistic interpretations enhance understanding, though further validation is needed.
Section: 4 Discussion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

1 Introduction

Key Aspects

Strengths

Suggestions for Improvement

2 Experimental setup

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 1: Comparison with Previous Studies
Figure/Table Image (Page 3)
Table 1: Comparison with Previous Studies
First Reference in Text
None of the datasets in this benchmark allowed a perfect prediction with all models. In Table 1 we compare the characteristics of our datasets with previous works.
Description
  • Study Comparison Overview: The table compares the current study ('Ours') with five previous studies (referenced as [1], [26], [2], [3], [29]) across several characteristics of the datasets and models used. For 'Ours', 20 models were evaluated.
  • Number of Datasets: The current study ('Ours') utilized 54 classification datasets and 57 regression datasets, totaling 111 datasets. This is compared to, for example, study [1] which used 22 classification and 33 regression datasets, and study [29] which used 181 classification and 119 regression datasets.
  • Median Dataset Size: The median dataset size for 'Ours' is 5k (presumably 5,000 instances/rows). This is smaller than study [1] (17k), [2] (11k), [3] (20k), and [29] (12K), but larger than [26] (2k).
  • Median Number of Features: The median number of features (columns in a dataset) in 'Ours' is 13. This is comparable to study [1] (13), but lower than [26] (21), [2] (32), [3] (21), and [29] (21).
  • Median Number of Categorical Features: The median number of categorical features (features whose values are from a limited, fixed set of categories, like 'color' with values 'red', 'blue', 'green') in 'Ours' is 4. This is higher than all listed previous studies, where the median ranged from 0 to 2.
  • Meta-Analysis Inclusion: The 'Meta-Analysis' column indicates whether the study performed a subsequent analysis on the results to understand when certain models perform better. 'Ours' is marked 'Yes', study [26] is 'Partial**', and the others are 'No'. A meta-analysis in this context refers to studying the study results themselves to find patterns.
Scientific Validity
  • ✅ Contextualization of the study: The table provides a useful context by comparing the scope of the current work (number of datasets, models, types of features) against several previous benchmarking studies. This helps in positioning the contribution of the paper.
  • 💡 Representativeness of Previous Studies: The selection of previous studies ([1], [26], [2], [3], [29]) appears relevant as they are also benchmarking studies in the domain. However, without knowing the broader literature, it's hard to ascertain if this is a fully representative set or if there's any selection bias towards studies that make 'Ours' look more comprehensive in certain aspects.
  • ✅ Relevance of Comparison Metrics: The chosen metrics for comparison (number of models, datasets, dataset size, feature counts, inclusion of categorical features, meta-analysis) are pertinent for evaluating the comprehensiveness of a benchmarking study in machine learning on tabular data.
  • ✅ Highlighting a distinguishing factor (Meta-Analysis): The claim that the current study performs a 'Meta-Analysis' ('Yes') while others did 'No' or 'Partial' highlights a distinguishing factor. The footnote for 'Partial**' for study [26] clarifies its scope, which is good. The strength of this claim depends on the depth and novelty of the meta-analysis performed in 'Ours', which isn't detailed in the table itself.
  • ✅ Supports claims of distinct characteristics: The table data supports the notion that 'Ours' includes a notable number of datasets with a higher median of categorical features compared to the cited studies, and uniquely claims a full meta-analysis. It also shows 'Ours' uses a median dataset size of 5k, which is smaller than most others listed, and a median feature count of 13, also smaller or comparable. This provides a balanced view, not just highlighting areas of greater scope.
  • 💡 Clarification of units ('k'): The term 'k' for dataset size is standard but confirming it means thousands of rows/instances would be ideal, perhaps in a general methods section if not in the caption.
Communication
  • ✅ Clear tabular format: The table effectively uses a standard tabular format to present a comparative overview, making it easy to contrast the current study with previous works across several key characteristics.
  • 💡 Column header clarity: The column headers are generally understandable. However, 'Median Dataset Size' could be more specific, e.g., 'Median Dataset Size (Rows)' or 'Median Dataset Size (Instances)', although 'k' typically implies thousands of rows in this context. 'Median # of Categorical' is clear but could be written as 'Median # Categorical Features' for full explicitness.
  • ✅ Use of footnotes for clarification: The footnotes '*' and '**' are present and explained below the table, which is good practice. Ensuring the footnote text is immediately discoverable and clearly phrased is important.
  • 💡 Clarity of 'Meta-Analysis' column entries: The 'Meta-Analysis' column uses terms like 'No', 'Partial**', and 'Yes'. While '**' is explained, a brief definition of what constitutes a 'Yes' or 'Partial' meta-analysis directly in the caption or as a general footnote for this column could enhance self-containedness, even if detailed elsewhere.
  • ✅ Good integration with text: The table is directly referenced and its purpose (comparing characteristics with previous works) is stated in the text, aiding comprehension.
Table 2: Descriptive Statistics of the Dataset Variables
Figure/Table Image (Page 4)
Table 2: Descriptive Statistics of the Dataset Variables
First Reference in Text
Overall, we included 111 datasets in this study: 57 regression datasets and 54 classification datasets. Table 2 summarizes the main parameters of the datasets.
Description
  • Overall Purpose: The table presents descriptive statistics for various characteristics of the 111 datasets used in the study. These characteristics are treated as variables themselves, describing the properties of each dataset.
  • Dataset Size (Rows): The 'Number of Rows' (individual records or samples in a dataset) varies greatly, with a mean of 18,576, a median of 4,720, a minimum of 43, and a maximum of 245,057. This indicates a wide range of dataset sizes in terms of sample count.
  • Dataset Dimensionality (Columns): The 'Number of Columns' (features or attributes in a dataset) has a mean of 24.16, a median of 12.5, ranging from a minimum of 4 to a maximum of 267. This shows diversity in the dimensionality of the datasets.
  • Feature Types (Numerical and Categorical): Datasets on average have 14.25 'Numerical Columns' (features with continuous or discrete numbers, like age or temperature) with a median of 7, and 9.91 'Categorical Columns' (features with a fixed set of categories, like 'color' with values 'red', 'blue') with a median of 4. Some datasets have no numerical (min 0) or no categorical columns (min 0).
  • Kurtosis of Features: 'Kurtosis' is a statistical measure that describes the 'tailedness' of the probability distribution of a feature – essentially, whether the data are heavy-tailed (more outliers) or light-tailed (fewer outliers) relative to a normal distribution. The datasets show a very wide range of kurtosis values, with a mean of 348.01 and a median of 6.44, but a minimum of -2711.83 and a maximum of 8901.75. The large standard deviation (1129.66) also highlights this variability.
  • Inter-Feature Correlation: The 'Average Correlation Between Features' indicates the average linear relationship strength between different input features within a dataset. The mean is 0.08 (weak positive correlation on average), with a median of 0.06. Values range from -0.16 to 0.62, suggesting some datasets have moderately correlated features while others have very little inter-feature correlation.
  • Feature Entropy: 'Average Entropy' measures the average randomness or unpredictability of the features in a dataset. A higher entropy suggests more diverse values. The mean entropy is 7.70, with a median of 7.96, ranging from 2.45 to 14.17.
  • Feature-Target Correlation (Pearson): The 'Average Pearson to Target Feature' shows the average linear relationship between each input feature and the outcome variable (the 'target' the model tries to predict). The mean is 0.11 and the median is 0.09, suggesting that, on average, individual features have a weak linear relationship with the target. The range is from -0.19 to 0.44.
Scientific Validity
  • ✅ Methodological Soundness: Providing descriptive statistics of the dataset meta-features is crucial for understanding the landscape of data used in the benchmark. This allows readers to assess the diversity and characteristics of the datasets upon which model performances are compared.
  • ✅ Supports Claim of Dataset Diversity: The wide ranges observed for most variables (e.g., Number of Rows, Kurtosis) strongly support the claim of using 'diverse tabular datasets' and highlight the heterogeneity of the benchmark.
  • ✅ Comprehensive Meta-Features: The inclusion of statistics like Kurtosis, Average Entropy, and correlations provides deeper insights into data properties beyond simple counts of rows and columns. These are relevant meta-features for characterizing datasets in machine learning research.
  • ✅ Appropriate Level of Aggregation: The statistics are presented for the collection of 111 datasets. It's clear these are summary statistics across datasets, not within a single dataset. This is appropriate for characterizing the benchmark suite itself.
  • 💡 Consideration of Extreme Kurtosis Values: The extreme minimum and maximum values for Kurtosis (-2711.83 and 8901.75) are notable. While mathematically possible, such extreme values (especially the negative one, as kurtosis is often defined relative to a normal distribution's kurtosis of 3, or excess kurtosis relative to 0) might warrant a brief comment or check to ensure they are not due to data anomalies or specific dataset types that heavily influence these statistics. If these are genuine, they further underscore dataset diversity.
  • ✅ Alignment with Stated Purpose: The table effectively summarizes the 'main parameters of the datasets' as stated in the reference text. The chosen parameters are indeed key characteristics for tabular data.
Communication
  • ✅ Clear and Standard Format: The table is well-structured with clear row and column labels, making it easy to understand the statistical properties being presented for the dataset variables.
  • ✅ Comprehensive Statistical Summary: The statistical measures provided (Mean, STD, Min, 25%, Median, 75%, Max) are standard and offer a comprehensive summary of the distribution of each dataset characteristic.
  • ✅ Appropriate Terminology for Target Audience: Variable names like 'Number of Rows', 'Number of Columns' are self-explanatory. Terms like 'Kurtosis', 'Average Entropy', and 'Average Pearson to Target Feature' are standard in data science but might require some background knowledge for a broader audience; however, for the target audience of a machine learning paper, they are appropriate.
  • ✅ Appropriate Numerical Precision: The precision of the numerical values (e.g., two decimal places for means and medians of counts) is generally appropriate. For Kurtosis, the large range and standard deviation are notable, and the precision seems fine.
  • ✅ Consistent with Reference Text: The table effectively summarizes the characteristics of the 111 datasets used in the study, as stated in the reference text, providing a good overview of their diversity.
Table 8: The meta-feature vector representing a dataset.
Figure/Table Image (Page 16)
Table 8: The meta-feature vector representing a dataset.
First Reference in Text
Table 8 mostly adopted from [41] and contains 20 features computationally useful for meta-learning tasks.
Description
  • Purpose and Origin of Meta-Features: Table 8 lists the 20 'meta-features' that are used to create a numerical profile, or 'vector', for each dataset in the study. Meta-features are characteristics calculated from a dataset that describe its properties. This vector is then used in 'meta-learning', which is essentially learning from the characteristics of previous learning tasks to improve future learning. The table is adopted primarily from a source cited as [41].
  • Basic Dataset Properties: The meta-features cover various aspects of a dataset. Basic properties include 'Row count' (number of samples), 'Column count' (number of original features), 'Columns after One Hot Encoding' (number of features after converting categorical data into a numerical format where each category becomes a new binary feature), 'Numerical Features' (count of features with number values), and 'Categorical Features' (count of features with category values).
  • Task Type: Task type is captured by 'Classification/Regression', indicating whether the dataset is for predicting a category or a continuous value.
  • Statistical Properties of Features: Statistical properties of the features within a dataset are included, such as 'Cancor' (Canonical correlation, a measure of association between two sets of variables, here likely the best single combination of features), 'Kurtosis' (a measure of the 'tailedness' or outlier proneness of feature distributions), 'Average Entropy' (average randomness or information content of features), and 'Standard Deviation Entropy'.
  • Feature-Target Relationships: Relationships between features and the target variable (the variable to be predicted) are measured by 'Average Pearson to Target Feature' (average linear correlation between each feature and the target) and 'Standard Deviation Pearson To Target Feature'.
  • Inter-Feature Relationships: Inter-feature relationships are described by 'Average Correlation Between Features'.
  • Feature Variability and Anomaly Measures: Measures of feature variability and potential issues include 'Average Asymmetry Of Features' (average skewness), 'Average Coefficient of Variation' (standard deviation divided by the mean, a normalized measure of dispersion), 'Standard Deviation Coefficient of Variation', 'Average Coefficient of Anomaly' (mean divided by standard deviation, potentially highlighting features with low variability relative to their mean), and 'Standard Deviation Coefficient of Anomaly'.
  • Dimensionality/Complexity Measure: A dimensionality reduction related feature is 'PCA' (Principal Component Analysis), specifically 'The number of PCA components required to explain 99% of the variance in the data', indicating the intrinsic complexity or dimensionality of the dataset.
  • Sources of Meta-Features: The sources for these meta-features are cited, primarily from references [42], [43], [44], with 'Row Over Column' from [49] and 'PCA' from [50].
Scientific Validity
  • ✅ Comprehensive and Relevant Meta-Feature Set: The selection of 20 meta-features covering simple statistics, feature properties, feature relationships, and feature-target relationships is comprehensive and aligns with common practices in meta-learning research. These types of features have been shown to be useful for characterizing datasets.
  • ✅ Grounded in Existing Literature: Citing sources for the meta-features, especially noting that the set is 'mostly adopted from [41]', lends credibility and allows for verification or deeper understanding of the feature definitions and their established utility.
  • ✅ Computationally Feasible Features: The meta-features chosen are computationally feasible to extract from datasets, which is important for practical meta-learning applications, as supported by the reference text stating they are 'computationally useful'.
  • ✅ Diverse Range of Feature Types: The inclusion of diverse types of meta-features (e.g., simple counts, statistical moments, information-theoretic measures, correlation-based) provides a rich representation of dataset characteristics, which is beneficial for building effective meta-models.
  • 💡 Details of Calculation Methodology Assumed Elsewhere: While the table lists the features, the exact methodology for calculating some of them (e.g., how 'Average Asymmetry' or 'Average Coefficient of Anomaly' are aggregated across features if they are per-feature measures) is not detailed in the table itself. This would typically be in the methods section or the cited references.
  • ✅ Appropriateness for the Meta-Learning Task: The utility of these specific 20 features for the particular meta-learning task in this paper (predicting DL vs. ML performance) is ultimately demonstrated by the performance of the meta-model built using them (e.g., the logistic regression in Table 6). The table itself just defines the inputs to that meta-model.
  • ✅ Inclusion of Simple Shape Descriptors: The feature 'Row Over Column' from [49] is a simple yet often informative measure of dataset shape, which can influence algorithm choice.
Communication
  • ✅ Clear and Organized Structure: The table is well-organized with three clear columns: 'Name' (of the meta-feature), 'Description', and 'Source' (citation for the feature). This structure makes it easy to understand each meta-feature and its origin.
  • ✅ Informative Descriptions: The descriptions provided for each meta-feature are generally concise and informative, helping the reader understand what each feature represents.
  • ✅ Citation of Sources: Providing sources for the meta-features (references [41], [42], [43], [44], [49], [50]) is good practice, allowing readers to trace the origin and potentially find more detailed definitions or justifications for their use.
  • ✅ Consistent with Caption and Reference Text: The caption clearly states that this table defines the meta-feature vector, and the reference text confirms it lists 20 features useful for meta-learning.
  • ✅ Clarity of Feature Names via Descriptions: Some feature names like 'Cancor' or 'Row Over Column' are standard in meta-learning literature but might be less intuitive for a broader audience without the provided description. The descriptions do a good job of clarifying these.
  • ✅ Specificity of PCA Description: For the 'PCA' feature, the description 'The number of PCA components required to explain 99% of the variance in the data' is very clear and specific.

3 Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 3: Performance ranking of TE and DL models.
Figure/Table Image (Page 5)
Table 3: Performance ranking of TE and DL models.
First Reference in Text
Table 3 outlines the performance of the Tree-based Ensemble models (TE) and DL models on the 111 datasets.
Description
  • Model Categories and Scope: The table summarizes the performance of 14 different machine learning models, categorized into two groups: Tree-based Ensemble (TE) models and Deep Learning (DL) models. TE models, such as Random Forest or XGBoost, combine the predictions of multiple decision trees to improve accuracy and robustness. DL models, like ResNet or MLP (Multilayer Perceptron), are based on artificial neural networks with multiple layers, capable of learning complex patterns. The comparison is based on their performance across 111 datasets.
  • Performance Metrics Used: Performance is reported using four metrics: '# Best' (the number of datasets out of 111 where a model achieved the top performance), 'Average Rank' (the average performance rank of the model across all datasets, lower is better), 'Median Rank' (the median performance rank, also lower is better), and '# in Top 3 Models' (the number of datasets where the model ranked among the top three performers).
  • Top Performing Model: CatBoost (TE): The TE model 'CatBoost' ranks first, being the best performer on 19 out of 111 datasets. It has an average rank of 4.9 and a median rank of 4. It appeared in the top 3 models for 50 datasets.
  • Strong Performance of Other TE Models: Other TE models like 'LightGBM' (15 #Best, Avg Rank 5, Median Rank 4, #Top3 47) and 'H2O-GBM' (13 #Best, Avg Rank 7, Median Rank 6, #Top3 28) also perform strongly, occupying the top ranks.
  • Highest Ranked DL Models: The first DL model to appear in the ranking is 'AutoGluon-DL', which is 5th overall. It was the best model on 11 datasets, with an average rank of 6.6, a median rank of 7, and was in the top 3 for 32 datasets. 'ResNet' is the next DL model, with 10 #Best, an average rank of 7.5, and a median rank of 8.
  • Lower Performing DL Models: Some DL models performed poorly. 'FT-Transformer' and 'TabNet' both achieved '# Best' on 0 datasets. 'TabNet' had the worst performance metrics among the listed models, with an average rank of 13.1 and a median rank of 14.
Scientific Validity
  • ✅ Robustness through Multiple Ranking Metrics: The use of multiple ranking metrics (# Best, Average Rank, Median Rank, # in Top 3) provides a robust way to compare model performance, as relying on a single metric can sometimes be misleading. This multifaceted approach is a strength.
  • ✅ Supports Reference Text: The table clearly supports the reference text's claim that it outlines the performance of TE and DL models. The data presented directly reflects this comparison.
  • ✅ Clear Indication of TE Model Dominance: The primary finding suggested by the ordering (TE models dominating the top ranks) is evident from the '# Best' column. This provides initial evidence for the broader argument that TE models often outperform DL models on these tabular datasets.
  • ✅ Broad Dataset Coverage: The table aggregates performance across 111 diverse datasets. This large number of datasets increases the generalizability of the findings compared to benchmarks with fewer datasets.
  • 💡 Ambiguity of Underlying Performance Metric for Ranking: The specific evaluation metric (e.g., AUC, RMSE, accuracy) used to determine which model is '# Best' or to calculate ranks is not explicitly stated in the table itself or its immediate caption. While Section 2.3 mentions evaluation metrics for regression and classification, and Section 3.1 refers to critical difference diagrams based on RMSE and accuracy, it's not explicitly clear if '# Best' in Table 3 is based on a single metric for each task type or some combined/averaged score. This ambiguity slightly limits the direct interpretation of what 'best' precisely means without referring to other parts of the text.
  • 💡 Lack of Statistical Significance Indicators: The table doesn't include any measure of statistical significance for the differences in ranks or '# Best' counts. While critical difference diagrams are mentioned later, incorporating some indication of significance or confidence intervals for average/median ranks within this table could strengthen the claims about relative model performance.
  • ✅ Representative Model Selection: The selection of models for TE and DL categories seems representative of commonly used and state-of-the-art algorithms in these classes for tabular data.
Communication
  • ✅ Clear Structure and Headers: The table is well-structured with clear column headers (# Best, Average Rank, Median Rank, # in Top 3 Models) that are standard for reporting model performance comparisons. The grouping of models into TE (Tree-based Ensemble) and DL (Deep Learning) is also clearly indicated.
  • ✅ Effective Ranking and Multiple Metrics: The models are ordered by '# Best', which provides an intuitive primary ranking. Supplementing this with average and median ranks, and '# in Top 3 Models' offers a more nuanced view of performance.
  • ✅ Use of Standard Terminology: The abbreviations TE and DL are defined in the reference text, which helps in understanding the 'Group' column. Model names like CatBoost, LightGBM, ResNet, TabNet are standard in the field.
  • ✅ Efficient Information Presentation: The table is concise and presents a large amount of comparative information efficiently. The main message about TE models generally outperforming DL models (based on '# Best') is quickly discernible.
  • 💡 Clarification of Ranking Metric Basis: While the table itself is clear, the specific performance metric used to determine '# Best', average/median rank (e.g., RMSE for regression, AUC for classification, or an overall combined metric) is not stated within the table or its immediate caption. This information is crucial for full interpretation and is mentioned in the text (Section 2.3 and later in Section 3.1 that RMSE and Accuracy are used for critical difference diagrams, but the basis for Table 3's ranking isn't explicitly tied to one specific metric here). Adding a note like 'Ranking based on primary metric X for classification and Y for regression, then aggregated' would improve self-containedness.
Table 4: Performance ranking of all models.
Figure/Table Image (Page 6)
Table 4: Performance ranking of all models.
First Reference in Text
Table 4 extends Table 3 as it presents the performance of all 20 models on the 111 datasets.
Description
  • Expanded Scope and Model Categorization: Table 4 expands on Table 3 by presenting the performance ranking of all 20 models evaluated in the study across 111 datasets. Models are categorized into 'Other', 'DL' (Deep Learning), and 'TE' (Tree-based Ensemble) groups. The 'Other' group includes models that may not fit strictly into TE or DL, or are automated machine learning (AutoML) systems like AutoGluon which can combine various model types.
  • Consistent Performance Metrics: The performance metrics are consistent with Table 3: '# Best' (number of datasets where the model was top-ranked), 'Average Rank', 'Median Rank', and '# in Top 3 Models'.
  • Top Performer: AutoGluon: 'AutoGluon' (categorized as 'Other') is the top-performing model, achieving the best performance on 39 out of 111 datasets. It has an average rank of 4.8, a median rank of 4, and appeared in the top 3 models for 58 datasets.
  • Performance of SVM: 'SVM' (Support Vector Machine, categorized as 'Other') is second in terms of '# Best' (10 datasets), but has a significantly worse average rank (12.4) and median rank (14) compared to AutoGluon, indicating it performs exceptionally well on a few datasets but is not consistently highly ranked.
  • Performance of DL and TE Models: DL models like 'ResNet' (7 #Best, Avg Rank 9.7) and TE models like 'CatBoost' (7 #Best, Avg Rank 6.6) follow in performance. CatBoost has a better average and median rank than ResNet despite the same '# Best' count.
  • Lower Performing Models: Several models, particularly some DL models, performed poorly. 'TabNet' (DL) is ranked last, achieving '# Best' on 0 datasets, with an average rank of 17.2 and a median rank of 18. 'FT-Transformer' (DL) also had 0 '# Best' and poor ranks.
  • Inclusion of Simpler Model Types: The table includes other model types like 'gplearn' (genetic programming), 'LR' (Linear/Logistic Regression), and 'KNN' (K-Nearest Neighbors), all categorized under 'Other', showing their relative standings among the more complex ensemble and deep learning methods.
Scientific Validity
  • ✅ Comprehensive Model Comparison: Presenting results for all 20 models, including those beyond just TE and DL, provides a more complete picture of the benchmark and is a methodological strength.
  • ✅ Strong Support for Textual Claims: The data strongly supports the textual claim that AutoGluon is the best-performing model on average and by '# Best'. It also supports the observation about SVM's performance characteristics (high '# Best' but poor average rank).
  • ✅ Appropriate Categorization of Models: The inclusion of the 'Other' category is necessary for models like AutoGluon (an AutoML system using various models) and classical ML algorithms (SVM, LR, KNN). This categorization is appropriate.
  • ✅ Robust Multi-Metric Evaluation: The consistent use of multiple ranking metrics (# Best, Average Rank, Median Rank, # in Top 3) across all models allows for a robust and nuanced comparison, which is a significant strength.
  • 💡 Ambiguity of Underlying Performance Metric for Ranking: As with Table 3, the exact definition of the metric(s) used to determine which model is '# Best' or to calculate the ranks is not explicitly stated in the table. Clarifying if this is based on a single specific metric (e.g., AUC for classification, RMSE for regression) or an aggregation would enhance rigor.
  • 💡 Lack of Statistical Significance Indicators: The table does not include measures of statistical significance for the differences in ranks or '# Best' counts. While the paper mentions critical difference diagrams elsewhere, adding confidence intervals or significance indicators for average/median ranks here could further solidify the conclusions drawn from this table.
  • ✅ Significant Finding Regarding AutoML: The finding that AutoGluon, an ensemble of ML and DL models, outperforms individual specialized models is a significant result and aligns with trends in automated machine learning.
Communication
  • ✅ Consistent and Clear Structure: The table maintains a clear and consistent structure with Table 3, using identical column headers for performance metrics, which aids in comparability and understanding.
  • ✅ Useful Model Grouping: The inclusion of a 'Group' column (Other, DL, TE) helps categorize the diverse set of 20 models, providing an additional layer of information for interpreting the rankings.
  • ✅ Effective Multi-Metric Presentation: The primary ranking by '# Best' clearly highlights top performers like AutoGluon. The supplementary ranks (Average, Median, # in Top 3) offer a more nuanced perspective, as noted in the text regarding SVM's high '# Best' but poorer average rank.
  • 💡 Clarification for 'Other' Group: The term 'Other' for model groups like AutoGluon and SVM is functional. However, since AutoGluon is described in the text as an ensemble of DL and ML models, a brief note or reminder of this in the caption or as a footnote could enhance clarity for readers focusing solely on the table.
  • 💡 Specification of Ranking Metric Basis: Similar to Table 3, the precise performance metric(s) used to determine '# Best' and calculate ranks (e.g., specific metrics for classification/regression, or an aggregated score) is not explicitly stated in the table or its immediate caption. This would improve the table's self-containedness.
  • ✅ Efficiently Conveys Key Findings: The table effectively conveys the dominance of AutoGluon and the relative performance of other models, including the poor performance of TabNet, which is consistent with the narrative.
Figure 1: Critical difference diagram for regression tasks based on RMSE. The...
Full Caption

Figure 1: Critical difference diagram for regression tasks based on RMSE. The best performing model is Auto-Gluon as lower RMSE scores indicate better performance.

Figure/Table Image (Page 6)
Figure 1: Critical difference diagram for regression tasks based on RMSE. The best performing model is Auto-Gluon as lower RMSE scores indicate better performance.
First Reference in Text
The results are also summarized using critical difference diagrams based on RMSE for regressions tasks (Fig. 1) and on accuracy for classifications tasks (Fig. 2).
Description
  • Diagram Type and Purpose: Figure 1 is a critical difference (CD) diagram, a visualization used to compare the performance of multiple machine learning models across a set of tasks (in this case, regression tasks). The models are positioned along a horizontal axis according to their average rank, calculated based on their Root Mean Squared Error (RMSE) – a measure of the differences between predicted and actual values, where lower values are better. The diagram shows 12 models or model variants.
  • Best Performing Model and Discrepancy: The model 'AutoGluon-DL' is shown as the best-performing model for regression tasks based on RMSE, with the lowest average rank of 6.3684. The caption, however, states 'AutoGluon' is the best performing model, which is listed with an average rank of 10.3509 and is positioned furthest to the left (best). There is a discrepancy between the caption and the visual data regarding which specific AutoGluon variant is best.
  • Worst Performing Model and Rank Interpretation: The model 'Decision Tree' is indicated as the worst-performing among those shown, with the highest average rank of 3.9123. The ranks appear to be scaled such that lower numerical values on the axis represent worse performance, which is unusual for ranks (typically lower rank number = better performance). The caption states lower RMSE is better, which usually translates to lower rank being better. The axis labels (1 to 12) and the model positions (AutoGluon at 10.3509 being leftmost/best, Decision Tree at 3.9123 being rightmost/worst) suggest the axis might represent average RMSE directly, or that the ranks have been inverted. However, CD diagrams conventionally plot average ranks where lower is better. Given the caption says "lower RMSE scores indicate better performance" and AutoGluon is best, the numbers (10.3509 for AutoGluon vs 3.9123 for Decision Tree) likely represent average ranks, and the diagram is ordered from best (left) to worst (right). The actual values (e.g. 10.3509) are the average ranks.
  • Statistical Significance Groupings: Horizontal black bars connect groups of models whose performance is not statistically significantly different from each other. For example, 'AutoGluon' (rank 10.3509) is not statistically significantly different from 'CatBoost' (rank 9.1930) or 'TPOT' (rank 8.3509). Similarly, 'Decision Tree' (rank 3.9123), 'AdaBoost' (rank 4.0175), 'SVM' (rank 4.1228), and 'H2O-DL' (rank 4.3158) form a group that are not statistically significantly different from each other at the bottom end of performance.
  • Relative Model Performance: Models like 'CatBoost' (rank 9.1930), 'TPOT' (rank 8.3509), 'Random Forest' (rank 7.9298), 'XGBoost' (rank 7.7018), and 'H2O-GBM' (rank 7.1579) are positioned in the better-performing half. 'AutoGluon-DL' is in the lower-performing half with a rank of 6.3684, which is better (lower rank number) than 'LR' (Linear Regression, rank 4.5789) but worse than 'H2O-GBM'. This again highlights the discrepancy with the caption which says AutoGluon is best.
Scientific Validity
  • ✅ Appropriate Visualization Method: Critical difference diagrams are an appropriate and widely accepted method for visualizing and comparing the performance of multiple algorithms across multiple datasets, especially when incorporating statistical significance (e.g., via the Nemenyi test or similar Friedman post-hoc tests).
  • ✅ Appropriate Base Metric (RMSE): The use of RMSE as the base metric for ranking models on regression tasks is standard and appropriate, as it penalizes larger errors more heavily.
  • ✅ Supports Claims of Performance Differences: The diagram visually supports the claim that there are statistically significant differences in performance among the models. For instance, AutoGluon is shown to be significantly better than models like Decision Tree or AdaBoost.
  • 💡 Significant Discrepancy/Clarity Issue with Best Model and Rank Values: The caption states "The best performing model is Auto-Gluon". However, the diagram shows AutoGluon with an average rank of 10.3509, positioned as the best (leftmost). AutoGluon-DL is listed with an average rank of 6.3684, positioned as worse than H2O-GBM. This creates confusion. If the ranks are such that higher number is better (unconventional for ranks), then AutoGluon (10.3509) would be best. If lower rank number is better (conventional), then Decision Tree (3.9123) would be best. Assuming the standard interpretation (leftmost is best, representing the lowest average rank if ranks are 1st, 2nd, etc., or highest average score if scores are directly plotted and higher is better), and given the caption states "lower RMSE scores indicate better performance", the leftmost model should have the best (lowest average) RMSE-based rank. The numerical values next to the models are likely their average ranks. The discrepancy needs urgent clarification: is AutoGluon or AutoGluon-DL the best, and how do the numerical ranks correspond to performance?
  • 💡 Missing Details on Statistical Test and Alpha Level: The specific statistical test used to calculate the critical difference (e.g., Nemenyi test, Bonferroni-Dunn test) and the significance level (alpha) are not mentioned in the caption or immediately with the figure. This information is essential for a full interpretation of the statistical significance bars and should be provided in the methods or figure caption.
  • 💡 Discrepancy in Number of Models Shown vs. Total Models: The diagram only includes 12 models/variants. Table 4 lists 20 models. It's unclear if this CD diagram is for a subset of models or if some models from Table 4 were excluded for regression tasks. Clarification on model inclusion would be helpful.
Communication
  • ✅ Appropriate Chart Type: The critical difference diagram is a standard and effective way to visualize pairwise statistical comparisons of multiple classifiers/regressors over multiple datasets. The horizontal axis representing average rank is clear.
  • ✅ Legible Labels and Values: Model names are generally legible, though some are slightly small. The average rank values associated with each model are clearly displayed.
  • ✅ Informative Caption: The caption clearly states the metric (RMSE for regression), identifies the best performer (AutoGluon), and explains the direction of better performance (lower RMSE is better). This makes the figure largely self-contained.
  • ✅ Clear Indication of Statistical Significance: The horizontal bars connecting groups of models whose performance is not statistically significantly different are a key feature of CD diagrams. These are present and visually connect the relevant models.
  • ✅ Good Axis Scaling: The range of average ranks (from approximately 4 to 10.5) is well-represented on the axis. The tick marks (1 to 12) are appropriate.
  • ✅ Good Use of Monochrome: The figure is monochrome, which is fine for this type of diagram and ensures accessibility. The lines are distinct enough.
  • 💡 Specify Alpha Level for Significance: It would be beneficial to state the alpha level (e.g., α = 0.05) used for the statistical tests (presumably Nemenyi or similar post-hoc tests) that determine the critical difference, either in the caption or the methods section linked to this figure. This would provide full context for the significance bars.
Figure 2: Critical difference diagram for classification tasks based on...
Full Caption

Figure 2: Critical difference diagram for classification tasks based on accuracy. The best performing model is AutoGluon as higher accuracy scores indicate better performance.

Figure/Table Image (Page 6)
Figure 2: Critical difference diagram for classification tasks based on accuracy. The best performing model is AutoGluon as higher accuracy scores indicate better performance.
First Reference in Text
The results are also summarized using critical difference diagrams based on RMSE for regressions tasks (Fig. 1) and on accuracy for classifications tasks (Fig. 2).
Description
  • Diagram Type and Evaluation Basis: Figure 2 is a critical difference (CD) diagram used to compare the performance of 14 machine learning models on classification tasks. Models are ranked based on their average performance using the accuracy metric, where higher accuracy indicates better performance. In a CD diagram, models are positioned along a horizontal axis according to their average rank; a lower average rank value signifies better performance.
  • Best Performing Model: The model 'AutoGluon' is identified as the best-performing model for classification tasks based on accuracy. It is positioned on the far right of the diagram, indicating the best average rank, which is 4.3519.
  • Worst Performing Model: The model 'gplearn' is shown as the worst-performing among the depicted models, positioned on the far left with the highest average rank of 11.3981.
  • Statistical Significance Groupings: Horizontal black bars connect groups of models where the differences in their average ranks are not statistically significant at a certain confidence level. For example, 'AutoGluon' (rank 4.3519), 'CatBoost' (rank 5.8241), 'LightGBM' (rank 5.8333), 'H2O-GBM' (rank 5.9630), 'TPOT' (rank 6.0463), 'AutoGluon-DL' (rank 6.2037), and 'XGBoost' (rank 6.4074) form a group at the top where many pairwise differences are not statistically significant. AutoGluon is statistically significantly better than models like 'H2O-DL' (rank 7.8704) and those performing worse.
  • Relative Model Performance Overview: The diagram visually presents the relative standings of various models. For instance, traditional models like 'Decision Tree' (rank 10.3704) and 'KNN' (rank 9.3241) are shown to perform worse on average than ensemble methods like 'CatBoost' or AutoML tools like 'AutoGluon' for these classification tasks based on accuracy.
Scientific Validity
  • ✅ Appropriate Visualization Method: Critical difference diagrams are a statistically sound and appropriate method for comparing multiple algorithms over multiple datasets, as recommended in machine learning literature.
  • 💡 Choice of Metric (Accuracy): Accuracy is a common metric for classification tasks. However, its suitability can be limited on imbalanced datasets, where metrics like AUC or F1-score might be more informative. The paper should ideally address or acknowledge dataset balance if relying solely on accuracy for ranking.
  • ✅ Supports Caption Claim: The diagram visually supports the caption's claim that AutoGluon is the best performing model for classification tasks based on accuracy, as it holds the best average rank.
  • 💡 Missing Details on Statistical Test and Alpha Level: The specific statistical test (e.g., Friedman test followed by Nemenyi post-hoc test) and the significance level (alpha) used to determine the critical difference for the connecting bars are not explicitly stated in the caption or figure context. This information is crucial for full methodological transparency.
  • 💡 Discrepancy in Number of Models Shown: The diagram shows 14 models. Table 4 in the paper lists 20 models in total. It would be beneficial to clarify if the 6 models not shown here were not applicable to classification tasks, not evaluated, or excluded for other reasons.
  • ✅ Alignment with Stated Purpose: The diagram effectively summarizes the relative performance and statistical distinguishability of models for classification tasks, aligning with the purpose stated in the reference text.
Communication
  • ✅ Appropriate Chart Type: The use of a critical difference diagram is an established and effective method for visually comparing multiple algorithms across datasets while incorporating statistical significance.
  • ✅ Legible Labels and Values: Model names and their associated average rank values are clearly displayed and legible. The horizontal axis representing average rank is also clear.
  • ✅ Informative Caption: The caption is informative: it specifies the task (classification), the metric (accuracy), identifies the best performing model (AutoGluon), and clarifies that higher accuracy scores are better. This helps in understanding the diagram's orientation.
  • ✅ Clear Indication of Statistical Significance: The horizontal bars that connect groups of models whose performances are not statistically significantly different are a key feature and are clearly depicted, aiding in the interpretation of pairwise comparisons.
  • ✅ Good Use of Monochrome: The diagram is monochrome, which is generally good for accessibility and clarity in this type of visualization. The lines are distinct.
  • 💡 Specify Alpha Level for Significance: To enhance completeness, explicitly state the alpha level (e.g., α = 0.05) used for the statistical tests that determine the critical difference. This could be in the caption or the relevant methods section.
  • ✅ Consistent Visual Ordering: The arrangement of models from worst (left) to best (right) is consistent with the caption stating higher accuracy is better, with AutoGluon (best) positioned on the right with the lowest average rank value.
Table 5: Performance metrics of TE and DL models for small datasets (< 1000)...
Full Caption

Table 5: Performance metrics of TE and DL models for small datasets (< 1000) rows.

Figure/Table Image (Page 7)
Table 5: Performance metrics of TE and DL models for small datasets (< 1000) rows.
First Reference in Text
Table 5 follows the same line but includes only datasets with less than 1000 rows (36).
Description
  • Focus on Small Datasets: Table 5 presents a performance comparison of Tree-based Ensemble (TE) models and Deep Learning (DL) models specifically on a subset of 'small datasets'. These are defined as datasets having fewer than 1000 rows (individual records). The analysis is based on 36 such datasets.
  • Consistent Performance Metrics: The performance metrics used are the same as in previous tables: '# Best' (number of the 36 small datasets where the model was top-ranked), 'Average Rank', 'Median Rank', and '# in Top 3 Models'.
  • Top Performer: H2O-GBM (TE): For these small datasets, the TE model 'H2O-GBM' is the top performer based on '# Best', achieving the best results on 6 out of 36 datasets. It has an average rank of 5.6 and a median rank of 5, appearing in the top 3 models for 12 datasets.
  • Second Best Performer: ResNet (DL): The DL model 'ResNet' is the second-best performer by '# Best', achieving top rank on 5 datasets. It has an average rank of 6.7 and a median rank of 7. It was in the top 3 for 8 datasets.
  • Other Notable Performers: Other models showing relatively good performance on small datasets include 'CatBoost' (TE, 4 #Best, Avg Rank 5.1), 'AutoGluon-DL' (DL, 4 #Best, Avg Rank 6.8), and 'MLP' (DL, 4 #Best, Avg Rank 6.3). Notably, CatBoost has a better average and median rank than ResNet, AutoGluon-DL, and MLP, despite fewer '# Best' instances than ResNet.
  • Lower Performing DL Models on Small Datasets: Similar to the overall ranking, some DL models like 'FT-Transformer' and 'TabNet' performed poorly on these small datasets as well, both achieving '# Best' on 0 datasets and having the highest (worst) average and median ranks among the listed models (TabNet: Avg Rank 13.9, Median Rank 14).
  • Mix of TE and DL Models Among Top Ranks: The table shows a mix of TE and DL models among the top performers on small datasets, with a TE model (H2O-GBM) leading, but closely followed by DL models like ResNet, AutoGluon-DL, and MLP in terms of '# Best'.
Scientific Validity
  • ✅ Valid Sub-Analysis on Small Datasets: Analyzing model performance on a subset of small datasets is a valid and interesting approach, as model behavior can differ significantly based on data size. This provides insights into which models are more robust or suitable when data is limited.
  • ✅ Clear Definition and Sufficient Sample for Subset: The definition of 'small datasets' as those with < 1000 rows is clearly stated, and the number of such datasets (36) is provided, which is a reasonable sample size for this sub-analysis.
  • ✅ Robust Multi-Metric Evaluation for Subset: The use of multiple ranking metrics (# Best, Average Rank, Median Rank, # in Top 3) remains a strength for this subset analysis, providing a nuanced view of performance.
  • ✅ Potentially Interesting Findings for Small Data Regime: The results, showing H2O-GBM (TE) leading but with DL models like ResNet and AutoGluon-DL also performing well, suggest that the performance gap between TE and DL models might be different or less pronounced on smaller datasets compared to the overall trend. This is an interesting finding.
  • ✅ Data Supports Textual Claims (with minor specificity note): The text mentions 'H2O led the chart'. The table specifies 'H2O-GBM'. This is a minor point but ensuring exact model names are used consistently between text and table is good practice. The data in the table supports the general claim about H2O-GBM's strong performance.
  • 💡 Ambiguity of Underlying Performance Metric for Ranking: As with the other performance tables, the specific underlying performance metric(s) used for ranking (e.g., RMSE for regression parts of the 36 datasets, accuracy for classification parts) is not explicitly stated in the table context. This detail is important for full interpretation.
  • 💡 Lack of Statistical Significance Indicators for Subset: The table does not include statistical significance indicators for the differences in ranks within this subset. While perhaps less critical for a sub-analysis table if covered elsewhere for the main results, it would add rigor.
  • ✅ Consistent Findings for Poorly Performing Models: The continued poor performance of models like TabNet and FT-Transformer even on small datasets reinforces findings from the overall analysis about these specific DL models.
Communication
  • ✅ Consistent Table Structure: The table structure is consistent with previous performance tables (Table 3 and 4), using the same column headers (# Best, Average Rank, Median Rank, # in Top 3 Models) and model grouping (TE, DL). This consistency aids readability and comparison across different dataset subsets.
  • ✅ Clear Definition of Data Subset: The caption clearly specifies the subset of data being analyzed: 'small datasets (< 1000) rows'. The reference text further clarifies that this subset comprises 36 datasets.
  • ✅ Standard Terminology and Ranking: Model names and group abbreviations (TE, DL) are standard and should be familiar to the target audience. The ranking by '# Best' provides an immediate understanding of top performers within this subset.
  • ✅ Efficient Presentation for Subset Analysis: The table effectively presents a focused analysis on small datasets, allowing for quick comparison of how TE and DL models perform under this specific condition.
  • 💡 Clarification of Ranking Metric Basis for Subset: Similar to previous tables, the specific performance metric(s) (e.g., RMSE, accuracy) used to determine '# Best' and ranks for these 36 small datasets is not explicitly stated in the table or its immediate caption. Adding this detail would enhance the table's self-containedness.
  • ✅ Consistency with Textual Summary: The text mentions H2O leading the chart, followed by ResNet. The table shows H2O-GBM (a TE model) leading with 6 '# Best'. ResNet (DL) follows with 5 '# Best'. This is consistent. The statement that H2O led the chart could be more specific (H2O-GBM).
Table 6: Coefficients of a logistic regression for predicting the probability...
Full Caption

Table 6: Coefficients of a logistic regression for predicting the probability that DL outperforms ML.

Figure/Table Image (Page 8)
Table 6: Coefficients of a logistic regression for predicting the probability that DL outperforms ML.
First Reference in Text
We now present the results of the prediction of whether ML or DL will perform better in each dataset. Table 6 presents the coefficients of a logistic regression for this classification task.
Description
  • Purpose of the Logistic Regression Model: Table 6 displays the results of a logistic regression model. This type of model is used to predict the probability of a binary outcome – in this case, whether Deep Learning (DL) models are likely to outperform traditional Machine Learning (ML) models on a given dataset. The table lists various characteristics of datasets (Variables) and their associated 'Coefficients', 'Standard Errors', 'z-values', and 'P-values' (P>|z|).
  • Interpretation of Coefficients: A 'Coefficient' in logistic regression indicates the change in the log-odds of the outcome (DL outperforming ML) for a one-unit increase in the predictor variable, holding other variables constant. A positive coefficient suggests that an increase in the variable increases the likelihood of DL outperforming ML, while a negative coefficient suggests the opposite.
  • Statistical Terms Explained: The 'Standard Error' measures the variability of the coefficient estimate. The 'z-value' is the coefficient divided by its standard error, used to test if the coefficient is significantly different from zero. The 'P-value' (P>|z|) indicates the probability of observing the data if the coefficient were truly zero; a small p-value (typically < 0.05) suggests the coefficient is statistically significant.
  • Intercept Value: The 'Intercept' has a coefficient of -0.8751 (p=0.0002), indicating the baseline log-odds of DL outperforming ML when all other predictor variables are zero.
  • Impact of Task Type (Classification/Regression): The variable 'Classification/Regression' has a positive coefficient of 0.5563 (p=0.0317). This suggests that if the task is classification (as opposed to regression, likely coded as 1 vs 0), the odds of DL outperforming ML are higher.
  • Impact of Kurtosis: 'Kurtosis', a measure of the 'tailedness' of a feature's distribution, has a positive coefficient of 0.8975 (p=0.0244). This implies that datasets with higher average kurtosis in their features are more likely to see DL models outperform ML models.
  • Impact of Feature-Target Correlation: 'Average Pearson to Target Feature' has a positive coefficient of 0.5247 (p=0.0787). This suggests a borderline significant positive association: datasets where features have a higher average Pearson correlation with the target variable might favor DL models.
  • Impact of PCA Components: 'PCA' (Principal Component Analysis, likely representing the number of components to explain variance) has a positive coefficient of 0.4624 (p=0.0808). This suggests a borderline significant positive association: datasets requiring more PCA components might favor DL models.
  • Non-Significant Variables: Variables like 'Row count' (coefficient -0.0195, p=0.9727) and 'Columns after One Hot Encoding' (a technique to convert categorical data into a numerical format; coefficient -0.5745, p=0.1457) do not show a statistically significant relationship with DL outperforming ML at the conventional α=0.05 level.
Scientific Validity
  • ✅ Appropriate Methodological Choice: Using logistic regression to model the probability of a binary outcome (DL outperforming ML vs. not) based on dataset characteristics (meta-features) is an appropriate methodological choice for this type of meta-analysis.
  • ✅ Inclusion of Significance Indicators: The inclusion of standard errors, z-values, and p-values allows for an assessment of the statistical significance of each predictor, which is crucial for interpreting the model.
  • ✅ Identification of Significant Predictors: The identification of 'Classification/Regression' and 'Kurtosis' as statistically significant predictors (at α=0.05) provides concrete, data-driven insights into factors that might influence the relative performance of DL vs. ML models. 'Average Pearson to Target Feature' and 'PCA' are borderline significant (p < 0.10) and could also be considered indicative.
  • 💡 Potential Concern: Events Per Variable (EPV): The number of observations (datasets, N=111) used to fit this logistic regression model is not explicitly stated in the table but is mentioned earlier in the paper. With around 20 predictor variables, the events per variable (EPV) ratio might be a concern if the number of instances where DL outperforms ML is small. This could affect model stability and generalizability. Authors should consider this or report on model diagnostics.
  • 💡 Considerations for Predictor Interpretation and Multicollinearity: The interpretation of coefficients for variables like 'Columns after One Hot Encoding' or 'Rows over Columns after One Hot Encoding' needs care, as these are derived features. Multicollinearity among predictors could also influence coefficient estimates and their standard errors. While not evident from the table alone, it's a standard consideration in regression modeling.
  • 💡 Missing Overall Model Fit Statistics: The table presents coefficients, but measures of the overall model fit (e.g., pseudo-R-squared, Hosmer-Lemeshow test, AUC of this meta-model) are not included in this table. The text mentions an AUC of 0.78 for a different H2O-DL model on a restricted dataset for a similar task, and an AUC of 0.68 for a logistic regression on that restricted set. The overall fit of this specific logistic regression model (on all 111 datasets if that's the case) would be valuable information.
  • ✅ Reasonable Selection of Predictor Variables: The selection of meta-features as predictors seems reasonable, covering various aspects of dataset size, dimensionality, feature characteristics, and feature-target relationships.
Communication
  • ✅ Clear and Standard Structure: The table is well-structured with standard columns for logistic regression output: Variable, Coefficient, Std Error, z-value, and P>|z|. This makes it easy for those familiar with regression outputs to interpret.
  • ✅ Descriptive Variable Names: Variable names are generally descriptive (e.g., 'Row count', 'Kurtosis', 'Average Entropy'). Some, like 'Cancor', might be less universally known without prior context from the meta-feature description (Table 8 in the paper, though not provided here). 'PCA' is standard.
  • ✅ Clear Indication of Statistical Significance: The inclusion of p-values (P>|z|) allows for quick identification of statistically significant predictors at conventional alpha levels (e.g., 0.05).
  • ✅ Informative Caption: The caption clearly states the purpose of the logistic regression model and what the coefficients represent (predicting the probability that DL outperforms ML).
  • ✅ Appropriate Numerical Precision: The precision of the reported values (e.g., four decimal places for coefficients and p-values) is adequate for this type of analysis.
  • 💡 Explicit Definition of Positive Class: While the table is largely self-contained, a brief note defining the positive class (i.e., what outcome a positive coefficient predicts – presumably DL outperforming ML) in the caption could be helpful, though it's implied by the caption's wording.
  • 💡 Unconventional P-value Column Header: The variable 'P¿-z-' as a column header for p-values is unconventional. Standard notation is 'P>|z|' or 'p-value'. This should be corrected for clarity.
Figure 3: The effect of various factors on the probability that DL outperforms...
Full Caption

Figure 3: The effect of various factors on the probability that DL outperforms ML. The heatmaps are generated using the prediction of the logistic regression models. The scatter plot represents the actual observations of the datasets. (a) the impact of the number of columns and rows; (b) the influence of numerical and categorical feature counts; (c) the effect of X-kurtosis and row count; and (d) the role of PCA components necessary to maintain 99% of the variance and number of rows.

Figure/Table Image (Page 9)
Figure 3: The effect of various factors on the probability that DL outperforms ML. The heatmaps are generated using the prediction of the logistic regression models. The scatter plot represents the actual observations of the datasets. (a) the impact of the number of columns and rows; (b) the influence of numerical and categorical feature counts; (c) the effect of X-kurtosis and row count; and (d) the role of PCA components necessary to maintain 99% of the variance and number of rows.
First Reference in Text
Based on the predictions of the logistic regression model, we provide further insights into the most influential factors for TE versus DL performance. Fig. 3 presents heatmaps of four dataset's configurations and their influence on the probability that a DL model would outperform the ML model for a given dataset, including (a) the impact of the number of columns and rows; (b) the influence of numerical and categorical feature counts; (c) the effect of X-kurtosis and row count; and (d) the role of PCA components necessary to maintain 99% of the variance.
Description
  • Overall Figure Structure and Content: Figure 3 consists of four panels (a, b, c, d), each displaying a heatmap. These heatmaps illustrate the predicted probability that Deep Learning (DL) models will outperform traditional Machine Learning (ML) models, based on a logistic regression model. The probability is shown as a color gradient, typically with reddish colors indicating higher probability and bluish colors lower probability. Overlaid on each heatmap are scatter points representing actual datasets: red points indicate datasets where DL outperformed ML, and blue points where ML outperformed DL (or performed equally well).
  • Panel (a): Columns vs. Rows: Panel (a) examines the impact of the 'Number of Columns' (features, on a logarithmic y-axis, roughly 1 to 100+) versus the 'Number of Rows' (samples, on a logarithmic x-axis, roughly 10 to 100,000+). The heatmap suggests a higher probability of DL outperforming ML for datasets with a small number of rows and a large number of columns. This effect appears to diminish as the number of rows increases. The probability scale seems to range from approximately 0.0 to 0.40. Most scatter points are blue, indicating ML often performs better.
  • Panel (b): Categorical vs. Numerical Features: Panel (b) shows the influence of '# Categorical Features' (features with discrete categories, on a logarithmic y-axis, roughly 0 to 100+) versus '# Numerical Features' (features with continuous values, on a logarithmic x-axis, roughly 0 to 100+). The heatmap indicates a higher probability of DL outperforming ML for datasets with a high number of categorical features, particularly when the number of numerical features is also high, or with a high number of numerical features when categorical features are few. The probability here can exceed 0.5 (up to ~0.9 according to the color bar), and there are more red scatter points in the higher probability regions.
  • Panel (c): X-Kurtosis vs. Rows: Panel (c) displays the effect of 'X-Kurtosis' (a statistical measure of the 'tailedness' of feature distributions, on a linear y-axis, roughly -2000 to 8000) versus '# Rows' (logarithmic x-axis). The heatmap suggests that higher X-Kurtosis values are associated with a higher probability of DL outperforming ML, relatively independent of the number of rows, although the effect seems stronger for fewer rows. The probability scale appears to range from about 0.2 to 0.8.
  • Panel (d): PCA Components vs. Rows: Panel (d) illustrates the role of 'PCA Components' (number of principal components needed to retain 99% variance, a measure of intrinsic dimensionality, on a logarithmic y-axis, roughly 0 to 100+) versus '# Rows' (logarithmic x-axis). The heatmap indicates a higher probability of DL outperforming ML for datasets with a larger number of PCA components, especially when the number of rows is small. The probability scale seems to range from about 0.2 to 0.8.
Scientific Validity
  • ✅ Appropriate Visualization of Logistic Regression Predictions: Visualizing the output of a logistic regression model (which predicts probabilities) as a heatmap is an appropriate technique to understand the influence of pairs of input variables (dataset characteristics) on the outcome (probability of DL outperforming ML).
  • ✅ Comparison of Predicted Probabilities with Actual Outcomes: Overlaying the actual outcomes (scatter points where DL did or did not outperform ML) on the predicted probability heatmap allows for a qualitative assessment of the logistic regression model's predictive capability in different regions of the feature space. It helps to see if high-probability regions indeed have more 'DL wins' and vice-versa.
  • ✅ Relevant Factor Combinations Explored: The selection of variable pairs for each panel (rows/cols, numerical/categorical features, kurtosis/rows, PCA/rows) seems to target potentially influential factors based on common understanding in machine learning and findings from Table 6 (e.g., kurtosis, PCA).
  • ✅ Visuals Support Textual Interpretation for Panel (a): The textual interpretation in Section 3.2 (page 7) regarding panel (a) states: 'for a small number of rows, increasing the number of columns results in a higher probability that the DL model would outperform an ML model. However, this effect decreases relatively quickly as the number of rows increases.' This is well supported by the heatmap in panel (a). The text also notes 'the probability does not increase over 0.5 which indicates no configuration found where DL models would outperform ML models, on average.' The heatmap in (a) visually supports this, with max probability around 0.4.
  • ✅ Visuals Support Textual Interpretation for Panel (b) regarding feature counts: The textual interpretation for panel (b) suggests 'a smaller number of rows and a higher number of columns increase the probability that DL models outperform ML models. Interestingly, this heatmap reveals configurations where the probability is higher than 0.5.' Panel (b) indeed shows regions with probability >0.5, particularly with high categorical and/or numerical features, though the 'number of rows' isn't explicitly on this panel's axes, implying this observation is combined with other insights or is a general statement. The panel itself shows relationships between feature counts.
  • 💡 Dependence on Underlying Meta-Model Performance: The heatmaps are based on the logistic regression model's predictions. The reliability of the patterns shown in the heatmaps is therefore dependent on the goodness-of-fit and predictive power of that underlying logistic regression model. While Table 6 provides coefficients, the overall performance of that specific meta-model (e.g., its AUC on all 111 datasets) is not directly presented alongside these visualizations, which would add context to how much confidence to place in these predicted probability surfaces.
  • 💡 Interpretation of Probability Threshold: The interpretation that 'DL models would outperform ML models, on average' when probability is < 0.5 (as stated for panel a in the text) is a specific interpretation of the probability threshold. The heatmaps themselves just show the predicted probability; the decision threshold for claiming one 'outperforms' on average isn't inherent to the heatmap.
Communication
  • ✅ Effective Multi-Panel Layout: The multi-panel layout (a, b, c, d) effectively presents different two-way interactions of dataset characteristics on the predicted probability. Each panel is clearly labeled.
  • ✅ Appropriate Visualization Choices (Heatmap with Scatter Overlay): The use of heatmaps to show the predicted probability from the logistic regression model is appropriate, providing a visual gradient of the effect. Overlaying scatter plots of actual dataset outcomes (DL performs better vs. ML performs better) allows for a direct comparison between model predictions and reality.
  • ✅ Clear Legends and Color Scales: The color scale for the heatmap (probability) is consistent across panels where applicable (though the actual range of probabilities shown differs), and the legend for scatter points (DL vs. ML) is clear. Axis labels are generally clear, though some y-axis labels like '# Cols' or '# Categorical Features' are abbreviated but understandable in context.
  • ✅ Informative Overall Caption: The caption provides a good overview of the figure's content and clearly identifies what each panel represents. It also explains the source of the heatmaps and scatter plots.
  • ✅ Appropriate Use of Logarithmic Scales: Logarithmic scales are used for axes representing counts (rows, columns, features, PCA components) in panels (a), (b), and (d), which is appropriate for data spanning several orders of magnitude. Panel (c) uses a linear scale for X-Kurtosis, which seems appropriate given its range.
  • 💡 Precision of Color Bar Range Across Panels: The color bar label 'Predicted Probability of DL Outperforming ML' is clear. However, the exact numerical range of this probability seems to vary slightly between panels (e.g., panel (a) up to ~0.4, panel (b) up to ~0.9). While the color mapping is relative within each panel, ensuring the color bar accurately reflects the specific range displayed in each panel, or using a common scale if intended, would enhance precision.
  • 💡 Scatter Point Visibility in Dense Regions: The scatter points are somewhat small and can be dense in certain regions, potentially obscuring some individual points or the underlying heatmap in those areas. Slightly larger points or using transparency might improve visibility without excessive clutter.
Table 7: Comparison of model performance on original and sampled datasets
Figure/Table Image (Page 10)