A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets

Section Analysis

Abstract

Key Aspects

📊 Problem Definition & Benchmarking Objective: The abstract begins by establishing the central problem: Deep Learning (DL) models, despite their success in other areas, frequently underperform compared to traditional Machine Learning (ML) methods like Gradient Boosting Machines (GBMs) on tabular datasets. This performance discrepancy is significant given the prevalence of tabular data in scientific research and real-world applications. The study directly addresses this challenge by introducing a comprehensive benchmark specifically designed to identify and characterize the types of datasets where DL models can indeed excel. The significance lies in aiming to refine the understanding of DL's applicability and optimal use-cases within this common data domain.
⚙️ Benchmark Methodology: Scale and Diversity: The paper distinguishes its contribution by emphasizing the variety and depth of its comparative benchmark, setting it apart from previous work. It details the evaluation of 111 diverse datasets using 20 different models, encompassing both regression and classification tasks. Furthermore, these datasets are explicitly noted to vary in scale and include those with and without categorical variables, reflecting real-world data complexities. This extensive and multifaceted methodological approach is crucial for generating robust and generalizable findings about model performance across a wide spectrum of tabular data scenarios.
🔑 Predictive Modeling and Key Contributions: A key outcome highlighted is the creation of a predictive model capable of identifying scenarios where DL models are likely to outperform alternatives, achieving a notable 86.1% accuracy (AUC 0.78). This model was developed based on the benchmark results, leveraging what the authors state is a sufficient number of datasets where DL models demonstrated superior performance, enabling a thorough analysis. The abstract concludes by stating that the paper presents insights derived from this characterization and offers a comparison of these findings with existing benchmarks. This contributes both a practical predictive tool and new empirical understanding to the field regarding DL model selection for tabular data.

Strengths

✅ Clear Articulation of Research Gap and Objective
The abstract effectively communicates the existing challenge of Deep Learning (DL) underperformance on tabular data compared to traditional Machine Learning (ML) methods. It clearly states the study's objective to conduct a comprehensive benchmark to better characterize the types of datasets where DL models excel, providing immediate context and purpose for the research.

"Unlike many other ML tasks, Deep Learning (DL) models often do not outperform traditional methods in this area... In this study, we introduce a comprehensive benchmark aimed at better characterizing the types of datasets where DL models excel." (Page 1)
✅ Emphasis on Benchmark Comprehensiveness and Depth
The abstract highlights the extensive nature of the benchmark, involving 111 datasets and 20 different models, covering both regression and classification tasks, and datasets with varying scales and presence of categorical variables. This methodological breadth is a significant strength, suggesting that the findings will be robust and well-supported by diverse evidence.

"Although several important benchmarks for tabular datasets already exist, our contribution lies in the variety and depth of our comparison: we evaluate 111 datasets with 20 different models, including both regression and classification tasks." (Page 1)
✅ Novel Predictive Outcome with Quantified Performance
A key contribution presented in the abstract is the development of a predictive model that can determine scenarios where DL models outperform alternatives, achieving a specific accuracy (86.1%) and AUC (0.78). This is a tangible and novel outcome that offers practical value to researchers and practitioners.

"Building on the results of this benchmark, we train a model that predicts scenarios where DL models outperform alternative methods with 86.1% accuracy (AUC 0.78)." (Page 1)

Suggestions for Improvement

💡 Briefly Specify the Nature of Derived Insights
The abstract mentions presenting insights from the characterization. To enhance immediate impact and reader understanding, briefly hinting at the nature of these insights (e.g., related to dataset characteristics like size or statistical properties, or task types) could be beneficial. This would provide a slightly more concrete expectation of the paper's findings without significantly increasing length. This suggestion has a low to medium potential impact, primarily aiming to improve clarity and reader engagement by making the nature of the contributions more explicit upfront.

"We present insights derived from this characterization and compare these findings to previous benchmarks." (Page 1)

Implementation: Consider revising the sentence: "We present insights derived from this characterization and compare these findings to previous benchmarks." to something like: "We present insights derived from this characterization, such as the influence of dataset scale and task type on model performance, and compare these findings to previous benchmarks." This addition should be kept concise to fit the abstract's brevity.

1 Introduction

Key Aspects

📚 Prevailing View and Research Impetus: The introduction establishes the widely held belief that traditional Machine Learning (ML) algorithms, such as Random Forest and XGBoost, generally outperform Deep Learning (DL) models on tabular data. This type of data, characterized by rows and columns like a spreadsheet, is foundational in numerous practical domains including medicine, economics, and operations. However, the text highlights that this is not an absolute rule, as instances exist where DL models demonstrate superior performance. This discrepancy forms the core research impetus: to understand the specific conditions and dataset characteristics that enable DL models to surpass established ML methods, thereby unlocking their full potential for tabular data analysis.
📊 Overview of Previous Comparative Studies: The section reviews existing benchmarking efforts that have compared ML and DL models on tabular data. It cites studies like [11] which introduced TabNet, a DL model designed for tabular data showing competitive results, and a more detailed benchmark by Grinsztajn et al. [1] which used 45 datasets. This latter study's methodology (including dataset selection criteria, preprocessing steps like one-hot encoding for categorical features and log transformation for regression targets, and the specific ML/DL models compared) and key findings (XGBoost outperforming DL models, ineffectiveness of hyperparameter tuning for DL to surpass ML, and DL's struggles with uninformative features) are summarized. This review serves to contextualize the current paper's research within the existing landscape of comparative studies and their established conclusions.

Strengths

✅ Clear Articulation of the Central Problem and Motivation
The introduction effectively establishes the central theme: the general superiority of ML over DL for tabular data, while also acknowledging that this is not universally true. This immediately clarifies the research problem and motivates the study.

"Machine learning (ML) has long been considered superior to deep learning (DL) when it comes to handling tabular data [1, 2, 3]... Nonetheless, this generalization does not hold universally [11]." (Page 1)
✅ Concise Summary of Relevant Prior Work
The introduction efficiently summarizes key findings from previous significant benchmarking studies, such as those by Arik and Pfister [11] on TabNet and Grinsztajn et al. [1]. This provides essential background and context for the current research, demonstrating an awareness of the existing literature and setting the stage for the paper's contributions.

"The authors show that XGboost outperformed all DL models for both classification and regression tasks while showing that tuning hyperparameters does not make DL models outperform the ML models." (Page 1)

Suggestions for Improvement

💡 Explicitly Bridge to Current Study's Novelty/Contribution
The introduction effectively summarizes prior work but could more explicitly transition to the current study's specific contributions or the gap it aims to fill. While the abstract covers this, a brief statement at the end of the introduction would strengthen its role as a lead-in to the paper's main body. This would have a medium impact by enhancing narrative cohesion and immediately clarifying how this paper builds upon or diverges from the cited studies. This suggestion is appropriate for the introduction as it sets the stage for the rest of the paper.

"They also suggest that DL models are challenged by uninformative features." (Page 1)

Implementation: After summarizing the findings of Grinsztajn et al. [1] (e.g., "They also suggest that DL models are challenged by uninformative features."), add a sentence such as: "Building on these findings, and as detailed in the abstract, this paper introduces a more comprehensive benchmark to further investigate these nuances and identify specific dataset characteristics where DL models may hold an advantage, as outlined in the subsequent sections."

2 Experimental setup

Key Aspects

📊 Dataset Curation and Characteristics: This section outlines the meticulous process of assembling a benchmark of 111 diverse tabular datasets, comprising 57 for regression and 54 for classification tasks. The datasets are sourced from established repositories like OpenML, a Materials Science benchmark, and Kaggle, ensuring a broad representation of real-world data challenges from domains such as economics and medicine. Critical characteristics considered for inclusion were variability in size (number of rows from 43 to 245,057 and columns from 4 to 267), the presence of categorical features, and a range of predictive difficulty. This diversity is crucial for robustly evaluating model performance and identifying scenarios where specific model types excel; Table 1 compares these characteristics to previous studies, while Table 2 provides descriptive statistics.
💻 Model Selection and Configuration: The study employs a comprehensive suite of 20 machine learning (ML) and deep learning (DL) models to ensure a representative comparison across algorithmic families. These models, chosen based on their prevalence in prior benchmarking studies and overall popularity, include 7 tree-based ensemble models (e.g., XGBoost, Random Forest, CatBoost), 7 DL models (e.g., AutoDL libraries like AutoGluon-DL, ResNet-like architectures, MLP), and 6 other models including classical ML algorithms (e.g., SVM, KNN) and broader AutoML libraries (e.g., TPOT). The paper notes that detailed configurations for each model's execution, including hyperparameter settings, are provided in the appendix, which is essential for reproducibility and a deeper understanding of the experimental conditions under which comparisons are made.
⚖️ Evaluation Protocol and Metrics: A standardized and rigorous evaluation strategy is detailed to compare model performance consistently across all datasets and tasks, ensuring the reliability of the findings. Following established practices and inspired by prior work [26], the methodology employs a 10-fold cross-validation procedure to generate robust performance estimates for each model. For regression tasks, performance is quantified using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and the coefficient of determination (R2). For classification tasks, the chosen metrics are Accuracy, Area Under the Receiver Operating Characteristic Curve (AUC), and F1 score. The evaluation framework also includes presenting performance summaries and rankings for each individual model and by model groups, offering a multi-faceted view of their capabilities.
🧠 Meta-Analysis for Performance Profiling: A central component of the experimental design is a meta-learning approach aimed at identifying dataset characteristics that predict when DL models are likely to outperform traditional ML models on tabular data. This involves constructing a 'meta-dataset,' where each of the 111 original datasets is transformed into a vector of 20 meta-features (e.g., number of records, feature statistics like kurtosis, and feature-target relationships, detailed in the referenced Table 8). The target variable for this meta-dataset is binary, indicating whether an ML model or a DL model achieved the best performance (defined by RMSE for regression and AUC for classification) on that original dataset. The ultimate goal is to train a meta-learning algorithm that can, for any new tabular dataset, predict which class of models (ML or DL) will likely yield superior results, thus providing practical guidance for model selection.

Strengths

✅ Comprehensive Dataset Strategy
The paper clearly outlines its strategy for dataset selection, emphasizing diversity in size, domain, task type (regression/classification), presence of categorical features, and difficulty. This breadth is crucial for the generalizability of the benchmark findings. The inclusion of Table 1, comparing dataset characteristics with previous studies, effectively positions the current work and highlights its comprehensiveness.

"In order to create a diverse and comprehensive benchmark for tabular datasets, aiming to provide insights into the scenarios where DL models outperform ML models, we incorporated a wide array of datasets exhibiting considerable variability and diversity of real-world tasks and characteristics." (Page 3)
✅ Broad and Relevant Model Portfolio
The selection of 20 models, encompassing various tree-based ensembles, deep learning architectures (including AutoDL solutions), and other classical ML algorithms, provides a wide spectrum for comparison. Referencing the appendix for detailed model configurations is appropriate for maintaining conciseness in this section while ensuring reproducibility.

"We use 20 ML and DL models to capture a representative set of algorithms from both groups for our analysis. These models are mainly adopted from previous benchmarking studies and due to their overall popularity [26, 1]." (Page 3)
✅ Clear and Standardized Evaluation Protocol
The evaluation strategy is well-defined, employing standard metrics for regression (RMSE, MAE, R2) and classification (Accuracy, AUC, F1 score) tasks, along with a 10-fold cross-validation approach. This adherence to established practices enhances the reliability and comparability of the results.

"Inspired by [26], we present the mean results of a 10-fold cross-validation evaluation. For the regression tasks we calcualte root mean squared error (RMSE), mean absolute error (MAE), and coefficient of determination (R2). For the classification tasks, we present accuracy, Area Under the receiver operating characteristic (ROC) Curve (AUC), and F1 score." (Page 4)
✅ Rigorous Meta-Analysis Framework
The methodology for the meta-analysis profiling is formally and clearly described. The construction of the meta-dataset, including the definition of meta-features (referencing Table 8) and the binary target variable based on model performance, is well-explained, setting a solid foundation for the subsequent analysis aimed at predicting DL versus ML model superiority.

"First, each dataset is converted into a meta-feature vector with 20 parameters (full description provided in Table 8), marked as ¯X... If this model belongs to the ML algorithms group, the target column ( ¯Y ) of the meta-learning is set to 1 and 0 otherwise." (Page 4)

Suggestions for Improvement

💡 Clarify Handling of Missing Data in Datasets
The 'Datasets' subsection (2.1) details criteria for dataset inclusion like size and categorical features, but it does not explicitly state how missing values within the selected datasets were handled prior to model training (e.g., imputation, removal of samples/features, or if models were expected to handle them). This information is crucial for reproducibility and understanding potential biases, as different missing data strategies can significantly impact model performance. Clarifying this would be a high-impact improvement for methodological completeness. This detail is fundamental to the experimental setup as it impacts the input data for all models.

"In our selection process, we also prioritized datasets containing categorical features." (Page 3)

Implementation: Add a sentence or two in Section 2.1 specifying the general strategy for handling missing values across the 111 datasets. For example: 'Datasets with missing values underwent [specific imputation method, e.g., mean/median imputation for numerical features and mode imputation for categorical ones] prior to model application,' or 'Models were selected/configured to handle missing data internally where applicable, and datasets with missing values were included as is if the models supported this.' If different strategies were used based on dataset or model, a brief summary would be helpful.
💡 Justify Choice of Single Metrics for Meta-Learning Target
Section 2.4 states the target column (Y_bar) for the meta-learning dataset is determined by whether an ML or DL model has the best performance based on 'RMSE for regression, and AUC for classification.' Given that multiple performance metrics are collected for each task (as per Section 2.3), a brief justification for selecting only RMSE and AUC for this critical binary labeling in the meta-analysis would strengthen the methodological rigor. This clarification is of medium impact, as the choice of these specific metrics directly influences the meta-learning outcome and the interpretation of when DL 'outperforms' ML. This detail belongs in Section 2.4 where the meta-dataset construction and target variable definition are described.

"If this model belongs to the ML algorithms group, the target column ( ¯Y ) of the meta-learning is set to 1 and 0 otherwise." (Page 4)

Implementation: Add a sentence in Section 2.4 after specifying RMSE and AUC as the basis for determining the best performing model group for the meta-learning target. For instance: 'RMSE and AUC were chosen as the definitive metrics for this binary labeling due to their widespread adoption in evaluating overall predictive power in regression and ranking quality in classification, respectively, providing a consistent basis for comparison across diverse datasets.' or a similar statement reflecting the actual rationale.

Non-Text Elements

Table 1: Comparison with Previous Studies

Figure/Table Image (Page 3)

First Reference in Text

None of the datasets in this benchmark allowed a perfect prediction with all models. In Table 1 we compare the characteristics of our datasets with previous works.

Description

Study Comparison Overview: The table compares the current study ('Ours') with five previous studies (referenced as [1], [26], [2], [3], [29]) across several characteristics of the datasets and models used. For 'Ours', 20 models were evaluated.
Number of Datasets: The current study ('Ours') utilized 54 classification datasets and 57 regression datasets, totaling 111 datasets. This is compared to, for example, study [1] which used 22 classification and 33 regression datasets, and study [29] which used 181 classification and 119 regression datasets.
Median Dataset Size: The median dataset size for 'Ours' is 5k (presumably 5,000 instances/rows). This is smaller than study [1] (17k), [2] (11k), [3] (20k), and [29] (12K), but larger than [26] (2k).
Median Number of Features: The median number of features (columns in a dataset) in 'Ours' is 13. This is comparable to study [1] (13), but lower than [26] (21), [2] (32), [3] (21), and [29] (21).
Median Number of Categorical Features: The median number of categorical features (features whose values are from a limited, fixed set of categories, like 'color' with values 'red', 'blue', 'green') in 'Ours' is 4. This is higher than all listed previous studies, where the median ranged from 0 to 2.
Meta-Analysis Inclusion: The 'Meta-Analysis' column indicates whether the study performed a subsequent analysis on the results to understand when certain models perform better. 'Ours' is marked 'Yes', study [26] is 'Partial**', and the others are 'No'. A meta-analysis in this context refers to studying the study results themselves to find patterns.

Scientific Validity

✅ Contextualization of the study: The table provides a useful context by comparing the scope of the current work (number of datasets, models, types of features) against several previous benchmarking studies. This helps in positioning the contribution of the paper.
💡 Representativeness of Previous Studies: The selection of previous studies ([1], [26], [2], [3], [29]) appears relevant as they are also benchmarking studies in the domain. However, without knowing the broader literature, it's hard to ascertain if this is a fully representative set or if there's any selection bias towards studies that make 'Ours' look more comprehensive in certain aspects.
✅ Relevance of Comparison Metrics: The chosen metrics for comparison (number of models, datasets, dataset size, feature counts, inclusion of categorical features, meta-analysis) are pertinent for evaluating the comprehensiveness of a benchmarking study in machine learning on tabular data.
✅ Highlighting a distinguishing factor (Meta-Analysis): The claim that the current study performs a 'Meta-Analysis' ('Yes') while others did 'No' or 'Partial' highlights a distinguishing factor. The footnote for 'Partial**' for study [26] clarifies its scope, which is good. The strength of this claim depends on the depth and novelty of the meta-analysis performed in 'Ours', which isn't detailed in the table itself.
✅ Supports claims of distinct characteristics: The table data supports the notion that 'Ours' includes a notable number of datasets with a higher median of categorical features compared to the cited studies, and uniquely claims a full meta-analysis. It also shows 'Ours' uses a median dataset size of 5k, which is smaller than most others listed, and a median feature count of 13, also smaller or comparable. This provides a balanced view, not just highlighting areas of greater scope.
💡 Clarification of units ('k'): The term 'k' for dataset size is standard but confirming it means thousands of rows/instances would be ideal, perhaps in a general methods section if not in the caption.

Communication

✅ Clear tabular format: The table effectively uses a standard tabular format to present a comparative overview, making it easy to contrast the current study with previous works across several key characteristics.
💡 Column header clarity: The column headers are generally understandable. However, 'Median Dataset Size' could be more specific, e.g., 'Median Dataset Size (Rows)' or 'Median Dataset Size (Instances)', although 'k' typically implies thousands of rows in this context. 'Median # of Categorical' is clear but could be written as 'Median # Categorical Features' for full explicitness.
✅ Use of footnotes for clarification: The footnotes '*' and '**' are present and explained below the table, which is good practice. Ensuring the footnote text is immediately discoverable and clearly phrased is important.
💡 Clarity of 'Meta-Analysis' column entries: The 'Meta-Analysis' column uses terms like 'No', 'Partial**', and 'Yes'. While '**' is explained, a brief definition of what constitutes a 'Yes' or 'Partial' meta-analysis directly in the caption or as a general footnote for this column could enhance self-containedness, even if detailed elsewhere.
✅ Good integration with text: The table is directly referenced and its purpose (comparing characteristics with previous works) is stated in the text, aiding comprehension.

Table 2: Descriptive Statistics of the Dataset Variables

Figure/Table Image (Page 4)

First Reference in Text

Overall, we included 111 datasets in this study: 57 regression datasets and 54 classification datasets. Table 2 summarizes the main parameters of the datasets.

Description

Overall Purpose: The table presents descriptive statistics for various characteristics of the 111 datasets used in the study. These characteristics are treated as variables themselves, describing the properties of each dataset.
Dataset Size (Rows): The 'Number of Rows' (individual records or samples in a dataset) varies greatly, with a mean of 18,576, a median of 4,720, a minimum of 43, and a maximum of 245,057. This indicates a wide range of dataset sizes in terms of sample count.
Dataset Dimensionality (Columns): The 'Number of Columns' (features or attributes in a dataset) has a mean of 24.16, a median of 12.5, ranging from a minimum of 4 to a maximum of 267. This shows diversity in the dimensionality of the datasets.
Feature Types (Numerical and Categorical): Datasets on average have 14.25 'Numerical Columns' (features with continuous or discrete numbers, like age or temperature) with a median of 7, and 9.91 'Categorical Columns' (features with a fixed set of categories, like 'color' with values 'red', 'blue') with a median of 4. Some datasets have no numerical (min 0) or no categorical columns (min 0).
Kurtosis of Features: 'Kurtosis' is a statistical measure that describes the 'tailedness' of the probability distribution of a feature – essentially, whether the data are heavy-tailed (more outliers) or light-tailed (fewer outliers) relative to a normal distribution. The datasets show a very wide range of kurtosis values, with a mean of 348.01 and a median of 6.44, but a minimum of -2711.83 and a maximum of 8901.75. The large standard deviation (1129.66) also highlights this variability.
Inter-Feature Correlation: The 'Average Correlation Between Features' indicates the average linear relationship strength between different input features within a dataset. The mean is 0.08 (weak positive correlation on average), with a median of 0.06. Values range from -0.16 to 0.62, suggesting some datasets have moderately correlated features while others have very little inter-feature correlation.
Feature Entropy: 'Average Entropy' measures the average randomness or unpredictability of the features in a dataset. A higher entropy suggests more diverse values. The mean entropy is 7.70, with a median of 7.96, ranging from 2.45 to 14.17.
Feature-Target Correlation (Pearson): The 'Average Pearson to Target Feature' shows the average linear relationship between each input feature and the outcome variable (the 'target' the model tries to predict). The mean is 0.11 and the median is 0.09, suggesting that, on average, individual features have a weak linear relationship with the target. The range is from -0.19 to 0.44.

Scientific Validity

✅ Methodological Soundness: Providing descriptive statistics of the dataset meta-features is crucial for understanding the landscape of data used in the benchmark. This allows readers to assess the diversity and characteristics of the datasets upon which model performances are compared.
✅ Supports Claim of Dataset Diversity: The wide ranges observed for most variables (e.g., Number of Rows, Kurtosis) strongly support the claim of using 'diverse tabular datasets' and highlight the heterogeneity of the benchmark.
✅ Comprehensive Meta-Features: The inclusion of statistics like Kurtosis, Average Entropy, and correlations provides deeper insights into data properties beyond simple counts of rows and columns. These are relevant meta-features for characterizing datasets in machine learning research.
✅ Appropriate Level of Aggregation: The statistics are presented for the collection of 111 datasets. It's clear these are summary statistics across datasets, not within a single dataset. This is appropriate for characterizing the benchmark suite itself.
💡 Consideration of Extreme Kurtosis Values: The extreme minimum and maximum values for Kurtosis (-2711.83 and 8901.75) are notable. While mathematically possible, such extreme values (especially the negative one, as kurtosis is often defined relative to a normal distribution's kurtosis of 3, or excess kurtosis relative to 0) might warrant a brief comment or check to ensure they are not due to data anomalies or specific dataset types that heavily influence these statistics. If these are genuine, they further underscore dataset diversity.
✅ Alignment with Stated Purpose: The table effectively summarizes the 'main parameters of the datasets' as stated in the reference text. The chosen parameters are indeed key characteristics for tabular data.

Communication

✅ Clear and Standard Format: The table is well-structured with clear row and column labels, making it easy to understand the statistical properties being presented for the dataset variables.
✅ Comprehensive Statistical Summary: The statistical measures provided (Mean, STD, Min, 25%, Median, 75%, Max) are standard and offer a comprehensive summary of the distribution of each dataset characteristic.
✅ Appropriate Terminology for Target Audience: Variable names like 'Number of Rows', 'Number of Columns' are self-explanatory. Terms like 'Kurtosis', 'Average Entropy', and 'Average Pearson to Target Feature' are standard in data science but might require some background knowledge for a broader audience; however, for the target audience of a machine learning paper, they are appropriate.
✅ Appropriate Numerical Precision: The precision of the numerical values (e.g., two decimal places for means and medians of counts) is generally appropriate. For Kurtosis, the large range and standard deviation are notable, and the precision seems fine.
✅ Consistent with Reference Text: The table effectively summarizes the characteristics of the 111 datasets used in the study, as stated in the reference text, providing a good overview of their diversity.

Table 8: The meta-feature vector representing a dataset.

Figure/Table Image (Page 16)

First Reference in Text

Table 8 mostly adopted from [41] and contains 20 features computationally useful for meta-learning tasks.

Description

Purpose and Origin of Meta-Features: Table 8 lists the 20 'meta-features' that are used to create a numerical profile, or 'vector', for each dataset in the study. Meta-features are characteristics calculated from a dataset that describe its properties. This vector is then used in 'meta-learning', which is essentially learning from the characteristics of previous learning tasks to improve future learning. The table is adopted primarily from a source cited as [41].
Basic Dataset Properties: The meta-features cover various aspects of a dataset. Basic properties include 'Row count' (number of samples), 'Column count' (number of original features), 'Columns after One Hot Encoding' (number of features after converting categorical data into a numerical format where each category becomes a new binary feature), 'Numerical Features' (count of features with number values), and 'Categorical Features' (count of features with category values).
Task Type: Task type is captured by 'Classification/Regression', indicating whether the dataset is for predicting a category or a continuous value.
Statistical Properties of Features: Statistical properties of the features within a dataset are included, such as 'Cancor' (Canonical correlation, a measure of association between two sets of variables, here likely the best single combination of features), 'Kurtosis' (a measure of the 'tailedness' or outlier proneness of feature distributions), 'Average Entropy' (average randomness or information content of features), and 'Standard Deviation Entropy'.
Feature-Target Relationships: Relationships between features and the target variable (the variable to be predicted) are measured by 'Average Pearson to Target Feature' (average linear correlation between each feature and the target) and 'Standard Deviation Pearson To Target Feature'.
Inter-Feature Relationships: Inter-feature relationships are described by 'Average Correlation Between Features'.
Feature Variability and Anomaly Measures: Measures of feature variability and potential issues include 'Average Asymmetry Of Features' (average skewness), 'Average Coefficient of Variation' (standard deviation divided by the mean, a normalized measure of dispersion), 'Standard Deviation Coefficient of Variation', 'Average Coefficient of Anomaly' (mean divided by standard deviation, potentially highlighting features with low variability relative to their mean), and 'Standard Deviation Coefficient of Anomaly'.
Dimensionality/Complexity Measure: A dimensionality reduction related feature is 'PCA' (Principal Component Analysis), specifically 'The number of PCA components required to explain 99% of the variance in the data', indicating the intrinsic complexity or dimensionality of the dataset.
Sources of Meta-Features: The sources for these meta-features are cited, primarily from references [42], [43], [44], with 'Row Over Column' from [49] and 'PCA' from [50].

Scientific Validity

✅ Comprehensive and Relevant Meta-Feature Set: The selection of 20 meta-features covering simple statistics, feature properties, feature relationships, and feature-target relationships is comprehensive and aligns with common practices in meta-learning research. These types of features have been shown to be useful for characterizing datasets.
✅ Grounded in Existing Literature: Citing sources for the meta-features, especially noting that the set is 'mostly adopted from [41]', lends credibility and allows for verification or deeper understanding of the feature definitions and their established utility.
✅ Computationally Feasible Features: The meta-features chosen are computationally feasible to extract from datasets, which is important for practical meta-learning applications, as supported by the reference text stating they are 'computationally useful'.
✅ Diverse Range of Feature Types: The inclusion of diverse types of meta-features (e.g., simple counts, statistical moments, information-theoretic measures, correlation-based) provides a rich representation of dataset characteristics, which is beneficial for building effective meta-models.
💡 Details of Calculation Methodology Assumed Elsewhere: While the table lists the features, the exact methodology for calculating some of them (e.g., how 'Average Asymmetry' or 'Average Coefficient of Anomaly' are aggregated across features if they are per-feature measures) is not detailed in the table itself. This would typically be in the methods section or the cited references.
✅ Appropriateness for the Meta-Learning Task: The utility of these specific 20 features for the particular meta-learning task in this paper (predicting DL vs. ML performance) is ultimately demonstrated by the performance of the meta-model built using them (e.g., the logistic regression in Table 6). The table itself just defines the inputs to that meta-model.
✅ Inclusion of Simple Shape Descriptors: The feature 'Row Over Column' from [49] is a simple yet often informative measure of dataset shape, which can influence algorithm choice.

Communication

✅ Clear and Organized Structure: The table is well-organized with three clear columns: 'Name' (of the meta-feature), 'Description', and 'Source' (citation for the feature). This structure makes it easy to understand each meta-feature and its origin.
✅ Informative Descriptions: The descriptions provided for each meta-feature are generally concise and informative, helping the reader understand what each feature represents.
✅ Citation of Sources: Providing sources for the meta-features (references [41], [42], [43], [44], [49], [50]) is good practice, allowing readers to trace the origin and potentially find more detailed definitions or justifications for their use.
✅ Consistent with Caption and Reference Text: The caption clearly states that this table defines the meta-feature vector, and the reference text confirms it lists 20 features useful for meta-learning.
✅ Clarity of Feature Names via Descriptions: Some feature names like 'Cancor' or 'Row Over Column' are standard in meta-learning literature but might be less intuitive for a broader audience without the provided description. The descriptions do a good job of clarifying these.
✅ Specificity of PCA Description: For the 'PCA' feature, the description 'The number of PCA components required to explain 99% of the variance in the data' is very clear and specific.

3 Results

Key Aspects

📊 Initial Model Hierarchy: TE Dominance: The initial benchmarking results, presented in Table 3, establish a performance hierarchy across 111 datasets by comparing Tree-based Ensemble (TE) models against Deep Learning (DL) models. TE models, such as CatBoost and LightGBM, are shown to occupy the top ranks based on the number of datasets where they achieved the best performance; for instance, CatBoost was the top performer in 17.1% of datasets. The first DL model, AutoGluon-DL, appears in the fifth position (9.9% best-performing). This observed dominance of TE models is consistently reflected across other evaluation metrics, including average and median model rankings, providing an early indication of their general robustness on this diverse set of tabular data.
🏆 Comprehensive Performance Landscape: AutoGluon's Superiority and DL Model Variability: Table 4 expands the analysis to all 20 models, revealing that AutoGluon, an ensemble method combining both DL and ML techniques, significantly outperforms all other individual models, being the best performer on 35% of the 111 datasets. This is nearly four times more than the second-best model, SVM (Support Vector Machine), which excelled in 9% of datasets but showed poor average performance. The results, supported by critical difference diagrams (Figures 1 and 2 for regression and classification respectively), also underscore that traditional ML models frequently populate the top three performing models for most datasets. Conversely, some specialized DL models for tabular data, like TabNet and FT-Transformer, demonstrated notably poor performance, ranking near the bottom.
🔬 Performance Nuances on Smaller Datasets: The study specifically investigates model performance on a subset of 36 smaller datasets, each containing fewer than 1000 rows (Table 5). In this scenario, the Tree-based Ensemble model H2O-GBM emerged as the most frequent top performer (16.6% of these small datasets), closely followed by the Deep Learning model ResNet (13.9%). However, ResNet's strong showing in the '# Best' metric did not translate to superior overall rankings (average rank, median rank, # in Top 3) where it was surpassed by other ML models like CatBoost. An interesting observation is ResNet's performance parity with both the simpler Multi-Layer Perceptron (MLP) model and the more complex AutoGluon-DL model on these smaller datasets, suggesting different scaling behaviors or sensitivities to data volume.
🧠 Identifying Predictors of DL Success via Logistic Regression: Section 3.2 delves into a meta-analysis to identify dataset characteristics that predict whether DL or ML models are likely to perform better. A logistic regression model was trained for this classification task, with its coefficients detailed in Table 6. The analysis yielded several statistically significant findings: firstly, DL models show a relatively better performance edge over TE models in classification tasks compared to regression tasks. Secondly, the kurtosis of a dataset's features—a statistical measure indicating the 'tailedness' of the distribution—was found to be a significant predictor. Lastly, the number of Principal Component Analysis (PCA) components required to explain data variance showed a positive, albeit almost statistically significant, association with DL's relative success.
🎯 Predictive Modeling of DL vs. ML Advantage: To further refine the prediction of when DL models might outperform traditional ML, the analysis was focused on 36 datasets where performance differences between ML and DL were statistically significant (p < 0.05). On this subset, an H2O-DL model was trained to predict whether TE or DL would be superior based on dataset properties. This specialized predictive model achieved notable performance, with an Area Under the ROC Curve (AUC) of 0.78, an accuracy of 86.1%, and an F1 score of 0.61. These metrics indicate a strong capability to discern scenarios favorable to DL, significantly outperforming a baseline logistic regression model which, while explainable, achieved lower (though still respectable) performance (AUC 0.68, accuracy 80.6%).
🗺️ Visualizing Influential Factors for DL Performance through Heatmaps: Based on the logistic regression model's predictions, Figure 3 presents heatmaps illustrating the influence of various dataset configurations on the probability of a DL model outperforming an ML model. Sub-figure (a) suggests that for datasets with a small number of rows, increasing the column count enhances DL's relative performance, though this effect diminishes with more rows. Sub-figure (b) shows a clearer trend where few rows and many columns (numerical or categorical) favor DL, sometimes exceeding a 0.5 probability of DL outperformance. Sub-figure (c) highlights that high X-kurtosis, rather than row count, signals a preference for DL. Finally, sub-figure (d) indicates that small datasets with features derived from techniques like PCA also increase the likelihood of DL superiority.
⚙️ Symbolic Regression and Data Size Impact Reassessment: A symbolic regression (SR) model was trained to derive a more explicit mathematical formula (Equation 2) for predicting DL versus ML model success. The SR model's performance was comparable to the logistic regression, and its output formula similarly highlighted that higher relative DL success rates are associated with smaller datasets and high kurtosis values. To further investigate the impact of data size, an experiment was conducted where 10 large datasets were down-sampled to 1000 training samples (Table 7). While the rankings of some DL models like AutoGluon-DL and ResNet improved on these smaller versions, Tree-based Ensemble models generally continued to dominate, reinforcing that while sample size is an important factor, it does not singularly dictate optimal model choice.

Strengths

✅ Clear and Comprehensive Presentation of Model Rankings
The results section effectively uses tables (Table 3, 4, 5) to clearly and comprehensively present model rankings across different dataset sizes and model groups. Multiple metrics (# Best, Average Rank, Median Rank, # in Top 3 Models) are provided, offering a multifaceted view of performance.

"Table 3 outlines the performance of the Tree-based Ensemble models (TE) and DL models on the 111 datasets." (Page 5)
✅ Robust and Multi-faceted Meta-Analysis Profiling
The meta-analysis profiling is robust, employing multiple analytical techniques including logistic regression (Table 6), a dedicated H2O-DL predictive model, symbolic regression (Eq. 2), and insightful visualizations (Fig. 3 heatmaps) to explore factors influencing DL versus ML performance.

"Table 6 presents the coefficients of a logistic regression for this classification task. The model reveals several important findings." (Page 7)
✅ Dedicated Investigation of Dataset Size Impact
The paper specifically investigates the impact of dataset size through dedicated analysis on small datasets (Table 5) and a controlled experiment involving down-sampling of larger datasets (Table 7). This provides nuanced insights into how data volume affects relative model performance.

"Next, to explore the effect of data size, we repeated the training of 10 large datasets after sampling them to 1000 training samples. The original and revised rankings for these 10 datasets are summarized in Table 7." (Page 8)
✅ Statistically Supported Findings in Meta-Analysis
The findings from the meta-analysis are supported by statistical measures, such as p-values for logistic regression coefficients (Table 6) and performance metrics (AUC, accuracy, F1-score) for the predictive models, lending credibility to the conclusions drawn about when DL might outperform ML.

"First, it demonstrates with statistical significance that the relative performance of DL models (compared to TE models) is better in classification tasks than in regression tasks." (Page 7)

Suggestions for Improvement

💡 Clarify
The term "computationally-attractive" used to describe PCA-based features in the discussion of Figure 3(d) is somewhat ambiguous. Clarifying why these features are considered attractive (e.g., due to dimensionality reduction leading to faster model training, or because they represent a denser encoding of information per feature) would enhance the precision of the interpretation. This is a low-impact suggestion aimed at improving terminological clarity within the results discussion, specifically concerning the interpretation of factors influencing DL model performance.

"Finally, sub-figure (d), shows a similar trend to the previous sub-figures where a small number of rows and columns, especially if these are more “computationally-attractive” like PCA-based features, results in a higher probability for DL models to outperform ML models." (Page 8)

Implementation: In the sentence discussing Figure 3(d) on page 8, revise 'especially if these are more “computationally-attractive” like PCA-based features' to something like 'especially if these features are derived from dimensionality reduction techniques like PCA, which can be “computationally-attractive” due to the reduced feature space potentially leading to more efficient model training or concentrated information content.'
💡 Reconcile Statements on DL Outperformance Probability in Figure 3
The text on page 7, when discussing Figure 3 sub-figure (a), states: 'Notably, for all the explored configurations, the probability does not increase over 0.5 which indicates no configuration found where DL models would outperform ML models, on average.' However, the subsequent discussion of sub-figure (b) mentions: 'Interestingly, this heatmap reveals configurations where the probability is higher than 0.5.' This apparent contradiction needs to be resolved or clarified. It's important to specify if the 'on average' statement for sub-figure (a) is specific to the row/column count configuration alone, and how that relates to findings in other sub-figures. This clarification has a medium impact on ensuring consistent interpretation of the heatmap results. This belongs in the discussion of Figure 3.

"Notably, for all the explored configurations, the probability does not increase over 0.5 which indicates no configuration found where DL models would outperform ML models, on average." (Page 7)

Implementation: After the sentence '...no configuration found where DL models would outperform ML models, on average.' for sub-figure (a), add a clarifying statement. For example: 'This observation for sub-figure (a), which considers only the interplay of row and column counts, should be contrasted with sub-figure (b), where the interaction of row counts with numerical and categorical feature counts does reveal specific configurations where the predicted probability of DL outperformance exceeds 0.5.'
💡 Align Textual Reference with Model Name in Table 5 for Small Datasets
On page 5, the text discussing Table 5 (performance on small datasets) states, 'In this scenario, H2O led the chart...'. However, Table 5 itself clearly indicates 'H2O-GBM' as the leading model in the '# Best' category. This discrepancy between the textual description and the table content should be rectified to ensure accuracy and consistency in reporting. Assuming Table 5 is correct, the text should specify H2O-GBM. This is a medium-impact suggestion crucial for the correctness of reported findings. This belongs in the discussion of Table 5.

"Table 5 follows the same line but includes only datasets with less than 1000 rows (36). In this scenario, H2O led the chart with 6/36 (16.6%), followed closely by ResNet with 5/36 (13.9%) datasets in which they are best performing." (Page 5)

Implementation: On page 5, change the sentence 'In this scenario, H2O led the chart with 6/36 (16.6%)...' to 'In this scenario, H2O-GBM led the chart with 6/36 (16.6%)...', aligning the text with the model name provided in Table 5.

Non-Text Elements

Table 3: Performance ranking of TE and DL models.

Figure/Table Image (Page 5)

First Reference in Text

Table 3 outlines the performance of the Tree-based Ensemble models (TE) and DL models on the 111 datasets.

Description

Model Categories and Scope: The table summarizes the performance of 14 different machine learning models, categorized into two groups: Tree-based Ensemble (TE) models and Deep Learning (DL) models. TE models, such as Random Forest or XGBoost, combine the predictions of multiple decision trees to improve accuracy and robustness. DL models, like ResNet or MLP (Multilayer Perceptron), are based on artificial neural networks with multiple layers, capable of learning complex patterns. The comparison is based on their performance across 111 datasets.
Performance Metrics Used: Performance is reported using four metrics: '# Best' (the number of datasets out of 111 where a model achieved the top performance), 'Average Rank' (the average performance rank of the model across all datasets, lower is better), 'Median Rank' (the median performance rank, also lower is better), and '# in Top 3 Models' (the number of datasets where the model ranked among the top three performers).
Top Performing Model: CatBoost (TE): The TE model 'CatBoost' ranks first, being the best performer on 19 out of 111 datasets. It has an average rank of 4.9 and a median rank of 4. It appeared in the top 3 models for 50 datasets.
Strong Performance of Other TE Models: Other TE models like 'LightGBM' (15 #Best, Avg Rank 5, Median Rank 4, #Top3 47) and 'H2O-GBM' (13 #Best, Avg Rank 7, Median Rank 6, #Top3 28) also perform strongly, occupying the top ranks.
Highest Ranked DL Models: The first DL model to appear in the ranking is 'AutoGluon-DL', which is 5th overall. It was the best model on 11 datasets, with an average rank of 6.6, a median rank of 7, and was in the top 3 for 32 datasets. 'ResNet' is the next DL model, with 10 #Best, an average rank of 7.5, and a median rank of 8.
Lower Performing DL Models: Some DL models performed poorly. 'FT-Transformer' and 'TabNet' both achieved '# Best' on 0 datasets. 'TabNet' had the worst performance metrics among the listed models, with an average rank of 13.1 and a median rank of 14.

Scientific Validity

✅ Robustness through Multiple Ranking Metrics: The use of multiple ranking metrics (# Best, Average Rank, Median Rank, # in Top 3) provides a robust way to compare model performance, as relying on a single metric can sometimes be misleading. This multifaceted approach is a strength.
✅ Supports Reference Text: The table clearly supports the reference text's claim that it outlines the performance of TE and DL models. The data presented directly reflects this comparison.
✅ Clear Indication of TE Model Dominance: The primary finding suggested by the ordering (TE models dominating the top ranks) is evident from the '# Best' column. This provides initial evidence for the broader argument that TE models often outperform DL models on these tabular datasets.
✅ Broad Dataset Coverage: The table aggregates performance across 111 diverse datasets. This large number of datasets increases the generalizability of the findings compared to benchmarks with fewer datasets.
💡 Ambiguity of Underlying Performance Metric for Ranking: The specific evaluation metric (e.g., AUC, RMSE, accuracy) used to determine which model is '# Best' or to calculate ranks is not explicitly stated in the table itself or its immediate caption. While Section 2.3 mentions evaluation metrics for regression and classification, and Section 3.1 refers to critical difference diagrams based on RMSE and accuracy, it's not explicitly clear if '# Best' in Table 3 is based on a single metric for each task type or some combined/averaged score. This ambiguity slightly limits the direct interpretation of what 'best' precisely means without referring to other parts of the text.
💡 Lack of Statistical Significance Indicators: The table doesn't include any measure of statistical significance for the differences in ranks or '# Best' counts. While critical difference diagrams are mentioned later, incorporating some indication of significance or confidence intervals for average/median ranks within this table could strengthen the claims about relative model performance.
✅ Representative Model Selection: The selection of models for TE and DL categories seems representative of commonly used and state-of-the-art algorithms in these classes for tabular data.

Communication

✅ Clear Structure and Headers: The table is well-structured with clear column headers (# Best, Average Rank, Median Rank, # in Top 3 Models) that are standard for reporting model performance comparisons. The grouping of models into TE (Tree-based Ensemble) and DL (Deep Learning) is also clearly indicated.
✅ Effective Ranking and Multiple Metrics: The models are ordered by '# Best', which provides an intuitive primary ranking. Supplementing this with average and median ranks, and '# in Top 3 Models' offers a more nuanced view of performance.
✅ Use of Standard Terminology: The abbreviations TE and DL are defined in the reference text, which helps in understanding the 'Group' column. Model names like CatBoost, LightGBM, ResNet, TabNet are standard in the field.
✅ Efficient Information Presentation: The table is concise and presents a large amount of comparative information efficiently. The main message about TE models generally outperforming DL models (based on '# Best') is quickly discernible.
💡 Clarification of Ranking Metric Basis: While the table itself is clear, the specific performance metric used to determine '# Best', average/median rank (e.g., RMSE for regression, AUC for classification, or an overall combined metric) is not stated within the table or its immediate caption. This information is crucial for full interpretation and is mentioned in the text (Section 2.3 and later in Section 3.1 that RMSE and Accuracy are used for critical difference diagrams, but the basis for Table 3's ranking isn't explicitly tied to one specific metric here). Adding a note like 'Ranking based on primary metric X for classification and Y for regression, then aggregated' would improve self-containedness.

Table 4: Performance ranking of all models.

Figure/Table Image (Page 6)

First Reference in Text

Table 4 extends Table 3 as it presents the performance of all 20 models on the 111 datasets.

Description

Expanded Scope and Model Categorization: Table 4 expands on Table 3 by presenting the performance ranking of all 20 models evaluated in the study across 111 datasets. Models are categorized into 'Other', 'DL' (Deep Learning), and 'TE' (Tree-based Ensemble) groups. The 'Other' group includes models that may not fit strictly into TE or DL, or are automated machine learning (AutoML) systems like AutoGluon which can combine various model types.
Consistent Performance Metrics: The performance metrics are consistent with Table 3: '# Best' (number of datasets where the model was top-ranked), 'Average Rank', 'Median Rank', and '# in Top 3 Models'.
Top Performer: AutoGluon: 'AutoGluon' (categorized as 'Other') is the top-performing model, achieving the best performance on 39 out of 111 datasets. It has an average rank of 4.8, a median rank of 4, and appeared in the top 3 models for 58 datasets.
Performance of SVM: 'SVM' (Support Vector Machine, categorized as 'Other') is second in terms of '# Best' (10 datasets), but has a significantly worse average rank (12.4) and median rank (14) compared to AutoGluon, indicating it performs exceptionally well on a few datasets but is not consistently highly ranked.
Performance of DL and TE Models: DL models like 'ResNet' (7 #Best, Avg Rank 9.7) and TE models like 'CatBoost' (7 #Best, Avg Rank 6.6) follow in performance. CatBoost has a better average and median rank than ResNet despite the same '# Best' count.
Lower Performing Models: Several models, particularly some DL models, performed poorly. 'TabNet' (DL) is ranked last, achieving '# Best' on 0 datasets, with an average rank of 17.2 and a median rank of 18. 'FT-Transformer' (DL) also had 0 '# Best' and poor ranks.
Inclusion of Simpler Model Types: The table includes other model types like 'gplearn' (genetic programming), 'LR' (Linear/Logistic Regression), and 'KNN' (K-Nearest Neighbors), all categorized under 'Other', showing their relative standings among the more complex ensemble and deep learning methods.

Scientific Validity

✅ Comprehensive Model Comparison: Presenting results for all 20 models, including those beyond just TE and DL, provides a more complete picture of the benchmark and is a methodological strength.
✅ Strong Support for Textual Claims: The data strongly supports the textual claim that AutoGluon is the best-performing model on average and by '# Best'. It also supports the observation about SVM's performance characteristics (high '# Best' but poor average rank).
✅ Appropriate Categorization of Models: The inclusion of the 'Other' category is necessary for models like AutoGluon (an AutoML system using various models) and classical ML algorithms (SVM, LR, KNN). This categorization is appropriate.
✅ Robust Multi-Metric Evaluation: The consistent use of multiple ranking metrics (# Best, Average Rank, Median Rank, # in Top 3) across all models allows for a robust and nuanced comparison, which is a significant strength.
💡 Ambiguity of Underlying Performance Metric for Ranking: As with Table 3, the exact definition of the metric(s) used to determine which model is '# Best' or to calculate the ranks is not explicitly stated in the table. Clarifying if this is based on a single specific metric (e.g., AUC for classification, RMSE for regression) or an aggregation would enhance rigor.
💡 Lack of Statistical Significance Indicators: The table does not include measures of statistical significance for the differences in ranks or '# Best' counts. While the paper mentions critical difference diagrams elsewhere, adding confidence intervals or significance indicators for average/median ranks here could further solidify the conclusions drawn from this table.
✅ Significant Finding Regarding AutoML: The finding that AutoGluon, an ensemble of ML and DL models, outperforms individual specialized models is a significant result and aligns with trends in automated machine learning.

Communication

✅ Consistent and Clear Structure: The table maintains a clear and consistent structure with Table 3, using identical column headers for performance metrics, which aids in comparability and understanding.
✅ Useful Model Grouping: The inclusion of a 'Group' column (Other, DL, TE) helps categorize the diverse set of 20 models, providing an additional layer of information for interpreting the rankings.
✅ Effective Multi-Metric Presentation: The primary ranking by '# Best' clearly highlights top performers like AutoGluon. The supplementary ranks (Average, Median, # in Top 3) offer a more nuanced perspective, as noted in the text regarding SVM's high '# Best' but poorer average rank.
💡 Clarification for 'Other' Group: The term 'Other' for model groups like AutoGluon and SVM is functional. However, since AutoGluon is described in the text as an ensemble of DL and ML models, a brief note or reminder of this in the caption or as a footnote could enhance clarity for readers focusing solely on the table.
💡 Specification of Ranking Metric Basis: Similar to Table 3, the precise performance metric(s) used to determine '# Best' and calculate ranks (e.g., specific metrics for classification/regression, or an aggregated score) is not explicitly stated in the table or its immediate caption. This would improve the table's self-containedness.
✅ Efficiently Conveys Key Findings: The table effectively conveys the dominance of AutoGluon and the relative performance of other models, including the poor performance of TabNet, which is consistent with the narrative.

Figure 1: Critical difference diagram for regression tasks based on RMSE. The...

Full Caption

Figure 1: Critical difference diagram for regression tasks based on RMSE. The best performing model is Auto-Gluon as lower RMSE scores indicate better performance.

Figure/Table Image (Page 6)

First Reference in Text

The results are also summarized using critical difference diagrams based on RMSE for regressions tasks (Fig. 1) and on accuracy for classifications tasks (Fig. 2).

Description

Diagram Type and Purpose: Figure 1 is a critical difference (CD) diagram, a visualization used to compare the performance of multiple machine learning models across a set of tasks (in this case, regression tasks). The models are positioned along a horizontal axis according to their average rank, calculated based on their Root Mean Squared Error (RMSE) – a measure of the differences between predicted and actual values, where lower values are better. The diagram shows 12 models or model variants.
Best Performing Model and Discrepancy: The model 'AutoGluon-DL' is shown as the best-performing model for regression tasks based on RMSE, with the lowest average rank of 6.3684. The caption, however, states 'AutoGluon' is the best performing model, which is listed with an average rank of 10.3509 and is positioned furthest to the left (best). There is a discrepancy between the caption and the visual data regarding which specific AutoGluon variant is best.
Worst Performing Model and Rank Interpretation: The model 'Decision Tree' is indicated as the worst-performing among those shown, with the highest average rank of 3.9123. The ranks appear to be scaled such that lower numerical values on the axis represent worse performance, which is unusual for ranks (typically lower rank number = better performance). The caption states lower RMSE is better, which usually translates to lower rank being better. The axis labels (1 to 12) and the model positions (AutoGluon at 10.3509 being leftmost/best, Decision Tree at 3.9123 being rightmost/worst) suggest the axis might represent average RMSE directly, or that the ranks have been inverted. However, CD diagrams conventionally plot average ranks where lower is better. Given the caption says "lower RMSE scores indicate better performance" and AutoGluon is best, the numbers (10.3509 for AutoGluon vs 3.9123 for Decision Tree) likely represent average ranks, and the diagram is ordered from best (left) to worst (right). The actual values (e.g. 10.3509) are the average ranks.
Statistical Significance Groupings: Horizontal black bars connect groups of models whose performance is not statistically significantly different from each other. For example, 'AutoGluon' (rank 10.3509) is not statistically significantly different from 'CatBoost' (rank 9.1930) or 'TPOT' (rank 8.3509). Similarly, 'Decision Tree' (rank 3.9123), 'AdaBoost' (rank 4.0175), 'SVM' (rank 4.1228), and 'H2O-DL' (rank 4.3158) form a group that are not statistically significantly different from each other at the bottom end of performance.
Relative Model Performance: Models like 'CatBoost' (rank 9.1930), 'TPOT' (rank 8.3509), 'Random Forest' (rank 7.9298), 'XGBoost' (rank 7.7018), and 'H2O-GBM' (rank 7.1579) are positioned in the better-performing half. 'AutoGluon-DL' is in the lower-performing half with a rank of 6.3684, which is better (lower rank number) than 'LR' (Linear Regression, rank 4.5789) but worse than 'H2O-GBM'. This again highlights the discrepancy with the caption which says AutoGluon is best.

Scientific Validity

✅ Appropriate Visualization Method: Critical difference diagrams are an appropriate and widely accepted method for visualizing and comparing the performance of multiple algorithms across multiple datasets, especially when incorporating statistical significance (e.g., via the Nemenyi test or similar Friedman post-hoc tests).
✅ Appropriate Base Metric (RMSE): The use of RMSE as the base metric for ranking models on regression tasks is standard and appropriate, as it penalizes larger errors more heavily.
✅ Supports Claims of Performance Differences: The diagram visually supports the claim that there are statistically significant differences in performance among the models. For instance, AutoGluon is shown to be significantly better than models like Decision Tree or AdaBoost.
💡 Significant Discrepancy/Clarity Issue with Best Model and Rank Values: The caption states "The best performing model is Auto-Gluon". However, the diagram shows AutoGluon with an average rank of 10.3509, positioned as the best (leftmost). AutoGluon-DL is listed with an average rank of 6.3684, positioned as worse than H2O-GBM. This creates confusion. If the ranks are such that higher number is better (unconventional for ranks), then AutoGluon (10.3509) would be best. If lower rank number is better (conventional), then Decision Tree (3.9123) would be best. Assuming the standard interpretation (leftmost is best, representing the lowest average rank if ranks are 1st, 2nd, etc., or highest average score if scores are directly plotted and higher is better), and given the caption states "lower RMSE scores indicate better performance", the leftmost model should have the best (lowest average) RMSE-based rank. The numerical values next to the models are likely their average ranks. The discrepancy needs urgent clarification: is AutoGluon or AutoGluon-DL the best, and how do the numerical ranks correspond to performance?
💡 Missing Details on Statistical Test and Alpha Level: The specific statistical test used to calculate the critical difference (e.g., Nemenyi test, Bonferroni-Dunn test) and the significance level (alpha) are not mentioned in the caption or immediately with the figure. This information is essential for a full interpretation of the statistical significance bars and should be provided in the methods or figure caption.
💡 Discrepancy in Number of Models Shown vs. Total Models: The diagram only includes 12 models/variants. Table 4 lists 20 models. It's unclear if this CD diagram is for a subset of models or if some models from Table 4 were excluded for regression tasks. Clarification on model inclusion would be helpful.

Communication

✅ Appropriate Chart Type: The critical difference diagram is a standard and effective way to visualize pairwise statistical comparisons of multiple classifiers/regressors over multiple datasets. The horizontal axis representing average rank is clear.
✅ Legible Labels and Values: Model names are generally legible, though some are slightly small. The average rank values associated with each model are clearly displayed.
✅ Informative Caption: The caption clearly states the metric (RMSE for regression), identifies the best performer (AutoGluon), and explains the direction of better performance (lower RMSE is better). This makes the figure largely self-contained.
✅ Clear Indication of Statistical Significance: The horizontal bars connecting groups of models whose performance is not statistically significantly different are a key feature of CD diagrams. These are present and visually connect the relevant models.
✅ Good Axis Scaling: The range of average ranks (from approximately 4 to 10.5) is well-represented on the axis. The tick marks (1 to 12) are appropriate.
✅ Good Use of Monochrome: The figure is monochrome, which is fine for this type of diagram and ensures accessibility. The lines are distinct enough.
💡 Specify Alpha Level for Significance: It would be beneficial to state the alpha level (e.g., α = 0.05) used for the statistical tests (presumably Nemenyi or similar post-hoc tests) that determine the critical difference, either in the caption or the methods section linked to this figure. This would provide full context for the significance bars.

Figure 2: Critical difference diagram for classification tasks based on...

Full Caption

Figure 2: Critical difference diagram for classification tasks based on accuracy. The best performing model is AutoGluon as higher accuracy scores indicate better performance.

Figure/Table Image (Page 6)

First Reference in Text

The results are also summarized using critical difference diagrams based on RMSE for regressions tasks (Fig. 1) and on accuracy for classifications tasks (Fig. 2).

Description

Diagram Type and Evaluation Basis: Figure 2 is a critical difference (CD) diagram used to compare the performance of 14 machine learning models on classification tasks. Models are ranked based on their average performance using the accuracy metric, where higher accuracy indicates better performance. In a CD diagram, models are positioned along a horizontal axis according to their average rank; a lower average rank value signifies better performance.
Best Performing Model: The model 'AutoGluon' is identified as the best-performing model for classification tasks based on accuracy. It is positioned on the far right of the diagram, indicating the best average rank, which is 4.3519.
Worst Performing Model: The model 'gplearn' is shown as the worst-performing among the depicted models, positioned on the far left with the highest average rank of 11.3981.
Statistical Significance Groupings: Horizontal black bars connect groups of models where the differences in their average ranks are not statistically significant at a certain confidence level. For example, 'AutoGluon' (rank 4.3519), 'CatBoost' (rank 5.8241), 'LightGBM' (rank 5.8333), 'H2O-GBM' (rank 5.9630), 'TPOT' (rank 6.0463), 'AutoGluon-DL' (rank 6.2037), and 'XGBoost' (rank 6.4074) form a group at the top where many pairwise differences are not statistically significant. AutoGluon is statistically significantly better than models like 'H2O-DL' (rank 7.8704) and those performing worse.
Relative Model Performance Overview: The diagram visually presents the relative standings of various models. For instance, traditional models like 'Decision Tree' (rank 10.3704) and 'KNN' (rank 9.3241) are shown to perform worse on average than ensemble methods like 'CatBoost' or AutoML tools like 'AutoGluon' for these classification tasks based on accuracy.

Scientific Validity

✅ Appropriate Visualization Method: Critical difference diagrams are a statistically sound and appropriate method for comparing multiple algorithms over multiple datasets, as recommended in machine learning literature.
💡 Choice of Metric (Accuracy): Accuracy is a common metric for classification tasks. However, its suitability can be limited on imbalanced datasets, where metrics like AUC or F1-score might be more informative. The paper should ideally address or acknowledge dataset balance if relying solely on accuracy for ranking.
✅ Supports Caption Claim: The diagram visually supports the caption's claim that AutoGluon is the best performing model for classification tasks based on accuracy, as it holds the best average rank.
💡 Missing Details on Statistical Test and Alpha Level: The specific statistical test (e.g., Friedman test followed by Nemenyi post-hoc test) and the significance level (alpha) used to determine the critical difference for the connecting bars are not explicitly stated in the caption or figure context. This information is crucial for full methodological transparency.
💡 Discrepancy in Number of Models Shown: The diagram shows 14 models. Table 4 in the paper lists 20 models in total. It would be beneficial to clarify if the 6 models not shown here were not applicable to classification tasks, not evaluated, or excluded for other reasons.
✅ Alignment with Stated Purpose: The diagram effectively summarizes the relative performance and statistical distinguishability of models for classification tasks, aligning with the purpose stated in the reference text.

Communication

✅ Appropriate Chart Type: The use of a critical difference diagram is an established and effective method for visually comparing multiple algorithms across datasets while incorporating statistical significance.
✅ Legible Labels and Values: Model names and their associated average rank values are clearly displayed and legible. The horizontal axis representing average rank is also clear.
✅ Informative Caption: The caption is informative: it specifies the task (classification), the metric (accuracy), identifies the best performing model (AutoGluon), and clarifies that higher accuracy scores are better. This helps in understanding the diagram's orientation.
✅ Clear Indication of Statistical Significance: The horizontal bars that connect groups of models whose performances are not statistically significantly different are a key feature and are clearly depicted, aiding in the interpretation of pairwise comparisons.
✅ Good Use of Monochrome: The diagram is monochrome, which is generally good for accessibility and clarity in this type of visualization. The lines are distinct.
💡 Specify Alpha Level for Significance: To enhance completeness, explicitly state the alpha level (e.g., α = 0.05) used for the statistical tests that determine the critical difference. This could be in the caption or the relevant methods section.
✅ Consistent Visual Ordering: The arrangement of models from worst (left) to best (right) is consistent with the caption stating higher accuracy is better, with AutoGluon (best) positioned on the right with the lowest average rank value.

Table 5: Performance metrics of TE and DL models for small datasets (< 1000)...

Full Caption

Table 5: Performance metrics of TE and DL models for small datasets (< 1000) rows.

Figure/Table Image (Page 7)

First Reference in Text

Table 5 follows the same line but includes only datasets with less than 1000 rows (36).

Description

Focus on Small Datasets: Table 5 presents a performance comparison of Tree-based Ensemble (TE) models and Deep Learning (DL) models specifically on a subset of 'small datasets'. These are defined as datasets having fewer than 1000 rows (individual records). The analysis is based on 36 such datasets.
Consistent Performance Metrics: The performance metrics used are the same as in previous tables: '# Best' (number of the 36 small datasets where the model was top-ranked), 'Average Rank', 'Median Rank', and '# in Top 3 Models'.
Top Performer: H2O-GBM (TE): For these small datasets, the TE model 'H2O-GBM' is the top performer based on '# Best', achieving the best results on 6 out of 36 datasets. It has an average rank of 5.6 and a median rank of 5, appearing in the top 3 models for 12 datasets.
Second Best Performer: ResNet (DL): The DL model 'ResNet' is the second-best performer by '# Best', achieving top rank on 5 datasets. It has an average rank of 6.7 and a median rank of 7. It was in the top 3 for 8 datasets.
Other Notable Performers: Other models showing relatively good performance on small datasets include 'CatBoost' (TE, 4 #Best, Avg Rank 5.1), 'AutoGluon-DL' (DL, 4 #Best, Avg Rank 6.8), and 'MLP' (DL, 4 #Best, Avg Rank 6.3). Notably, CatBoost has a better average and median rank than ResNet, AutoGluon-DL, and MLP, despite fewer '# Best' instances than ResNet.
Lower Performing DL Models on Small Datasets: Similar to the overall ranking, some DL models like 'FT-Transformer' and 'TabNet' performed poorly on these small datasets as well, both achieving '# Best' on 0 datasets and having the highest (worst) average and median ranks among the listed models (TabNet: Avg Rank 13.9, Median Rank 14).
Mix of TE and DL Models Among Top Ranks: The table shows a mix of TE and DL models among the top performers on small datasets, with a TE model (H2O-GBM) leading, but closely followed by DL models like ResNet, AutoGluon-DL, and MLP in terms of '# Best'.

Scientific Validity

✅ Valid Sub-Analysis on Small Datasets: Analyzing model performance on a subset of small datasets is a valid and interesting approach, as model behavior can differ significantly based on data size. This provides insights into which models are more robust or suitable when data is limited.
✅ Clear Definition and Sufficient Sample for Subset: The definition of 'small datasets' as those with < 1000 rows is clearly stated, and the number of such datasets (36) is provided, which is a reasonable sample size for this sub-analysis.
✅ Robust Multi-Metric Evaluation for Subset: The use of multiple ranking metrics (# Best, Average Rank, Median Rank, # in Top 3) remains a strength for this subset analysis, providing a nuanced view of performance.
✅ Potentially Interesting Findings for Small Data Regime: The results, showing H2O-GBM (TE) leading but with DL models like ResNet and AutoGluon-DL also performing well, suggest that the performance gap between TE and DL models might be different or less pronounced on smaller datasets compared to the overall trend. This is an interesting finding.
✅ Data Supports Textual Claims (with minor specificity note): The text mentions 'H2O led the chart'. The table specifies 'H2O-GBM'. This is a minor point but ensuring exact model names are used consistently between text and table is good practice. The data in the table supports the general claim about H2O-GBM's strong performance.
💡 Ambiguity of Underlying Performance Metric for Ranking: As with the other performance tables, the specific underlying performance metric(s) used for ranking (e.g., RMSE for regression parts of the 36 datasets, accuracy for classification parts) is not explicitly stated in the table context. This detail is important for full interpretation.
💡 Lack of Statistical Significance Indicators for Subset: The table does not include statistical significance indicators for the differences in ranks within this subset. While perhaps less critical for a sub-analysis table if covered elsewhere for the main results, it would add rigor.
✅ Consistent Findings for Poorly Performing Models: The continued poor performance of models like TabNet and FT-Transformer even on small datasets reinforces findings from the overall analysis about these specific DL models.

Communication

✅ Consistent Table Structure: The table structure is consistent with previous performance tables (Table 3 and 4), using the same column headers (# Best, Average Rank, Median Rank, # in Top 3 Models) and model grouping (TE, DL). This consistency aids readability and comparison across different dataset subsets.
✅ Clear Definition of Data Subset: The caption clearly specifies the subset of data being analyzed: 'small datasets (< 1000) rows'. The reference text further clarifies that this subset comprises 36 datasets.
✅ Standard Terminology and Ranking: Model names and group abbreviations (TE, DL) are standard and should be familiar to the target audience. The ranking by '# Best' provides an immediate understanding of top performers within this subset.
✅ Efficient Presentation for Subset Analysis: The table effectively presents a focused analysis on small datasets, allowing for quick comparison of how TE and DL models perform under this specific condition.
💡 Clarification of Ranking Metric Basis for Subset: Similar to previous tables, the specific performance metric(s) (e.g., RMSE, accuracy) used to determine '# Best' and ranks for these 36 small datasets is not explicitly stated in the table or its immediate caption. Adding this detail would enhance the table's self-containedness.
✅ Consistency with Textual Summary: The text mentions H2O leading the chart, followed by ResNet. The table shows H2O-GBM (a TE model) leading with 6 '# Best'. ResNet (DL) follows with 5 '# Best'. This is consistent. The statement that H2O led the chart could be more specific (H2O-GBM).

Table 6: Coefficients of a logistic regression for predicting the probability...

Full Caption

Table 6: Coefficients of a logistic regression for predicting the probability that DL outperforms ML.

Figure/Table Image (Page 8)

First Reference in Text

We now present the results of the prediction of whether ML or DL will perform better in each dataset. Table 6 presents the coefficients of a logistic regression for this classification task.

Description

Purpose of the Logistic Regression Model: Table 6 displays the results of a logistic regression model. This type of model is used to predict the probability of a binary outcome – in this case, whether Deep Learning (DL) models are likely to outperform traditional Machine Learning (ML) models on a given dataset. The table lists various characteristics of datasets (Variables) and their associated 'Coefficients', 'Standard Errors', 'z-values', and 'P-values' (P>|z|).
Interpretation of Coefficients: A 'Coefficient' in logistic regression indicates the change in the log-odds of the outcome (DL outperforming ML) for a one-unit increase in the predictor variable, holding other variables constant. A positive coefficient suggests that an increase in the variable increases the likelihood of DL outperforming ML, while a negative coefficient suggests the opposite.
Statistical Terms Explained: The 'Standard Error' measures the variability of the coefficient estimate. The 'z-value' is the coefficient divided by its standard error, used to test if the coefficient is significantly different from zero. The 'P-value' (P>|z|) indicates the probability of observing the data if the coefficient were truly zero; a small p-value (typically < 0.05) suggests the coefficient is statistically significant.
Intercept Value: The 'Intercept' has a coefficient of -0.8751 (p=0.0002), indicating the baseline log-odds of DL outperforming ML when all other predictor variables are zero.
Impact of Task Type (Classification/Regression): The variable 'Classification/Regression' has a positive coefficient of 0.5563 (p=0.0317). This suggests that if the task is classification (as opposed to regression, likely coded as 1 vs 0), the odds of DL outperforming ML are higher.
Impact of Kurtosis: 'Kurtosis', a measure of the 'tailedness' of a feature's distribution, has a positive coefficient of 0.8975 (p=0.0244). This implies that datasets with higher average kurtosis in their features are more likely to see DL models outperform ML models.
Impact of Feature-Target Correlation: 'Average Pearson to Target Feature' has a positive coefficient of 0.5247 (p=0.0787). This suggests a borderline significant positive association: datasets where features have a higher average Pearson correlation with the target variable might favor DL models.
Impact of PCA Components: 'PCA' (Principal Component Analysis, likely representing the number of components to explain variance) has a positive coefficient of 0.4624 (p=0.0808). This suggests a borderline significant positive association: datasets requiring more PCA components might favor DL models.
Non-Significant Variables: Variables like 'Row count' (coefficient -0.0195, p=0.9727) and 'Columns after One Hot Encoding' (a technique to convert categorical data into a numerical format; coefficient -0.5745, p=0.1457) do not show a statistically significant relationship with DL outperforming ML at the conventional α=0.05 level.

Scientific Validity

✅ Appropriate Methodological Choice: Using logistic regression to model the probability of a binary outcome (DL outperforming ML vs. not) based on dataset characteristics (meta-features) is an appropriate methodological choice for this type of meta-analysis.
✅ Inclusion of Significance Indicators: The inclusion of standard errors, z-values, and p-values allows for an assessment of the statistical significance of each predictor, which is crucial for interpreting the model.
✅ Identification of Significant Predictors: The identification of 'Classification/Regression' and 'Kurtosis' as statistically significant predictors (at α=0.05) provides concrete, data-driven insights into factors that might influence the relative performance of DL vs. ML models. 'Average Pearson to Target Feature' and 'PCA' are borderline significant (p < 0.10) and could also be considered indicative.
💡 Potential Concern: Events Per Variable (EPV): The number of observations (datasets, N=111) used to fit this logistic regression model is not explicitly stated in the table but is mentioned earlier in the paper. With around 20 predictor variables, the events per variable (EPV) ratio might be a concern if the number of instances where DL outperforms ML is small. This could affect model stability and generalizability. Authors should consider this or report on model diagnostics.
💡 Considerations for Predictor Interpretation and Multicollinearity: The interpretation of coefficients for variables like 'Columns after One Hot Encoding' or 'Rows over Columns after One Hot Encoding' needs care, as these are derived features. Multicollinearity among predictors could also influence coefficient estimates and their standard errors. While not evident from the table alone, it's a standard consideration in regression modeling.
💡 Missing Overall Model Fit Statistics: The table presents coefficients, but measures of the overall model fit (e.g., pseudo-R-squared, Hosmer-Lemeshow test, AUC of this meta-model) are not included in this table. The text mentions an AUC of 0.78 for a different H2O-DL model on a restricted dataset for a similar task, and an AUC of 0.68 for a logistic regression on that restricted set. The overall fit of this specific logistic regression model (on all 111 datasets if that's the case) would be valuable information.
✅ Reasonable Selection of Predictor Variables: The selection of meta-features as predictors seems reasonable, covering various aspects of dataset size, dimensionality, feature characteristics, and feature-target relationships.

Communication

✅ Clear and Standard Structure: The table is well-structured with standard columns for logistic regression output: Variable, Coefficient, Std Error, z-value, and P>|z|. This makes it easy for those familiar with regression outputs to interpret.
✅ Descriptive Variable Names: Variable names are generally descriptive (e.g., 'Row count', 'Kurtosis', 'Average Entropy'). Some, like 'Cancor', might be less universally known without prior context from the meta-feature description (Table 8 in the paper, though not provided here). 'PCA' is standard.
✅ Clear Indication of Statistical Significance: The inclusion of p-values (P>|z|) allows for quick identification of statistically significant predictors at conventional alpha levels (e.g., 0.05).
✅ Informative Caption: The caption clearly states the purpose of the logistic regression model and what the coefficients represent (predicting the probability that DL outperforms ML).
✅ Appropriate Numerical Precision: The precision of the reported values (e.g., four decimal places for coefficients and p-values) is adequate for this type of analysis.
💡 Explicit Definition of Positive Class: While the table is largely self-contained, a brief note defining the positive class (i.e., what outcome a positive coefficient predicts – presumably DL outperforming ML) in the caption could be helpful, though it's implied by the caption's wording.
💡 Unconventional P-value Column Header: The variable 'P¿-z-' as a column header for p-values is unconventional. Standard notation is 'P>|z|' or 'p-value'. This should be corrected for clarity.

Figure 3: The effect of various factors on the probability that DL outperforms...

Full Caption

Figure 3: The effect of various factors on the probability that DL outperforms ML. The heatmaps are generated using the prediction of the logistic regression models. The scatter plot represents the actual observations of the datasets. (a) the impact of the number of columns and rows; (b) the influence of numerical and categorical feature counts; (c) the effect of X-kurtosis and row count; and (d) the role of PCA components necessary to maintain 99% of the variance and number of rows.

Figure/Table Image (Page 9)

First Reference in Text

Based on the predictions of the logistic regression model, we provide further insights into the most influential factors for TE versus DL performance. Fig. 3 presents heatmaps of four dataset's configurations and their influence on the probability that a DL model would outperform the ML model for a given dataset, including (a) the impact of the number of columns and rows; (b) the influence of numerical and categorical feature counts; (c) the effect of X-kurtosis and row count; and (d) the role of PCA components necessary to maintain 99% of the variance.

Description

Overall Figure Structure and Content: Figure 3 consists of four panels (a, b, c, d), each displaying a heatmap. These heatmaps illustrate the predicted probability that Deep Learning (DL) models will outperform traditional Machine Learning (ML) models, based on a logistic regression model. The probability is shown as a color gradient, typically with reddish colors indicating higher probability and bluish colors lower probability. Overlaid on each heatmap are scatter points representing actual datasets: red points indicate datasets where DL outperformed ML, and blue points where ML outperformed DL (or performed equally well).
Panel (a): Columns vs. Rows: Panel (a) examines the impact of the 'Number of Columns' (features, on a logarithmic y-axis, roughly 1 to 100+) versus the 'Number of Rows' (samples, on a logarithmic x-axis, roughly 10 to 100,000+). The heatmap suggests a higher probability of DL outperforming ML for datasets with a small number of rows and a large number of columns. This effect appears to diminish as the number of rows increases. The probability scale seems to range from approximately 0.0 to 0.40. Most scatter points are blue, indicating ML often performs better.
Panel (b): Categorical vs. Numerical Features: Panel (b) shows the influence of '# Categorical Features' (features with discrete categories, on a logarithmic y-axis, roughly 0 to 100+) versus '# Numerical Features' (features with continuous values, on a logarithmic x-axis, roughly 0 to 100+). The heatmap indicates a higher probability of DL outperforming ML for datasets with a high number of categorical features, particularly when the number of numerical features is also high, or with a high number of numerical features when categorical features are few. The probability here can exceed 0.5 (up to ~0.9 according to the color bar), and there are more red scatter points in the higher probability regions.
Panel (c): X-Kurtosis vs. Rows: Panel (c) displays the effect of 'X-Kurtosis' (a statistical measure of the 'tailedness' of feature distributions, on a linear y-axis, roughly -2000 to 8000) versus '# Rows' (logarithmic x-axis). The heatmap suggests that higher X-Kurtosis values are associated with a higher probability of DL outperforming ML, relatively independent of the number of rows, although the effect seems stronger for fewer rows. The probability scale appears to range from about 0.2 to 0.8.
Panel (d): PCA Components vs. Rows: Panel (d) illustrates the role of 'PCA Components' (number of principal components needed to retain 99% variance, a measure of intrinsic dimensionality, on a logarithmic y-axis, roughly 0 to 100+) versus '# Rows' (logarithmic x-axis). The heatmap indicates a higher probability of DL outperforming ML for datasets with a larger number of PCA components, especially when the number of rows is small. The probability scale seems to range from about 0.2 to 0.8.

Scientific Validity

✅ Appropriate Visualization of Logistic Regression Predictions: Visualizing the output of a logistic regression model (which predicts probabilities) as a heatmap is an appropriate technique to understand the influence of pairs of input variables (dataset characteristics) on the outcome (probability of DL outperforming ML).
✅ Comparison of Predicted Probabilities with Actual Outcomes: Overlaying the actual outcomes (scatter points where DL did or did not outperform ML) on the predicted probability heatmap allows for a qualitative assessment of the logistic regression model's predictive capability in different regions of the feature space. It helps to see if high-probability regions indeed have more 'DL wins' and vice-versa.
✅ Relevant Factor Combinations Explored: The selection of variable pairs for each panel (rows/cols, numerical/categorical features, kurtosis/rows, PCA/rows) seems to target potentially influential factors based on common understanding in machine learning and findings from Table 6 (e.g., kurtosis, PCA).
✅ Visuals Support Textual Interpretation for Panel (a): The textual interpretation in Section 3.2 (page 7) regarding panel (a) states: 'for a small number of rows, increasing the number of columns results in a higher probability that the DL model would outperform an ML model. However, this effect decreases relatively quickly as the number of rows increases.' This is well supported by the heatmap in panel (a). The text also notes 'the probability does not increase over 0.5 which indicates no configuration found where DL models would outperform ML models, on average.' The heatmap in (a) visually supports this, with max probability around 0.4.
✅ Visuals Support Textual Interpretation for Panel (b) regarding feature counts: The textual interpretation for panel (b) suggests 'a smaller number of rows and a higher number of columns increase the probability that DL models outperform ML models. Interestingly, this heatmap reveals configurations where the probability is higher than 0.5.' Panel (b) indeed shows regions with probability >0.5, particularly with high categorical and/or numerical features, though the 'number of rows' isn't explicitly on this panel's axes, implying this observation is combined with other insights or is a general statement. The panel itself shows relationships between feature counts.
💡 Dependence on Underlying Meta-Model Performance: The heatmaps are based on the logistic regression model's predictions. The reliability of the patterns shown in the heatmaps is therefore dependent on the goodness-of-fit and predictive power of that underlying logistic regression model. While Table 6 provides coefficients, the overall performance of that specific meta-model (e.g., its AUC on all 111 datasets) is not directly presented alongside these visualizations, which would add context to how much confidence to place in these predicted probability surfaces.
💡 Interpretation of Probability Threshold: The interpretation that 'DL models would outperform ML models, on average' when probability is < 0.5 (as stated for panel a in the text) is a specific interpretation of the probability threshold. The heatmaps themselves just show the predicted probability; the decision threshold for claiming one 'outperforms' on average isn't inherent to the heatmap.

Communication

✅ Effective Multi-Panel Layout: The multi-panel layout (a, b, c, d) effectively presents different two-way interactions of dataset characteristics on the predicted probability. Each panel is clearly labeled.
✅ Appropriate Visualization Choices (Heatmap with Scatter Overlay): The use of heatmaps to show the predicted probability from the logistic regression model is appropriate, providing a visual gradient of the effect. Overlaying scatter plots of actual dataset outcomes (DL performs better vs. ML performs better) allows for a direct comparison between model predictions and reality.
✅ Clear Legends and Color Scales: The color scale for the heatmap (probability) is consistent across panels where applicable (though the actual range of probabilities shown differs), and the legend for scatter points (DL vs. ML) is clear. Axis labels are generally clear, though some y-axis labels like '# Cols' or '# Categorical Features' are abbreviated but understandable in context.
✅ Informative Overall Caption: The caption provides a good overview of the figure's content and clearly identifies what each panel represents. It also explains the source of the heatmaps and scatter plots.
✅ Appropriate Use of Logarithmic Scales: Logarithmic scales are used for axes representing counts (rows, columns, features, PCA components) in panels (a), (b), and (d), which is appropriate for data spanning several orders of magnitude. Panel (c) uses a linear scale for X-Kurtosis, which seems appropriate given its range.
💡 Precision of Color Bar Range Across Panels: The color bar label 'Predicted Probability of DL Outperforming ML' is clear. However, the exact numerical range of this probability seems to vary slightly between panels (e.g., panel (a) up to ~0.4, panel (b) up to ~0.9). While the color mapping is relative within each panel, ensuring the color bar accurately reflects the specific range displayed in each panel, or using a common scale if intended, would enhance precision.
💡 Scatter Point Visibility in Dense Regions: The scatter points are somewhat small and can be dense in certain regions, potentially obscuring some individual points or the underlying heatmap in those areas. Slightly larger points or using transparency might improve visibility without excessive clutter.

Table 7: Comparison of model performance on original and sampled datasets

Figure/Table Image (Page 10)

First Reference in Text

Next, to explore the effect of data size, we repeated the training of 10 large datasets after sampling them to 1000 training samples. The original and revised rankings for these 10 datasets are summarized in Table 7.

Description

Table Structure and Purpose: Table 7 is divided into two main sections: 'Original' and 'Sampled to 1,000 Rows'. It compares the performance of 14 machine learning models (categorized as TE - Tree-based Ensemble, or DL - Deep Learning) on a subset of 10 large datasets, first using their original sizes, and then after these datasets were downsampled to have only 1000 training samples (rows).
Performance Metrics: Performance is evaluated using the same metrics as in previous tables: '# Best' (number of datasets where the model was top-ranked), 'Average Rank', 'Median Rank', and '# in Top 3 Models'.
Performance on Original Datasets: In the 'Original' dataset section, 'LightGBM' (TE) was the best performer on 3 of the 10 datasets, with an average rank of 4.1. 'CatBoost' (TE) was best on 2 datasets with an average rank of 3.3. Several DL models like 'H2O-DL', 'MLP', and TE models like 'XGBoost', 'Random Forest', 'H2O-GBM' were best on 1 dataset each.
Performance on Sampled Datasets: In the 'Sampled to 1,000 Rows' section, 'LightGBM' (TE) remained a strong performer, being '# Best' on 4 datasets with an improved average rank of 2.4. 'XGBoost' (TE) was '# Best' on 2 datasets with an average rank of 4.2.
Shifts in DL Model Performance: Comparing the two sections, some DL models showed changes in their '# Best' counts. For instance, 'AutoGluon-DL' went from 0 '# Best' on original datasets to 1 '# Best' on sampled datasets, and its average rank improved from 6.9 to 7.6 (though a higher rank number is worse). 'ResNet' also went from 0 to 1 '# Best', and its average rank improved from 7.3 to 6.9. Conversely, 'H2O-DL' and 'MLP' went from 1 '# Best' to 0 '# Best'.
TE Model Performance on Sampled Data: TE models like 'LightGBM', 'CatBoost', and 'XGBoost' generally maintained strong or improved positions on the sampled datasets. For example, 'LightGBM' improved its '# Best' count from 3 to 4 and its average rank from 4.1 to 2.4 (lower is better). 'TPOT' (TE) also improved from 0 to 1 '# Best' and its average rank from 5.7 to 4.2.
Consistently Lower Performing Models: Models like 'TabNet', 'FT-Transformer', and 'DCNV2' (all DL) consistently showed 0 '# Best' instances and relatively poor average/median ranks in both original and sampled dataset scenarios.

Scientific Validity

✅ Valid Experimental Design for Size Effect: The experimental design of comparing performance on original large datasets versus downsampled versions (to 1000 rows) is a valid approach to investigate the impact of training data size on model performance.
💡 Limitation: Small Number of Datasets (10): The focus on a subset of 10 'large' datasets is a limitation, as acknowledged in the text ('Although this is a relatively small sample of 10 datasets...'). Findings from such a small subset may not generalize broadly, and the criteria for selecting these 10 datasets are not detailed, potentially introducing selection bias.
💡 Missing Detail: Sampling Methodology: The method of sampling (e.g., random, stratified) to obtain the 1000 training samples is not specified. Different sampling strategies could impact the characteristics of the smaller datasets and thus the model performances.
✅ Data Supports Textual Conclusions: The table data supports the textual conclusion that 'TE models still dominate this set' (sampled datasets), as LightGBM and XGBoost (both TE) have the highest '# Best' counts. It also supports the observation that 'the ranking of some DL models (AutoGluon-DL and ResNet) improves in the smaller datasets'.
✅ Reasonable Interpretation of Results: The observation that sample size is an important factor but 'does not determine which model will have the best performance by itself' is a nuanced and reasonable conclusion based on the varied shifts in rankings.
💡 Ambiguity of Underlying Performance Metric for Ranking: The specific performance metric (e.g., RMSE, accuracy) underlying the '# Best' counts and ranks is not explicitly stated in the table's context. This detail is important for a full understanding of what 'best' refers to across these 10 datasets, which might include a mix of regression and classification tasks.
✅ Contribution to Understanding Low-Data Regime Performance: The table provides evidence that some DL models might become relatively more competitive or less disadvantaged on smaller data, while some TE models maintain or improve their strong standing. This contributes to understanding model behavior in low-data regimes.

Communication

✅ Clear Comparative Structure: The table's two-part structure ('Original' and 'Sampled to 1,000 Rows') is very effective for directly comparing model performance before and after downsampling the datasets. This makes it easy to see shifts in relative rankings.
✅ Consistent Column Headers: The column headers ('Group', '# Best', 'Average Rank', 'Median Rank', '# in Top 3 Models') are consistent with previous performance tables, aiding reader familiarity and ease of interpretation.
✅ Informative Caption and Reference Text: The caption clearly states the purpose of the table – comparing performance on original versus sampled datasets. The reference text adds crucial context that this applies to 10 large datasets sampled to 1000 rows.
✅ Clear Model Identification: Model names and their group categorizations (TE for Tree-based Ensemble, DL for Deep Learning) are standard and clearly presented.
💡 Enhance Caption Self-Containedness: While the reference text mentions 10 datasets, reiterating this number directly in the table caption (e.g., "...on 10 original and sampled datasets") could improve the table's self-containedness for readers who might look at the table in isolation.
💡 Clarification of Ranking Metric Basis: As with previous performance tables, the specific underlying performance metric (e.g., RMSE, accuracy) used to determine '# Best' and calculate ranks is not explicitly stated within the table's immediate context. This information would be beneficial for full clarity.

Table 9: Performance ranking of all models for small datasets (< 1000).

Figure/Table Image (Page 17)

First Reference in Text

Table 9 is equivalent to Table 5 but includes all 20 models (including those which are not TE or DL based).

Description

Intended Purpose and Metrics: Table 9 is intended to present the performance ranking of all 20 machine learning models evaluated in the study, specifically on a subset of 36 'small datasets' (those with fewer than 1000 rows/samples). The performance metrics used are '# Best' (number of these small datasets where a model was top-ranked), 'Average Rank', 'Median Rank', and '# in Top 3 Models'. Models are grouped as TE (Tree-based Ensemble) or DL (Deep Learning).
Discrepancy: Presented Table Content: However, the content of Table 9, as displayed on page 17 of the document, is identical to Table 5. It lists only 14 models, all belonging to the TE or DL groups. It does not include models from the 'Other' category (e.g., AutoGluon full version, SVM, KNN, LR, gplearn, Decision Tree) which are part of the 20 models mentioned in the reference text.
Top Performer (among 14 shown models): H2O-GBM: Based on the 14 models shown (which are the same as in Table 5): 'H2O-GBM' (TE) is the top performer according to '# Best', achieving the top rank on 6 out of the 36 small datasets. Its average rank is 5.6 and median rank is 5.
Second Best Performer (among 14 shown models): ResNet: 'ResNet' (DL) is the second-best performer by '# Best' among the 14 shown, with 5 top ranks, an average rank of 6.7, and a median rank of 7.
Other Notable Performers (among 14 shown models): Other models like 'CatBoost' (TE), 'AutoGluon-DL' (DL), and 'MLP' (DL) each achieved '# Best' on 4 datasets within this subset of 14 models.
Lower Performing Models (among 14 shown models): Consistently, 'TabNet' (DL) and 'FT-Transformer' (DL) are at the bottom of the ranking for these 14 models on small datasets, with 0 '# Best' instances and the highest (worst) average and median ranks.

Scientific Validity

💡 Critical Discrepancy with Reference Text: The primary issue with Table 9 is that its presented content directly contradicts the reference text. The text states it includes all 20 models, but the table shown is a duplicate of Table 5, which only covers 14 TE and DL models. This prevents any assessment of how the 'Other' models (e.g., the full AutoGluon, SVM) perform on small datasets based on this table, undermining its stated purpose.
💡 Intended Analysis is Valid, but Not Executed in Presented Table: The intended analysis – comparing all 20 models on small datasets – is scientifically valuable for a comprehensive understanding. However, this value is not realized by the table as presented.
✅ Consistency with Table 5 for Shown Models: For the 14 models that are shown, the rankings are identical to Table 5. This consistency is expected if Table 5 correctly represented this TE/DL subset for small datasets. The validity points for Table 5 regarding the subset definition (36 datasets < 1000 rows) and multi-metric evaluation would apply here for the shown models.
💡 Ambiguity of Underlying Performance Metric for Ranking: As with previous performance tables, the specific underlying performance metric(s) used for ranking (e.g., RMSE, accuracy, or an aggregate) is not explicitly stated in the table's immediate context. This detail is crucial for full interpretation, especially as 'small datasets' would still contain a mix of regression and classification tasks.
💡 Lack of Statistical Significance Indicators: The table, as presented, does not include statistical significance indicators for the differences in ranks. If the table were corrected to include all 20 models, such indicators would be important for robust conclusions.
💡 Lost Opportunity for Broader Comparison (due to error): If the table were corrected, it would allow for a direct comparison against Table 4 (all models, all datasets) to see if the 'Other' models that performed well overall (like the full AutoGluon) maintain their advantage on small datasets, or if TE/DL models become relatively stronger or weaker.

Communication

💡 Critical Inconsistency: Table content does not match reference text description: The table structure itself (columns: Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models) is consistent with other performance tables (e.g., Table 5), which is good for familiarity. However, there's a critical flaw: the reference text states this table includes all 20 models, but the presented Table 9 on page 17 is identical in content to Table 5, showing only 14 TE and DL models. This makes the table and its caption misleading. The table should be updated to actually include all 20 models as described in the reference text.
💡 Misleading Caption: The caption claims to rank 'all models' for small datasets. Given that the presented table only shows 14 TE/DL models (identical to Table 5), the caption is inaccurate for the visual element provided. It should either be revised to reflect the 14 models shown, or (preferably) the table content should be updated to include all 20 models.
💡 Model Grouping (Conditional on Correction): Assuming the table were to show all 20 models, the grouping column ('Group') would be essential for distinguishing TE, DL, and 'Other' models, similar to Table 4. As it stands, it only shows TE and DL groups for the 14 models.
✅ Clear Column Headers and Ranking Logic: The column headers are clear and standard for performance ranking. The ordering by '# Best' provides an intuitive primary ranking.
✅ Clear Definition of 'Small Datasets': The reference to '< 1000' rows for defining 'small datasets' is clear in the caption, which is good for context.

Table 10: Performance ranking of TE and DL models for datasets with few...

Full Caption

Table 10: Performance ranking of TE and DL models for datasets with few dimensions (< 10).

Figure/Table Image (Page 17)

First Reference in Text

Tables 10 and 11 present the results for datasets with low dimensions (smaller than 10); Tables 12 and 13 present the results of medium-large datasets (over 10,000 samples); Tables 14 and 15 present the rankings for the regression datasets based on RMSE; Tables 16 and 17 and Tables 18 and 19 present the rankings based on MAE and R2, respectively.

Description

Focus on Low-Dimensional Datasets: Table 10 presents a performance comparison of 14 machine learning models, categorized as Tree-based Ensemble (TE) or Deep Learning (DL), specifically on datasets characterized by 'few dimensions' (defined as having fewer than 10 features/columns).
Consistent Performance Metrics: The performance is evaluated using standard metrics: '# Best' (number of these low-dimensional datasets where the model achieved the top rank), 'Average Rank', 'Median Rank', and '# in Top 3 Models'.
Top Performer: Random Forest (TE): The TE model 'Random Forest' is the top performer according to '# Best', achieving the highest rank on 5 of these low-dimensional datasets. It has an average rank of 6.7 and a median rank of 6.
Other Notable Performers: H2O-GBM (TE) and MLP (DL): Other models performing well by '# Best' include 'H2O-GBM' (TE) and 'MLP' (DL), each being the best on 4 datasets. 'MLP' has a better average rank (6.3) and median rank (6) compared to 'H2O-GBM' (average rank 7.1, median rank 6).
Models with Multiple Top Ranks: Several models, including 'TPOT' (TE), 'LightGBM' (TE), 'CatBoost' (TE), and 'ResNet' (DL), were '# Best' on 3 datasets each. 'LightGBM' and 'CatBoost' show strong average and median ranks (4.9/4 and 5/5 respectively).
Lower Performing DL Models on Low-Dimensional Datasets: Deep Learning models 'FT-Transformer' and 'TabNet' are at the bottom of this ranking, with 0 '# Best' instances and the highest (worst) average and median ranks among the 14 models (TabNet: Avg Rank 12.4, Median Rank 13).
Performance Landscape in Low Dimensions: The results suggest that on low-dimensional datasets, traditional TE models like Random Forest, H2O-GBM, LightGBM, and CatBoost perform strongly, but some DL models like MLP and ResNet are also competitive.

Scientific Validity

✅ Valid Sub-Analysis on Low-Dimensional Datasets: Analyzing model performance specifically on low-dimensional datasets is a valid and informative sub-analysis, as model behaviors can vary with the number of features. This helps identify models suitable for such scenarios.
💡 Missing Context: Number of Low-Dimensional Datasets: The definition of 'few dimensions' as '< 10' is clearly stated. However, the number of datasets within the benchmark that meet this criterion is not provided in the table's context. This count is crucial for interpreting the '# Best' column (e.g., '# Best' out of how many datasets?) and assessing the statistical power of these subset rankings.
✅ Robust Multi-Metric Evaluation for Subset: The continued use of multiple ranking metrics (# Best, Average Rank, Median Rank, # in Top 3) provides a more robust view of performance than relying on a single metric.
✅ Potentially Interesting Findings for Low-Dimensional Data: The finding that Random Forest (a TE model) leads in '# Best' for low-dimensional data, with other TE models also performing well, is an interesting observation. The competitiveness of MLP (a relatively simple DL model) in this setting is also noteworthy.
✅ Scope Consistent with Title (TE and DL models): The table focuses on TE and DL models, as per its title. Table 11, mentioned in the reference text, presumably includes all 20 models for this low-dimensional subset, allowing for a broader comparison later.
💡 Ambiguity of Underlying Performance Metric for Ranking: The specific underlying performance metric(s) used for ranking (e.g., RMSE for regression parts of these low-dimensional datasets, accuracy for classification parts) is not explicitly stated in the table context. This detail is important for full interpretation.
💡 Lack of Statistical Significance Indicators for Subset: The table does not include statistical significance indicators for the differences in ranks within this subset. While perhaps less critical for a sub-analysis table if covered elsewhere for the main results, it would add rigor.

Communication

✅ Consistent Table Structure: The table structure, with columns for Model, Group, # Best, Average Rank, Median Rank, and # in Top 3 Models, is consistent with previous performance tables, aiding readability and comparison across different data subsets.
✅ Clear Definition of Data Subset: The caption clearly defines the subset of data being analyzed: 'datasets with few dimensions (< 10)'. This context is crucial for interpreting the results.
✅ Standard Terminology and Ranking: Model names and group abbreviations (TE for Tree-based Ensemble, DL for Deep Learning) are standard. The primary ranking by '# Best' offers an immediate view of top performers in this low-dimensional setting.
✅ Efficient Presentation for Subset Analysis: The table effectively presents a focused analysis on low-dimensional datasets, allowing for quick comparison of TE and DL model performance under this specific condition.
💡 Missing Information: Number of Low-Dimensional Datasets: The number of datasets that fall into the '< 10 dimensions' category is not specified in the table or its immediate caption/reference text. Knowing this number would help assess the robustness of the rankings. For instance, if '# Best' is out of a very small number of datasets, it's less impactful.
💡 Clarification of Ranking Metric Basis for Subset: As with other performance tables, the specific underlying performance metric(s) (e.g., RMSE, accuracy) used to determine '# Best' and ranks for this subset is not explicitly stated. This information is needed for full clarity.

Table 11: Performance ranking of all models for datasets with few dimensions (<...

Full Caption

Table 11: Performance ranking of all models for datasets with few dimensions (< 10).

Figure/Table Image (Page 18)

First Reference in Text

Description

Scope and Model Categorization: Table 11 presents the performance ranking of all 20 machine learning models evaluated in the study, specifically on datasets characterized by 'few dimensions' (defined as having fewer than 10 features/columns). Models are grouped into 'Other', 'DL' (Deep Learning), and 'TE' (Tree-based Ensemble).
Performance Metrics Used: Performance is evaluated using standard metrics: '# Best' (number of these low-dimensional datasets where the model achieved the top rank), 'Average Rank', 'Median Rank', and '# in Top 3 Models'.
Top Performer: AutoGluon: 'AutoGluon' (categorized as 'Other') is the dominant top performer on these low-dimensional datasets, achieving the best performance on 11 datasets. It has an average rank of 5.3 and a median rank of 5, appearing in the top 3 for 13 datasets.
Next Best Performers: Following AutoGluon, 'ResNet' (DL) and 'H2O-GBM' (TE) were each '# Best' on 3 datasets. ResNet had an average rank of 9 and median rank of 10, while H2O-GBM had an average rank of 9.3 and median rank of 8.
Other Models with Multiple Top Ranks: 'MLP' (DL) and 'LightGBM' (TE) were '# Best' on 2 datasets each. MLP had an average rank of 7.7, and LightGBM had an average rank of 6.4.
Models with Single Top Rank: Several models, including 'TPOT' (TE), 'XGBoost' (TE), 'AutoGluon-DL' (DL), 'Decision Tree' (Other), 'gplearn' (Other), 'DCNV2' (DL), 'KNN' (Other), 'SVM' (Other), 'CatBoost' (TE), and 'Random Forest' (TE), each achieved '# Best' on 1 dataset.
Lower Performing Models on Low-Dimensional Datasets: At the lower end of the performance spectrum for low-dimensional datasets, 'H2O-DL' (DL), 'AdaBoost' (TE), 'FT-Transformer' (DL), 'TabNet' (DL), and 'LR' (Other) all had 0 '# Best' instances. 'TabNet' had the highest (worst) average rank of 15.2 and median rank of 15 among all models.

Scientific Validity

✅ Comprehensive Model Evaluation for Subset: Analyzing the performance of all 20 models on low-dimensional datasets provides a comprehensive view and allows for a more complete comparison than Table 10 (which was intended to cover only TE/DL models for this subset).
💡 Missing Context: Number of Low-Dimensional Datasets: The definition of 'few dimensions' as '< 10' is clear. However, the number of datasets within the benchmark that satisfy this condition is not provided in the table's immediate context. This count is essential for interpreting the '# Best' values (e.g., 11 '# Best' out of how many datasets?) and for understanding the statistical stability of these rankings.
✅ Significant Finding: AutoGluon's Dominance: The finding that 'AutoGluon' (the full AutoML system) significantly outperforms other models, including specialized TE and DL models, even on low-dimensional datasets is a strong and interesting result.
✅ Robust Multi-Metric Evaluation: The use of multiple ranking metrics (# Best, Average Rank, Median Rank, # in Top 3) offers a more nuanced and robust assessment of model performance than relying on a single metric.
💡 Ambiguity of Underlying Performance Metric for Ranking: The specific underlying performance metric(s) (e.g., RMSE, accuracy, or an aggregate) used to determine '# Best' and calculate ranks for this subset of low-dimensional datasets is not explicitly stated. This information is crucial for a complete and precise understanding of the comparisons.
💡 Lack of Statistical Significance Indicators for Subset Analysis: The table does not include indicators of statistical significance for the differences in ranks. While critical difference diagrams might be presented elsewhere for overall results, their absence here for this specific subset analysis means that the observed differences in ranks may or may not be statistically meaningful.
✅ Consistent Findings for Poorly Performing Models: The consistent poor performance of models like TabNet and FT-Transformer, even in the low-dimensional setting, reinforces earlier findings about these particular DL models.

Communication

✅ Consistent Table Structure: The table structure is consistent with other performance tables, using standard columns (Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models) which aids in readability and cross-comparison.
✅ Clear Definition of Data Subset: The caption clearly defines the specific subset of data being analyzed: 'datasets with few dimensions (< 10)'. This context is vital for interpreting the results.
✅ Comprehensive Model Coverage and Grouping: The inclusion of all 20 models, categorized by 'Group' (Other, DL, TE), provides a comprehensive view for this low-dimensional subset, aligning with the intent described in the reference text for Table 11.
✅ Clear Indication of Top Performer: The primary ranking by '# Best' clearly highlights 'AutoGluon' as the top performer in this scenario.
💡 Missing Information: Number of Low-Dimensional Datasets: The number of datasets that fit the '< 10 dimensions' criterion is not specified in the table caption or its immediate context. This information is crucial for interpreting the '# Best' column (e.g., 11 '# Best' out of how many total low-dimensional datasets?). Adding this would significantly improve self-containedness.
💡 Clarification of Ranking Metric Basis for Subset: Similar to other performance tables, the specific underlying performance metric(s) (e.g., RMSE, accuracy, or an aggregate for mixed tasks) used to determine '# Best' and calculate ranks is not explicitly stated. This detail is needed for a full and precise interpretation.

Table 12: Performance ranking of TE and DL models for medium-large datasets (>...

Full Caption

Table 12: Performance ranking of TE and DL models for medium-large datasets (> 10,000).

Figure/Table Image (Page 18)

First Reference in Text

Description

Focus on Medium-Large Datasets: Table 12 focuses on the performance ranking of 14 machine learning models, categorized as Tree-based Ensemble (TE) or Deep Learning (DL). The analysis is specific to 'medium-large datasets', defined as those having more than 10,000 samples (rows).
Consistent Performance Metrics: The performance metrics are consistent with previous tables: '# Best' (number of these medium-large datasets where the model was top-ranked), 'Average Rank', 'Median Rank', and '# in Top 3 Models'.
Top Performers: LightGBM (TE) and CatBoost (TE): Tree-based Ensemble (TE) models 'LightGBM' and 'CatBoost' are the top performers according to '# Best', each achieving the best results on 8 of these medium-large datasets. Both also have excellent average ranks (4.3) and median ranks (3.5 for LightGBM, 3 for CatBoost). They were in the top 3 models for 19 and 20 datasets respectively.
Leading DL Model: AutoGluon-DL: The Deep Learning (DL) model 'AutoGluon-DL' is the next best, achieving top rank on 5 datasets. It has an average rank of 6 and a median rank of 6, appearing in the top 3 for 10 datasets.
Other Strong TE Models: Other TE models like 'XGBoost' and 'Random Forest' were '# Best' on 3 datasets each, with good average and median ranks. 'H2O-GBM' (TE) also secured '# Best' on 3 datasets.
Performance of Other DL Models: DL models 'H2O-DL' and 'ResNet' were '# Best' on 2 datasets each. 'MLP' (DL) was '# Best' on 1 dataset.
Lower Performing DL Models on Medium-Large Datasets: Consistently, DL models 'FT-Transformer', 'TabNet', and 'DCNV2' are at the bottom of the ranking for these medium-large datasets, with 0 '# Best' instances and the highest (worst) average and median ranks.
Overall Performance Landscape: The results indicate a strong performance of TE models on medium-large datasets, with LightGBM and CatBoost leading significantly in '# Best' and average/median ranks. AutoGluon-DL is the most competitive DL model in this scenario.

Scientific Validity

✅ Valid Sub-Analysis on Medium-Large Datasets: Analyzing model performance on a subset of medium-large datasets ('> 10,000 samples') is a scientifically valid approach to understand how models scale and perform when more data is available.
💡 Missing Context: Number of Medium-Large Datasets: The definition of 'medium-large datasets' is clearly stated. However, the exact number of datasets from the benchmark that fall into this category is not provided in the table's context. This information is important for interpreting the '# Best' column and understanding the sample size of this sub-analysis.
✅ Findings Align with General Expectations for TE Models: The strong performance of TE models like LightGBM and CatBoost on larger datasets aligns with common observations in the field where these models often excel on structured/tabular data, especially when data is not extremely scarce.
✅ Robust Multi-Metric Evaluation: The use of multiple ranking metrics continues to be a strength, offering a more comprehensive view of model performance than a single metric would.
✅ Scope Consistent with Title (TE and DL models for this table): Table 12 focuses on TE and DL models. The reference text implies Table 13 will show all models for this dataset size, allowing for a broader comparison that would include 'Other' models like the full AutoGluon system.
💡 Ambiguity of Underlying Performance Metric for Ranking: The specific underlying performance metric(s) used for ranking (e.g., RMSE, accuracy, or an aggregate) is not explicitly stated. This detail is necessary for a precise understanding of what 'best' signifies across these datasets, which likely include both regression and classification tasks.
💡 Lack of Statistical Significance Indicators for Subset: The table does not include statistical significance indicators for the differences in ranks. Given the potential for more stable performance estimates on larger datasets, assessing statistical significance would be particularly valuable here.
✅ Consistent Findings for Lower-Performing DL Models: The continued poor performance of FT-Transformer, TabNet, and DCNV2 even on larger datasets suggests their limitations are not solely due to small sample sizes.

Communication

✅ Consistent Table Structure: The table maintains a consistent structure with previous performance tables, using standard columns (Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models), which is beneficial for reader comprehension and comparison across dataset subsets.
✅ Clear Definition of Data Subset: The caption clearly specifies the data subset: 'medium-large datasets (> 10,000)' (samples/rows). This provides essential context for the presented rankings.
✅ Standard Terminology and Ranking: Model names and their group categorizations (TE for Tree-based Ensemble, DL for Deep Learning) are standard. The primary ranking by '# Best' effectively highlights the top-performing models in this specific scenario.
✅ Efficient Presentation for Subset Analysis: The table efficiently summarizes model performance on larger datasets, facilitating a focused comparison of TE and DL models under these conditions.
💡 Missing Information: Number of Medium-Large Datasets: The number of datasets that fall into the '> 10,000 samples' category is not specified in the table's immediate context (caption or nearby text). Knowing this total number would help in interpreting the '# Best' column (e.g., 8 '# Best' out of how many datasets?). This information would improve the table's self-containedness.
💡 Clarification of Ranking Metric Basis for Subset: As with other performance tables, the precise underlying performance metric(s) (e.g., RMSE, accuracy, or an aggregate for mixed tasks) used to determine '# Best' and calculate ranks is not explicitly stated. This detail is crucial for a full and accurate interpretation.

Table 13: Performance ranking of all models for medium-large datasets (>...

Full Caption

Table 13: Performance ranking of all models for medium-large datasets (> 10,000).

Figure/Table Image (Page 19)

First Reference in Text

Description

Scope and Model Categorization: Table 13 provides a performance ranking of all 20 machine learning models evaluated in the study. This analysis is specifically for 'medium-large datasets', defined as those containing over 10,000 samples (rows). The models are categorized into three groups: 'Other' (which includes AutoML systems like AutoGluon and classical algorithms), 'DL' (Deep Learning models), and 'TE' (Tree-based Ensemble models).
Performance Metrics Used: The performance is assessed using four metrics, consistent with previous tables: '# Best' (the number of these medium-large datasets where a model achieved the top performance), 'Average Rank' (the model's average performance rank across these datasets, lower is better), 'Median Rank' (the median rank, also lower is better), and '# in Top 3 Models' (the number of datasets where the model ranked among the top three).
Top Performer: AutoGluon: 'AutoGluon' (categorized as 'Other') is the clear top performer on these medium-large datasets. It was '# Best' on 21 datasets, achieved an average rank of 3.1, a median rank of 1 (the best possible), and was in the top 3 models for 26 datasets.
Second Best by '# Best': AutoGluon-DL: The next best model in terms of '# Best' is 'AutoGluon-DL' (DL), which was top on 3 datasets. It had an average rank of 7.6 and a median rank of 7.
Models with Two '# Best' Instances: Several models were '# Best' on 2 datasets each: 'ResNet' (DL), 'LightGBM' (TE), and 'SVM' (Other). LightGBM had a notably better average rank (6.2) and median rank (4) compared to ResNet (9.6 and 10) and SVM (14.5 and 16) among this group.
Performance of Other TE Models: TE models like 'CatBoost' and 'Random Forest' had 0 '# Best' instances on these larger datasets but maintained respectable average ranks (CatBoost: 5.5, Random Forest: 7.5). This contrasts with their stronger '# Best' performance in Table 12 (which only included TE/DL models).
Lower Performing Models on Medium-Large Datasets: At the bottom of the performance ranking for medium-large datasets are 'TabNet' (DL) and 'DCNV2' (DL), with 0 '# Best' instances and the highest (worst) average and median ranks. For example, TabNet had an average rank of 17.4 and a median rank of 17.5.

Scientific Validity

✅ Valid Sub-Analysis for All Models: Conducting a performance analysis on a subset of medium-large datasets ('> 10,000 samples') for all 20 models is a methodologically sound approach to understand how different types of models, including AutoML systems, perform when more data is available.
💡 Missing Context: Number of Medium-Large Datasets: The definition of 'medium-large datasets' is clearly provided. However, the exact number of datasets from the total benchmark that meet this criterion is not stated in the table's context. This count is important for interpreting the '# Best' values (e.g., 21 '# Best' out of how many such datasets?) and for understanding the statistical basis of these rankings.
✅ Significant Finding: AutoGluon's Performance on Larger Data: The dominant performance of 'AutoGluon' (the full AutoML system) on these larger datasets is a significant finding, suggesting its effectiveness in leveraging more data compared to many individual specialized models.
✅ Robust Multi-Metric Evaluation: The continued use of multiple ranking metrics (# Best, Average Rank, Median Rank, # in Top 3) provides a more robust and nuanced assessment of model performance than relying on a single metric.
💡 Ambiguity of Underlying Performance Metric for Ranking: The specific underlying performance metric(s) (e.g., RMSE for regression, accuracy for classification, or an aggregated score if tasks are mixed) used to determine '# Best' and calculate ranks is not explicitly stated. This information is crucial for a precise understanding of the comparisons.
💡 Lack of Statistical Significance Indicators for Subset Analysis: The table does not include indicators of statistical significance for the differences in ranks. Given that larger datasets might lead to more stable performance estimates, assessing the statistical significance of observed differences would be particularly valuable to confirm if, for example, AutoGluon is statistically significantly better than the next best models.
✅ Highlights Competitive Strength of AutoML: The relative decline in '# Best' for some strong TE models like CatBoost (0 '# Best' here vs. 8 in Table 12, which only had TE/DL) when compared against all models including AutoGluon is an important observation, highlighting the competitive strength of AutoML systems.

Communication

✅ Consistent Table Structure: The table's structure, including columns for Model, Group, # Best, Average Rank, Median Rank, and # in Top 3 Models, is consistent with other performance tables, which aids in readability and allows for easy comparison across different dataset characteristics.
✅ Clear Definition of Data Subset: The caption clearly specifies the data subset being analyzed: 'medium-large datasets (> 10,000)' (presumably samples/rows). This context is essential for understanding the presented rankings.
✅ Comprehensive Model Coverage and Grouping: Presenting all 20 models, categorized by 'Group' (Other, DL, TE), provides a comprehensive overview for this specific dataset size, aligning with the intent for this table.
✅ Clear Indication of Top Performer: The primary ranking by '# Best' effectively highlights 'AutoGluon' as the dominant model in this scenario.
💡 Missing Information: Number of Medium-Large Datasets: The number of datasets that meet the '> 10,000 samples' criterion is not specified in the table's immediate context. Knowing this total (e.g., '# Best' of 21 out of how many datasets?) would enhance the interpretability of the '# Best' column and the overall robustness of the findings for this subset.
💡 Clarification of Ranking Metric Basis for Subset: As with previous comprehensive performance tables, the specific underlying performance metric(s) (e.g., RMSE for regression tasks, accuracy for classification tasks, or an aggregated score if tasks are mixed within this subset) used to determine '# Best' and calculate ranks is not explicitly stated. This detail is important for a full and precise interpretation.

Table 14: Performance ranking of TE and DL models for regression datasets,...

Full Caption

Table 14: Performance ranking of TE and DL models for regression datasets, ranked by RMSE score.

Figure/Table Image (Page 19)

First Reference in Text

Description

Focus on Regression Tasks and RMSE Ranking: Table 14 presents a performance ranking of 14 machine learning models, specifically Tree-based Ensemble (TE) and Deep Learning (DL) models. This ranking is exclusively for regression tasks (predicting continuous values) and is based on the Root Mean Squared Error (RMSE) score, where a lower RMSE indicates better performance. There are 57 regression datasets in total.
Performance Metrics Used: The performance metrics include '# Best' (number of regression datasets where the model achieved the lowest RMSE), 'Average Rank', 'Median Rank' (both based on RMSE, lower is better), and '# in Top 3 Models' (number of regression datasets where the model was among the top three by RMSE).
Top Performer: CatBoost (TE): The TE model 'CatBoost' is the top performer for regression tasks by RMSE. It was '# Best' on 15 out of 57 regression datasets, with an average rank of 3.7 and a median rank of 3. It appeared in the top 3 models for 31 datasets.
Other Strong TE Models: Other TE models also performed strongly. 'Random Forest' was '# Best' on 9 datasets (average rank 5.5, median rank 5), and 'LightGBM' was '# Best' on 7 datasets (average rank 4.1, median rank 3).
Leading DL Model: AutoGluon-DL: The best performing DL model in this context is 'AutoGluon-DL', which was '# Best' on 6 datasets. It had an average rank of 7.3 and a median rank of 7.
Performance of Other DL Models: 'ResNet' (DL) and 'MLP' (DL) were '# Best' on 4 and 3 datasets respectively. 'H2O-DL' (DL, though listed as TE in the table, it's a DL model from H2O) was also '# Best' on 4 datasets.
Lower Performing DL Models for Regression (RMSE): DL models 'FT-Transformer', 'TabNet', and 'DCNV2' are at the bottom of the ranking for regression tasks based on RMSE, each with 0 '# Best' instances and the highest (worst) average and median ranks.
Overall Dominance of TE Models in Regression (RMSE): The table indicates a clear dominance of TE models for regression tasks when ranked by RMSE, with the top three models by '# Best' all being TE models.

Scientific Validity

✅ Valid Task-Specific Analysis and Metric Choice: Focusing the analysis on regression tasks and using RMSE as the ranking criterion is a standard and valid approach for evaluating model performance in this specific problem domain.
✅ Sufficient Number of Regression Datasets: The number of regression datasets (57, from page 3) provides a reasonably large base for these rankings, lending more credibility to the observed trends compared to analyses on very few datasets.
✅ Findings Align with Existing Knowledge on TE Models: The strong performance of TE models, particularly CatBoost, LightGBM, and Random Forest, on regression tasks with RMSE is consistent with findings in other literature that often show tree-based ensembles excelling on tabular regression problems.
✅ Robust Multi-Metric Evaluation: The use of multiple ranking indicators (# Best, Average Rank, Median Rank, # in Top 3) provides a more robust and nuanced picture of performance than a single indicator alone.
✅ Scope Consistent with Title (TE and DL models for this table): The table presents results for TE and DL models as stated. Table 15, mentioned in the reference text, presumably extends this to all 20 models for regression tasks by RMSE.
💡 Lack of Statistical Significance Indicators: The table does not include indicators of statistical significance for the differences in ranks. This information would be valuable to determine if the observed performance differences (e.g., CatBoost vs. Random Forest) are statistically meaningful.
💡 Typo in Model Grouping for H2O-DL: There appears to be a typo in the 'Group' column for 'H2O-DL', which is listed as 'TE' but is a Deep Learning model. This should be corrected to 'DL' for consistency with other DL model categorizations.
✅ Consistent Findings for Lower-Performing DL Models: The consistent poor performance of specific DL models (FT-Transformer, TabNet, DCNV2) on regression tasks by RMSE reinforces findings from broader analyses, suggesting their limitations in this context.

Communication

✅ Consistent Table Structure: The table structure is consistent with previous performance tables, using standard columns (Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models), which facilitates understanding and comparison.
✅ Clear Definition of Data Subset and Ranking Criterion: The caption clearly specifies the data subset (regression datasets) and the ranking criterion (RMSE score). This provides essential context for interpreting the results.
✅ Standard Terminology and Ranking: Model names and their group categorizations (TE for Tree-based Ensemble, DL for Deep Learning) are standard. The primary ranking by '# Best' effectively highlights the top-performing models for regression tasks based on RMSE.
✅ Efficient Presentation for Task-Specific Analysis: The table efficiently summarizes model performance specifically for regression tasks, allowing a focused comparison of TE and DL models under this condition.
💡 Missing Information: Number of Regression Datasets: The number of regression datasets used for this ranking (presumably 57, as mentioned on page 3) is not explicitly stated in the table's caption or immediate context. Adding this detail (e.g., '# Best out of 57 regression datasets') would improve the table's self-containedness.
💡 Explicit Indication of RMSE Direction: While RMSE is stated as the ranking metric, a brief reminder in the caption that lower RMSE scores indicate better performance would be helpful for readers quickly scanning the table, although this is standard for RMSE.

Table 15: Performance ranking of all models for regression datasets, ranked by...

Full Caption

Table 15: Performance ranking of all models for regression datasets, ranked by RMSE score.

Figure/Table Image (Page 20)

First Reference in Text

Description

Scope: All Models for Regression Tasks (RMSE): Table 15 presents a comprehensive performance ranking of all 20 machine learning models evaluated in the study. This ranking is specific to regression tasks (predicting continuous numerical values) and is based on the Root Mean Squared Error (RMSE) score, where lower RMSE values signify better performance. Models are categorized into 'Other', 'TE' (Tree-based Ensemble), and 'DL' (Deep Learning) groups.
Performance Metrics Used: The performance metrics are consistent: '# Best' (number of regression datasets where the model achieved the lowest RMSE), 'Average Rank', 'Median Rank' (both based on RMSE, lower is better), and '# in Top 3 Models' (number of regression datasets where the model ranked among the top three by RMSE).
Top Performer: AutoGluon: 'AutoGluon' (categorized as 'Other') is the dominant top performer. It achieved the '# Best' RMSE score on 30 regression datasets, had an average rank of 3.5, a median rank of 1 (the best possible), and was in the top 3 for 38 datasets.
Performance of SVM: 'SVM' (Support Vector Machine, 'Other') is the next model by '# Best', achieving this on 4 datasets. However, its average rank (12.9) and median rank (14) are considerably worse, indicating strong performance on a few specific datasets but not consistently high ranking.
Performance of Tree-Based Ensemble (TE) Models: Among TE models, 'TPOT', 'CatBoost', 'H2O-GBM', and 'XGBoost' each were '# Best' on 3 datasets. 'CatBoost' had the best average rank (5.2) among these. 'LightGBM' (TE) was '# Best' on 1 dataset but had a good average rank of 5.8.
Performance of Deep Learning (DL) Models: For DL models, 'ResNet' was '# Best' on 2 datasets (average rank 8.9). 'AutoGluon-DL' and 'H2O-DL' were each '# Best' on 1 dataset. Other DL models like 'MLP', 'FT-Transformer', 'TabNet', and 'DCNV2' had 0 '# Best' instances.
Lowest Performing Model: 'TabNet' (DL) is ranked last among all models for regression tasks by RMSE, with 0 '# Best' instances, an average rank of 18.2, and a median rank of 19.

Scientific Validity

✅ Comprehensive and Valid Task-Specific Analysis: Presenting the performance of all 20 models for regression tasks ranked by RMSE provides a comprehensive and methodologically sound comparison for this specific task type and metric.
✅ Appropriate Metric Choice (RMSE): The use of RMSE is a standard and appropriate choice for evaluating regression model performance, penalizing larger errors more significantly.
✅ Clear and Significant Finding: AutoGluon's Dominance: The overwhelming dominance of 'AutoGluon' in terms of '# Best', average rank, and median rank for regression tasks by RMSE is a very strong finding. The large gap between AutoGluon and the next best models is notable.
✅ Broad Model Comparison: The inclusion of diverse model types (AutoML, classical ML, TE, DL) allows for a broad comparison and highlights the strengths of AutoML systems like AutoGluon in this context.
✅ Complements Previous Subset Analysis (Table 14): The table effectively extends Table 14 (which showed only TE/DL models for regression by RMSE) by including 'Other' models, providing a fuller picture as intended by the reference text structure.
💡 Context: Number of Regression Datasets: The number of regression datasets (presumably 57, from Table 14 context) provides a solid basis for these rankings. However, explicitly stating this number in the caption would enhance clarity.
💡 Lack of Statistical Significance Indicators: The table does not include indicators of statistical significance for the observed differences in ranks or '# Best' counts. Given the strong apparent differences, particularly for AutoGluon, assessing statistical significance would further solidify these findings.

Communication

✅ Consistent Table Structure: The table is well-structured, consistent with other performance tables, using standard columns (Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models) that facilitate reader understanding and comparison.
✅ Clear Definition of Data Subset and Ranking Criterion: The caption clearly defines the specific subset of data (regression datasets) and the ranking criterion (RMSE score). This is crucial context for interpreting the results.
✅ Comprehensive Model Coverage and Grouping: Including all 20 models, categorized by 'Group' (Other, TE, DL), provides a comprehensive performance overview for regression tasks based on RMSE, as intended by the reference text.
✅ Clear Indication of Top Performer: The primary ranking by '# Best' clearly highlights 'AutoGluon' as the dominant model for regression tasks by RMSE.
💡 Missing Information: Number of Regression Datasets: The number of regression datasets used for this ranking (presumably 57, as inferred from Table 14 context) is not explicitly stated in this table's caption or immediate context. Adding this detail (e.g., '# Best out of 57 regression datasets') would improve the table's self-containedness and the interpretability of the '# Best' column.
💡 Explicit Indication of RMSE Direction (Minor): A brief reminder in the caption that lower RMSE scores indicate better performance, while standard for RMSE, could aid readers quickly scanning the table.

Table 16: Performance ranking of TE and DL models for regression datasets,...

Full Caption

Table 16: Performance ranking of TE and DL models for regression datasets, ranked by MAE score.

Figure/Table Image (Page 20)

First Reference in Text

Description

Focus on Regression Tasks and MAE Ranking: Table 16 provides a performance ranking of 14 machine learning models, specifically Tree-based Ensemble (TE) and Deep Learning (DL) models. This ranking is for regression tasks (predicting continuous numerical values) and is based on the Mean Absolute Error (MAE) score. MAE measures the average magnitude of the errors in a set of predictions, without considering their direction; lower MAE values indicate better performance. There are 57 regression datasets in total.
Performance Metrics Used: The performance metrics include '# Best' (number of regression datasets where the model achieved the lowest MAE), 'Average Rank', 'Median Rank' (both based on MAE, lower is better), and '# in Top 3 Models' (number of regression datasets where the model was among the top three by MAE).
Top Performers: CatBoost (TE) and AutoGluon-DL (DL): The TE model 'CatBoost' and the DL model 'AutoGluon-DL' are tied for the top position based on '# Best', each achieving the lowest MAE on 15 out of 57 regression datasets. However, CatBoost has a significantly better average rank (3.8) and median rank (3) compared to AutoGluon-DL (average rank 6.4, median rank 7).
Other Strong TE Models: Following the top two, TE models 'LightGBM' and 'H2O-GBM' were each '# Best' on 6 datasets. LightGBM had an average rank of 4.2 and median rank of 4, while H2O-GBM had an average rank of 6.4 and median rank of 6.
Further TE Model Performance: 'Random Forest' (TE) was '# Best' on 5 datasets, and 'TPOT' (TE) on 4 datasets.
Performance of Other DL Models: Among other DL models, 'ResNet' and 'MLP' were each '# Best' on 1 dataset. 'H2O-DL' had 1 '# Best' instance as well.
Lower Performing DL Models for Regression (MAE): DL models 'TabNet', 'DCNV2', and 'FT-Transformer' are at the bottom of this MAE-based ranking for regression tasks, each with 0 '# Best' instances and the highest (worst) average and median ranks.

Scientific Validity

✅ Valid Metric Choice (MAE) for Complementary Analysis: Using MAE for ranking regression models is a valid approach. MAE is less sensitive to outliers compared to RMSE, so presenting results by MAE offers a complementary perspective to the RMSE-based rankings in Table 14.
✅ Sufficient Number of Regression Datasets: The number of regression datasets (57, inferred from Table 14 context) provides a good foundation for these rankings.
✅ Interesting Shift in Top Performers with MAE: The tie in '# Best' between CatBoost (TE) and AutoGluon-DL (DL) is an interesting finding when using MAE, contrasting with CatBoost's clearer lead by '# Best' in the RMSE ranking (Table 14). CatBoost's superior average and median MAE ranks despite the tie in '# Best' suggest it is more consistently a top performer by MAE than AutoGluon-DL.
✅ Robust Performance of TE Models Across Metrics: The continued strong performance of TE models overall (CatBoost, LightGBM, H2O-GBM, Random Forest, TPOT having many '# Best' instances combined) is evident, even with a different error metric.
✅ Scope Consistent with Title (TE and DL models): This table focuses on TE and DL models. Table 17, as per the reference text, will likely include all 20 models for regression ranked by MAE, providing a broader comparison.
💡 Lack of Statistical Significance Indicators: The table does not include indicators of statistical significance for the observed differences in ranks. This would be valuable to determine if the performance differences are statistically meaningful, especially between models with close average ranks.
✅ Consistent Findings for Lower-Performing DL Models: The consistent poor performance of FT-Transformer, TabNet, and DCNV2 on regression tasks, irrespective of whether RMSE or MAE is used, strengthens the conclusions about their relative efficacy in this benchmark.

Communication

✅ Consistent Table Structure: The table maintains a consistent structure with previous performance tables (e.g., Table 14), using standard columns (Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models). This consistency aids readability and comparison across different evaluation metrics.
✅ Clear Definition of Data Subset and Ranking Criterion: The caption clearly specifies the data subset (regression datasets) and the ranking criterion (MAE score). This is crucial context for interpreting the results.
✅ Standard Terminology and Ranking: Model names and their group categorizations (TE for Tree-based Ensemble, DL for Deep Learning) are standard. The primary ranking by '# Best' effectively highlights the top-performing models for regression tasks based on MAE.
✅ Efficient Presentation for Metric-Specific Analysis: The table efficiently presents model performance specifically for regression tasks using MAE, allowing a focused comparison of TE and DL models.
💡 Missing Information: Number of Regression Datasets: The number of regression datasets used for this ranking (presumably 57, as inferred from the context of Table 14 for regression datasets) is not explicitly stated in this table's caption. Adding this detail (e.g., '# Best out of 57 regression datasets') would improve self-containedness.
💡 Explicit Indication of MAE Direction (Minor): While MAE is a standard metric where lower is better, a brief reminder in the caption could be helpful for readers quickly scanning the table.

Table 17: Performance ranking of all models for regression datasets, ranked by...

Full Caption

Table 17: Performance ranking of all models for regression datasets, ranked by MAE score.

Figure/Table Image (Page 21)

First Reference in Text

Description

Scope: All Models for Regression Tasks (MAE): Table 17 presents a performance ranking of all 20 machine learning models evaluated in the study. This ranking is specific to regression tasks (predicting continuous numerical values) and is based on the Mean Absolute Error (MAE) score. MAE calculates the average of the absolute differences between predicted and actual values, treating all errors equally regardless of magnitude; lower MAE values indicate better performance. Models are categorized into 'Other', 'TE' (Tree-based Ensemble), and 'DL' (Deep Learning) groups.
Performance Metrics Used: The performance metrics are consistent with previous tables: '# Best' (number of regression datasets where the model achieved the lowest MAE), 'Average Rank', 'Median Rank' (both based on MAE, lower is better), and '# in Top 3 Models' (number of regression datasets where the model ranked among the top three by MAE).
Top Performer: AutoGluon: 'AutoGluon' (categorized as 'Other') is the top performer by a significant margin. It achieved the '# Best' MAE score on 25 regression datasets, had an average rank of 3.6, a median rank of 2, and was in the top 3 for 39 datasets.
Second Best Performer: CatBoost: 'CatBoost' (TE) is the second-best model by '# Best', achieving this on 7 datasets. It has an average rank of 5.3 and a median rank of 4.5.
Performance of AutoGluon-DL: 'AutoGluon-DL' (DL) follows, being '# Best' on 6 datasets, with an average rank of 8.1 and a median rank of 8.5.
Other Notable Performers: Other models with multiple '# Best' instances include 'TPOT' (TE, 3 #Best), 'H2O-GBM' (TE, 3 #Best), 'SVM' (Other, 3 #Best), and 'KNN' (Other, 3 #Best). 'LightGBM' (TE) had 2 '# Best' instances.
Lower Performing Models: Several models, including many DL models like 'TabNet', 'FT-Transformer', 'MLP', and 'DCNV2', as well as 'LR' (Other) and 'Decision Tree' (Other), had 0 '# Best' instances for regression tasks by MAE. 'TabNet' had the highest (worst) average rank of 17.9.

Scientific Validity

✅ Comprehensive and Valid Task-Specific Analysis (MAE): Ranking all 20 models for regression tasks based on MAE provides a comprehensive and methodologically sound comparison, offering an alternative perspective to RMSE by being less sensitive to outliers.
✅ Reinforces AutoGluon's Strong Performance: The clear dominance of AutoGluon when using MAE, similar to its performance with RMSE (Table 15), reinforces its strength in regression tasks within this benchmark.
✅ Highlights Impact of Metric Choice (MAE vs. RMSE): The shift in relative rankings compared to RMSE (e.g., AutoGluon-DL performing better than several TE models by '# Best' with MAE, which was less pronounced with RMSE) highlights the impact of metric choice and provides valuable insights.
✅ Broad Model Comparison: The inclusion of all model types (AutoML, classical ML, TE, DL) allows for a broad and informative comparison.
💡 Context: Number of Regression Datasets: The number of regression datasets (presumably 57, from Table 14/16 context) forms a solid basis for these rankings. However, explicit mention in the caption would be beneficial.
💡 Lack of Statistical Significance Indicators: The table does not include indicators of statistical significance for the observed differences in ranks or '# Best' counts. This information would be valuable to confirm if the performance differences are statistically robust.
✅ Consistent Findings for Lower-Performing DL Models: The consistent poor performance of certain DL models (TabNet, FT-Transformer) across different regression metrics (RMSE and MAE) strengthens the conclusions about their relative effectiveness for these tasks in the benchmark.

Communication

✅ Consistent and Clear Structure: The table maintains a consistent and clear structure, using standard columns (Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models) that align with other performance tables in the paper, facilitating ease of comparison.
✅ Informative Caption: The caption accurately specifies the scope: 'all models', 'regression datasets', and the ranking criterion 'MAE score'. This provides good context.
✅ Comprehensive Model Coverage and Grouping: The inclusion of all 20 models, categorized by 'Group' (Other, TE, DL), offers a comprehensive view for regression tasks based on MAE, as intended by the reference text.
✅ Clear Indication of Top Performer: The primary ranking by '# Best' clearly identifies 'AutoGluon' as the leading model for regression tasks by MAE.
💡 Enhance Self-Containedness: State Number of Datasets: While the number of regression datasets (57, inferred from Table 14/16 context) is not explicitly stated in this table's caption, it would be beneficial for self-containedness to add it (e.g., '# Best out of 57 regression datasets').
💡 Explicit Indication of MAE Direction (Minor): A brief reminder in the caption that lower MAE scores indicate better performance, though standard for MAE, could assist readers who are quickly scanning the table.

Table 18: Performance ranking of TE and DL models for regression datasets,...

Full Caption

Table 18: Performance ranking of TE and DL models for regression datasets, ranked by R2 score.

Figure/Table Image (Page 21)

First Reference in Text

Description

Focus on Regression Tasks and R2 Ranking: Table 18 presents a performance ranking of 14 machine learning models, specifically Tree-based Ensemble (TE) and Deep Learning (DL) models. This ranking is for regression tasks (predicting continuous numerical values) and is based on the R2 (R-squared or coefficient of determination) score. R2 represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s); higher R2 values (closer to 1) indicate a better fit of the model to the data. There are 57 regression datasets in total.
Performance Metrics Used: The performance metrics include '# Best' (number of regression datasets where the model achieved the highest R2), 'Average Rank', 'Median Rank' (both based on R2, higher R2 leading to better rank), and '# in Top 3 Models' (number of regression datasets where the model was among the top three by R2).
Top Performer: CatBoost (TE): The TE model 'CatBoost' is the top performer for regression tasks by R2 score. It was '# Best' on 15 out of 57 regression datasets, with an average rank of 3.8 and a median rank of 3. It appeared in the top 3 models for 32 datasets.
Other Strong TE Models: Other TE models also performed strongly. 'Random Forest' and 'LightGBM' were each '# Best' on 8 datasets. Random Forest had an average rank of 5.3 (median 5), and LightGBM had an average rank of 4 (median 3). 'TPOT' (TE) and 'H2O-GBM' (TE) were '# Best' on 6 datasets each.
Leading DL Model: AutoGluon-DL: The best performing DL model in this R2-based ranking is 'AutoGluon-DL', which was '# Best' on 5 datasets. It had an average rank of 7.1 and a median rank of 7.
Performance of Other DL Models: 'MLP' (DL) was '# Best' on 4 datasets, while 'ResNet' (DL) and 'H2O-DL' (DL) were each '# Best' on 1 dataset.
Lower Performing Models for Regression (R2): DL models 'TabNet', 'DCNV2', and 'FT-Transformer', along with the TE model 'AdaBoost', are at the bottom of this R2-based ranking for regression tasks, each with 0 '# Best' instances and the highest (worst) average and median ranks.
Overall Dominance of TE Models in Regression (R2): The table indicates a clear dominance of TE models for regression tasks when ranked by R2 score, with the top five models by '# Best' all being TE models, followed by AutoGluon-DL.

Scientific Validity

✅ Valid Metric Choice (R2) for Regression Analysis: Using R2 score for ranking regression models is a standard and valid approach. R2 provides a measure of how well the model explains the variance in the target variable, offering a different perspective compared to error-based metrics like RMSE and MAE.
✅ Sufficient Number of Regression Datasets: The number of regression datasets (57, inferred from Table 14 context) provides a solid basis for these rankings.
✅ Consistent Strong Performance of TE Models: The strong performance of TE models, especially CatBoost, Random Forest, and LightGBM, in terms of R2 score is consistent with their strong performance using RMSE and MAE, reinforcing their suitability for these regression tasks.
✅ Comprehensive Evaluation with Multiple Metrics: Presenting results based on multiple regression metrics (RMSE, MAE, R2) provides a more holistic view of model performance, as different metrics can highlight different aspects of model behavior.
✅ Scope Consistent with Title (TE and DL models): This table focuses on TE and DL models. Table 19, as per the reference text, will likely include all 20 models for regression ranked by R2, offering a broader comparison.
💡 Lack of Statistical Significance Indicators: The table does not include indicators of statistical significance for the observed differences in ranks. This would be valuable to determine if the performance differences (e.g., between CatBoost and Random Forest/LightGBM) are statistically meaningful.
✅ Consistent Findings for Lower-Performing DL Models: The consistent poor performance of certain DL models (TabNet, DCNV2, FT-Transformer) across all three regression metrics (RMSE, MAE, R2) strongly suggests their limitations for these regression tasks within the benchmark.

Communication

✅ Consistent Table Structure: The table structure is consistent with previous performance tables (e.g., Table 14, 16), using standard columns (Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models), which aids readability and comparison across different evaluation metrics for regression.
✅ Clear Definition of Data Subset and Ranking Criterion: The caption clearly specifies the data subset (regression datasets) and the ranking criterion (R2 score). This is crucial context for interpreting the results.
✅ Standard Terminology and Ranking: Model names and their group categorizations (TE for Tree-based Ensemble, DL for Deep Learning) are standard. The primary ranking by '# Best' effectively highlights the top-performing models for regression tasks based on R2.
✅ Efficient Presentation for Metric-Specific Analysis: The table efficiently summarizes model performance for regression tasks using R2, allowing a focused comparison of TE and DL models.
💡 Missing Information: Number of Regression Datasets: The number of regression datasets used for this ranking (presumably 57, as inferred from the context of Table 14 for regression datasets) is not explicitly stated in this table's caption. Adding this detail (e.g., '# Best out of 57 regression datasets') would improve self-containedness.
💡 Explicit Indication of R2 Direction: A brief reminder in the caption that higher R2 scores (closer to 1) indicate better performance would be helpful for readers quickly scanning the table, as R2 is a goodness-of-fit measure.

Table 19: Performance ranking of all models for regression datasets, ranked by...

Full Caption

Table 19: Performance ranking of all models for regression datasets, ranked by R2 score.

Figure/Table Image (Page 22)

First Reference in Text

Description

Scope: All Models for Regression Tasks (R2): Table 19 presents a comprehensive performance ranking of all 20 machine learning models evaluated in the study. This ranking is specific to regression tasks (predicting continuous numerical values) and is based on the R2 (R-squared or coefficient of determination) score. R2 measures the proportion of variance in the target variable explained by the model, with higher values (closer to 1) indicating better performance. Models are categorized into 'Other', 'TE' (Tree-based Ensemble), and 'DL' (Deep Learning) groups.
Performance Metrics Used: The performance metrics are consistent: '# Best' (number of regression datasets where the model achieved the highest R2), 'Average Rank', 'Median Rank' (both based on R2, higher R2 leading to better rank), and '# in Top 3 Models' (number of regression datasets where the model ranked among the top three by R2).
Top Performer: AutoGluon: 'AutoGluon' (categorized as 'Other') is the dominant top performer by R2 score. It achieved the '# Best' R2 on 31 regression datasets, had an average rank of 3.3, a median rank of 1 (the best possible), and was in the top 3 for 40 datasets.
Second Best Performer: TPOT: 'TPOT' (TE) is the second-best model by '# Best', achieving this on 5 datasets. It has an average rank of 7.1 and a median rank of 6.
Models with Three '# Best' Instances: Following TPOT, 'LightGBM' (TE), 'CatBoost' (TE), 'H2O-GBM' (TE), and 'SVM' (Other) each were '# Best' on 3 datasets. Among these, CatBoost (avg rank 5.4) and LightGBM (avg rank 5.5) had better average ranks than H2O-GBM (8.2) and SVM (12.6).
Performance of Deep Learning (DL) Models: For DL models, 'AutoGluon-DL' and 'H2O-DL' were each '# Best' on 2 datasets, while 'MLP' was '# Best' on 1 dataset. 'ResNet' had 0 '# Best' instances by R2 score.
Lower Performing Models: Several models, including many DL models like 'TabNet', 'DCNV2', 'FT-Transformer', and 'ResNet', as well as 'gplearn' (Other) and 'AdaBoost' (TE), had 0 '# Best' instances for regression tasks by R2. 'TabNet' had the highest (worst) average rank of 18.4.

Scientific Validity

✅ Comprehensive and Valid Task-Specific Analysis (R2): Ranking all 20 models for regression tasks based on R2 score provides a comprehensive and methodologically sound comparison. R2 offers insights into the explanatory power of the models, complementing error-based metrics like RMSE and MAE.
✅ Consistent Dominance of AutoGluon Across Regression Metrics: The continued dominance of AutoGluon across all three regression metrics (RMSE, MAE, and R2) when all models are considered is a very strong and consistent finding, highlighting its robustness for regression tasks in this benchmark.
✅ Provides Different Insights Compared to Error Metrics: Presenting results for R2 allows for a different perspective on model performance compared to error metrics. For instance, TPOT emerges as the second-best by '# Best' here, which was not the case for RMSE or MAE when all models were considered, indicating its relative strength in terms of R2.
✅ Broad Model Comparison: The inclusion of all model types (AutoML, classical ML, TE, DL) allows for a broad and informative comparison, fulfilling the intent of this table series.
💡 Context: Number of Regression Datasets: The number of regression datasets (presumably 57, from prior context) forms a solid basis for these rankings. Explicitly stating this in the caption would be beneficial.
💡 Lack of Statistical Significance Indicators: The table does not include indicators of statistical significance for the observed differences in ranks or '# Best' counts. This information would be valuable to confirm if the performance differences (e.g., AutoGluon vs. TPOT) are statistically robust.
✅ Consistent Findings for Lower-Performing DL Models: The consistent poor performance of certain DL models (TabNet, DCNV2, FT-Transformer) across all three regression metrics reinforces conclusions about their limited effectiveness for these regression tasks in the benchmark.

Communication

✅ Consistent and Clear Structure: The table structure is consistent with other comprehensive performance tables (e.g., Table 15, 17), using standard columns (Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models). This aids readability and allows for easy comparison across different metrics.
✅ Informative Caption: The caption accurately specifies the scope: 'all models', 'regression datasets', and the ranking criterion 'R2 score'. This provides essential context for the results.
✅ Comprehensive Model Coverage and Grouping: Including all 20 models, categorized by 'Group' (Other, TE, DL), offers a complete performance overview for regression tasks based on R2, aligning with the intent described in the reference text.
✅ Clear Indication of Top Performer: The primary ranking by '# Best' clearly highlights 'AutoGluon' as the leading model for regression tasks by R2 score.
💡 Enhance Self-Containedness: State Number of Datasets: The number of regression datasets (presumably 57, inferred from the context of previous regression tables like Table 14 or 18) is not explicitly stated in this table's caption. Adding this detail (e.g., '# Best out of 57 regression datasets') would improve the table's self-containedness.
💡 Explicit Indication of R2 Direction (Minor): A brief reminder in the caption that higher R2 scores (closer to 1) indicate better performance would be helpful for readers quickly scanning the table, although this is standard for R2.

Table 20: Performance ranking of TE and DL models for classification datasets,...

Full Caption

Table 20: Performance ranking of TE and DL models for classification datasets, ranked by AUC score.

Figure/Table Image (Page 22)

First Reference in Text

Description

Focus on Classification Tasks and AUC Ranking: Table 20 presents a performance ranking of 14 machine learning models, specifically Tree-based Ensemble (TE) and Deep Learning (DL) models. This ranking is exclusively for classification tasks (predicting discrete categories) and is based on the Area Under the Receiver Operating Characteristic Curve (AUC) score. AUC measures a model's ability to distinguish between classes; a higher AUC (closer to 1) indicates better performance. There are 54 classification datasets in total.
Performance Metrics Used: The performance metrics include '# Best' (number of classification datasets where the model achieved the highest AUC), 'Average Rank', 'Median Rank' (both based on AUC, higher AUC leading to better rank), and '# in Top 3 Models' (number of classification datasets where the model was among the top three by AUC).
Top Performer: H2O-GBM (TE): The TE model 'H2O-GBM' is the top performer for classification tasks by AUC. It was '# Best' on 9 out of 54 classification datasets, with an average rank of 7.6 and a median rank of 7. It appeared in the top 3 models for 18 datasets.
Second Best Performer: LightGBM (TE): 'LightGBM' (TE) is the second-best model by '# Best', achieving this on 8 datasets. It has a strong average rank of 5.8 and a median rank of 5.5.
Leading DL Models: Among DL models, 'ResNet' performed well, being '# Best' on 6 datasets with an average rank of 8.1 and a median rank of 8. 'H2O-DL' and 'AutoGluon-DL' were each '# Best' on 5 datasets.
Other Notable Performers: 'AdaBoost' (TE) and 'CatBoost' (TE) were each '# Best' on 4 datasets. 'DCNV2' (DL) also achieved '# Best' on 4 datasets.
Lower Performing DL Models for Classification (AUC): DL models 'FT-Transformer' and 'TabNet' are at the bottom of the ranking for classification tasks based on AUC, each with 0 '# Best' instances and the highest (worst) average and median ranks (TabNet: Avg Rank 12.1, Median Rank 13).
Overall Performance Landscape for Classification (AUC): The table shows a competitive landscape for classification tasks by AUC, with TE models generally at the top but several DL models also showing strong performance in terms of '# Best' instances.

Scientific Validity

✅ Valid and Robust Metric Choice (AUC): Using AUC score for ranking models on classification tasks is a standard and robust approach, especially valuable as AUC is insensitive to class imbalance, which is common in real-world datasets.
✅ Sufficient Number of Classification Datasets: The number of classification datasets (54, from page 3) provides a reasonably large base for these rankings, lending credibility to the observed trends.
✅ Nuanced Performance Insights: The strong performance of TE models like H2O-GBM and LightGBM, as well as the competitiveness of certain DL models like ResNet and AutoGluon-DL, provides a nuanced view of model capabilities for classification.
✅ Robust Multi-Metric Evaluation: The use of multiple ranking indicators (# Best, Average Rank, Median Rank, # in Top 3) offers a more comprehensive picture of performance than relying on a single indicator.
✅ Scope Consistent with Title (TE and DL models): This table focuses on TE and DL models. Table 21, as implied by the structure of other result sections, will likely present the rankings for all 20 models for classification by AUC.
💡 Lack of Statistical Significance Indicators: The table does not include indicators of statistical significance for the differences in ranks. This information would be valuable to determine if the observed performance differences (e.g., between H2O-GBM and LightGBM) are statistically meaningful.
✅ Consistent Findings for Lower-Performing DL Models: The consistent poor performance of specific DL models (FT-Transformer, TabNet) on classification tasks by AUC reinforces findings from broader analyses about their limitations in this benchmark.

Communication

✅ Consistent Table Structure: The table structure is consistent with previous performance tables, using standard columns (Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models), which aids readability and comparison across different task types and metrics.
✅ Clear Definition of Data Subset and Ranking Criterion: The caption clearly specifies the data subset (classification datasets) and the ranking criterion (AUC score). This provides essential context for interpreting the results.
✅ Standard Terminology and Ranking: Model names and their group categorizations (TE for Tree-based Ensemble, DL for Deep Learning) are standard. The primary ranking by '# Best' effectively highlights the top-performing models for classification tasks based on AUC.
✅ Efficient Presentation for Task-Specific Analysis: The table efficiently summarizes model performance specifically for classification tasks using AUC, allowing a focused comparison of TE and DL models.
💡 Missing Information: Number of Classification Datasets: The number of classification datasets used for this ranking (presumably 54, as mentioned on page 3) is not explicitly stated in the table's caption or immediate context. Adding this detail (e.g., '# Best out of 54 classification datasets') would improve the table's self-containedness.
💡 Explicit Indication of AUC Direction: A brief reminder in the caption that higher AUC scores (closer to 1) indicate better performance would be helpful for readers quickly scanning the table, as AUC is a standard classification metric where higher is better.

Table 21: Performance ranking of all models for classification datasets, ranked...

Full Caption

Table 21: Performance ranking of all models for classification datasets, ranked by AUC score.

Figure/Table Image (Page 23)

First Reference in Text

Description

Scope: All Models for Classification Tasks (AUC): Table 21 presents a comprehensive performance ranking of all 20 machine learning models evaluated in the study. This ranking is specific to classification tasks (predicting discrete categories) and is based on the Area Under the Receiver Operating Characteristic Curve (AUC) score. AUC is a measure of a classifier's ability to distinguish between classes, with a higher AUC (closer to 1) indicating better performance. Models are categorized into 'Other', 'TE' (Tree-based Ensemble), and 'DL' (Deep Learning) groups.
Performance Metrics Used: The performance metrics are consistent: '# Best' (number of classification datasets where the model achieved the highest AUC), 'Average Rank', 'Median Rank' (both based on AUC, higher AUC leading to better rank), and '# in Top 3 Models' (number of classification datasets where the model ranked among the top three by AUC).
Top Performer: AutoGluon: 'AutoGluon' (categorized as 'Other') is the top performer by AUC score. It achieved the '# Best' AUC on 9 classification datasets, had an average rank of 6.1, a median rank of 5, and was in the top 3 for 20 datasets.
Second Best Performer: SVM: 'SVM' (Support Vector Machine, 'Other') is the second-best model by '# Best', achieving this on 6 datasets. Its average rank is 12, and median rank is 14.
Models with Five '# Best' Instances: Following SVM, 'ResNet' (DL) and 'LightGBM' (TE) each were '# Best' on 5 datasets. LightGBM (avg rank 8, median 7.5) had a better average and median rank than ResNet (avg rank 10.5, median 10.5).
Models with Four '# Best' Instances: 'CatBoost' (TE) and 'AutoGluon-DL' (DL) were each '# Best' on 4 datasets.
Performance of Other Models: Several models, including 'DCNV2' (DL), 'H2O-DL' (DL), 'MLP' (DL), 'H2O-GBM' (TE), 'TPOT' (TE), 'LR' (Other), 'gplearn' (Other), 'Decision Tree' (Other), 'KNN' (Other), and 'AdaBoost' (TE), had fewer '# Best' instances, ranging from 0 to 3.
Lower Performing Models: 'TabNet' (DL) and 'FT-Transformer' (DL) are at the bottom of the ranking, with 0 '# Best' instances and the highest (worst) average and median ranks. TabNet had an average rank of 15.9.

Scientific Validity

✅ Comprehensive and Valid Task-Specific Analysis (AUC): Ranking all 20 models for classification tasks based on AUC score provides a comprehensive and methodologically sound comparison. AUC is a widely accepted and robust metric for classification performance, especially useful for datasets with imbalanced classes.
✅ Nuanced Performance Insights: The strong performance of AutoGluon, followed by SVM, and then a mix of TE and DL models, provides a nuanced picture of which model types excel under AUC evaluation for classification.
✅ Broad Model Comparison: The inclusion of all model types (AutoML, classical ML, TE, DL) allows for a broad and informative comparison, fulfilling the intent of this comprehensive table.
💡 Context: Number of Classification Datasets: The number of classification datasets (presumably 54, from Table 20 context) provides a solid foundation for these rankings. Explicitly stating this in the caption would be beneficial for clarity.
💡 Lack of Statistical Significance Indicators: The table does not include indicators of statistical significance for the observed differences in ranks or '# Best' counts. This information would be valuable to confirm if the performance differences (e.g., AutoGluon vs. SVM) are statistically robust.
✅ Consistent Findings for Lower-Performing DL Models: The consistent poor performance of certain DL models (TabNet, FT-Transformer) on classification tasks, even when evaluated by AUC, strengthens the conclusions about their relative effectiveness in this benchmark.
✅ Complements Other Classification Metric Tables: Presenting results based on multiple classification metrics (AUC here, accuracy in Table 22, F1 in Table 24) is good practice for a thorough benchmark, as different metrics can reveal different strengths and weaknesses of models.

Communication

✅ Consistent and Clear Structure: The table structure is consistent with other comprehensive performance tables (e.g., Table 15, 17, 19), using standard columns (Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models). This consistency facilitates readability and comparison.
✅ Informative Caption: The caption accurately specifies the scope: 'all models', 'classification datasets', and the ranking criterion 'AUC score'. This provides crucial context for the results.
✅ Comprehensive Model Coverage and Grouping: Including all 20 models, categorized by 'Group' (Other, TE, DL), offers a complete performance overview for classification tasks based on AUC, as intended by the series of tables.
✅ Clear Indication of Top Performer: The primary ranking by '# Best' clearly highlights 'AutoGluon' as the leading model for classification tasks by AUC.
💡 Enhance Self-Containedness: State Number of Datasets: The number of classification datasets (presumably 54, as inferred from Table 20 context) is not explicitly stated in this table's caption. Adding this detail (e.g., '# Best out of 54 classification datasets') would improve the table's self-containedness and the interpretability of the '# Best' column.
💡 Explicit Indication of AUC Direction (Minor): A brief reminder in the caption that higher AUC scores (closer to 1) indicate better performance would be helpful for readers quickly scanning the table, although this is standard for AUC.

Table 22: Performance ranking of TE and DL models for classification datasets,...

Full Caption

Table 22: Performance ranking of TE and DL models for classification datasets, ranked by accuracy score.

Figure/Table Image (Page 23)

First Reference in Text

Description

Focus on Classification Tasks and Accuracy Ranking: Table 22 presents a performance ranking of 14 machine learning models, specifically those categorized as Tree-based Ensemble (TE) and Deep Learning (DL). This ranking is exclusively for classification tasks (predicting discrete categories) and is based on the accuracy score, which measures the proportion of correct predictions. Higher accuracy scores indicate better performance. The study includes 54 classification datasets in total.
Performance Metrics Used: The performance is evaluated using several metrics: '# Best' (the number of classification datasets out of 54 where the model achieved the highest accuracy), 'Average Rank', 'Median Rank' (both based on accuracy, higher accuracy leading to a better/lower rank number), and '# in Top 3 Models' (the number of datasets where the model ranked among the top three by accuracy).
Top Performer: LightGBM (TE): The TE model 'LightGBM' is the top performer for classification tasks when ranked by accuracy. It was '# Best' on 11 classification datasets, with an average rank of 5.9 and a median rank of 5.5. It appeared in the top 3 models for 19 datasets.
Second Best Performer: H2O-GBM (TE): 'H2O-GBM' (TE) is the second-best model by '# Best', achieving this on 10 datasets. It has an average rank of 7.6 and a median rank of 7.
Leading DL Models by Accuracy: Among DL models, 'AutoGluon-DL' performed best, being '# Best' on 6 datasets with an average rank of 6.4 and a median rank of 6. 'DCNV2' (DL) was '# Best' on 5 datasets.
Other Notable Performers: Other models with multiple '# Best' instances include 'Random Forest' (TE, 4 #Best), 'TPOT' (TE, 3 #Best), 'CatBoost' (TE, 3 #Best), 'H2O-DL' (DL, 3 #Best), and 'ResNet' (DL, 3 #Best).
Lower Performing DL Models for Classification (Accuracy): DL models 'FT-Transformer' and 'TabNet' are at the bottom of this accuracy-based ranking, each with 0 '# Best' instances. TabNet had the worst average rank of 12.5 and median rank of 13 among the listed models.

Scientific Validity

💡 Metric Choice (Accuracy) and Potential Limitations: Using accuracy as a ranking criterion for classification tasks is common. However, accuracy can be misleading, especially on datasets with imbalanced class distributions. It would be beneficial if the paper acknowledged this limitation or indicated that class balance was considered or that other metrics (like AUC or F1-score, presented in other tables) provide a more robust view.
✅ Sufficient Number of Classification Datasets: The number of classification datasets (54, from page 3) provides a substantial base for these rankings, lending credibility to the observed trends.
✅ Highlights Strong Performance of TE Models: The strong performance of TE models like LightGBM and H2O-GBM in terms of accuracy is a notable finding, consistent with their general strength on tabular data.
✅ Robust Multi-Metric Evaluation: The use of multiple ranking indicators (# Best, Average Rank, Median Rank, # in Top 3) provides a more comprehensive assessment of performance than relying solely on one indicator.
✅ Scope Consistent with Title (TE and DL models): This table focuses on TE and DL models. Table 23, as implied by the structure of other result sections (e.g., Table 21 extending Table 20), will likely present the rankings for all 20 models for classification by accuracy.
💡 Lack of Statistical Significance Indicators: The table does not include indicators of statistical significance for the observed differences in ranks. This information would be valuable to determine if the performance differences (e.g., between LightGBM and H2O-GBM) are statistically meaningful.
✅ Consistent Findings for Lower-Performing DL Models: The consistent poor performance of specific DL models (FT-Transformer, TabNet) on classification tasks, even when evaluated by accuracy, reinforces findings from other metrics about their limitations in this benchmark.

Communication

✅ Consistent Table Structure: The table's structure, with columns for Model, Group, # Best, Average Rank, Median Rank, and # in Top 3 Models, is consistent with prior performance tables, which aids in reader familiarity and allows for straightforward comparisons across different metrics and dataset types.
✅ Clear Definition of Data Subset and Ranking Criterion: The caption clearly defines the specific context: 'classification datasets' and ranking by 'accuracy score'. This information is vital for correctly interpreting the presented data.
✅ Standard Terminology and Ranking Logic: Model names and their group affiliations (TE for Tree-based Ensemble, DL for Deep Learning) are standard and clearly presented. The primary ranking by '# Best' offers an immediate understanding of the top-performing models based on accuracy.
✅ Efficient Presentation for Metric-Specific Analysis: The table efficiently summarizes model performance specifically for classification tasks when accuracy is the primary metric, enabling a focused comparison between TE and DL models.
💡 Missing Information: Number of Classification Datasets: The number of classification datasets used for this ranking (which is 54, as mentioned on page 3 of the paper) is not explicitly stated in this table's caption or its immediate vicinity. Including this detail (e.g., '# Best out of 54 classification datasets') would enhance the table's self-containedness and the interpretability of the '# Best' figures.
💡 Explicit Indication of Accuracy Direction (Minor): While it's generally understood that higher accuracy is better, a brief note in the caption to this effect could be beneficial for readers who are quickly scanning the table, especially as other metrics in the paper (like RMSE) have the opposite direction of preference.

Table 23: Performance ranking of all models for classification datasets, ranked...

Full Caption

Table 23: Performance ranking of all models for classification datasets, ranked by accuracy score

Figure/Table Image (Page 24)

First Reference in Text

Description

Scope: All Models for Classification Tasks (Accuracy): Table 23 presents a comprehensive performance ranking of all 20 machine learning models evaluated in the study. This ranking is specific to classification tasks (predicting discrete categories) and is based on the accuracy score, which is the proportion of correct predictions. Higher accuracy indicates better performance. Models are categorized into 'Other', 'TE' (Tree-based Ensemble), and 'DL' (Deep Learning) groups.
Performance Metrics Used: The performance metrics are consistent: '# Best' (number of classification datasets where the model achieved the highest accuracy), 'Average Rank', 'Median Rank' (both based on accuracy, higher accuracy leading to a better/lower rank number), and '# in Top 3 Models' (number of classification datasets where the model ranked among the top three by accuracy).
Top Performer: AutoGluon: 'AutoGluon' (categorized as 'Other') is the top performer by accuracy. It achieved the '# Best' accuracy on 13 classification datasets, had an average rank of 5.7, a median rank of 5, and was in the top 3 for 27 datasets.
Second Best Performer: LightGBM: 'LightGBM' (TE) is the second-best model by '# Best', achieving this on 6 datasets. Its average rank is 7.9, and median rank is 7.
Models with Five '# Best' Instances: Following LightGBM, 'SVM' (Other) and 'H2O-GBM' (TE) each were '# Best' on 5 datasets. SVM had an average rank of 11.6 and median rank of 12, while H2O-GBM had an average rank of 8.2 and median rank of 8.
Models with Four '# Best' Instances: 'DCNV2' (DL) and 'AutoGluon-DL' (DL) were each '# Best' on 4 datasets.
Performance of Other Models: Several models, including 'ResNet' (DL), 'H2O-DL' (DL), 'TPOT' (TE), 'CatBoost' (TE), 'Random Forest' (TE), and 'LR' (Other), had fewer '# Best' instances, ranging from 0 to 2.
Lower Performing Models: 'TabNet' (DL) is at the bottom of the ranking for classification by accuracy, with 0 '# Best' instances and the highest (worst) average rank of 16.3 and median rank of 18.

Scientific Validity

💡 Metric Choice (Accuracy) and Its Limitations: Ranking all 20 models for classification tasks based on accuracy provides a comprehensive comparison. However, accuracy as a metric can be misleading for imbalanced datasets. While common, its use should ideally be complemented by metrics like AUC or F1-score (which are presented in other tables, e.g., Table 21 for AUC, Table 24 for F1), providing a more robust overall assessment.
✅ Provides Broad Overview of Accuracy Performance: The strong performance of AutoGluon, followed by TE models like LightGBM and H2O-GBM, and then a mix of other models, gives a broad overview of model suitability for classification when accuracy is the goal.
✅ Comprehensive Model Comparison: The inclusion of all model types (AutoML, classical ML, TE, DL) allows for a comprehensive comparison, which is a strength of this table.
💡 Context: Number of Classification Datasets: The number of classification datasets (presumably 54, from Table 20/22 context) offers a reasonable basis for these rankings. Explicitly stating this in the caption would improve clarity.
💡 Lack of Statistical Significance Indicators: The table does not include indicators of statistical significance for the observed differences in ranks or '# Best' counts. This information would be valuable to determine if the performance differences (e.g., AutoGluon vs. LightGBM) are statistically robust.
✅ Consistent Findings for Lower-Performing DL Models: The consistent poor performance of certain DL models (e.g., TabNet) on classification tasks, even when evaluated by accuracy, reinforces findings from other metrics about their relative effectiveness in this benchmark.

Communication

✅ Consistent and Clear Structure: The table structure is consistent with other comprehensive performance tables (e.g., Table 21), using standard columns (Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models). This consistency aids readability and allows for easy comparison across different metrics for classification tasks.
✅ Informative Caption: The caption accurately specifies the scope: 'all models', 'classification datasets', and the ranking criterion 'accuracy score'. This provides crucial context for the results.
✅ Comprehensive Model Coverage and Grouping: Including all 20 models, categorized by 'Group' (Other, TE, DL), offers a complete performance overview for classification tasks based on accuracy, as intended by this series of tables.
✅ Clear Indication of Top Performer: The primary ranking by '# Best' clearly highlights 'AutoGluon' as the leading model for classification tasks by accuracy.
💡 Enhance Self-Containedness: State Number of Datasets: The number of classification datasets (presumably 54, as inferred from Table 20/22 context) is not explicitly stated in this table's caption. Adding this detail (e.g., '# Best out of 54 classification datasets') would improve the table's self-containedness and the interpretability of the '# Best' column.
💡 Explicit Indication of Accuracy Direction (Minor): A brief reminder in the caption that higher accuracy scores indicate better performance, though standard for accuracy, could assist readers who are quickly scanning the table.

Table 24: Performance ranking of TE and DL models for classification datasets,...

Full Caption

Table 24: Performance ranking of TE and DL models for classification datasets, ranked by F1 score.

Figure/Table Image (Page 24)

First Reference in Text

Description

Focus on Classification Tasks and F1 Score Ranking: Table 24 presents a performance ranking of 14 machine learning models, specifically those categorized as Tree-based Ensemble (TE) and Deep Learning (DL). This ranking is exclusively for classification tasks (predicting discrete categories) and is based on the F1 score. The F1 score is the harmonic mean of precision (the proportion of true positive predictions among all positive predictions) and recall (the proportion of true positive predictions among all actual positive instances), providing a balance between them, especially useful for imbalanced datasets. Higher F1 scores (closer to 1) indicate better performance. The study includes 54 classification datasets in total.
Performance Metrics Used: The performance is evaluated using several metrics: '# Best' (the number of classification datasets out of 54 where the model achieved the highest F1 score), 'Average Rank', 'Median Rank' (both based on F1 score, higher F1 score leading to a better/lower rank number), and '# in Top 3 Models' (the number of datasets where the model ranked among the top three by F1 score).
Top Performers: H2O-GBM (TE) and DCNV2 (DL): The TE model 'H2O-GBM' and the DL model 'DCNV2' are tied for the top position based on '# Best', each achieving the highest F1 score on 8 classification datasets. H2O-GBM has an average rank of 7.6 and median rank of 7, while DCNV2 has an average rank of 6.9 and median rank of 7. DCNV2 has a slightly better average rank.
Second Tier Performers: 'LightGBM' (TE) and 'AutoGluon-DL' (DL) are the next best models by '# Best', each achieving this on 7 datasets. LightGBM has a better average rank (5.7) and median rank (6) compared to AutoGluon-DL (average rank 6, median rank 4).
Other Notable Performers: 'H2O-DL' (DL) was '# Best' on 6 datasets. 'AdaBoost' (TE) and 'CatBoost' (TE) were each '# Best' on 4 datasets.
Lower Performing DL Models for Classification (F1 Score): DL models 'FT-Transformer' and 'TabNet' are at the bottom of this F1 score-based ranking, each with 0 '# Best' instances. TabNet had the worst average rank of 11.7 and median rank of 13 among the listed models.
Mixed Performance at the Top (TE vs. DL): The F1 score ranking shows a more mixed performance at the top between TE and DL models compared to some other metrics, with H2O-GBM (TE) and DCNV2 (DL) sharing the lead in '# Best'.

Scientific Validity

✅ Valid and Robust Metric Choice (F1 Score): Using F1 score for ranking classification models is a valid and often preferred approach, particularly when dealing with datasets that may have imbalanced classes, as F1 considers both precision and recall. This makes it a more robust metric than accuracy in many scenarios.
✅ Sufficient Number of Classification Datasets: The number of classification datasets (54, from page 3) provides a good base for these rankings, contributing to the reliability of the observed trends.
✅ Interesting Finding: DL Model Competitiveness by F1 Score: The finding that a DL model (DCNV2) shares the top spot with a TE model (H2O-GBM) in terms of '# Best' by F1 score is an interesting result, suggesting that for this balanced metric, some DL models can be highly competitive.
✅ Robust Multi-Metric Evaluation: The use of multiple ranking indicators (# Best, Average Rank, Median Rank, # in Top 3) provides a more comprehensive view of performance than a single indicator would.
✅ Scope Consistent with Title (TE and DL models): This table focuses on TE and DL models. Table 25, as implied by the structure of other result sections, will likely present the rankings for all 20 models for classification by F1 score.
💡 Lack of Statistical Significance Indicators: The table does not include indicators of statistical significance for the observed differences in ranks. This information would be valuable to determine if the performance differences (e.g., between H2O-GBM and DCNV2, or between these and LightGBM/AutoGluon-DL) are statistically meaningful.
✅ Consistent Findings for Lower-Performing DL Models: The consistent poor performance of specific DL models (FT-Transformer, TabNet) on classification tasks, even when evaluated by the F1 score, reinforces findings from other metrics about their limitations in this benchmark.

Communication

✅ Consistent Table Structure: The table structure is consistent with previous performance tables, utilizing standard columns (Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models). This consistency aids reader comprehension and allows for easier comparison across different metrics and dataset types.
✅ Clear Definition of Data Subset and Ranking Criterion: The caption clearly specifies the data subset ('classification datasets') and the ranking criterion ('F1 score'), which is crucial for accurate interpretation of the results.
✅ Standard Terminology and Ranking Logic: Model names and their group affiliations (TE for Tree-based Ensemble, DL for Deep Learning) are standard and clearly presented. The primary ranking by '# Best' effectively highlights the top-performing models based on F1 score.
✅ Efficient Presentation for Metric-Specific Analysis: The table provides an efficient summary of model performance for classification tasks using the F1 score, enabling a focused comparison between TE and DL models.
💡 Missing Information: Number of Classification Datasets: The number of classification datasets used for this ranking (presumably 54, as inferred from previous classification tables like Table 20 or 22) is not explicitly stated in this table's caption. Adding this detail (e.g., '# Best out of 54 classification datasets') would improve the table's self-containedness and the interpretability of the '# Best' figures.
💡 Explicit Indication of F1 Score Direction: A brief reminder in the caption that higher F1 scores (closer to 1) indicate better performance would be helpful for readers quickly scanning the table, as F1 is a standard classification metric where higher is better.

Table 25: Performance ranking of all models for classification datasets, ranked...

Full Caption

Table 25: Performance ranking of all models for classification datasets, ranked by F1 score.

Figure/Table Image (Page 25)

First Reference in Text

Description

Scope: All Models for Classification Tasks (F1 Score): Table 25 presents a comprehensive performance ranking of all 20 machine learning models evaluated in the study. This ranking is specific to classification tasks (predicting discrete categories) and is based on the F1 score. The F1 score is the harmonic mean of precision and recall, providing a balance between these two measures, particularly useful for imbalanced datasets. Higher F1 scores (closer to 1) indicate better performance. Models are categorized into 'Other', 'TE' (Tree-based Ensemble), and 'DL' (Deep Learning) groups.
Performance Metrics Used: The performance metrics are consistent: '# Best' (number of classification datasets where the model achieved the highest F1 score), 'Average Rank', 'Median Rank' (both based on F1 score, higher F1 leading to a better/lower rank number), and '# in Top 3 Models' (number of classification datasets where the model ranked among the top three by F1 score).
Top Performer: AutoGluon: 'AutoGluon' (categorized as 'Other') is the top performer by F1 score. It achieved the '# Best' F1 score on 11 classification datasets, had an average rank of 5.6, a median rank of 5, and was in the top 3 for 23 datasets.
Second Best Performer: DCNV2: 'DCNV2' (DL) is the second-best model by '# Best', achieving this on 7 datasets. Its average rank is 9.4, and median rank is 9.
Third Best Performer: SVM: 'SVM' (Other) follows, being '# Best' on 5 datasets, with an average rank of 12 and median rank of 13.5.
Models with Four '# Best' Instances: 'AutoGluon-DL' (DL), 'AdaBoost' (TE), and 'H2O-DL' (DL) were each '# Best' on 4 datasets.
Performance of Other Models: Several models, including 'LightGBM' (TE), 'H2O-GBM' (TE), 'gplearn' (Other), 'TPOT' (TE), 'CatBoost' (TE), 'LR' (Other), 'XGBoost' (TE), 'KNN' (Other), 'Random Forest' (TE), 'ResNet' (DL), and 'MLP' (DL), had fewer '# Best' instances, ranging from 0 to 3.
Lower Performing Models: 'TabNet' (DL) and 'FT-Transformer' (DL) are at the bottom of the ranking for classification by F1 score, with 0 '# Best' instances. TabNet had the highest (worst) average rank of 15.3 and median rank of 16.5.

Scientific Validity

✅ Comprehensive and Valid Task-Specific Analysis (F1 Score): Ranking all 20 models for classification tasks based on F1 score provides a comprehensive and methodologically sound comparison. The F1 score is a robust metric for classification, especially useful when class balance might be an issue, as it considers both precision and recall.
✅ Highlights Different Performance Aspects Compared to Other Metrics: The strong performance of AutoGluon, followed by DCNV2 (a DL model) and SVM (a classical model), when ranked by F1 score, offers valuable insights into which models excel with this balanced metric. This differs from rankings by accuracy or AUC, highlighting the metric's influence.
✅ Broad Model Comparison: The inclusion of all model types (AutoML, classical ML, TE, DL) allows for a broad and informative comparison, fulfilling the intent of this comprehensive table.
💡 Context: Number of Classification Datasets: The number of classification datasets (presumably 54, from Table 20/22/24 context) provides a solid foundation for these rankings. Explicitly stating this in the caption would improve clarity.
💡 Lack of Statistical Significance Indicators: The table does not include indicators of statistical significance for the observed differences in ranks or '# Best' counts. This information would be valuable to confirm if the performance differences (e.g., AutoGluon vs. DCNV2) are statistically robust.
✅ Consistent Findings for Lower-Performing DL Models: The consistent poor performance of certain DL models (TabNet, FT-Transformer) on classification tasks, even when evaluated by F1 score, reinforces findings from other metrics about their relative effectiveness in this benchmark.
✅ Completes the Set of Classification Metric Evaluations: Presenting results based on multiple classification metrics (AUC in Table 21, accuracy in Table 23, and F1 here) is good practice for a thorough benchmark, as different metrics reveal different strengths and weaknesses of the models.

Communication

✅ Consistent and Clear Structure: The table structure is consistent with other comprehensive performance tables (e.g., Table 21, 23), using standard columns (Model, Group, # Best, Average Rank, Median Rank, # in Top 3 Models). This consistency aids readability and allows for easy comparison across different metrics for classification tasks.
✅ Informative Caption: The caption accurately specifies the scope: 'all models', 'classification datasets', and the ranking criterion 'F1 score'. This provides crucial context for the results.
✅ Comprehensive Model Coverage and Grouping: Including all 20 models, categorized by 'Group' (Other, TE, DL), offers a complete performance overview for classification tasks based on F1 score, as intended by this series of tables.
✅ Clear Indication of Top Performer: The primary ranking by '# Best' clearly highlights 'AutoGluon' as the leading model for classification tasks by F1 score.
💡 Enhance Self-Containedness: State Number of Datasets: The number of classification datasets (presumably 54, as inferred from Table 20/22/24 context) is not explicitly stated in this table's caption. Adding this detail (e.g., '# Best out of 54 classification datasets') would improve the table's self-containedness and the interpretability of the '# Best' column.
💡 Explicit Indication of F1 Score Direction (Minor): A brief reminder in the caption that higher F1 scores (closer to 1) indicate better performance, though standard for F1, could assist readers who are quickly scanning the table.

Figure 4: Critical difference diagram for regression tasks based on MAE. The...

Full Caption

Figure 4: Critical difference diagram for regression tasks based on MAE. The best performing model is AutoGluon as lower MAE scores indicate better performance.

Figure/Table Image (Page 26)

First Reference in Text

Figs. 4, 5, 6, and 7 present the critical difference diagrams based on additional metrics.

Description

Diagram Type and Evaluation Basis: Figure 4 is a critical difference (CD) diagram, a statistical visualization used to compare the performance of multiple machine learning models across a set of regression tasks. Models are positioned along a horizontal axis based on their average rank, which is calculated from their Mean Absolute Error (MAE) scores. MAE measures the average magnitude of errors between predicted and actual values, with lower MAE scores (and thus lower average ranks based on MAE) indicating better performance. The diagram displays 12 models or model variants.
Best Performing Model: The model 'AutoGluon' is depicted as the best-performing model for regression tasks when evaluated by MAE, having the lowest average rank of 10.3333 (positioned leftmost).
Worst Performing Model: The model 'AdaBoost' is shown as the worst-performing among the models presented in this diagram, with the highest average rank of 3.6667 (positioned rightmost).
Statistical Significance Groupings: Horizontal black bars connect groups of models whose performances are not statistically significantly different from each other. For example, 'AutoGluon' (rank 10.3333), 'CatBoost' (rank 9.1053), and 'TPOT' (rank 8.2632) form a group at the top, indicating no significant difference among them. Similarly, at the lower end, 'AdaBoost' (rank 3.6667), 'H2O-DL' (rank 4.0175), 'Decision Tree' (rank 4.0351), 'SVM' (rank 4.2807), and 'LR' (Linear Regression, rank 4.5789) are grouped together.
Relative Performance of Other Models: Models such as 'Random Forest' (rank 8.0702), 'XGBoost' (rank 7.5439), and 'AutoGluon-DL' (rank 7.0877) are positioned in the upper to middle range of performance according to their average MAE ranks.

Scientific Validity

✅ Appropriate Visualization Method: Critical difference diagrams are a statistically sound and appropriate method for comparing multiple algorithms over multiple datasets, particularly when incorporating tests for significant differences.
✅ Valid Metric Choice (MAE): MAE (Mean Absolute Error) is a valid and commonly used metric for regression tasks. It is less sensitive to outliers than RMSE (Root Mean Squared Error), providing a complementary view of model performance.
✅ Supports Caption Claim: The diagram visually supports the caption's claim that AutoGluon is the best performing model based on MAE for regression tasks, as it has the lowest average rank.
💡 Missing Details on Statistical Test and Alpha Level: The specific statistical test (e.g., Friedman test followed by Nemenyi or other post-hoc tests) used to determine the critical difference and the significance level (alpha) are not stated in the caption or the immediate figure context. This information is crucial for a complete understanding of the statistical comparisons.
💡 Clarity on Model Inclusion/Exclusion: The diagram shows 12 models. The paper evaluates 20 models in total (as seen in Table 4). It's important to clarify if this CD diagram represents all models applicable to regression or a specific subset selected for this visualization. If it's a subset, the criteria for inclusion/exclusion should be clear.
✅ Complements Other Metric Evaluations: Presenting CD diagrams for multiple metrics (RMSE in Fig. 1, MAE here, R2 in Fig. 5) allows for a more robust and comprehensive comparison of model performance, as different metrics can highlight different strengths or weaknesses.

Communication

✅ Appropriate Chart Type: The critical difference diagram is an effective and standard method for visualizing pairwise statistical comparisons of multiple algorithms across various datasets. The horizontal axis clearly represents the average rank.
✅ Legible Labels and Values: Model names and their associated average rank values are legible. The visual layout is clean.
✅ Informative Caption: The caption is informative, clearly stating the task (regression), the metric (MAE), identifying the best performer (AutoGluon), and explaining that lower MAE scores indicate better performance. This makes the figure largely self-contained.
✅ Clear Indication of Statistical Significance: The horizontal bars connecting groups of models whose performance is not statistically significantly different are a key feature of CD diagrams and are clearly depicted.
✅ Good Use of Monochrome: The diagram uses a monochrome color scheme, which is suitable for this type of visualization and ensures accessibility.
💡 Specify Alpha Level for Significance: To ensure full methodological transparency, the alpha level (e.g., α = 0.05) used for the statistical tests (presumably Nemenyi or a similar post-hoc test) that determine the critical difference should be explicitly stated, either in the caption or in the methods section pertaining to these diagrams.
✅ Good Axis Scaling: The range of average ranks on the axis (from approximately 3.5 to 10.5, with ticks from 1 to 12) effectively accommodates the data.

Figure 5: Critical difference diagram for regression tasks based on R2. The...

Full Caption

Figure 5: Critical difference diagram for regression tasks based on R2. The best performing model is AutoGluon as higher R2 scores indicate better performance.

Figure/Table Image (Page 26)

First Reference in Text

Figs. 4, 5, 6, and 7 present the critical difference diagrams based on additional metrics.

Description

Diagram Type and Evaluation Basis: Figure 5 is a critical difference (CD) diagram, a statistical visualization used to compare the performance of multiple machine learning models across a set of regression tasks. Models are positioned along a horizontal axis based on their average rank. This ranking is derived from their R2 (R-squared or coefficient of determination) scores, where a higher R2 score (closer to 1, indicating a better fit or more variance explained) results in a better (lower numerical value) average rank. The diagram displays 10 models or model variants.
Best Performing Model: The model 'AutoGluon' is depicted as the best-performing model for regression tasks when evaluated by R2 score, having the lowest average rank of 2.4912 (positioned rightmost, as higher R2 is better, leading to a lower rank number).
Worst Performing Model: The model 'Decision Tree' is shown as the worst-performing among the models presented in this diagram, with the highest average rank of 9.0702 (positioned leftmost).
Statistical Significance Groupings: Horizontal black bars connect groups of models whose performances are not statistically significantly different from each other. For instance, 'AutoGluon' (rank 2.4912) is statistically significantly better than all other models shown except 'CatBoost' (rank 3.9298), with which it forms a statistically indistinguishable group. At the other end, 'Decision Tree' (rank 9.0702), 'H2O-DL' (rank 9.0000), and 'AdaBoost' (rank 8.8947) form a group whose performances are not statistically significantly different from each other.
Relative Performance of Other Models: Models like 'TPOT' (rank 4.6667), 'Random Forest' (rank 4.8947), 'XGBoost' (rank 5.3509), and 'H2O-GBM' (rank 5.7895) are positioned in the better-performing half of the diagram, though they are statistically outperformed by AutoGluon. 'AutoGluon-DL' (rank 6.5789) and 'LR' (Linear Regression, rank 8.5439) are in the lower-performing half.

Scientific Validity

✅ Appropriate Visualization Method: Critical difference diagrams are a statistically sound and appropriate method for comparing multiple algorithms over multiple datasets, especially when incorporating tests for significant differences in performance.
✅ Valid Metric Choice (R2): R2 (coefficient of determination) is a standard and valid metric for evaluating the goodness-of-fit of regression models. Using it provides insights into how much of the variance in the target variable is explained by the models.
✅ Supports Caption Claim: The diagram visually supports the caption's claim that AutoGluon is the best performing model based on R2 for regression tasks, as it has the lowest average rank.
💡 Missing Details on Statistical Test and Alpha Level: The specific statistical test (e.g., Friedman test followed by Nemenyi or other post-hoc tests) used to determine the critical difference and the significance level (alpha) are not stated in the caption or the immediate figure context. This information is essential for a complete understanding of the statistical comparisons and should be provided.
💡 Clarity on Model Inclusion/Exclusion: The diagram displays 10 models. The paper evaluates 20 models in total (as seen in Table 4). It's important to clarify if this CD diagram represents all models applicable to regression or a specific subset selected for this visualization. If it's a subset, the criteria for inclusion/exclusion should be stated.
✅ Complements Other Metric Evaluations for Regression: Presenting CD diagrams for multiple regression metrics (RMSE in Fig. 1, MAE in Fig. 4, and R2 here) allows for a more robust and comprehensive comparison of model performance, as different metrics can highlight different aspects of model behavior.
✅ Highlights Statistically Significant Top Performers: The clear separation of AutoGluon (and CatBoost) from the majority of other models in terms of statistical significance for R2 is a strong finding.

Communication

✅ Appropriate Chart Type: The critical difference diagram is a standard and effective method for visualizing pairwise statistical comparisons of multiple algorithms. The horizontal axis clearly represents the average rank.
✅ Legible Labels and Values: Model names and their associated average rank values are legible. The visual layout is clean and uncluttered.
✅ Informative Caption: The caption is informative: it specifies the task (regression), the metric (R2), identifies the best performer (AutoGluon), and clarifies that higher R2 scores indicate better performance. This makes the figure largely self-contained.
✅ Clear Indication of Statistical Significance: The horizontal bars connecting groups of models whose performance is not statistically significantly different are a key feature of CD diagrams and are clearly depicted, aiding in the interpretation of pairwise comparisons.
✅ Good Use of Monochrome: The diagram uses a monochrome color scheme, which is suitable for this type of visualization and ensures accessibility.
💡 Specify Alpha Level for Significance: To ensure full methodological transparency, the alpha level (e.g., α = 0.05) used for the statistical tests (presumably Nemenyi or a similar post-hoc test) that determine the critical difference should be explicitly stated, either in the caption or in the methods section pertaining to these diagrams.
✅ Good Axis Scaling: The range of average ranks on the axis (from approximately 2.5 to 9) is well-represented by the tick marks (1 to 12), effectively accommodating the data.

Figure 6: Critical difference diagram for regression tasks based on AUC scores....

Full Caption

Figure 6: Critical difference diagram for regression tasks based on AUC scores. The best performing model is AutoGluon as higher AUC scores indicate better performance.

Figure/Table Image (Page 26)

First Reference in Text

Figs. 4, 5, 6, and 7 present the critical difference diagrams based on additional metrics.

Description

Diagram Type and Stated Evaluation Basis (with a critical note on AUC for regression): Figure 6 is a critical difference (CD) diagram, which is a statistical visualization tool used to compare the performance of multiple machine learning models across a set of tasks. Models are positioned along a horizontal axis based on their average rank. The caption states this diagram is for 'regression tasks based on AUC scores'. However, AUC (Area Under the Receiver Operating Characteristic curve) is a standard metric for evaluating binary classification models, measuring their ability to distinguish between classes. It is not typically used for regression tasks, which predict continuous values. Assuming 'regression tasks' is a typo and it should be 'classification tasks', higher AUC scores (closer to 1) indicate better performance, leading to a better (lower numerical value) average rank. The diagram displays 10 models or model variants.
Best Performing Model: The model 'AutoGluon' is depicted as the best-performing model, positioned rightmost with the lowest average rank of 4.4259 (assuming lower rank number is better). The caption confirms AutoGluon is best and higher AUC is better.
Worst Performing Model: The model 'gplearn' is shown as the worst-performing among those presented, positioned leftmost with the highest average rank of 11.0278.
Statistical Significance Groupings: Horizontal black bars connect groups of models whose performances are not statistically significantly different from each other. For example, 'AutoGluon' (rank 4.4259) is statistically significantly better than all other models shown except 'AutoGluon-DL' (rank 5.5185), 'LightGBM' (rank 5.8981), 'CatBoost' (rank 6.0463), and 'TPOT' (rank 6.1759), with which it forms a statistically indistinguishable group at the top. At the other end, 'gplearn' (rank 11.0278) and 'Decision Tree' (rank 9.9907) are grouped.
Relative Performance of Other Models: Models like 'H2O-GBM' (rank 6.4815), 'XGBoost' (rank 6.6019), 'H2O-DL' (rank 7.5370), 'AdaBoost' (rank 8.0833), 'LR' (Linear Regression, rank 8.3704), and 'SVM' (rank 8.8889) are positioned in the middle to lower range of performance according to their average AUC-based ranks for classification tasks.

Scientific Validity

💡 Critical Error: AUC is for Classification, Not Regression: The most significant issue is the caption stating 'regression tasks based on AUC scores'. AUC is a metric for classification performance, not regression. If this is truly AUC applied to regression tasks (which would require defining how AUC is calculated for regression, e.g., by discretizing the output, a non-standard approach), this needs extensive justification and explanation. If it's a typo and should be 'classification tasks', then the use of AUC is appropriate. Assuming it's a typo and meant for classification:
✅ Appropriate Visualization Method (if for classification): Assuming the figure is for classification tasks, critical difference diagrams are a statistically sound and appropriate method for comparing multiple algorithms over multiple datasets, incorporating statistical significance.
✅ Valid Metric Choice (AUC for classification): AUC is a robust metric for classification, particularly useful when dealing with imbalanced datasets as it measures the ability to rank positive instances higher than negative ones across all thresholds.
✅ Supports Caption Claim (if for classification): The diagram visually supports the caption's claim that AutoGluon is the best performing model (assuming higher AUC is better, leading to a better rank).
💡 Missing Details on Statistical Test and Alpha Level: The specific statistical test (e.g., Friedman test followed by Nemenyi or other post-hoc tests) used to determine the critical difference and the significance level (alpha) are not stated in the caption or the immediate figure context. This information is crucial for a complete understanding of the statistical comparisons.
💡 Clarity on Model Inclusion/Exclusion: The diagram displays 10 models. The paper evaluates 20 models in total. It's important to clarify if this CD diagram represents all models applicable to the (presumably) classification tasks or a specific subset. If it's a subset, the criteria for inclusion/exclusion should be clear.
✅ Complements Other Metric Evaluations: This figure is one of several presenting results on additional metrics, which is good for a comprehensive benchmark.

Communication

✅ Appropriate Chart Type (if for classification): The critical difference diagram is a standard visualization for comparing multiple algorithms. The horizontal axis for average rank is clear.
✅ Legible Labels and Values: Model names and their associated average rank values are generally legible, though some are slightly small.
💡 Misleading Caption: Incorrect Task Type for AUC: The caption identifies the best performer (AutoGluon) and states that higher AUC scores are better. However, it critically mislabels the task as 'regression tasks' while AUC is a classification metric. This is a major point of confusion.
✅ Clear Indication of Statistical Significance: The horizontal bars indicating groups of models with no statistically significant difference in performance are clearly depicted.
✅ Good Use of Monochrome: The monochrome color scheme is suitable and accessible.
💡 Specify Alpha Level for Significance: The alpha level for the statistical tests determining the critical difference should be stated in the caption or methods for full transparency.
✅ Good Axis Scaling: The axis scaling for average ranks (from approximately 4 to 11, with ticks from 1 to 14) is appropriate for the data range shown.

Figure 7: Critical difference diagram for regression tasks based on F1 scores....

Full Caption

Figure 7: Critical difference diagram for regression tasks based on F1 scores. The best performing model is AutoGluon as higher F1 scores indicate better performance.

Figure/Table Image (Page 27)

First Reference in Text

Figs. 4, 5, 6, and 7 present the critical difference diagrams based on additional metrics.

Description

Diagram Type and Stated Evaluation Basis (with a critical note on F1 for regression): Figure 7 is a critical difference (CD) diagram, a statistical visualization tool used to compare the performance of multiple machine learning models across a set of tasks. Models are positioned along a horizontal axis based on their average rank. The caption states this diagram is for 'regression tasks based on F1 scores'. However, the F1 score is a standard metric for evaluating classification models, representing the harmonic mean of precision and recall. It is not typically used for regression tasks, which predict continuous values. Assuming 'regression tasks' is a typo and it should be 'classification tasks', higher F1 scores (closer to 1) indicate better performance, leading to a better (lower numerical value) average rank. The diagram displays 10 models or model variants.
Best Performing Model: The model 'AutoGluon' is depicted as the best-performing model, positioned rightmost with the lowest average rank of 4.3889 (assuming lower rank number is better). The caption confirms AutoGluon is best and higher F1 scores are better.
Worst Performing Model: The model 'Decision Tree' is shown as the worst-performing among those presented, positioned leftmost with the highest average rank of 10.8241.
Statistical Significance Groupings: Horizontal black bars connect groups of models whose performances are not statistically significantly different from each other. For example, 'AutoGluon' (rank 4.3889) and 'AutoGluon-DL' (rank 5.5000) form a statistically indistinguishable group at the top. At the other end, 'Decision Tree' (rank 10.8241) and 'gplearn' (rank 10.0278) are grouped.
Relative Performance of Other Models: Models like 'CatBoost' (rank 6.1019), 'LightGBM' (rank 6.1389), 'XGBoost' (rank 6.2870), 'H2O-GBM' (rank 6.3704), and 'TPOT' (rank 6.3796) are positioned in the better-performing half of the diagram, though they are statistically outperformed by AutoGluon. 'H2O-DL' (rank 7.6111), 'AdaBoost' (rank 8.2500), 'LR' (Linear Regression, rank 8.2593), and 'SVM' (rank 9.1481) are in the lower-performing half.

Scientific Validity

💡 Critical Error: F1 Score is for Classification, Not Regression: The most significant issue is the caption stating 'regression tasks based on F1 scores'. F1 score is a metric for classification performance, not regression. If this is truly F1 applied to regression tasks (which would require defining how F1 is calculated for regression, a non-standard approach), this needs extensive justification and explanation. If it's a typo and should be 'classification tasks', then the use of F1 is appropriate. Assuming it's a typo and meant for classification:
✅ Appropriate Visualization Method (if for classification): Assuming the figure is for classification tasks, critical difference diagrams are a statistically sound and appropriate method for comparing multiple algorithms over multiple datasets, incorporating statistical significance.
✅ Valid Metric Choice (F1 for classification): F1 score is a robust metric for classification, particularly useful when dealing with imbalanced datasets or when both precision and recall are important.
✅ Supports Caption Claim (if for classification): The diagram visually supports the caption's claim that AutoGluon is the best performing model (assuming higher F1 is better, leading to a better rank).
💡 Missing Details on Statistical Test and Alpha Level: The specific statistical test (e.g., Friedman test followed by Nemenyi or other post-hoc tests) used to determine the critical difference and the significance level (alpha) are not stated in the caption or the immediate figure context. This information is crucial for a complete understanding of the statistical comparisons.
💡 Clarity on Model Inclusion/Exclusion: The diagram displays 10 models. The paper evaluates 20 models in total. It's important to clarify if this CD diagram represents all models applicable to the (presumably) classification tasks or a specific subset. If it's a subset, the criteria for inclusion/exclusion should be clear.
✅ Complements Other Metric Evaluations: This figure is one of several presenting results on additional metrics, which is good for a comprehensive benchmark, assuming the task-metric pairing is correct.

Communication

✅ Appropriate Chart Type (if for classification): The critical difference diagram is a standard visualization for comparing multiple algorithms. The horizontal axis for average rank is clear.
✅ Legible Labels and Values: Model names and their associated average rank values are generally legible, though some are slightly small.
💡 Misleading Caption: Incorrect Task Type for F1 Score: The caption identifies the best performer (AutoGluon) and states that higher F1 scores are better. However, it critically mislabels the task as 'regression tasks' while F1 score is a classification metric. This is a major point of confusion and undermines the figure's clarity.
✅ Clear Indication of Statistical Significance: The horizontal bars indicating groups of models with no statistically significant difference in performance are clearly depicted.
✅ Good Use of Monochrome: The monochrome color scheme is suitable and accessible.
💡 Specify Alpha Level for Significance: The alpha level for the statistical tests determining the critical difference should be stated in the caption or methods for full transparency.
✅ Good Axis Scaling: The axis scaling for average ranks (from approximately 4 to 11, with ticks from 1 to 14) is appropriate for the data range shown.

4 Discussion

Key Aspects

🎯 Recapitulation of Study Aims and Predictive Success: The discussion commences by restating the study's core objective: a comprehensive benchmarking of 20 machine learning (ML) and deep learning (DL) models across 111 diverse tabular datasets, encompassing both regression and classification tasks. It underscores the prevalent observation that while DL models are state-of-the-art in domains like computer vision, they generally lag behind traditional ML methods on tabular data. A primary contribution highlighted is the development of a predictive model, achieving 86.1% accuracy (AUC 0.76), capable of identifying dataset characteristics—measurable prior to model training—that indicate a higher likelihood of DL models outperforming ML counterparts, thereby offering practical guidance for model selection.
📊 ML Dominance and Specific Model Performance Insights: Aligning with prior benchmark studies [45, 2], the analysis confirms that, on average, traditional ML models, particularly Tree-based Ensemble (TE) techniques, exhibit superior performance over DL models on tabular data, establishing them as a generally reliable choice. The discussion highlights the exceptional performance of AutoGluon, an automated ML framework that strategically ensembles both ML and DL models, which significantly surpassed other models, a finding consistent with [26]. Conversely, the paper notes that TabNet, a DL model specifically designed for tabular data, underperformed relative to even more generic DL architectures like Multi-Layer Perceptrons (MLPs) and H2O-DL, corroborating observations from other benchmarking efforts [47].
↔️ Impact of Dataset Dimensionality on Model Choice: The paper addresses the previously ambiguous influence of dataset dimensions (number of rows and columns) on the relative performance of DL versus ML models, a point of divergence in existing literature [1, 2, 26]. The authors' findings, visualized in Figure 3, suggest a discernible trend: datasets characterized by a smaller number of rows and a larger number of columns tend to favor DL models. However, this observation is nuanced by results from Table 7, which attempted to isolate the effect of dataset size, indicating that while DL models might show an advantage in smaller datasets, dimensionality is merely one of several interacting factors influencing the optimal model choice, not a sole determinant.
🔑 Statistical Predictors of DL Outperformance: Through a logistic regression analysis aimed at profiling dataset characteristics (Table 6), the study identifies two statistically significant features that predict scenarios where DL models are more likely to outperform ML models. Firstly, the nature of the task is critical: DL models demonstrate a relatively stronger performance in classification tasks compared to regression tasks. Secondly, the kurtosis of the dataset's features—a statistical measure indicating the "tailedness" or propensity for outliers in a distribution—emerges as a significant predictor, with higher kurtosis values often associated with conditions favorable to DL approaches.
🧠 Rationale for DL's Contextual Strengths: The discussion provides plausible explanations for the observed contextual strengths of DL models. The improved relative performance of DL in classification tasks is attributed to how errors are treated: in classification, all misclassifications may contribute more uniformly to performance metrics, whereas in regression tasks, metrics like Root Mean Squared Error (RMSE) heavily penalize large prediction errors, to which the extensive parameter space of DL models might be susceptible. This is supported by improved DL ranking when using Mean Absolute Error (MAE) for regression. Furthermore, the advantage of DL models on datasets with high kurtosis is linked to their ability to effectively model long-tailed distributions, a characteristic where DL architectures are recognized to excel over many traditional ML models [48].

Strengths

✅ Robust Contextualization and Synthesis of Findings
The discussion skillfully integrates the study's main results with findings from previous benchmarks, providing a clear context for its contributions and confirming or nuancing existing knowledge about the performance of ML and DL models on tabular data.

"Similar to previous benchmarking studies [45, 2], our analysis shows that ML models, on average, outperform DL models on tabular data." (Page 9)
✅ Nuanced Interpretation of Complex Interactions
The paper avoids oversimplification in its conclusions, particularly when discussing complex interactions such as the impact of dataset size, acknowledging it as one of several influential factors rather than a sole determinant for DL versus ML performance.

"However, the analysis in Table 7 which tries to isolate the size factor does not show clear results; our conclusion is that while DL models may outperform other ML models in small datasets in many cases, this is just one of many other factors which affect the relative performance of the two model groups." (Page 10)
✅ Highlighting Actionable Predictive Features
The discussion effectively highlights statistically significant and practically relevant dataset features, such as task type (classification vs. regression) and kurtosis, which can offer actionable guidance to practitioners for model selection prior to extensive experimentation.

"Regarding the profiling of the dataset characteristics to predict whether either DL model or ML model would provide the best-performing results, a logistic regression analysis (see Table 6) reveals only two statistically significant dataset’s features - is the task a regression or classification and the kurtosis metric." (Page 10)
✅ Provision of Mechanistic Explanations
The discussion provides logical and plausible explanations for why certain factors, like task type or the kurtosis metric, influence the relative performance of DL and ML models, thereby enhancing the reader's understanding of the potential underlying mechanisms.

"A possible explanation for this phenomenon is that in classification tasks, all errors contribute equally to the performance metric while in regression tasks, large errors have more weight (RMSE is unbounded)." (Page 11)

Suggestions for Improvement

💡 Enhance Clarity on Figure 3 Probability Thresholds in Discussion
The Discussion section on page 10 summarizes the impact of row and column counts by referencing Figure 3. The Results section (page 7) notes that for Figure 3(a), the probability of DL outperforming ML did not exceed 0.5, while for Figure 3(b), configurations exceeding 0.5 were found. Briefly clarifying this distinction within the Discussion's summary of Figure 3 would reinforce the nuanced conditions under which DL might be favored. This is a medium-impact suggestion that would improve the clarity and consistency of how Figure 3's findings are synthesized in the Discussion, ensuring readers fully grasp the specific contexts where DL outperformance is predicted.

"Our results clearly indicate a somewhat linear trend where a smaller number of rows and a larger number of columns, on average, results in a higher probability that DL models would outperform ML models." (Page 10)

Implementation: When discussing the trend from Figure 3 on page 10 (e.g., after "...results in a higher probability that DL models would outperform ML models."), consider adding a brief clause or sentence, such as: "...though the absolute probability of DL outperformance varies with specific feature interactions, with some configurations (e.g., involving feature types as in Fig. 3b) showing probabilities above 0.5, unlike configurations based purely on row/column counts (Fig. 3a)."
💡 Connect "Safe Bet" ML to the Utility of the Predictive Model
On page 9, the discussion establishes ML models as a "safe 'bate' [bet]". While this reflects the average performance, the paper's key contribution is a model predicting when DL excels. The Discussion could more explicitly bridge these two points, emphasizing that while ML is a safe default, their predictive tool offers a pathway to identify exceptions where DL is the superior choice, thus adding nuance to the "safe bet" concept. This is a medium-impact suggestion that would better integrate the general findings with the specific contributions of the paper within the Discussion.

"...indicating that on a random tabular dataset, ML models would be the safe “bate”, as indicated by Table 4." (Page 9)

Implementation: After the sentence "...ML models would be the safe “bate”, as indicated by Table 4." on page 9, add a sentence like: "However, the predictive models developed in this study, achieving 86.1% accuracy, aim to equip practitioners with the means to identify specific scenarios where deviating from this 'safe bet' towards DL is empirically justified by dataset characteristics."
💡 Reiterate Symbolic Regression's Corroborative Role
The Discussion (page 9) initially mentions the presentation of explainable models including symbolic regression. However, when detailing the profiling of dataset characteristics for DL outperformance (page 10), the focus is primarily on the logistic regression findings. Briefly reiterating that the symbolic regression model (from Section 3.2, Eq. 2) also corroborated key findings like the importance of dataset size and kurtosis would strengthen the argument for these predictors. This is a low-to-medium impact suggestion enhancing the completeness of the Discussion's summary of explainable model insights.

"Regarding the profiling of the dataset characteristics to predict whether either DL model or ML model would provide the best-performing results, a logistic regression analysis (see Table 6) reveals only two statistically significant dataset’s features..." (Page 10)

Implementation: After discussing the logistic regression findings on page 10 ("...is the task a regression or classification and the kurtosis metric."), add a sentence such as: "These findings were further supported by a symbolic regression model, which independently highlighted the influence of smaller dataset sizes and high kurtosis values in favoring DL model performance."

A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets

First Page Preview

Table of Contents

Overall Summary

Study Background and Main Findings

Research Impact and Future Directions

Critical Analysis and Recommendations

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

1 Introduction

Key Aspects

Strengths

Suggestions for Improvement

2 Experimental setup

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

3 Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

4 Discussion

Key Aspects

Strengths

Suggestions for Improvement