This paper addresses the challenge of deep learning (DL) models often underperforming compared to traditional machine learning (ML) methods, particularly tree-based ensembles, on tabular datasets (data organized in rows and columns). The authors conducted a comprehensive benchmark involving 111 datasets and 20 different models, covering both regression (predicting a continuous value) and classification (predicting a category) tasks. The datasets varied in size and included those with and without categorical features (features that take on values from a limited set of categories).
The benchmark results confirmed the general superiority of ML models, especially tree-based ensembles like CatBoost and LightGBM, on tabular data. However, the study also found that DL models performed better on certain types of datasets. A key contribution of the paper is the development of a predictive model that can identify scenarios where DL models are likely to outperform traditional ML models, achieving 86.1% accuracy. This predictive model was built using a meta-learning approach, where each dataset was represented by a set of characteristics ('meta-features'), such as the number of rows and columns, statistical properties like kurtosis (a measure of how 'tailed' a distribution is), and relationships between features and the target variable.
The meta-analysis revealed that DL models tend to perform better on smaller datasets with a larger number of columns, and on datasets with high kurtosis values. The study also found that DL models have a relative advantage in classification tasks compared to regression tasks. These findings were supported by both a logistic regression model and a symbolic regression model, which generated an explicit formula for predicting DL outperformance based on dataset characteristics. To further investigate the impact of dataset size, the authors conducted an experiment where they downsampled larger datasets to 1000 training samples. While some DL models improved in ranking on these smaller datasets, tree-based ensembles generally continued to dominate.
The authors discuss their findings in the context of previous benchmarks, noting that their results generally align with prior observations about the relative performance of ML and DL models on tabular data. They acknowledge that dataset size is just one of many factors influencing model performance and emphasize the importance of considering other dataset characteristics, such as kurtosis and task type, when choosing between ML and DL models. They also highlight the strong performance of AutoGluon, an automated ML framework that combines both ML and DL models, suggesting its potential as a robust solution for tabular data problems.
This comprehensive benchmark significantly advances our understanding of machine learning (ML) and deep learning (DL) model performance on tabular data. By evaluating 20 diverse models across 111 datasets, the study confirms the general superiority of traditional ML, especially tree-based ensembles, while also identifying specific conditions where DL models can excel. The development of a predictive model, achieving 86.1% accuracy in identifying scenarios favoring DL, is a key contribution, offering practitioners a valuable tool for informed model selection.
While the benchmark's breadth is a strength, the diversity of datasets might introduce noise and limit the depth of analysis within specific data types. Additionally, the focus on accuracy, while common, should be interpreted cautiously, especially for potentially imbalanced datasets where other metrics like AUC or F1-score offer a more robust evaluation. Future work could explore feature engineering's impact and extend the benchmark to other tabular data tasks like time-series analysis.
Despite these limitations, the study provides valuable insights into the complex interplay of dataset characteristics and model performance. The identification of task type (classification vs. regression) and kurtosis as statistically significant predictors of DL's relative success offers actionable guidance. The mechanistic explanations for these findings, while plausible, warrant further investigation to confirm the proposed underlying mechanisms. Overall, this benchmark establishes a strong foundation for future research and provides practitioners with a data-driven approach to model selection for tabular data, moving beyond the 'safe bet' of traditional ML towards a more nuanced understanding of when DL can offer superior performance.
The abstract effectively communicates the existing challenge of Deep Learning (DL) underperformance on tabular data compared to traditional Machine Learning (ML) methods. It clearly states the study's objective to conduct a comprehensive benchmark to better characterize the types of datasets where DL models excel, providing immediate context and purpose for the research.
The abstract highlights the extensive nature of the benchmark, involving 111 datasets and 20 different models, covering both regression and classification tasks, and datasets with varying scales and presence of categorical variables. This methodological breadth is a significant strength, suggesting that the findings will be robust and well-supported by diverse evidence.
A key contribution presented in the abstract is the development of a predictive model that can determine scenarios where DL models outperform alternatives, achieving a specific accuracy (86.1%) and AUC (0.78). This is a tangible and novel outcome that offers practical value to researchers and practitioners.
The abstract mentions presenting insights from the characterization. To enhance immediate impact and reader understanding, briefly hinting at the nature of these insights (e.g., related to dataset characteristics like size or statistical properties, or task types) could be beneficial. This would provide a slightly more concrete expectation of the paper's findings without significantly increasing length. This suggestion has a low to medium potential impact, primarily aiming to improve clarity and reader engagement by making the nature of the contributions more explicit upfront.
Implementation: Consider revising the sentence: "We present insights derived from this characterization and compare these findings to previous benchmarks." to something like: "We present insights derived from this characterization, such as the influence of dataset scale and task type on model performance, and compare these findings to previous benchmarks." This addition should be kept concise to fit the abstract's brevity.
The introduction effectively establishes the central theme: the general superiority of ML over DL for tabular data, while also acknowledging that this is not universally true. This immediately clarifies the research problem and motivates the study.
The introduction efficiently summarizes key findings from previous significant benchmarking studies, such as those by Arik and Pfister [11] on TabNet and Grinsztajn et al. [1]. This provides essential background and context for the current research, demonstrating an awareness of the existing literature and setting the stage for the paper's contributions.
The introduction effectively summarizes prior work but could more explicitly transition to the current study's specific contributions or the gap it aims to fill. While the abstract covers this, a brief statement at the end of the introduction would strengthen its role as a lead-in to the paper's main body. This would have a medium impact by enhancing narrative cohesion and immediately clarifying how this paper builds upon or diverges from the cited studies. This suggestion is appropriate for the introduction as it sets the stage for the rest of the paper.
Implementation: After summarizing the findings of Grinsztajn et al. [1] (e.g., "They also suggest that DL models are challenged by uninformative features."), add a sentence such as: "Building on these findings, and as detailed in the abstract, this paper introduces a more comprehensive benchmark to further investigate these nuances and identify specific dataset characteristics where DL models may hold an advantage, as outlined in the subsequent sections."
The paper clearly outlines its strategy for dataset selection, emphasizing diversity in size, domain, task type (regression/classification), presence of categorical features, and difficulty. This breadth is crucial for the generalizability of the benchmark findings. The inclusion of Table 1, comparing dataset characteristics with previous studies, effectively positions the current work and highlights its comprehensiveness.
The selection of 20 models, encompassing various tree-based ensembles, deep learning architectures (including AutoDL solutions), and other classical ML algorithms, provides a wide spectrum for comparison. Referencing the appendix for detailed model configurations is appropriate for maintaining conciseness in this section while ensuring reproducibility.
The evaluation strategy is well-defined, employing standard metrics for regression (RMSE, MAE, R2) and classification (Accuracy, AUC, F1 score) tasks, along with a 10-fold cross-validation approach. This adherence to established practices enhances the reliability and comparability of the results.
The methodology for the meta-analysis profiling is formally and clearly described. The construction of the meta-dataset, including the definition of meta-features (referencing Table 8) and the binary target variable based on model performance, is well-explained, setting a solid foundation for the subsequent analysis aimed at predicting DL versus ML model superiority.
The 'Datasets' subsection (2.1) details criteria for dataset inclusion like size and categorical features, but it does not explicitly state how missing values within the selected datasets were handled prior to model training (e.g., imputation, removal of samples/features, or if models were expected to handle them). This information is crucial for reproducibility and understanding potential biases, as different missing data strategies can significantly impact model performance. Clarifying this would be a high-impact improvement for methodological completeness. This detail is fundamental to the experimental setup as it impacts the input data for all models.
Implementation: Add a sentence or two in Section 2.1 specifying the general strategy for handling missing values across the 111 datasets. For example: 'Datasets with missing values underwent [specific imputation method, e.g., mean/median imputation for numerical features and mode imputation for categorical ones] prior to model application,' or 'Models were selected/configured to handle missing data internally where applicable, and datasets with missing values were included as is if the models supported this.' If different strategies were used based on dataset or model, a brief summary would be helpful.
Section 2.4 states the target column (Y_bar) for the meta-learning dataset is determined by whether an ML or DL model has the best performance based on 'RMSE for regression, and AUC for classification.' Given that multiple performance metrics are collected for each task (as per Section 2.3), a brief justification for selecting only RMSE and AUC for this critical binary labeling in the meta-analysis would strengthen the methodological rigor. This clarification is of medium impact, as the choice of these specific metrics directly influences the meta-learning outcome and the interpretation of when DL 'outperforms' ML. This detail belongs in Section 2.4 where the meta-dataset construction and target variable definition are described.
Implementation: Add a sentence in Section 2.4 after specifying RMSE and AUC as the basis for determining the best performing model group for the meta-learning target. For instance: 'RMSE and AUC were chosen as the definitive metrics for this binary labeling due to their widespread adoption in evaluating overall predictive power in regression and ranking quality in classification, respectively, providing a consistent basis for comparison across diverse datasets.' or a similar statement reflecting the actual rationale.
The results section effectively uses tables (Table 3, 4, 5) to clearly and comprehensively present model rankings across different dataset sizes and model groups. Multiple metrics (# Best, Average Rank, Median Rank, # in Top 3 Models) are provided, offering a multifaceted view of performance.
The meta-analysis profiling is robust, employing multiple analytical techniques including logistic regression (Table 6), a dedicated H2O-DL predictive model, symbolic regression (Eq. 2), and insightful visualizations (Fig. 3 heatmaps) to explore factors influencing DL versus ML performance.
The paper specifically investigates the impact of dataset size through dedicated analysis on small datasets (Table 5) and a controlled experiment involving down-sampling of larger datasets (Table 7). This provides nuanced insights into how data volume affects relative model performance.
The findings from the meta-analysis are supported by statistical measures, such as p-values for logistic regression coefficients (Table 6) and performance metrics (AUC, accuracy, F1-score) for the predictive models, lending credibility to the conclusions drawn about when DL might outperform ML.
The term "computationally-attractive" used to describe PCA-based features in the discussion of Figure 3(d) is somewhat ambiguous. Clarifying why these features are considered attractive (e.g., due to dimensionality reduction leading to faster model training, or because they represent a denser encoding of information per feature) would enhance the precision of the interpretation. This is a low-impact suggestion aimed at improving terminological clarity within the results discussion, specifically concerning the interpretation of factors influencing DL model performance.
Implementation: In the sentence discussing Figure 3(d) on page 8, revise 'especially if these are more “computationally-attractive” like PCA-based features' to something like 'especially if these features are derived from dimensionality reduction techniques like PCA, which can be “computationally-attractive” due to the reduced feature space potentially leading to more efficient model training or concentrated information content.'
The text on page 7, when discussing Figure 3 sub-figure (a), states: 'Notably, for all the explored configurations, the probability does not increase over 0.5 which indicates no configuration found where DL models would outperform ML models, on average.' However, the subsequent discussion of sub-figure (b) mentions: 'Interestingly, this heatmap reveals configurations where the probability is higher than 0.5.' This apparent contradiction needs to be resolved or clarified. It's important to specify if the 'on average' statement for sub-figure (a) is specific to the row/column count configuration alone, and how that relates to findings in other sub-figures. This clarification has a medium impact on ensuring consistent interpretation of the heatmap results. This belongs in the discussion of Figure 3.
Implementation: After the sentence '...no configuration found where DL models would outperform ML models, on average.' for sub-figure (a), add a clarifying statement. For example: 'This observation for sub-figure (a), which considers only the interplay of row and column counts, should be contrasted with sub-figure (b), where the interaction of row counts with numerical and categorical feature counts does reveal specific configurations where the predicted probability of DL outperformance exceeds 0.5.'
On page 5, the text discussing Table 5 (performance on small datasets) states, 'In this scenario, H2O led the chart...'. However, Table 5 itself clearly indicates 'H2O-GBM' as the leading model in the '# Best' category. This discrepancy between the textual description and the table content should be rectified to ensure accuracy and consistency in reporting. Assuming Table 5 is correct, the text should specify H2O-GBM. This is a medium-impact suggestion crucial for the correctness of reported findings. This belongs in the discussion of Table 5.
Implementation: On page 5, change the sentence 'In this scenario, H2O led the chart with 6/36 (16.6%)...' to 'In this scenario, H2O-GBM led the chart with 6/36 (16.6%)...', aligning the text with the model name provided in Table 5.
Figure 1: Critical difference diagram for regression tasks based on RMSE. The best performing model is Auto-Gluon as lower RMSE scores indicate better performance.
Figure 2: Critical difference diagram for classification tasks based on accuracy. The best performing model is AutoGluon as higher accuracy scores indicate better performance.
Table 5: Performance metrics of TE and DL models for small datasets (< 1000) rows.
Table 6: Coefficients of a logistic regression for predicting the probability that DL outperforms ML.
Figure 3: The effect of various factors on the probability that DL outperforms ML. The heatmaps are generated using the prediction of the logistic regression models. The scatter plot represents the actual observations of the datasets. (a) the impact of the number of columns and rows; (b) the influence of numerical and categorical feature counts; (c) the effect of X-kurtosis and row count; and (d) the role of PCA components necessary to maintain 99% of the variance and number of rows.