Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models

Matteo Silvestri, Fabrizio Silvestri, Flavio Giorgi, Gabriele Tolomei
arXiv: arXiv:2510.20351v1
Sapienza University of Rome

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This study investigates a critical flaw in how Large Language Models (LLMs) are evaluated on structured, tabular data. The central research question is whether an LLM's high performance on common benchmarks (like the 'Titanic' or 'Adult Income' datasets) stems from genuine reasoning or from prior exposure to this data during its trainingβ€”a phenomenon known as data contamination. The authors specifically aim to distinguish between two types of contamination: 'syntactic,' the rote memorization of exact data rows, and 'semantic,' a more subtle familiarity with the meaning of the data's features (e.g., knowing the typical relationships between 'age', 'occupation', and 'income').

To test this, the researchers developed a rigorous experimental framework. They designed two key tasks: a 'Completion Task,' where the model had to fill in missing values in a data record, and an 'Existence Task,' where it had to identify a real record from a set of slightly altered fakes. These tasks were applied to original datasets as well as two controlled variants: an 'obfuscated' version where meaningful column names were replaced with generic codes (e.g., 'age' becomes 'c01'), and a synthetic version that mimicked statistical properties but destroyed the relationships between columns. This design allowed them to isolate the effect of semantic meaning from raw data structure.

The findings provide a clear and compelling answer. Across all tested models (up to 32 billion parameters), there was no evidence of syntactic contamination; the models consistently failed the Existence Task, indicating they had not memorized specific data entries. However, on the Completion Task, models performed substantially above random chance (achieving up to 78% accuracy versus a 20% baseline) but only on well-known datasets where semantic cues were present. When these cues were removed in the obfuscated versions, performance plummeted to near-random levels. This demonstrates that the models' apparent competence is critically dependent on recognizing the meaning of the data, not on reasoning from its structure.

The study concludes that what is often measured as tabular reasoning is, in fact, a form of latent knowledge retrieval, or 'semantic familiarity.' This has profound implications for the validity of current LLM evaluation practices. The authors argue for a fundamental shift towards 'contamination-aware' evaluation protocols, such as using less-disseminated or semantically neutral datasets, to ensure that future benchmarks accurately measure a model's true inferential capabilities rather than its ability to recall information it has already seen.

Research Impact and Future Directions

This paper provides a methodologically robust and compelling demonstration that Large Language Models' performance on common tabular data benchmarks is significantly inflated by 'semantic contamination.' The core insightβ€”that models remember the meaning of data schemas rather than the data itselfβ€”is a crucial contribution to the field. The study's primary strength lies in its elegant experimental design, which effectively isolates and measures this phenomenon, lending strong causal support to its conclusions.

The findings serve as a critical check on the prevailing narrative of emergent reasoning capabilities in LLMs. They show that what appears to be sophisticated inference can, in some contexts, be a form of sophisticated pattern matching based on prior knowledge. The most significant practical implication is that many existing benchmarks for tabular reasoning may be invalid. Researchers and developers relying on these benchmarks could be drawing false conclusions about their models' true abilities, potentially misdirecting research efforts and overstating model readiness for real-world data science tasks.

While the study is limited by the size of the models tested (up to 32B parameters), its conclusions are well-supported for this model class. The key unanswered question is whether these contamination patterns hold for much larger, state-of-the-art models, which might have greater capacity for verbatim memorization. Regardless, this work establishes a vital methodological foundation. It not only raises awareness of a subtle but pervasive problem but also provides the tools to diagnose it, urging the community to move towards more rigorous and valid evaluation standards for assessing the true reasoning abilities of AI systems.

Critical Analysis and Recommendations

Precise Framing of the Central Hypothesis (written-content)
The abstract excels at defining the specific mechanism under investigationβ€”'strong semantic cues'β€”and immediately provides concrete examples like meaningful column names. This precision is a strength because it moves beyond a vague discussion of 'contamination' to propose a specific, testable hypothesis, effectively framing the paper's unique contribution from the outset.
Section: Abstract
Suggestion: Quantify Key Finding for Greater Impact (written-content)
The abstract states that performance 'sharply declines' but omits the specific numbers. Including a quantitative example (e.g., a drop from over 70% to below 20% accuracy) would provide a more concrete and powerful illustration of the effect's magnitude, making the findings more memorable and compelling for the reader.
Section: Abstract
Compelling Justification of Significance (written-content)
The introduction effectively links the technical problem of data contamination to the core scientific principles of reproducibility and validity. This framing is a strength because it elevates the work from a niche investigation to a crucial inquiry into the integrity of LLM evaluation, making a strong case for its importance.
Section: 1 Introduction
Suggestion: Foreshadow the Core Experimental Contrast (written-content)
The introduction mentions 'controlled testing' but misses an opportunity to briefly foreshadow the central experimental comparison between original and semantically obfuscated datasets. Adding this detail would provide a clearer methodological roadmap early on, strengthening the narrative link between the stated problem and the proposed investigation.
Section: 1 Introduction
Explicit and Persuasive Research Gap Statement (written-content)
The Related Work section culminates in an exemplary 'gap statement,' clearly articulating that while data contamination is a known issue, its specific manifestation in tabular data is systematically under-investigated. This is a strength because it persuasively carves out a distinct and valuable research space for the paper.
Section: 2 Related Work
Suggestion: Bridge the Gap to the Central Hypothesis (written-content)
The section successfully identifies the research gap in tabular data but could more explicitly link it to the paper's core hypothesis about 'semantic cues.' Mentioning that tabular benchmarks are especially vulnerable because they contain rich semantic information would create a more seamless narrative bridge from the literature to this paper's specific investigation.
Section: 2 Related Work
Precise and Well-Exemplified Definitions (written-content)
The section provides exceptionally clear definitions for 'syntactic' and 'semantic' contamination, grounded with concrete examples. This conceptual clarity is a major strength because this distinction is foundational to the paper's entire experimental design and argument.
Section: 3 Background and Problem Definition
Suggestion: Explicitly State the Central Hypothesis (written-content)
While the research questions are clear, the section would be strengthened by explicitly stating the study's central hypothesis: that contamination is primarily semantic, not syntactic. Formally stating this would provide a stronger argumentative anchor and create a more direct link to the experimental results.
Section: 3 Background and Problem Definition
Excellent Operationalization of Abstract Concepts (written-content)
The methodology is a key strength, successfully translating the abstract concepts of 'syntactic' and 'semantic' contamination into a concrete, measurable, and falsifiable experimental design. The use of dual tasks (Completion vs. Existence) and controlled dataset variants (Obfuscated vs. Synthetic) represents a high standard of scientific rigor that effectively isolates the variables of interest.
Section: 4 Methodology
Suggestion: Justify Key Experimental Parameter (written-content)
Methodological Limitation: Lack of Parameter Justification. The paper specifies a 20% rate for masking and perturbing data attributes but provides no rationale for this choice. Justifying this valueβ€”explaining how it balances task difficulty with retaining sufficient contextβ€”would strengthen the methodology's transparency and help readers assess the robustness of the experimental design.
Section: 4 Methodology
Clear and Direct Interpretation of Results (written-content)
The section excels at presenting results with exceptional clarity, directly and unambiguously answering the research questions posed earlier. This is a strength because the authors effectively synthesize the outcomes into a coherent narrative that makes the paper's conclusions highly persuasive and easy to follow.
Section: 5 Experiments
Data Provides Clear Support for Central Claims (graphical-figure)
The data in Table 1 provides compelling evidence for the paper's main arguments. The stark contrast between high accuracy on the Completion task for semantic datasets (e.g., 73% for 'adult') and near-random performance on non-semantic or obfuscated datasets (e.g., 13% for obfuscated 'adult'), combined with consistently low accuracy on the Existence task, strongly validates the paper's thesis.
Section: 5 Experiments
Suggestion: Report Statistical Variance (graphical-figure)
Methodological Limitation: Lack of Statistical Robustness Reporting. Table 1 reports single point estimates for accuracy without measures of variance like standard deviations or confidence intervals. This limits the interpretation because, without knowing the result stability across multiple runs, it is difficult to assess how much the findings might be influenced by random noise in sampling.
Section: 5 Experiments
Logical and Coherent Research Agenda (written-content)
The proposed future work is a strength because it is not a disparate list of ideas but a tightly-coupled extension of the identified limitations. The plan to evaluate larger models and expand the dataset scope creates a clear and compelling roadmap for subsequent research that builds directly on this paper's foundation.
Section: 6 Limitations and Future Work
Suggestion: Propose a Concrete Metric for Future Work (written-content)
The paper proposes categorizing datasets by their 'contamination level' but does not suggest how this might be quantified. Briefly proposing a potential metricβ€”such as a 'Contamination Score' based on the performance gap between original and obfuscated versionsβ€”would make the future work section more methodologically insightful and the research agenda more tangible.
Section: 6 Limitations and Future Work
Effective Synthesis of Core Contributions (written-content)
The conclusion is exceptionally well-crafted, concisely restating the key empirical finding (semantic contamination exists), the central insight (it's about meaning, not memorization), and the broader contribution (a call for methodological reform). This provides a powerful, self-contained summary of the entire work.
Section: 7 Conclusions
Suggestion: Concretize the Call to Action (written-content)
The conclusion calls for 'contamination-aware evaluation pipelines' but this phrase remains abstract. Briefly referencing a specific, actionable strategy derived from the paper's findings, such as prioritizing evaluation on less-disseminated or synthetic datasets, would make this call to action more concrete for researchers.
Section: 7 Conclusions

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

1 Introduction

Key Aspects

Strengths

Suggestions for Improvement

2 Related Work

Key Aspects

Strengths

Suggestions for Improvement

3 Background and Problem Definition

Key Aspects

Strengths

Suggestions for Improvement

4 Methodology

Key Aspects

Strengths

Suggestions for Improvement

5 Experiments

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 1: Summary of contamination results across datasets and models. AC...
Full Caption

Table 1: Summary of contamination results across datasets and models. AC denotes accuracy in the Completion task, while AE denotes accuracy in the Existence task. Bold values indicate performances that are statistically significant above the random-guess baseline (a = 0.001).

Figure/Table Image (Page 4)
Table 1: Summary of contamination results across datasets and models. AC denotes accuracy in the Completion task, while AE denotes accuracy in the Existence task. Bold values indicate performances that are statistically significant above the random-guess baseline (a = 0.001).
First Reference in Text
As summarised in Table 1, our experiments reveal a clear divide between semantic and non-semantic datasets in terms of potential data contamination.
Description
  • Experimental Tasks and Metrics: The table measures the performance of several large language models on two tasks. The 'Completion task' (AC) tests if a model can correctly fill in missing values in a row of data, with accuracy measured against four incorrect options (a random guess would be 20% accurate). The 'Existence task' (AE) tests if a model can identify an authentic row of data from among four perturbed, or slightly altered, versions. This helps distinguish between understanding relationships in data versus rote memorization of exact rows.
  • Dataset Categories and Variants: The experiments use two main types of datasets. 'Semantic Datasets' like 'adult' and 'titanic' use human-readable column names (e.g., 'age', 'occupation') that provide contextual clues. 'Non-semantic Datasets' like 'gamma' use abstract numerical data without obvious real-world meaning. Each dataset is also tested in different forms: 'real' (the original data), 'like' (synthetic data that mimics statistical properties but not exact rows), and 'obf' (obfuscated, where meaningful names are replaced with generic codes like 'c01').
  • Performance on Semantic Datasets: For well-known semantic datasets like 'adult' and 'credit', the models show high accuracy on the Completion task (AC), far exceeding the 20% random-guess baseline. For instance, the 'qwen_32BQ' model achieves 73% accuracy on the 'adult' dataset and 78% on the 'credit' dataset. However, their performance on the Existence task (AE) for these same datasets is very low (e.g., 36% for 'adult'), suggesting the models have learned relationships between features but have not memorized the exact data rows.
  • Performance on Non-semantic and Obfuscated Datasets: When models are tested on non-semantic datasets ('gamma', 'synthetic'), their accuracy on the Completion task drops to levels near random guessing (around 18-25%). Similarly, when a semantic dataset like 'adult' is obfuscated (stripping it of meaningful names), the 'qwen_32BQ' model's accuracy plummets from 73% to 13%. This indicates that the models' high performance relies heavily on the semantic meaning embedded in the feature names and values, not on an inherent ability to reason about abstract tabular data.
Scientific Validity
  • βœ… Strong Experimental Controls: The experimental design is robust. The inclusion of non-semantic datasets as a negative control, and 'obfuscated' and 'like' variants as controls for different types of information leakage, allows for a clear disentanglement of semantic contamination from syntactic memorization or structural pattern learning. This strengthens the conclusion that semantic cues are the primary driver of inflated performance.
  • βœ… Clear Support for Central Claims: The data presented in the table provides compelling evidence for the paper's main arguments. The stark difference in Completion Accuracy (AC) between semantic and non-semantic datasets, combined with the consistently low Existence Accuracy (AE) across all conditions, strongly supports the conclusion that contamination is semantic in nature and does not involve verbatim memorization of instances.
  • πŸ’‘ Lack of Reported Variance: The table reports point estimates for accuracy scores from what appears to be a single experimental run. Without measures of variance, such as standard deviations or confidence intervals derived from multiple runs with different random seeds, it is difficult to assess the stability and statistical certainty of these results. Performance could be subject to noise from the random sampling of instances and features.
  • πŸ’‘ Ambiguity of the Random-Guess Baseline: The caption notes that bold values are significant above a 'random-guess baseline' but does not explicitly state its value. The text implies a 1-in-5 choice task, making the baseline 0.20. Stating this value directly in the caption (e.g., '...baseline of 0.20 (a = 0.001)') would make the table more self-contained and interpretation more immediate for the reader.
Communication
  • βœ… Effective Hierarchical Structure: The table is well-organized with a clear hierarchy: dataset type (Semantic/Non-semantic), specific dataset, variant (real/like/obf), and metric (AC/AE). This structure makes it easy for readers to make critical comparisons, such as contrasting 'real' vs. 'obf' for a given dataset, which is central to the paper's argument.
  • βœ… Good Use of Text Formatting: Defining the abbreviations 'AC' and 'AE' in the caption is good practice and essential for comprehension. Furthermore, using bold font to highlight statistically significant results allows the reader to quickly identify the key outcomes without needing to mentally compare each value to the baseline.
  • πŸ’‘ High Information Density Could Be Complemented Visually: The table is very dense, presenting 160 distinct results. While comprehensive, this makes it challenging to quickly grasp the main trends. A supplementary figure, such as a bar plot comparing AC scores for 'real', 'like', and 'obf' variants on the 'adult' and 'gamma' datasets, would provide a more intuitive and immediate visual summary of the core findings.
  • πŸ’‘ Inconsistent Numerical Alignment: The numbers in the columns are not consistently aligned by the decimal point. This minor formatting issue slightly hinders readability and the ease of scanning down columns to compare values. Aligning all numerical values to the decimal point would improve the table's professionalism and visual clarity.

6 Limitations and Future Work

Key Aspects

Strengths

Suggestions for Improvement

7 Conclusions

Key Aspects

Strengths

Suggestions for Improvement

↑