Evaluating Mathematical Reasoning in Large Language Models: The GSM-Symbolic Benchmark

Abstract

Overview

This paper investigates the mathematical reasoning abilities of Large Language Models (LLMs) using a new benchmark called GSM-Symbolic, derived from the GSM8K dataset. The authors find that LLMs struggle with variations in numerical values within the same problem structure, and their performance degrades as problem complexity increases. They also introduce GSM-NoOp, a dataset with irrelevant information added to problems, revealing that LLMs often incorporate this irrelevant information into their calculations, suggesting a lack of true understanding of mathematical concepts. The study concludes that current LLMs rely more on pattern matching than genuine logical reasoning, especially in mathematics.

Key Aspects

Introduction of GSM-Symbolic: A new benchmark created from symbolic templates allows for generating diverse math questions, enabling more controlled evaluations of LLMs.
LLM Performance Variance: LLMs show significant performance differences when answering variations of the same question, especially when numerical values are changed.
Impact of Problem Complexity: LLM performance degrades as the number of clauses (and thus, complexity) in a question increases.
GSM-NoOp and Irrelevant Information: LLMs struggle to discern relevant information, often incorporating irrelevant clauses into their calculations, highlighting a reliance on pattern matching.

Strengths

Clear motivation and problem statement.
The abstract effectively establishes the need for a more robust evaluation of LLM mathematical reasoning beyond the existing GSM8K benchmark.

"While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics." (Page 1)
Concisely summarizes key findings.
The abstract clearly presents the main findings of the study, including the performance variance, the impact of complexity, and the effect of irrelevant information.

"Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question...Furthermore, we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases...We observe significant performance drops (up to 65%) across all state-of-the-art models, even though the added clause does not contribute to the reasoning chain." (Page 1)
Well-defined scope and contribution.
The abstract clearly outlines the scope of the study and its contributions, including the introduction of GSM-Symbolic and GSM-NoOp.

"To address these concerns, we conduct a large-scale study on several state-of-the-art open and closed models...we introduce GSM-Symbolic, an improved benchmark created from symbolic templates." (Page 1)

Suggestions for Improvement

Quantify the scale of the study.
Mentioning the number of models and examples used would strengthen the abstract.

"a large-scale study" (Page 1)

Rationale: Providing specific numbers adds weight to the claims and allows readers to better assess the scope of the study.

Implementation: Include phrases like "We evaluated X LLMs on Y examples" or similar quantifications.
Briefly mention the broader implications.
Adding a sentence about the implications of these findings for the development of future LLMs would enhance the abstract's impact.

Rationale: Highlighting the broader significance of the work makes it more appealing to a wider audience.

Implementation: Add a concluding sentence like, "These findings highlight the need for new approaches to improve the genuine reasoning capabilities of LLMs."
Clarify the meaning of "clauses."
While the term "clauses" is used to indicate complexity, briefly explaining its meaning in this context would improve clarity for readers unfamiliar with the terminology.

"the number of clauses in a question increases" (Page 1)

Rationale: Ensuring all readers understand key terms enhances accessibility and comprehension.

Implementation: Briefly define "clauses" as parts or components of the problem statement.

Introduction

Overview

Large Language Models (LLMs) have shown impressive abilities in various areas like language processing and creative tasks. However, whether they can truly reason logically, especially in fields like math and coding, is still a big question. While LLMs seem good at some tasks, they have limitations. Existing research suggests that LLMs might rely more on recognizing patterns from their training data than actual understanding, making them sensitive to small changes in how questions are phrased. The GSM8K dataset is commonly used to test LLMs' math skills, but it has drawbacks. It only provides a single score, might have been accidentally included in training data (contamination), and doesn't allow for flexible testing to see how LLMs handle different types of questions or difficulty levels.

Key Aspects

LLM Capabilities and Limitations: LLMs excel in various tasks but their logical reasoning abilities are still under investigation.
Focus on Mathematical Reasoning: The ability of LLMs to perform mathematical reasoning is crucial for real-world AI applications.
Limitations of GSM8K: The popular GSM8K benchmark has limitations, including a single metric, potential data contamination, and lack of flexibility for controlled experiments.
Need for a More Versatile Framework: A more adaptable evaluation framework is needed to assess LLM robustness and reasoning abilities.
Contributions of the Paper: The paper introduces GSM-Symbolic, a new benchmark, investigates the reliability of GSM8K results, studies the fragility of LLM reasoning, and explores the impact of question complexity and irrelevant information.

Strengths

Clearly stated motivation.
The introduction effectively highlights the importance of studying LLM reasoning abilities and the limitations of current benchmarks.

"However, the question of whether current LLMs are genuinely capable of true logical reasoning remains an important research focus." (Page 1)
Comprehensive background information.
The introduction provides a good overview of existing research on LLM reasoning and the limitations of the GSM8K dataset.

"While some studies highlight impressive capabilities, a closer examination reveals substantial limitations. Literature suggests that the reasoning process in LLMs is probabilistic pattern-matching rather than formal reasoning." (Page 1)
Well-defined scope and contributions.
The introduction clearly outlines the paper's contributions and sets the stage for the subsequent sections.

"We make the following contributions: • We introduce GSM-Symbolic..." (Page 3)

Suggestions for Improvement

Provide more specific examples of LLM limitations.
While the introduction mentions limitations, adding specific examples would make the argument stronger.

"a closer examination reveals substantial limitations" (Page 1)

Rationale: Concrete examples would better illustrate the challenges in LLM reasoning and make the motivation for the paper's contributions clearer.

Implementation: Include specific examples of how LLMs fail in reasoning tasks or are sensitive to input changes.
Expand on the potential impact of data contamination.
The introduction briefly mentions data contamination but could elaborate on its potential consequences for evaluating LLM performance.

"Moreover, the popularity and prevalence of GSM8K can increase the risk of inadvertent data contamination." (Page 2)

Rationale: Explaining the implications of data contamination would highlight the importance of the proposed GSM-Symbolic benchmark.

Implementation: Discuss how data contamination could lead to inflated performance estimates and hinder accurate assessment of LLM capabilities.
Strengthen the connection between the introduction and the research questions.
While the introduction provides context, explicitly stating the research questions would improve the flow and focus of the paper.

Rationale: Clearly articulated research questions would guide the reader and provide a framework for understanding the subsequent experiments and results.

Implementation: Formulate specific research questions related to the reliability of GSM8K, the fragility of LLM reasoning, and the impact of question complexity and irrelevant information.

Related Work: Reasoning & Language Models

Overview

This section discusses existing research on the reasoning abilities of Large Language Models (LLMs). It highlights that while LLMs have shown potential in various domains, their reasoning capabilities are still uncertain. Studies exploring the computational aspects of transformers suggest that these models might have limitations in handling complex tasks and may benefit from additional memory mechanisms like scratchpads. However, it remains unclear whether LLMs can perform true logical reasoning. Several studies suggest that LLMs rely more on probabilistic pattern-matching than formal reasoning, making them sensitive to small changes in input and prone to errors in complex scenarios. This pattern-matching approach, while more advanced than simple memorization, still falls short of genuine logical reasoning.

Key Aspects

Computational Modeling of Transformers: Research suggests that transformers may have limitations in expressiveness for complex tasks but can be enhanced with additional memory.
Limitations of Transformer Architecture: The architecture itself might lack the necessary components for true logical reasoning.
Probabilistic Pattern-Matching: LLMs likely rely on probabilistic pattern-matching, searching for similar patterns seen during training.
Sensitivity to Input Changes: LLMs are highly sensitive to input variations, indicating a strong token bias and fragility in reasoning.
Pattern Matching vs. Formal Reasoning: While LLMs can match abstract reasoning patterns, they fall short of true formal reasoning, potentially explaining their limitations in complex problem-solving.

Strengths

Comprehensive overview of related work.
The section provides a good overview of different perspectives on LLM reasoning, including computational modeling and the pattern-matching hypothesis.

"Logical reasoning is a critical trait of intelligent systems. Recent advancements in Large Language Models (LLMs) have demonstrated significant potential across various domains, yet their reasoning abilities remain uncertain and inconsistent. Many works have investigated whether LLMs are truly capable of reasoning by examining how these models solve tasks requiring logical reasoning." (Page 3)
Clear explanation of key concepts.
The section clearly explains complex concepts like probabilistic pattern-matching and token bias in an accessible way.

"Instead, LLMs likely perform a form of probabilistic pattern-matching and searching to find closest seen data during training without proper understanding of concepts." (Page 4)
Relevant citations.
The section includes relevant citations to support the claims and provides a good starting point for further reading.

"For example, parallels have been drawn between components such as attention and feed-forward modules and simple computational primitives (Weiss et al., 2021; Zhou et al., 2024)." (Page 3)

Suggestions for Improvement

Provide a more structured organization.
While the section covers relevant topics, a more structured organization would improve readability and clarity.

Rationale: A clear structure would make it easier for readers to follow the different arguments and understand the connections between them.

Implementation: Organize the section into subsections or use headings to separate different aspects of LLM reasoning, such as computational limitations, pattern-matching, and sensitivity to input changes.
Connect the related work to the paper's contributions more explicitly.
While the section mentions related work, it could more explicitly connect these findings to the paper's specific contributions, particularly the introduction of GSM-Symbolic.

Rationale: A stronger connection would highlight the novelty and relevance of the paper's contributions in the context of existing research.

Implementation: Add a paragraph or sentences explicitly linking the limitations of current LLMs and evaluation methods to the motivation for developing GSM-Symbolic and the research questions addressed in the paper.
Discuss potential future research directions.
Concluding the section with a brief discussion of potential future research directions would provide a broader perspective and stimulate further investigation.

Rationale: Highlighting open questions and future research directions would contribute to the overall impact and value of the paper.

Implementation: Add a concluding paragraph discussing potential avenues for future research, such as developing new architectures or training methods to improve LLM reasoning abilities.

GSM-Symbolic

Overview

This section introduces GSM-Symbolic, a new benchmark for evaluating the mathematical reasoning of Large Language Models (LLMs). It addresses the limitations of existing benchmarks like GSM8K by using symbolic templates to generate diverse question variations. This approach allows for more controlled experiments and provides more reliable metrics for assessing LLM performance. The section also describes the template generation process, which involves identifying variables, their domains, and conditions to ensure question and answer correctness. Finally, it outlines the experimental setup used in the paper, including the models evaluated, the evaluation process, and the dataset size.

Key Aspects

GSM-Symbolic Benchmark: A new benchmark using symbolic templates to generate diverse question variants, enabling more nuanced evaluation of LLMs.
Template Generation Process: A detailed process for creating templates, involving identifying variables, domains, and conditions to ensure correctness.
Controlled Experiments: GSM-Symbolic allows for more controlled experiments by manipulating variables and difficulty levels.
Reliable Metrics: The benchmark provides more reliable metrics for assessing LLM performance in mathematical reasoning.
Experimental Setup: The section describes the models used, the evaluation setup with Chain-of-Thought prompting, and the dataset size used in the experiments.

Strengths

Clear explanation of GSM-Symbolic.
The section clearly explains the purpose and design of GSM-Symbolic, making it easy to understand the novelty and value of the benchmark.

"While the mentioned benchmarks offer a single performance metric on a fixed number of questions, we argue that viewing LLM performance as a distribution across various problem instances provides deeper insights." (Page 5)
Detailed description of template generation.
The section provides a step-by-step explanation of how templates are created, including the identification of variables, domains, and conditions.

"The annotation process involves identifying variables, their domains, and necessary conditions to ensure the correctness of both the question and the answer." (Page 5)
Well-defined experimental setup.
The section clearly outlines the experimental setup, including the models used, the evaluation process, and the dataset size.

"Overall, for this work, we conducted nearly 500 total evaluations on various setups. To this end, we maintained a manageable dataset size by using 100 templates and generating 50 samples per template, resulting in 5000 total examples for each benchmark." (Page 5)

Suggestions for Improvement

Provide more examples of symbolic templates.
While Figure 1 shows one example, including more examples of symbolic templates would further clarify the process and its flexibility.

Rationale: More examples would help readers understand the different types of questions that can be generated and the range of complexity that can be captured.

Implementation: Include a few more examples of symbolic templates in the section or in the appendix, showcasing different problem structures and variable types.
Discuss the limitations of GSM-Symbolic.
While the section highlights the advantages of GSM-Symbolic, it would be beneficial to also discuss its limitations or potential biases.

Rationale: Acknowledging limitations would provide a more balanced perspective and encourage further research to address these limitations.

Implementation: Add a paragraph discussing potential limitations, such as the scope of mathematical concepts covered or the potential for biases in the template generation process.
Justify the choice of 100 templates and 50 samples.
The section mentions the dataset size but doesn't explain why these specific numbers were chosen.

"we maintained a manageable dataset size by using 100 templates and generating 50 samples per template" (Page 5)

Rationale: Justifying the dataset size would strengthen the methodology and ensure the results are statistically significant.

Implementation: Explain the rationale behind choosing 100 templates and 50 samples, perhaps by referring to computational constraints or the desired level of statistical power.

Non-Text Elements

figure 1

Figure 1 illustrates how the GSM-Symbolic template is created. It shows an example from the original GSM8K dataset alongside a corresponding template. The GSM8K example is a word problem about a person named Sophie and the number of toys she has for her nephew. The template generalizes this problem by replacing specific names and numbers with placeholders like {name}, {family}, {x}, {y}, {z}, and {total}. This allows for generating many similar problems with different names, numbers, and relationships between them, while keeping the underlying structure the same. The figure also shows the solution to both the original problem and the templated version.

First Mention

Text: "Figure 1: Illustration of the GSM-Symbolic template creation process."

Context: This dataset serves as a tool to investigate the presumed reasoning capabilities of LLMs, enabling the design of controllable mathematical reasoning evaluations with more reliable metrics. Our results reveal that all state-of-the-art LLMs exhibit significant performance variations, suggesting the fragility or lack of reasoning.

Relevance: This figure is crucial for understanding how GSM-Symbolic is constructed and how it enables more controlled experiments compared to GSM8K. It visually demonstrates the concept of templates and their use in generating diverse problem instances.

Critique

Visual Aspects

Use a more visually distinct style for the placeholders (e.g., a different font, color, or background) to make them stand out from the rest of the text.
Consider adding arrows or other visual cues to connect the placeholders in the template to the corresponding elements in the GSM8K example.
Use a larger font size for the text within the figure to improve readability.

Analytical Aspects

Provide a brief explanation of the symbols used in the template, such as the meaning of 'sample' and the '#variables' and '#conditions' sections.
Explain why certain elements are chosen to be variables (e.g., why 'name' and 'family' are variable, but the type of toys is not).
Explain how the conditions ensure the generated problems are solvable and at the appropriate difficulty level.

Numeric Data

Experiments & Results

Overview

This section presents the main findings of the study on LLM mathematical reasoning. First, it examines the reliability of current GSM8K results by analyzing the performance distribution on GSM-Symbolic, revealing significant variations. It then investigates the fragility of LLM reasoning by comparing performance when changing names versus numbers in problems, finding LLMs more sensitive to numerical changes. The section also explores the impact of question difficulty (number of clauses) on performance, showing that accuracy decreases and variance increases with higher difficulty. Finally, it introduces GSM-NoOp, a dataset with irrelevant information added to problems, demonstrating that LLMs often incorporate this irrelevant information, leading to significant performance drops and suggesting a lack of true understanding of mathematical concepts.

Key Aspects

Reliability of GSM8K Results: Performance on GSM-Symbolic shows significant variance, questioning the reliability of single-point metrics on GSM8K.
Fragility of LLM Reasoning: LLMs are more sensitive to changes in numerical values than changes in names, indicating fragility in reasoning.
Impact of Question Difficulty: Performance degrades and variance increases as the number of clauses in a question increases.
GSM-NoOp and Irrelevant Information: LLMs struggle to ignore irrelevant information in GSM-NoOp, leading to substantial performance drops.
Pattern Matching vs. Reasoning: The findings suggest LLMs rely more on pattern matching than true understanding of mathematical concepts.

Strengths

Clear presentation of results.
The section presents the results in a clear and organized manner, using figures and tables to illustrate the key findings.

"Fig. 2 shows the empirical distribution of the performance of models on GSM-Symbolic computed on these 50 datasets." (Page 6)
Comprehensive analysis.
The section provides a comprehensive analysis of the results, exploring different factors that contribute to LLM performance variations.

"First, we investigate the impact of the type of change to understand the difference between changing names (e.g., person names, places, foods, currencies, etc.) versus changing numbers (i.e., the values of variables)." (Page 7)
Well-supported claims.
The claims made in the section are well-supported by the presented results and figures.

"As shown in Fig. 6, the trend of the evolution of the performance distribution is very consistent across all models: as the difficulty increases, the performance decreases and the variance increases." (Page 9)

Suggestions for Improvement

Provide more details on the statistical analysis.
While the section mentions variance and standard deviations, providing more details on the statistical tests used would strengthen the analysis.

Rationale: More detailed statistical analysis would provide stronger evidence for the claims made in the section.

Implementation: Include p-values or other statistical measures to quantify the significance of the observed differences in performance.
Discuss the limitations of the experimental setup.
The section could benefit from a discussion of the limitations of the experimental setup, such as the choice of models or the use of greedy decoding.

Rationale: Acknowledging limitations would provide a more balanced perspective and encourage further research to address these limitations.

Implementation: Add a paragraph discussing potential limitations of the experimental setup and their potential impact on the results.
Connect the findings to the broader context of LLM research.
While the section focuses on mathematical reasoning, connecting the findings to the broader context of LLM research would enhance the paper's impact.

Rationale: Connecting the findings to broader research questions would highlight the significance of the study and its implications for the development of future LLMs.

Implementation: Discuss how the findings relate to other research on LLM reasoning, such as studies on logical reasoning or common sense reasoning.

Non-Text Elements

figure 2

Figure 2 shows the distribution of performance for several large language models (LLMs) on the GSM-Symbolic benchmark. Each histogram represents a different model and shows how often the model achieved certain accuracy levels across 50 different sets of GSM-Symbolic problems. The x-axis represents the accuracy achieved (as a percentage), and the y-axis represents the frequency (how many times that accuracy level was observed). A dashed vertical line marks the model's performance on the original GSM8K dataset. The average performance on GSM-Symbolic and its standard deviation are also shown for each model.

First Mention

Text: "Figure 2: The distribution of 8-shot Chain-of-Thought (CoT) performance across 50 sets generated from GSM-Symbolic templates shows significant variability in accuracy among all state-of-the-art models."

Context: Furthermore, for most models, the average performance on GSM-Symbolic is lower than on GSM8K (indicated by the dashed line). Interestingly, the performance of GSM8K falls on the right side of the distribution, which, statistically speaking, should have a very low likelihood, given that GSM8K is basically a single draw from GSM-Symbolic.

Relevance: This figure is important because it shows how consistent (or inconsistent) the models are when answering slightly different versions of the same math problems. The spread of the histograms indicates the variability in performance, and the comparison to GSM8K performance suggests potential issues like data contamination.

Critique

Visual Aspects

Use a consistent color scheme for the histograms and the GSM8K lines across all subplots to improve visual coherence.
Label the axes clearly with 'Accuracy (%)' and 'Frequency' to avoid ambiguity.
Increase the spacing between the subplots to reduce clutter and improve readability.

Analytical Aspects

Explain why 50 sets were chosen and how they were generated. Was it a random sampling? What parameters were varied?
Provide a clearer explanation of what the standard deviation represents in this context. A high school student might not be familiar with this concept.
Discuss the implications of the GSM8K performance often falling outside the typical range of GSM-Symbolic performance. Why is this surprising and what does it suggest?

Numeric Data

figure 3

Figure 3 is a bar chart showing how much the performance of different LLMs drops when tested on GSM-Symbolic compared to their performance on the original GSM8K. Each bar represents a different model, and its length corresponds to the percentage drop in accuracy. A downward bar means performance decreased on GSM-Symbolic. The labels on each bar provide the exact percentage drop for each model.

First Mention

Text: "Figure 3: The performance of all state-of-the-art models on GSM-Symbolic drops compared to GSM8K."

Context: Later, we investigate the factors that impact the performance drops in more depth.

Relevance: This figure directly visualizes the performance degradation discussed in the text. It highlights the impact of using the more diverse and challenging GSM-Symbolic benchmark compared to the static GSM8K.

Critique

Visual Aspects

Order the bars from largest to smallest drop for easier comparison.
Add a horizontal line at 0% to clearly separate performance gains from drops.
Use a color gradient or different shades to visually represent the magnitude of the drop.

Analytical Aspects

Explain the implications of this performance drop. Does it suggest overfitting to GSM8K? Does it indicate a lack of generalization?
Connect this figure back to Figure 2 and discuss how the drop relates to the distribution of performance.
Discuss why some models show a larger drop than others. Are there architectural differences or training differences that might explain this?

Numeric Data

figure 4

Figure 4 illustrates the sensitivity of Large Language Models (LLMs) to changes in names, numbers, or both within math word problems. It presents histograms showing the distribution of accuracy scores for six different LLMs across three conditions: changing only the names in the problem, changing only the numbers, and changing both names and numbers. Each histogram shows the frequency of different accuracy levels, allowing for a comparison of performance variability across the three conditions. The figure aims to demonstrate how these changes, while not affecting the underlying mathematical logic, can significantly impact the LLMs' ability to solve the problems.

First Mention

Text: "Figure 4: How sensitive are LLMs when we change only names, only proper numbers, or both names and numbers?"

Context: Overall, models have noticeable performance variation even if we only change names, but even more when we change numbers or combine these changes.

Relevance: This figure is highly relevant as it directly addresses the research question of how fragile LLMs are to superficial changes (names) versus core changes (numbers) in mathematical reasoning problems. It provides evidence for the argument that LLMs are more sensitive to changes in numerical values than to changes in names, suggesting a potential over-reliance on pattern matching.

Critique

Visual Aspects

Use distinct colors or patterns for the histograms representing different change conditions to improve visual clarity and comparison.
Add a clear legend explaining the meaning of each color/pattern used for the change conditions.
Label the axes clearly with appropriate units (Accuracy (%) and Frequency).
Increase the font size of labels and legends to improve readability.

Analytical Aspects

Provide the average accuracy and standard deviation for each condition in the figure or caption to allow for a more quantitative comparison.
Discuss the implications of the observed differences in variance between name changes and number changes in more detail.
Connect the findings to the hypothesis that LLMs rely on in-distribution pattern matching, explaining how this hypothesis is supported by the observed sensitivity to numerical changes.
Consider adding a statistical test to quantify the significance of the observed differences in performance between the conditions.

Numeric Data

figure 5

Figure 5 demonstrates how the difficulty of the GSM-Symbolic math problems is modified by changing the number of clauses. It shows four example problems, each representing a different difficulty level: GSM-Symbolic-M1 (minus one clause), GSM-Symbolic (original), GSM-Symbolic-P1 (plus one clause), and GSM-Symbolic-P2 (plus two clauses). Each problem is a word problem involving calculations, and the increasing difficulty is reflected in the addition of more conditions or steps required to solve the problem. This figure provides a concrete illustration of how the benchmark allows for controlled manipulation of problem complexity.

First Mention

Text: "Figure 5: Modifying the difficulty level of GSM-Symbolic by modifying the number of clauses."

Context:

Relevance: This figure is essential for understanding how the authors operationalize 'difficulty' in their experiments. By showing examples of problems with varying numbers of clauses, it clarifies the manipulation used to test the impact of complexity on LLM performance. This directly relates to the research question of how difficulty affects the performance distribution.

Critique

Visual Aspects

Use a consistent format and font size for all the example problems.
Highlight the added clauses in the P1 and P2 examples using bold text, color, or underlining to make them easily noticeable.
Consider adding a brief explanation of what constitutes a 'clause' in this context, as it might not be immediately clear to all readers.

Analytical Aspects

Provide a more detailed explanation of how the added clauses increase the difficulty of the problem. For example, explain the additional reasoning steps required or the increased cognitive load.
Explain why this method of manipulating difficulty (adding clauses) is appropriate for studying mathematical reasoning in LLMs.
Discuss any potential limitations of this approach. For example, does adding a clause always increase the difficulty linearly, or are there other factors that might influence the perceived difficulty level?

Numeric Data

figure 6

This figure illustrates how increasing the complexity of a math problem, represented by the number of clauses, affects the performance of Large Language Models (LLMs). It uses histograms to show the distribution of accuracy scores for four different models across four difficulty levels: GSM-M1 (one clause removed), GSM-Symb (original complexity), GSM-P1 (one clause added), and GSM-P2 (two clauses added). As the problems become more complex (moving from M1 to P2), the histograms generally shift to the left, indicating lower accuracy. The spread of the histograms also tends to increase with complexity, suggesting greater variability in performance. Think of it like stacking blocks: it's easier to balance a small tower (M1) than a tall one (P2). The taller the tower gets, the more likely it is to wobble and fall (more variance).

First Mention

Text: "Figure 6: The impact of increasing the number of clauses on performance: As the difficulty increases from GSM-M1 → GSM-Symb→ GSM-P1 → GSM-P2, the distribution of performance shifts to the left (i.e., accuracy decreases), and the variance increases."

Context: As shown in Fig. 6, the trend of the evolution of the performance distribution is very consistent across all models: as the difficulty increases, the performance decreases and the variance increases. Note that overall, the rate of accuracy drop also increases as the difficulty increases. This is in line with the hypothesis that models are not performing formal reasoning, as the number of required reasoning steps increases linearly, but the rate of drop seems to be faster. Moreover, considering the pattern-matching hypothesis, the increase in variance suggests that searching and pattern-matching become significantly harder for models as the difficulty increases.

Relevance: This figure is central to the paper's argument that LLMs struggle with more complex reasoning tasks. It provides visual evidence of the performance degradation and increased variability as problem difficulty increases.

Critique

Visual Aspects

Use consistent colors for the same difficulty level across all model histograms to facilitate comparison.
Label the axes clearly with 'Accuracy (%)' and 'Frequency' or 'Number of Samples'.
Add a brief explanation within the figure of what GSM-M1, GSM-Symb, GSM-P1, and GSM-P2 represent.

Analytical Aspects

Provide the exact number of clauses used in each difficulty level to quantify the complexity increase.
Discuss potential reasons why the variance increases with complexity, such as the accumulation of errors in multi-step reasoning.
Consider adding a statistical measure of variance (e.g., standard deviation) to each histogram or in a separate table.

Numeric Data

figure 7

Figure 7 shows an example of how LLMs can be misled by irrelevant information. It presents a word problem from the GSM-NoOp dataset, where a seemingly relevant but ultimately unimportant detail is added (some kiwis being smaller). Two LLM responses are shown, both incorrectly incorporating the size of the kiwis into their calculations. This is like asking 'If you have 5 apples and 2 are green, how many apples do you have?' A person would understand that the color doesn't change the number of apples, but the LLMs seem to get confused by the extra detail.

First Mention

Text: "Figure 7: An example from the GSM-NoOp dataset: We add seemingly relevant statements to the questions that are, in fact, irrelevant to the reasoning and conclusion. However, the majority of models fail to ignore these statements and blindly convert them into operations, leading to mistakes."

Context: Fig. 7 illustrates an example from GSM-NoOp. An interesting observation is that models tend to blindly subtract the number of smaller fruits, potentially because their training datasets included similar examples that required conversion to subtraction operations. In the Appendix, we include additional failure cases from GSM-NoOp. Overall, we find that models tend to convert statements to operations without truly understanding their meaning. For instance, a common case we observe is that models interpret statements about “discount” as “multiplication”, regardless of the context. This raises the question of whether these models have truly understood the mathematical concepts well enough. Consequently, as shown in Fig. 8a, there is a catastrophic performance decline across all tested models, with the Phi-3-mini model experiencing over a 65% drop, and even stronger models such as o1-preview showing significant declines.

Relevance: This figure supports the paper's argument that LLMs rely on pattern matching and struggle with true understanding. It demonstrates how irrelevant information can significantly impact their performance, suggesting a lack of genuine comprehension of the problem.

Critique

Visual Aspects

Highlight the irrelevant part of the problem text in a different color or style to emphasize its misleading nature.
Clearly label each LLM response with the model name.
Consider adding a correct solution alongside the incorrect ones to provide a clear contrast.

Analytical Aspects

Explain why the LLMs might have made the specific mistakes shown, relating it back to the pattern-matching hypothesis.
Provide some statistics on how often LLMs make similar errors on GSM-NoOp problems.
Discuss the implications of these findings for the reliability of LLMs in real-world applications where irrelevant information might be present.

Numeric Data

figure 8

Figure 8 is a collection of bar charts demonstrating the performance drop of various large language models (LLMs) on the GSM-NoOp dataset, a modified version of GSM8K designed to assess how LLMs handle irrelevant information within math problems. (a) shows the general performance drop on GSM-NoOp across different models. (b) compares performance on GSM8K, GSM-Symbolic, and GSM-NoOp when using different 'shots' or examples during testing. 'NoOp-Symb' uses examples from GSM-Symbolic, while 'NoOp-NoOp' uses examples from GSM-NoOp. (c) highlights specific models that, while generally performing worse on GSM8K and GSM-Symbolic, show better performance on NoOp-Symb.

First Mention

Text: "Figure 8: (a) The performance of models drops significantly on GSM-NoOp, with more recent models experiencing a greater decline than older ones."

Context: (b) As previously demonstrated, performance on GSM-Symbolic is very close to that on GSM8K. However, on GSM-Noop, the significant drop in performance cannot be recovered, even when using the exact same question's variation as shots (NoOp-Symb) or when using different questions with different GSM-Noopthat contain No-Op operations (NoOp-Noop) as shots. (c) Notably, some models that perform significantly worse than those in (b) on GSM8K and GSM-Symbolic show much better performance on NoOp-Symb.

Relevance: This figure is central to the paper's argument about the limitations of LLMs in mathematical reasoning. It visually demonstrates how LLMs struggle with irrelevant information, even when provided with relevant examples. It supports the idea that LLMs rely on pattern matching and struggle with true understanding.

Critique

Visual Aspects

In (a), consider ordering the models by performance drop or by model size for easier comparison.
In (b) and (c), use consistent colors for the same datasets (GSM8K, GSM-Symbolic, GSM-NoOp) across all bar charts.
Label the y-axes clearly with 'Accuracy (%)' to avoid ambiguity.

Analytical Aspects

Explain why more recent models might experience a greater decline in (a). Is it related to their size, training data, or architecture?
In (b), discuss the implications of the finding that performance doesn't improve even with relevant examples (NoOp-Symb). Does this suggest a fundamental limitation in how LLMs process information?
In (c), analyze why certain models might perform better on NoOp-Symb despite being generally weaker. Could it be due to specific training data or architectural differences?

Numeric Data

Conclusion

Overview

This research explored the reasoning abilities of Large Language Models (LLMs) in mathematics, particularly focusing on limitations of current evaluations using the GSM8K dataset. They introduced GSM-Symbolic, a new benchmark offering varied mathematical problems. The study showed significant inconsistencies in LLM performance on similar questions, especially with changes in numerical values. Performance also decreased with increasing problem complexity. The GSM-NoOp dataset, which includes irrelevant information in problems, revealed a major weakness: LLMs often use this irrelevant information, leading to significant errors. This suggests LLMs rely on pattern matching rather than true logical reasoning, even for simple math problems.

Key Aspects

Limitations of GSM8K Evaluations: Current evaluations using GSM8K may not accurately reflect LLM mathematical reasoning abilities due to its limitations in problem diversity and complexity.
GSM-Symbolic Benchmark: The introduction of GSM-Symbolic provides a more robust and nuanced evaluation of LLM mathematical reasoning by offering diverse problem variations.
Performance Variability and Fragility: LLMs exhibit significant performance variations on similar questions, particularly with changes in numerical values, highlighting the fragility of their reasoning process.
Impact of Complexity: LLM performance degrades with increasing problem complexity, suggesting limitations in handling multi-step reasoning.
Lack of True Understanding: The GSM-NoOp results indicate that LLMs often fail to discern relevant information, suggesting a lack of genuine understanding of mathematical concepts and a reliance on pattern matching.

Strengths

Concise summary of key findings.
The conclusion effectively summarizes the main findings of the study, including the limitations of GSM8K, the benefits of GSM-Symbolic, and the observed performance variations and limitations of LLMs.

"Our extensive study reveals significant performance variability across different instantiations of the same question, challenging the reliability of current GSM8K results that rely on single-point accuracy metrics." (Page 12)
Clear implications for future research.
The conclusion clearly articulates the implications of the findings for future research, emphasizing the need for models capable of formal reasoning.

"We believe further research is essential to develop AI models capable of formal reasoning, moving beyond pattern recognition to achieve more robust and generalizable problem-solving skills." (Page 12)
Strong concluding statement.
The conclusion ends with a strong statement that reinforces the importance of the research and its contribution to the field of AI.

"This remains a critical challenge for the field as we strive to create systems with human-like cognitive abilities or general intelligence." (Page 12)

Suggestions for Improvement

Quantify the performance drops observed with GSM-NoOp.
While the conclusion mentions "substantial performance drops of up to 65%", providing more specific numbers for different models would strengthen the claim.

"substantial performance drops of up to 65%" (Page 12)

Rationale: Quantifying the drops would provide a more concrete understanding of the impact of irrelevant information on LLM performance.

Implementation: Include specific performance drop percentages for a few representative models, or refer to a table or figure with detailed results.
Discuss potential alternative evaluation methods.
The conclusion focuses on the limitations of current methods but could briefly mention potential alternative approaches for evaluating LLM reasoning.

Rationale: Suggesting alternative evaluation methods would provide a more constructive outlook and stimulate further research in this direction.

Implementation: Add a sentence or two discussing alternative approaches, such as incorporating more complex reasoning tasks or using human evaluation to assess understanding.
Connect the findings to real-world applications.
While the conclusion mentions general intelligence, connecting the findings to specific real-world applications of LLMs would enhance the paper's relevance.

Rationale: Discussing the implications for real-world applications would highlight the practical significance of the research and its potential impact.

Implementation: Add a sentence or two discussing the implications of the findings for applications like automated theorem proving, problem-solving in scientific domains, or other areas where robust mathematical reasoning is crucial.

Appendix

Overview

This appendix provides supplementary information to the main paper. It includes detailed experimental setups, complete results on GSM8K and GSM-Symbolic benchmarks and their variants, additional results on performance distribution, further analysis of the impact of question difficulty (including the effects of fine-tuning), and a comprehensive discussion of the OpenAI o1-mini and o1-preview models.

Key Aspects

Detailed Experimental Setups: Description of the prompt template and other experimental details like decoding strategy.
Full Results on Benchmarks: Comprehensive results of various models on GSM8K, GSM-Symbolic, and their variants, providing a complete picture of the models' performance.
Additional Results on Performance Distribution: More results demonstrating the performance variations of different models on GSM-Symbolic, complementing the findings in the main paper.
Impact of Question Difficulty and Fine-tuning: Further investigation into the effects of question difficulty and whether fine-tuning on easier tasks improves performance on harder ones.
Analysis of o1-preview and o1-mini: A detailed discussion and analysis of the performance of OpenAI's o1-mini and o1-preview models, including their strengths and limitations.

Strengths

Comprehensive supplementary information.
The appendix provides a wealth of supplementary information that enhances the understanding of the main paper's findings.

"In this appendix, we provide additional details to the main text, including: • A.1: Detailed experimental setups, including the prompt template." (Page 17)
Detailed experimental setup description.
The detailed description of the experimental setup, including the prompt template, allows for reproducibility and facilitates further research.

"In this work, all reported evaluations results use 8-shots with chain-of-thought prompting. We use the following prompt format:" (Page 17)
Full results table.
The inclusion of the full results table allows for a more complete analysis and comparison of different models.

"In Tab. 1, we present the comprehensive performance results of various models, including Gemma (Mesnard et al., 2024), Gemma2 (Rivière et al., 2024), Phi (Abdin et al., 2024), Mistral (Jiang et al., 2023), Llama3 (Dubey et al., 2024), GPT-4o (OpenAI, 2023), and the o1 (OpenAI, 2024) series, on GSM8K and its different variants, GSM-Symbolic." (Page 18)

Suggestions for Improvement

Organize the appendix into clearer subsections.
While the appendix provides a lot of information, organizing it into clearer subsections with more descriptive headings would improve readability.

Rationale: Clearer organization would make it easier for readers to navigate the appendix and find the specific information they are looking for.

Implementation: Use more descriptive subheadings that clearly indicate the content of each section, such as "A.1 Experimental Setup: Prompting and Decoding" or "A.2 Full Results: GSM8K and GSM-Symbolic Performance".
Provide more context for the additional results.
The additional results presented in the appendix could benefit from more context and explanation. For example, the connection between the additional results and the main paper's findings could be made more explicit.

Rationale: Providing more context would help readers understand the significance of the additional results and how they relate to the overall research question.

Implementation: Add introductory paragraphs or sentences to each subsection explaining the purpose of the additional results and how they complement the findings in the main paper.
Discuss the limitations of the presented results.
The appendix could benefit from a discussion of the limitations of the presented results. For example, the limitations of using greedy decoding or the potential biases in the datasets could be discussed.

Rationale: Acknowledging limitations would provide a more balanced perspective and encourage further research to address these limitations.

Implementation: Add a section or paragraphs discussing the limitations of the presented results and their potential impact on the conclusions of the study.

Non-Text Elements

figure 9

Figure 9 shows the prompt format used for evaluating the Large Language Models (LLMs). It consists of a preamble (or system instruction), eight example question-answer pairs (referred to as 'shots'), and the target question. The preamble sets the context for the LLM, instructing it to solve mathematical questions step-by-step like an expert. Each shot includes a question (Q:) and an answer (A:) that demonstrates the desired chain-of-thought reasoning process. The target question is presented without an answer, and the LLM is expected to generate a step-by-step solution and provide the final answer. Placeholders like {{question}}, {{solution}}, and {{final answer}} are used to represent the actual content that would be inserted during evaluation. Think of it like a recipe: the preamble is the general instruction (e.g., 'bake at 350 degrees'), the shots are examples of how to make specific dishes (e.g., 'chocolate chip cookies'), and the target question is a new dish the LLM needs to 'cook' (e.g., 'oatmeal raisin cookies') using the same general instructions and the examples provided.

First Mention

Text: "Figure 9: The prompt format used for evaluations."

Context:

Relevance: This figure is crucial for understanding the experimental setup and how the LLMs were evaluated. It provides a clear picture of the input provided to the models, including the context setting, the examples used for few-shot learning, and the format of the target questions. This helps in interpreting the results and understanding the LLMs' behavior.

Critique

Visual Aspects

Use a different font or color for the placeholders ({{question}}, {{solution}}, {{final answer}}) to make them stand out more clearly.
Add visual separators (e.g., lines or spacing) between the preamble, each shot, and the target question to improve readability.
Consider using a more visually appealing layout, such as a table format, to present the prompt elements more clearly.

Analytical Aspects

Explain the purpose of providing 8 shots and whether different numbers of shots were tested.
Discuss the rationale behind the specific wording of the preamble and how it might influence the LLMs' responses.
Explain how the 'final answer' is extracted from the LLM's generated text and how correctness is evaluated.

Numeric Data

table 1

Table 1 presents the performance of various Large Language Models (LLMs) on different versions of the GSM8K math problem dataset. It shows the accuracy (percentage of correctly answered questions) for each model on the full GSM8K test set, a smaller 100-question subset of GSM8K, and several variations of GSM-Symbolic (M1, standard, P1, P2, and NoOp). Each variation represents a different level of difficulty or type of change applied to the original problems. The table also includes standard deviations, which show how much the accuracy scores vary across different runs or subsets of the data. Think of it like testing different students (LLMs) on different sets of math problems (datasets). The table shows each student's average score on each problem set and how consistent their scores are.

First Mention

Text: "Table 1: Full 8-shot results of all models on GSM8Kand different variants of GSM-Symbolic."

Context:

Relevance: This table summarizes the main quantitative results of the paper. It allows for a direct comparison of different LLMs and their performance across different benchmarks, providing evidence for the claims made about performance variations, the impact of difficulty, and the effect of irrelevant information.

Critique

Visual Aspects

Highlight the best-performing model for each dataset using bold text or a different color.
Consider using a heatmap or color scale to visually represent the accuracy scores, making it easier to identify patterns and trends.
Add a caption that clearly explains the meaning of each column and the units used (accuracy percentage).

Analytical Aspects

Explain how the standard deviations were calculated and what they represent in this context. A high school student might not be familiar with this concept.
Discuss the statistical significance of the observed differences in performance between different models and datasets.
Analyze the trends observed in the table. For example, which models are most robust to changes in difficulty or irrelevant information? Are there any correlations between model size and performance?

Numeric Data

figure 10

Figure 10 presents additional results on the performance variation of Large Language Models (LLMs) on the GSM-Symbolic dataset. It shows histograms of accuracy distributions for three different models: Phi-2, Mistral-7b-instruct-v0.1, and Gemma2-2b-it. Each histogram shows how often each model achieved a particular accuracy level across multiple runs or variations of the GSM-Symbolic problems. The x-axis represents the accuracy percentage, and the y-axis represents the frequency. The average accuracy on the original GSM8K dataset is also provided for comparison, along with the average accuracy and standard deviation on GSM-Symbolic.

First Mention

Text: "Figure 10: Additional results on performance variation on GSM-Symbolic."

Context:

Relevance: This figure supplements the earlier analysis of performance variation on GSM-Symbolic (Figure 2) by providing results for additional models. It further supports the claim that LLMs exhibit significant performance variability even on slightly different versions of the same math problems, raising concerns about the reliability of single-point accuracy metrics.

Critique

Visual Aspects

Use consistent bin sizes for the histograms to facilitate comparison between models.
Label the axes clearly with 'Accuracy (%)' and 'Frequency' to avoid ambiguity.
Consider using a different color or pattern for the bars representing the average GSM8K accuracy to distinguish it from the GSM-Symbolic distribution.

Analytical Aspects

Explain how many runs or variations of the GSM-Symbolic problems were used to generate the histograms. This would clarify the sample size and the basis for the distributions.
Provide a more detailed explanation of what the standard deviation represents in this context. How does it relate to the spread of the distribution?
Discuss the implications of the observed performance variations. Do they suggest overfitting to the specific wording or numerical values in the original GSM8K problems?

Numeric Data

figure 11

Figure 11 explores whether using examples (shots) from a slightly harder problem set (GSM-P1) or fine-tuning a model on that set improves performance on an even harder set (GSM-P2). Part (a) shows that including examples from GSM-P1 during testing doesn't help much on GSM-P2. It's like trying to learn advanced calculus by only looking at algebra examples – it won't give you the tools you need. Part (b) shows that even fine-tuning a model on GSM-P1, while improving performance on P1, doesn't translate to better performance on P2. This is like training a dog to fetch a specific ball; it might get really good at fetching *that* ball, but not necessarily other objects.

First Mention

Text: "Figure 11: Using in-context shots or finetuning on GSM-P1 does not improve performance on GSM-P2: (a) Compared to the case where 8 shots come from GSM8K, when we include shots from GSM-P1the performance on GSM-P2 does not improve."

Context: (b) Finetuning on GSM-P1 can improve performance on GSM-P1 but not on GSM-P2.

Relevance: This figure is important because it investigates whether exposure to slightly harder problems, either through examples or fine-tuning, can improve performance on significantly harder problems. The negative results suggest that simply increasing the difficulty of training data or examples might not be enough to improve the reasoning capabilities of LLMs.

Critique

Visual Aspects

In (a), label the bars clearly with the model names and the source of the shots (GSM8K or P1).
In (b), use a different color or line style for the GSM-P1 and GSM-P2 accuracy curves to improve visual distinction.
Add a legend to (b) explaining which line represents which dataset.

Analytical Aspects

In (a), explain why 8 shots were chosen and whether different numbers of shots were tested.
In (b), explain what 'epochs' represent in the context of fine-tuning. How does the number of epochs relate to the amount of training?
Discuss the implications of these findings for training and fine-tuning strategies for LLMs. What alternative approaches might be more effective in improving reasoning abilities?

Numeric Data

figure 12

Figure 12 presents the performance of two closed-source LLMs, ol-mini and ol-preview, on different versions of the GSM-Symbolic dataset. It uses histograms to show how often each model achieved a particular accuracy on GSM8K, GSM-Symbolic (the standard version), GSM-M1 (easier), GSM-P1 (harder), and GSM-P2 (hardest). The x-axis represents accuracy, and the y-axis represents frequency. The figure also provides the average accuracy and standard deviation for each model and dataset. The key observation is that ol-preview performs very well and consistently across all difficulty levels, while ol-mini's performance degrades as the difficulty increases, similar to the open-source models discussed earlier.

First Mention

Text: "Figure 12: Results on o1-mini and o1-preview: both models mostly follow the same trend we presented in the main text."

Context: However, o1-preview shows very strong results on all levels of difficulty as all distributions are close to each other.

Relevance: This figure is relevant because it extends the analysis to closed-source models, showing that while some closed models like ol-preview demonstrate strong and consistent performance across different difficulty levels, others like ol-mini still struggle with increasing complexity, following similar trends as open-source models.

Critique

Visual Aspects

Use consistent colors for the same dataset across both ol-mini and ol-preview plots to facilitate comparison.
Clearly label the axes with 'Accuracy (%)' and 'Frequency'.
Consider using box plots instead of histograms to more clearly show the median, quartiles, and outliers of the accuracy distributions.

Analytical Aspects

Explain why ol-preview performs so consistently across different difficulty levels. Is it due to its architecture, training data, or other factors?
Compare the performance of ol-mini and ol-preview to the open-source models discussed earlier in the paper. Are there any significant differences or similarities?
Discuss the implications of these findings for the development of more robust and generalizable LLMs.

Numeric Data

figure 13

Figure 13 shows an example of how the ol-preview model fails to understand the context of a word problem from the GSM-NoOp dataset. The problem asks for the current cost of school supplies, given current prices and mentioning that prices were lower last year due to inflation. However, the model incorrectly calculates the cost based on the lower, past prices, even though the question explicitly asks for the current cost. This demonstrates the model's tendency to blindly apply numerical operations without fully grasping the context or relevance of the information provided. It's like a student who sees the word 'discount' and automatically subtracts, regardless of whether a discount is actually being applied in the problem.

First Mention

Text: "Figure 13: Sample response from o1-preview on an example from GSM-NoOp: the model blindly applies the inflation rate, even though the inflation amount is irrelevant as the question clearly indicates the given prices are for "now" and not last year."

Context:

Relevance: This figure is relevant because it provides a specific example of how even high-performing closed-source models like ol-preview can fail on GSM-NoOp problems due to their inability to filter out irrelevant information. It reinforces the paper's argument that LLMs struggle with true understanding and rely on pattern matching, leading to errors when presented with irrelevant information.

Critique

Visual Aspects

Highlight the part of the problem text that indicates the prices are current ('now') to emphasize the model's misunderstanding.
Clearly separate the problem statement, the model's response, and the explanation of the error to improve readability.
Consider adding a visual representation of the correct solution alongside the model's incorrect response to highlight the discrepancy.

Analytical Aspects

Explain why the model might have made this specific mistake. Does it relate to the model's training data or its internal representation of the problem?
Discuss the implications of this finding for the reliability of LLMs in real-world applications where understanding context and relevance is crucial.
Connect this example to the broader discussion of pattern matching versus true understanding in LLMs.

Numeric Data

figure 14

Figure 14 presents a word problem from the GSM-NoOp dataset, which involves calculating the price difference between sourdough loaves and muffins after a donation. The figure includes responses from two models, 'ol-preview' and 'ol-mini'. Both models provide step-by-step solutions, but they incorrectly account for the donated items by subtracting their value from the total cost, even though the donation doesn't affect the price difference between the two items. The 'ol-preview' model calculates the cost of the remaining items after donation and then finds the difference. The 'ol-mini' model calculates the initial cost, the value of the donated items, the net costs after donation, and finally, the difference. Both models arrive at the same incorrect answer due to the erroneous subtraction of the donation value.

First Mention

Text: "Figure 14: Sample response from ol-preview and ol-mini on an example from GSM-NoOp: while the donation amount is irrelevant to the price difference, the models subtract the amount we donate."

Context:

Relevance: This figure further illustrates the point that LLMs struggle with irrelevant information and often misinterpret the problem, even when provided with seemingly straightforward scenarios. It reinforces the paper's argument that LLMs rely on pattern matching and lack true understanding of the underlying mathematical concepts. It shows that even advanced models like ol-preview and ol-mini are susceptible to this issue.

Critique

Visual Aspects

Highlight the irrelevant part of the problem (the donation) in a different color or with a different font style to emphasize its misleading nature.
Clearly separate and label the responses from the two models ('ol-preview' and 'ol-mini') to improve readability.
Consider adding a correct solution alongside the model responses to highlight the error and provide a clear contrast.

Analytical Aspects

Explain in simpler terms why the donation is irrelevant to the price difference. Use an analogy or a simpler example to illustrate the concept.
Discuss the specific pattern-matching behavior that might have led the models to incorporate the donation into their calculations. For example, do they frequently encounter problems where subtraction is required after an initial calculation?
Explain the broader implications of this type of error. If LLMs can't handle such simple irrelevant information, how can they be trusted with more complex real-world problems?

Evaluating Mathematical Reasoning in Large Language Models: The GSM-Symbolic Benchmark

Table of Contents

Overall Summary

Overview

Key Findings

Strengths

Areas for Improvement

Significant Elements

figure

figure

Conclusion

Section Analysis

Abstract

Overview

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Overview

Key Aspects

Strengths

Suggestions for Improvement

Related Work: Reasoning & Language Models

Overview

Key Aspects

Strengths

Suggestions for Improvement

GSM-Symbolic

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

Experiments & Results

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

Conclusion

Overview

Key Aspects

Strengths

Suggestions for Improvement

Appendix

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data