Evaluating Mathematical Reasoning in Large Language Models: The GSM-Symbolic Benchmark

Table of Contents

Overall Summary

Overview

This study examines the mathematical reasoning capabilities of Large Language Models (LLMs) by introducing a new benchmark, GSM-Symbolic, derived from the GSM8K dataset. The research highlights the limitations of current LLMs, particularly their dependency on pattern matching rather than genuine logical reasoning. By using symbolic templates, the GSM-Symbolic benchmark allows for the generation of diverse math problems, facilitating a more nuanced evaluation of LLM performance. Additionally, the study explores GSM-NoOp, a dataset with irrelevant information to test LLMs' ability to discern relevant details, ultimately finding that LLMs often incorporate irrelevant information into their calculations.

Key Findings

Strengths

Areas for Improvement

Significant Elements

figure

Description: Figure 1 illustrates the creation of symbolic templates used in GSM-Symbolic, showing how generic placeholders replace specific elements in math problems.

Relevance: This figure is crucial for understanding how GSM-Symbolic enables diverse problem generation and controlled evaluation.

figure

Description: Figure 2 displays the distribution of LLM performance on GSM-Symbolic, highlighting significant variability compared to GSM8K.

Relevance: It demonstrates the inconsistency in LLM performance and questions the reliability of single-point metrics.

Conclusion

The research sheds light on the limitations of current LLMs in mathematical reasoning, emphasizing their reliance on pattern matching rather than true understanding. By introducing GSM-Symbolic and GSM-NoOp, the study offers a more comprehensive framework for evaluating LLM capabilities. The findings underscore the need for models capable of formal reasoning, which is crucial for advancing AI applications in complex domains. Future research should focus on developing LLMs with improved logical reasoning skills and exploring alternative evaluation methods to better capture these abilities in real-world contexts.

Section Analysis

Abstract

Overview

This paper investigates the mathematical reasoning abilities of Large Language Models (LLMs) using a new benchmark called GSM-Symbolic, derived from the GSM8K dataset. The authors find that LLMs struggle with variations in numerical values within the same problem structure, and their performance degrades as problem complexity increases. They also introduce GSM-NoOp, a dataset with irrelevant information added to problems, revealing that LLMs often incorporate this irrelevant information into their calculations, suggesting a lack of true understanding of mathematical concepts. The study concludes that current LLMs rely more on pattern matching than genuine logical reasoning, especially in mathematics.

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Overview

Large Language Models (LLMs) have shown impressive abilities in various areas like language processing and creative tasks. However, whether they can truly reason logically, especially in fields like math and coding, is still a big question. While LLMs seem good at some tasks, they have limitations. Existing research suggests that LLMs might rely more on recognizing patterns from their training data than actual understanding, making them sensitive to small changes in how questions are phrased. The GSM8K dataset is commonly used to test LLMs' math skills, but it has drawbacks. It only provides a single score, might have been accidentally included in training data (contamination), and doesn't allow for flexible testing to see how LLMs handle different types of questions or difficulty levels.

Key Aspects

Strengths

Suggestions for Improvement

Related Work: Reasoning & Language Models

Overview

This section discusses existing research on the reasoning abilities of Large Language Models (LLMs). It highlights that while LLMs have shown potential in various domains, their reasoning capabilities are still uncertain. Studies exploring the computational aspects of transformers suggest that these models might have limitations in handling complex tasks and may benefit from additional memory mechanisms like scratchpads. However, it remains unclear whether LLMs can perform true logical reasoning. Several studies suggest that LLMs rely more on probabilistic pattern-matching than formal reasoning, making them sensitive to small changes in input and prone to errors in complex scenarios. This pattern-matching approach, while more advanced than simple memorization, still falls short of genuine logical reasoning.

Key Aspects

Strengths

Suggestions for Improvement

GSM-Symbolic

Overview

This section introduces GSM-Symbolic, a new benchmark for evaluating the mathematical reasoning of Large Language Models (LLMs). It addresses the limitations of existing benchmarks like GSM8K by using symbolic templates to generate diverse question variations. This approach allows for more controlled experiments and provides more reliable metrics for assessing LLM performance. The section also describes the template generation process, which involves identifying variables, their domains, and conditions to ensure question and answer correctness. Finally, it outlines the experimental setup used in the paper, including the models evaluated, the evaluation process, and the dataset size.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 1

Figure 1 illustrates how the GSM-Symbolic template is created. It shows an example from the original GSM8K dataset alongside a corresponding template. The GSM8K example is a word problem about a person named Sophie and the number of toys she has for her nephew. The template generalizes this problem by replacing specific names and numbers with placeholders like {name}, {family}, {x}, {y}, {z}, and {total}. This allows for generating many similar problems with different names, numbers, and relationships between them, while keeping the underlying structure the same. The figure also shows the solution to both the original problem and the templated version.

First Mention

Text: "Figure 1: Illustration of the GSM-Symbolic template creation process."

Context: This dataset serves as a tool to investigate the presumed reasoning capabilities of LLMs, enabling the design of controllable mathematical reasoning evaluations with more reliable metrics. Our results reveal that all state-of-the-art LLMs exhibit significant performance variations, suggesting the fragility or lack of reasoning.

Relevance: This figure is crucial for understanding how GSM-Symbolic is constructed and how it enables more controlled experiments compared to GSM8K. It visually demonstrates the concept of templates and their use in generating diverse problem instances.

Critique
Visual Aspects
  • Use a more visually distinct style for the placeholders (e.g., a different font, color, or background) to make them stand out from the rest of the text.
  • Consider adding arrows or other visual cues to connect the placeholders in the template to the corresponding elements in the GSM8K example.
  • Use a larger font size for the text within the figure to improve readability.
Analytical Aspects
  • Provide a brief explanation of the symbols used in the template, such as the meaning of 'sample' and the '#variables' and '#conditions' sections.
  • Explain why certain elements are chosen to be variables (e.g., why 'name' and 'family' are variable, but the type of toys is not).
  • Explain how the conditions ensure the generated problems are solvable and at the appropriate difficulty level.
Numeric Data

Experiments & Results

Overview

This section presents the main findings of the study on LLM mathematical reasoning. First, it examines the reliability of current GSM8K results by analyzing the performance distribution on GSM-Symbolic, revealing significant variations. It then investigates the fragility of LLM reasoning by comparing performance when changing names versus numbers in problems, finding LLMs more sensitive to numerical changes. The section also explores the impact of question difficulty (number of clauses) on performance, showing that accuracy decreases and variance increases with higher difficulty. Finally, it introduces GSM-NoOp, a dataset with irrelevant information added to problems, demonstrating that LLMs often incorporate this irrelevant information, leading to significant performance drops and suggesting a lack of true understanding of mathematical concepts.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 2

Figure 2 shows the distribution of performance for several large language models (LLMs) on the GSM-Symbolic benchmark. Each histogram represents a different model and shows how often the model achieved certain accuracy levels across 50 different sets of GSM-Symbolic problems. The x-axis represents the accuracy achieved (as a percentage), and the y-axis represents the frequency (how many times that accuracy level was observed). A dashed vertical line marks the model's performance on the original GSM8K dataset. The average performance on GSM-Symbolic and its standard deviation are also shown for each model.

First Mention

Text: "Figure 2: The distribution of 8-shot Chain-of-Thought (CoT) performance across 50 sets generated from GSM-Symbolic templates shows significant variability in accuracy among all state-of-the-art models."

Context: Furthermore, for most models, the average performance on GSM-Symbolic is lower than on GSM8K (indicated by the dashed line). Interestingly, the performance of GSM8K falls on the right side of the distribution, which, statistically speaking, should have a very low likelihood, given that GSM8K is basically a single draw from GSM-Symbolic.

Relevance: This figure is important because it shows how consistent (or inconsistent) the models are when answering slightly different versions of the same math problems. The spread of the histograms indicates the variability in performance, and the comparison to GSM8K performance suggests potential issues like data contamination.

Critique
Visual Aspects
  • Use a consistent color scheme for the histograms and the GSM8K lines across all subplots to improve visual coherence.
  • Label the axes clearly with 'Accuracy (%)' and 'Frequency' to avoid ambiguity.
  • Increase the spacing between the subplots to reduce clutter and improve readability.
Analytical Aspects
  • Explain why 50 sets were chosen and how they were generated. Was it a random sampling? What parameters were varied?
  • Provide a clearer explanation of what the standard deviation represents in this context. A high school student might not be familiar with this concept.
  • Discuss the implications of the GSM8K performance often falling outside the typical range of GSM-Symbolic performance. Why is this surprising and what does it suggest?
Numeric Data
figure 3

Figure 3 is a bar chart showing how much the performance of different LLMs drops when tested on GSM-Symbolic compared to their performance on the original GSM8K. Each bar represents a different model, and its length corresponds to the percentage drop in accuracy. A downward bar means performance decreased on GSM-Symbolic. The labels on each bar provide the exact percentage drop for each model.

First Mention

Text: "Figure 3: The performance of all state-of-the-art models on GSM-Symbolic drops compared to GSM8K."

Context: Later, we investigate the factors that impact the performance drops in more depth.

Relevance: This figure directly visualizes the performance degradation discussed in the text. It highlights the impact of using the more diverse and challenging GSM-Symbolic benchmark compared to the static GSM8K.

Critique
Visual Aspects
  • Order the bars from largest to smallest drop for easier comparison.
  • Add a horizontal line at 0% to clearly separate performance gains from drops.
  • Use a color gradient or different shades to visually represent the magnitude of the drop.
Analytical Aspects
  • Explain the implications of this performance drop. Does it suggest overfitting to GSM8K? Does it indicate a lack of generalization?
  • Connect this figure back to Figure 2 and discuss how the drop relates to the distribution of performance.
  • Discuss why some models show a larger drop than others. Are there architectural differences or training differences that might explain this?
Numeric Data
figure 4

Figure 4 illustrates the sensitivity of Large Language Models (LLMs) to changes in names, numbers, or both within math word problems. It presents histograms showing the distribution of accuracy scores for six different LLMs across three conditions: changing only the names in the problem, changing only the numbers, and changing both names and numbers. Each histogram shows the frequency of different accuracy levels, allowing for a comparison of performance variability across the three conditions. The figure aims to demonstrate how these changes, while not affecting the underlying mathematical logic, can significantly impact the LLMs' ability to solve the problems.

First Mention

Text: "Figure 4: How sensitive are LLMs when we change only names, only proper numbers, or both names and numbers?"

Context: Overall, models have noticeable performance variation even if we only change names, but even more when we change numbers or combine these changes.

Relevance: This figure is highly relevant as it directly addresses the research question of how fragile LLMs are to superficial changes (names) versus core changes (numbers) in mathematical reasoning problems. It provides evidence for the argument that LLMs are more sensitive to changes in numerical values than to changes in names, suggesting a potential over-reliance on pattern matching.

Critique
Visual Aspects
  • Use distinct colors or patterns for the histograms representing different change conditions to improve visual clarity and comparison.
  • Add a clear legend explaining the meaning of each color/pattern used for the change conditions.
  • Label the axes clearly with appropriate units (Accuracy (%) and Frequency).
  • Increase the font size of labels and legends to improve readability.
Analytical Aspects
  • Provide the average accuracy and standard deviation for each condition in the figure or caption to allow for a more quantitative comparison.
  • Discuss the implications of the observed differences in variance between name changes and number changes in more detail.
  • Connect the findings to the hypothesis that LLMs rely on in-distribution pattern matching, explaining how this hypothesis is supported by the observed sensitivity to numerical changes.
  • Consider adding a statistical test to quantify the significance of the observed differences in performance between the conditions.
Numeric Data
figure 5

Figure 5 demonstrates how the difficulty of the GSM-Symbolic math problems is modified by changing the number of clauses. It shows four example problems, each representing a different difficulty level: GSM-Symbolic-M1 (minus one clause), GSM-Symbolic (original), GSM-Symbolic-P1 (plus one clause), and GSM-Symbolic-P2 (plus two clauses). Each problem is a word problem involving calculations, and the increasing difficulty is reflected in the addition of more conditions or steps required to solve the problem. This figure provides a concrete illustration of how the benchmark allows for controlled manipulation of problem complexity.

First Mention

Text: "Figure 5: Modifying the difficulty level of GSM-Symbolic by modifying the number of clauses."

Context:

Relevance: This figure is essential for understanding how the authors operationalize 'difficulty' in their experiments. By showing examples of problems with varying numbers of clauses, it clarifies the manipulation used to test the impact of complexity on LLM performance. This directly relates to the research question of how difficulty affects the performance distribution.

Critique
Visual Aspects
  • Use a consistent format and font size for all the example problems.
  • Highlight the added clauses in the P1 and P2 examples using bold text, color, or underlining to make them easily noticeable.
  • Consider adding a brief explanation of what constitutes a 'clause' in this context, as it might not be immediately clear to all readers.
Analytical Aspects
  • Provide a more detailed explanation of how the added clauses increase the difficulty of the problem. For example, explain the additional reasoning steps required or the increased cognitive load.
  • Explain why this method of manipulating difficulty (adding clauses) is appropriate for studying mathematical reasoning in LLMs.
  • Discuss any potential limitations of this approach. For example, does adding a clause always increase the difficulty linearly, or are there other factors that might influence the perceived difficulty level?
Numeric Data
figure 6

This figure illustrates how increasing the complexity of a math problem, represented by the number of clauses, affects the performance of Large Language Models (LLMs). It uses histograms to show the distribution of accuracy scores for four different models across four difficulty levels: GSM-M1 (one clause removed), GSM-Symb (original complexity), GSM-P1 (one clause added), and GSM-P2 (two clauses added). As the problems become more complex (moving from M1 to P2), the histograms generally shift to the left, indicating lower accuracy. The spread of the histograms also tends to increase with complexity, suggesting greater variability in performance. Think of it like stacking blocks: it's easier to balance a small tower (M1) than a tall one (P2). The taller the tower gets, the more likely it is to wobble and fall (more variance).

First Mention

Text: "Figure 6: The impact of increasing the number of clauses on performance: As the difficulty increases from GSM-M1 → GSM-Symb→ GSM-P1 → GSM-P2, the distribution of performance shifts to the left (i.e., accuracy decreases), and the variance increases."

Context: As shown in Fig. 6, the trend of the evolution of the performance distribution is very consistent across all models: as the difficulty increases, the performance decreases and the variance increases. Note that overall, the rate of accuracy drop also increases as the difficulty increases. This is in line with the hypothesis that models are not performing formal reasoning, as the number of required reasoning steps increases linearly, but the rate of drop seems to be faster. Moreover, considering the pattern-matching hypothesis, the increase in variance suggests that searching and pattern-matching become significantly harder for models as the difficulty increases.

Relevance: This figure is central to the paper's argument that LLMs struggle with more complex reasoning tasks. It provides visual evidence of the performance degradation and increased variability as problem difficulty increases.

Critique
Visual Aspects
  • Use consistent colors for the same difficulty level across all model histograms to facilitate comparison.
  • Label the axes clearly with 'Accuracy (%)' and 'Frequency' or 'Number of Samples'.
  • Add a brief explanation within the figure of what GSM-M1, GSM-Symb, GSM-P1, and GSM-P2 represent.
Analytical Aspects
  • Provide the exact number of clauses used in each difficulty level to quantify the complexity increase.
  • Discuss potential reasons why the variance increases with complexity, such as the accumulation of errors in multi-step reasoning.
  • Consider adding a statistical measure of variance (e.g., standard deviation) to each histogram or in a separate table.
Numeric Data
figure 7

Figure 7 shows an example of how LLMs can be misled by irrelevant information. It presents a word problem from the GSM-NoOp dataset, where a seemingly relevant but ultimately unimportant detail is added (some kiwis being smaller). Two LLM responses are shown, both incorrectly incorporating the size of the kiwis into their calculations. This is like asking 'If you have 5 apples and 2 are green, how many apples do you have?' A person would understand that the color doesn't change the number of apples, but the LLMs seem to get confused by the extra detail.

First Mention

Text: "Figure 7: An example from the GSM-NoOp dataset: We add seemingly relevant statements to the questions that are, in fact, irrelevant to the reasoning and conclusion. However, the majority of models fail to ignore these statements and blindly convert them into operations, leading to mistakes."

Context: Fig. 7 illustrates an example from GSM-NoOp. An interesting observation is that models tend to blindly subtract the number of smaller fruits, potentially because their training datasets included similar examples that required conversion to subtraction operations. In the Appendix, we include additional failure cases from GSM-NoOp. Overall, we find that models tend to convert statements to operations without truly understanding their meaning. For instance, a common case we observe is that models interpret statements about “discount” as “multiplication”, regardless of the context. This raises the question of whether these models have truly understood the mathematical concepts well enough. Consequently, as shown in Fig. 8a, there is a catastrophic performance decline across all tested models, with the Phi-3-mini model experiencing over a 65% drop, and even stronger models such as o1-preview showing significant declines.

Relevance: This figure supports the paper's argument that LLMs rely on pattern matching and struggle with true understanding. It demonstrates how irrelevant information can significantly impact their performance, suggesting a lack of genuine comprehension of the problem.

Critique
Visual Aspects
  • Highlight the irrelevant part of the problem text in a different color or style to emphasize its misleading nature.
  • Clearly label each LLM response with the model name.
  • Consider adding a correct solution alongside the incorrect ones to provide a clear contrast.
Analytical Aspects
  • Explain why the LLMs might have made the specific mistakes shown, relating it back to the pattern-matching hypothesis.
  • Provide some statistics on how often LLMs make similar errors on GSM-NoOp problems.
  • Discuss the implications of these findings for the reliability of LLMs in real-world applications where irrelevant information might be present.
Numeric Data
figure 8

Figure 8 is a collection of bar charts demonstrating the performance drop of various large language models (LLMs) on the GSM-NoOp dataset, a modified version of GSM8K designed to assess how LLMs handle irrelevant information within math problems. (a) shows the general performance drop on GSM-NoOp across different models. (b) compares performance on GSM8K, GSM-Symbolic, and GSM-NoOp when using different 'shots' or examples during testing. 'NoOp-Symb' uses examples from GSM-Symbolic, while 'NoOp-NoOp' uses examples from GSM-NoOp. (c) highlights specific models that, while generally performing worse on GSM8K and GSM-Symbolic, show better performance on NoOp-Symb.

First Mention

Text: "Figure 8: (a) The performance of models drops significantly on GSM-NoOp, with more recent models experiencing a greater decline than older ones."

Context: (b) As previously demonstrated, performance on GSM-Symbolic is very close to that on GSM8K. However, on GSM-Noop, the significant drop in performance cannot be recovered, even when using the exact same question's variation as shots (NoOp-Symb) or when using different questions with different GSM-Noopthat contain No-Op operations (NoOp-Noop) as shots. (c) Notably, some models that perform significantly worse than those in (b) on GSM8K and GSM-Symbolic show much better performance on NoOp-Symb.

Relevance: This figure is central to the paper's argument about the limitations of LLMs in mathematical reasoning. It visually demonstrates how LLMs struggle with irrelevant information, even when provided with relevant examples. It supports the idea that LLMs rely on pattern matching and struggle with true understanding.

Critique
Visual Aspects
  • In (a), consider ordering the models by performance drop or by model size for easier comparison.
  • In (b) and (c), use consistent colors for the same datasets (GSM8K, GSM-Symbolic, GSM-NoOp) across all bar charts.
  • Label the y-axes clearly with 'Accuracy (%)' to avoid ambiguity.
Analytical Aspects
  • Explain why more recent models might experience a greater decline in (a). Is it related to their size, training data, or architecture?
  • In (b), discuss the implications of the finding that performance doesn't improve even with relevant examples (NoOp-Symb). Does this suggest a fundamental limitation in how LLMs process information?
  • In (c), analyze why certain models might perform better on NoOp-Symb despite being generally weaker. Could it be due to specific training data or architectural differences?
Numeric Data

Conclusion

Overview

This research explored the reasoning abilities of Large Language Models (LLMs) in mathematics, particularly focusing on limitations of current evaluations using the GSM8K dataset. They introduced GSM-Symbolic, a new benchmark offering varied mathematical problems. The study showed significant inconsistencies in LLM performance on similar questions, especially with changes in numerical values. Performance also decreased with increasing problem complexity. The GSM-NoOp dataset, which includes irrelevant information in problems, revealed a major weakness: LLMs often use this irrelevant information, leading to significant errors. This suggests LLMs rely on pattern matching rather than true logical reasoning, even for simple math problems.

Key Aspects

Strengths

Suggestions for Improvement

Appendix

Overview

This appendix provides supplementary information to the main paper. It includes detailed experimental setups, complete results on GSM8K and GSM-Symbolic benchmarks and their variants, additional results on performance distribution, further analysis of the impact of question difficulty (including the effects of fine-tuning), and a comprehensive discussion of the OpenAI o1-mini and o1-preview models.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 9

Figure 9 shows the prompt format used for evaluating the Large Language Models (LLMs). It consists of a preamble (or system instruction), eight example question-answer pairs (referred to as 'shots'), and the target question. The preamble sets the context for the LLM, instructing it to solve mathematical questions step-by-step like an expert. Each shot includes a question (Q:) and an answer (A:) that demonstrates the desired chain-of-thought reasoning process. The target question is presented without an answer, and the LLM is expected to generate a step-by-step solution and provide the final answer. Placeholders like {{question}}, {{solution}}, and {{final answer}} are used to represent the actual content that would be inserted during evaluation. Think of it like a recipe: the preamble is the general instruction (e.g., 'bake at 350 degrees'), the shots are examples of how to make specific dishes (e.g., 'chocolate chip cookies'), and the target question is a new dish the LLM needs to 'cook' (e.g., 'oatmeal raisin cookies') using the same general instructions and the examples provided.

First Mention

Text: "Figure 9: The prompt format used for evaluations."

Context:

Relevance: This figure is crucial for understanding the experimental setup and how the LLMs were evaluated. It provides a clear picture of the input provided to the models, including the context setting, the examples used for few-shot learning, and the format of the target questions. This helps in interpreting the results and understanding the LLMs' behavior.

Critique
Visual Aspects
  • Use a different font or color for the placeholders ({{question}}, {{solution}}, {{final answer}}) to make them stand out more clearly.
  • Add visual separators (e.g., lines or spacing) between the preamble, each shot, and the target question to improve readability.
  • Consider using a more visually appealing layout, such as a table format, to present the prompt elements more clearly.
Analytical Aspects
  • Explain the purpose of providing 8 shots and whether different numbers of shots were tested.
  • Discuss the rationale behind the specific wording of the preamble and how it might influence the LLMs' responses.
  • Explain how the 'final answer' is extracted from the LLM's generated text and how correctness is evaluated.
Numeric Data
table 1

Table 1 presents the performance of various Large Language Models (LLMs) on different versions of the GSM8K math problem dataset. It shows the accuracy (percentage of correctly answered questions) for each model on the full GSM8K test set, a smaller 100-question subset of GSM8K, and several variations of GSM-Symbolic (M1, standard, P1, P2, and NoOp). Each variation represents a different level of difficulty or type of change applied to the original problems. The table also includes standard deviations, which show how much the accuracy scores vary across different runs or subsets of the data. Think of it like testing different students (LLMs) on different sets of math problems (datasets). The table shows each student's average score on each problem set and how consistent their scores are.

First Mention

Text: "Table 1: Full 8-shot results of all models on GSM8Kand different variants of GSM-Symbolic."

Context:

Relevance: This table summarizes the main quantitative results of the paper. It allows for a direct comparison of different LLMs and their performance across different benchmarks, providing evidence for the claims made about performance variations, the impact of difficulty, and the effect of irrelevant information.

Critique
Visual Aspects
  • Highlight the best-performing model for each dataset using bold text or a different color.
  • Consider using a heatmap or color scale to visually represent the accuracy scores, making it easier to identify patterns and trends.
  • Add a caption that clearly explains the meaning of each column and the units used (accuracy percentage).
Analytical Aspects
  • Explain how the standard deviations were calculated and what they represent in this context. A high school student might not be familiar with this concept.
  • Discuss the statistical significance of the observed differences in performance between different models and datasets.
  • Analyze the trends observed in the table. For example, which models are most robust to changes in difficulty or irrelevant information? Are there any correlations between model size and performance?
Numeric Data
figure 10

Figure 10 presents additional results on the performance variation of Large Language Models (LLMs) on the GSM-Symbolic dataset. It shows histograms of accuracy distributions for three different models: Phi-2, Mistral-7b-instruct-v0.1, and Gemma2-2b-it. Each histogram shows how often each model achieved a particular accuracy level across multiple runs or variations of the GSM-Symbolic problems. The x-axis represents the accuracy percentage, and the y-axis represents the frequency. The average accuracy on the original GSM8K dataset is also provided for comparison, along with the average accuracy and standard deviation on GSM-Symbolic.

First Mention

Text: "Figure 10: Additional results on performance variation on GSM-Symbolic."

Context:

Relevance: This figure supplements the earlier analysis of performance variation on GSM-Symbolic (Figure 2) by providing results for additional models. It further supports the claim that LLMs exhibit significant performance variability even on slightly different versions of the same math problems, raising concerns about the reliability of single-point accuracy metrics.

Critique
Visual Aspects
  • Use consistent bin sizes for the histograms to facilitate comparison between models.
  • Label the axes clearly with 'Accuracy (%)' and 'Frequency' to avoid ambiguity.
  • Consider using a different color or pattern for the bars representing the average GSM8K accuracy to distinguish it from the GSM-Symbolic distribution.
Analytical Aspects
  • Explain how many runs or variations of the GSM-Symbolic problems were used to generate the histograms. This would clarify the sample size and the basis for the distributions.
  • Provide a more detailed explanation of what the standard deviation represents in this context. How does it relate to the spread of the distribution?
  • Discuss the implications of the observed performance variations. Do they suggest overfitting to the specific wording or numerical values in the original GSM8K problems?
Numeric Data
figure 11

Figure 11 explores whether using examples (shots) from a slightly harder problem set (GSM-P1) or fine-tuning a model on that set improves performance on an even harder set (GSM-P2). Part (a) shows that including examples from GSM-P1 during testing doesn't help much on GSM-P2. It's like trying to learn advanced calculus by only looking at algebra examples – it won't give you the tools you need. Part (b) shows that even fine-tuning a model on GSM-P1, while improving performance on P1, doesn't translate to better performance on P2. This is like training a dog to fetch a specific ball; it might get really good at fetching *that* ball, but not necessarily other objects.

First Mention

Text: "Figure 11: Using in-context shots or finetuning on GSM-P1 does not improve performance on GSM-P2: (a) Compared to the case where 8 shots come from GSM8K, when we include shots from GSM-P1the performance on GSM-P2 does not improve."

Context: (b) Finetuning on GSM-P1 can improve performance on GSM-P1 but not on GSM-P2.

Relevance: This figure is important because it investigates whether exposure to slightly harder problems, either through examples or fine-tuning, can improve performance on significantly harder problems. The negative results suggest that simply increasing the difficulty of training data or examples might not be enough to improve the reasoning capabilities of LLMs.

Critique
Visual Aspects
  • In (a), label the bars clearly with the model names and the source of the shots (GSM8K or P1).
  • In (b), use a different color or line style for the GSM-P1 and GSM-P2 accuracy curves to improve visual distinction.
  • Add a legend to (b) explaining which line represents which dataset.
Analytical Aspects
  • In (a), explain why 8 shots were chosen and whether different numbers of shots were tested.
  • In (b), explain what 'epochs' represent in the context of fine-tuning. How does the number of epochs relate to the amount of training?
  • Discuss the implications of these findings for training and fine-tuning strategies for LLMs. What alternative approaches might be more effective in improving reasoning abilities?
Numeric Data
figure 12

Figure 12 presents the performance of two closed-source LLMs, ol-mini and ol-preview, on different versions of the GSM-Symbolic dataset. It uses histograms to show how often each model achieved a particular accuracy on GSM8K, GSM-Symbolic (the standard version), GSM-M1 (easier), GSM-P1 (harder), and GSM-P2 (hardest). The x-axis represents accuracy, and the y-axis represents frequency. The figure also provides the average accuracy and standard deviation for each model and dataset. The key observation is that ol-preview performs very well and consistently across all difficulty levels, while ol-mini's performance degrades as the difficulty increases, similar to the open-source models discussed earlier.

First Mention

Text: "Figure 12: Results on o1-mini and o1-preview: both models mostly follow the same trend we presented in the main text."

Context: However, o1-preview shows very strong results on all levels of difficulty as all distributions are close to each other.

Relevance: This figure is relevant because it extends the analysis to closed-source models, showing that while some closed models like ol-preview demonstrate strong and consistent performance across different difficulty levels, others like ol-mini still struggle with increasing complexity, following similar trends as open-source models.

Critique
Visual Aspects
  • Use consistent colors for the same dataset across both ol-mini and ol-preview plots to facilitate comparison.
  • Clearly label the axes with 'Accuracy (%)' and 'Frequency'.
  • Consider using box plots instead of histograms to more clearly show the median, quartiles, and outliers of the accuracy distributions.
Analytical Aspects
  • Explain why ol-preview performs so consistently across different difficulty levels. Is it due to its architecture, training data, or other factors?
  • Compare the performance of ol-mini and ol-preview to the open-source models discussed earlier in the paper. Are there any significant differences or similarities?
  • Discuss the implications of these findings for the development of more robust and generalizable LLMs.
Numeric Data
figure 13

Figure 13 shows an example of how the ol-preview model fails to understand the context of a word problem from the GSM-NoOp dataset. The problem asks for the current cost of school supplies, given current prices and mentioning that prices were lower last year due to inflation. However, the model incorrectly calculates the cost based on the lower, past prices, even though the question explicitly asks for the current cost. This demonstrates the model's tendency to blindly apply numerical operations without fully grasping the context or relevance of the information provided. It's like a student who sees the word 'discount' and automatically subtracts, regardless of whether a discount is actually being applied in the problem.

First Mention

Text: "Figure 13: Sample response from o1-preview on an example from GSM-NoOp: the model blindly applies the inflation rate, even though the inflation amount is irrelevant as the question clearly indicates the given prices are for "now" and not last year."

Context:

Relevance: This figure is relevant because it provides a specific example of how even high-performing closed-source models like ol-preview can fail on GSM-NoOp problems due to their inability to filter out irrelevant information. It reinforces the paper's argument that LLMs struggle with true understanding and rely on pattern matching, leading to errors when presented with irrelevant information.

Critique
Visual Aspects
  • Highlight the part of the problem text that indicates the prices are current ('now') to emphasize the model's misunderstanding.
  • Clearly separate the problem statement, the model's response, and the explanation of the error to improve readability.
  • Consider adding a visual representation of the correct solution alongside the model's incorrect response to highlight the discrepancy.
Analytical Aspects
  • Explain why the model might have made this specific mistake. Does it relate to the model's training data or its internal representation of the problem?
  • Discuss the implications of this finding for the reliability of LLMs in real-world applications where understanding context and relevance is crucial.
  • Connect this example to the broader discussion of pattern matching versus true understanding in LLMs.
Numeric Data
figure 14

Figure 14 presents a word problem from the GSM-NoOp dataset, which involves calculating the price difference between sourdough loaves and muffins after a donation. The figure includes responses from two models, 'ol-preview' and 'ol-mini'. Both models provide step-by-step solutions, but they incorrectly account for the donated items by subtracting their value from the total cost, even though the donation doesn't affect the price difference between the two items. The 'ol-preview' model calculates the cost of the remaining items after donation and then finds the difference. The 'ol-mini' model calculates the initial cost, the value of the donated items, the net costs after donation, and finally, the difference. Both models arrive at the same incorrect answer due to the erroneous subtraction of the donation value.

First Mention

Text: "Figure 14: Sample response from ol-preview and ol-mini on an example from GSM-NoOp: while the donation amount is irrelevant to the price difference, the models subtract the amount we donate."

Context:

Relevance: This figure further illustrates the point that LLMs struggle with irrelevant information and often misinterpret the problem, even when provided with seemingly straightforward scenarios. It reinforces the paper's argument that LLMs rely on pattern matching and lack true understanding of the underlying mathematical concepts. It shows that even advanced models like ol-preview and ol-mini are susceptible to this issue.

Critique
Visual Aspects
  • Highlight the irrelevant part of the problem (the donation) in a different color or with a different font style to emphasize its misleading nature.
  • Clearly separate and label the responses from the two models ('ol-preview' and 'ol-mini') to improve readability.
  • Consider adding a correct solution alongside the model responses to highlight the error and provide a clear contrast.
Analytical Aspects
  • Explain in simpler terms why the donation is irrelevant to the price difference. Use an analogy or a simpler example to illustrate the concept.
  • Discuss the specific pattern-matching behavior that might have led the models to incorporate the donation into their calculations. For example, do they frequently encounter problems where subtraction is required after an initial calculation?
  • Explain the broader implications of this type of error. If LLMs can't handle such simple irrelevant information, how can they be trusted with more complex real-world problems?
Numeric Data
↑ Back to Top