This study examines the mathematical reasoning capabilities of Large Language Models (LLMs) by introducing a new benchmark, GSM-Symbolic, derived from the GSM8K dataset. The research highlights the limitations of current LLMs, particularly their dependency on pattern matching rather than genuine logical reasoning. By using symbolic templates, the GSM-Symbolic benchmark allows for the generation of diverse math problems, facilitating a more nuanced evaluation of LLM performance. Additionally, the study explores GSM-NoOp, a dataset with irrelevant information to test LLMs' ability to discern relevant details, ultimately finding that LLMs often incorporate irrelevant information into their calculations.
Description: Figure 1 illustrates the creation of symbolic templates used in GSM-Symbolic, showing how generic placeholders replace specific elements in math problems.
Relevance: This figure is crucial for understanding how GSM-Symbolic enables diverse problem generation and controlled evaluation.
Description: Figure 2 displays the distribution of LLM performance on GSM-Symbolic, highlighting significant variability compared to GSM8K.
Relevance: It demonstrates the inconsistency in LLM performance and questions the reliability of single-point metrics.
The research sheds light on the limitations of current LLMs in mathematical reasoning, emphasizing their reliance on pattern matching rather than true understanding. By introducing GSM-Symbolic and GSM-NoOp, the study offers a more comprehensive framework for evaluating LLM capabilities. The findings underscore the need for models capable of formal reasoning, which is crucial for advancing AI applications in complex domains. Future research should focus on developing LLMs with improved logical reasoning skills and exploring alternative evaluation methods to better capture these abilities in real-world contexts.
This paper investigates the mathematical reasoning abilities of Large Language Models (LLMs) using a new benchmark called GSM-Symbolic, derived from the GSM8K dataset. The authors find that LLMs struggle with variations in numerical values within the same problem structure, and their performance degrades as problem complexity increases. They also introduce GSM-NoOp, a dataset with irrelevant information added to problems, revealing that LLMs often incorporate this irrelevant information into their calculations, suggesting a lack of true understanding of mathematical concepts. The study concludes that current LLMs rely more on pattern matching than genuine logical reasoning, especially in mathematics.
The abstract effectively establishes the need for a more robust evaluation of LLM mathematical reasoning beyond the existing GSM8K benchmark.
The abstract clearly presents the main findings of the study, including the performance variance, the impact of complexity, and the effect of irrelevant information.
The abstract clearly outlines the scope of the study and its contributions, including the introduction of GSM-Symbolic and GSM-NoOp.
Mentioning the number of models and examples used would strengthen the abstract.
Rationale: Providing specific numbers adds weight to the claims and allows readers to better assess the scope of the study.
Implementation: Include phrases like "We evaluated X LLMs on Y examples" or similar quantifications.
Adding a sentence about the implications of these findings for the development of future LLMs would enhance the abstract's impact.
Rationale: Highlighting the broader significance of the work makes it more appealing to a wider audience.
Implementation: Add a concluding sentence like, "These findings highlight the need for new approaches to improve the genuine reasoning capabilities of LLMs."
While the term "clauses" is used to indicate complexity, briefly explaining its meaning in this context would improve clarity for readers unfamiliar with the terminology.
Rationale: Ensuring all readers understand key terms enhances accessibility and comprehension.
Implementation: Briefly define "clauses" as parts or components of the problem statement.
Large Language Models (LLMs) have shown impressive abilities in various areas like language processing and creative tasks. However, whether they can truly reason logically, especially in fields like math and coding, is still a big question. While LLMs seem good at some tasks, they have limitations. Existing research suggests that LLMs might rely more on recognizing patterns from their training data than actual understanding, making them sensitive to small changes in how questions are phrased. The GSM8K dataset is commonly used to test LLMs' math skills, but it has drawbacks. It only provides a single score, might have been accidentally included in training data (contamination), and doesn't allow for flexible testing to see how LLMs handle different types of questions or difficulty levels.
The introduction effectively highlights the importance of studying LLM reasoning abilities and the limitations of current benchmarks.
The introduction provides a good overview of existing research on LLM reasoning and the limitations of the GSM8K dataset.
The introduction clearly outlines the paper's contributions and sets the stage for the subsequent sections.
While the introduction mentions limitations, adding specific examples would make the argument stronger.
Rationale: Concrete examples would better illustrate the challenges in LLM reasoning and make the motivation for the paper's contributions clearer.
Implementation: Include specific examples of how LLMs fail in reasoning tasks or are sensitive to input changes.
The introduction briefly mentions data contamination but could elaborate on its potential consequences for evaluating LLM performance.
Rationale: Explaining the implications of data contamination would highlight the importance of the proposed GSM-Symbolic benchmark.
Implementation: Discuss how data contamination could lead to inflated performance estimates and hinder accurate assessment of LLM capabilities.
While the introduction provides context, explicitly stating the research questions would improve the flow and focus of the paper.
Rationale: Clearly articulated research questions would guide the reader and provide a framework for understanding the subsequent experiments and results.
Implementation: Formulate specific research questions related to the reliability of GSM8K, the fragility of LLM reasoning, and the impact of question complexity and irrelevant information.
This section discusses existing research on the reasoning abilities of Large Language Models (LLMs). It highlights that while LLMs have shown potential in various domains, their reasoning capabilities are still uncertain. Studies exploring the computational aspects of transformers suggest that these models might have limitations in handling complex tasks and may benefit from additional memory mechanisms like scratchpads. However, it remains unclear whether LLMs can perform true logical reasoning. Several studies suggest that LLMs rely more on probabilistic pattern-matching than formal reasoning, making them sensitive to small changes in input and prone to errors in complex scenarios. This pattern-matching approach, while more advanced than simple memorization, still falls short of genuine logical reasoning.
The section provides a good overview of different perspectives on LLM reasoning, including computational modeling and the pattern-matching hypothesis.
The section clearly explains complex concepts like probabilistic pattern-matching and token bias in an accessible way.
The section includes relevant citations to support the claims and provides a good starting point for further reading.
While the section covers relevant topics, a more structured organization would improve readability and clarity.
Rationale: A clear structure would make it easier for readers to follow the different arguments and understand the connections between them.
Implementation: Organize the section into subsections or use headings to separate different aspects of LLM reasoning, such as computational limitations, pattern-matching, and sensitivity to input changes.
While the section mentions related work, it could more explicitly connect these findings to the paper's specific contributions, particularly the introduction of GSM-Symbolic.
Rationale: A stronger connection would highlight the novelty and relevance of the paper's contributions in the context of existing research.
Implementation: Add a paragraph or sentences explicitly linking the limitations of current LLMs and evaluation methods to the motivation for developing GSM-Symbolic and the research questions addressed in the paper.
Concluding the section with a brief discussion of potential future research directions would provide a broader perspective and stimulate further investigation.
Rationale: Highlighting open questions and future research directions would contribute to the overall impact and value of the paper.
Implementation: Add a concluding paragraph discussing potential avenues for future research, such as developing new architectures or training methods to improve LLM reasoning abilities.
This section introduces GSM-Symbolic, a new benchmark for evaluating the mathematical reasoning of Large Language Models (LLMs). It addresses the limitations of existing benchmarks like GSM8K by using symbolic templates to generate diverse question variations. This approach allows for more controlled experiments and provides more reliable metrics for assessing LLM performance. The section also describes the template generation process, which involves identifying variables, their domains, and conditions to ensure question and answer correctness. Finally, it outlines the experimental setup used in the paper, including the models evaluated, the evaluation process, and the dataset size.
The section clearly explains the purpose and design of GSM-Symbolic, making it easy to understand the novelty and value of the benchmark.
The section provides a step-by-step explanation of how templates are created, including the identification of variables, domains, and conditions.
The section clearly outlines the experimental setup, including the models used, the evaluation process, and the dataset size.
While Figure 1 shows one example, including more examples of symbolic templates would further clarify the process and its flexibility.
Rationale: More examples would help readers understand the different types of questions that can be generated and the range of complexity that can be captured.
Implementation: Include a few more examples of symbolic templates in the section or in the appendix, showcasing different problem structures and variable types.
While the section highlights the advantages of GSM-Symbolic, it would be beneficial to also discuss its limitations or potential biases.
Rationale: Acknowledging limitations would provide a more balanced perspective and encourage further research to address these limitations.
Implementation: Add a paragraph discussing potential limitations, such as the scope of mathematical concepts covered or the potential for biases in the template generation process.
The section mentions the dataset size but doesn't explain why these specific numbers were chosen.
Rationale: Justifying the dataset size would strengthen the methodology and ensure the results are statistically significant.
Implementation: Explain the rationale behind choosing 100 templates and 50 samples, perhaps by referring to computational constraints or the desired level of statistical power.
Figure 1 illustrates how the GSM-Symbolic template is created. It shows an example from the original GSM8K dataset alongside a corresponding template. The GSM8K example is a word problem about a person named Sophie and the number of toys she has for her nephew. The template generalizes this problem by replacing specific names and numbers with placeholders like {name}, {family}, {x}, {y}, {z}, and {total}. This allows for generating many similar problems with different names, numbers, and relationships between them, while keeping the underlying structure the same. The figure also shows the solution to both the original problem and the templated version.
Text: "Figure 1: Illustration of the GSM-Symbolic template creation process."
Context: This dataset serves as a tool to investigate the presumed reasoning capabilities of LLMs, enabling the design of controllable mathematical reasoning evaluations with more reliable metrics. Our results reveal that all state-of-the-art LLMs exhibit significant performance variations, suggesting the fragility or lack of reasoning.
Relevance: This figure is crucial for understanding how GSM-Symbolic is constructed and how it enables more controlled experiments compared to GSM8K. It visually demonstrates the concept of templates and their use in generating diverse problem instances.
This section presents the main findings of the study on LLM mathematical reasoning. First, it examines the reliability of current GSM8K results by analyzing the performance distribution on GSM-Symbolic, revealing significant variations. It then investigates the fragility of LLM reasoning by comparing performance when changing names versus numbers in problems, finding LLMs more sensitive to numerical changes. The section also explores the impact of question difficulty (number of clauses) on performance, showing that accuracy decreases and variance increases with higher difficulty. Finally, it introduces GSM-NoOp, a dataset with irrelevant information added to problems, demonstrating that LLMs often incorporate this irrelevant information, leading to significant performance drops and suggesting a lack of true understanding of mathematical concepts.
The section presents the results in a clear and organized manner, using figures and tables to illustrate the key findings.
The section provides a comprehensive analysis of the results, exploring different factors that contribute to LLM performance variations.
The claims made in the section are well-supported by the presented results and figures.
While the section mentions variance and standard deviations, providing more details on the statistical tests used would strengthen the analysis.
Rationale: More detailed statistical analysis would provide stronger evidence for the claims made in the section.
Implementation: Include p-values or other statistical measures to quantify the significance of the observed differences in performance.
The section could benefit from a discussion of the limitations of the experimental setup, such as the choice of models or the use of greedy decoding.
Rationale: Acknowledging limitations would provide a more balanced perspective and encourage further research to address these limitations.
Implementation: Add a paragraph discussing potential limitations of the experimental setup and their potential impact on the results.
While the section focuses on mathematical reasoning, connecting the findings to the broader context of LLM research would enhance the paper's impact.
Rationale: Connecting the findings to broader research questions would highlight the significance of the study and its implications for the development of future LLMs.
Implementation: Discuss how the findings relate to other research on LLM reasoning, such as studies on logical reasoning or common sense reasoning.
Figure 2 shows the distribution of performance for several large language models (LLMs) on the GSM-Symbolic benchmark. Each histogram represents a different model and shows how often the model achieved certain accuracy levels across 50 different sets of GSM-Symbolic problems. The x-axis represents the accuracy achieved (as a percentage), and the y-axis represents the frequency (how many times that accuracy level was observed). A dashed vertical line marks the model's performance on the original GSM8K dataset. The average performance on GSM-Symbolic and its standard deviation are also shown for each model.
Text: "Figure 2: The distribution of 8-shot Chain-of-Thought (CoT) performance across 50 sets generated from GSM-Symbolic templates shows significant variability in accuracy among all state-of-the-art models."
Context: Furthermore, for most models, the average performance on GSM-Symbolic is lower than on GSM8K (indicated by the dashed line). Interestingly, the performance of GSM8K falls on the right side of the distribution, which, statistically speaking, should have a very low likelihood, given that GSM8K is basically a single draw from GSM-Symbolic.
Relevance: This figure is important because it shows how consistent (or inconsistent) the models are when answering slightly different versions of the same math problems. The spread of the histograms indicates the variability in performance, and the comparison to GSM8K performance suggests potential issues like data contamination.
Figure 3 is a bar chart showing how much the performance of different LLMs drops when tested on GSM-Symbolic compared to their performance on the original GSM8K. Each bar represents a different model, and its length corresponds to the percentage drop in accuracy. A downward bar means performance decreased on GSM-Symbolic. The labels on each bar provide the exact percentage drop for each model.
Text: "Figure 3: The performance of all state-of-the-art models on GSM-Symbolic drops compared to GSM8K."
Context: Later, we investigate the factors that impact the performance drops in more depth.
Relevance: This figure directly visualizes the performance degradation discussed in the text. It highlights the impact of using the more diverse and challenging GSM-Symbolic benchmark compared to the static GSM8K.
Figure 4 illustrates the sensitivity of Large Language Models (LLMs) to changes in names, numbers, or both within math word problems. It presents histograms showing the distribution of accuracy scores for six different LLMs across three conditions: changing only the names in the problem, changing only the numbers, and changing both names and numbers. Each histogram shows the frequency of different accuracy levels, allowing for a comparison of performance variability across the three conditions. The figure aims to demonstrate how these changes, while not affecting the underlying mathematical logic, can significantly impact the LLMs' ability to solve the problems.
Text: "Figure 4: How sensitive are LLMs when we change only names, only proper numbers, or both names and numbers?"
Context: Overall, models have noticeable performance variation even if we only change names, but even more when we change numbers or combine these changes.
Relevance: This figure is highly relevant as it directly addresses the research question of how fragile LLMs are to superficial changes (names) versus core changes (numbers) in mathematical reasoning problems. It provides evidence for the argument that LLMs are more sensitive to changes in numerical values than to changes in names, suggesting a potential over-reliance on pattern matching.
Figure 5 demonstrates how the difficulty of the GSM-Symbolic math problems is modified by changing the number of clauses. It shows four example problems, each representing a different difficulty level: GSM-Symbolic-M1 (minus one clause), GSM-Symbolic (original), GSM-Symbolic-P1 (plus one clause), and GSM-Symbolic-P2 (plus two clauses). Each problem is a word problem involving calculations, and the increasing difficulty is reflected in the addition of more conditions or steps required to solve the problem. This figure provides a concrete illustration of how the benchmark allows for controlled manipulation of problem complexity.
Text: "Figure 5: Modifying the difficulty level of GSM-Symbolic by modifying the number of clauses."
Context:
Relevance: This figure is essential for understanding how the authors operationalize 'difficulty' in their experiments. By showing examples of problems with varying numbers of clauses, it clarifies the manipulation used to test the impact of complexity on LLM performance. This directly relates to the research question of how difficulty affects the performance distribution.
This figure illustrates how increasing the complexity of a math problem, represented by the number of clauses, affects the performance of Large Language Models (LLMs). It uses histograms to show the distribution of accuracy scores for four different models across four difficulty levels: GSM-M1 (one clause removed), GSM-Symb (original complexity), GSM-P1 (one clause added), and GSM-P2 (two clauses added). As the problems become more complex (moving from M1 to P2), the histograms generally shift to the left, indicating lower accuracy. The spread of the histograms also tends to increase with complexity, suggesting greater variability in performance. Think of it like stacking blocks: it's easier to balance a small tower (M1) than a tall one (P2). The taller the tower gets, the more likely it is to wobble and fall (more variance).
Text: "Figure 6: The impact of increasing the number of clauses on performance: As the difficulty increases from GSM-M1 → GSM-Symb→ GSM-P1 → GSM-P2, the distribution of performance shifts to the left (i.e., accuracy decreases), and the variance increases."
Context: As shown in Fig. 6, the trend of the evolution of the performance distribution is very consistent across all models: as the difficulty increases, the performance decreases and the variance increases. Note that overall, the rate of accuracy drop also increases as the difficulty increases. This is in line with the hypothesis that models are not performing formal reasoning, as the number of required reasoning steps increases linearly, but the rate of drop seems to be faster. Moreover, considering the pattern-matching hypothesis, the increase in variance suggests that searching and pattern-matching become significantly harder for models as the difficulty increases.
Relevance: This figure is central to the paper's argument that LLMs struggle with more complex reasoning tasks. It provides visual evidence of the performance degradation and increased variability as problem difficulty increases.
Figure 7 shows an example of how LLMs can be misled by irrelevant information. It presents a word problem from the GSM-NoOp dataset, where a seemingly relevant but ultimately unimportant detail is added (some kiwis being smaller). Two LLM responses are shown, both incorrectly incorporating the size of the kiwis into their calculations. This is like asking 'If you have 5 apples and 2 are green, how many apples do you have?' A person would understand that the color doesn't change the number of apples, but the LLMs seem to get confused by the extra detail.
Text: "Figure 7: An example from the GSM-NoOp dataset: We add seemingly relevant statements to the questions that are, in fact, irrelevant to the reasoning and conclusion. However, the majority of models fail to ignore these statements and blindly convert them into operations, leading to mistakes."
Context: Fig. 7 illustrates an example from GSM-NoOp. An interesting observation is that models tend to blindly subtract the number of smaller fruits, potentially because their training datasets included similar examples that required conversion to subtraction operations. In the Appendix, we include additional failure cases from GSM-NoOp. Overall, we find that models tend to convert statements to operations without truly understanding their meaning. For instance, a common case we observe is that models interpret statements about “discount” as “multiplication”, regardless of the context. This raises the question of whether these models have truly understood the mathematical concepts well enough. Consequently, as shown in Fig. 8a, there is a catastrophic performance decline across all tested models, with the Phi-3-mini model experiencing over a 65% drop, and even stronger models such as o1-preview showing significant declines.
Relevance: This figure supports the paper's argument that LLMs rely on pattern matching and struggle with true understanding. It demonstrates how irrelevant information can significantly impact their performance, suggesting a lack of genuine comprehension of the problem.
Figure 8 is a collection of bar charts demonstrating the performance drop of various large language models (LLMs) on the GSM-NoOp dataset, a modified version of GSM8K designed to assess how LLMs handle irrelevant information within math problems. (a) shows the general performance drop on GSM-NoOp across different models. (b) compares performance on GSM8K, GSM-Symbolic, and GSM-NoOp when using different 'shots' or examples during testing. 'NoOp-Symb' uses examples from GSM-Symbolic, while 'NoOp-NoOp' uses examples from GSM-NoOp. (c) highlights specific models that, while generally performing worse on GSM8K and GSM-Symbolic, show better performance on NoOp-Symb.
Text: "Figure 8: (a) The performance of models drops significantly on GSM-NoOp, with more recent models experiencing a greater decline than older ones."
Context: (b) As previously demonstrated, performance on GSM-Symbolic is very close to that on GSM8K. However, on GSM-Noop, the significant drop in performance cannot be recovered, even when using the exact same question's variation as shots (NoOp-Symb) or when using different questions with different GSM-Noopthat contain No-Op operations (NoOp-Noop) as shots. (c) Notably, some models that perform significantly worse than those in (b) on GSM8K and GSM-Symbolic show much better performance on NoOp-Symb.
Relevance: This figure is central to the paper's argument about the limitations of LLMs in mathematical reasoning. It visually demonstrates how LLMs struggle with irrelevant information, even when provided with relevant examples. It supports the idea that LLMs rely on pattern matching and struggle with true understanding.
This research explored the reasoning abilities of Large Language Models (LLMs) in mathematics, particularly focusing on limitations of current evaluations using the GSM8K dataset. They introduced GSM-Symbolic, a new benchmark offering varied mathematical problems. The study showed significant inconsistencies in LLM performance on similar questions, especially with changes in numerical values. Performance also decreased with increasing problem complexity. The GSM-NoOp dataset, which includes irrelevant information in problems, revealed a major weakness: LLMs often use this irrelevant information, leading to significant errors. This suggests LLMs rely on pattern matching rather than true logical reasoning, even for simple math problems.
The conclusion effectively summarizes the main findings of the study, including the limitations of GSM8K, the benefits of GSM-Symbolic, and the observed performance variations and limitations of LLMs.
The conclusion clearly articulates the implications of the findings for future research, emphasizing the need for models capable of formal reasoning.
The conclusion ends with a strong statement that reinforces the importance of the research and its contribution to the field of AI.
While the conclusion mentions "substantial performance drops of up to 65%", providing more specific numbers for different models would strengthen the claim.
Rationale: Quantifying the drops would provide a more concrete understanding of the impact of irrelevant information on LLM performance.
Implementation: Include specific performance drop percentages for a few representative models, or refer to a table or figure with detailed results.
The conclusion focuses on the limitations of current methods but could briefly mention potential alternative approaches for evaluating LLM reasoning.
Rationale: Suggesting alternative evaluation methods would provide a more constructive outlook and stimulate further research in this direction.
Implementation: Add a sentence or two discussing alternative approaches, such as incorporating more complex reasoning tasks or using human evaluation to assess understanding.
While the conclusion mentions general intelligence, connecting the findings to specific real-world applications of LLMs would enhance the paper's relevance.
Rationale: Discussing the implications for real-world applications would highlight the practical significance of the research and its potential impact.
Implementation: Add a sentence or two discussing the implications of the findings for applications like automated theorem proving, problem-solving in scientific domains, or other areas where robust mathematical reasoning is crucial.
This appendix provides supplementary information to the main paper. It includes detailed experimental setups, complete results on GSM8K and GSM-Symbolic benchmarks and their variants, additional results on performance distribution, further analysis of the impact of question difficulty (including the effects of fine-tuning), and a comprehensive discussion of the OpenAI o1-mini and o1-preview models.
The appendix provides a wealth of supplementary information that enhances the understanding of the main paper's findings.
The detailed description of the experimental setup, including the prompt template, allows for reproducibility and facilitates further research.
The inclusion of the full results table allows for a more complete analysis and comparison of different models.
While the appendix provides a lot of information, organizing it into clearer subsections with more descriptive headings would improve readability.
Rationale: Clearer organization would make it easier for readers to navigate the appendix and find the specific information they are looking for.
Implementation: Use more descriptive subheadings that clearly indicate the content of each section, such as "A.1 Experimental Setup: Prompting and Decoding" or "A.2 Full Results: GSM8K and GSM-Symbolic Performance".
The additional results presented in the appendix could benefit from more context and explanation. For example, the connection between the additional results and the main paper's findings could be made more explicit.
Rationale: Providing more context would help readers understand the significance of the additional results and how they relate to the overall research question.
Implementation: Add introductory paragraphs or sentences to each subsection explaining the purpose of the additional results and how they complement the findings in the main paper.
The appendix could benefit from a discussion of the limitations of the presented results. For example, the limitations of using greedy decoding or the potential biases in the datasets could be discussed.
Rationale: Acknowledging limitations would provide a more balanced perspective and encourage further research to address these limitations.
Implementation: Add a section or paragraphs discussing the limitations of the presented results and their potential impact on the conclusions of the study.
Figure 9 shows the prompt format used for evaluating the Large Language Models (LLMs). It consists of a preamble (or system instruction), eight example question-answer pairs (referred to as 'shots'), and the target question. The preamble sets the context for the LLM, instructing it to solve mathematical questions step-by-step like an expert. Each shot includes a question (Q:) and an answer (A:) that demonstrates the desired chain-of-thought reasoning process. The target question is presented without an answer, and the LLM is expected to generate a step-by-step solution and provide the final answer. Placeholders like {{question}}, {{solution}}, and {{final answer}} are used to represent the actual content that would be inserted during evaluation. Think of it like a recipe: the preamble is the general instruction (e.g., 'bake at 350 degrees'), the shots are examples of how to make specific dishes (e.g., 'chocolate chip cookies'), and the target question is a new dish the LLM needs to 'cook' (e.g., 'oatmeal raisin cookies') using the same general instructions and the examples provided.
Text: "Figure 9: The prompt format used for evaluations."
Context:
Relevance: This figure is crucial for understanding the experimental setup and how the LLMs were evaluated. It provides a clear picture of the input provided to the models, including the context setting, the examples used for few-shot learning, and the format of the target questions. This helps in interpreting the results and understanding the LLMs' behavior.
Table 1 presents the performance of various Large Language Models (LLMs) on different versions of the GSM8K math problem dataset. It shows the accuracy (percentage of correctly answered questions) for each model on the full GSM8K test set, a smaller 100-question subset of GSM8K, and several variations of GSM-Symbolic (M1, standard, P1, P2, and NoOp). Each variation represents a different level of difficulty or type of change applied to the original problems. The table also includes standard deviations, which show how much the accuracy scores vary across different runs or subsets of the data. Think of it like testing different students (LLMs) on different sets of math problems (datasets). The table shows each student's average score on each problem set and how consistent their scores are.
Text: "Table 1: Full 8-shot results of all models on GSM8Kand different variants of GSM-Symbolic."
Context:
Relevance: This table summarizes the main quantitative results of the paper. It allows for a direct comparison of different LLMs and their performance across different benchmarks, providing evidence for the claims made about performance variations, the impact of difficulty, and the effect of irrelevant information.
Figure 10 presents additional results on the performance variation of Large Language Models (LLMs) on the GSM-Symbolic dataset. It shows histograms of accuracy distributions for three different models: Phi-2, Mistral-7b-instruct-v0.1, and Gemma2-2b-it. Each histogram shows how often each model achieved a particular accuracy level across multiple runs or variations of the GSM-Symbolic problems. The x-axis represents the accuracy percentage, and the y-axis represents the frequency. The average accuracy on the original GSM8K dataset is also provided for comparison, along with the average accuracy and standard deviation on GSM-Symbolic.
Text: "Figure 10: Additional results on performance variation on GSM-Symbolic."
Context:
Relevance: This figure supplements the earlier analysis of performance variation on GSM-Symbolic (Figure 2) by providing results for additional models. It further supports the claim that LLMs exhibit significant performance variability even on slightly different versions of the same math problems, raising concerns about the reliability of single-point accuracy metrics.
Figure 11 explores whether using examples (shots) from a slightly harder problem set (GSM-P1) or fine-tuning a model on that set improves performance on an even harder set (GSM-P2). Part (a) shows that including examples from GSM-P1 during testing doesn't help much on GSM-P2. It's like trying to learn advanced calculus by only looking at algebra examples – it won't give you the tools you need. Part (b) shows that even fine-tuning a model on GSM-P1, while improving performance on P1, doesn't translate to better performance on P2. This is like training a dog to fetch a specific ball; it might get really good at fetching *that* ball, but not necessarily other objects.
Text: "Figure 11: Using in-context shots or finetuning on GSM-P1 does not improve performance on GSM-P2: (a) Compared to the case where 8 shots come from GSM8K, when we include shots from GSM-P1the performance on GSM-P2 does not improve."
Context: (b) Finetuning on GSM-P1 can improve performance on GSM-P1 but not on GSM-P2.
Relevance: This figure is important because it investigates whether exposure to slightly harder problems, either through examples or fine-tuning, can improve performance on significantly harder problems. The negative results suggest that simply increasing the difficulty of training data or examples might not be enough to improve the reasoning capabilities of LLMs.
Figure 12 presents the performance of two closed-source LLMs, ol-mini and ol-preview, on different versions of the GSM-Symbolic dataset. It uses histograms to show how often each model achieved a particular accuracy on GSM8K, GSM-Symbolic (the standard version), GSM-M1 (easier), GSM-P1 (harder), and GSM-P2 (hardest). The x-axis represents accuracy, and the y-axis represents frequency. The figure also provides the average accuracy and standard deviation for each model and dataset. The key observation is that ol-preview performs very well and consistently across all difficulty levels, while ol-mini's performance degrades as the difficulty increases, similar to the open-source models discussed earlier.
Text: "Figure 12: Results on o1-mini and o1-preview: both models mostly follow the same trend we presented in the main text."
Context: However, o1-preview shows very strong results on all levels of difficulty as all distributions are close to each other.
Relevance: This figure is relevant because it extends the analysis to closed-source models, showing that while some closed models like ol-preview demonstrate strong and consistent performance across different difficulty levels, others like ol-mini still struggle with increasing complexity, following similar trends as open-source models.
Figure 13 shows an example of how the ol-preview model fails to understand the context of a word problem from the GSM-NoOp dataset. The problem asks for the current cost of school supplies, given current prices and mentioning that prices were lower last year due to inflation. However, the model incorrectly calculates the cost based on the lower, past prices, even though the question explicitly asks for the current cost. This demonstrates the model's tendency to blindly apply numerical operations without fully grasping the context or relevance of the information provided. It's like a student who sees the word 'discount' and automatically subtracts, regardless of whether a discount is actually being applied in the problem.
Text: "Figure 13: Sample response from o1-preview on an example from GSM-NoOp: the model blindly applies the inflation rate, even though the inflation amount is irrelevant as the question clearly indicates the given prices are for "now" and not last year."
Context:
Relevance: This figure is relevant because it provides a specific example of how even high-performing closed-source models like ol-preview can fail on GSM-NoOp problems due to their inability to filter out irrelevant information. It reinforces the paper's argument that LLMs struggle with true understanding and rely on pattern matching, leading to errors when presented with irrelevant information.
Figure 14 presents a word problem from the GSM-NoOp dataset, which involves calculating the price difference between sourdough loaves and muffins after a donation. The figure includes responses from two models, 'ol-preview' and 'ol-mini'. Both models provide step-by-step solutions, but they incorrectly account for the donated items by subtracting their value from the total cost, even though the donation doesn't affect the price difference between the two items. The 'ol-preview' model calculates the cost of the remaining items after donation and then finds the difference. The 'ol-mini' model calculates the initial cost, the value of the donated items, the net costs after donation, and finally, the difference. Both models arrive at the same incorrect answer due to the erroneous subtraction of the donation value.
Text: "Figure 14: Sample response from ol-preview and ol-mini on an example from GSM-NoOp: while the donation amount is irrelevant to the price difference, the models subtract the amount we donate."
Context:
Relevance: This figure further illustrates the point that LLMs struggle with irrelevant information and often misinterpret the problem, even when provided with seemingly straightforward scenarios. It reinforces the paper's argument that LLMs rely on pattern matching and lack true understanding of the underlying mathematical concepts. It shows that even advanced models like ol-preview and ol-mini are susceptible to this issue.