LLMs Know What They Don't Know: Discovering the Internal Representations of Truthfulness

Table of Contents

Overall Summary

Overview

This paper investigates how Large Language Models (LLMs) internally represent the truthfulness of their generated text, focusing on the location of this information within the model's internal activations. The study analyzes various error detection methods, including probing classifiers applied to specific tokens within the LLM output, and compares their performance across different LLMs and datasets. The research reveals that truthfulness information is concentrated in specific tokens, improving error detection. Furthermore, the study explores the generalization of these methods across tasks, the predictability of error types, and the discrepancy between internal knowledge and generated answers. The findings shed light on the inner workings of LLMs and their limitations in generating truthful and accurate information.

Key Findings

Strengths

Areas for Improvement

Significant Elements

Figure 2

Description: Figure 2 visually demonstrates the localized nature of truthfulness information by showing heatmaps of error detection performance across different layers and tokens for the Mistral-7b-instruct model. The heatmaps clearly show that the highest AUC scores are achieved at the exact answer tokens, confirming that these tokens hold the most information about the correctness of the LLM's output.

Relevance: This figure provides strong visual evidence for the key finding that truthfulness information is localized within specific tokens, supporting the argument for using exact answer tokens in error detection.

Table 1

Description: Table 1 provides a quantitative comparison of different error detection techniques across various LLMs and datasets, using the AUC metric. The table clearly shows that probing classifiers applied to the exact answer tokens achieve the best performance compared to other methods, such as using aggregated logits or probabilities. The table includes numerical results (AUC scores) demonstrating the improved performance achieved by focusing on exact answer tokens.

Relevance: This table provides quantitative support for the paper's main claim by showing that probing classifiers applied to the exact answer tokens significantly outperform other error detection methods across multiple LLMs and datasets.

Conclusion

This paper demonstrates that LLMs encode truthfulness information within their internal representations, particularly in the "exact answer tokens." This finding enables improved error detection, especially in open-source models. However, the limited generalization of these truthfulness features across different tasks suggests the presence of skill-specific mechanisms within LLMs. The ability to predict error types and the observed discrepancy between internal knowledge and generated answers further highlight the complexity of LLM behavior. Future research should focus on exploring these skill-specific mechanisms, developing more robust error detection methods that generalize across tasks, and leveraging internal knowledge to improve LLM decoding strategies and mitigate the generation of incorrect information. This could involve developing adaptive probes that tailor their analysis to the specific task or designing new training procedures that encourage LLMs to better align their internal knowledge with their external behavior. Ultimately, these efforts will contribute to building more reliable and trustworthy LLMs for a wider range of applications.

Section Analysis

Abstract

Overview

Large language models (LLMs) often make errors, known as hallucinations. This paper shows that LLMs store information about the truthfulness of their answers, especially within specific tokens. This information can be used to detect errors more effectively, but these detectors don't work universally across different tasks. The paper also shows how to predict the types of errors LLMs are likely to make and reveals that sometimes LLMs internally know the correct answer but still generate a wrong one.

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Overview

Large language models (LLMs) are prone to generating incorrect information, often called "hallucinations." This paper shifts from focusing on how humans perceive these errors to examining how LLMs internally encode truthfulness. The research explores how this internal encoding can be used to better detect errors, predict error types, and potentially mitigate them. The paper also addresses the broad definition of "hallucinations" used in the study.

Key Aspects

Strengths

Suggestions for Improvement

Background

Overview

This section defines LLM errors, often called "hallucinations," and discusses existing research on detecting these errors. It emphasizes the lack of a universal definition for hallucinations and adopts a broad interpretation encompassing all types of LLM errors. The section also reviews prior work on error detection, including methods using external knowledge, output logits, and probing classifiers, highlighting the need for a more holistic approach.

Key Aspects

Strengths

Suggestions for Improvement

Better Error Detection

Overview

This section describes experiments on detecting errors in LLM-generated text. It focuses on how choosing the right token within the LLM's output significantly improves error detection. The section defines the task, explains the experimental setup (datasets, models, and metrics), and introduces the concept of using "exact answer tokens" for better error detection.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 1

Figure 1 provides examples of an input prompt and the LLM's response from the TriviaQA dataset. It highlights the specific tokens that can be probed for truthfulness information, including the first and last exact answer tokens within the generated response. This figure helps to visualize the concept of exact answer tokens and their position within the LLM's output.

First Mention

Text: "Figure 1: Example for the input and LLM output from the TriviaQA dataset, and the names of the tokens that can be probed."

Context: Existing methods often overlook a critical nuance: the token selection for error detection, typically focusing on the last generated token or taking a mean. However, since LLMs typically generate long-form responses, this practice may miss crucial details (Brunner et al., 2020). Other approaches use the last token of the prompt (Slobodkin et al., 2023, inter alia), but this is inherently inaccurate due to LLMs’ unidirectional nature, failing to account for the generated response and missing cases where different sampled answers from the same model vary in correctness. We investigate a previously unexamined token location: the exact answer tokens, which represent the most meaningful parts of the generated response. We define exact answer tokens as those whose modification alters the answer’s correctness, disregarding subsequent generated content.3 Figure 1 illustrates the different token locations.

Relevance: This figure is relevant because it visually demonstrates the concept of 'exact answer tokens,' which are crucial for the proposed error detection method. By highlighting these tokens, the figure clarifies how the research pinpoints the locations within the LLM's output where truthfulness signals are strongest.

Critique
Visual Aspects
  • The figure could benefit from clearer visual cues to distinguish between the prompt and the LLM's response. Perhaps different background colors or bounding boxes could be used.
  • The font size for the token labels (e.g., '[INST]', '[/INST]') is small and might be difficult to read. Increasing the font size would improve readability.
  • While the figure shows the exact answer tokens, it could be enhanced by visually highlighting the other token positions mentioned in the text (e.g., last generated token, end of question token) for comparison.
Analytical Aspects
  • The figure could be more impactful by showing examples of both correct and incorrect answers to illustrate how the exact answer tokens differ in these cases.
  • The figure focuses on a single dataset (TriviaQA). Including examples from other datasets used in the paper would demonstrate the broader applicability of the concept.
  • The figure could be accompanied by a brief explanation of how the exact answer tokens are identified (e.g., using an external algorithm) to provide more context.
figure 2

Figure 2 displays heatmaps showing the performance (AUC values) of a probe error detector across different layers and tokens of the Mistral-7b-instruct LLM. The heatmaps reveal that the error detection performance peaks at the exact answer tokens, particularly in the middle to later layers of the model. This visualization supports the claim that truthfulness information is concentrated in these specific tokens.

First Mention

Text: "Figure 2: AUC values of a probe error detector across layers and tokens, Mistral-7b-instruct. Generation proceeds from left to right, with detection performance peaking at the exact answer tokens."

Context: Following prior work, we use a linear probing classifier for error detection (Li et al., 2024, inter alia) on static tokens: the last generated token (hl,−1), the one before it (hl,−2), and the final prompt token (hl,k). The layer l is selected per token based on validation set performance. For further details on the implementation of each method, refer to Appendix A.4. 3.3 EXACT ANSWER TOKENS Existing methods often overlook a critical nuance: the token selection for error detection, typically focusing on the last generated token or taking a mean. However, since LLMs typically generate long-form responses, this practice may miss crucial details (Brunner et al., 2020). Other approaches use the last token of the prompt (Slobodkin et al., 2023, inter alia), but this is inherently inaccurate due to LLMs’ unidirectional nature, failing to account for the generated response and missing cases where different sampled answers from the same model vary in correctness. We investigate a previously unexamined token location: the exact answer tokens, which represent the most meaningful parts of the generated response. We define exact answer tokens as those whose modification alters the answer’s correctness, disregarding subsequent generated content.3 Figure 1 illustrates the different token locations. In the following experiments, we implement each error detection method with an “exact answer” version, demonstrating that it often improves performance, especially in probing. These exact answer is identified from a lengthy generated answer using an external algorithm, which processes the question and the LLM’s response, A(qi, ˆyi), to extract the exact answer. In our implementation, we use Mistral-7b-Instruct in a few-shot learning setup as A. However, we demonstrate that all the LLMs we evaluate are capable of extracting exact answers from their own outputs, as explained in AppendixA.2. After extracting the exact answer, the exact answer tokens are identified through a simple search process. We focus on four specific tokens: the one immediately preceding the first exact answer token, the first exact answer token itself, the last exact answer token, and the one immediately following it. 3.4 RESULTS Patterns of truthfulness encoding. We first focus on probing classifiers to gain insights into the internal representations of LLMs. Specifically, we extensively analyze the effects of layer and token selection on activation extraction for these classifiers. This is done by systematically probing all layers of the model, starting with the last question token and continuing through to the final generated token. Figure 2 shows the AUC metrics of trained probes across various layers and tokens of Mistral-7b-Instruct. While some datasets seem easier for error prediction, all exhibit consistent truthfulness encoding patterns.

Relevance: Figure 2 directly supports a central claim of the paper: that truthfulness information is concentrated in the exact answer tokens. The heatmaps visually demonstrate the peak performance of the error detector at these tokens, providing strong evidence for this claim.

Critique
Visual Aspects
  • The color scale could be improved for better contrast and easier identification of peak performance areas.
  • Adding clear labels or annotations directly on the heatmaps to pinpoint the exact answer token locations would enhance readability.
  • The figure could benefit from a more descriptive caption that explains the axes, the color scale, and the key takeaways.
Analytical Aspects
  • The figure only shows results for one LLM (Mistral-7b-instruct). Including similar heatmaps for other LLMs would strengthen the generalizability of the findings.
  • The figure could be more informative by including a baseline comparison (e.g., performance at the last generated token) to highlight the improvement achieved by focusing on exact answer tokens.
  • A brief explanation of the statistical significance of the observed peak performance (e.g., p-values) would strengthen the analysis.
table 1

Table 1 compares the performance of various error detection techniques across different Large Language Models (LLMs) and datasets, using the Area Under the Curve (AUC) metric. The techniques include simple baselines like 'Majority' and more sophisticated methods like 'Logits-mean', 'Logits-min', 'p(True)', and 'Probe'. The table also shows the impact of using exact answer tokens on the performance of these techniques. The best-performing method for each LLM-dataset combination is highlighted in bold.

First Mention

Text: "Table 1: Comparison of error detection techniques using AUC metric, across different models and datasets. The best-performing method is bolded. Using exact answer tokens is useful for many cases, especially probing."

Context: Next, we evaluate various error detection methods by comparing their performance with and without the use of exact answer tokens. Table 1 compares the AUC across three representative datasets (additional datasets and models in Appendix B, showing consistent patterns). Here we present results for the last exact answer token, which outperformed both the first exact answer token and the one preceding it, while the token following the last performed similarly.

Relevance: This table is crucial for understanding the effectiveness of different error detection methods and the impact of using exact answer tokens. It directly addresses the paper's main contribution of improving error detection by focusing on specific tokens.

Critique
Visual Aspects
  • The table could benefit from visual separation between the two LLM groups (Mistral and Llama) to improve readability.
  • Using a color gradient to represent AUC values could make it easier to quickly identify high-performing methods.
  • Adding a brief explanation of the different methods in a footnote or caption would make the table more self-contained.
Analytical Aspects
  • While standard deviations are provided, adding p-values or confidence intervals would strengthen the statistical significance of the results.
  • The table could include a discussion of the limitations of the AUC metric and potential alternative evaluation measures.
  • A more detailed analysis of the differences in performance between the methods, particularly the impact of exact answer tokens, would be valuable.
Numeric Data
  • Mistral-7B-Instruct, TriviaQA, Probe @ Exact: 0.92 AUC
  • Mistral-7B-Instruct, Winobias, Probe @ Exact: 0.92 AUC
  • Mistral-7B-Instruct, Math, Probe @ Exact: 0.95 AUC
  • Llama 3-8b-Instruct, TriviaQA, Probe @ Exact: 0.93 AUC
  • Llama 3-8b-Instruct, Winobias, Probe @ Exact: 0.95 AUC
  • Llama 3-8b-Instruct, Math, Probe @ Exact: 0.83 AUC
table 5

Table 5 presents a comparison of error detection performance, measured by AUC, on the Mistral-7B-Instruct model. Various methods are compared, including majority voting, logit-based methods (mean, min, with and without exact answer tokens), probability-based methods, p(True), and probing at different token positions. The table covers several datasets, including TriviaQA, Winobias, Math, Movies, IMDB, HotpotQA, HotpotQA-WC, Winogrande, NLI, and NQ-WC.

First Mention

Text: "Table 5: Comparison of error detection performance (AUC) on Mistral-7B-Instruct."

Context: In Table 8 we present some qualitative samples from Mistral-7b-instruct, for the phenomenon we observe at error type (C2) Consistently incorrect but generates the correct answer at least one time. The samples in the table represent cases where the probe chose the correct answer. Table 9 compares different decoding mechanisms, including the choice via probe, on non-instruct models, and Table 10 compares on the instruct models.

Relevance: This table provides a comprehensive overview of the error detection performance of the Mistral-7B-Instruct model across various methods and datasets, allowing for a direct comparison of their effectiveness.

Critique
Visual Aspects
  • The table could be improved by visually separating the different datasets into groups (e.g., factual, common sense, etc.) for better readability.
  • Highlighting the best-performing method for each dataset would make it easier to quickly identify the most effective strategies.
  • The table could benefit from a clearer explanation of the different probing locations and their significance.
Analytical Aspects
  • While the table includes standard deviations, adding p-values or confidence intervals would strengthen the statistical significance of the results.
  • A discussion of the limitations of the AUC metric and potential alternative evaluation measures would be valuable.
  • The table could include a more detailed analysis of the impact of using exact answer tokens on the performance of the different methods.
Numeric Data
  • TriviaQA, Probe @ Exact answer last: 0.85 AUC
  • Winobias, Probe @ Exact answer last: 0.92 AUC
  • Math, Probe @ Exact answer last: 0.92 AUC
  • Movies, Probe @ Exact answer last: 0.96 AUC
  • IMDB, Probe @ Exact answer last: 0.97 AUC
table 6

Table 6 presents a comparison of error detection performance, measured by the Area Under the Curve (AUC) score, for the Llama-8b language model. The table is structured to compare various error detection methods across different datasets, including TriviaQA, Winobias, Math, Movies, IMDB, HotpotQA, HotpotQA-WC, Winogrande, NLI, and NQ-WC. The methods compared include a majority baseline, logits-based methods (with and without exact answer tokens), probabilities-based methods, p(True) methods, and probing at different token positions (last generated, before last generated, end of question, exact answer last, and exact answer last+1). Each cell in the table provides the AUC score and its standard deviation for a specific method and dataset combination.

First Mention

Text: "Table 6: Comparison of error detection performance (AUC) on Llama-8b."

Context: The caption is located above Table 6 on page 26.

Relevance: This table is highly relevant as it provides a comprehensive comparison of different error detection methods on the Llama-8b model. It allows for direct comparison of the effectiveness of various techniques and highlights the impact of using exact answer tokens. The results contribute to understanding the strengths and weaknesses of each method and inform the choice of appropriate error detection strategies for different datasets and tasks.

Critique
Visual Aspects
  • Clear Layout: The table is well-organized and easy to read, with clear headings for datasets and methods.
  • Standard Deviations: The inclusion of standard deviations provides valuable information about the variability of the results.
  • Two-Part Structure: The division of the table into two sections for different groups of datasets improves readability.
Analytical Aspects
  • Comprehensive Comparison: The table includes a wide range of error detection methods, allowing for a thorough comparison.
  • Impact of Exact Answers: The inclusion of methods with and without exact answer tokens highlights the importance of this factor in error detection.
  • Statistical Significance: While standard deviations are provided, the table could benefit from including statistical significance tests (e.g., p-values) to determine if the differences between methods are statistically significant.
Numeric Data
table 7

Table 7 provides a comparison of error detection performance, using the AUC metric, for the Llama-8b-Instruct model. It compares various error detection methods across multiple datasets, including TriviaQA, Winobias, Math, Movies, IMDB, HotpotQA, HotpotQA-WC, Winogrande, NLI, and NQ-WC. The methods compared include a majority baseline, logits-based methods (with and without exact answer tokens), probabilities-based methods, p(True) methods, and probing at different token positions. Each cell presents the AUC score and its standard deviation for a specific method and dataset.

First Mention

Text: "Table 7: Comparison of error detection performance (AUC) on Llama-8b-Instruct."

Context: The caption is above Table 7 on page 27.

Relevance: This table is crucial for understanding the performance of different error detection techniques on the Llama-8b-Instruct model. It allows for direct comparison of the methods and shows the effect of using exact answer tokens. The results contribute to evaluating the effectiveness of each method and inform the selection of suitable error detection strategies for different datasets and tasks.

Critique
Visual Aspects
  • Clear Structure: The table is well-organized and easy to read, with clear headings for datasets and methods.
  • Standard Deviations: The inclusion of standard deviations provides valuable information about the variability of the results.
  • Two-Part Structure: The division of the table into two sections for different groups of datasets improves readability.
Analytical Aspects
  • Comprehensive Comparison: The table includes a wide range of error detection methods, allowing for a thorough comparison.
  • Impact of Exact Answers: The inclusion of methods with and without exact answer tokens highlights the importance of this factor in error detection.
  • Statistical Significance: While standard deviations are provided, the table would benefit from statistical significance tests (e.g., p-values) to determine if the differences between methods are statistically significant.
Numeric Data

Generalization Between Tasks

Overview

This section investigates how well error detection models, specifically probing classifiers, generalize across different tasks. While initial results suggest some generalization, further analysis reveals that this is mostly due to information already present in the output logits. True generalization, beyond what logits can capture, is limited to tasks requiring similar skills, like factual recall. This suggests LLMs have multiple, task-specific ways of encoding truthfulness, not a single universal mechanism.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 3

Figure 3 presents two heatmaps illustrating the generalization performance of the Mistral-7b-instruct model across different datasets. Heatmap (a) shows the raw AUC values when a probe trained on one dataset is tested on another. Values above 0.5 suggest some generalization. Heatmap (b) shows the difference in AUC between the probe and a logit-based method. Positive values indicate that the probe generalizes better than simply using logits, suggesting the probe learns something beyond what's available in the output probabilities.

First Mention

Text: "Figure 3: Generalization between datasets, Mistral-7b-instruct. After subtracting the logit-based method's performance, we observe that most datasets show limited or no meaningful generalization."

Context: Results. Figure 3a shows the generalization results for Mistral-7b-instruct, with similar patterns observed for other LLMs in Appendix C. In this context, values above 0.5 indicate successful generalization. At first glance, the results appear consistent with previous research: most heatmap values exceed 0.5, implying some degree of generalization across tasks. This observation supports the existence of a universal mechanism for decoding truthfulness, since the same linear directions—captured by the probe—encode truthfulness information across many datasets. However, upon closer inspection, it turns out that most of this performance can be achieved by logit-based truthfulness detection, which only observes the output logits. Figure 3b presents the same heatmap after subtracting results from our strongest logit-based baseline (Logit-min-exact). This adjusted heatmap reveals the probe’s generalization rarely exceeds what can be achieved by examining logits alone. This implies that the apparent generalization does not stem from a universal internal encoding of truthfulness but rather reflects information already accessible through external features like logits.

Relevance: This figure is central to the paper's argument about the limitations of generalization in error detection. It shows that while some generalization appears to occur, it's mostly explained by information already present in the output logits, challenging the idea of a universal truthfulness encoding.

Critique
Visual Aspects
  • The color scales in both heatmaps could be adjusted for better contrast, making it easier to distinguish between different levels of generalization.
  • Labeling the axes with dataset names directly on the heatmaps, rather than just in the caption, would improve readability.
  • Adding a visual guide or annotation to highlight areas of meaningful generalization (i.e., high values in heatmap (b)) would make the key findings more readily apparent.
Analytical Aspects
  • The caption could be more explicit about the logit-based method used for comparison in heatmap (b).
  • Including a brief explanation of the statistical significance of the observed differences between the probe and logit-based methods would strengthen the analysis.
  • The figure could be enhanced by adding a third heatmap showing the performance of the logit-based method alone, allowing for a direct visual comparison.

Investigating Error Types

Overview

This section explores the different types of errors LLMs make, focusing on the TriviaQA dataset. It introduces a taxonomy of errors based on how consistently the LLM generates correct or incorrect answers when prompted repeatedly. The section then investigates whether the LLM's internal representations can predict these error types.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 4

Figure 4 illustrates three distinct error types observed in the free-form generation of a large language model (LLM) when the same question is sampled multiple times. Each panel (a, b, and c) depicts a different scenario. Panel (a) shows a case where the LLM mostly generates the correct answer but occasionally produces incorrect ones (hallucinations). Panel (b) shows a case where the LLM mostly generates the same incorrect answer, but occasionally produces the correct one, suggesting some underlying knowledge of the correct answer. Panel (c) shows a case where the LLM generates many different answers, with the correct answer appearing infrequently, indicating low confidence and high variability in its responses.

First Mention

Text: "Figure 4: Different error types in free-form generation, exposed when resampled many times."

Context: To analyze errors from the LLM’s perspective, we sample K = 30 responses at a temperature setting of T = 14 for each example in the dataset and then analyze the resulting distribution of answers. Figure 4 illustrates three representative error types. In one (Figure 4a), the model usually gives the correct answer but occasionally make an error, implying correct information is present but sampling may lead to mistakes. In another (Figure 4b, the model often responds incorrectly, though it is capable of providing the right answer, indicating some retained knowledge despite consistently making the same error. In a third type (Figure 4c), the model generates a wide array of mostly incorrect answers, reflecting low confidence in any generated answer.

Relevance: This figure is crucial for understanding the different ways LLMs can make errors. It moves beyond simply labeling an answer as correct or incorrect and delves into the patterns of errors, which is essential for developing targeted mitigation strategies.

Critique
Visual Aspects
  • Using more distinct colors for the correct (green) and incorrect (red) labels would improve visibility and accessibility.
  • Including the actual questions being asked in each panel would provide more context and make the examples more understandable.
  • Adding a brief explanation of the temperature setting (T=1) and its effect on sampling would be helpful for readers unfamiliar with this concept.
Analytical Aspects
  • While the figure shows representative examples, it doesn't provide information on the prevalence of each error type in the dataset. Adding this information would give a better understanding of the overall error distribution.
  • The figure focuses on a single dataset (TriviaQA). Showing examples from other datasets would demonstrate the generalizability of the error types.
  • The figure could be strengthened by connecting the observed error types to the specific limitations or biases of LLMs, providing a deeper explanation of why these errors occur.
table 2

Table 2 presents the AUC scores for classifying different error types using the internal representations of four LLMs: Mistral-7b, Mistral-Instr-7b, Llama3-8b, and Llama3-Instr-8b. The error types include (A) Refuses to answer, (B) Consistently correct, (C) Consistently incorrect, (D) Two competing, and (E) Many answers. The AUC scores, along with their standard deviations, indicate how well the models' internal representations can predict these error types. Higher AUC scores suggest better predictability.

First Mention

Text: "Table 2: AUC scores for error type classification. Error types are predictable from the inner model representations, indicating the encoding of fine-grained information on errors."

Context: Our taxonomy offers an external, behavioral analysis of LLMs, which we complement by an intrinsic evaluation. We explore whether LLMs encode information on potential error types within their intermediate activations, offering a deeper insight into the underlying mechanisms. To investigate this, we train a probe in a one-to-many setting, where a single probe identifies a specific error type from all others. We use representations extracted from the answers produced via greedy decoding. Table 2 presents the test set results for all models. Our findings demonstrate that the error type can be predicted from the intermediate representations of the greedy decoding generations, suggesting that they may encode not just output correctness but also features that are correlative with fine-grained information about potential errors. This predictability opens up possibilities for targeted interventions on specific error types.

Relevance: This table directly supports the claim that LLMs encode information about the types of errors they are likely to make. The AUC scores demonstrate that error types are predictable from internal representations, suggesting a link between internal states and external behavior.

Critique
Visual Aspects
  • The table could be more visually appealing by using color gradients or shading to represent the AUC values, making it easier to compare performance across models and error types.
  • Adding a clear visual separation between the different LLMs would improve readability.
  • The error type labels could be made more descriptive to provide a better understanding of each category.
Analytical Aspects
  • While the table includes standard deviations, adding p-values or confidence intervals would strengthen the statistical significance of the results and allow for more robust comparisons between models.
  • The table could benefit from a discussion of the limitations of using AUC as the sole evaluation metric and potential alternative measures.
  • A more detailed analysis of the features or representations that contribute to the prediction of each error type would provide deeper insights into the underlying mechanisms.
Numeric Data
  • Mistral-7b, Refuses to answer: 0.86 AUC
  • Mistral-Instr-7b, Refuses to answer: 0.85 AUC
  • Llama3-8b, Refuses to answer: 0.87 AUC
  • Llama3-Instr-8b, Refuses to answer: 0.88 AUC

Detecting the Correct Answer

Overview

This section explores whether the internal signals of LLMs about correctness align with their actual generated answers. By using a "probe" trained to detect errors, the researchers select answers from a pool of generated responses and compare the accuracy of this probe-based selection to traditional methods like greedy decoding. The results show that the probe significantly improves accuracy, especially when the LLM doesn't consistently generate the correct answer, suggesting a disconnect between the LLM's internal knowledge and its external behavior.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 5

Figure 5 presents two bar charts comparing the accuracy of different answer selection strategies for the Mistral-7B-Instruct model on two datasets: (a) TriviaQA and (b) Math. The strategies compared are: greedy decoding (taking the first generated answer), random selection, selecting the most frequent answer (majority vote), and selecting the answer with the highest probability according to a probe trained to detect correct answers. The bars are grouped by error types, which categorize the model's behavior across multiple generations of the same question (e.g., consistently correct, consistently incorrect, two competing answers, many answers). The figure highlights that using the probe to select answers leads to significant accuracy improvements, especially for error types where the model doesn't show a clear preference for the correct answer across multiple generations.

First Mention

Text: "Figure 5: Different answer choice strategies, Mistral-7B-Instruct. A notable improvement in accuracy is observed for error types where the LLM shows no preference for the correct answer across repeated generations."

Context: Results The results for Mistral-7b-instruct are summarized in Figure 5, with additional results for other LLMs and datasets as well as qualitative examples provided in Appendix E. We only present results on error types that appear 30 times or more in our test dataset. Overall, using the probe to select answers enhances the LLMs accuracy across all examined tasks. However, the extent of improvement varies by error type. For instance, in the TriviaQA dataset, there is minimal gain in the “mostly correct” category (B2). In contrast, substantial gains—ranging from 30 to 40 points in some cases—are observed in the “mostly incorrect” (C2), “two competing answers” (D), and “many answers” (E1) categories. Interestingly, and perhaps surprisingly, the probe is most effective in cases where the LLM lacks any (external) preference for the correct answer during generation. The fact that the probe can effectively identify the correct answer in these scenarios, points at a significant disconnect between the LLM’s internal encoding and its external behavior. These results suggest that even when the model encodes information of which answer is correct, it can still generate an incorrect answer in practice.

Relevance: This figure is highly relevant as it directly addresses the research question of whether the LLM's internal knowledge of truthfulness aligns with its external behavior (answer generation). It shows that using a probe based on internal representations can significantly improve answer selection, especially when the model's generation behavior is inconsistent or incorrect, revealing a disconnect between internal knowledge and external behavior.

Critique
Visual Aspects
  • Adding a legend explaining the colors used for each answer selection strategy would improve clarity.
  • Labeling the y-axis with 'Accuracy' would make it more explicit what the bars represent.
  • The error type labels on the x-axis could be made more concise and easier to understand at a glance. Using abbreviations or shorter descriptions would help.
Analytical Aspects
  • The figure only shows results for Mistral-7B-Instruct. Including similar charts for other LLMs would strengthen the generalizability of the findings.
  • While the caption mentions 'notable improvement,' quantifying this improvement with specific percentage increases would make the results more impactful.
  • The figure could be enhanced by adding error bars to the bars, representing standard deviations or confidence intervals, to show the statistical significance of the observed differences.

Discussion and Conclusions

Overview

This section summarizes the paper's findings, highlighting the localized nature of truthfulness information within exact answer tokens. This improves error detection, particularly in open-source LLMs. The study also reveals that truthfulness features don't generalize well across different tasks, suggesting skill-specific mechanisms. Furthermore, LLMs can often predict their own error types, and there's a discrepancy between internal knowledge and generated answers, where LLMs might internally know the correct answer but still generate an incorrect one. The paper concludes by emphasizing the value of analyzing internal representations for understanding and mitigating LLM errors.

Key Aspects

Strengths

Suggestions for Improvement

Implementation Details

Overview

This appendix details the implementation of the error detection methods, including how errors were identified, how the probing classifiers were implemented, the datasets used, and the baseline methods. This information is crucial for reproducing the study's results and understanding the methodology.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 3

Table 3 shows the success rate of different large language models (LLMs) in extracting the exact answer from their own generated long-form answers. It lists four models (Mistral-7b, Mistral-Instruct-7b, Llama3-8b, and Llama3-Instruct-8b) and their respective success rates. This demonstrates that LLMs can, to a large extent, identify the key information within their own outputs, even if they don't always present it as the final, concise answer.

First Mention

Text: "Table 3: Success rate of extracting exact answer from a long model answer. Each model is used to extract answers from its own output."

Context: To avoid bias in our probing task, we only retain questions for which a valid exact answer was successfully extracted. This ensures there is no unfair correlation between invalid answers and incorrect answers in the experiments. We note the following: (a) While it is possible to use an instruct LLM to extract every answer regardless of its correctness, we chose the aforementioned strategy to improve the efficiency of our experiments; (b) This is just one possible implementation. For each LLM, one could use the same LLM to extract its own exact answer token, as demonstrated in a proof-of-concept over 1000 samples of TriviaQA in Table 3. Alternatively, it may be more efficient to train a smaller system specifically designed for detecting exact answer tokens, which would be more suitable for real-world scenarios. We choose to keep the extraction process as abstract as possible, as our primary focus is not on the specific implementation, but on analyzing the potential gains from probing these locations. Additionally, if the exact answer token is not among the first generated tokens, we examine the token immediately preceding it (“before exact answer token”). If the exact answer token is not the last one, we also examine the following token. When the exact answer spans multiple tokens, the first and last exact answer tokens are probed separately.

Relevance: Table 3 is relevant because it demonstrates the feasibility of automatically extracting the exact answer from an LLM's output. This extraction process is crucial for the proposed method of probing exact answer tokens for truthfulness signals. The high success rates shown in the table validate the use of this approach.

Critique
Visual Aspects
  • The table is very simple and could be visually enhanced with clearer headings and formatting. For example, bolding the model names would improve readability.
  • Adding a row indicating the dataset used for this evaluation (TriviaQA) would provide important context.
  • The success rates are presented as decimals. While clear, presenting them as percentages might be more intuitive for a broader audience.
Analytical Aspects
  • The table only presents a proof-of-concept with 1000 samples. While indicative, it would be stronger to show results on the full dataset or a larger sample size.
  • The table doesn't provide any measure of variability or uncertainty (e.g., standard deviation, confidence intervals). Including such measures would make the results more robust.
  • The table could benefit from a brief discussion of the extraction method used and its potential limitations. This would provide more context and transparency.
Numeric Data
  • Mistral-7b: 0.99
  • Mistral-Instruct-7b: 0.96
  • Llama3-8b: 0.99
  • Llama3-Instruct-8b: 0.95

Full Error Detection Results

Overview

This appendix provides the complete error detection results, complementing the findings presented in the main paper. It includes figures showing the performance of the probe across different layers and tokens for various datasets and models, as well as tables with detailed results for all error detection methods and datasets. These results support the main paper's conclusions.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 6

Figure 6 presents heatmaps visualizing the Area Under the Curve (AUC) values of a trained probe across different layers and tokens for the Mistral-7b-instruct model. Each heatmap corresponds to a different dataset (HotpotQA, HotpotQA with context, Movies, Winogrande, NLI, IMDB). The x-axis represents the tokens, with the exact answer tokens highlighted. The y-axis represents the model's layers. The color intensity indicates the AUC value, with darker blue representing higher AUC and thus better error detection performance. The figure demonstrates that the error detection performance peaks around the exact answer tokens, particularly in the middle to later layers of the model, similar to the pattern observed in Figure 2 for TriviaQA, Winobias, and Math.

First Mention

Text: "Figure 6 presents the AUC values of a traind probe across layers and token for Mistral-7b-instruct, showing a similar pattern across all datasets. We also observe similar patterns across other models. See our repo https://github.com/technion-cs-nlp/LLMsKnow for the figures."

Context: B FULL ERROR DETECTION RESULTS Figure 6 presents the AUC values of a traind probe across layers and token for Mistral-7b-instruct, showing a similar pattern across all datasets. We also observe similar patterns across other models. See our repo https://github.com/technion-cs-nlp/LLMsKnowfor the figures. Tables 4, 5, 6, and 7 present the full error detection results across all baselines and datasets, which are consistent with the main paper results.

Relevance: Figure 6 expands upon the findings presented in Figure 2 by showing that the pattern of peak error detection performance at the exact answer tokens holds across a wider range of datasets. This reinforces the paper's central argument about the localized nature of truthfulness information within LLMs.

Critique
Visual Aspects
  • The color scale could be improved for better contrast, making it easier to distinguish between different AUC values.
  • Directly labeling the exact answer tokens on the x-axis of each heatmap would improve readability.
  • The figure caption could be more descriptive, explaining the meaning of the axes and the color scale in more detail.
Analytical Aspects
  • While the caption mentions similar patterns across other models, it would be beneficial to include these heatmaps in the appendix or provide a reference to where they can be found.
  • The figure could be strengthened by including a baseline comparison, such as the performance at the last generated token, to visually demonstrate the improvement achieved by focusing on exact answer tokens.
  • A brief discussion of the statistical significance of the observed peak performance (e.g., p-values) would add rigor to the analysis.
table 4

Table 4 compares the performance (AUC) of various error detection methods on the Mistral-7B model across ten datasets. The methods include simple strategies like always predicting the majority class, using the mean or minimum of logits or probabilities (with and without considering the 'exact' answer tokens), prompting the model to assess its own truthfulness ('p(True)'), and using probing classifiers at different token locations. The table shows that probing the exact answer token generally yields the highest AUC scores, indicating better error detection performance.

First Mention

Text: "Tables 4, 5, 6, and 7 present the full error detection results across all baselines and datasets, which are consistent with the main paper results."

Context: B FULL ERROR DETECTION RESULTS Figure 6 presents the AUC values of a traind probe across layers and token for Mistral-7b-instruct, showing a similar pattern across all datasets. We also observe similar patterns across other models. See our repo https://github.com/technion-cs-nlp/LLMsKnow for the figures. Tables 4, 5, 6, and 7 present the full error detection results across all baselines and datasets, which are consistent with the main paper results.

Relevance: Table 4 provides comprehensive results supporting the paper's claim that focusing on exact answer tokens improves error detection. It compares various methods, including established baselines and the proposed probing technique, demonstrating the latter's superior performance across multiple datasets.

Critique
Visual Aspects
  • The table could benefit from visual grouping of related methods (e.g., logits-based, probabilities-based) to improve readability.
  • Highlighting the best-performing method for each dataset would make it easier to quickly identify the most effective strategies.
  • The table could be split into two separate tables, one for each model, to avoid excessive width and improve clarity.
Analytical Aspects
  • While the table includes standard deviations, adding p-values or confidence intervals would strengthen the statistical significance of the results and allow for more robust comparisons.
  • The table could include a discussion of the limitations of the AUC metric and potential alternative evaluation measures.
  • A more detailed analysis of the impact of using 'exact' answer tokens on the performance of different methods would be valuable.
table 5

Table 5, similar to Table 4, presents a comparison of error detection performance (AUC) but for the Mistral-7B-Instruct model. It evaluates various methods across the same ten datasets, including majority voting, logits and probabilities (with mean, min, max, and exact answer variations), p(True), and probing at different token positions. The results consistently show that probing, especially at the exact answer token, often outperforms other methods.

First Mention

Text: "Tables 4, 5, 6, and 7 present the full error detection results across all baselines and datasets, which are consistent with the main paper results."

Context: Same as Table 4's first mention.

Relevance: Table 5 provides further evidence supporting the paper's main claim by demonstrating the effectiveness of probing exact answer tokens for error detection on an instructed version of the LLM. This shows that the method's benefits extend beyond the base model to a more refined, instruction-tuned version.

Critique
Visual Aspects
  • The table could be improved by visually separating different groups of methods (e.g., logits-based, probabilities-based) for better readability.
  • Highlighting the rows corresponding to the probing methods would emphasize the key findings.
  • The table could be split into two separate tables, one for each model, to avoid excessive width and improve clarity.
Analytical Aspects
  • While standard deviations are provided, adding p-values or confidence intervals would strengthen the statistical significance of the comparisons.
  • A discussion of the limitations of the AUC metric and potential alternative evaluation measures would be beneficial.
  • A more detailed analysis of the differences in performance between methods, particularly the impact of using exact answer tokens, would be valuable.
table 6

Table 6 presents the error detection performance (AUC) for the Llama-8b model, similar in structure to Tables 4 and 5. It compares various methods, including majority voting, logits and probabilities (with and without exact answer tokens), p(True), and probing at different token positions, across the same ten datasets. The results generally show that probing at the exact answer token yields the best performance.

First Mention

Text: "Tables 4, 5, 6, and 7 present the full error detection results across all baselines and datasets, which are consistent with the main paper results."

Context: Same as Table 4's first mention.

Relevance: Table 6 extends the analysis to a different LLM architecture (Llama-8b), demonstrating that the benefits of probing exact answer tokens for error detection are not limited to a specific model (Mistral). This strengthens the generalizability of the findings.

Critique
Visual Aspects
  • The table could benefit from visual grouping of related methods (e.g., logits-based, probabilities-based) to improve readability.
  • Highlighting the best-performing method for each dataset would make it easier to identify the most effective strategies.
  • The table could be split into two separate tables, one for each model, to avoid excessive width and improve clarity.
Analytical Aspects
  • While standard deviations are provided, adding p-values or confidence intervals would strengthen the statistical significance of the results.
  • The table could include a discussion of the limitations of the AUC metric and potential alternative evaluation measures.
  • A more detailed analysis of the impact of using 'exact' answer tokens on the performance of different methods would be valuable.
table 7

Table 7 presents the error detection performance (AUC) for the Llama-8b-Instruct model, mirroring the structure of the previous tables. It compares various methods, including majority voting, logits and probabilities (with and without exact answer tokens), p(True), and probing at different token positions, across ten datasets. The results generally indicate that probing at the exact answer token leads to the best error detection performance.

First Mention

Text: "Tables 4, 5, 6, and 7 present the full error detection results across all baselines and datasets, which are consistent with the main paper results."

Context: Same as Table 4's first mention.

Relevance: Table 7 completes the comprehensive analysis by presenting results for the instruction-tuned version of the Llama model. This demonstrates the consistency of the findings across both base and instructed versions of two different LLM architectures (Mistral and Llama), further strengthening the generalizability of the paper's claims.

Critique
Visual Aspects
  • The table could benefit from visual grouping of related methods (e.g., logits-based, probabilities-based) to improve readability.
  • Highlighting the best-performing method for each dataset would make it easier to identify the most effective strategies.
  • The table could be split into two separate tables, one for each model, to avoid excessive width and improve clarity.
Analytical Aspects
  • While standard deviations are provided, adding p-values or confidence intervals would strengthen the statistical significance of the results.
  • The table could include a discussion of the limitations of the AUC metric and potential alternative evaluation measures.
  • A more detailed analysis of the impact of using 'exact' answer tokens on the performance of different methods would be valuable.
figure 6

Figure 6 presents heatmaps visualizing the performance of a probe error detector across different layers and tokens for the Mistral-7b-instruct model. Each heatmap corresponds to a different dataset (HotpotQA, HotpotQA with context, Movies, Winogrande, NLI, IMDB). The x-axis represents the tokens, with the exact answer tokens highlighted. The y-axis represents the model's layers. The color intensity indicates the AUC value, with darker blue representing higher AUC and thus better error detection performance. The key observation is that the detection performance spikes at the exact answer tokens, supporting the idea that truthfulness information is concentrated there.

First Mention

Text: "Figure 6: AUC values of a probe error detector across layers and tokens, Mistral-7b-instruct. The detection performance spikes at the exact answer tokens."

Context: B FULL ERROR DETECTION RESULTS Figure 6 presents the AUC values of a traind probe across layers and token for Mistral-7b-instruct, showing a similar pattern across all datasets. We also observe similar patterns across other models. See our repo https://github.com/technion-cs-nlp/LLMsKnowfor the figures. Tables 4, 5, 6, and 7 present the full error detection results across all baselines and datasets, which are consistent with the main paper results.

Relevance: Figure 6 is highly relevant as it visually demonstrates the core finding of the paper: that truthfulness information is concentrated in the exact answer tokens. The heatmaps show a clear spike in error detection performance (AUC) at these tokens across various datasets, providing strong evidence for this claim.

Critique
Visual Aspects
  • The x-axis labels could be made more readable by increasing the font size or using abbreviations.
  • Adding a clear visual marker (e.g., a vertical line or a different color) to indicate the exact answer token position on each heatmap would improve clarity.
  • The color scale could be adjusted for better contrast, making it easier to distinguish between different AUC values.
Analytical Aspects
  • While the figure shows results for Mistral-7b-instruct, including similar heatmaps for other LLMs analyzed in the paper would strengthen the generalizability of the findings.
  • The figure could benefit from a baseline comparison, such as showing the AUC values for the last generated token or other token positions, to highlight the improvement achieved by focusing on exact answer tokens.
  • A brief discussion of the statistical significance of the observed spikes in AUC values (e.g., p-values or confidence intervals) would strengthen the analysis.
table 4

Table 4 compares the performance (AUC) of various error detection methods on the Mistral-7B model across ten different datasets. The methods include simple strategies like always predicting the majority class, using the mean, minimum, or maximum of logits or probabilities, prompting the model to assess its own truthfulness ('p(True)'), and using probing classifiers at various token positions. The table shows that probing the exact answer tokens generally yields the highest AUC scores, indicating superior error detection performance.

First Mention

Text: "Table 4: Comparison of error detection performance (AUC) on Mistral-7B."

Context: B FULL ERROR DETECTION RESULTS Figure 6 presents the AUC values of a traind probe across layers and token for Mistral-7b-instruct, showing a similar pattern across all datasets. We also observe similar patterns across other models. See our repo https://github.com/technion-cs-nlp/LLMsKnowfor the figures. Tables 4, 5, 6, and 7 present the full error detection results across all baselines and datasets, which are consistent with the main paper results.

Relevance: Table 4 provides a comprehensive comparison of different error detection methods, demonstrating the effectiveness of the proposed approach (probing exact answer tokens) compared to existing baselines. This table supports the central claim of improved error detection.

Critique
Visual Aspects
  • The table could be more readable by grouping the datasets based on task similarity (e.g., factual vs. common sense).
  • Highlighting the best-performing method for each dataset would make it easier to quickly identify the most effective strategies.
  • The table could benefit from a clearer visual separation between the different types of methods (e.g., logits-based, probabilities-based, probing).
Analytical Aspects
  • While the table includes standard deviations, adding p-values or confidence intervals would strengthen the statistical significance of the results and allow for more robust comparisons between methods.
  • The table could be enhanced by including a discussion of the limitations of the AUC metric and potential alternative evaluation measures.
  • A more detailed analysis of the differences in performance between methods, particularly the impact of using exact answer tokens, would be valuable.
Numeric Data
  • TriviaQA, Probe @ Exact answer last: 0.89 AUC
  • Winobias, Probe @ Exact answer last: 0.96 AUC
  • Math, Probe @ Exact answer last: 0.95 AUC
  • Movies, Probe @ Exact answer last: 0.92 AUC
  • IMDB, Probe @ Exact answer last: 0.88 AUC

Full Generalization Results

Overview

This appendix provides the full set of results for the generalization experiments, complementing the analysis in the main paper. It includes heatmaps showing the generalization performance of three LLMs (Mistral-7b, Llama-3-8b, and Llama-3-8b-instruct) across different datasets. The heatmaps show both raw AUC values and the performance difference between the probe method and a logit-based baseline. These results further support the paper's findings about the limited generalization of truthfulness features across tasks.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 7

Figure 7 illustrates the generalization capabilities of the Mistral-7b model across different datasets using two heatmaps. Heatmap (a) displays raw AUC values, where values above 0.5 suggest some level of generalization. Heatmap (b) shows the difference in AUC scores between the probe method and a logit-based method. Positive values in heatmap (b) indicate that the probe generalizes better than the logit-based method, learning additional information beyond what is captured by the output logits. The datasets used for both training and testing include TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA with context (WC), and Natural Questions with context (WC).

First Mention

Text: "Figure 7: Generalization between datasets, Mistral-7b."

Context: Figures 7, 8 and 9 present the generalization results for the remaining models. While these results exhibit similar high-level patterns to those found in the main paper on Mistral-7b-instruct, notable differences suggest that these models may possess different mechanisms for encoding truthfulness.

Relevance: Figure 7 is relevant because it provides a visual representation of how well the error detection capabilities of Mistral-7b generalize across different tasks. It helps to understand whether the model has a universal truthfulness mechanism or if it's task-specific. This is important for determining the practical applicability of error detection methods.

Critique
Visual Aspects
  • The color scales could be improved for better contrast and to highlight areas of strong or weak generalization more effectively.
  • Labeling the axes directly with the dataset names instead of relying solely on the caption would improve readability.
  • Adding a visual cue, such as a diagonal line, to separate the training and testing datasets on the heatmaps would make it easier to interpret the results.
Analytical Aspects
  • While the caption mentions raw AUC values and differences, it would be beneficial to include the actual values in the figure or a supplementary table for more detailed analysis.
  • The figure could be strengthened by including a statistical significance test (e.g., p-values) to determine if the observed differences between the probe and logit-based methods are statistically significant.
  • The caption could provide more context by briefly explaining the logit-based method used for comparison.
figure 8

Figure 8, similar to Figure 7, presents the generalization performance but for the Llama-3-8b model. It uses two heatmaps: (a) shows the raw AUC values, where values above 0.5 suggest some generalization across datasets, and (b) shows the difference in AUC between the probe method and a logit-based method. Positive values in (b) indicate that the probe learns information beyond what's captured by the logits. The same datasets are used for training and testing as in Figure 7: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA WC, and NQ WC.

First Mention

Text: "Figure 8: Generalization between datasets, Llama-3-8b."

Context: Figures 7, 8 and 9 present the generalization results for the remaining models. While these results exhibit similar high-level patterns to those found in the main paper on Mistral-7b-instruct, notable differences suggest that these models may possess different mechanisms for encoding truthfulness.

Relevance: Figure 8 is relevant as it provides insights into the generalization capabilities of the Llama-3-8b model, complementing the analysis of Mistral-7b in Figure 7. Comparing the two figures allows for an assessment of whether the observed generalization patterns are model-specific or hold across different LLM architectures. This is crucial for understanding the broader applicability of the findings.

Critique
Visual Aspects
  • The color scales could be improved for better contrast, making it easier to distinguish between different levels of generalization.
  • Labeling the axes directly with dataset names, rather than just in the caption, would improve readability.
  • Adding a visual separation between the training and testing datasets on the heatmaps would enhance clarity.
Analytical Aspects
  • While the caption mentions raw AUC values and differences, it would be beneficial to include the actual values in the figure or a supplementary table for more detailed analysis.
  • The figure could be strengthened by including a statistical significance test (e.g., p-values) to determine if the observed differences between the probe and logit-based methods are statistically significant.
  • The caption could provide more context by briefly explaining the logit-based method used for comparison and its limitations.
figure 7

Figure 7 presents two heatmaps illustrating the generalization performance of Mistral-7b across different datasets. Heatmap (a) shows raw AUC values, where values above 0.5 suggest some level of generalization. Heatmap (b) shows the difference in AUC between the probe method and a logit-based method. Positive values indicate that the probe generalizes better than the logit-based method. The datasets used for both training and testing include TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA with context (WC), and Natural Questions with context (WC).

First Mention

Text: "Figure 7: Generalization between datasets, Mistral-7b."

Context: C FULL GENERALIZATION RESULTS Figures 7, 8 and 9 present the generalization results for the remaining models. While these results exhibit similar high-level patterns to those found in the main paper on Mistral-7b-instruct, notable differences suggest that these models may possess different mechanisms for encoding truthfulness.

Relevance: Figure 7 provides a visual representation of the generalization capabilities of the Mistral-7b model. It helps to understand how well the model can transfer knowledge about truthfulness detection from one dataset to another. This is important for assessing the model's robustness and its potential for real-world applications where it might encounter unseen data.

Critique
Visual Aspects
  • The color scales could be improved for better contrast and to highlight areas of strong or weak generalization more effectively.
  • Labeling the axes directly with the dataset names, instead of relying solely on the caption, would improve readability.
  • Adding a visual cue, such as a diagonal line, to separate the training and testing datasets on the heatmaps could enhance clarity.
Analytical Aspects
  • While the figure shows raw AUC values and the difference compared to a logit-based method, it would be beneficial to include the performance of the logit-based method itself for a more direct comparison.
  • The figure could be strengthened by adding a discussion of the statistical significance of the observed differences. For example, are the improvements shown in heatmap (b) statistically significant?
  • The caption could be more explicit about the specific logit-based method used for the comparison in heatmap (b).
figure 8

Figure 8, similar to Figure 7, presents the generalization performance but for the Llama-3-8b model. It includes two heatmaps: (a) raw AUC values, where values above 0.5 suggest some generalization, and (b) the difference in AUC between the probe method and a logit-based method. Positive values in (b) indicate that the probe generalizes better than the logit-based method. The same datasets are used as in Figure 7.

First Mention

Text: "Figure 8: Generalization between datasets, Llama-3-8b."

Context: C FULL GENERALIZATION RESULTS Figures 7, 8 and 9 present the generalization results for the remaining models. While these results exhibit similar high-level patterns to those found in the main paper on Mistral-7b-instruct, notable differences suggest that these models may possess different mechanisms for encoding truthfulness.

Relevance: Figure 8 is important because it allows for a comparison of the generalization performance between Mistral-7b (shown in Figure 7) and Llama-3-8b. This comparison helps to understand whether the observed generalization patterns are model-specific or more general across different LLM architectures.

Critique
Visual Aspects
  • Using a consistent color scale across Figures 7, 8, and 9 would facilitate easier comparison between the models.
  • Labeling the axes directly with dataset names would improve readability and reduce reliance on the caption.
  • Adding a visual separation between the training and testing datasets on the heatmaps could enhance clarity.
Analytical Aspects
  • The figure could be more informative by including the performance of the logit-based method itself, allowing for a direct comparison with the probe method.
  • A discussion of the statistical significance of the observed differences between the probe and logit-based methods would strengthen the analysis.
  • The caption could be more explicit about the specific logit-based method used for the comparison.
figure 9

Figure 9 presents the generalization performance of Llama-3-8b-instruct across different datasets, using the same format as Figures 7 and 8. It includes two heatmaps: (a) raw AUC values, where values above 0.5 suggest some generalization, and (b) the difference in AUC between the probe method and a logit-based method. Positive values in (b) indicate better generalization by the probe compared to the logit-based method.

First Mention

Text: "Figure 9: Generalization between datasets, Llama-3-8b-instruct."

Context: C FULL GENERALIZATION RESULTS Figures 7, 8 and 9 present the generalization results for the remaining models. While these results exhibit similar high-level patterns to those found in the main paper on Mistral-7b-instruct, notable differences suggest that these models may possess different mechanisms for encoding truthfulness.

Relevance: Figure 9 completes the generalization analysis by showing the results for the instruction-tuned version of the Llama model. This allows for a comparison between the base Llama model (Figure 8) and its instructed counterpart, as well as with the Mistral models (Figures 3 and 7). This comparison helps to understand the impact of instruction tuning on generalization performance and whether the observed patterns are consistent across different model variants.

Critique
Visual Aspects
  • Maintaining a consistent color scale across all generalization figures (Figures 3, 7, 8, and 9) would facilitate easier comparison between models and variants.
  • Labeling the axes directly with dataset names would improve readability.
  • Adding a visual separation between training and testing datasets on the heatmaps could enhance clarity.
Analytical Aspects
  • Including the performance of the logit-based method itself would provide a more complete picture and allow for a direct comparison with the probe method.
  • Discussing the statistical significance of the observed differences between the probe and logit-based methods would strengthen the analysis.
  • The caption could be more explicit about the specific logit-based method used for the comparison.

Taxonomy of Errors

Overview

This appendix explains the way errors are categorized (taxonomy) in the research, providing more detail and justification for the chosen categories.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 10

Figure 10 is a line graph showing the percentage of answers for which at least one generated answer was correct when resampling the LLM's response multiple times. The x-axis represents the number of resamples (from 1, which is equivalent to greedy decoding, up to 31). The y-axis represents the percentage of correct answers. The graph shows an increasing trend, meaning that as the number of resamples increases, the chance of getting at least one correct answer also increases. However, the rate of increase diminishes as the number of resamples grows, suggesting a point of diminishing returns.

First Mention

Text: "Figure 10: The percentage of answers for which at least one generated answer was correct. The first step is greedy decoding."

Context: D TAXONOMY OF ERRORS Figure 10 presents, for each amount of resamples, the amount percentage of answers for which at least one generated answer was correct. The experiment was done on Mistral-7b-instruct with the TriviaQA dataset. For many answers that the greedy decoding fails to correctly provide an answer, the LLM is still able to generate the correct answer in at least one resample. The plot plateues around 30 resamples.

Relevance: This figure is relevant because it justifies the choice of K=30 resamples used in the error type taxonomy. It shows that increasing the number of resamples improves the chances of finding the correct answer, but the improvement plateaus around 30, suggesting that further resampling would yield diminishing returns.

Critique
Visual Aspects
  • The y-axis label could be more descriptive, such as 'Percentage of questions with at least one correct answer among resamples'.
  • Adding a horizontal line at the level achieved by greedy decoding (1 resample) would provide a clear visual comparison.
  • The figure could benefit from a title that clearly states the dataset and model used (TriviaQA, Mistral-7b-instruct).
Analytical Aspects
  • The figure could be more informative by showing the distribution of correct answers across resamples, not just the percentage of questions with at least one correct answer. For example, a box plot or violin plot could be used.
  • The figure only shows results for one dataset and model. Including similar graphs for other datasets and models would strengthen the generalizability of the findings.
  • The caption could include a brief discussion of the implications of the diminishing returns, such as the trade-off between computational cost and accuracy improvement.
Numeric Data
  • Correctness at 1 resample (greedy decoding): 63 %
  • Correctness at 30 resamples: 88 %

Detecting the Correct Answer Full Results

Overview

This appendix provides complete results for the experiments on detecting the correct answer within a pool of generated responses, expanding on the analysis in the main paper. It includes examples where the model internally encoded the correct answer but consistently generated an incorrect one. The appendix also presents tables comparing different answer selection strategies, including probe-based selection, for both instruct and non-instruct LLMs across various datasets. The results consistently show that probe-based selection improves accuracy, especially when the LLM doesn't have a strong preference for the correct answer during generation.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

table 8

Table 8 presents examples where Mistral-7b-Instruct consistently generated the wrong answer but occasionally produced the correct one. The probe successfully identified the correct answer in these instances. The table shows five questions from TriviaQA, the incorrect answer most frequently generated, how many times that incorrect answer was generated out of 30 samples, the correct answer, and how many times the correct answer was generated. This table highlights the disconnect between the model's internal knowledge and its external behavior.

First Mention

Text: "Table 8: Examples of questions where Mistral-7b-Instruct consistently provided incorrect answers but occasionally generated the correct one. In these instances, the probe successfully identified the right answer. For each question, the model was samples 30 times."

Context: In Table 8 we present some qualitative samples from Mistral-7b-instruct, for the phenomenon we observe at error type (C2) Consistently incorrect but generates the correct answer at least one time. The samples in the table represent cases where the probe chose the correct answer. Table 9 compares different decoding mechanisms, including the choice via probe, on non-instruct models, and Table 10 compares on the instruct models. For all datasets and models, we observe similar conclusions to those in the main paper: significant improvement is observed for error types where the LLM shows no preference to the correct answer.

Relevance: Table 8 is highly relevant because it provides specific examples demonstrating that the LLM sometimes knows the correct answer but fails to generate it consistently. This supports the paper's key finding of a discrepancy between the LLM's internal knowledge and its external behavior. The probe's ability to identify the correct answer from the sample pool further emphasizes its effectiveness.

Critique
Visual Aspects
  • The table could benefit from clearer headings. Instead of 'Wrong Answer' and 'Correct Answer', more descriptive headings like 'Most Frequent Incorrect Answer' and 'Correct Answer (identified by probe)' would be helpful.
  • Adding a column indicating the percentage of times each answer was generated could provide a quicker understanding of the answer distribution.
  • Visually separating the question from the answers (e.g., with a horizontal line) would improve readability.
Analytical Aspects
  • The table only shows five examples. While illustrative, including more examples or providing statistics on how often this phenomenon occurs in the dataset would strengthen the analysis.
  • The table focuses on Mistral-7b-Instruct. Showing similar examples for other LLMs would demonstrate the generalizability of the finding.
  • The table could be enhanced by briefly explaining how the probe identifies the correct answer among the samples, providing more insight into its mechanism.
Numeric Data
  • Question 1, Incorrect Answer Count: 29
  • Question 1, Correct Answer Count: 1
  • Question 2, Incorrect Answer Count: 27
  • Question 2, Correct Answer Count: 1
  • Question 3, Incorrect Answer Count: 18
  • Question 3, Correct Answer Count: 2
  • Question 4, Incorrect Answer Count: 17
  • Question 4, Correct Answer Count: 3
  • Question 5, Incorrect Answer Count: 21
  • Question 5, Correct Answer Count: 4
table 9

Table 9 compares different answer choice strategies for non-instruct LLMs (Mistral-7b and Llama-8b) across three datasets (TriviaQA, Math, Winobias). The strategies include greedy decoding, random sampling, majority vote (choosing the most frequent answer), and using the probe. The table shows the performance (accuracy with standard deviation) of each strategy for different error types, as defined in the paper's taxonomy. Hyphens indicate missing data or inapplicable strategies. The table highlights how different strategies perform for various error types and models, providing a comprehensive comparison.

First Mention

Text: "Table 9: Various answer choice strategies, non-instruct models."

Context: In Table 8 we present some qualitative samples from Mistral-7b-instruct, for the phenomenon we observe at error type (C2) Consistently incorrect but generates the correct answer at least one time. The samples in the table represent cases where the probe chose the correct answer. Table 9 compares different decoding mechanisms, including the choice via probe, on non-instruct models, and Table 10 compares on the instruct models. For all datasets and models, we observe similar conclusions to those in the main paper: significant improvement is observed for error types where the LLM shows no preference to the correct answer.

Relevance: Table 9 is relevant as it provides a detailed comparison of different answer selection strategies for non-instruct LLMs. It shows how these strategies perform across various error types and datasets, offering insights into their strengths and weaknesses. This comparison helps to understand the effectiveness of the probe-based selection method relative to other strategies.

Critique
Visual Aspects
  • The table is quite dense and could benefit from clearer visual separation between datasets and models. Using different background colors or borders could improve readability.
  • The error type labels could be made more concise or abbreviations could be used to reduce clutter.
  • Highlighting the best-performing strategy for each error type and model would make it easier to identify key trends.
Analytical Aspects
  • While the table shows accuracy with standard deviation, it would be beneficial to include statistical significance tests (e.g., p-values) to determine if the differences between strategies are statistically significant.
  • The table could be enhanced by including a discussion of the limitations of each strategy and the reasons behind their varying performance across error types.
  • The table focuses on non-instruct models. A direct comparison with the performance of instruct models (as shown in Table 10) would be valuable for understanding the impact of instruction tuning on answer selection strategies.
Numeric Data
table 10

Table 10 presents the accuracy of different answer selection strategies for instruction-tuned language models on the TriviaQA, Math, and Winobias datasets. The strategies include greedy decoding, random sampling, majority voting (choosing the most frequent answer), and probe-based selection. The results are broken down by error type, which categorizes the model's response patterns across multiple generations of the same question. The table shows how the accuracy of each strategy varies depending on the type of error the model makes.

First Mention

Text: "Table 10: Various answer choice strategies, instruct models."

Context: In Table 8 we present some qualitative samples from Mistral-7b-instruct, for the phenomenon we observe at error type (C2) Consistently incorrect but generates the correct answer at least one time. The samples in the table represent cases where the probe chose the correct answer. Table 9 compares different decoding mechanisms, including the choice via probe, on non-instruct models, and Table 10 compares on the instruct models. For all datasets and models, we observe similar conclusions to those in the main paper: significant improvement is observed for error types where the LLM shows no preference to the correct answer.

Relevance: Table 10 is relevant because it directly compares the performance of different answer selection strategies, including the proposed probe-based method, for instruction-tuned LLMs. It shows how the effectiveness of each strategy varies depending on the type of error the LLM makes, providing insights into the strengths and weaknesses of each approach.

Critique
Visual Aspects
  • The table could benefit from clearer visual separation between the different datasets and models. Using different background colors or borders could improve readability.
  • Highlighting the best-performing strategy for each error type and dataset would make it easier to identify the most effective approaches.
  • The error type labels could be made more concise or explained in a footnote to avoid cluttering the table.
Analytical Aspects
  • While the table shows accuracy values, it would be beneficial to include standard deviations or confidence intervals to provide a measure of uncertainty.
  • The table could be enhanced by adding a discussion of the statistical significance of the observed differences between the strategies. Are the improvements achieved by the probe statistically significant?
  • The table could include a more detailed analysis of why certain strategies perform better for specific error types. This would provide a deeper understanding of the relationship between LLM behavior and answer selection strategies.
Numeric Data
  • TriviaQA, Mistral-7b-Instruct, All Error Types, Probing: 0.89
  • Math, Mistral-7b-Instruct, All Error Types, Probing: 0.96
  • Winobias, Mistral-7b-Instruct, All Error Types, Probing: 0.91
  • TriviaQA, Llama-8b-Instruct, All Error Types, Probing: 0.93
  • Math, Llama-8b-Instruct, All Error Types, Probing: 0.92
  • Winobias, Llama-8b-Instruct, All Error Types, Probing: 1.0
↑ Back to Top