Interpreting Sparse Autoencoder Features in Large Language Models Using Novel Scoring Techniques

Table of Contents

Overall Summary

Overview

The study introduces an automated pipeline for interpreting the latent features of Sparse Autoencoders (SAEs) in large language models (LLMs) using advanced scoring techniques. The pipeline leverages LLMs to generate natural language explanations for SAE features and evaluates these explanations using innovative methods, including intervention scoring. The research demonstrates that SAE latents are more interpretable than individual neurons and explores the semantic similarity of independently trained SAEs. This work provides open-source resources, including code and explanations, contributing to the broader goal of improving model interpretability.

Key Findings

Strengths

Areas for Improvement

Significant Elements

Figure

Description: Figure 1 visualizes latent feature explanations generated by LLMs for SAE features. It highlights tokens within a sentence and displays the detection and fuzzing scores for each explanation.

Relevance: This figure illustrates the automated interpretation process, providing a tangible example of the quality and specificity of generated explanations.

Table

Description: Table 1 presents Spearman correlation coefficients between different scoring methods, including Fuzzing, Detection, and Simulation, across 600 latent scores.

Relevance: The table quantifies the relationships between various scoring methods, offering insights into their agreement and the different aspects of explanation quality they capture.

Conclusion

The study significantly advances the field of model interpretability by introducing a novel automated pipeline for explaining SAE features in LLMs, supported by innovative scoring techniques. The findings suggest that SAE latents are more interpretable than neurons, with implications for improving model transparency and understanding. Future research should focus on quantifying interpretability improvements, exploring alternative sampling strategies, and applying these techniques to diverse model architectures. The open-source resources provided will enable further exploration and validation within the community, driving progress in the development of interpretability tools and techniques.

Section Analysis

Abstract

Overview

This abstract introduces an automated pipeline using Large Language Models (LLMs) to interpret the millions of latent features generated by Sparse Autoencoders (SAEs) applied to deep neural networks. The pipeline includes novel scoring techniques for evaluating explanation quality, including "intervention scoring." The authors apply this to analyze SAEs trained on different LLMs and find that SAE latents are more interpretable than individual neurons, even sparsified ones. The work also explores the semantic similarity between independently trained SAEs and provides open-source code and explanations.

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Overview

This introduction provides context on the growing importance of interpretability in large language models (LLMs). It highlights the shift from analyzing individual, often polysemantic, neurons to focusing on linear combinations of neurons as representations of human-interpretable concepts. Sparse autoencoders (SAEs) are introduced as a key method for extracting these more interpretable features, addressing the limitations of previous neuron-centric approaches.

Key Aspects

Strengths

Suggestions for Improvement

Related Work

Overview

This section discusses existing work on automated interpretability, focusing on methods for explaining neuron or latent activations in large language models. It covers approaches using LLMs like GPT-4 to generate and evaluate explanations, techniques involving activation collection and pattern analysis, and alternative methods like patching latent activations into the residual stream for explanation generation.

Key Aspects

Strengths

Suggestions for Improvement

Methods

Overview

This section details the methodology for explaining and scoring Sparse Autoencoder (SAE) latents. It describes the process of collecting activations from a 10M token sample of RedPajama-v2, the prompting strategy used for the explainer model (Llama 70b), the focus on maximally activating contexts, and introduces four novel scoring methods: detection, fuzzing, surprisal, embedding, and intervention scoring, as alternatives to traditional simulation scoring.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 1. SAE latents explanations in a random sentence. To visualize the latent...
Full Caption

Fig. 1. SAE latents explanations in a random sentence. To visualize the latent explanations produced, we select a sentence taken from the RPJv2 dataset. We selected 4 tokens in different positions in that sentence and filter for latents that are active in different layers. Then we randomly select active latents and their corresponding explanations to display. We display the detection and fuzzing scores of each explanation, which indicate how well it explains other examples in the dataset (see section 3 for details on these scores). The features selected had high activation, but were not cherry-picked based on explanations or scores.

Key Insights
  • The figure provides a qualitative glimpse into the types of explanations generated by the automated pipeline. It suggests that SAE latents can capture meaningful linguistic features, such as pronoun usage, quotation markers, and semantic categories (e.g., sports, online platforms).
  • The figure contributes to the broader goal of understanding the internal representations of LLMs. If SAE latents are indeed more interpretable than neurons, they could be valuable tools for analyzing and controlling LLM behavior.
  • The figure directly addresses the research objective of automatically interpreting SAE features. It provides a visual example of the output of the proposed pipeline.
  • The main limitation is the use of a single sentence for visualization. Future work could explore visualizing explanations across multiple sentences or using more robust evaluation metrics. Additionally, the paper should provide more details on the distribution of detection and fuzzing scores across all generated explanations, not just the few shown in the figure.
Key Values
  • Detection scores and fuzzing scores are presented for each explanation. These scores range from 0.56 to 0.95 for detection and 0.59 to 0.88 for fuzzing.
  • These scores provide a preliminary indication of the quality of the generated explanations. Higher scores suggest that the explanations are better at capturing the latent's activation patterns. However, the limitations of using a single sentence for evaluation should be acknowledged.
  • The relationship between detection and fuzzing scores is not explicitly discussed, but they generally seem to correlate positively. This suggests that explanations that are good at identifying activating sentences are also good at localizing the activating tokens within those sentences.
  • The range of scores observed suggests that some latents are more easily explained than others. Further analysis is needed to understand the factors that contribute to explanation quality.
First Reference in Text
We evaluate explanation quality using fuzzing, detection and embedding scores, as those were both quick to compute and easy to interpret.
Summary

Figure 1 showcases examples of explanations generated by the automated pipeline for Sparse Autoencoder (SAE) latents. It uses a single sentence from the RedPajama-v2 dataset and highlights four tokens. For each highlighted token, the figure displays explanations for several active latents across different layers of the model. Each explanation box includes the feature number, layer, detection score, fuzzing score, and the natural language explanation itself. The caption emphasizes that the features were selected based on high activation, not on the quality of their explanations, to avoid cherry-picking.

Methodological Critique
  • The methodology of selecting a single sentence and four tokens within it for visualization is a simplification. While it provides a concrete example, it doesn't demonstrate the robustness of the explanations across a wider range of contexts. The random selection of active latents is a good practice to avoid bias, but the caption doesn't specify how many latents were sampled per token.
  • The caption mentions detection and fuzzing scores but doesn't explain them in detail, directing the reader to section 3. This is acceptable but could be improved by briefly stating what these scores represent (e.g., accuracy of identifying activating contexts). The caption also states that features were chosen based on high activation, but doesn't define "high activation" quantitatively.
  • The figure aims to demonstrate the interpretability of SAE latents. The provided explanations seem qualitatively reasonable, but a more rigorous evaluation is needed (as mentioned in the reference text). The figure doesn't provide evidence for the claim that SAE latents are *more* interpretable than neurons, which is a central claim of the paper.
  • The methodology of visualizing explanations using a single sentence is not a standard practice in interpretability research. While understandable for illustrative purposes, the paper should rely on more robust quantitative evaluations for its main claims.
Presentation Critique
  • The figure is generally clear, but the explanation boxes could be visually improved. The feature number and layer information could be presented more concisely. The color coding of the tokens in the sentence is helpful.
  • The visual organization is straightforward, with clear separation between the sentence, highlighted tokens, and corresponding explanations. The inclusion of detection and fuzzing scores directly within the explanation boxes is effective.
  • The figure is likely understandable for an audience familiar with LLMs and interpretability research, but may require additional context for a broader audience. Terms like "SAE latent," "detection score," and "fuzzing score" should be clearly defined in the paper.
  • While the visualization itself is not standard, the use of captions, scores, and explanations aligns with general conventions for presenting qualitative results in interpretability research.
Fig. 2. The new proposed scoring methods. In detection scoring, the scorer...
Full Caption

Fig. 2. The new proposed scoring methods. In detection scoring, the scorer model is tasked with selecting the set of sentences that activate a given latent given an explanation. In this work, we show 5 examples at the same time, and each has an identical probability of being a sentence that activates the latent, independent of whether any other example also activates the latent. The activating tokens are colored in green for display, but that information is not shown to the scorer model. For fuzzing scoring, the scorer model is tasked with selecting the sentences where the highlighted tokens are the tokens that activate a target latent given an explanation of that latent. In surprisal scoring, activating and non-activating examples are run through the model and the loss over those sentences is computed. Correct explanations should decrease the loss in activating sentences compared to a generic explanation, but shouldn't significantly decrease the loss in non-activating sentences. For embedding scoring, activating and non-activating sentences are embedded as "documents" that should be retrieved using the explanation as a query.

Key Insights
  • The proposed scoring methods offer a potentially more efficient and nuanced way to evaluate explanations for SAE latents compared to simulation scoring.
  • By focusing on the ability of a scorer model to discriminate between activating and non-activating contexts, these methods aim to capture the core function of a good explanation.
  • The development of these methods contributes to the growing field of interpretability research for LLMs, specifically in the context of SAE-based analysis.
  • The practical utility and effectiveness of these methods need to be further validated through empirical evaluation and comparison with existing methods. More details on implementation are crucial for reproducibility and wider adoption.
Key Values
  • The figure presents example inputs and expected outputs for each scoring method. For instance, in Detection, the expected output is a binary vector [1, 0, 0]. In Fuzzing, the output is also a binary vector [1, 0, 1]. In Surprisal, the output is a comparison of probabilities (P(Sentence|Exp1) > P(Sentence|Exp2)). In Embedding, the output is a comparison of embedding similarities.
  • These example values illustrate how each scoring method quantifies the quality of an explanation. They show how the scorer model uses the explanation to classify sentences or tokens as activating or non-activating.
  • The key difference between the methods lies in their granularity (sentence-level vs. token-level) and their use of probabilities vs. embeddings. Detection and Fuzzing operate at the sentence and token level, respectively, while Surprisal and Embedding leverage probabilities and embeddings.
  • These methods address the challenge of evaluating explanation quality in the context of SAE latents, which is a relatively new area of research. They offer alternatives to simulation scoring, which can be computationally expensive.
First Reference in Text
As an alternative to simulation scoring, we introduce four new evaluation methods that focus on how well an explanation enables a scorer to discriminate between activating and non-activating contexts.
Summary

Figure 2 outlines four novel scoring methods for evaluating the quality of explanations for SAE latents: Detection, Fuzzing, Surprisal, and Embedding. Each method is illustrated with an example featuring an explanation, a set of sentences, and the expected output or scoring mechanism. The caption provides a concise description of each method's process. Importantly, it clarifies that while activating tokens are highlighted in green for the reader's benefit, the scorer model does *not* have access to this information.

Methodological Critique
  • Introducing four new scoring methods simultaneously makes comparison and interpretation complex. The rationale for needing four distinct methods should be clearly articulated. Are they designed to capture different aspects of explanation quality? How do they relate to existing methods (like simulation scoring)?
  • While the caption describes the general procedure for each method, it lacks specific details about implementation. For example, what type of scorer model is used? How are the sentences selected? How is the loss computed in surprisal scoring? What embedding model is used for embedding scoring? These details are crucial for reproducibility.
  • The figure provides illustrative examples, but no empirical evidence to support the effectiveness of these scoring methods. The paper should present results comparing these methods against each other and against existing benchmarks.
  • The proposed methods seem generally aligned with the goal of evaluating explanation quality, but their novelty and practical utility need to be demonstrated through rigorous experimentation and comparison with existing methods.
Presentation Critique
  • The caption is dense and could be improved by breaking it down into smaller, more digestible chunks. Each scoring method could be described in a separate paragraph or bullet point. Using a table format to summarize the methods could also enhance clarity.
  • The visual organization is adequate but could be improved. Using consistent visual cues (e.g., color coding, arrows) to indicate the flow of information in each example would make it easier to follow. The figure could also benefit from clearer labels and headings.
  • The figure assumes familiarity with concepts like "scorer model," "latent," "activation," and "embedding." While appropriate for a specialized audience, the paper should provide clear definitions and explanations for a broader readership.
  • The presentation of the methods, while not entirely standard, generally adheres to conventions for illustrating methodological approaches. However, the lack of specific implementation details hinders reproducibility.
Fig. 3. Fuzzing and detection scores for different sampling techniques. Panels...
Full Caption

Fig. 3. Fuzzing and detection scores for different sampling techniques. Panels a) and b) show the distributions of fuzzing and detection scores, respectively, as a function of different example sampling methods for explanation generation. Sampling only from the top activation gets on average lower accuracy in fuzzing and on detection when compared with randomly sampling and sampling from quantiles. The distributions from random sampling and sampling from quantiles are very similar. Panels c) and d) measure how the explanations generalize across activation quantiles, showing that explanations generated from the top quantiles are better at distinguishing non-activating examples in detection, but have lower accuracy on other quantiles, especially on the lower activating deciles. This also happens with explanations generated from examples sampled randomly and from quantiles, but the accuracy does not drop as much in lower activating deciles.

Key Insights
  • The main finding is that sampling strategies significantly impact the quality and generalizability of generated explanations for SAE latents.
  • This insight has implications for interpretability research, suggesting that relying solely on top activating examples can lead to biased and less robust explanations. Using more diverse sampling strategies like 'random' or 'quantile' sampling can improve the generalizability of explanations.
  • The figure directly addresses the research objective of evaluating different methods for generating explanations. It provides empirical evidence to support the claim that broader sampling leads to better explanations.
  • Future work could explore more sophisticated sampling strategies that balance the need to capture the most activating examples with the need to generalize to a wider range of contexts. Additionally, incorporating statistical significance testing would strengthen the conclusions drawn from the figure.
Key Values
  • Panels (a) and (b) show that 'random' and 'quantile' sampling achieve higher median accuracy scores for both fuzzing and detection compared to 'top' sampling. The distributions for 'random' and 'quantile' are very similar.
  • These higher median scores indicate that explanations generated from a broader range of examples are more robust and generalize better across different contexts.
  • Panels (c) and (d) reveal that 'top' sampling performs well at distinguishing non-activating examples (decile 0) but poorly on lower activation deciles. 'Random' and 'quantile' sampling show more consistent performance across deciles, although they also exhibit a slight drop in accuracy for lower deciles.
  • The decile-level analysis highlights the trade-off between accurately capturing the most activating examples and generalizing to a wider range of activations. It suggests that over-reliance on top activating examples can lead to explanations that are overly specific and fail to capture the full range of a latent's meaning.
First Reference in Text
We find that randomly sampling from a broader set of examples leads to explanations that cover a larger set of activating examples, sometimes to the detriment of the top activating examples, see fig. 3.
Summary

Figure 3 explores the impact of different context sampling strategies on the performance of generated explanations, measured by fuzzing and detection accuracy. Panels (a) and (b) present the overall score distributions for 'top', 'random', and 'quantile' sampling, along with a 'random explanations' baseline. Panels (c) and (d) show how explanation accuracy varies across activation deciles for each sampling method, specifically for detection. The figure demonstrates that while 'top' sampling excels at distinguishing non-activating examples, it performs poorly on lower activation deciles. 'Random' and 'quantile' sampling provide more balanced performance.

Methodological Critique
  • The figure compares three sampling methods: 'top', 'random', and 'quantile'. While these are common strategies, the rationale for choosing these specific methods and the precise implementation of 'quantile' sampling (e.g., number of quantiles, how they are defined) are not explicitly stated.
  • The caption mentions 'fuzzing' and 'detection' scores but doesn't fully define them. It assumes the reader understands these metrics from prior context. Including a brief explanation or referencing the relevant section would improve clarity.
  • The figure visually supports the claim that 'random' and 'quantile' sampling yield better overall performance than 'top' sampling. However, it doesn't provide statistical significance testing to confirm these observations. Reporting p-values or confidence intervals would strengthen the analysis.
  • The methodology of evaluating explanations based on different sampling strategies is sound and relevant to the research question. However, the lack of precise details about the 'quantile' sampling implementation and the absence of statistical significance testing limit the reproducibility and robustness of the findings.
Presentation Critique
  • The caption is quite lengthy and dense. Breaking it down into shorter, more focused sentences would improve readability. Separately captioning each panel (a-d) would also be beneficial.
  • The visual organization is generally clear, with distinct panels for overall score distributions and decile-level analysis. However, the y-axis labels in panels (c) and (d) could be more descriptive (e.g., 'Detection Accuracy'). Adding clear titles to each panel would also enhance readability.
  • The figure assumes familiarity with terms like 'fuzzing', 'detection', 'activation quantiles', and 'deciles'. While appropriate for a specialized audience, the paper should provide clear definitions and explanations for broader accessibility.
  • The presentation of the results generally follows conventions for visualizing performance comparisons. However, the lack of statistical annotations (e.g., p-values, confidence intervals) and more descriptive axis labels could be improved.

Results

Overview

This section presents the results of comparing different scoring methods for explanations of Sparse Autoencoder (SAE) latents, finding fuzzing and detection scores correlate most with simulation scoring. It also highlights the value of intervention scoring in interpreting features missed by other methods. The impact of sampling techniques, explainer model size, and SAE architecture on explanation quality are explored, showing benefits from broader sampling and larger SAEs. Finally, the overlap between latents at adjacent layers is analyzed, suggesting prioritization of wider SAEs on fewer layers for limited compute budgets.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 1. Spearman correlation computed over 600 different latent scores
Key Insights
  • The main finding is that the proposed scoring methods exhibit varying degrees of correlation with each other and with the established Simulation method.
  • This finding has implications for the choice of scoring method in interpretability research. If different methods measure different aspects of explanation quality, researchers need to carefully consider which method is most appropriate for their specific research question.
  • The table contributes to the research objective of evaluating the proposed scoring methods. It provides a quantitative assessment of their agreement with each other and with the existing benchmark.
  • Future work could explore the reasons for the low correlations observed with Embedding. It's important to understand whether this method is capturing a valuable but distinct aspect of explanation quality, or whether it's simply less reliable than the other methods. Additionally, investigating the statistical significance of the observed correlations would strengthen the analysis.
Key Values
  • The highest correlation (0.73) is observed between Fuzzing and Detection. Fuzzing also shows a relatively high correlation (0.70) with Simulation. The lowest correlations are generally observed with Embedding, particularly with Surprisal (0.20).
  • The high correlation between Fuzzing and Detection suggests that these two methods capture similar aspects of explanation quality. The relatively high correlation between Fuzzing and Simulation provides some support for the validity of the Fuzzing method. The low correlations involving Embedding suggest that this method measures a different aspect of explanation quality compared to the other methods.
  • The correlations between the new methods and Simulation are of particular interest, as Simulation is the established benchmark. The relatively strong correlation between Fuzzing and Simulation (0.70) is a positive sign. However, the weaker correlations for Detection (0.44), Embedding (0.41), and Surprisal (0.30) raise questions about their agreement with Simulation.
  • These correlation values provide insights into the relationships between different scoring methods for SAE latent explanations. They suggest that some methods are more closely related than others, and that Embedding, in particular, may be capturing a distinct aspect of explanation quality.
First Reference in Text
Since simulation scoring is an established method, we measure how our other context-based scoring techniques correlate with simulation, as well as how they correlate between themselves (see tables 1 and A1).
Summary

Table 1 presents a correlation matrix of Spearman rank correlation coefficients calculated between different scoring methods for SAE latent explanations. The methods included are Fuzzing, Detection, Simulation, Embedding, and Surprisal. The correlations are computed over 600 different latent scores. The table is symmetric, with a diagonal of 1s (as each method perfectly correlates with itself).

Methodological Critique
  • Using Spearman correlation is appropriate for comparing ranking-based scoring methods, as it doesn't assume linearity. However, the choice of 600 latent scores should be justified. Is this a representative sample? How were these latents selected?
  • The table caption mentions that the correlations are computed over "600 different latent scores." More detail is needed on how these latents were selected. Were they randomly sampled? Were they chosen to represent a diverse range of latent types or activation patterns? Additionally, specifying the layers from which these latents were taken would be helpful.
  • The table shows the correlations between the proposed scoring methods and the established simulation scoring method. This allows for a direct comparison and helps assess the validity of the new methods. However, the table doesn't provide any information about the statistical significance of these correlations. Including p-values would strengthen the analysis.
  • Calculating correlations between scoring methods is a standard approach for evaluating their agreement. However, relying solely on correlations can be misleading. It's important to also consider the magnitude of the differences between scores and the practical implications of these differences.
Presentation Critique
  • The table is clearly organized and easy to read. The row and column headers clearly identify the scoring methods being compared. The use of a symmetric matrix is appropriate for presenting correlations.
  • The visual presentation of the correlation matrix is standard and effective. However, using color coding to highlight the strength of correlations could improve readability and quickly convey the key findings.
  • The table is understandable for a technical audience familiar with statistical concepts like correlation. However, for a broader audience, a brief explanation of Spearman correlation and its interpretation would be beneficial.
  • The presentation of the correlation matrix adheres to standard conventions. However, adding a brief explanation of the scoring methods in the caption or a footnote would improve the table's self-containedness.
SAE features are more interpretable than random features, especially when...
Full Caption

SAE features are more interpretable than random features, especially when intervening more strongly. Our explainer also produces explanations that are scored higher than random explanations. Right: Many features that would normally be uninterpreted when using context-based automatic interpretability are interpretable in terms of their effects on output.

Key Insights
  • The main finding is that intervention scores provide a valuable complementary perspective to context-based interpretability methods, revealing features that influence output but are not easily explained by input contexts.
  • This insight has implications for understanding and controlling LLM behavior. By identifying features that directly influence output, intervention scores can help researchers develop more targeted and effective methods for steering and debugging LLMs.
  • The figure directly addresses the research objective of developing new interpretability metrics. It provides empirical evidence to support the claim that intervention scores capture a distinct and valuable aspect of interpretability.
  • Future work could explore the relationship between intervention scores and other interpretability metrics, as well as their application to different models, layers, and tasks. Further investigation is also needed to understand the limitations of intervention scores and to develop methods for improving their robustness and reliability.
Key Values
  • The left panel shows that SAE features have higher intervention scores than random features, especially at higher target KL divergences (1.0 and 3.0). The explainer-generated explanations also score higher than random explanations.
  • These higher intervention scores for SAE features suggest that they have a more meaningful and interpretable impact on model output compared to random features. The increasing difference with higher KL divergence indicates that this effect is more pronounced for stronger interventions. The fact that generated explanations score higher than random ones suggests the explainer is producing meaningful explanations.
  • The right panel shows a negative correlation between intervention scores and fuzzing scores (y = -2.71x + 2.79, r = -0.37).
  • This negative correlation suggests that features with high intervention scores (i.e., those that strongly influence output) tend to have low fuzzing scores (i.e., they are difficult to interpret based on input contexts). This highlights the value of intervention scores in capturing features that are not easily interpretable using traditional context-based methods.
First Reference in Text
In figure 4, we see that the intervention score we propose is a valuable contribution to the set of automatic interpretability metrics because it (a) distinguishes between features from a trained SAE and random features and (b) recalls features that context-based scoring methods fail to interpret.
Summary

Figure 4 presents an analysis of intervention scores, a proposed metric for interpretability. The left panel shows distributions of intervention scores for SAE features, random features, and random explanations at different target KL divergences (intervention strengths). The right panel shows a scatter plot comparing intervention scores with fuzzing scores (a context-based interpretability metric). The figure focuses on Gemma 2 9B at layer 32.

Methodological Critique
  • The use of intervention scores is a novel approach to interpretability, directly assessing the causal impact of features on output. Comparing these scores to random features and random explanations provides a good baseline.
  • While the caption mentions Section 3.5.5 for details on intervention scores, a brief explanation of how these scores are calculated would enhance the figure's self-containedness. Key details like the choice of scorer model, the types of interventions performed, and the method for calculating KL divergence are missing from the figure and caption.
  • The left panel visually demonstrates the difference between SAE features, random features, and random explanations. However, it would be stronger to include statistical significance tests (e.g., p-values) to quantify these differences. The right panel shows a negative correlation between intervention and fuzzing scores, suggesting that these methods capture different aspects of interpretability. Quantifying this correlation (e.g., with Spearman's rank correlation coefficient) and assessing its significance would be beneficial.
  • Comparing intervention scores with existing context-based methods is a valuable contribution. However, further validation is needed to establish the robustness and generalizability of intervention scores across different models, layers, and tasks.
Presentation Critique
  • The caption clearly describes the main findings, but could be improved by providing more context about intervention scores and their interpretation. The terms "target KL divergence" and "fuzzing score" are used without definition, assuming prior knowledge from the reader.
  • The left panel effectively uses histograms to visualize the distributions of intervention scores. The right panel uses a scatter plot to show the relationship between intervention and fuzzing scores. Adding axis labels and titles to the plots would improve clarity. Including the equation of the fitted line and the R-squared value in the right panel would provide more quantitative information about the correlation.
  • The figure is understandable for a technical audience familiar with interpretability research and statistical concepts. However, for a broader audience, more explanation of the technical terms and the methodology would be necessary.
  • The visual presentation of the results is generally clear, but could be enhanced with clearer labels, titles, and statistical annotations.

Overlap Between Latents at Adjacent Layers

Overview

This section explores the overlap between latent features learned by SAEs at adjacent layers in the residual stream of a transformer model. It highlights the motivation for studying this overlap, including potential resource optimization and sanity-checking the interpretability pipeline. The Hungarian algorithm is used to align latents due to permutation symmetry, and the analysis reveals greater semantic similarity between adjacent residual stream SAEs compared to MLP SAEs, suggesting potential compute savings by training wider SAEs on fewer layers.

Key Aspects

Strengths

Suggestions for Improvement

Conclusion

Overview

This conclusion summarizes the main contributions of the paper, which focuses on explaining the latents of Sparse Autoencoders (SAEs) trained on large language models (LLMs). It highlights the proposed scoring techniques for evaluating explanation quality, discusses the limitations of current methods regarding context length and non-activating examples, and suggests future directions, including incorporating explanation length into metrics and improving the selection of non-activating examples. The conclusion also emphasizes the potential of improved explanations for model steering, concept localization, and editing.

Key Aspects

Strengths

Suggestions for Improvement

Appendix

Overview

This appendix provides supplementary information regarding the explainer prompt, examples of activating contexts and explanations, different scoring methods (detection, fuzzing, surprisal, embedding), adversarial examples, correlation between scores, SAE models used, and factors that influence explainability, including dataset, chain of thought, activation information, context length, origin and number of examples, explainability across layers, size of explainer and scorer models, and SAE size and location.

Key Aspects

Strengths

Suggestions for Improvement

↑ Back to Top