Interpreting Sparse Autoencoder Features in Large Language Models Using Novel Scoring Techniques

Section Analysis

Abstract

Overview

This abstract introduces an automated pipeline using Large Language Models (LLMs) to interpret the millions of latent features generated by Sparse Autoencoders (SAEs) applied to deep neural networks. The pipeline includes novel scoring techniques for evaluating explanation quality, including "intervention scoring." The authors apply this to analyze SAEs trained on different LLMs and find that SAE latents are more interpretable than individual neurons, even sparsified ones. The work also explores the semantic similarity between independently trained SAEs and provides open-source code and explanations.

Key Aspects

Automated Interpretation of SAE Features: An automated pipeline leverages LLMs to generate and evaluate natural language explanations for the numerous features extracted by SAEs from deep neural network activations.
Novel Scoring Techniques: Five new techniques are introduced to assess the quality of generated explanations, offering more efficient alternatives to previous methods. One highlighted technique, "intervention scoring," evaluates the interpretability of feature interventions.
Improved Interpretability of SAE Latents: The research confirms that SAE latents offer significantly better interpretability compared to individual neurons, even when sparsified using top-k postprocessing.
Analysis of SAE Similarity: The semantic similarity of independently trained SAEs is measured, revealing high similarity between SAEs trained on nearby layers of the residual stream.
Open-Source Resources: The code and generated explanations are made available as open-source resources.

Strengths

Clear Motivation and Problem Statement
The abstract effectively establishes the challenge of interpreting millions of SAE features and the need for automated methods.

"However, these SAEs can have millions of distinct latent features, making it infeasible for humans to manually interpret each one." (Page 1)
Novel Methodology
The introduction of new scoring techniques, particularly intervention scoring, demonstrates a valuable contribution to the field.

"One of these techniques, intervention scoring, evaluates the interpretability of the effects of intervening on a feature, which we find explains features that are not recalled by existing methods." (Page 1)
Significant Findings
The confirmation of improved interpretability of SAE latents over neurons and the analysis of SAE similarity provide valuable insights.

"Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons, even when neurons are sparsified using top-k postprocessing." (Page 1)

Suggestions for Improvement

Elaborate on Specific LLM and SAE Architectures
While the abstract mentions using "two different open-weight LLMs," specifying the models would enhance clarity and reproducibility.

"We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs." (Page 1)

Rationale: Providing specific model details allows readers to better understand the context of the experiments and potentially reproduce the results.

Implementation: Include the names and versions of the LLMs used, as well as details about the SAE architectures (e.g., number of layers, latent dimensions).
Quantify the Improvement in Interpretability
While the abstract states that SAE latents are "much more interpretable," quantifying this improvement with metrics would strengthen the claim.

"Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons..." (Page 1)

Rationale: Quantitative measures provide more concrete evidence of the improvement and allow for comparisons with other methods.

Implementation: Include specific metrics used to evaluate interpretability, such as precision, recall, or F1-score, and report the values obtained for both SAE latents and neurons.

Introduction

Overview

This introduction provides context on the growing importance of interpretability in large language models (LLMs). It highlights the shift from analyzing individual, often polysemantic, neurons to focusing on linear combinations of neurons as representations of human-interpretable concepts. Sparse autoencoders (SAEs) are introduced as a key method for extracting these more interpretable features, addressing the limitations of previous neuron-centric approaches.

Key Aspects

Need for Interpretability: The introduction emphasizes the importance of understanding the internal representations of LLMs, despite their impressive performance.
Limitations of Neuron-Level Analysis: The text points out the challenges of interpreting individual neurons due to their polysemantic nature, activating across diverse contexts.
Linear Representation Hypothesis: The introduction highlights the idea that human-interpretable concepts are encoded in linear combinations of neurons, motivating the use of feature extraction methods.
SAEs for Feature Extraction: Sparse autoencoders are presented as a solution to the polysemanticity problem, offering a way to extract more interpretable and potentially monosemantic latent features.
Growing Use of SAEs: The text notes the increasing adoption of SAEs as an important interpretability tool for large language models, including their application to models like GPT-4 and Claude.

Strengths

Clear Context and Motivation
The introduction effectively establishes the context of LLM interpretability research and motivates the shift towards feature-based analysis.

"At the same time, we understand little about the internal representations driving their behavior." (Page 1)
Concise Explanation of SAEs

"SAEs consist of two parts: an encoder that transforms activation vectors into a sparse, higher-dimensional latent space, and a decoder that projects the latents back into the original space." (Page 1)

Suggestions for Improvement

Connect to Abstract's Contributions
While the introduction provides good background, explicitly linking it back to the specific contributions mentioned in the abstract would strengthen the narrative.

"Sparse autoencoders (SAEs) were proposed as a way to address polysemanticity (Cunningham et al., 2023)." (Page 1)

Rationale: This would create a stronger flow from the general problem to the specific solutions proposed in the paper.

Implementation: Briefly mention the automated pipeline and scoring techniques introduced in the abstract, highlighting how they address the challenges outlined in the introduction.
Expand on the Significance of Scaling SAEs
The introduction mentions scaling SAEs to larger models, but elaborating on the significance of this would enhance the impact.

"Recently, a significant effort was made to scale SAE training to larger models, like GPT-4 (Gao et al., 2024) and Claude (Templeton et al., 2024), and they have become an important interpretability tool for LLMs." (Page 1)

Rationale: This would underscore the relevance of the work to current trends in LLM interpretability.

Implementation: Briefly discuss the challenges and benefits of applying SAEs to larger models, potentially mentioning computational costs or insights gained from analyzing these larger models.

Related Work

Overview

This section discusses existing work on automated interpretability, focusing on methods for explaining neuron or latent activations in large language models. It covers approaches using LLMs like GPT-4 to generate and evaluate explanations, techniques involving activation collection and pattern analysis, and alternative methods like patching latent activations into the residual stream for explanation generation.

Key Aspects

Early LLM-based Interpretability: Early work used LLMs like GPT-4 to explain neuron activations by providing examples of activating contexts and tasking the LLM to generate explanations. These explanations were then evaluated based on their ability to predict neuron activations in new contexts.
Context and Activation Pattern Analysis: Several approaches focus on collecting contexts and corresponding latent activations, then using larger models to identify patterns and generate explanations based on these patterns.
Evaluation Methods: Various evaluation methods have been proposed, including having the model generate activating examples and measuring neuron activation, and using "interpretability agents" for iterative experimentation.
Cheaper Automated Interpretability: A more resource-efficient method involves using the model being explained as the explanation generation model. This involves patching latent activations into the residual stream and generating continuations related to the latent.

Strengths

Comprehensive Overview
The section provides a good overview of different approaches to automated interpretability, covering both LLM-based methods and more resource-efficient techniques.

"In general, current approaches focus on collecting contexts together with latent activations from the model to be explained, and use a larger model to find patterns in activating contexts." (Page 2)

Suggestions for Improvement

Deeper Discussion of Evaluation Metrics
While the section mentions several evaluation methods, a more in-depth discussion of their strengths and weaknesses would be beneficial.

"Following those works, other methods of evaluating explanations have been proposed, including asking a model to generate activating examples and measuring how much the neuron activates (Kopf et al., 2024; Hernandez et al., 2022)." (Page 2)

Rationale: A deeper discussion of evaluation metrics would help readers understand the trade-offs between different approaches and choose the most appropriate method for their specific needs.

Implementation: Expand on the advantages and disadvantages of each evaluation method, potentially providing examples of how they have been used in practice and discussing their limitations.
Connect to Current Work's Approach
Explicitly stating how the current work builds upon or differs from the discussed related work would strengthen the context.

"One of the first approaches to automated interpretability focused on explaining neurons of GPT-2 using GPT-4 (Bills et al., 2023)." (Page 2)

Rationale: This would clarify the novelty and contribution of the current work in relation to existing methods.

Implementation: Add a paragraph or subsection explicitly comparing and contrasting the current approach with the related work discussed, highlighting the key differences and innovations.

Methods

Overview

This section details the methodology for explaining and scoring Sparse Autoencoder (SAE) latents. It describes the process of collecting activations from a 10M token sample of RedPajama-v2, the prompting strategy used for the explainer model (Llama 70b), the focus on maximally activating contexts, and introduces four novel scoring methods: detection, fuzzing, surprisal, embedding, and intervention scoring, as alternatives to traditional simulation scoring.

Key Aspects

Activation Collection: Latent activations were collected from SAEs trained on Llama 3.1 8b using a 10M token sample of RedPajama-v2. The analysis revealed that a significant portion of latents don't activate frequently, especially with smaller context windows, and this sparsity varies with SAE size and training data.
Prompting Strategy: The explainer model, Llama 70b, was prompted with 32-token activating examples with highlighted activating tokens and their activation strength. Chain-of-thought prompting was found to not significantly improve explanation quality.
Maximally Activating Contexts: The method focuses on maximally activating contexts for generating and scoring explanations, acknowledging the potential bias but prioritizing conciseness and specificity over broader, potentially less meaningful explanations.
Novel Scoring Methods: Four new scoring methods are introduced: detection, fuzzing, surprisal, and embedding. These methods focus on the ability of an explanation to discriminate between activating and non-activating contexts, offering more compute-efficient alternatives to simulation scoring. Intervention scoring assesses the impact of a feature on model output.

Strengths

Detailed Explanation of Methodology
The section provides a thorough explanation of the data collection, prompting, and scoring processes, making the research reproducible.

"We collected latent activations from the SAEs over a 10M token sample of RedPajama-v2 (RPJv2; Computer 2023)." (Page 2)
Introduction of Novel Scoring Methods
The four new scoring methods offer a valuable contribution to the field, addressing limitations of existing methods like simulation scoring.

"As an alternative to simulation scoring, we introduce four new evaluation methods that focus on how well an explanation enables a scorer to discriminate between activating and non-activating contexts." (Page 5)

Suggestions for Improvement

Clarify Rationale for 32-Token Context
While the section mentions using 32-token contexts, the rationale behind this choice isn't fully explained.

"The activating example is selected to only have 32 tokens, irrespective of the position of the activating tokens in the prompt." (Page 3)

Rationale: Understanding the reason for this specific context length would provide further insight into the methodology and its potential limitations.

Implementation: Add a sentence or two explaining the choice of 32 tokens, perhaps relating it to computational constraints or the typical length of relevant contexts for the target features.
Provide More Detail on Failure Modes
While the section briefly mentions failure modes of sampling strategies, providing more concrete examples would be beneficial.

"For examples of these types of failure modes, see fig. A1 and the discussion in the appendix A.2." (Page 4)

Rationale: More detailed examples of failure modes would help readers understand the challenges and trade-offs involved in different sampling strategies.

Implementation: Include a brief description of the failure modes illustrated in Figure A1 and Appendix A.2, highlighting the specific issues encountered with different sampling approaches.

Non-Text Elements

Fig. 1. SAE latents explanations in a random sentence. To visualize the latent...

Full Caption

Fig. 1. SAE latents explanations in a random sentence. To visualize the latent explanations produced, we select a sentence taken from the RPJv2 dataset. We selected 4 tokens in different positions in that sentence and filter for latents that are active in different layers. Then we randomly select active latents and their corresponding explanations to display. We display the detection and fuzzing scores of each explanation, which indicate how well it explains other examples in the dataset (see section 3 for details on these scores). The features selected had high activation, but were not cherry-picked based on explanations or scores.

Key Insights

The figure provides a qualitative glimpse into the types of explanations generated by the automated pipeline. It suggests that SAE latents can capture meaningful linguistic features, such as pronoun usage, quotation markers, and semantic categories (e.g., sports, online platforms).
The figure contributes to the broader goal of understanding the internal representations of LLMs. If SAE latents are indeed more interpretable than neurons, they could be valuable tools for analyzing and controlling LLM behavior.
The figure directly addresses the research objective of automatically interpreting SAE features. It provides a visual example of the output of the proposed pipeline.
The main limitation is the use of a single sentence for visualization. Future work could explore visualizing explanations across multiple sentences or using more robust evaluation metrics. Additionally, the paper should provide more details on the distribution of detection and fuzzing scores across all generated explanations, not just the few shown in the figure.

Key Values

Detection scores and fuzzing scores are presented for each explanation. These scores range from 0.56 to 0.95 for detection and 0.59 to 0.88 for fuzzing.
These scores provide a preliminary indication of the quality of the generated explanations. Higher scores suggest that the explanations are better at capturing the latent's activation patterns. However, the limitations of using a single sentence for evaluation should be acknowledged.
The relationship between detection and fuzzing scores is not explicitly discussed, but they generally seem to correlate positively. This suggests that explanations that are good at identifying activating sentences are also good at localizing the activating tokens within those sentences.
The range of scores observed suggests that some latents are more easily explained than others. Further analysis is needed to understand the factors that contribute to explanation quality.

First Reference in Text

We evaluate explanation quality using fuzzing, detection and embedding scores, as those were both quick to compute and easy to interpret.

Summary

Figure 1 showcases examples of explanations generated by the automated pipeline for Sparse Autoencoder (SAE) latents. It uses a single sentence from the RedPajama-v2 dataset and highlights four tokens. For each highlighted token, the figure displays explanations for several active latents across different layers of the model. Each explanation box includes the feature number, layer, detection score, fuzzing score, and the natural language explanation itself. The caption emphasizes that the features were selected based on high activation, not on the quality of their explanations, to avoid cherry-picking.

Methodological Critique

The methodology of selecting a single sentence and four tokens within it for visualization is a simplification. While it provides a concrete example, it doesn't demonstrate the robustness of the explanations across a wider range of contexts. The random selection of active latents is a good practice to avoid bias, but the caption doesn't specify how many latents were sampled per token.
The caption mentions detection and fuzzing scores but doesn't explain them in detail, directing the reader to section 3. This is acceptable but could be improved by briefly stating what these scores represent (e.g., accuracy of identifying activating contexts). The caption also states that features were chosen based on high activation, but doesn't define "high activation" quantitatively.
The figure aims to demonstrate the interpretability of SAE latents. The provided explanations seem qualitatively reasonable, but a more rigorous evaluation is needed (as mentioned in the reference text). The figure doesn't provide evidence for the claim that SAE latents are *more* interpretable than neurons, which is a central claim of the paper.
The methodology of visualizing explanations using a single sentence is not a standard practice in interpretability research. While understandable for illustrative purposes, the paper should rely on more robust quantitative evaluations for its main claims.

Presentation Critique

The figure is generally clear, but the explanation boxes could be visually improved. The feature number and layer information could be presented more concisely. The color coding of the tokens in the sentence is helpful.
The visual organization is straightforward, with clear separation between the sentence, highlighted tokens, and corresponding explanations. The inclusion of detection and fuzzing scores directly within the explanation boxes is effective.
The figure is likely understandable for an audience familiar with LLMs and interpretability research, but may require additional context for a broader audience. Terms like "SAE latent," "detection score," and "fuzzing score" should be clearly defined in the paper.
While the visualization itself is not standard, the use of captions, scores, and explanations aligns with general conventions for presenting qualitative results in interpretability research.

Fig. 2. The new proposed scoring methods. In detection scoring, the scorer...

Full Caption

Fig. 2. The new proposed scoring methods. In detection scoring, the scorer model is tasked with selecting the set of sentences that activate a given latent given an explanation. In this work, we show 5 examples at the same time, and each has an identical probability of being a sentence that activates the latent, independent of whether any other example also activates the latent. The activating tokens are colored in green for display, but that information is not shown to the scorer model. For fuzzing scoring, the scorer model is tasked with selecting the sentences where the highlighted tokens are the tokens that activate a target latent given an explanation of that latent. In surprisal scoring, activating and non-activating examples are run through the model and the loss over those sentences is computed. Correct explanations should decrease the loss in activating sentences compared to a generic explanation, but shouldn't significantly decrease the loss in non-activating sentences. For embedding scoring, activating and non-activating sentences are embedded as "documents" that should be retrieved using the explanation as a query.

Key Insights

The proposed scoring methods offer a potentially more efficient and nuanced way to evaluate explanations for SAE latents compared to simulation scoring.
By focusing on the ability of a scorer model to discriminate between activating and non-activating contexts, these methods aim to capture the core function of a good explanation.
The development of these methods contributes to the growing field of interpretability research for LLMs, specifically in the context of SAE-based analysis.
The practical utility and effectiveness of these methods need to be further validated through empirical evaluation and comparison with existing methods. More details on implementation are crucial for reproducibility and wider adoption.

Key Values

The figure presents example inputs and expected outputs for each scoring method. For instance, in Detection, the expected output is a binary vector [1, 0, 0]. In Fuzzing, the output is also a binary vector [1, 0, 1]. In Surprisal, the output is a comparison of probabilities (P(Sentence|Exp1) > P(Sentence|Exp2)). In Embedding, the output is a comparison of embedding similarities.
These example values illustrate how each scoring method quantifies the quality of an explanation. They show how the scorer model uses the explanation to classify sentences or tokens as activating or non-activating.
The key difference between the methods lies in their granularity (sentence-level vs. token-level) and their use of probabilities vs. embeddings. Detection and Fuzzing operate at the sentence and token level, respectively, while Surprisal and Embedding leverage probabilities and embeddings.
These methods address the challenge of evaluating explanation quality in the context of SAE latents, which is a relatively new area of research. They offer alternatives to simulation scoring, which can be computationally expensive.

First Reference in Text

As an alternative to simulation scoring, we introduce four new evaluation methods that focus on how well an explanation enables a scorer to discriminate between activating and non-activating contexts.

Summary

Figure 2 outlines four novel scoring methods for evaluating the quality of explanations for SAE latents: Detection, Fuzzing, Surprisal, and Embedding. Each method is illustrated with an example featuring an explanation, a set of sentences, and the expected output or scoring mechanism. The caption provides a concise description of each method's process. Importantly, it clarifies that while activating tokens are highlighted in green for the reader's benefit, the scorer model does *not* have access to this information.

Methodological Critique

Introducing four new scoring methods simultaneously makes comparison and interpretation complex. The rationale for needing four distinct methods should be clearly articulated. Are they designed to capture different aspects of explanation quality? How do they relate to existing methods (like simulation scoring)?
While the caption describes the general procedure for each method, it lacks specific details about implementation. For example, what type of scorer model is used? How are the sentences selected? How is the loss computed in surprisal scoring? What embedding model is used for embedding scoring? These details are crucial for reproducibility.
The figure provides illustrative examples, but no empirical evidence to support the effectiveness of these scoring methods. The paper should present results comparing these methods against each other and against existing benchmarks.
The proposed methods seem generally aligned with the goal of evaluating explanation quality, but their novelty and practical utility need to be demonstrated through rigorous experimentation and comparison with existing methods.

Presentation Critique

The caption is dense and could be improved by breaking it down into smaller, more digestible chunks. Each scoring method could be described in a separate paragraph or bullet point. Using a table format to summarize the methods could also enhance clarity.
The visual organization is adequate but could be improved. Using consistent visual cues (e.g., color coding, arrows) to indicate the flow of information in each example would make it easier to follow. The figure could also benefit from clearer labels and headings.
The figure assumes familiarity with concepts like "scorer model," "latent," "activation," and "embedding." While appropriate for a specialized audience, the paper should provide clear definitions and explanations for a broader readership.
The presentation of the methods, while not entirely standard, generally adheres to conventions for illustrating methodological approaches. However, the lack of specific implementation details hinders reproducibility.

Fig. 3. Fuzzing and detection scores for different sampling techniques. Panels...

Full Caption

Fig. 3. Fuzzing and detection scores for different sampling techniques. Panels a) and b) show the distributions of fuzzing and detection scores, respectively, as a function of different example sampling methods for explanation generation. Sampling only from the top activation gets on average lower accuracy in fuzzing and on detection when compared with randomly sampling and sampling from quantiles. The distributions from random sampling and sampling from quantiles are very similar. Panels c) and d) measure how the explanations generalize across activation quantiles, showing that explanations generated from the top quantiles are better at distinguishing non-activating examples in detection, but have lower accuracy on other quantiles, especially on the lower activating deciles. This also happens with explanations generated from examples sampled randomly and from quantiles, but the accuracy does not drop as much in lower activating deciles.

Key Insights

The main finding is that sampling strategies significantly impact the quality and generalizability of generated explanations for SAE latents.
This insight has implications for interpretability research, suggesting that relying solely on top activating examples can lead to biased and less robust explanations. Using more diverse sampling strategies like 'random' or 'quantile' sampling can improve the generalizability of explanations.
The figure directly addresses the research objective of evaluating different methods for generating explanations. It provides empirical evidence to support the claim that broader sampling leads to better explanations.
Future work could explore more sophisticated sampling strategies that balance the need to capture the most activating examples with the need to generalize to a wider range of contexts. Additionally, incorporating statistical significance testing would strengthen the conclusions drawn from the figure.

Key Values

Panels (a) and (b) show that 'random' and 'quantile' sampling achieve higher median accuracy scores for both fuzzing and detection compared to 'top' sampling. The distributions for 'random' and 'quantile' are very similar.
These higher median scores indicate that explanations generated from a broader range of examples are more robust and generalize better across different contexts.
Panels (c) and (d) reveal that 'top' sampling performs well at distinguishing non-activating examples (decile 0) but poorly on lower activation deciles. 'Random' and 'quantile' sampling show more consistent performance across deciles, although they also exhibit a slight drop in accuracy for lower deciles.
The decile-level analysis highlights the trade-off between accurately capturing the most activating examples and generalizing to a wider range of activations. It suggests that over-reliance on top activating examples can lead to explanations that are overly specific and fail to capture the full range of a latent's meaning.

First Reference in Text

We find that randomly sampling from a broader set of examples leads to explanations that cover a larger set of activating examples, sometimes to the detriment of the top activating examples, see fig. 3.

Summary

Figure 3 explores the impact of different context sampling strategies on the performance of generated explanations, measured by fuzzing and detection accuracy. Panels (a) and (b) present the overall score distributions for 'top', 'random', and 'quantile' sampling, along with a 'random explanations' baseline. Panels (c) and (d) show how explanation accuracy varies across activation deciles for each sampling method, specifically for detection. The figure demonstrates that while 'top' sampling excels at distinguishing non-activating examples, it performs poorly on lower activation deciles. 'Random' and 'quantile' sampling provide more balanced performance.

Methodological Critique

The figure compares three sampling methods: 'top', 'random', and 'quantile'. While these are common strategies, the rationale for choosing these specific methods and the precise implementation of 'quantile' sampling (e.g., number of quantiles, how they are defined) are not explicitly stated.
The caption mentions 'fuzzing' and 'detection' scores but doesn't fully define them. It assumes the reader understands these metrics from prior context. Including a brief explanation or referencing the relevant section would improve clarity.
The figure visually supports the claim that 'random' and 'quantile' sampling yield better overall performance than 'top' sampling. However, it doesn't provide statistical significance testing to confirm these observations. Reporting p-values or confidence intervals would strengthen the analysis.
The methodology of evaluating explanations based on different sampling strategies is sound and relevant to the research question. However, the lack of precise details about the 'quantile' sampling implementation and the absence of statistical significance testing limit the reproducibility and robustness of the findings.

Presentation Critique

The caption is quite lengthy and dense. Breaking it down into shorter, more focused sentences would improve readability. Separately captioning each panel (a-d) would also be beneficial.
The visual organization is generally clear, with distinct panels for overall score distributions and decile-level analysis. However, the y-axis labels in panels (c) and (d) could be more descriptive (e.g., 'Detection Accuracy'). Adding clear titles to each panel would also enhance readability.
The figure assumes familiarity with terms like 'fuzzing', 'detection', 'activation quantiles', and 'deciles'. While appropriate for a specialized audience, the paper should provide clear definitions and explanations for broader accessibility.
The presentation of the results generally follows conventions for visualizing performance comparisons. However, the lack of statistical annotations (e.g., p-values, confidence intervals) and more descriptive axis labels could be improved.

Results

Overview

This section presents the results of comparing different scoring methods for explanations of Sparse Autoencoder (SAE) latents, finding fuzzing and detection scores correlate most with simulation scoring. It also highlights the value of intervention scoring in interpreting features missed by other methods. The impact of sampling techniques, explainer model size, and SAE architecture on explanation quality are explored, showing benefits from broader sampling and larger SAEs. Finally, the overlap between latents at adjacent layers is analyzed, suggesting prioritization of wider SAEs on fewer layers for limited compute budgets.

Key Aspects

Scoring Method Comparison: Fuzzing and detection scoring methods show the highest correlation with the established simulation scoring, while surprisal and embedding correlate more with each other and detection.
Intervention Scoring: Intervention scoring proves valuable by distinguishing between trained and random SAE features and by interpreting features missed by context-based methods.
Sampling Techniques: Sampling examples from across the activation distribution, rather than just the top activating examples, leads to higher quality explanations that generalize better.
Explainer Model Size: Larger explainer models generally produce better explanations, but the performance difference between Llama 3.1 70b and Claude Sonnet 3.5 is not substantial.
SAE Architecture: SAEs with more latents and those trained on residual streams have higher scores. Earlier layers have lower scores, but scores stabilize after initial layers.
Latent Overlap: Adjacent residual stream SAEs exhibit higher semantic overlap than MLP SAEs, suggesting potential resource optimization by training wider SAEs on fewer layers.

Strengths

Comprehensive Evaluation
The section evaluates various factors affecting explanation quality, including scoring methods, sampling techniques, and model architectures, providing a holistic view of the proposed pipeline.

"We evaluate explanation quality using fuzzing, detection and embedding scores, as those were both quick to compute and easy to interpret." (Page 7)
Emphasis on Generalization
The study highlights the importance of evaluating explanations across the entire activation distribution, not just on top activating examples, which is a crucial aspect for robust interpretability.

"This effect would not be seen if the scoring were done on just the most activating examples, underscoring a problem with current auto-interpretability evaluations..." (Page 7)

Suggestions for Improvement

Quantify Semantic Overlap
While the section mentions measuring semantic similarity, providing specific metrics or quantitative results would strengthen the analysis.

"We observe that the residual stream SAEs have higher semantic overlap for neighboring layers than the MLP SAEs." (Page 9)

Rationale: Quantitative measures would provide more concrete evidence of the observed semantic overlap and allow for more rigorous comparisons.

Implementation: Report specific metrics used to quantify semantic overlap, such as cosine similarity or Frobenius inner product values, and include visualizations or tables to illustrate the differences between residual stream and MLP SAEs.
Further Investigate Explainer Model Differences
The section notes similar performance between Llama and Claude, but further investigation into the reasons and potential for Claude optimization could be valuable.

"We don’t find the performance of Claude Sonnet 3.5 to be much higher than that of Llama 3.1 70b..." (Page 7)

Rationale: Understanding the factors influencing explainer model performance could lead to improved explanation quality and more efficient use of resources.

Implementation: Conduct further experiments to analyze the impact of prompt optimization and model architecture on explanation quality for both Llama and Claude, potentially exploring different prompting strategies or fine-tuning approaches.

Non-Text Elements

Table 1. Spearman correlation computed over 600 different latent scores

Key Insights

The main finding is that the proposed scoring methods exhibit varying degrees of correlation with each other and with the established Simulation method.
This finding has implications for the choice of scoring method in interpretability research. If different methods measure different aspects of explanation quality, researchers need to carefully consider which method is most appropriate for their specific research question.
The table contributes to the research objective of evaluating the proposed scoring methods. It provides a quantitative assessment of their agreement with each other and with the existing benchmark.
Future work could explore the reasons for the low correlations observed with Embedding. It's important to understand whether this method is capturing a valuable but distinct aspect of explanation quality, or whether it's simply less reliable than the other methods. Additionally, investigating the statistical significance of the observed correlations would strengthen the analysis.

Key Values

The highest correlation (0.73) is observed between Fuzzing and Detection. Fuzzing also shows a relatively high correlation (0.70) with Simulation. The lowest correlations are generally observed with Embedding, particularly with Surprisal (0.20).
The high correlation between Fuzzing and Detection suggests that these two methods capture similar aspects of explanation quality. The relatively high correlation between Fuzzing and Simulation provides some support for the validity of the Fuzzing method. The low correlations involving Embedding suggest that this method measures a different aspect of explanation quality compared to the other methods.
The correlations between the new methods and Simulation are of particular interest, as Simulation is the established benchmark. The relatively strong correlation between Fuzzing and Simulation (0.70) is a positive sign. However, the weaker correlations for Detection (0.44), Embedding (0.41), and Surprisal (0.30) raise questions about their agreement with Simulation.
These correlation values provide insights into the relationships between different scoring methods for SAE latent explanations. They suggest that some methods are more closely related than others, and that Embedding, in particular, may be capturing a distinct aspect of explanation quality.

First Reference in Text

Since simulation scoring is an established method, we measure how our other context-based scoring techniques correlate with simulation, as well as how they correlate between themselves (see tables 1 and A1).

Summary

Table 1 presents a correlation matrix of Spearman rank correlation coefficients calculated between different scoring methods for SAE latent explanations. The methods included are Fuzzing, Detection, Simulation, Embedding, and Surprisal. The correlations are computed over 600 different latent scores. The table is symmetric, with a diagonal of 1s (as each method perfectly correlates with itself).

Methodological Critique

Using Spearman correlation is appropriate for comparing ranking-based scoring methods, as it doesn't assume linearity. However, the choice of 600 latent scores should be justified. Is this a representative sample? How were these latents selected?
The table caption mentions that the correlations are computed over "600 different latent scores." More detail is needed on how these latents were selected. Were they randomly sampled? Were they chosen to represent a diverse range of latent types or activation patterns? Additionally, specifying the layers from which these latents were taken would be helpful.
The table shows the correlations between the proposed scoring methods and the established simulation scoring method. This allows for a direct comparison and helps assess the validity of the new methods. However, the table doesn't provide any information about the statistical significance of these correlations. Including p-values would strengthen the analysis.
Calculating correlations between scoring methods is a standard approach for evaluating their agreement. However, relying solely on correlations can be misleading. It's important to also consider the magnitude of the differences between scores and the practical implications of these differences.

Presentation Critique

The table is clearly organized and easy to read. The row and column headers clearly identify the scoring methods being compared. The use of a symmetric matrix is appropriate for presenting correlations.
The visual presentation of the correlation matrix is standard and effective. However, using color coding to highlight the strength of correlations could improve readability and quickly convey the key findings.
The table is understandable for a technical audience familiar with statistical concepts like correlation. However, for a broader audience, a brief explanation of Spearman correlation and its interpretation would be beneficial.
The presentation of the correlation matrix adheres to standard conventions. However, adding a brief explanation of the scoring methods in the caption or a footnote would improve the table's self-containedness.

SAE features are more interpretable than random features, especially when...

Full Caption

SAE features are more interpretable than random features, especially when intervening more strongly. Our explainer also produces explanations that are scored higher than random explanations. Right: Many features that would normally be uninterpreted when using context-based automatic interpretability are interpretable in terms of their effects on output.

Key Insights

The main finding is that intervention scores provide a valuable complementary perspective to context-based interpretability methods, revealing features that influence output but are not easily explained by input contexts.
This insight has implications for understanding and controlling LLM behavior. By identifying features that directly influence output, intervention scores can help researchers develop more targeted and effective methods for steering and debugging LLMs.
The figure directly addresses the research objective of developing new interpretability metrics. It provides empirical evidence to support the claim that intervention scores capture a distinct and valuable aspect of interpretability.
Future work could explore the relationship between intervention scores and other interpretability metrics, as well as their application to different models, layers, and tasks. Further investigation is also needed to understand the limitations of intervention scores and to develop methods for improving their robustness and reliability.

Key Values

The left panel shows that SAE features have higher intervention scores than random features, especially at higher target KL divergences (1.0 and 3.0). The explainer-generated explanations also score higher than random explanations.
These higher intervention scores for SAE features suggest that they have a more meaningful and interpretable impact on model output compared to random features. The increasing difference with higher KL divergence indicates that this effect is more pronounced for stronger interventions. The fact that generated explanations score higher than random ones suggests the explainer is producing meaningful explanations.
The right panel shows a negative correlation between intervention scores and fuzzing scores (y = -2.71x + 2.79, r = -0.37).
This negative correlation suggests that features with high intervention scores (i.e., those that strongly influence output) tend to have low fuzzing scores (i.e., they are difficult to interpret based on input contexts). This highlights the value of intervention scores in capturing features that are not easily interpretable using traditional context-based methods.

First Reference in Text

In figure 4, we see that the intervention score we propose is a valuable contribution to the set of automatic interpretability metrics because it (a) distinguishes between features from a trained SAE and random features and (b) recalls features that context-based scoring methods fail to interpret.

Summary

Figure 4 presents an analysis of intervention scores, a proposed metric for interpretability. The left panel shows distributions of intervention scores for SAE features, random features, and random explanations at different target KL divergences (intervention strengths). The right panel shows a scatter plot comparing intervention scores with fuzzing scores (a context-based interpretability metric). The figure focuses on Gemma 2 9B at layer 32.

Methodological Critique

The use of intervention scores is a novel approach to interpretability, directly assessing the causal impact of features on output. Comparing these scores to random features and random explanations provides a good baseline.
While the caption mentions Section 3.5.5 for details on intervention scores, a brief explanation of how these scores are calculated would enhance the figure's self-containedness. Key details like the choice of scorer model, the types of interventions performed, and the method for calculating KL divergence are missing from the figure and caption.
The left panel visually demonstrates the difference between SAE features, random features, and random explanations. However, it would be stronger to include statistical significance tests (e.g., p-values) to quantify these differences. The right panel shows a negative correlation between intervention and fuzzing scores, suggesting that these methods capture different aspects of interpretability. Quantifying this correlation (e.g., with Spearman's rank correlation coefficient) and assessing its significance would be beneficial.
Comparing intervention scores with existing context-based methods is a valuable contribution. However, further validation is needed to establish the robustness and generalizability of intervention scores across different models, layers, and tasks.

Presentation Critique

The caption clearly describes the main findings, but could be improved by providing more context about intervention scores and their interpretation. The terms "target KL divergence" and "fuzzing score" are used without definition, assuming prior knowledge from the reader.
The left panel effectively uses histograms to visualize the distributions of intervention scores. The right panel uses a scatter plot to show the relationship between intervention and fuzzing scores. Adding axis labels and titles to the plots would improve clarity. Including the equation of the fitted line and the R-squared value in the right panel would provide more quantitative information about the correlation.
The figure is understandable for a technical audience familiar with interpretability research and statistical concepts. However, for a broader audience, more explanation of the technical terms and the methodology would be necessary.
The visual presentation of the results is generally clear, but could be enhanced with clearer labels, titles, and statistical annotations.

Overlap Between Latents at Adjacent Layers

Overview

This section explores the overlap between latent features learned by SAEs at adjacent layers in the residual stream of a transformer model. It highlights the motivation for studying this overlap, including potential resource optimization and sanity-checking the interpretability pipeline. The Hungarian algorithm is used to align latents due to permutation symmetry, and the analysis reveals greater semantic similarity between adjacent residual stream SAEs compared to MLP SAEs, suggesting potential compute savings by training wider SAEs on fewer layers.

Key Aspects

Motivation for Overlap Analysis: Investigating feature overlap can reveal redundancies, inform training strategies, and validate the consistency of the interpretability pipeline.
Permutation Symmetry and Alignment: The Hungarian algorithm is employed to address the permutation symmetry issue and align latents for meaningful comparison between layers.
Semantic Similarity Measurement: The Frobenius inner product of explanation embeddings is used to measure the semantic similarity between aligned latents.
Residual Stream vs. MLP Overlap: Adjacent residual stream SAEs demonstrate higher semantic overlap than MLP SAEs, suggesting different training prioritization strategies.
Resource Optimization Implications: The findings suggest that training wider SAEs on a smaller subset of residual stream layers may be more compute-efficient than training narrower SAEs on all layers.

Strengths

Clear Motivation
The section clearly articulates the reasons for studying latent overlap, connecting it to practical concerns like resource optimization and pipeline validation.

"Measuring the degree of feature overlap is interesting for a few reasons. First, if adjacent SAEs learn almost identical features, it may not be worthwhile to train and interpret SAEs at every single residual stream layer." (Page 9)

Suggestions for Improvement

Visualize Overlap
While the text describes the use of the Frobenius inner product, visualizing the overlap between layers would enhance understanding and provide a more intuitive representation of the findings.

"We observe that the residual stream SAEs have higher semantic overlap for neighboring layers than the MLP SAEs." (Page 9)

Rationale: Visualizations can effectively communicate complex relationships and patterns, making the results more accessible and impactful.

Implementation: Include heatmaps or other visual representations of the Frobenius inner product matrices, illustrating the degree of overlap between different layers for both residual stream and MLP SAEs.
Quantify Overlap Difference
While the section mentions higher overlap in residual stream SAEs, quantifying this difference would strengthen the claim and provide a more precise comparison.

"We observe that the residual stream SAEs have higher semantic overlap for neighboring layers than the MLP SAEs." (Page 9)

Rationale: Quantitative measures provide more concrete evidence and allow for more rigorous comparisons between different architectures.

Implementation: Report specific values for the Frobenius inner product or other relevant metrics for both residual stream and MLP SAEs, potentially including statistical significance tests to assess the difference in overlap.

Conclusion

Overview

This conclusion summarizes the main contributions of the paper, which focuses on explaining the latents of Sparse Autoencoders (SAEs) trained on large language models (LLMs). It highlights the proposed scoring techniques for evaluating explanation quality, discusses the limitations of current methods regarding context length and non-activating examples, and suggests future directions, including incorporating explanation length into metrics and improving the selection of non-activating examples. The conclusion also emphasizes the potential of improved explanations for model steering, concept localization, and editing.

Key Aspects

Summary of Contributions: The conclusion reiterates the key contributions of the paper, including the introduction of novel scoring techniques for SAE latent explanations and the analysis of explanation distributions and overlaps.
Limitations and Future Directions: The authors acknowledge the limitations of current methods, particularly in handling long-context features and incorporating explanation length into evaluation metrics. Future work will focus on addressing these limitations.
Potential Applications: The conclusion highlights the potential applications of improved explanations for model steering, concept localization, and editing, emphasizing the broader impact of this research.
Resource Optimization: The conclusion reinforces the finding that training wider SAEs on a subset of residual stream layers may be more efficient than training narrower SAEs on all layers, given computational constraints.

Strengths

Clear Summary of Findings
The conclusion effectively summarizes the key findings of the paper, providing a concise overview of the main contributions and their implications.

"Explaining the latents of SAEs trained on cutting-edge LLMs is a computationally demanding task, requiring scalable methods to both generate explanations and assess their quality." (Page 10)
Forward-Looking Perspective
The conclusion offers a forward-looking perspective, outlining future research directions and potential applications of the presented techniques.

"Access to better, automatically generated explanations could play a crucial role in areas like model steering, concept localization, and editing." (Page 10)

Suggestions for Improvement

Elaborate on Future Work
While the conclusion mentions future directions, elaborating on specific plans and potential approaches would strengthen the impact.

"We leave these extensions to future work." (Page 10)

Rationale: Providing more concrete details about future work would give readers a clearer understanding of the research trajectory and potential future contributions.

Implementation: Expand on the planned improvements for handling long-context features and incorporating explanation length into metrics. Discuss potential approaches, such as using hierarchical models or reinforcement learning techniques, to address these challenges.
Quantify Potential Resource Savings
While the conclusion mentions resource optimization, quantifying the potential savings from training wider SAEs on fewer layers would strengthen the argument.

"Our results suggest that, when compute resources are constrained, it may be more efficient to train wider SAEs on a small subset of residual stream layers, rather than narrower SAEs on all layers." (Page 10)

Rationale: Providing concrete numbers or estimates of the potential compute savings would make the resource optimization argument more compelling and impactful.

Implementation: Estimate the computational cost of training different SAE architectures and compare the resource requirements for training wider SAEs on fewer layers versus narrower SAEs on all layers. Present these estimates in a table or graph to illustrate the potential savings.

Appendix

Overview

This appendix provides supplementary information regarding the explainer prompt, examples of activating contexts and explanations, different scoring methods (detection, fuzzing, surprisal, embedding), adversarial examples, correlation between scores, SAE models used, and factors that influence explainability, including dataset, chain of thought, activation information, context length, origin and number of examples, explainability across layers, size of explainer and scorer models, and SAE size and location.

Key Aspects

Explainer Prompt: Details the prompt used for the explainer model, including guidelines for generating concise explanations of language patterns and optional Chain of Thought (COT) prompting for improved analysis.
Scoring Methods: Explains the different scoring methods used to evaluate the quality of generated explanations, including detection, fuzzing, surprisal, and embedding scores, providing details on their implementation and prompts.
SAE Model Details: Provides information on the different SAE models used throughout the work, including the number of latents and average L0 norms per layer for various models trained on Gemma and Llama.
Factors Influencing Explainability: Discusses various factors that impact the explainability of SAE latents, such as the training dataset, use of COT, activation information, context length, example sampling strategies, and model sizes.
Automatically Interpreting Interventions: Details the process of automatically interpreting interventions, including scoring implementation, explanation generation, and baselines used for comparison.

Strengths

Detailed Explanations
The appendix provides comprehensive explanations of the various components of the interpretability pipeline, including the explainer prompt, scoring methods, and SAE model details.

"The system prompt if not using Chain of Thought (COT): You are a meticulous AI researcher conducting an important investigation into patterns found in language." (Page 14)

Suggestions for Improvement

Clarify Adversarial Examples Section
The section on adversarial examples is brief and could benefit from more concrete examples and discussion of potential mitigation strategies.

"As SAEs are scaled and latents become sparser and more specific, techniques that overly rely on activating contexts will have more imprecise results Gao et al. (2024)." (Page 20)

Rationale: A more detailed discussion of adversarial examples would help readers understand the limitations of current methods and potential areas for future research.

Implementation: Provide specific examples of adversarial examples and discuss potential mitigation strategies, such as using more robust sampling techniques or incorporating adversarial training into the explanation generation process.
Improve Clarity of Correlation Table
Table A1, showing the correlation between scores, could be improved by adding labels for rows and columns and providing more context on the interpretation of the values.

Rationale: Clearer labeling and context would make the table easier to understand and interpret.

Implementation: Add row and column labels to Table A1, indicating which scoring methods are being compared. Provide a brief explanation of how to interpret the correlation values and their significance.

Interpreting Sparse Autoencoder Features in Large Language Models Using Novel Scoring Techniques

Table of Contents

Overall Summary

Overview

Key Findings

Strengths

Areas for Improvement

Significant Elements

Figure

Table

Conclusion

Section Analysis

Abstract

Overview

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Overview

Key Aspects

Strengths

Suggestions for Improvement

Related Work

Overview

Key Aspects

Strengths

Suggestions for Improvement

Methods

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Results

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Overlap Between Latents at Adjacent Layers

Overview

Key Aspects

Strengths

Suggestions for Improvement

Conclusion

Overview

Key Aspects

Strengths

Suggestions for Improvement

Appendix

Overview

Key Aspects

Strengths

Suggestions for Improvement