Do-Not-Answer: A Safety Benchmark for LLMs

Abstract

Overview

This paper introduces "Do-Not-Answer," an open-source dataset designed to evaluate the safety of large language models (LLMs). The dataset focuses on prompts that responsible LLMs should refuse to answer. The researchers evaluated six popular LLMs using this dataset and found varying levels of safety performance. They also developed smaller, BERT-like classifiers that can automatically evaluate LLM safety with comparable effectiveness to GPT-4.

Key Aspects

Creation of the Do-Not-Answer dataset: This dataset is the first open-source resource specifically designed to test the safety mechanisms of LLMs by focusing on prompts they should not answer.
Evaluation of six LLMs: The study assessed the responses of three commercial (GPT-4, ChatGPT, Claude) and three open-source (LLaMA-2, ChatGLM2, Vicuna) LLMs to the risky prompts.
Development of BERT-like classifiers: Smaller BERT-like classifiers were trained to automatically evaluate LLM responses for safety, achieving performance comparable to GPT-4.
Three-level hierarchical risk taxonomy: The dataset uses a detailed taxonomy to categorize the types of risks posed by different prompts, aiding in a more nuanced understanding of LLM vulnerabilities.

Strengths

Clear motivation and contribution
The abstract clearly states the need for better safety evaluation of LLMs, especially given the emergence of harmful capabilities. The creation of an open-source dataset is a significant contribution to the field.

"With the rapid evolution of large language mod- els (LLMs), new and hard-to-predict harmful capabilities are emerging. This requires devel- opers to be able to identify risks through the evaluation of “dangerous capabilities” in order to responsibly deploy LLMs. In this work, we collect the first open-source dataset to evaluate safeguards in LLMs, and deploy safer open- source LLMs at a low cost." (Page 1)
Concise and informative
The abstract effectively summarizes the key findings, including the varying performance of different LLMs and the successful development of smaller classifiers. It provides enough information to understand the scope and significance of the work without being overly technical.

"We annotate and assess the responses of six popular LLMs to these instructions. Based on our annotation, we proceed to train several BERT-like classifiers, and find that these small classifiers can achieve results that are compa- rable with GPT-4 on automatic safety evalua- tion." (Page 1)

Suggestions for Improvement

Mention specific findings
While the abstract mentions varying LLM performance, it could be strengthened by briefly mentioning which models performed best or worst. This would provide a more concrete takeaway for the reader.

Rationale: Adding specific results, even briefly, would make the abstract more impactful and encourage readers to learn more.

Implementation: Include a short phrase like "Results show that LLaMA-2 performed best in avoiding risky instructions, while ChatGLM2 ranked last." or similar.
Highlight the taxonomy
The abstract could benefit from explicitly mentioning the three-level hierarchical risk taxonomy used in the dataset. This is a key aspect of the work and deserves mention in the abstract.

Rationale: Highlighting the taxonomy would emphasize the comprehensive nature of the dataset and its potential for deeper analysis of LLM safety.

Implementation: Add a phrase like "The dataset is organized using a three-level hierarchical risk taxonomy, covering a range of potential harms." or similar.

Introduction

Overview

Large language models (LLMs) are rapidly evolving, demonstrating both beneficial and harmful emergent capabilities. This necessitates evaluating and mitigating these risks, especially for open-source LLMs, which often lack robust safety mechanisms. This paper introduces the "Do-Not-Answer" dataset, an open-source resource designed to evaluate LLM safeguards by focusing on prompts that responsible models should refuse to answer. This dataset aims to enable safer development and deployment of open-source LLMs.

Key Aspects

Emergent capabilities of LLMs: LLMs are developing new abilities, both positive and negative, beyond their initial training. This unpredictability poses challenges for safety.
Need for risk identification: It's crucial to identify and evaluate the "dangerous capabilities" of LLMs to mitigate potential harm.
Lack of safety in open-source LLMs: Compared to commercial LLMs, open-source models often lack comprehensive safety mechanisms, making this research particularly relevant.
Introduction of Do-Not-Answer dataset: This open-source dataset provides a valuable resource for evaluating LLM safeguards and promoting responsible development.
Focus on prompts to refuse: The dataset specifically focuses on prompts that responsible LLMs should not answer, providing a clear benchmark for safety evaluation.

Strengths

Clear problem statement
The introduction effectively establishes the context and motivation for the research by highlighting the dual nature of LLM emergent capabilities and the need for better safety evaluations.

"The rapid evolution of large language models (LLMs) has lead to a number of emerging and high- utility capabilities, including those for which they were not trained. On the downside, they have also been found to exhibit hard-to-predict harmful ca- pabilities." (Page 1)
Focus on open-source LLMs
The introduction explicitly addresses the gap in safety mechanisms for open-source LLMs, making the research relevant and impactful for a wider community.

"However, open-source LLMs tend to lack comprehensive safety mechanisms." (Page 1)

Suggestions for Improvement

Expand on existing safety measures
While the introduction mentions existing model evaluations, it could be strengthened by briefly elaborating on the types of safety mechanisms currently employed in commercial LLMs.

"Existing model evaluations have been devised to measure gender and racial biases, truth- fulness, toxicity, and reproduction of copyrighted content, and led to the demonstration of ethical and societal dangers (Zhuo et al., 2023; Liang et al., 2022)." (Page 1)

Rationale: Providing more context on existing safety measures would better frame the need for the Do-Not-Answer dataset.

Implementation: Add a sentence or two describing common safety mechanisms like reinforcement learning from human feedback or content filtering.
Preview the dataset's structure
The introduction could briefly preview the structure or key features of the Do-Not-Answer dataset to give the reader a better understanding of its contents.

Rationale: A brief preview of the dataset would increase reader engagement and provide a smoother transition to the later sections.

Implementation: Add a sentence like "The Do-Not-Answer dataset is curated and filtered to include a diverse range of prompts categorized by a hierarchical risk taxonomy."

Related Work

Overview

This section discusses existing research on the risks of deploying Large Language Models (LLMs), including studies focusing on specific risk areas like bias, toxicity, and misinformation, and work on holistic risk evaluation. It highlights the limitations of current datasets and emphasizes the need for a comprehensive taxonomy and open-source dataset for evaluating LLM safety.

Key Aspects

Studies in specific risk areas: Existing research has addressed specific risks like bias, toxicity, and misinformation, but often lacks a broader perspective on more severe risks.
Holistic risk evaluation: While some datasets for evaluating LLM safety exist, they often lack labeled responses, ignore human impacts, or are proprietary and inaccessible to the public.
Need for comprehensive datasets: There's a significant gap in comprehensive, publicly available datasets for evaluating the full range of LLM safety capabilities.
Limitations of existing datasets: Current datasets may not cover all risk areas, lack labeled responses, or be proprietary, hindering comprehensive safety evaluations.
Focus on open-source dataset: This work aims to create a comprehensive risk taxonomy and an open-source dataset to address the limitations of existing resources.

Strengths

Comprehensive overview of related work
The section provides a thorough overview of existing research on LLM risks, covering both specific risk areas and holistic evaluations. This demonstrates a good understanding of the current landscape.

"There has been a lot of research on studying the risks of deploying LLMs in applications, in terms of risk taxonomy, evaluation, and safety mitigation." (Page 2)
Clear identification of gaps
The section clearly identifies the limitations of existing datasets and research, highlighting the need for the proposed work. This justifies the creation of the Do-Not-Answer dataset and taxonomy.

"Nonetheless, there is still a lack of comprehensive datasets for evaluating the safety capabilities of LLMs. In this work, we de- velop a more holistic risk taxonomy that covers a wide range of potential risks." (Page 3)

Suggestions for Improvement

More detailed comparison of datasets
While the section mentions several datasets, a more detailed comparison of their sizes, scopes, and limitations would be beneficial. This would help readers better understand the unique contribution of the Do-Not-Answer dataset.

Rationale: A more detailed comparison would strengthen the argument for the Do-Not-Answer dataset and highlight its advantages over existing resources.

Implementation: Create a table summarizing key features of the mentioned datasets, including size, scope, availability, and limitations.
Discuss specific evaluation metrics
The section could benefit from discussing the specific evaluation metrics used in prior work and how they relate to the metrics used in this paper. This would provide a clearer context for evaluating the results.

Rationale: Discussing evaluation metrics would help readers understand how the proposed work compares to previous research and the significance of the chosen metrics.

Implementation: Add a paragraph discussing common evaluation metrics for LLM safety, such as accuracy, precision, recall, and F1-score, and how they are used in the current work.

Non-Text Elements

figure 1

This figure presents a radar chart comparing six large language models (LLMs) across different safety aspects. Each axis of the radar chart represents a category of potential harm, such as 'Information Hazards' or 'Malicious Uses'. The performance of each LLM is represented by a colored line, forming a polygon on the chart. A point closer to the outer edge of the chart indicates a higher score in that category, meaning the LLM is better at avoiding that specific harm. The LLMs compared are GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna.

First Mention

Text: "Figure 1: A comprehensive evaluation of LLM safeguards."

Context: This figure is presented at the beginning of the 'Related Work' section on page 2. It follows a discussion of the need for better safety evaluations in LLMs.

Relevance: This figure is highly relevant because it visually summarizes the core evaluation performed in the paper. It allows for a quick comparison of the safety performance of different LLMs across various risk categories.

Critique

Visual Aspects

One axis label is missing, making it difficult to interpret that dimension of the comparison.
The scale on the axes (0.8 to 1.0) is quite narrow, potentially obscuring subtle differences in performance.
The colors used for the LLM lines are somewhat similar, making it challenging to distinguish them at a glance.

Analytical Aspects

The figure lacks any indication of statistical significance. It's unclear whether the observed differences in performance are statistically meaningful.
The figure doesn't provide any context or explanation for why certain LLMs perform better or worse in specific categories.
The choice of categories and their relative importance in the overall safety assessment are not discussed.

Numeric Data

Safety Taxonomy

Overview

This section introduces a three-level hierarchical taxonomy for classifying the risks associated with large language models (LLMs), particularly text-only models. It builds upon previous research and focuses on five key risk areas: information hazards, malicious uses, discrimination/exclusion/toxicity, misinformation harms, and human-computer interaction harms. The taxonomy provides a detailed breakdown of potential hazards, outlining how these risks manifest and providing examples of the types of questions or prompts that could trigger them.

Key Aspects

Hierarchical structure: The taxonomy uses a three-level structure to categorize risks, starting with broad areas and then breaking them down into more specific harm types and risk types.
Focus on text-only LLMs: The taxonomy specifically addresses risks associated with text-only LLMs, assuming no API interaction or multimodal input/output.
Five key risk areas: The taxonomy covers five main areas of risk, providing a comprehensive framework for evaluating LLM safety.
Mechanism of risks: For each risk area, the taxonomy explains how these risks can arise from LLM predictions and how they can be exploited.
Question/prompt examples: The taxonomy provides examples of the types of questions or prompts that could trigger each type of harm, making it practical for evaluating LLM responses.

Strengths

Builds on existing research
The taxonomy builds upon the established work of Weidinger et al. (2021), providing a more refined and detailed classification of LLM risks.

"The study by Weidinger et al. (2021) categorized the risks associated with LLMs into six distinct areas: (I) information hazards; (II) malicious uses; (III) discrimination, exclusion, and toxicity; (IV) misinformation harms; (V) human-computer inter- action harms; and (VI) automation, access, and environmental harms. Building upon this founda- tion, we introduce a comprehensive three-level risk taxonomy for LLMs, as illustrated in Figure 2." (Page 3)
Comprehensive and detailed
The three-level structure and the inclusion of specific risk types provide a comprehensive and detailed framework for analyzing LLM safety.

"We then formulate twelve types of harms as our second-level classification (Table 1), and delineate sixty distinct risk types at the bottom level, pro- viding a comprehensive breakdown of potential hazards.3" (Page 3)

Suggestions for Improvement

Clarify the distinction between harm types and risk types
While the taxonomy mentions harm types and risk types, the distinction between these two levels could be clarified further. This would improve the overall clarity and usability of the taxonomy.

Rationale: A clearer distinction between the levels would make the taxonomy easier to understand and apply in practice.

Implementation: Provide more explicit definitions and examples to differentiate between harm types (second level) and risk types (third level).
Include examples of safe responses
While the taxonomy provides examples of prompts that could trigger harmful responses, it could be strengthened by also including examples of safe and appropriate responses. This would provide a more complete picture of desired LLM behavior.

Rationale: Including examples of safe responses would make the taxonomy more practical for developers and researchers working on LLM safety.

Implementation: For each risk type, provide examples of responses that would be considered safe and responsible.

Non-Text Elements

figure 2

This figure illustrates a three-level taxonomy of risks associated with Large Language Models (LLMs). The top level categorizes risks into five broad areas: Information Hazards, Malicious Uses, Discrimination/Toxicity, Misinformation Harms, and Human-Chatbot Interaction Harms. Each of these areas is then broken down into more specific harm types in the second level. The third level further details each harm type with specific examples. The taxonomy helps to organize and understand the various ways LLMs can potentially cause harm.

First Mention

Text: "Figure 2: Three-level taxonomy of LLM risks."

Context: This figure appears in the 'Safety Taxonomy' section on page 4, after a discussion of existing risk categorization efforts and the need for a more comprehensive taxonomy.

Relevance: This figure is crucial for understanding the structure and scope of the research. It provides the framework for the Do-Not-Answer dataset and guides the evaluation of LLM safety.

Critique

Visual Aspects

The figure could benefit from a clearer visual hierarchy. The connections between the levels could be more distinct, perhaps using different line weights or colors.
The text within the boxes is quite small, making it difficult to read without zooming in.
The figure is somewhat cluttered, making it challenging to follow the different branches of the taxonomy.

Analytical Aspects

The figure doesn't explain how the taxonomy was developed or validated. It's unclear whether it's based on empirical data or expert opinions.
The figure doesn't discuss the relative importance of the different risk areas or harm types. Are some risks considered more critical than others?
The figure doesn't provide any context on how the taxonomy will be used in the subsequent evaluation. How are the prompts in the Do-Not-Answer dataset mapped to this taxonomy?

Numeric Data

table 1

This table shows the number of questions in the Do-Not-Answer dataset that fall into each of the five risk areas and twelve harm types defined in the safety taxonomy. The table has three columns: 'Risk Area', 'Harm Type', and '# Q' (number of questions). It provides a breakdown of the dataset's composition and shows how the questions are distributed across different risk categories.

First Mention

Text: "Table 1: The number of questions (# Q) falling into our five risk areas and twelve harm types."

Context: This table is presented in the 'Safety Taxonomy' section on page 5, immediately after the introduction of the three-level risk taxonomy in Figure 2.

Relevance: This table is important because it provides a quantitative overview of the Do-Not-Answer dataset. It shows the distribution of questions across different risk categories, which is crucial for understanding the scope and focus of the evaluation.

Critique

Visual Aspects

The table could be more visually appealing. Using different colors or shading for the rows could improve readability.
The table is quite wide, making it difficult to fit on the page without wrapping text. Consider splitting it into two smaller tables or rotating it.
The '# Q' column could be renamed to something more descriptive, such as 'Number of Questions'.

Analytical Aspects

The table doesn't explain how the questions were generated or selected for inclusion in the dataset. What was the process for creating these questions?
The table doesn't discuss the rationale for the number of questions in each category. Why are there more questions for some harm types than others?
The table doesn't provide any context on how these numbers will be used in the subsequent evaluation. How will the distribution of questions affect the results?

Numeric Data

Information Hazards - Risks from leaking or inferring sensitive information (organization/gov): 136
Information Hazards - Compromise privacy by leaking or inferring private information (person/individual): 112
Malicious Uses - Assisting illegal activities: 132
Malicious Uses - Nudging or advising users to perform unethical or unsafe actions: 71
Malicious Uses - Reducing the cost of disinformation campaigns: 40
Discrimination, Exclusion, Toxicity, Hateful, Offensive - Social stereotypes and unfair discrimination: 95
Discrimination, Exclusion, Toxicity, Hateful, Offensive - Toxic language (hate speech): 53
Discrimination, Exclusion, Toxicity, Hateful, Offensive - Adult content: 28
Misinformation Harms - Disseminating false or misleading information: 92
Misinformation Harms - Causing material harm by disseminating misinformation e.g. in medicine or law: 63
Human-chatbot Interaction Harms - Mental health or overreliance crisis: 67
Human-chatbot Interaction Harms - Treat chatbot as a human: 50

Data Collection

Overview

This section details how the researchers collected 939 risky questions and the corresponding responses from six different Large Language Models (LLMs). The questions are designed to be those that a responsible LLM should refuse to answer. The researchers used a novel three-round conversation strategy with GPT-4 to generate these questions, addressing challenges in eliciting risky content. They also describe how they handled borderline cases and filled in question templates with specific risky scenarios. Finally, they collected responses from three commercial and three open-source LLMs, providing statistics on the length of these responses.

Key Aspects

Rationale for question selection: The dataset focuses on questions LLMs should *not* answer to facilitate easier evaluation of safety mechanisms.
GPT-4 question generation strategy: A three-round conversational approach was used with GPT-4 to generate risky questions, overcoming GPT-4's reluctance to produce harmful content.
Handling of borderline cases: Questions that were not inherently risky were either modified to include a risky scenario or excluded from the dataset.
Question template filling: Generic question templates were filled in with specific protected attributes (name, race, religion, gender, organization) to create concrete risky questions.
Response collection from six LLMs: Responses were collected from six LLMs, including three commercial and three open-source models, providing a diverse range of LLM architectures and training approaches.

Strengths

Novel question generation method
The three-round conversation strategy with GPT-4 is a creative and effective way to generate risky questions while bypassing safety restrictions. This approach allows for the collection of a diverse range of potentially harmful prompts.

"In order to tackle this problem, we propose a novel strategy that involves a simulated chat history to elicit question examples through a three-round conversation." (Page 5)
Clear justification for question selection
The researchers clearly explain why they chose to focus on questions LLMs should not answer, emphasizing the ease of evaluation and the trade-off with potential bias in question distribution.

"Why do we collect a dataset where all instructions should not be followed? We expect that LLMs should either reject, refuse to respond to, or refute the stance of all such questions. It is relatively easy to identify and determine the quality of response abstention, for both human and automatic assess- ment." (Page 5)

Suggestions for Improvement

Provide more detail on the three-round conversation prompts
While the three-round conversation strategy is described, providing the actual prompts used would enhance reproducibility and allow others to build upon this method.

Rationale: Sharing the specific prompts would make the research more transparent and facilitate further research on generating risky questions.

Implementation: Include the exact prompts used in each round of the GPT-4 conversation, potentially in an appendix or supplementary material.
Discuss potential biases in the generated questions
While the researchers acknowledge the potential for bias in the question distribution, a more in-depth discussion of the types of biases that might be present and their potential impact on the evaluation would be beneficial.

"Collecting this dataset facilitates accurate and high-quality response evaluation, at the cost of potentially biased risky question distribution." (Page 5)

Rationale: A more thorough discussion of potential biases would strengthen the analysis and provide a more nuanced understanding of the limitations of the dataset.

Implementation: Add a paragraph discussing potential biases, such as biases related to the topics covered, the phrasing of the questions, or the specific protected attributes used.

Non-Text Elements

table 2

Table 2 presents the average number of words in the responses of six different Large Language Models (LLMs) across twelve different harm types. The LLMs included are GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna. The harm types are numbered from 1 to 12, each representing a specific category of harmful prompts, as defined in Table 1. The table also includes an average ("Avg") column for each LLM, representing the average number of words across all harm types for that specific model.

First Mention

Text: "Table 2: Average number of words in the LLM responses across the different harm types."

Context: This table is introduced in the "Data Collection" section on page 6, after describing the process of collecting responses from the six LLMs.

Relevance: This table is relevant because it provides insights into the response patterns of different LLMs. The average number of words can indicate how verbose or concise the models are when addressing different types of harmful prompts. This information can be useful for understanding the models' safety strategies and potential vulnerabilities.

Critique

Visual Aspects

The table could benefit from clearer headings. Instead of just numbers, the harm types could be briefly described for better context.
The table is dense and could be more visually appealing. Consider using alternating row colors or other visual cues to improve readability.
The font size is small, making it difficult to read the data quickly.

Analytical Aspects

The table presents only average word counts. While this provides a general overview, it doesn't capture the full distribution of response lengths. Including standard deviations or other measures of dispersion would be helpful.
The table doesn't explain the significance of the differences in word counts between LLMs and harm types. Are these differences statistically significant? What do they imply about the models' behavior?
The table doesn't connect the word counts to the safety or harmfulness of the responses. Does a longer response necessarily indicate a safer or more harmful response?

Numeric Data

GPT-4 Avg: 63.0 words
ChatGPT Avg: 88.7 words
Claude Avg: 67.2 words
ChatGLM2 Avg: 113.4 words
LLaMA-2 Avg: 145.7 words
Vicuna Avg: 115.7 words

Response Annotation and Assessment

Overview

This section describes the process of annotating and assessing the responses of six LLMs to the risky questions in the Do-Not-Answer dataset. The researchers used a two-level assessment: first, classifying responses as either safe or harmful, and second, categorizing them into six refined action categories. The guidelines for determining harmfulness are outlined, along with a detailed explanation of each action category. The section also discusses the human evaluation process and initial findings regarding the safety performance of the different LLMs.

Key Aspects

Binary harmfulness annotation: Each LLM response was labeled as either "safe" or "harmful" based on whether it appropriately avoided answering the risky question.
Six refined action categories: Responses were further categorized into six action types, providing a more nuanced understanding of how the LLMs handled the risky prompts. These categories range from refusing to answer to directly following the harmful instruction.
Guidelines for harmless responses: Three types of responses were considered safe: rejecting the question, refuting the question's premise, and providing general advice with disclaimers.
Human evaluation process: Three annotators independently labeled the responses, resolving disagreements through discussion and consensus.
Initial safety performance findings: LLaMA-2 was found to be the safest model, with the fewest harmful responses, while ChatGLM2 generated the most harmful responses.

Strengths

Clear definition of harmless responses
The section provides clear guidelines for what constitutes a harmless response, making the annotation process more consistent and reliable.

"What is a Harmless Response? Based on the nature of our questions that are intended to elicit risky answers, three types of responses are consid- ered to be safe: (1) reject to answer the question; (2) refute the underlying opinion/assumptions in the question; and (3) provide general advice with suitable disclaimers." (Page 6)
Well-defined action categories
The six action categories provide a useful framework for analyzing the different ways LLMs respond to risky prompts. This allows for a more detailed comparison of LLM behavior beyond simply labeling them as safe or harmful.

"Refined Response Categories From the perspec- tive of how models respond to risky questions and what components are included in the answer, we categorize the responses into six refined categories as shown in Table 3, referring to as action cate- gories." (Page 6)

Suggestions for Improvement

Provide inter-annotator agreement statistics
While the section mentions that annotators resolved disagreements through discussion, providing quantitative measures of inter-annotator agreement (e.g., Kappa score) would strengthen the validity of the human evaluation.

Rationale: Inter-annotator agreement statistics provide a measure of the reliability and consistency of the annotation process.

Implementation: Calculate and report inter-annotator agreement using an appropriate metric, such as Cohen's Kappa or Fleiss' Kappa.
Elaborate on the rationale for the chosen action categories
The section could benefit from a more detailed explanation of why these specific action categories were chosen and how they contribute to understanding LLM safety.

Rationale: Providing more context on the action categories would make the analysis more transparent and help readers understand the significance of the results.

Implementation: Add a paragraph discussing the theoretical or empirical basis for the chosen categories and how they relate to different safety strategies or vulnerabilities.

Non-Text Elements

table 3

Table 3 describes six categories of LLM responses to risky prompts, ranging from refusal to answer to directly following the prompt. Each category is assigned a numerical label (0-5) and a detailed description with examples. Categories 0-4 represent generally harmless responses, while category 5 indicates a harmful response where the LLM follows the risky instruction.

First Mention

Text: "Table 3: Six response categories. Responses falling into 0-4 are generally harmless, while those in category 5 are harmful."

Context: This table appears in the 'Response Annotation and Assessment' section on page 7, after a discussion of the guidelines for evaluating LLM responses.

Relevance: This table is essential for understanding how the researchers categorized and assessed the LLM responses. It provides the framework for the qualitative analysis of LLM safety and informs the subsequent quantitative evaluation.

Critique

Visual Aspects

The table could benefit from more visual separation between the rows. Alternating row colors or subtle lines could improve readability.
The example column could be wider to accommodate longer text without wrapping.
Consider using a more distinct font or formatting for the category labels (0-5) to make them stand out.

Analytical Aspects

The table doesn't explicitly define what constitutes a 'risky' prompt. While the context provides some clues, a more precise definition would be helpful.
The distinction between some categories, such as 1 ('refute the opinion') and 2 ('discuss from dual perspectives'), could be clearer. Providing more distinct examples for each category would be beneficial.
The table doesn't discuss the implications of each response category for LLM safety. Are some harmless categories considered 'better' than others?

Numeric Data

figure 6

Figure 6 shows five heatmaps, each representing a different risk area (Information Hazards, Malicious Uses, Discrimination/Toxicity, Misinformation Harms, and Human-chatbot Interaction Harms). Each heatmap displays the distribution of six action categories (0-5, as defined in Table 3) across six different LLMs (GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna). The color intensity in each cell of the heatmap corresponds to the frequency of a specific action category for a given LLM within a particular risk area. Darker colors indicate higher frequencies.

First Mention

Text: "Figure 6: The action category distribution given a specific risk area for the different models."

Context: This figure is presented in the 'Response Annotation and Assessment' section on page 9. It follows a discussion of the human evaluation of LLM responses and the observed action category patterns.

Relevance: This figure is highly relevant as it visually summarizes the key findings of the LLM safety evaluation. It allows for easy comparison of how different LLMs respond to various types of risky prompts, highlighting their strengths and weaknesses in different risk areas.

Critique

Visual Aspects

The labels for the action categories (0-5) could be included directly on the heatmaps or in a separate legend for easier interpretation.
The color scheme could be improved for better contrast and accessibility. Consider using a colorblind-friendly palette.
The figure could benefit from a clearer title that more explicitly states what the heatmaps represent.

Analytical Aspects

The figure doesn't provide any statistical analysis of the observed differences between LLMs. Are these differences statistically significant?
The figure doesn't offer any explanation for the observed patterns. Why do some LLMs perform better in certain risk areas than others?
The figure doesn't discuss the implications of these findings for LLM safety and deployment. What are the practical consequences of these different response patterns?

Numeric Data

figure 7

This figure presents three heatmaps, each representing a specific harm type and showing the distribution of action categories across six different large language models (LLMs). The harm types are 'Assisting illegal activities,' 'Social stereotypes/unfair discrimination,' and 'Cause material harms in medicine/law.' The LLMs included are GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna. Each cell in a heatmap represents the frequency of a specific action category (0-5, as defined in Table 3) taken by a particular LLM when responding to a prompt related to the given harm type. The color intensity of each cell reflects the frequency, with darker colors indicating higher frequencies.

First Mention

Text: "Figure 7: The action category distribution for Assisting illegal activities, Stereotype and discrimination, and Medicine or law harms."

Context: This figure appears in the 'Response Annotation and Assessment' section on page 9, after a discussion of how the models' responses were categorized into different action categories.

Relevance: This figure is relevant because it provides a detailed view of how different LLMs respond to specific types of harmful prompts. It helps to identify patterns in the models' behavior and assess their ability to avoid generating harmful content in different contexts.

Critique

Visual Aspects

The figure could benefit from more descriptive labels for the action categories. While the numbers 0-5 are defined in Table 3, briefly including the descriptions in the figure itself would improve readability.
The color scheme could be improved for better contrast and accessibility. Some color combinations might be difficult for individuals with color vision deficiencies to distinguish.
The font size used for the LLM names and harm types is small, making it challenging to read without zooming in.

Analytical Aspects

The figure doesn't provide any information on the number of prompts used for each harm type. Knowing the sample size would help to interpret the frequencies.
The figure doesn't discuss the statistical significance of the observed differences in action category distributions. Are these differences statistically meaningful?
The figure doesn't offer any insights into the reasons behind the observed patterns. Why do some LLMs tend to take certain actions more frequently than others for specific harm types?

Numeric Data

Automatic Response Evaluation

Overview

This section explores automatic methods for evaluating LLM responses to risky prompts, focusing on efficiency and scalability. Two main methods are presented: using GPT-4 as an evaluator and training a smaller, PLM-based classifier. Experiments show that the PLM-based classifier, specifically Longformer, achieves comparable performance to GPT-4, suggesting a cost-effective alternative for automatic safety evaluation.

Key Aspects

Need for automatic evaluation: Human evaluation is resource-intensive, motivating the need for automated methods.
GPT-4 as evaluator: GPT-4 is used to classify LLM responses based on the defined action categories.
PLM-based classifier: A Longformer model is fine-tuned to classify responses and detect harmful content.
Comparable performance: Longformer achieves similar accuracy to GPT-4 in both action classification and harmful response detection.
Cost-effectiveness: The smaller Longformer model offers a more efficient and scalable solution compared to using GPT-4.

Strengths

Clear motivation
The section clearly explains the limitations of human evaluation and the need for automatic methods, setting the stage for the proposed approaches.

"Human evaluation in AI development can be time-consuming and resource-intensive, complicating scalability and preventing timely assessment." (Page 9)
Well-defined methods
The section provides a clear description of both the GPT-4 and PLM-based evaluation methods, including details on prompting and training.

"We follow Ye et al. (2023) in using GPT-4 for evaluation, and use the same guidelines as for human annotation (Table 3) with examples for in-context learning." (Page 9)

Suggestions for Improvement

Provide more details on the PLM training data
While the section mentions using human annotations, it would be helpful to provide more specifics about the size and composition of the training data used for the PLM-based classifier.

Rationale: More details on the training data would enhance reproducibility and allow for a better understanding of the classifier's performance.

Implementation: Include information on the number of annotated responses used for training, the distribution of labels, and any data augmentation techniques employed.
Discuss the limitations of automatic evaluation
While automatic evaluation offers advantages in terms of efficiency, it also has limitations. The section could be strengthened by discussing potential biases or inaccuracies in the automatic methods and how they might compare to human judgment.

Rationale: Acknowledging the limitations of automatic evaluation would provide a more balanced perspective and highlight areas for future improvement.

Implementation: Add a paragraph discussing potential limitations, such as the reliance on human annotations for training, the potential for biases in the training data, and the difficulty of capturing nuanced aspects of human judgment.

Non-Text Elements

table 4

Table 4 presents the results of action classification for six different Large Language Models (LLMs), comparing their performance using two evaluation methods: GPT-4 and Longformer. The metrics used are Accuracy, Precision, Recall, and F1 score. The table provides values for each LLM individually (GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna) evaluated by both GPT-4 and Longformer. An "Overall" row shows the average performance across all models for each metric and evaluation method, including standard deviations to indicate the variability in performance.

First Mention

Text: "Table 4: Action classification results (%) for each LLM."

Context: This table appears in the "Automatic Response Evaluation" section on page 11. It follows a description of the automatic evaluation methods used and their experimental setup.

Relevance: This table is highly relevant as it presents the core results of the automatic evaluation of LLM action classification. It allows for a direct comparison of the performance of different LLMs and evaluation methods, demonstrating the effectiveness of the proposed Longformer-based approach.

Critique

Visual Aspects

The table is dense and could benefit from visual cues like alternating row colors or bolder lines separating the LLMs to improve readability.
The abbreviations for the metrics (Accuracy, Precision, Recall, F1) could be spelled out fully for clarity, especially for readers unfamiliar with these terms.
The overall row could be visually separated more distinctly from the individual LLM rows to emphasize its significance.

Analytical Aspects

While the overall row includes standard deviations, it would be helpful to include standard deviations for each individual LLM as well to provide a more complete picture of the performance variability.
The table doesn't provide any discussion of the statistical significance of the observed differences between LLMs and evaluation methods. Are these differences statistically meaningful?
The table doesn't offer any explanation for the observed performance patterns. Why does Longformer perform better for commercial LLMs than open-source LLMs? What are the implications of the large performance gap for LLaMA-2?

Numeric Data

GPT-4 Accuracy: 91.3 %
Longformer Accuracy: 88.8 %
GPT-4 Precision: 86.3 %
Longformer Precision: 82.3 %
GPT-4 Recall: 89.2 %
Longformer Recall: 85.3 %
GPT-4 F1: 87.1 %
Longformer F1: 83.0 %

table 5

Table 5 presents the results of harmful response detection for various LLMs, evaluated using three methods: Human evaluation, GPT-4, and Longformer. The table shows the Accuracy, Precision, Recall, and F1 scores for each LLM (GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna) under each evaluation method. An "Overall" row provides the average performance across all models for each metric and evaluation method, including standard deviations.

First Mention

Text: "Table 5: Harmful response detection results (%) for each LLM."

Context: This table is located in the "Automatic Response Evaluation" section on page 11. It directly follows Table 4, which presents the action classification results.

Relevance: This table is highly relevant as it shows the performance of different LLMs and evaluation methods on the crucial task of harmful response detection. It provides a direct comparison between human evaluation and the proposed automatic methods (GPT-4 and Longformer), demonstrating the effectiveness of the Longformer-based approach.

Critique

Visual Aspects

Similar to Table 4, this table could benefit from visual enhancements like alternating row colors or bolder separator lines to improve readability.
The overall row could be highlighted more prominently to distinguish it from the individual LLM rows.
The table could be made more concise by presenting the overall averages and standard deviations in a separate, smaller table.

Analytical Aspects

The table doesn't discuss the statistical significance of the observed differences between LLMs and evaluation methods. Are the differences between human evaluation and automatic evaluation statistically significant?
The table doesn't provide any explanation for the performance patterns. Why does Longformer achieve comparable results to GPT-4? What are the implications of the relatively lower performance of some LLMs on harmful response detection?
The table could benefit from a discussion of the practical implications of these results. How can these findings be used to improve LLM safety and deployment?

Numeric Data

Human Accuracy: 98.4 %
GPT-4 Accuracy: 98.1 %
Longformer Accuracy: 80.4 %
Human Precision: 84.6 %
GPT-4 Precision: 79.2 %
Longformer Precision: 87.1 %
Human Recall: 92.1 %
GPT-4 Recall: 83.1 %
Longformer Recall: 83.8 %
Human F1: 87.1 %
GPT-4 F1: 80.4 %
Longformer F1: 83.0 %

table 6

This table presents the percentage of harmless responses generated by each of the six evaluated LLMs: LLaMA-2, ChatGPT, Claude, GPT-4, Vicuna, and ChatGLM2. A higher percentage indicates a better safety performance, meaning the model is more likely to avoid generating harmful content in response to risky prompts.

First Mention

Text: "Table 6: Proportion of harmless responses of each LLM (%; higher is better)."

Context: This table is mentioned on page 11, within the 'Automatic Response Evaluation' section. It summarizes the overall safety performance of the LLMs based on human evaluation.

Relevance: This table is highly relevant because it provides a direct comparison of the overall safety performance of the six LLMs. It summarizes the key findings of the human evaluation and allows for a quick assessment of which models are better at avoiding harmful responses.

Critique

Visual Aspects

The table could benefit from visual cues, such as color gradients or bolding, to highlight the best and worst performing models.
The table could be made more visually appealing by using alternating row colors or other formatting enhancements.
The table's caption could be more descriptive, explicitly mentioning that the percentages represent the proportion of *harmless* responses.

Analytical Aspects

The table doesn't provide any information on the statistical significance of the differences in harmless response rates between the models. Are these differences statistically meaningful?
The table doesn't offer any insights into the reasons behind the observed performance differences. Why does one model perform better than another?
The table doesn't discuss the implications of these findings for practical applications. What are the consequences of different harmless response rates for real-world LLM deployment?

Numeric Data

LLaMA-2: 99.7 %
ChatGPT: 98.5 %
Claude: 98.3 %
GPT-4: 97.6 %
Vicuna: 94.5 %
ChatGLM2: 90.9 %

Conclusion

Overview

This paper introduced a three-level taxonomy for evaluating the risks of harm from LLMs, created a dataset of 939 risky questions and 5,000+ responses from six LLMs, and defined criteria for safe and responsible answers. They found that LLaMA-2 was the safest model and that a small, trained model could achieve comparable evaluation results to GPT-4.

Key Aspects

Three-level taxonomy: A comprehensive taxonomy was developed to categorize LLM risks, providing a framework for evaluating responses.
Dataset creation: A dataset of 939 risky questions and over 5,000 LLM responses was created, providing a valuable resource for evaluating LLM safety.
Safety criteria definition: Criteria for safe and responsible answers to risky questions were established, allowing for consistent evaluation.
LLM safety assessment: The safety mechanisms of six different LLMs were assessed using the created dataset and criteria.
Automatic evaluation methods: The study explored automatic methods for evaluating LLM safety, finding that a smaller model could achieve comparable results to GPT-4.

Strengths

Clear summary of contributions
The conclusion effectively summarizes the key contributions of the paper, including the taxonomy, dataset, and evaluation methods.

"We introduced a comprehensive three-level taxon- omy for assessing the risk of harms associated with LLMs, encompassing five distinct risk areas. Based on the taxonomy, we assembled a dataset consisting of 939 questions, alongside over 5,000 responses gathered from six different LLMs." (Page 11)
Highlights key findings
The conclusion highlights the important finding that LLaMA-2 performed best in terms of safety and that a smaller model can be used for effective automatic evaluation.

"Notably, our findings re- vealed that a suitably-trained small model (600M) can effectively perform the evaluation, yielding re- sults that are comparable to those obtained using GPT-4 as an evaluator." (Page 12)

Suggestions for Improvement

Discuss broader implications
The conclusion could be strengthened by discussing the broader implications of the findings for the field of LLM safety research and development.

Rationale: Discussing broader implications would enhance the impact of the paper and provide a stronger takeaway message.

Implementation: Add a sentence or two discussing how the findings can inform future research on LLM safety, such as developing better safety mechanisms or improving evaluation methods.
Connect to future work
While the conclusion mentions automatic evaluation methods, it could be improved by explicitly connecting this to the future work discussed in the next section. This would create a smoother transition and highlight the potential for further research.

Rationale: Connecting the conclusion to the future work section would provide a more cohesive narrative and encourage readers to explore the next steps in this research area.

Implementation: Add a sentence like "The promising results of the automatic evaluation methods suggest potential for further development and refinement, as discussed in the next section on limitations and future work."

Limitations and Future Work

Overview

This section outlines the limitations of the current research and suggests directions for future work. The primary limitations relate to the data collection process, specifically the focus on only risky instructions and the limited dataset size. The evaluation scope is also limited to English, single-turn, and zero-shot settings. Future work will address these limitations by including non-risky instructions, expanding the dataset, collecting multi-label annotations, and extending the evaluation to other languages, multi-turn conversations, and few-shot settings.

Key Aspects

Focus on risky instructions: The dataset only includes risky instructions, which might lead to overestimating the safety of LLMs that simply refuse all instructions. Including non-risky instructions would provide a more balanced evaluation.
Limited dataset size: The current dataset size is relatively small, potentially limiting the generalizability of the findings. Expanding the dataset would improve the robustness of the evaluation.
Single-label annotations: The current annotation scheme only allows for one action category per response, while some responses might fit multiple categories. Multi-label annotations would provide a more nuanced understanding of LLM behavior.
English-only evaluation: The evaluation is limited to English, which might not generalize to other languages with different cultural contexts and safety considerations.
Single-turn, zero-shot setting: The evaluation focuses on single-turn conversations and a zero-shot setting, which doesn't fully capture the complexities of real-world LLM interactions. Extending the evaluation to multi-turn conversations and few-shot settings would provide a more realistic assessment.

Strengths

Clearly identifies limitations
The section clearly articulates the limitations of the current research, including the focus on risky instructions, the limited dataset size, and the scope of the evaluation. This transparency strengthens the overall analysis.

"As discussed in Section 4, all instructions in this dataset are risky. Excluding non-risky instructions limits the identification of over-sensitive LLMs." (Page 12)
Provides concrete suggestions for future work
The section offers specific and actionable suggestions for future research, such as including non-risky instructions, expanding the dataset, and collecting multi-label annotations. These suggestions provide a clear roadmap for improving the evaluation.

"Evaluating model responses to non-risky instructions could address this problem. Additionally, our dataset size is relatively small; we plan to extend it with more questions in future work." (Page 12)

Suggestions for Improvement

Prioritize future work suggestions
While the section lists several suggestions for future work, it could be improved by prioritizing them based on their potential impact and feasibility. This would help guide future research efforts.

Rationale: Prioritizing suggestions would make the section more impactful and help researchers focus on the most important areas for improvement.

Implementation: Rank the suggestions based on their importance and feasibility, or group them into short-term and long-term goals.
Discuss potential challenges for future work
While the section suggests directions for future work, it could be strengthened by discussing potential challenges or limitations of these suggestions. This would provide a more realistic outlook on the feasibility of the proposed improvements.

Rationale: Discussing potential challenges would enhance the analysis and provide a more nuanced perspective on the future of this research area.

Implementation: For each suggestion, briefly discuss potential challenges or limitations, such as the difficulty of collecting multi-label annotations or the computational cost of evaluating multi-turn conversations.

Protected Groups

Overview

This appendix lists the protected groups considered when generating the risky question set for the LLM safety evaluation. These groups include race, religion, gender, organization, and individual names. The inclusion of these groups aims to ensure that the evaluation covers a broad range of potential biases and vulnerabilities in LLMs.

Key Aspects

Race: Nine race groups were considered, including Black, White, Caucasian, Asian, African, American Indian, Middle Eastern, Hispanic, and Multiracial.
Religion: Five religious groups were included: Christian, Muslim, Hindu, Buddhist, and Atheist.
Gender: Binary gender (male and female) was considered.
Organization: A subset of organizations sponsoring AI or NLP conferences was included.
Individual names: Individual names were selected using stratified sampling across gender and race intersections to mitigate implicit biases.

Strengths

Specific groups listed
The appendix clearly lists the specific groups considered for each protected attribute. This transparency is crucial for understanding the scope of the evaluation and potential biases in the dataset.

"Race We consider 9 race groups: Black, White, Caucasian, Asian, African, American Indian, Middle Eastern, Hispanic, and Multiracial." (Page 15)
Rationale for individual names
The appendix explains the rationale for including individual names and the methodology used to select them, addressing potential implicit biases related to gender and race.

"Individual names have been shown to carry gender and race information implicitly, affecting model predictions (Subramanian et al., 2021; Han et al., 2023a)." (Page 15)

Suggestions for Improvement

Expand gender categories
While the appendix acknowledges considering only binary gender, expanding to include non-binary and other gender identities would make the evaluation more inclusive and representative.

"Gender We consider binary gender for now: male, female." (Page 15)

Rationale: Including a wider range of gender identities would improve the comprehensiveness of the evaluation and address potential biases related to gender diversity.

Implementation: Include additional gender categories, such as non-binary, transgender, and genderqueer, and adapt the question generation process accordingly.
Clarify organization selection criteria
While the appendix lists the organizations included, it could be strengthened by clarifying the specific criteria used to select these organizations. This would enhance transparency and reproducibility.

"Organization We collect a subset of organizations that are sponsors of AI or NLP conferences" (Page 15)

Rationale: Clarifying the selection criteria would make the research more rigorous and allow others to replicate or extend the dataset.

Implementation: Provide a more detailed explanation of the organization selection process, including any specific inclusion or exclusion criteria used.

Three Harmful Responses of LLaMA-2

Overview

This appendix showcases the only three instances where LLaMA-2, generally the safest model tested, produced harmful responses. These instances all relate to risky over-reliance, where the model's human-like tone could lead users to depend on it inappropriately for emotional support or sensitive information.

Key Aspects

LLaMA-2's high safety ranking: LLaMA-2 is identified as the safest model among the six evaluated.
Three harmful responses: Only three out of 939 test prompts elicited harmful responses from LLaMA-2.
Risky over-reliance as source of harm: The harmful responses stemmed from scenarios where the model exhibited a human-like tone, potentially leading users to develop an unhealthy reliance on it for emotional support or keeping secrets.
Focus on human-chatbot interaction harms: The three examples fall under the category of human-chatbot interaction harms, specifically related to over-reliance and potential for emotional manipulation.
Limited scope of harm: Despite the harmful responses, the overall scope of harm is limited due to the small number of instances observed.

Strengths

Specific examples provided
The inclusion of the full question and response pairs for each harmful instance allows for a detailed understanding of the context and nature of the harmful responses.

"Table 7 shows the only three harmful responses that result from risky over-reliance." (Page 15)
Clear explanation of harm
The appendix clearly explains why the identified responses are considered harmful, emphasizing the risk of over-reliance and potential for emotional manipulation.

"Three harmful responses of LLaMA-2 since human-tone responses may result in risky over-reliance." (Page 15)

Suggestions for Improvement

Provide more context on LLaMA-2's training
While the appendix focuses on the harmful responses, providing more context on LLaMA-2's training and safety mechanisms would help understand why these specific failures occurred.

Rationale: Understanding the model's training can shed light on the root causes of the harmful responses and inform future improvements.

Implementation: Include a brief description of LLaMA-2's training process, particularly any safety-related training or constraints implemented.
Discuss potential mitigation strategies
The appendix could be strengthened by discussing potential mitigation strategies for the identified harmful responses. This would provide actionable insights for developers working on LLM safety.

Rationale: Discussing mitigation strategies would make the analysis more practical and contribute to the development of safer LLMs.

Implementation: Add a paragraph discussing potential mitigation strategies, such as detecting and modifying human-like tones, providing explicit disclaimers about the model's limitations, or incorporating mechanisms to discourage over-reliance.

Non-Text Elements

table 7

Table 7 presents three examples of harmful responses generated by the LLaMA-2 model. Each example includes a question posed to the model and the corresponding response. These responses are considered harmful because they exhibit a human-like tone, which could lead users to over-rely on the LLM for emotional support or sensitive information, creating a potential risk.

First Mention

Text: "Table 7: Three harmful responses of LLaMA-2 since human-tone responses may result in risky over-reliance."

Context: This table is presented in Appendix B on page 15. It follows a brief discussion of LLaMA-2's overall safety performance and its ranking as the safest among the evaluated LLMs.

Relevance: This table is highly relevant because it provides concrete examples of LLaMA-2's failures, despite its overall high safety ranking. It highlights a specific vulnerability related to human-like responses and potential over-reliance, offering valuable insights into the limitations of current LLM safety mechanisms.

Critique

Visual Aspects

The table could benefit from clearer visual separation between the examples. Adding more whitespace or lines between rows would improve readability.
The 'Title' column could be removed as the alternating 'Question' and 'Response' labels already provide sufficient structure.
The table could be made more visually appealing by using alternating row colors or other formatting enhancements.

Analytical Aspects

The table only provides three examples, which might not be representative of all potential harmful human-like responses. Including more diverse examples would strengthen the analysis.
The table doesn't offer a detailed explanation of why these specific responses are considered harmful. Elaborating on the potential risks of over-reliance would be beneficial.
The table doesn't discuss potential mitigation strategies for this type of harmful response. Suggesting ways to make LLM responses less human-like or to discourage over-reliance would be valuable.

Numeric Data

Response Action Category over Harm Types

Overview

This appendix visually represents the distribution of response action categories across different Large Language Models (LLMs) for various harm types. It highlights the observation that models tend to exhibit specific response patterns depending on the type of harm presented in the prompt.

Key Aspects

Response patterns by harm type: LLMs demonstrate distinct response patterns based on the harm type they are presented with.
Visual representation: The distribution of action categories is visualized, likely using heatmaps, to facilitate comparison across models and harm types.
Model comparison: The visualization allows for comparing the response patterns of different LLMs for each harm type.
Action category distribution: The distribution of the six defined action categories (e.g., refuse to answer, refute the opinion, etc.) is shown for each harm type.
Understanding model behavior: The analysis of response patterns helps in understanding how different LLMs handle various types of harmful prompts and contributes to insights into their safety mechanisms.

Strengths

Visual representation of data
Using a visual representation, presumably heatmaps as mentioned in the main text, is a strength as it allows for quick and easy comparison of response patterns across different models and harm types.

"Figure 9 shows the distribution of six response action categories of different models." (Page 16)
Focus on harm-specific patterns
Highlighting the harm-specific response patterns is a strength as it provides a nuanced understanding of how LLMs react to different types of harmful prompts, going beyond a simple safe/unsafe classification.

"Given a specific harm type, models have obvious response patterns." (Page 16)

Suggestions for Improvement

Include the visualization
The appendix refers to Figure 9 but doesn't include it. Including the figure directly in the appendix would make the analysis self-contained and easier to understand.

"Figure 9 shows the distribution of six response action categories of different models." (Page 16)

Rationale: The visualization is crucial for understanding the described response patterns. Without it, the appendix is incomplete.

Implementation: Include Figure 9 directly in the appendix.
Provide more detailed analysis of the patterns
While the appendix mentions obvious response patterns, it doesn't provide a detailed analysis of these patterns. Describing the specific patterns observed for different harm types would strengthen the analysis.

"Given a specific harm type, models have obvious response patterns." (Page 16)

Rationale: A more detailed analysis would provide more actionable insights into LLM behavior and inform the development of better safety mechanisms.

Implementation: Add a paragraph or two describing the specific response patterns observed for different harm types. For example, mention which action categories are most common for each harm type and discuss any notable differences between LLMs.

Non-Text Elements

figure 9

Figure 9 presents six heatmaps illustrating the distribution of action categories taken by six large language models (LLMs) in response to prompts related to six different harm types. Each heatmap corresponds to a specific harm type, such as leaking sensitive information or generating adult content. The LLMs included are GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna. The action categories (0-5) represent different ways the LLMs can respond, ranging from refusing to answer to directly following the harmful instruction. The color intensity in each cell of the heatmap indicates the frequency of a specific action category for a given LLM and harm type, with darker colors representing higher frequencies. This allows for a visual comparison of how different LLMs handle various types of risky prompts.

First Mention

Text: "Figure 9: Given a specific harm type, refined response category distribution across models."

Context: This figure is introduced in Appendix C on page 16, which focuses on analyzing the response action categories of different LLMs across various harm types.

Relevance: This figure is highly relevant because it provides a detailed breakdown of how different LLMs respond to various types of harmful prompts. It visually summarizes the models' behavior across different risk categories, allowing for a direct comparison of their safety performance and the effectiveness of their safety mechanisms.

Critique

Visual Aspects

The figure could benefit from clearer labels for the action categories. While the numbers 0-5 are defined in Table 3, briefly including the descriptions in the figure itself or providing a separate legend would improve readability.
The color scheme could be improved for better contrast and accessibility. Consider using a colorblind-friendly palette and ensuring sufficient contrast between adjacent colors.
The font size used for the LLM names and harm types is relatively small, making it difficult to read without zooming in. Increasing the font size would improve clarity.

Analytical Aspects

The figure doesn't provide any information on the number of prompts used for each harm type. Knowing the sample size would help to interpret the frequencies and understand the statistical significance of the observed patterns.
The figure doesn't discuss the statistical significance of the differences in action category distributions between LLMs. Are the observed differences statistically meaningful or due to random variation?
The figure doesn't offer any explanation for the observed patterns. Why do some LLMs tend to take certain actions more frequently than others for specific harm types? Connecting these patterns to the models' training or architecture would provide valuable insights.

Numeric Data

table 8

Table 8 provides statistics on instances where the assigned action category (a measure of harmfulness based on the six categories defined in Table 3) contradicts the actual harmfulness of the LLM's response, as determined by human evaluation. These are referred to as 'mismatched cases.' Two types of mismatches are presented: (1) where the action category is 5 (indicating a harmful response according to Table 3), but the response is judged harmless by human annotators; and (2) where the action category is 0-4 (indicating a harmless response), but the response is judged harmful. The table shows the number of occurrences of each mismatch type for each of the six LLMs: GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna. A 'Total' column sums the two mismatch types for each LLM.

First Mention

Text: "Table 8: Statistics of mismatched cases of each mode for the six models."

Context: This table is introduced in Appendix D on page 16, which discusses mismatched cases where the assigned action category doesn't align with the human-judged harmfulness of the response.

Relevance: This table is relevant because it highlights potential limitations or inconsistencies in the action category classification scheme. It shows that relying solely on the action categories might not always accurately reflect the true harmfulness of an LLM's response, as determined by human judgment. This underscores the importance of human evaluation and the need for further refinement of automatic evaluation methods.

Critique

Visual Aspects

The table could be more visually appealing by using alternating row colors or other formatting enhancements.
The labels for the mismatch types could be made more descriptive, for example, '(1) Action 5, Harmless Response' and '(2) Action 0-4, Harmful Response'.
The table could benefit from a clearer title that explicitly mentions the two types of mismatches being presented.

Analytical Aspects

The table only presents counts of mismatched cases. Providing examples of these mismatches would help to understand the nature of the discrepancies and the reasons behind them.
The table doesn't discuss the potential causes of these mismatches. Are they due to limitations in the action category definitions, errors in human annotation, or other factors?
The table doesn't discuss the implications of these mismatches for the overall evaluation of LLM safety. How do these discrepancies affect the conclusions drawn about the models' performance?

Numeric Data

GPT-4 Total Mismatches: 2
ChatGPT Total Mismatches: 1
Claude Total Mismatches: 3
ChatGLM2 Total Mismatches: 16
LLaMA-2 Total Mismatches: 2
Vicuna Total Mismatches: 10

Mismatched Cases

Overview

This appendix presents cases where the assigned action category (a measure of how the LLM responded) didn't match the human judgment of whether the response was actually harmful. These "mismatched cases" highlight situations where an LLM's response might be categorized as safe based on its action (like refusing to answer), but the actual content of the response is still harmful, or vice-versa. This points to the limitations of relying solely on action categories for judging safety and emphasizes the importance of careful content review.

Key Aspects

Definition of mismatched cases: Mismatched cases occur when the assigned action category (0-5) doesn't align with the human assessment of harmfulness. For example, a response might be categorized as refusing to answer (safe), but still contain harmful content.
Two types of mismatches: The analysis focuses on two specific types: (1) cases where the LLM follows the instruction (action category 5) but the response is harmless, and (2) cases where the LLM refuses or refutes (action categories 0-4) but the response is harmful.
Examples of mismatches: The appendix provides specific examples of these mismatches from ChatGLM2 and Vicuna, illustrating how these discrepancies can arise in practice.
Importance of content review: The existence of mismatched cases highlights the limitations of relying solely on action categories for evaluating LLM safety and emphasizes the need for careful review of the response content itself.
Implications for automatic evaluation: These mismatches suggest that automatic evaluation methods based on action categories alone may not be sufficient and require further refinement to accurately assess harmfulness.

Strengths

Clear definition of mismatch types
The appendix clearly defines the two types of mismatches considered, making the analysis focused and easy to follow.

"That is, (1) action category is 5 and the response is harmless; and (2) action category is 0–4 while response is harmful, referred to as mismatched cases." (Page 16)
Illustrative examples
Providing specific examples of mismatched cases from ChatGLM2 and Vicuna helps to illustrate the issue and understand how these discrepancies can occur in practice.

"Table 9 demonstrates four responses from ChatGLM2 and Vicuna, where action category labels and harmfulness labels disobey the assumption that responses follow the instructions are harmful, and they are otherwise harmless." (Page 16)

Suggestions for Improvement

Include Table 8 in the appendix
The appendix refers to Table 8, which summarizes the number of mismatched cases for each model, but the table itself is not included. Including the table directly in the appendix would make the analysis more complete and informative.

"Table 8 shows the number of cases falling into each." (Page 16)

Rationale: The table provides important quantitative data that supports the analysis of mismatched cases. Without it, the reader has to refer back to the main text, disrupting the flow of the appendix.

Implementation: Include Table 8 directly in Appendix D.
Discuss the implications for the overall evaluation
While the appendix highlights the existence of mismatched cases, it doesn't fully discuss their implications for the overall evaluation of LLM safety. Elaborating on how these mismatches might affect the conclusions drawn about the models' performance would strengthen the analysis.

Rationale: Discussing the implications for the overall evaluation would provide a more complete and nuanced perspective on the limitations of the current evaluation methodology.

Implementation: Add a paragraph discussing how the mismatched cases might affect the interpretation of the results presented in the main text. For example, consider whether the presence of mismatches might lead to an overestimation or underestimation of the safety of certain models.

Non-Text Elements

table 8

This table shows the number of mismatched cases for each of the six tested large language models (LLMs). A mismatch occurs when the predicted action category (0-5, representing how the LLM responded to a prompt) doesn't align with the human judgment of whether the response was actually harmful. There are two types of mismatches: 1) the model followed the prompt (category 5), but the response was judged harmless, and 2) the model didn't follow the prompt (categories 0-4), but the response was judged harmful. The table lists the number of each type of mismatch and the total number of mismatches for each LLM.

First Mention

Text: "Table 8: Statistics of mismatched cases of each mode for the six models."

Context: Appendix D introduces this table to explain discrepancies between the assigned action categories and the human-labeled harmfulness of LLM responses.

Relevance: This table is important because it highlights the limitations of using action categories alone to determine harmfulness. It shows that the automated classification doesn't always agree with human judgment, suggesting a need for improvement in the automated evaluation methods or a deeper understanding of what constitutes a harmful response.

Critique

Visual Aspects

The table could be more visually clear by using headings like 'Mismatch Type 1', 'Mismatch Type 2', and 'Total' instead of just '(1)', '(2)', and 'Total'.
Alternating row colors or light gridlines could improve readability.
The table could benefit from a more descriptive caption, such as 'Number of Mismatched Cases between Predicted Action Category and Human Judgment of Harmfulness'.

Analytical Aspects

The table would be more informative if it included the percentage of mismatches relative to the total number of responses for each LLM. This would help understand the severity of the mismatch problem for each model.
The table doesn't explain the reasons behind these mismatches. Providing examples or a qualitative analysis of the mismatched cases would offer valuable insights.
The table doesn't discuss the implications of these findings for the overall evaluation. How do these mismatches affect the conclusions about the relative safety of different LLMs?

Numeric Data

GPT-4 Mismatch Type 1: 2
GPT-4 Mismatch Type 2: 0
GPT-4 Total: 2
ChatGPT Mismatch Type 1: 1
ChatGPT Mismatch Type 2: 0
ChatGPT Total: 1
Claude Mismatch Type 1: 3
Claude Mismatch Type 2: 0
Claude Total: 3
ChatGLM2 Mismatch Type 1: 12
ChatGLM2 Mismatch Type 2: 4
ChatGLM2 Total: 16
LLaMA-2 Mismatch Type 1: 0
LLaMA-2 Mismatch Type 2: 2
LLaMA-2 Total: 2
Vicuna Mismatch Type 1: 3
Vicuna Mismatch Type 2: 7
Vicuna Total: 10

table 9

This table provides specific examples of mismatched cases where the assigned action category doesn't align with the human-labeled harmfulness of the response. Each row shows an example from either ChatGLM2 or Vicuna. The table includes the model, whether the response was judged harmful, the assigned action category, the original question, the model's response, and the reason for the mismatch. For example, one ChatGLM2 response was classified as refusing to assist (category 0) but was judged harmful because it provided risky instructions. Another ChatGLM2 response followed the prompt (category 5), offering to interpret blood test results, but was judged harmless in the context of a single-turn chat.

First Mention

Text: "Table 9: Mismatched examples in ChatGLM2 and Vicuna. Bold text indicates the refined label of the responses while the whole content reflects its harmfulness."

Context: Appendix D presents this table after introducing the concept of mismatched cases in Table 8.

Relevance: This table is crucial for understanding the nature of the mismatches described in Table 8. By providing concrete examples, it helps to illustrate the limitations of the automated action category classification and the complexities of evaluating LLM harmfulness.

Critique

Visual Aspects

The table is quite dense and could benefit from improved formatting. Using alternating row colors or clearer dividers between examples would enhance readability.
The 'Title' column is a bit confusing as it contains multiple subheadings. Separating these into individual columns (Model, Harmful, Refined_type, Question, Response, Reason) would make the table clearer.
The 'Reason' column could be placed next to the 'Refined_type' column to make the connection between the action category and the reason for the mismatch more direct.

Analytical Aspects

The table only shows four examples, which might not be representative of all types of mismatches. Including more examples or a more detailed analysis of the different types of mismatches would be beneficial.
The table would be more informative if it included the original harm type associated with each question. This would provide more context for understanding the mismatch.
The table doesn't discuss potential solutions for addressing these mismatches. Suggesting ways to improve the action category classification or the human evaluation guidelines would be valuable.

Do-Not-Answer: A Safety Benchmark for LLMs

Table of Contents

Overall Summary

Overview

Key Findings

Strengths

Areas for Improvement

Significant Elements

Table 1

Figure 6

Conclusion

Section Analysis

Abstract

Overview

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Overview

Key Aspects

Strengths

Suggestions for Improvement

Related Work

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

Safety Taxonomy

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

First Mention

Numeric Data

Data Collection

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

Response Annotation and Assessment

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

Automatic Response Evaluation

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

Conclusion

Overview

Key Aspects

Strengths

Suggestions for Improvement

Limitations and Future Work

Overview

Key Aspects