Do-Not-Answer: A Safety Benchmark for LLMs

Table of Contents

Overall Summary

Overview

This paper introduces "Do-Not-Answer," an open-source dataset designed to evaluate the safety of large language models (LLMs). The dataset focuses on 939 prompts that responsible LLMs should refuse to answer, categorized using a three-level hierarchical risk taxonomy covering five key risk areas (information hazards, malicious uses, discrimination/toxicity, misinformation harms, and human-computer interaction harms). Six popular LLMs (GPT-4, ChatGPT, Claude, LLaMA-2, ChatGLM2, and Vicuna) were evaluated using this dataset, revealing varying safety performance levels. Additionally, the researchers developed smaller, more efficient BERT-like classifiers, demonstrating their potential as a cost-effective alternative for automatic safety evaluation, achieving comparable performance to GPT-4.

Key Findings

Strengths

Areas for Improvement

Significant Elements

Table 1

Description: This table presents the distribution of questions across the five risk areas and twelve harm types defined in the taxonomy, providing insights into the composition and scope of the Do-Not-Answer dataset.

Relevance: The table demonstrates the coverage of various risk categories and allows researchers to understand the dataset's focus and potential biases. It also helps in interpreting the evaluation results in the context of the dataset composition.

Figure 6

Description: This figure displays five heatmaps showing the distribution of action categories across six LLMs for each risk area, providing a visual summary of model behavior across different types of risky prompts.

Relevance: The heatmaps offer a clear and concise visualization of the models' strengths and weaknesses in handling various risk areas, allowing for quick comparison of their safety performance and identification of areas for improvement.

Conclusion

This paper makes a significant contribution to the field of LLM safety by introducing a comprehensive risk taxonomy, a novel dataset of risky prompts, and an evaluation of popular LLMs. The finding that LLaMA-2 performed best in terms of safety and the promising results of the smaller, Longformer-based automatic evaluation method are valuable insights for developers. Future research should focus on expanding the dataset with non-risky instructions, extending the evaluation to other languages and multi-turn conversations, and further refining automatic evaluation techniques to improve the accuracy and scalability of LLM safety assessments. These advancements are crucial for the responsible development and deployment of LLMs in real-world applications.

Section Analysis

Abstract

Overview

This paper introduces "Do-Not-Answer," an open-source dataset designed to evaluate the safety of large language models (LLMs). The dataset focuses on prompts that responsible LLMs should refuse to answer. The researchers evaluated six popular LLMs using this dataset and found varying levels of safety performance. They also developed smaller, BERT-like classifiers that can automatically evaluate LLM safety with comparable effectiveness to GPT-4.

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Overview

Large language models (LLMs) are rapidly evolving, demonstrating both beneficial and harmful emergent capabilities. This necessitates evaluating and mitigating these risks, especially for open-source LLMs, which often lack robust safety mechanisms. This paper introduces the "Do-Not-Answer" dataset, an open-source resource designed to evaluate LLM safeguards by focusing on prompts that responsible models should refuse to answer. This dataset aims to enable safer development and deployment of open-source LLMs.

Key Aspects

Strengths

Suggestions for Improvement

Related Work

Overview

This section discusses existing research on the risks of deploying Large Language Models (LLMs), including studies focusing on specific risk areas like bias, toxicity, and misinformation, and work on holistic risk evaluation. It highlights the limitations of current datasets and emphasizes the need for a comprehensive taxonomy and open-source dataset for evaluating LLM safety.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 1

This figure presents a radar chart comparing six large language models (LLMs) across different safety aspects. Each axis of the radar chart represents a category of potential harm, such as 'Information Hazards' or 'Malicious Uses'. The performance of each LLM is represented by a colored line, forming a polygon on the chart. A point closer to the outer edge of the chart indicates a higher score in that category, meaning the LLM is better at avoiding that specific harm. The LLMs compared are GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna.

First Mention

Text: "Figure 1: A comprehensive evaluation of LLM safeguards."

Context: This figure is presented at the beginning of the 'Related Work' section on page 2. It follows a discussion of the need for better safety evaluations in LLMs.

Relevance: This figure is highly relevant because it visually summarizes the core evaluation performed in the paper. It allows for a quick comparison of the safety performance of different LLMs across various risk categories.

Critique
Visual Aspects
  • One axis label is missing, making it difficult to interpret that dimension of the comparison.
  • The scale on the axes (0.8 to 1.0) is quite narrow, potentially obscuring subtle differences in performance.
  • The colors used for the LLM lines are somewhat similar, making it challenging to distinguish them at a glance.
Analytical Aspects
  • The figure lacks any indication of statistical significance. It's unclear whether the observed differences in performance are statistically meaningful.
  • The figure doesn't provide any context or explanation for why certain LLMs perform better or worse in specific categories.
  • The choice of categories and their relative importance in the overall safety assessment are not discussed.
Numeric Data

Safety Taxonomy

Overview

This section introduces a three-level hierarchical taxonomy for classifying the risks associated with large language models (LLMs), particularly text-only models. It builds upon previous research and focuses on five key risk areas: information hazards, malicious uses, discrimination/exclusion/toxicity, misinformation harms, and human-computer interaction harms. The taxonomy provides a detailed breakdown of potential hazards, outlining how these risks manifest and providing examples of the types of questions or prompts that could trigger them.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 2

This figure illustrates a three-level taxonomy of risks associated with Large Language Models (LLMs). The top level categorizes risks into five broad areas: Information Hazards, Malicious Uses, Discrimination/Toxicity, Misinformation Harms, and Human-Chatbot Interaction Harms. Each of these areas is then broken down into more specific harm types in the second level. The third level further details each harm type with specific examples. The taxonomy helps to organize and understand the various ways LLMs can potentially cause harm.

First Mention

Text: "Figure 2: Three-level taxonomy of LLM risks."

Context: This figure appears in the 'Safety Taxonomy' section on page 4, after a discussion of existing risk categorization efforts and the need for a more comprehensive taxonomy.

Relevance: This figure is crucial for understanding the structure and scope of the research. It provides the framework for the Do-Not-Answer dataset and guides the evaluation of LLM safety.

Critique
Visual Aspects
  • The figure could benefit from a clearer visual hierarchy. The connections between the levels could be more distinct, perhaps using different line weights or colors.
  • The text within the boxes is quite small, making it difficult to read without zooming in.
  • The figure is somewhat cluttered, making it challenging to follow the different branches of the taxonomy.
Analytical Aspects
  • The figure doesn't explain how the taxonomy was developed or validated. It's unclear whether it's based on empirical data or expert opinions.
  • The figure doesn't discuss the relative importance of the different risk areas or harm types. Are some risks considered more critical than others?
  • The figure doesn't provide any context on how the taxonomy will be used in the subsequent evaluation. How are the prompts in the Do-Not-Answer dataset mapped to this taxonomy?
Numeric Data
table 1

This table shows the number of questions in the Do-Not-Answer dataset that fall into each of the five risk areas and twelve harm types defined in the safety taxonomy. The table has three columns: 'Risk Area', 'Harm Type', and '# Q' (number of questions). It provides a breakdown of the dataset's composition and shows how the questions are distributed across different risk categories.

First Mention

Text: "Table 1: The number of questions (# Q) falling into our five risk areas and twelve harm types."

Context: This table is presented in the 'Safety Taxonomy' section on page 5, immediately after the introduction of the three-level risk taxonomy in Figure 2.

Relevance: This table is important because it provides a quantitative overview of the Do-Not-Answer dataset. It shows the distribution of questions across different risk categories, which is crucial for understanding the scope and focus of the evaluation.

Critique
Visual Aspects
  • The table could be more visually appealing. Using different colors or shading for the rows could improve readability.
  • The table is quite wide, making it difficult to fit on the page without wrapping text. Consider splitting it into two smaller tables or rotating it.
  • The '# Q' column could be renamed to something more descriptive, such as 'Number of Questions'.
Analytical Aspects
  • The table doesn't explain how the questions were generated or selected for inclusion in the dataset. What was the process for creating these questions?
  • The table doesn't discuss the rationale for the number of questions in each category. Why are there more questions for some harm types than others?
  • The table doesn't provide any context on how these numbers will be used in the subsequent evaluation. How will the distribution of questions affect the results?
Numeric Data
  • Information Hazards - Risks from leaking or inferring sensitive information (organization/gov): 136
  • Information Hazards - Compromise privacy by leaking or inferring private information (person/individual): 112
  • Malicious Uses - Assisting illegal activities: 132
  • Malicious Uses - Nudging or advising users to perform unethical or unsafe actions: 71
  • Malicious Uses - Reducing the cost of disinformation campaigns: 40
  • Discrimination, Exclusion, Toxicity, Hateful, Offensive - Social stereotypes and unfair discrimination: 95
  • Discrimination, Exclusion, Toxicity, Hateful, Offensive - Toxic language (hate speech): 53
  • Discrimination, Exclusion, Toxicity, Hateful, Offensive - Adult content: 28
  • Misinformation Harms - Disseminating false or misleading information: 92
  • Misinformation Harms - Causing material harm by disseminating misinformation e.g. in medicine or law: 63
  • Human-chatbot Interaction Harms - Mental health or overreliance crisis: 67
  • Human-chatbot Interaction Harms - Treat chatbot as a human: 50

Data Collection

Overview

This section details how the researchers collected 939 risky questions and the corresponding responses from six different Large Language Models (LLMs). The questions are designed to be those that a responsible LLM should refuse to answer. The researchers used a novel three-round conversation strategy with GPT-4 to generate these questions, addressing challenges in eliciting risky content. They also describe how they handled borderline cases and filled in question templates with specific risky scenarios. Finally, they collected responses from three commercial and three open-source LLMs, providing statistics on the length of these responses.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

table 2

Table 2 presents the average number of words in the responses of six different Large Language Models (LLMs) across twelve different harm types. The LLMs included are GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna. The harm types are numbered from 1 to 12, each representing a specific category of harmful prompts, as defined in Table 1. The table also includes an average ("Avg") column for each LLM, representing the average number of words across all harm types for that specific model.

First Mention

Text: "Table 2: Average number of words in the LLM responses across the different harm types."

Context: This table is introduced in the "Data Collection" section on page 6, after describing the process of collecting responses from the six LLMs.

Relevance: This table is relevant because it provides insights into the response patterns of different LLMs. The average number of words can indicate how verbose or concise the models are when addressing different types of harmful prompts. This information can be useful for understanding the models' safety strategies and potential vulnerabilities.

Critique
Visual Aspects
  • The table could benefit from clearer headings. Instead of just numbers, the harm types could be briefly described for better context.
  • The table is dense and could be more visually appealing. Consider using alternating row colors or other visual cues to improve readability.
  • The font size is small, making it difficult to read the data quickly.
Analytical Aspects
  • The table presents only average word counts. While this provides a general overview, it doesn't capture the full distribution of response lengths. Including standard deviations or other measures of dispersion would be helpful.
  • The table doesn't explain the significance of the differences in word counts between LLMs and harm types. Are these differences statistically significant? What do they imply about the models' behavior?
  • The table doesn't connect the word counts to the safety or harmfulness of the responses. Does a longer response necessarily indicate a safer or more harmful response?
Numeric Data
  • GPT-4 Avg: 63.0 words
  • ChatGPT Avg: 88.7 words
  • Claude Avg: 67.2 words
  • ChatGLM2 Avg: 113.4 words
  • LLaMA-2 Avg: 145.7 words
  • Vicuna Avg: 115.7 words

Response Annotation and Assessment

Overview

This section describes the process of annotating and assessing the responses of six LLMs to the risky questions in the Do-Not-Answer dataset. The researchers used a two-level assessment: first, classifying responses as either safe or harmful, and second, categorizing them into six refined action categories. The guidelines for determining harmfulness are outlined, along with a detailed explanation of each action category. The section also discusses the human evaluation process and initial findings regarding the safety performance of the different LLMs.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

table 3

Table 3 describes six categories of LLM responses to risky prompts, ranging from refusal to answer to directly following the prompt. Each category is assigned a numerical label (0-5) and a detailed description with examples. Categories 0-4 represent generally harmless responses, while category 5 indicates a harmful response where the LLM follows the risky instruction.

First Mention

Text: "Table 3: Six response categories. Responses falling into 0-4 are generally harmless, while those in category 5 are harmful."

Context: This table appears in the 'Response Annotation and Assessment' section on page 7, after a discussion of the guidelines for evaluating LLM responses.

Relevance: This table is essential for understanding how the researchers categorized and assessed the LLM responses. It provides the framework for the qualitative analysis of LLM safety and informs the subsequent quantitative evaluation.

Critique
Visual Aspects
  • The table could benefit from more visual separation between the rows. Alternating row colors or subtle lines could improve readability.
  • The example column could be wider to accommodate longer text without wrapping.
  • Consider using a more distinct font or formatting for the category labels (0-5) to make them stand out.
Analytical Aspects
  • The table doesn't explicitly define what constitutes a 'risky' prompt. While the context provides some clues, a more precise definition would be helpful.
  • The distinction between some categories, such as 1 ('refute the opinion') and 2 ('discuss from dual perspectives'), could be clearer. Providing more distinct examples for each category would be beneficial.
  • The table doesn't discuss the implications of each response category for LLM safety. Are some harmless categories considered 'better' than others?
Numeric Data
figure 6

Figure 6 shows five heatmaps, each representing a different risk area (Information Hazards, Malicious Uses, Discrimination/Toxicity, Misinformation Harms, and Human-chatbot Interaction Harms). Each heatmap displays the distribution of six action categories (0-5, as defined in Table 3) across six different LLMs (GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna). The color intensity in each cell of the heatmap corresponds to the frequency of a specific action category for a given LLM within a particular risk area. Darker colors indicate higher frequencies.

First Mention

Text: "Figure 6: The action category distribution given a specific risk area for the different models."

Context: This figure is presented in the 'Response Annotation and Assessment' section on page 9. It follows a discussion of the human evaluation of LLM responses and the observed action category patterns.

Relevance: This figure is highly relevant as it visually summarizes the key findings of the LLM safety evaluation. It allows for easy comparison of how different LLMs respond to various types of risky prompts, highlighting their strengths and weaknesses in different risk areas.

Critique
Visual Aspects
  • The labels for the action categories (0-5) could be included directly on the heatmaps or in a separate legend for easier interpretation.
  • The color scheme could be improved for better contrast and accessibility. Consider using a colorblind-friendly palette.
  • The figure could benefit from a clearer title that more explicitly states what the heatmaps represent.
Analytical Aspects
  • The figure doesn't provide any statistical analysis of the observed differences between LLMs. Are these differences statistically significant?
  • The figure doesn't offer any explanation for the observed patterns. Why do some LLMs perform better in certain risk areas than others?
  • The figure doesn't discuss the implications of these findings for LLM safety and deployment. What are the practical consequences of these different response patterns?
Numeric Data
figure 7

This figure presents three heatmaps, each representing a specific harm type and showing the distribution of action categories across six different large language models (LLMs). The harm types are 'Assisting illegal activities,' 'Social stereotypes/unfair discrimination,' and 'Cause material harms in medicine/law.' The LLMs included are GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna. Each cell in a heatmap represents the frequency of a specific action category (0-5, as defined in Table 3) taken by a particular LLM when responding to a prompt related to the given harm type. The color intensity of each cell reflects the frequency, with darker colors indicating higher frequencies.

First Mention

Text: "Figure 7: The action category distribution for Assisting illegal activities, Stereotype and discrimination, and Medicine or law harms."

Context: This figure appears in the 'Response Annotation and Assessment' section on page 9, after a discussion of how the models' responses were categorized into different action categories.

Relevance: This figure is relevant because it provides a detailed view of how different LLMs respond to specific types of harmful prompts. It helps to identify patterns in the models' behavior and assess their ability to avoid generating harmful content in different contexts.

Critique
Visual Aspects
  • The figure could benefit from more descriptive labels for the action categories. While the numbers 0-5 are defined in Table 3, briefly including the descriptions in the figure itself would improve readability.
  • The color scheme could be improved for better contrast and accessibility. Some color combinations might be difficult for individuals with color vision deficiencies to distinguish.
  • The font size used for the LLM names and harm types is small, making it challenging to read without zooming in.
Analytical Aspects
  • The figure doesn't provide any information on the number of prompts used for each harm type. Knowing the sample size would help to interpret the frequencies.
  • The figure doesn't discuss the statistical significance of the observed differences in action category distributions. Are these differences statistically meaningful?
  • The figure doesn't offer any insights into the reasons behind the observed patterns. Why do some LLMs tend to take certain actions more frequently than others for specific harm types?
Numeric Data

Automatic Response Evaluation

Overview

This section explores automatic methods for evaluating LLM responses to risky prompts, focusing on efficiency and scalability. Two main methods are presented: using GPT-4 as an evaluator and training a smaller, PLM-based classifier. Experiments show that the PLM-based classifier, specifically Longformer, achieves comparable performance to GPT-4, suggesting a cost-effective alternative for automatic safety evaluation.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

table 4

Table 4 presents the results of action classification for six different Large Language Models (LLMs), comparing their performance using two evaluation methods: GPT-4 and Longformer. The metrics used are Accuracy, Precision, Recall, and F1 score. The table provides values for each LLM individually (GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna) evaluated by both GPT-4 and Longformer. An "Overall" row shows the average performance across all models for each metric and evaluation method, including standard deviations to indicate the variability in performance.

First Mention

Text: "Table 4: Action classification results (%) for each LLM."

Context: This table appears in the "Automatic Response Evaluation" section on page 11. It follows a description of the automatic evaluation methods used and their experimental setup.

Relevance: This table is highly relevant as it presents the core results of the automatic evaluation of LLM action classification. It allows for a direct comparison of the performance of different LLMs and evaluation methods, demonstrating the effectiveness of the proposed Longformer-based approach.

Critique
Visual Aspects
  • The table is dense and could benefit from visual cues like alternating row colors or bolder lines separating the LLMs to improve readability.
  • The abbreviations for the metrics (Accuracy, Precision, Recall, F1) could be spelled out fully for clarity, especially for readers unfamiliar with these terms.
  • The overall row could be visually separated more distinctly from the individual LLM rows to emphasize its significance.
Analytical Aspects
  • While the overall row includes standard deviations, it would be helpful to include standard deviations for each individual LLM as well to provide a more complete picture of the performance variability.
  • The table doesn't provide any discussion of the statistical significance of the observed differences between LLMs and evaluation methods. Are these differences statistically meaningful?
  • The table doesn't offer any explanation for the observed performance patterns. Why does Longformer perform better for commercial LLMs than open-source LLMs? What are the implications of the large performance gap for LLaMA-2?
Numeric Data
  • GPT-4 Accuracy: 91.3 %
  • Longformer Accuracy: 88.8 %
  • GPT-4 Precision: 86.3 %
  • Longformer Precision: 82.3 %
  • GPT-4 Recall: 89.2 %
  • Longformer Recall: 85.3 %
  • GPT-4 F1: 87.1 %
  • Longformer F1: 83.0 %
table 5

Table 5 presents the results of harmful response detection for various LLMs, evaluated using three methods: Human evaluation, GPT-4, and Longformer. The table shows the Accuracy, Precision, Recall, and F1 scores for each LLM (GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna) under each evaluation method. An "Overall" row provides the average performance across all models for each metric and evaluation method, including standard deviations.

First Mention

Text: "Table 5: Harmful response detection results (%) for each LLM."

Context: This table is located in the "Automatic Response Evaluation" section on page 11. It directly follows Table 4, which presents the action classification results.

Relevance: This table is highly relevant as it shows the performance of different LLMs and evaluation methods on the crucial task of harmful response detection. It provides a direct comparison between human evaluation and the proposed automatic methods (GPT-4 and Longformer), demonstrating the effectiveness of the Longformer-based approach.

Critique
Visual Aspects
  • Similar to Table 4, this table could benefit from visual enhancements like alternating row colors or bolder separator lines to improve readability.
  • The overall row could be highlighted more prominently to distinguish it from the individual LLM rows.
  • The table could be made more concise by presenting the overall averages and standard deviations in a separate, smaller table.
Analytical Aspects
  • The table doesn't discuss the statistical significance of the observed differences between LLMs and evaluation methods. Are the differences between human evaluation and automatic evaluation statistically significant?
  • The table doesn't provide any explanation for the performance patterns. Why does Longformer achieve comparable results to GPT-4? What are the implications of the relatively lower performance of some LLMs on harmful response detection?
  • The table could benefit from a discussion of the practical implications of these results. How can these findings be used to improve LLM safety and deployment?
Numeric Data
  • Human Accuracy: 98.4 %
  • GPT-4 Accuracy: 98.1 %
  • Longformer Accuracy: 80.4 %
  • Human Precision: 84.6 %
  • GPT-4 Precision: 79.2 %
  • Longformer Precision: 87.1 %
  • Human Recall: 92.1 %
  • GPT-4 Recall: 83.1 %
  • Longformer Recall: 83.8 %
  • Human F1: 87.1 %
  • GPT-4 F1: 80.4 %
  • Longformer F1: 83.0 %
table 6

This table presents the percentage of harmless responses generated by each of the six evaluated LLMs: LLaMA-2, ChatGPT, Claude, GPT-4, Vicuna, and ChatGLM2. A higher percentage indicates a better safety performance, meaning the model is more likely to avoid generating harmful content in response to risky prompts.

First Mention

Text: "Table 6: Proportion of harmless responses of each LLM (%; higher is better)."

Context: This table is mentioned on page 11, within the 'Automatic Response Evaluation' section. It summarizes the overall safety performance of the LLMs based on human evaluation.

Relevance: This table is highly relevant because it provides a direct comparison of the overall safety performance of the six LLMs. It summarizes the key findings of the human evaluation and allows for a quick assessment of which models are better at avoiding harmful responses.

Critique
Visual Aspects
  • The table could benefit from visual cues, such as color gradients or bolding, to highlight the best and worst performing models.
  • The table could be made more visually appealing by using alternating row colors or other formatting enhancements.
  • The table's caption could be more descriptive, explicitly mentioning that the percentages represent the proportion of *harmless* responses.
Analytical Aspects
  • The table doesn't provide any information on the statistical significance of the differences in harmless response rates between the models. Are these differences statistically meaningful?
  • The table doesn't offer any insights into the reasons behind the observed performance differences. Why does one model perform better than another?
  • The table doesn't discuss the implications of these findings for practical applications. What are the consequences of different harmless response rates for real-world LLM deployment?
Numeric Data
  • LLaMA-2: 99.7 %
  • ChatGPT: 98.5 %
  • Claude: 98.3 %
  • GPT-4: 97.6 %
  • Vicuna: 94.5 %
  • ChatGLM2: 90.9 %

Conclusion

Overview

This paper introduced a three-level taxonomy for evaluating the risks of harm from LLMs, created a dataset of 939 risky questions and 5,000+ responses from six LLMs, and defined criteria for safe and responsible answers. They found that LLaMA-2 was the safest model and that a small, trained model could achieve comparable evaluation results to GPT-4.

Key Aspects

Strengths

Suggestions for Improvement

Limitations and Future Work

Overview

This section outlines the limitations of the current research and suggests directions for future work. The primary limitations relate to the data collection process, specifically the focus on only risky instructions and the limited dataset size. The evaluation scope is also limited to English, single-turn, and zero-shot settings. Future work will address these limitations by including non-risky instructions, expanding the dataset, collecting multi-label annotations, and extending the evaluation to other languages, multi-turn conversations, and few-shot settings.

Key Aspects

Strengths

Suggestions for Improvement

Protected Groups

Overview

This appendix lists the protected groups considered when generating the risky question set for the LLM safety evaluation. These groups include race, religion, gender, organization, and individual names. The inclusion of these groups aims to ensure that the evaluation covers a broad range of potential biases and vulnerabilities in LLMs.

Key Aspects

Strengths

Suggestions for Improvement

Three Harmful Responses of LLaMA-2

Overview

This appendix showcases the only three instances where LLaMA-2, generally the safest model tested, produced harmful responses. These instances all relate to risky over-reliance, where the model's human-like tone could lead users to depend on it inappropriately for emotional support or sensitive information.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

table 7

Table 7 presents three examples of harmful responses generated by the LLaMA-2 model. Each example includes a question posed to the model and the corresponding response. These responses are considered harmful because they exhibit a human-like tone, which could lead users to over-rely on the LLM for emotional support or sensitive information, creating a potential risk.

First Mention

Text: "Table 7: Three harmful responses of LLaMA-2 since human-tone responses may result in risky over-reliance."

Context: This table is presented in Appendix B on page 15. It follows a brief discussion of LLaMA-2's overall safety performance and its ranking as the safest among the evaluated LLMs.

Relevance: This table is highly relevant because it provides concrete examples of LLaMA-2's failures, despite its overall high safety ranking. It highlights a specific vulnerability related to human-like responses and potential over-reliance, offering valuable insights into the limitations of current LLM safety mechanisms.

Critique
Visual Aspects
  • The table could benefit from clearer visual separation between the examples. Adding more whitespace or lines between rows would improve readability.
  • The 'Title' column could be removed as the alternating 'Question' and 'Response' labels already provide sufficient structure.
  • The table could be made more visually appealing by using alternating row colors or other formatting enhancements.
Analytical Aspects
  • The table only provides three examples, which might not be representative of all potential harmful human-like responses. Including more diverse examples would strengthen the analysis.
  • The table doesn't offer a detailed explanation of why these specific responses are considered harmful. Elaborating on the potential risks of over-reliance would be beneficial.
  • The table doesn't discuss potential mitigation strategies for this type of harmful response. Suggesting ways to make LLM responses less human-like or to discourage over-reliance would be valuable.
Numeric Data

Response Action Category over Harm Types

Overview

This appendix visually represents the distribution of response action categories across different Large Language Models (LLMs) for various harm types. It highlights the observation that models tend to exhibit specific response patterns depending on the type of harm presented in the prompt.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 9

Figure 9 presents six heatmaps illustrating the distribution of action categories taken by six large language models (LLMs) in response to prompts related to six different harm types. Each heatmap corresponds to a specific harm type, such as leaking sensitive information or generating adult content. The LLMs included are GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna. The action categories (0-5) represent different ways the LLMs can respond, ranging from refusing to answer to directly following the harmful instruction. The color intensity in each cell of the heatmap indicates the frequency of a specific action category for a given LLM and harm type, with darker colors representing higher frequencies. This allows for a visual comparison of how different LLMs handle various types of risky prompts.

First Mention

Text: "Figure 9: Given a specific harm type, refined response category distribution across models."

Context: This figure is introduced in Appendix C on page 16, which focuses on analyzing the response action categories of different LLMs across various harm types.

Relevance: This figure is highly relevant because it provides a detailed breakdown of how different LLMs respond to various types of harmful prompts. It visually summarizes the models' behavior across different risk categories, allowing for a direct comparison of their safety performance and the effectiveness of their safety mechanisms.

Critique
Visual Aspects
  • The figure could benefit from clearer labels for the action categories. While the numbers 0-5 are defined in Table 3, briefly including the descriptions in the figure itself or providing a separate legend would improve readability.
  • The color scheme could be improved for better contrast and accessibility. Consider using a colorblind-friendly palette and ensuring sufficient contrast between adjacent colors.
  • The font size used for the LLM names and harm types is relatively small, making it difficult to read without zooming in. Increasing the font size would improve clarity.
Analytical Aspects
  • The figure doesn't provide any information on the number of prompts used for each harm type. Knowing the sample size would help to interpret the frequencies and understand the statistical significance of the observed patterns.
  • The figure doesn't discuss the statistical significance of the differences in action category distributions between LLMs. Are the observed differences statistically meaningful or due to random variation?
  • The figure doesn't offer any explanation for the observed patterns. Why do some LLMs tend to take certain actions more frequently than others for specific harm types? Connecting these patterns to the models' training or architecture would provide valuable insights.
Numeric Data
table 8

Table 8 provides statistics on instances where the assigned action category (a measure of harmfulness based on the six categories defined in Table 3) contradicts the actual harmfulness of the LLM's response, as determined by human evaluation. These are referred to as 'mismatched cases.' Two types of mismatches are presented: (1) where the action category is 5 (indicating a harmful response according to Table 3), but the response is judged harmless by human annotators; and (2) where the action category is 0-4 (indicating a harmless response), but the response is judged harmful. The table shows the number of occurrences of each mismatch type for each of the six LLMs: GPT-4, ChatGPT, Claude, ChatGLM2, LLaMA-2, and Vicuna. A 'Total' column sums the two mismatch types for each LLM.

First Mention

Text: "Table 8: Statistics of mismatched cases of each mode for the six models."

Context: This table is introduced in Appendix D on page 16, which discusses mismatched cases where the assigned action category doesn't align with the human-judged harmfulness of the response.

Relevance: This table is relevant because it highlights potential limitations or inconsistencies in the action category classification scheme. It shows that relying solely on the action categories might not always accurately reflect the true harmfulness of an LLM's response, as determined by human judgment. This underscores the importance of human evaluation and the need for further refinement of automatic evaluation methods.

Critique
Visual Aspects
  • The table could be more visually appealing by using alternating row colors or other formatting enhancements.
  • The labels for the mismatch types could be made more descriptive, for example, '(1) Action 5, Harmless Response' and '(2) Action 0-4, Harmful Response'.
  • The table could benefit from a clearer title that explicitly mentions the two types of mismatches being presented.
Analytical Aspects
  • The table only presents counts of mismatched cases. Providing examples of these mismatches would help to understand the nature of the discrepancies and the reasons behind them.
  • The table doesn't discuss the potential causes of these mismatches. Are they due to limitations in the action category definitions, errors in human annotation, or other factors?
  • The table doesn't discuss the implications of these mismatches for the overall evaluation of LLM safety. How do these discrepancies affect the conclusions drawn about the models' performance?
Numeric Data
  • GPT-4 Total Mismatches: 2
  • ChatGPT Total Mismatches: 1
  • Claude Total Mismatches: 3
  • ChatGLM2 Total Mismatches: 16
  • LLaMA-2 Total Mismatches: 2
  • Vicuna Total Mismatches: 10

Mismatched Cases

Overview

This appendix presents cases where the assigned action category (a measure of how the LLM responded) didn't match the human judgment of whether the response was actually harmful. These "mismatched cases" highlight situations where an LLM's response might be categorized as safe based on its action (like refusing to answer), but the actual content of the response is still harmful, or vice-versa. This points to the limitations of relying solely on action categories for judging safety and emphasizes the importance of careful content review.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

table 8

This table shows the number of mismatched cases for each of the six tested large language models (LLMs). A mismatch occurs when the predicted action category (0-5, representing how the LLM responded to a prompt) doesn't align with the human judgment of whether the response was actually harmful. There are two types of mismatches: 1) the model followed the prompt (category 5), but the response was judged harmless, and 2) the model didn't follow the prompt (categories 0-4), but the response was judged harmful. The table lists the number of each type of mismatch and the total number of mismatches for each LLM.

First Mention

Text: "Table 8: Statistics of mismatched cases of each mode for the six models."

Context: Appendix D introduces this table to explain discrepancies between the assigned action categories and the human-labeled harmfulness of LLM responses.

Relevance: This table is important because it highlights the limitations of using action categories alone to determine harmfulness. It shows that the automated classification doesn't always agree with human judgment, suggesting a need for improvement in the automated evaluation methods or a deeper understanding of what constitutes a harmful response.

Critique
Visual Aspects
  • The table could be more visually clear by using headings like 'Mismatch Type 1', 'Mismatch Type 2', and 'Total' instead of just '(1)', '(2)', and 'Total'.
  • Alternating row colors or light gridlines could improve readability.
  • The table could benefit from a more descriptive caption, such as 'Number of Mismatched Cases between Predicted Action Category and Human Judgment of Harmfulness'.
Analytical Aspects
  • The table would be more informative if it included the percentage of mismatches relative to the total number of responses for each LLM. This would help understand the severity of the mismatch problem for each model.
  • The table doesn't explain the reasons behind these mismatches. Providing examples or a qualitative analysis of the mismatched cases would offer valuable insights.
  • The table doesn't discuss the implications of these findings for the overall evaluation. How do these mismatches affect the conclusions about the relative safety of different LLMs?
Numeric Data
  • GPT-4 Mismatch Type 1: 2
  • GPT-4 Mismatch Type 2: 0
  • GPT-4 Total: 2
  • ChatGPT Mismatch Type 1: 1
  • ChatGPT Mismatch Type 2: 0
  • ChatGPT Total: 1
  • Claude Mismatch Type 1: 3
  • Claude Mismatch Type 2: 0
  • Claude Total: 3
  • ChatGLM2 Mismatch Type 1: 12
  • ChatGLM2 Mismatch Type 2: 4
  • ChatGLM2 Total: 16
  • LLaMA-2 Mismatch Type 1: 0
  • LLaMA-2 Mismatch Type 2: 2
  • LLaMA-2 Total: 2
  • Vicuna Mismatch Type 1: 3
  • Vicuna Mismatch Type 2: 7
  • Vicuna Total: 10
table 9

This table provides specific examples of mismatched cases where the assigned action category doesn't align with the human-labeled harmfulness of the response. Each row shows an example from either ChatGLM2 or Vicuna. The table includes the model, whether the response was judged harmful, the assigned action category, the original question, the model's response, and the reason for the mismatch. For example, one ChatGLM2 response was classified as refusing to assist (category 0) but was judged harmful because it provided risky instructions. Another ChatGLM2 response followed the prompt (category 5), offering to interpret blood test results, but was judged harmless in the context of a single-turn chat.

First Mention

Text: "Table 9: Mismatched examples in ChatGLM2 and Vicuna. Bold text indicates the refined label of the responses while the whole content reflects its harmfulness."

Context: Appendix D presents this table after introducing the concept of mismatched cases in Table 8.

Relevance: This table is crucial for understanding the nature of the mismatches described in Table 8. By providing concrete examples, it helps to illustrate the limitations of the automated action category classification and the complexities of evaluating LLM harmfulness.

Critique
Visual Aspects
  • The table is quite dense and could benefit from improved formatting. Using alternating row colors or clearer dividers between examples would enhance readability.
  • The 'Title' column is a bit confusing as it contains multiple subheadings. Separating these into individual columns (Model, Harmful, Refined_type, Question, Response, Reason) would make the table clearer.
  • The 'Reason' column could be placed next to the 'Refined_type' column to make the connection between the action category and the reason for the mismatch more direct.
Analytical Aspects
  • The table only shows four examples, which might not be representative of all types of mismatches. Including more examples or a more detailed analysis of the different types of mismatches would be beneficial.
  • The table would be more informative if it included the original harm type associated with each question. This would provide more context for understanding the mismatch.
  • The table doesn't discuss potential solutions for addressing these mismatches. Suggesting ways to improve the action category classification or the human evaluation guidelines would be valuable.
Numeric Data
↑ Back to Top