Evaluating Large Language Models for Scientific Feedback

Table of Contents

Overall Summary

Overview

This research investigates the potential of large language models (LLMs), specifically GPT-4, to provide useful feedback on research papers. Through a retrospective analysis comparing GPT-4's feedback with human peer reviews and a prospective user study, the study examines the reliability and credibility of LLM-generated scientific feedback.

Key Findings

Strengths

Areas for Improvement

Significant Elements

Figure 1

Description: A flow diagram illustrating the study's methodology, including LLM feedback generation, comparison with human feedback, and the user study process.

Relevance: Provides a clear visual overview of the research process and the different stages involved in evaluating LLM-generated feedback.

Figure 3

Description: A scatter plot comparing the frequency of different feedback aspects provided by GPT-4 and human reviewers, highlighting discrepancies in emphasis.

Relevance: Visually demonstrates the LLM's tendency to focus on certain aspects of feedback more than humans, revealing potential strengths and weaknesses in its evaluation capabilities.

Conclusion

This research demonstrates the potential of LLMs, particularly GPT-4, as a valuable tool for providing scientific feedback. While LLM feedback can be helpful and align with human feedback in many aspects, it is crucial to acknowledge its current limitations, particularly in providing specific and in-depth critique. Human expert review remains essential for rigorous scientific evaluation, and future research should focus on addressing the identified limitations and exploring the ethical implications of using LLMs in this context. The findings suggest that LLMs and human feedback can complement each other, potentially transforming research practices and enhancing the efficiency and accessibility of scientific feedback mechanisms.

Section Analysis

Abstract

Overview

This abstract presents a large-scale study investigating the potential of large language models (LLMs), specifically GPT-4, to provide useful feedback on research papers. The study involved two parts: a retrospective analysis comparing GPT-4's feedback with human peer reviews and a prospective user study assessing researcher perceptions of GPT-4 generated feedback. The results indicate a significant overlap between GPT-4 and human feedback, comparable to the overlap between two human reviewers, suggesting the potential utility of LLMs in augmenting scientific feedback mechanisms.

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Overview

The introduction section of this research paper establishes the importance of feedback in scientific research, highlighting the challenges posed by the increasing volume of publications and specialization of knowledge. It then introduces large language models (LLMs) as a potential solution to these challenges, specifically focusing on GPT-4, and outlines the study's objective to systematically analyze the reliability and credibility of LLM-generated scientific feedback.

Key Aspects

Strengths

Suggestions for Improvement

Results

Overview

The Results section details the methodology and findings of two evaluations: a retrospective analysis comparing GPT-4 generated feedback with human peer reviews and a prospective user study assessing researcher perceptions of the LLM feedback. The retrospective analysis found significant overlap between GPT-4 and human feedback, comparable to inter-reviewer overlap. The user study revealed that researchers found the LLM feedback helpful, aligning with human feedback, and offering novel perspectives.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1

Figure 1 provides a visual overview of the study's methodology for evaluating the ability of Large Language Models (LLMs) to generate helpful feedback on scientific papers. The figure is a three-part flow diagram, illustrating the processes of: (a) generating LLM feedback from a raw PDF, (b) comparing this LLM feedback with human feedback, and (c) conducting a user study to evaluate the feedback. Each part of the diagram is further broken down into steps, visually represented by boxes and arrows, with key terms and concepts highlighted within each step. For instance, section (a) showcases a four-step pipeline that begins with processing a 'Raw PDF' and ends with the generation of structured 'Feedback,' including steps like 'Parsed PDF' and using a 'Prompt.' Section (b) emphasizes the comparison between 'Language Model Feedback' and 'Human Feedback' through summarized text and a dedicated 'Matched Feedback' area highlighting commonalities. Lastly, section (c) illustrates the process of 'User Recruiting' for a user study aimed at assessing the generated feedback.

First Mention

Text: "Specifically, we developed a GPT-4 based scientific feedback generation pipeline that takes the raw PDF of a paper and produces structured feedback (Fig. 1a)."

Context: This sentence appears in the second paragraph of the Introduction section, where the authors are outlining their approach to analyzing the reliability and credibility of LLM-generated scientific feedback.

Relevance: Figure 1 is crucial for understanding the study's methodology. It visually outlines the steps involved in generating LLM feedback, comparing it to human feedback, and conducting a user study, providing a clear roadmap for the research process.

Critique
Visual Aspects
  • The figure effectively uses a flow diagram format to break down a complex process into understandable steps.
  • The use of color coding, arrows, and concise labels enhances clarity.
  • However, the lack of a legend for some visual elements, like the hexagon with a spiral inside, could benefit from further explanation.
Analytical Aspects
  • The figure clearly outlines the key steps involved in the research process, providing a visual representation of the methodology.
  • The figure highlights the importance of comparing LLM feedback to human feedback, which is a crucial aspect of evaluating the LLM's performance.
  • The figure also emphasizes the role of a user study in assessing the perceived helpfulness and utility of LLM feedback from the perspective of researchers.
Numeric Data
Figure 3

Figure 3 is a scatter plot comparing the frequency of different feedback aspects provided by GPT-4 to those provided by human reviewers. The x-axis represents the log frequency ratio between GPT-4 and human feedback, with positive values indicating that GPT-4 comments on that aspect more frequently than humans. The figure highlights that GPT-4 emphasizes certain aspects, such as 'Implications of the Research' 7.27 times more frequently and is 10.69 times less likely to comment on 'Add experiments on more datasets' compared to human reviewers. The size of each circle represents the prevalence of that aspect in human feedback.

First Mention

Text: "Fig. 3 presents the relative frequency of each of the 11 aspects of comments raised by humans and LLM."

Context: This sentence, found in the fifth paragraph of the Results section, introduces Figure 3 as a visual representation of the differences in emphasis between LLM and human feedback on specific aspects of scientific papers.

Relevance: Figure 3 directly supports the section's discussion on the varying emphasis placed on different aspects of feedback by LLMs and human reviewers. It visually demonstrates the discrepancies in frequency ratios for specific aspects, highlighting the potential for complementary roles between human and AI feedback.

Critique
Visual Aspects
  • The figure is well-labeled and easy to understand.
  • The use of color and circle size effectively conveys the relative frequencies and prevalence of different feedback aspects.
  • The caption could be improved by explicitly mentioning that circle size corresponds to the prevalence in human feedback, not LLM feedback.
Analytical Aspects
  • The figure effectively visualizes the differences in emphasis between LLM and human feedback, providing a clear picture of the LLM's tendencies.
  • The specific frequency ratios mentioned in the caption (7.27 times more frequent for 'Implications of the Research', 10.69 times less likely for 'Add experiments on more datasets') provide quantifiable insights into these differences.
  • The figure suggests that LLMs may have specific strengths and weaknesses in evaluating different aspects of scientific papers, highlighting the need for further research on how to best leverage their capabilities.
Numeric Data
  • Frequency ratio of 'Implications of the Research' comments (GPT-4/Human): 7.27
  • Frequency ratio of 'Add experiments on more datasets' comments (GPT-4/Human): 0.0935
Figure 4

Figure 4 presents the results of a human study on LLM and human review feedback (n = 308). The figure consists of eight horizontal bar charts (a-h) representing survey responses related to the helpfulness, specificity, perceived impact, and likelihood of future use of LLM-generated feedback compared to human feedback. Each chart presents a different question or statement, and the bars within each chart represent the percentage of respondents who chose each response option. The charts consistently use a blue color scheme, with darker shades of blue indicating a higher percentage of responses. The caption provides detailed information on each subfigure (a-h) and the overall findings of the study.

First Mention

Text: "The results from the user study are illustrated in Fig. 4."

Context: This sentence, located in the sixth paragraph of the Results section, introduces Figure 4 as a visual representation of the findings from the user study conducted to assess researcher perceptions of LLM-generated feedback.

Relevance: Figure 4 is central to the section's discussion on the prospective user study. It visually presents the survey responses, providing insights into researcher perceptions of the helpfulness, specificity, and potential impact of LLM feedback compared to human feedback. The figure directly supports the claim that researchers generally find LLM feedback helpful and see its potential for improving the scientific review process.

Critique
Visual Aspects
  • While the figure effectively communicates the distribution of responses for each question, the lack of explicit axis labels might hinder quick interpretation.
  • The choice of blue shades for percentages is generally clear, but a legend mapping shade intensity to specific percentage ranges would enhance readability.
  • The separation of the figure caption from the main visual content by a significant amount of text might disrupt the reading flow. Providing a brief title within the figure summarizing the study's focus could also improve clarity.
Analytical Aspects
  • The figure provides a comprehensive overview of researcher perceptions of LLM feedback, covering various aspects like helpfulness, specificity, and potential impact.
  • The high percentages of positive responses regarding helpfulness and perceived benefits support the study's claim that LLM feedback can be a valuable tool for researchers.
  • The figure also highlights areas where LLM feedback is perceived as less effective than human feedback, such as specificity, providing valuable insights for future improvements.
Numeric Data
  • Percentage of users who found LLM feedback helpful or very helpful: 57.4 %
  • Percentage of users who found LLM feedback more beneficial than feedback from at least some human reviewers: 82.4 %
  • Percentage of users who believe the LLM feedback system can improve the accuracy of reviews: 70.8 %
  • Percentage of users who believe the LLM feedback system can improve the thoroughness of reviews: 77.9 %
  • Percentage of users who intend to use or potentially use the LLM feedback system again: 50.5 %
  • Percentage of users who believe the LLM feedback system mostly helps authors: 80.5 %
Supplementary Figure 1

Supplementary Figure 1 presents two bar graphs (a and b) comparing the overlap between comments generated by GPT-4 and those made by human reviewers. The caption provides context for the graphs, stating that they depict the fraction of GPT-4's comments that align with at least one human reviewer's feedback. The graphs are divided into two categories: 'Nature Journals' (a) and 'ICLR' (b), presumably representing different publication sources. Both graphs feature the same x-axis categories: 'GPT-4 vs. All Human' and 'GPT-4 (shuffle) vs. All Human'. The 'GPT-4 (shuffle)' category likely refers to a baseline condition where GPT-4's comments have been randomly reassigned to different papers. The y-axis, labeled 'Global Hit Rate (%)', measures the percentage of overlapping comments. Notably, the error bars, representing 95% confidence intervals, are remarkably small, indicating a high level of precision in the measurements.

First Mention

Text: "More than half (57.55%) of the comments raised by GPT-4 were raised by at least one human reviewer (Supp. Fig. 1a)."

Context: This sentence, found in the third paragraph of the Results section, introduces Supplementary Figure 1a as evidence of the significant overlap between LLM feedback and human feedback in the Nature family journal data.

Relevance: Supplementary Figure 1 directly supports the section's claim that LLM feedback significantly overlaps with human-generated feedback. It visually demonstrates the high percentage of GPT-4 comments that align with at least one human reviewer's feedback, both in the Nature family journal data and the ICLR data. The figure also highlights the significant drop in overlap when GPT-4 feedback is shuffled, indicating that the LLM is not simply generating generic comments.

Critique
Visual Aspects
  • The figure is well-organized and easy to understand.
  • The labeling is clear, and the use of bar graphs effectively illustrates the comparison between GPT-4 and human reviewers.
  • The color scheme is appropriate, and the error bars provide valuable information about the reliability of the data.
Analytical Aspects
  • The figure effectively demonstrates the significant overlap between GPT-4 and human feedback, supporting the study's claim.
  • The shuffling experiment provides a strong control condition, demonstrating that the LLM is generating paper-specific feedback, not just generic comments.
  • The small error bars indicate a high level of precision in the measurements, strengthening the reliability of the findings.
Numeric Data
  • Percentage of GPT-4 comments overlapping with at least one human reviewer in Nature family journal data: 57.55 %
  • Percentage of GPT-4 comments overlapping with at least one human reviewer in ICLR data: 77.18 %
Supplementary Figure 2

Figure 2 presents a retrospective evaluation of GPT-4's performance using alternative set overlap metrics for a more robust check. The figure is organized into eight bar graphs: (a) and (b) illustrate the hit rate, (c) and (d) depict the Szymkiewicz–Simpson overlap coefficient, (e) and (f) showcase the Jaccard index, and finally, (g) and (h) display the Sørensen–Dice coefficient. Each graph compares four scenarios: GPT-4 versus human, human versus human, human without control versus human, and shuffled GPT-4 output versus human. The graphs consistently use the same x-axis labels, and the error bars consistently represent 95% confidence intervals.

First Mention

Text: "Results were consistent across other overlapping metrics including Szymkiewicz–Simpson overlap coefficient, Jaccard index, Sørensen–Dice coefficient (Supp. Fig. 2)."

Context: This sentence, appearing in the third paragraph of the Results section, refers to Supplementary Figure 2 as evidence that the observed overlap between LLM and human feedback is consistent across various set overlap metrics, not just the hit rate.

Relevance: Supplementary Figure 2 strengthens the study's findings by demonstrating the robustness of the overlap results across different set overlap metrics. It shows that the comparable overlap between GPT-4 and human feedback, as well as the significant drop in overlap with shuffled GPT-4 feedback, is not limited to the hit rate but holds true for other established metrics like the Szymkiewicz–Simpson overlap coefficient, Jaccard index, and Sørensen–Dice coefficient.

Critique
Visual Aspects
  • The figure is well-organized, and the labeling is clear.
  • The use of different colors for each bar and the inclusion of error bars enhance the readability and interpretability of the data.
  • The significance markers are appropriately placed and clearly visible.
Analytical Aspects
  • The figure effectively demonstrates the robustness of the findings across different set overlap metrics, strengthening the study's conclusions.
  • The consistent use of the same x-axis labels and error bar representation across all graphs facilitates easy comparison and interpretation.
  • The inclusion of a shuffled GPT-4 control condition in each graph further emphasizes the paper-specificity of the LLM feedback.
Numeric Data

Discussion

Overview

The Discussion section summarizes the study's findings, emphasizing the potential of LLMs, specifically GPT-4, for providing helpful scientific feedback. It highlights the significant overlap between LLM-generated feedback and human peer reviews, as well as the positive user perceptions from the prospective study. However, the authors acknowledge the limitations of LLMs, particularly in providing specific and in-depth critique, and emphasize that human expert review remains essential for rigorous scientific evaluation. The section also discusses potential misuse of LLMs in scientific feedback and broader implications for research practices, concluding with limitations of the study and future research directions.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Supplementary Figure 3

Supplementary Figure 3 is a bar chart that illustrates the perceived helpfulness of LLM-based scientific feedback among participants with varying levels of publishing experience. The x-axis represents five categories of publishing experience: "No experience," "1-3 years," "3-5 years," "5-10 years," and "More than 10 years." The y-axis represents the proportion of respondents, ranging from 0.0 to 0.7. Five bars, each in a different color, represent the five levels of perceived helpfulness: "Not at all helpful," "Slightly helpful," "Moderately helpful," "Helpful," and "Highly helpful." Notably, the "Helpful" (green) and "Highly helpful" (purple) bars are consistently taller across all experience levels, indicating a generally positive perception of LLM feedback's helpfulness. For example, in the "No experience" category, approximately 60% of respondents found the feedback "Helpful," and around 10% found it "Highly helpful." Similar trends are observed in other experience categories, with "Helpful" consistently receiving the highest proportion of responses.

First Mention

Text: "Our analysis suggests that people from diverse educational backgrounds and publishing experience can find the LLM scientific feedback generation framework useful (Supp. Fig. 3,4)."

Context: This sentence appears in the first paragraph of the Discussion section, highlighting the broad applicability of the LLM feedback framework across different researcher demographics.

Relevance: Supplementary Figure 3 supports the claim that LLM-based scientific feedback is perceived as helpful by researchers with varying levels of publishing experience. This finding suggests that the LLM feedback framework can be a valuable tool for both novice and experienced researchers, potentially democratizing access to constructive feedback in scientific writing.

Critique
Visual Aspects
  • The figure is clear and easy to understand.
  • The labels are clear and concise, and the color scheme is effective.
  • However, the caption is separated from the figure by a large block of text. This could be improved by placing the caption directly below the figure.
Analytical Aspects
  • The figure effectively demonstrates that the perceived helpfulness of LLM feedback is consistent across different levels of publishing experience.
  • The lack of statistical tests or confidence intervals makes it difficult to assess the statistical significance of the observed differences between groups.
  • Further analysis could explore potential interactions between publishing experience and other factors, such as research field or professional status, to provide a more nuanced understanding of the LLM feedback's impact.
Numeric Data
  • Proportion of respondents with 'No experience' who found LLM feedback 'Helpful': 0.6
  • Proportion of respondents with 'No experience' who found LLM feedback 'Highly helpful': 0.1
Supplementary Figure 4

Figure 4 is a bar chart that displays the perceived helpfulness of LLM-based scientific feedback across different professional statuses. The x-axis lists seven professional statuses: "Undergraduate Student," "Master Student," "Doctoral Student," "Postdoc," "Faculty or Academic Staff," and "Researcher in Industry." The y-axis represents the proportion of respondents, with values ranging from 0 to 0.8. For each professional status, there are five bars, each representing a level of perceived helpfulness, as indicated in the legend on the right. Similar to Figure 3, the "Helpful" and "Highly helpful" categories generally receive higher proportions of responses across different professional statuses, suggesting a positive perception of LLM feedback in scientific writing. For instance, among "Doctoral Students," approximately 65% found the feedback "Helpful," and around 10% found it "Highly helpful." This trend of "Helpful" being the most frequent response is consistent across most professional statuses.

First Mention

Text: "Our analysis suggests that people from diverse educational backgrounds and publishing experience can find the LLM scientific feedback generation framework useful (Supp. Fig. 3,4)."

Context: This sentence, also from the first paragraph of the Discussion section, emphasizes the inclusivity of the LLM feedback framework, suggesting its potential benefit for researchers across various career stages and roles.

Relevance: Supplementary Figure 4 complements Figure 3 by demonstrating that the perceived helpfulness of LLM feedback extends across different professional statuses. This finding further supports the claim that LLM feedback can be a valuable tool for a diverse range of researchers, regardless of their career stage or position.

Critique
Visual Aspects
  • The figure is clear and easy to understand.
  • The labels are clear and concise, and the color scheme is effective.
  • The caption is, again, separated from the figure by a large block of text, which could be improved by placing the caption closer to the figure.
Analytical Aspects
  • The figure effectively illustrates that the perceived helpfulness of LLM feedback is generally consistent across different professional statuses.
  • The absence of statistical tests or confidence intervals limits the ability to draw definitive conclusions about the statistical significance of the observed differences.
  • Future research could investigate potential variations in the perceived helpfulness of LLM feedback within specific professional statuses, considering factors like research experience, publication record, or institutional affiliation.
Numeric Data
  • Proportion of 'Doctoral Students' who found LLM feedback 'Helpful': 0.65
  • Proportion of 'Doctoral Students' who found LLM feedback 'Highly helpful': 0.1

Methods

Overview

The Methods section outlines the data sources and procedures used in the study. It details the selection criteria and characteristics of the Nature Family Journals dataset and the ICLR dataset, both used for retrospective analysis. The section also describes the pipeline for generating scientific feedback using GPT-4, including PDF parsing, prompt construction, and feedback structure. Finally, it explains the retrospective comment matching pipeline, involving extractive text summarization and semantic text matching, and its validation through human verification.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Supplementary Table 1

Supplementary Table 1 provides a summary of the papers and their associated reviews sampled from 15 Nature family journals. It lists the journal name, the number of papers sampled from that journal, and the total number of reviews associated with those papers. For example, 773 papers were sampled from Nature with 2,324 associated reviews, and 810 papers were sampled from Nature Communications with 2,250 associated reviews. In total, the table includes data for 3,096 papers and 8,745 reviews across 15 Nature family journals.

First Mention

Text: "The first dataset, sourced from Nature family journals, includes 8,745 comments from human reviewers for 3,096 accepted papers across 15 Nature family journals, including Nature, Nature Biomedical Engineering, Nature Human Behaviour, and Nature Communications (Supp. Table 1, Methods)."

Context: This sentence, located in the third paragraph of the Results section, introduces the first dataset used in the retrospective analysis, which comprises papers and reviews from Nature family journals. It mentions Supplementary Table 1 as a source for more detailed information about the dataset.

Relevance: Supplementary Table 1 is relevant to the Methods section as it provides a detailed breakdown of the papers and reviews included in the Nature family journals dataset, which is one of the two main datasets used in the study's retrospective analysis. This table allows readers to understand the scope and composition of the dataset, including the number of papers and reviews from each journal, which is crucial for assessing the generalizability of the study's findings.

Critique
Visual Aspects
  • The table is clear, well-organized, and easy to read.
  • The use of clear headers and simple formatting enhances readability.
  • The table effectively communicates the intended information.
Analytical Aspects
  • The table provides a comprehensive overview of the Nature family journals dataset, including the number of papers and reviews from each journal.
  • The inclusion of a total row summarizing the counts for all journals facilitates easy understanding of the dataset's overall size.
  • The table does not provide information about the selection criteria for the papers or the distribution of papers across different research areas, which could be relevant for assessing the representativeness of the dataset.
Numeric Data
  • Number of papers sampled from Nature: 773
  • Number of reviews associated with papers from Nature: 2324
  • Number of papers sampled from Nature Communications: 810
  • Number of reviews associated with papers from Nature Communications: 2250
  • Total number of papers in the dataset: 3096
  • Total number of reviews in the dataset: 8745
Supplementary Table 2

Supplementary Table 2 summarizes the ICLR papers and their associated reviews sampled from the years 2022 and 2023, grouped by decision category. The table presents the number of papers and reviews for each decision category, including Accept (Oral), Accept (Spotlight), Accept (Poster), Reject after author rebuttal, and Withdrawn after reviews, for both ICLR 2022 and ICLR 2023. For example, in ICLR 2022, there were 55 accepted papers with oral presentations and 200 associated reviews, while in ICLR 2023, there were 90 accepted papers with oral presentations and 317 associated reviews. The table also provides the total number of papers and reviews for each year, with 820 papers and 3,168 reviews in ICLR 2022 and 889 papers and 3,337 reviews in ICLR 2023.

First Mention

Text: "Our second dataset was sourced from ICLR (International Conference on Learning Representations), a leading computer science venue on artificial intelligence (Supp. Table 2, Methods)."

Context: This sentence, found in the third paragraph of the Introduction section, introduces the second dataset used in the study, which consists of papers and reviews from the ICLR machine learning conference. It mentions Supplementary Table 2 as a source for more detailed information about this dataset.

Relevance: Supplementary Table 2 is relevant to the Methods section as it provides a detailed breakdown of the ICLR dataset used in the study's retrospective analysis. The table presents the number of papers and reviews for each decision category (accept with oral presentation, accept with spotlight presentation, accept with poster presentation, reject, and withdraw) for both ICLR 2022 and ICLR 2023. This information is crucial for understanding the composition of the dataset and for assessing the generalizability of the study's findings across different acceptance outcomes.

Critique
Visual Aspects
  • The table is well-organized and easy to read.
  • The headings are clear and informative.
  • The use of different columns for 2022 and 2023 data makes the comparison straightforward.
Analytical Aspects
  • The table provides a comprehensive overview of the ICLR dataset, including the number of papers and reviews for each decision category and year.
  • The table allows for a direct comparison of the number of papers and reviews between ICLR 2022 and ICLR 2023, which is helpful for understanding the growth of the conference.
  • The table does not provide information about the distribution of papers across different research areas within machine learning, which could be relevant for assessing the representativeness of the dataset.
Numeric Data
  • Number of accepted papers with oral presentations in ICLR 2022: 55
  • Number of reviews associated with accepted papers with oral presentations in ICLR 2022: 200
  • Number of accepted papers with oral presentations in ICLR 2023: 90
  • Number of reviews associated with accepted papers with oral presentations in ICLR 2023: 317
  • Total number of papers in ICLR 2022: 820
  • Total number of reviews in ICLR 2022: 3168
  • Total number of papers in ICLR 2023: 889
  • Total number of reviews in ICLR 2023: 3337
Supplementary Table 3

Supplementary Table 3 presents the results of human verification conducted on the retrospective comment extraction and matching pipeline. It is divided into two subtables: (a) Extractive Summarization and (b) Semantic Matching. Subtable (a) focuses on the accuracy of extracting comments from scientific feedback, reporting the counts of true positives (correctly extracted comments), false negatives (missed relevant comments), and false positives (incorrectly extracted or split comments). It also presents the calculated Precision, Recall, and F1 Score for this stage, with an F1 score of 0.968. Subtable (b) evaluates the accuracy of pairing extracted comments based on semantic similarity. It displays the counts of matches and mismatches between human judgment and the system's predictions, providing Precision, Recall, and F1 Score for this stage, with an F1 score of 0.824.

First Mention

Text: "We validated the pipeline’s accuracy through human verification, yielding an F1 score of 96.8% for extraction (Supp. Table 3a, Methods) and 82.4% for matching (Supp. Table 3b, Methods)."

Context: This sentence, located at the end of the third paragraph in the Results section, describes the human verification process used to validate the accuracy of the comment matching pipeline. It specifically mentions Supplementary Table 3a for extraction results and Supplementary Table 3b for matching results.

Relevance: Supplementary Table 3 is highly relevant to the Methods section as it provides detailed results of the human verification process used to validate the accuracy of the comment extraction and matching pipeline. This pipeline is a crucial component of the study's methodology, as it enables the comparison of LLM-generated feedback with human feedback. The table presents key performance metrics, including precision, recall, and F1 score, for both the extraction and matching stages, demonstrating the high accuracy of the pipeline and strengthening the reliability of the study's findings.

Critique
Visual Aspects
  • The table is generally well-organized and easy to understand.
  • The labeling is clear, and the division into subtables helps to separate the results for different stages of the process.
  • The use of abbreviations (TP, FN, FP) is standard in this context and unlikely to cause confusion.
Analytical Aspects
  • The table provides a clear and concise presentation of the human verification results, including both raw counts and calculated performance metrics.
  • The high F1 scores for both extraction (0.968) and matching (0.824) indicate that the pipeline is highly accurate in identifying and pairing relevant comments.
  • The table does not provide information about the inter-rater reliability of the human annotations, which would be helpful for assessing the consistency of the human judgments.
Numeric Data
  • F1 score for comment extraction: 0.968
  • F1 score for comment matching: 0.824
  • Precision for comment extraction: 0.977
  • Recall for comment extraction: 0.96
  • Precision for comment matching: 0.777
  • Recall for comment matching: 0.878
Supplementary Table 4

Supplementary Table 4 presents the mean token lengths of papers and human reviews in the two datasets used in the study: ICLR and Nature Family Journals. The table shows that ICLR papers have a mean token length of 5,841.46, while Nature Family Journal papers have a mean token length of 12,444.06. Similarly, human reviews for ICLR papers have a mean token length of 671.53, while those for Nature Family Journal papers have a mean token length of 1,337.93.

First Mention

Text: "This token limit exceeds the 5,841.46-token average of ICLR papers and covers over half of the 12,444.06-token average for Nature family journal papers (Supp. Table 4)."

Context: This sentence, found in the last paragraph of the Methods section, explains the rationale for using the initial 6,500 tokens of the extracted text to construct the prompt for GPT-4. It refers to Supplementary Table 4 for the mean token lengths of papers in both datasets, justifying the chosen token limit.

Relevance: Supplementary Table 4 is relevant to the Methods section as it provides context for the decision to use the initial 6,500 tokens of the extracted text to construct the prompt for GPT-4. The table shows the mean token lengths of papers in both the ICLR and Nature Family Journals datasets, highlighting the difference in length between the two. This information justifies the chosen token limit, as it exceeds the average length of ICLR papers and covers over half of the average length of Nature Family Journal papers, ensuring that a substantial portion of each paper is included in the prompt.

Critique
Visual Aspects
  • The table is clear and well-organized.
  • The labeling is straightforward and easy to understand.
  • The choice of a table format is appropriate for presenting this type of data.
Analytical Aspects
  • The table effectively conveys the difference in mean token lengths between ICLR and Nature Family Journal papers, providing justification for the chosen token limit.
  • The table only presents mean token lengths, without providing information about the distribution of token lengths or the standard deviation, which could be helpful for understanding the variability in paper lengths.
  • The table does not explain the method used for tokenization, which could be relevant for understanding how the token lengths were calculated.
Numeric Data
  • Mean token length of ICLR papers: 5841.46
  • Mean token length of Nature Family Journal papers: 12444.06
  • Mean token length of human reviews for ICLR papers: 671.53
  • Mean token length of human reviews for Nature Family Journal papers: 1337.93
Supplementary Table 5

Supplementary Table 5 presents examples of comments extracted from both LLM (GPT-4) and human feedback on papers submitted to the ICLR conference, categorized by human coding. The table is organized by human coding aspects, such as 'Clarity and Presentation', 'Comparison to Previous Studies', 'Theoretical Soundness', 'Novelty', and 'Reproducibility'. For each category, the table provides both the human and GPT-4 generated comments, allowing for a direct comparison of the feedback provided by both sources. The comments highlight various aspects of the paper, such as writing quality, comparison with existing methods, theoretical rigor, novelty of the approach, and reproducibility of the results.

First Mention

Text: "Using our extractive text summarization pipeline, we extracted lists of comments from both the LLM and human feedback for each paper. Each comment was then annotated according to our predefined schema, identifying any of the 11 aspects it represented (Supp. Table 5,6,7)."

Context: This sentence, located in the last paragraph of the Methods section, describes the process of annotating comments extracted from LLM and human feedback according to a predefined schema of 11 key aspects. It mentions Supplementary Tables 5, 6, and 7 as examples of the annotated comments.

Relevance: Supplementary Table 5 is relevant to the Methods section as it provides examples of the annotated comments extracted from LLM and human feedback. The table showcases the different aspects of scientific papers that were considered in the annotation process, such as clarity and presentation, comparison to previous studies, theoretical soundness, novelty, and reproducibility. This information helps readers understand the scope and depth of the annotation schema and provides concrete examples of how the comments were categorized.

Critique
Visual Aspects
  • The table is well-organized and easy to read.
  • The use of clear headers and concise comments makes the information readily accessible.
  • The table could benefit from highlighting the key differences between the human and LLM comments for each aspect, making the comparison more explicit.
Analytical Aspects
  • The table provides a diverse range of examples, covering various aspects of scientific papers, which demonstrates the comprehensiveness of the annotation schema.
  • The comments themselves are insightful and provide valuable feedback on the papers, showcasing the potential of both human and LLM feedback for improving scientific writing.
  • The table does not provide information about the frequency of each aspect in the dataset, which would be helpful for understanding the relative importance of different aspects in scientific feedback.
Numeric Data
Supplementary Table 6

Supplementary Table 6 presents example comments extracted from both LLM (GPT-4) and human feedback on papers submitted to the ICLR conference. The table is organized by 'Human Coding', which categorizes the comments based on their focus: 'Add ablations experiments', 'Implications of the Research', or 'Ethical Aspects'. For each category, the table provides both the human and GPT-4 generated comments. This table illustrates the similarities and differences in feedback provided by humans and the LLM.

First Mention

Text: "null"

Context: null

Relevance: While Supplementary Table 6 is not explicitly mentioned in the Methods section, it is relevant as it provides examples of the types of comments that the comment matching pipeline would be processing. It showcases the diversity of feedback aspects and the nuances in language used by both humans and LLMs, which the pipeline needs to accurately capture and compare.

Critique
Visual Aspects
  • The table is clear and well-organized.
  • The use of separate columns for human and LLM feedback allows for easy comparison.
  • The table effectively communicates the different perspectives on paper review from human and LLM sources.
Analytical Aspects
  • The table provides valuable qualitative insights into the types of comments generated by both humans and LLMs.
  • The examples highlight the LLM's ability to identify similar concerns as human reviewers, such as the need for ablation experiments or the discussion of ethical implications.
  • The table also reveals differences in the level of detail and specificity between human and LLM comments, which is a key finding discussed in the Results section.
Numeric Data
Supplementary Table 7

Table 7 presents example comments from both a large language model (LLM) and human reviewers regarding a scientific paper submitted to ICLR. The comments are categorized by human coding aspects such as 'Add ablations experiments', 'Implications of the Research', and 'Ethical Aspects'. Each row includes the source of the comment (Human or GPT-4) and the comment itself.

First Mention

Text: "null"

Context: null

Relevance: Similar to Supplementary Table 6, Table 7, though not directly mentioned in the Methods section, is relevant as it provides additional examples of the comments that the comment matching pipeline would be analyzing. It further illustrates the range of feedback aspects covered by both human and LLM feedback and the variations in language and specificity.

Critique
Visual Aspects
  • The table is clearly structured, with distinct headers and easy-to-read text.
  • The categories provide useful context for the comments.
  • However, without the original paper, the comments lack sufficient context for a reader to fully understand them.
Analytical Aspects
  • The table offers further qualitative evidence of the LLM's ability to generate feedback that aligns with human concerns, such as the need for ablation studies or the consideration of ethical implications.
  • The examples highlight the LLM's capacity to identify potential issues that human reviewers might overlook, such as the lack of IRB approval information.
  • The table also reinforces the observation that LLM feedback can sometimes be less specific or actionable compared to human feedback, as seen in the examples related to 'Add ablations experiments'.
Numeric Data
Supplementary Figure 6

This flow diagram illustrates a three-stage pipeline designed to compare comments generated by a Large Language Model (LLM) with those from human reviewers. The pipeline begins with 'Language Model Feedback,' where key comments are extracted from the LLM's analysis of a scientific paper. The second stage, 'Language Model Summary,' condenses the extracted comments into a more succinct form. Finally, in the 'Feedback Matching' stage, the original feedback and the summarized feedback are compared using semantic similarity analysis, with matching points highlighted and a similarity rating assigned.

First Mention

Text: "We developed a retrospective comment matching pipeline to evaluate the overlap between feedback from LLM and human reviewers (Fig. 1b, Methods)."

Context: This sentence, located in the fourth paragraph of the Results section, introduces the retrospective comment matching pipeline and refers to Figure 1b for a visual representation of the process. However, Figure 1b is a simplified overview, and Supplementary Figure 6 provides a more detailed workflow of the pipeline.

Relevance: Supplementary Figure 6 is crucial for understanding the methodology used to compare LLM and human feedback. It visually details the steps involved in extracting comments, summarizing them, and matching them based on semantic similarity. This pipeline is central to the study's retrospective analysis, enabling the quantification of overlap between LLM and human feedback.

Critique
Visual Aspects
  • The flow diagram is generally clear and easy to follow, with distinct stages and connecting arrows.
  • However, it might benefit from a more visually appealing layout and the inclusion of specific examples for each stage.
  • Adding visual cues to differentiate between LLM and human feedback throughout the pipeline could enhance clarity.
Analytical Aspects
  • The figure clearly outlines the key steps involved in the comment matching process, providing a transparent view of the methodology.
  • The use of semantic similarity analysis for matching comments is a robust approach, capturing the meaning and intent behind the feedback rather than relying solely on keyword matching.
  • The inclusion of a similarity rating and justification for each matched pair adds a layer of granularity to the analysis, allowing for a more nuanced assessment of the overlap between LLM and human feedback.
Numeric Data
Supplementary Figure 7

Figure 7 explores the robustness of hit rate measurements in ICLR data by controlling for the number of comments. Five bar graphs (a-e) depict hit rate comparisons for various categories of ICLR papers, considering factors like acceptance type (oral presentation, spotlight, poster) and post-review status (rejected, withdrawn). Each bar graph compares the hit rates of 'GPT-4 vs. Human,' 'Human vs. Human,' 'Human (w/o control) vs. Human,' and 'GPT-4 (shuffle) vs. Human.' The presence of error bars suggests the use of confidence intervals, likely at 95%, although the caption doesn't explicitly confirm this. Statistical significance is indicated using asterisks with varying levels: *P < 0.05, **P < 0.01, ***P < 0.001, and ****P < 0.0001.

First Mention

Text: "The results, with and without this control, were largely similar across both the ICLR dataset for different decision outcomes (Supp. Fig. 7,10) and the Nature family journals dataset across different journals (Supp. Fig. 8,9,10)."

Context: This sentence, found in the eighth paragraph of the Methods section, refers to Supplementary Figure 7 as part of the robustness check performed to ensure that controlling for the number of comments does not significantly alter the hit rate results in the ICLR data.

Relevance: Supplementary Figure 7 is crucial for demonstrating the robustness of the study's findings regarding the overlap between LLM and human feedback. It shows that controlling for the number of comments in the ICLR data does not significantly affect the hit rate results, supporting the claim that the observed overlap is not simply an artifact of the number of comments generated.

Critique
Visual Aspects
  • The figure is well-organized and clearly labeled, with distinct categories and consistent labeling across the graphs.
  • The color scheme effectively distinguishes the different comparison groups.
  • The use of error bars and statistical significance markers enhances the interpretation of the results.
Analytical Aspects
  • The figure effectively demonstrates the robustness of the hit rate results across different categories of ICLR papers and with and without controlling for the number of comments.
  • The consistent pattern of higher hit rates for 'GPT-4 vs. Human' and 'Human vs. Human' compared to 'GPT-4 (shuffle) vs. Human' further supports the claim that LLM feedback is paper-specific and not simply generic.
  • The use of statistical significance markers provides a clear indication of the statistical robustness of the observed differences between groups.
Numeric Data
Supplementary Figure 8

Supplementary Figure 8 comprises nine bar graphs, labeled (a) to (i), showcasing a robustness check by controlling for the number of comments when measuring overlap within datasets from nine Nature family journals. Each graph depicts the 'Hit Rate (%)' across three scenarios: comparing GPT-4 feedback to human feedback ('GPT-4 vs. Human'), comparing feedback from two different human reviewers ('Human vs. Human'), and comparing feedback from two groups of human reviewers where one group's data is shuffled to control for the number of comments ('Human (w/o control) vs. Human (shuffle) vs. Human'). The figure aims to demonstrate that controlling for the number of comments yields similar results to analyses without this control, suggesting that the observed overlap between LLM and human feedback is comparable to the overlap between different human reviewers. The caption further states that additional results for other Nature family journals are presented in Supplementary Figure 9 and that the error bars in the graphs represent 95% confidence intervals. Statistical significance is indicated by asterisks: * for P < 0.05, ** for P < 0.01, *** for P < 0.001, and **** for P < 0.0001.

First Mention

Text: "The results, with and without this control, were largely similar across both the ICLR dataset for different decision outcomes (Supp. Fig. 7,10) and the Nature family journals dataset across different journals (Supp. Fig. 8,9,10)."

Context: This sentence, also from the eighth paragraph of the Methods section, refers to Supplementary Figure 8 as part of the robustness check for the Nature family journals data. It highlights that the results are consistent with and without controlling for the number of comments, further supporting the reliability of the findings.

Relevance: Supplementary Figure 8 is essential for demonstrating the robustness of the study's findings in the Nature family journals data. It shows that controlling for the number of comments does not significantly alter the hit rate results, reinforcing the claim that the observed overlap between LLM and human feedback is comparable to the overlap between human reviewers.

Critique
Visual Aspects
  • The figure is generally clear and well-organized, with each graph clearly labeled and the statistical significance clearly marked.
  • The use of different colors for the bars representing different comparison groups could enhance readability.
  • The caption provides a comprehensive explanation of the figure's purpose, methodology, and results.
Analytical Aspects
  • The figure effectively demonstrates the consistency of the hit rate results across different Nature family journals and with and without controlling for the number of comments.
  • The similar hit rates observed for 'GPT-4 vs. Human' and 'Human vs. Human' across most journals further support the claim that LLM feedback is comparable to human feedback in terms of identifying similar concerns.
  • The use of statistical significance markers and the reporting of p-values provide a robust assessment of the statistical significance of the observed differences.
Numeric Data
Supplementary Figure 10

Supplementary Figure 10 consists of four scatter plots that examine the robustness of controlling for the number of comments when analyzing the correlation of hit rates between GPT-4 and human reviewers in predicting peer review outcomes. Each scatter plot compares the hit rate of GPT-4 versus human reviewers with the hit rate of human reviewers versus other human reviewers. Subplot (a) focuses on hit rates across various Nature family journals while controlling for the number of comments, showing a correlation coefficient (r) of 0.80 and a p-value of 3.69 x 10^-4. Subplot (b) examines hit rates across ICLR papers with different decision outcomes, also controlling for the number of comments, with an r-value of 0.98 and a p-value of 3.28 x 10^-3. Subplot (c) analyzes hit rates across Nature family journals without controlling for the number of comments (r = 0.75, P = 1.37 x 10^-3). Lastly, subplot (d) investigates hit rates across ICLR papers with different decision outcomes without controlling for the number of comments (r = 0.98, P = 2.94 x 10^-3). The circle size in each scatter plot represents the sample size for each data point.

First Mention

Text: "Results with and without this control, were largely similar across both the ICLR dataset for different decision outcomes (Supp. Fig. 7,10) and the Nature family journals dataset across different journals (Supp. Fig. 8,9,10)."

Context: This sentence, located in the fifth paragraph of the Methods section, refers to Supplementary Figure 10 as part of a robustness check to demonstrate that controlling for the number of comments yields similar results to analyses without this control when comparing hit rates between GPT-4 and human reviewers across different journals and decision outcomes.

Relevance: Supplementary Figure 10 is relevant to the Methods section as it provides a visual representation of the robustness check performed to validate the consistency of the study's findings regarding the correlation of hit rates between GPT-4 and human reviewers. It demonstrates that the observed correlations are not significantly affected by variations in the number of comments, strengthening the reliability of the study's conclusions.

Critique
Visual Aspects
  • The figure is well-organized and easy to understand, with each subplot clearly labeled and the axes appropriately scaled.
  • The use of different colors for different journals and decision outcomes in subplots (a) and (b) enhances readability and allows for easy comparison.
  • The inclusion of a diagonal dashed line representing a perfect match in each subplot provides a helpful visual reference for assessing the correlation between hit rates.
Analytical Aspects
  • The figure effectively demonstrates the robustness of the correlation between GPT-4 and human hit rates, both with and without controlling for the number of comments.
  • The inclusion of correlation coefficients (r) and p-values for each subplot provides a quantitative measure of the strength and significance of the observed correlations.
  • The figure could benefit from a more detailed explanation of the method used to control for the number of comments, such as specifying whether regression analysis or another statistical technique was employed.
Numeric Data
  • Correlation coefficient (r) for hit rates across Nature family journals, controlling for the number of comments: 0.8
  • P-value for hit rates across Nature family journals, controlling for the number of comments: 0.000369
  • Correlation coefficient (r) for hit rates across ICLR papers with different decision outcomes, controlling for the number of comments: 0.98
  • P-value for hit rates across ICLR papers with different decision outcomes, controlling for the number of comments: 0.00328
  • Correlation coefficient (r) for hit rates across Nature family journals without controlling for the number of comments: 0.75
  • P-value for hit rates across Nature family journals without controlling for the number of comments: 0.00137
  • Correlation coefficient (r) for hit rates across ICLR papers with different decision outcomes without controlling for the number of comments: 0.98
  • P-value for hit rates across ICLR papers with different decision outcomes without controlling for the number of comments: 0.00294
Supplementary Figure 12

Supplementary Figure 12 presents the prompt template used with GPT-4 to generate scientific feedback on papers from the Nature journal family dataset. The template instructs GPT-4 to draft a high-quality review outline for a Nature family journal, starting with "Review outline:" followed by four sections: "1. Significance and novelty," "2. Potential reasons for acceptance," "3. Potential reasons for rejection," and "4. Suggestions for improvement." For the "Potential reasons for rejection" section, the template emphasizes providing multiple key reasons with at least two sub-bullet points for each reason, detailing the arguments with specificity. Similarly, the "Suggestions for improvement" section encourages listing multiple key suggestions with detailed explanations. The template emphasizes providing thoughtful and constructive feedback in outline form only.

First Mention

This prompt is then fed into GPT-4, which generates the scientific feedback in a single pass. Further details and validations of the pipeline can be found in the Supplementary Information.

Relevance: Supplementary Figure 12 is crucial for understanding the specific instructions provided to GPT-4 for generating scientific feedback. It reveals the structure and content requirements of the feedback, highlighting the emphasis on detailed explanations, multiple reasons for rejection, and specific suggestions for improvement. This information is essential for interpreting the quality and nature of the LLM-generated feedback analyzed in the study.

Critique
Visual Aspects
  • The figure is not visually present; only the caption is provided, which adequately describes the prompt template.
  • Presenting the prompt template as plain text within the caption is appropriate, as it allows for easy readability and understanding of the instructions.
  • Including a visual representation of the prompt template, such as a screenshot of the input interface, could further enhance clarity and provide a more concrete understanding of how the prompt is presented to GPT-4.
Analytical Aspects
  • The prompt template demonstrates a well-structured approach to eliciting scientific feedback from GPT-4, covering key aspects of a peer review.
  • The emphasis on detailed explanations, multiple reasons for rejection, and specific suggestions for improvement encourages the LLM to generate comprehensive and constructive feedback.
  • The template could benefit from further refinement to encourage GPT-4 to provide more specific and actionable feedback, potentially by incorporating examples or prompting for specific types of evidence or analysis.
Numeric Data
Supplementary Figure 13

Supplementary Figure 13 displays the prompt template used with GPT-4 for extractive text summarization of comments in both LLM-generated and human-written feedback. The template instructs GPT-4 to identify the key concerns raised in a review, focusing specifically on potential reasons for rejection. It requires the analysis to be presented in JSON format, including a concise summary and the exact wording from the review for each concern. The template provides an example JSON format, illustrating how to structure the output with numbered keys representing each concern and corresponding values containing a summary and verbatim quote. It emphasizes ignoring minor issues like typos and clarifications and instructs GPT-4 to output only the JSON data.

First Mention

The pipeline first performs extractive text summarization34–37 to extract the comments from both LLM and human-written feedback.

Relevance: Supplementary Figure 13 is essential for understanding how comments are extracted from both LLM and human feedback for subsequent analysis. It reveals the specific instructions given to GPT-4 for identifying key concerns, focusing on potential reasons for rejection, and structuring the output in JSON format. This information is crucial for interpreting the accuracy and reliability of the comment extraction process and the subsequent comparison between LLM and human feedback.

Critique
Visual Aspects
  • The figure is not visually present; only the caption is provided, which adequately describes the prompt template.
  • Presenting the prompt template as plain text within the caption is appropriate, as it allows for easy readability and understanding of the instructions.
  • Including a visual representation of the prompt template, such as a screenshot of the input interface or an example JSON output, could further enhance clarity and provide a more concrete understanding of the extraction process.
Analytical Aspects
  • The prompt template focuses on extracting key concerns, particularly potential reasons for rejection, which aligns with the study's objective of evaluating the LLM's ability to provide constructive feedback.
  • The requirement for JSON format ensures structured and machine-readable output, facilitating automated analysis and comparison of comments.
  • The template could benefit from further refinement to encourage GPT-4 to extract more specific and actionable comments, potentially by prompting for specific types of evidence or analysis related to the identified concerns.
Numeric Data
Supplementary Figure 14

Supplementary Figure 14 presents the prompt template used with GPT-4 for semantic text matching to identify shared comments between two sets of feedback. The input consists of two JSON files containing extracted comments from the previous step, one for each review being compared (either LLM vs. human or human vs. human). GPT-4 is instructed to match points with a significant degree of similarity in their concerns, avoiding superficial similarities or weak connections. For each matched pair, GPT-4 is asked to provide a rationale explaining the match and rate the similarity on a scale of 5 to 10, with detailed descriptions for each rating level. The output is expected in JSON format, with each key representing a matched pair (e.g., "A1-B2") and the corresponding value containing the rationale and similarity rating. If no match is found, an empty JSON object should be output.

First Mention

It then applies semantic text matching38–40 to identify shared comments between the two feedback sources.

Relevance: Supplementary Figure 14 is crucial for understanding how shared comments are identified between LLM and human feedback. It reveals the specific instructions given to GPT-4 for matching comments based on semantic similarity, avoiding superficial matches, providing rationales for each match, and rating the similarity level. This information is essential for interpreting the accuracy and reliability of the comment matching process and the subsequent analysis of overlap between LLM and human feedback.

Critique
Visual Aspects
  • The figure is not visually present; only the caption is provided, which adequately describes the prompt template.
  • Presenting the prompt template as plain text within the caption is appropriate, as it allows for easy readability and understanding of the instructions.
  • Including a visual representation of the prompt template, such as a screenshot of the input interface or an example JSON output with matched pairs, could further enhance clarity and provide a more concrete understanding of the matching process.
Analytical Aspects
  • The prompt template emphasizes matching comments based on significant similarity in concerns, avoiding superficial matches, which enhances the reliability of the overlap analysis.
  • The requirement for rationales for each match provides transparency and allows for human verification of the matching process.
  • The similarity rating scale with detailed descriptions for each level provides a nuanced measure of the degree of overlap between comments, allowing for a more fine-grained analysis of the LLM's performance.
Numeric Data
  • Minimum similarity rating for a match to be considered valid: 7
↑ Back to Top