This research investigates the potential of large language models (LLMs), specifically GPT-4, to provide useful feedback on research papers. Through a retrospective analysis comparing GPT-4's feedback with human peer reviews and a prospective user study, the study examines the reliability and credibility of LLM-generated scientific feedback.
Description: A flow diagram illustrating the study's methodology, including LLM feedback generation, comparison with human feedback, and the user study process.
Relevance: Provides a clear visual overview of the research process and the different stages involved in evaluating LLM-generated feedback.
Description: A scatter plot comparing the frequency of different feedback aspects provided by GPT-4 and human reviewers, highlighting discrepancies in emphasis.
Relevance: Visually demonstrates the LLM's tendency to focus on certain aspects of feedback more than humans, revealing potential strengths and weaknesses in its evaluation capabilities.
This research demonstrates the potential of LLMs, particularly GPT-4, as a valuable tool for providing scientific feedback. While LLM feedback can be helpful and align with human feedback in many aspects, it is crucial to acknowledge its current limitations, particularly in providing specific and in-depth critique. Human expert review remains essential for rigorous scientific evaluation, and future research should focus on addressing the identified limitations and exploring the ethical implications of using LLMs in this context. The findings suggest that LLMs and human feedback can complement each other, potentially transforming research practices and enhancing the efficiency and accessibility of scientific feedback mechanisms.
This abstract presents a large-scale study investigating the potential of large language models (LLMs), specifically GPT-4, to provide useful feedback on research papers. The study involved two parts: a retrospective analysis comparing GPT-4's feedback with human peer reviews and a prospective user study assessing researcher perceptions of GPT-4 generated feedback. The results indicate a significant overlap between GPT-4 and human feedback, comparable to the overlap between two human reviewers, suggesting the potential utility of LLMs in augmenting scientific feedback mechanisms.
The abstract clearly states the research question: can LLMs provide useful feedback on research papers? This focus allows for a targeted and well-defined study.
The abstract outlines a robust methodology involving both retrospective analysis and a prospective user study. This two-pronged approach provides both quantitative and qualitative data to assess the effectiveness of LLM feedback.
The abstract highlights key findings, including the significant overlap between LLM and human feedback and the positive user perceptions. These results effectively demonstrate the potential of LLMs in this domain.
While the abstract mentions limitations, it could benefit from a more detailed discussion of the specific challenges encountered with GPT-4, such as its tendency to focus on certain aspects of feedback and struggles with in-depth critique.
Rationale: A more thorough discussion of limitations would strengthen the abstract's objectivity and provide a more balanced perspective on the findings.
Implementation: Include specific examples of limitations observed in the study, such as instances where GPT-4 provided generic or superficial feedback. This would provide readers with a clearer understanding of the current capabilities and limitations of LLMs in this context.
The abstract could benefit from a brief mention of the ethical considerations surrounding the use of LLMs for scientific feedback, such as potential biases in the model or the risk of over-reliance on automated feedback.
Rationale: Addressing ethical implications is crucial for responsible AI development and deployment, especially in a sensitive domain like scientific evaluation.
Implementation: Add a sentence acknowledging the importance of ethical considerations and outlining the need for further research on potential biases and responsible use guidelines for LLM-based feedback systems.
The abstract briefly mentions future directions but could expand on specific research avenues, such as exploring other LLMs, fine-tuning models on scientific feedback datasets, or integrating visual LLMs for a more comprehensive analysis.
Rationale: A more detailed discussion of future directions would highlight the study's contribution to ongoing research and inspire further exploration in this promising area.
Implementation: Include specific examples of future research avenues, such as investigating the effectiveness of different LLM architectures, developing methods for mitigating potential biases, or exploring the use of LLMs for other aspects of the scientific process, like grant proposal evaluation or literature review.
The introduction section of this research paper establishes the importance of feedback in scientific research, highlighting the challenges posed by the increasing volume of publications and specialization of knowledge. It then introduces large language models (LLMs) as a potential solution to these challenges, specifically focusing on GPT-4, and outlines the study's objective to systematically analyze the reliability and credibility of LLM-generated scientific feedback.
The introduction effectively establishes the need for new feedback mechanisms by highlighting the limitations of traditional approaches and the growing challenges in scientific evaluation.
The authors clearly identify a gap in the existing literature regarding the systematic evaluation of LLMs for scientific feedback, justifying the need for their study.
The introduction clearly outlines the scope of the study, focusing on the reliability and credibility of LLM-generated feedback and specifying the datasets and methods used.
While the introduction acknowledges the potential of LLMs, it could benefit from a more explicit discussion of the ethical implications of using LLMs for scientific feedback.
Rationale: Addressing ethical considerations is crucial for responsible AI development and deployment, especially in a domain as sensitive as scientific evaluation. It would demonstrate the authors' awareness of potential biases, misuse, and the impact on the scientific community.
Implementation: Include a paragraph discussing potential ethical concerns, such as biases in LLM training data, the risk of plagiarism or manipulation, and the importance of transparency and accountability in using LLM-generated feedback. This could also highlight the need for guidelines and best practices for responsible use.
The introduction briefly mentions the unknown "perils" of LLMs but could benefit from a more detailed discussion of their limitations, particularly in the context of scientific feedback.
Rationale: Acknowledging the limitations of LLMs would provide a more balanced perspective and set realistic expectations for the study's findings. It would also highlight the need for further research and development to address these limitations.
Implementation: Include a paragraph discussing specific limitations of LLMs, such as their potential for generating generic or superficial feedback, their inability to understand complex scientific concepts, and their reliance on the quality and representativeness of training data. This could also mention the need for human oversight and the importance of combining LLM feedback with expert human judgment.
The introduction mentions the challenges faced by marginalized researchers in accessing feedback but could strengthen the connection between this issue and the potential of LLMs to address it.
Rationale: Explicitly linking LLMs to the issue of scientific inequalities would highlight the potential societal impact of this research and emphasize the importance of developing inclusive and equitable feedback mechanisms.
Implementation: Expand on the discussion of scientific inequalities, emphasizing how LLMs could provide more accessible and equitable feedback opportunities for researchers from underrepresented groups. This could include examples of how LLMs could overcome geographical barriers, language barriers, or biases in traditional feedback systems.
The Results section details the methodology and findings of two evaluations: a retrospective analysis comparing GPT-4 generated feedback with human peer reviews and a prospective user study assessing researcher perceptions of the LLM feedback. The retrospective analysis found significant overlap between GPT-4 and human feedback, comparable to inter-reviewer overlap. The user study revealed that researchers found the LLM feedback helpful, aligning with human feedback, and offering novel perspectives.
The section employs a rigorous methodology, including a two-stage comment matching pipeline with human verification for both extraction and matching, ensuring accuracy in comparing LLM and human feedback.
The section provides a thorough analysis of the overlap between LLM and human feedback, considering various factors like journal type, decision outcome, number of reviewers raising a comment, and comment position in the sequence.
The inclusion of a prospective user study provides valuable insights into researcher perceptions of LLM feedback, complementing the quantitative analysis with qualitative data on helpfulness, specificity, and perceived benefits.
While the section mentions LLM generating novel feedback not mentioned by humans, it relies on anecdotal evidence from user comments. Quantifying this novelty would strengthen the claim.
Rationale: Providing a quantitative measure of novel feedback would provide a more objective assessment of this aspect of LLM capability and its potential value.
Implementation: Develop a method to identify and quantify comments from LLM feedback that are not present in human reviews. This could involve comparing the semantic content of comments or using a clustering approach to identify unique themes in LLM feedback.
The section acknowledges limitations in the specificity and actionability of LLM feedback. Further investigation into the causes and potential solutions for this limitation is warranted.
Rationale: Understanding why LLM feedback sometimes lacks specificity is crucial for improving the system and making it more useful for researchers.
Implementation: Analyze the characteristics of LLM feedback that is perceived as generic or vague. Explore techniques like prompt engineering, fine-tuning on specific scientific domains, or incorporating external knowledge sources to enhance the specificity and actionability of LLM feedback.
The section mentions the potential of LLM feedback to benefit researchers lacking access to timely feedback, but doesn't explore this impact in detail. Further investigation is needed to assess its effectiveness in mitigating scientific inequalities.
Rationale: Understanding the potential of LLM feedback to address scientific inequalities is crucial for ensuring equitable access to feedback and promoting inclusivity in scientific research.
Implementation: Conduct targeted studies to evaluate the effectiveness of LLM feedback for researchers from underrepresented groups. This could involve comparing the quality and impact of LLM feedback with traditional feedback mechanisms for these researchers and assessing its role in promoting participation and success in scientific endeavors.
Figure 1 provides a visual overview of the study's methodology for evaluating the ability of Large Language Models (LLMs) to generate helpful feedback on scientific papers. The figure is a three-part flow diagram, illustrating the processes of: (a) generating LLM feedback from a raw PDF, (b) comparing this LLM feedback with human feedback, and (c) conducting a user study to evaluate the feedback. Each part of the diagram is further broken down into steps, visually represented by boxes and arrows, with key terms and concepts highlighted within each step. For instance, section (a) showcases a four-step pipeline that begins with processing a 'Raw PDF' and ends with the generation of structured 'Feedback,' including steps like 'Parsed PDF' and using a 'Prompt.' Section (b) emphasizes the comparison between 'Language Model Feedback' and 'Human Feedback' through summarized text and a dedicated 'Matched Feedback' area highlighting commonalities. Lastly, section (c) illustrates the process of 'User Recruiting' for a user study aimed at assessing the generated feedback.
Text: "Specifically, we developed a GPT-4 based scientific feedback generation pipeline that takes the raw PDF of a paper and produces structured feedback (Fig. 1a)."
Context: This sentence appears in the second paragraph of the Introduction section, where the authors are outlining their approach to analyzing the reliability and credibility of LLM-generated scientific feedback.
Relevance: Figure 1 is crucial for understanding the study's methodology. It visually outlines the steps involved in generating LLM feedback, comparing it to human feedback, and conducting a user study, providing a clear roadmap for the research process.
Figure 3 is a scatter plot comparing the frequency of different feedback aspects provided by GPT-4 to those provided by human reviewers. The x-axis represents the log frequency ratio between GPT-4 and human feedback, with positive values indicating that GPT-4 comments on that aspect more frequently than humans. The figure highlights that GPT-4 emphasizes certain aspects, such as 'Implications of the Research' 7.27 times more frequently and is 10.69 times less likely to comment on 'Add experiments on more datasets' compared to human reviewers. The size of each circle represents the prevalence of that aspect in human feedback.
Text: "Fig. 3 presents the relative frequency of each of the 11 aspects of comments raised by humans and LLM."
Context: This sentence, found in the fifth paragraph of the Results section, introduces Figure 3 as a visual representation of the differences in emphasis between LLM and human feedback on specific aspects of scientific papers.
Relevance: Figure 3 directly supports the section's discussion on the varying emphasis placed on different aspects of feedback by LLMs and human reviewers. It visually demonstrates the discrepancies in frequency ratios for specific aspects, highlighting the potential for complementary roles between human and AI feedback.
Figure 4 presents the results of a human study on LLM and human review feedback (n = 308). The figure consists of eight horizontal bar charts (a-h) representing survey responses related to the helpfulness, specificity, perceived impact, and likelihood of future use of LLM-generated feedback compared to human feedback. Each chart presents a different question or statement, and the bars within each chart represent the percentage of respondents who chose each response option. The charts consistently use a blue color scheme, with darker shades of blue indicating a higher percentage of responses. The caption provides detailed information on each subfigure (a-h) and the overall findings of the study.
Text: "The results from the user study are illustrated in Fig. 4."
Context: This sentence, located in the sixth paragraph of the Results section, introduces Figure 4 as a visual representation of the findings from the user study conducted to assess researcher perceptions of LLM-generated feedback.
Relevance: Figure 4 is central to the section's discussion on the prospective user study. It visually presents the survey responses, providing insights into researcher perceptions of the helpfulness, specificity, and potential impact of LLM feedback compared to human feedback. The figure directly supports the claim that researchers generally find LLM feedback helpful and see its potential for improving the scientific review process.
Supplementary Figure 1 presents two bar graphs (a and b) comparing the overlap between comments generated by GPT-4 and those made by human reviewers. The caption provides context for the graphs, stating that they depict the fraction of GPT-4's comments that align with at least one human reviewer's feedback. The graphs are divided into two categories: 'Nature Journals' (a) and 'ICLR' (b), presumably representing different publication sources. Both graphs feature the same x-axis categories: 'GPT-4 vs. All Human' and 'GPT-4 (shuffle) vs. All Human'. The 'GPT-4 (shuffle)' category likely refers to a baseline condition where GPT-4's comments have been randomly reassigned to different papers. The y-axis, labeled 'Global Hit Rate (%)', measures the percentage of overlapping comments. Notably, the error bars, representing 95% confidence intervals, are remarkably small, indicating a high level of precision in the measurements.
Text: "More than half (57.55%) of the comments raised by GPT-4 were raised by at least one human reviewer (Supp. Fig. 1a)."
Context: This sentence, found in the third paragraph of the Results section, introduces Supplementary Figure 1a as evidence of the significant overlap between LLM feedback and human feedback in the Nature family journal data.
Relevance: Supplementary Figure 1 directly supports the section's claim that LLM feedback significantly overlaps with human-generated feedback. It visually demonstrates the high percentage of GPT-4 comments that align with at least one human reviewer's feedback, both in the Nature family journal data and the ICLR data. The figure also highlights the significant drop in overlap when GPT-4 feedback is shuffled, indicating that the LLM is not simply generating generic comments.
Figure 2 presents a retrospective evaluation of GPT-4's performance using alternative set overlap metrics for a more robust check. The figure is organized into eight bar graphs: (a) and (b) illustrate the hit rate, (c) and (d) depict the Szymkiewicz–Simpson overlap coefficient, (e) and (f) showcase the Jaccard index, and finally, (g) and (h) display the Sørensen–Dice coefficient. Each graph compares four scenarios: GPT-4 versus human, human versus human, human without control versus human, and shuffled GPT-4 output versus human. The graphs consistently use the same x-axis labels, and the error bars consistently represent 95% confidence intervals.
Text: "Results were consistent across other overlapping metrics including Szymkiewicz–Simpson overlap coefficient, Jaccard index, Sørensen–Dice coefficient (Supp. Fig. 2)."
Context: This sentence, appearing in the third paragraph of the Results section, refers to Supplementary Figure 2 as evidence that the observed overlap between LLM and human feedback is consistent across various set overlap metrics, not just the hit rate.
Relevance: Supplementary Figure 2 strengthens the study's findings by demonstrating the robustness of the overlap results across different set overlap metrics. It shows that the comparable overlap between GPT-4 and human feedback, as well as the significant drop in overlap with shuffled GPT-4 feedback, is not limited to the hit rate but holds true for other established metrics like the Szymkiewicz–Simpson overlap coefficient, Jaccard index, and Sørensen–Dice coefficient.
The Discussion section summarizes the study's findings, emphasizing the potential of LLMs, specifically GPT-4, for providing helpful scientific feedback. It highlights the significant overlap between LLM-generated feedback and human peer reviews, as well as the positive user perceptions from the prospective study. However, the authors acknowledge the limitations of LLMs, particularly in providing specific and in-depth critique, and emphasize that human expert review remains essential for rigorous scientific evaluation. The section also discusses potential misuse of LLMs in scientific feedback and broader implications for research practices, concluding with limitations of the study and future research directions.
The section provides a balanced perspective on the potential and limitations of LLMs for scientific feedback. It acknowledges the promising findings while also highlighting the current limitations and emphasizing the continued importance of human expertise.
The section offers clear recommendations for the responsible use of LLMs in scientific feedback, advocating for their use as a tool for manuscript improvement before submission and cautioning against replacing human review with automated feedback generation.
The section goes beyond summarizing the study's findings and engages in a thoughtful discussion of the broader implications of LLMs for research practices, envisioning potential paradigm shifts in how research is conducted, collaborated on, and evaluated.
While the section briefly mentions potential misuse, a more in-depth discussion of ethical considerations is warranted. This could include exploring potential biases in LLM training data, the risk of plagiarism or manipulation of feedback, and the impact on reviewer workload and recognition.
Rationale: A more comprehensive discussion of ethical implications would demonstrate the authors' awareness of the potential societal impact of this technology and contribute to responsible AI development and deployment in the scientific community.
Implementation: Dedicate a paragraph to exploring ethical considerations, drawing on existing literature on AI ethics and responsible AI principles. This could include discussing the need for transparency in using LLM feedback, guidelines for mitigating potential biases, and strategies for ensuring fair recognition of both human and AI contributions in the review process.
The section outlines several future research directions but could benefit from more specific and actionable suggestions. For example, it could discuss specific strategies for fine-tuning LLMs on scientific feedback datasets, methods for integrating visual LLMs, or approaches for addressing error detection and correction.
Rationale: Providing more concrete suggestions for future research would guide further exploration in this area and facilitate the development of more advanced and reliable LLM-based feedback systems.
Implementation: Expand on each future research direction with specific examples of potential research questions, methodologies, and datasets. This could include discussing the use of reinforcement learning to train LLMs for specific feedback tasks, exploring the development of multimodal LLMs that can analyze both text and visual elements, or investigating the use of LLMs for identifying and correcting specific types of errors in scientific papers.
While the section mentions the potential of LLMs to benefit researchers lacking access to timely feedback, it doesn't delve into the potential impact on scientific inequalities. Further discussion is needed on how to ensure equitable access to LLM feedback and mitigate potential biases that could exacerbate existing inequalities.
Rationale: Addressing the potential impact on scientific inequalities is crucial for ensuring that LLM feedback promotes inclusivity and doesn't inadvertently reinforce existing power imbalances in the scientific community.
Implementation: Discuss potential strategies for ensuring equitable access to LLM feedback, such as developing open-source tools, providing subsidized access for researchers from underresourced institutions, and translating LLM feedback into multiple languages. Additionally, address the need for ongoing research to monitor and mitigate potential biases in LLM feedback that could disadvantage certain groups of researchers.
Supplementary Figure 3 is a bar chart that illustrates the perceived helpfulness of LLM-based scientific feedback among participants with varying levels of publishing experience. The x-axis represents five categories of publishing experience: "No experience," "1-3 years," "3-5 years," "5-10 years," and "More than 10 years." The y-axis represents the proportion of respondents, ranging from 0.0 to 0.7. Five bars, each in a different color, represent the five levels of perceived helpfulness: "Not at all helpful," "Slightly helpful," "Moderately helpful," "Helpful," and "Highly helpful." Notably, the "Helpful" (green) and "Highly helpful" (purple) bars are consistently taller across all experience levels, indicating a generally positive perception of LLM feedback's helpfulness. For example, in the "No experience" category, approximately 60% of respondents found the feedback "Helpful," and around 10% found it "Highly helpful." Similar trends are observed in other experience categories, with "Helpful" consistently receiving the highest proportion of responses.
Text: "Our analysis suggests that people from diverse educational backgrounds and publishing experience can find the LLM scientific feedback generation framework useful (Supp. Fig. 3,4)."
Context: This sentence appears in the first paragraph of the Discussion section, highlighting the broad applicability of the LLM feedback framework across different researcher demographics.
Relevance: Supplementary Figure 3 supports the claim that LLM-based scientific feedback is perceived as helpful by researchers with varying levels of publishing experience. This finding suggests that the LLM feedback framework can be a valuable tool for both novice and experienced researchers, potentially democratizing access to constructive feedback in scientific writing.
Figure 4 is a bar chart that displays the perceived helpfulness of LLM-based scientific feedback across different professional statuses. The x-axis lists seven professional statuses: "Undergraduate Student," "Master Student," "Doctoral Student," "Postdoc," "Faculty or Academic Staff," and "Researcher in Industry." The y-axis represents the proportion of respondents, with values ranging from 0 to 0.8. For each professional status, there are five bars, each representing a level of perceived helpfulness, as indicated in the legend on the right. Similar to Figure 3, the "Helpful" and "Highly helpful" categories generally receive higher proportions of responses across different professional statuses, suggesting a positive perception of LLM feedback in scientific writing. For instance, among "Doctoral Students," approximately 65% found the feedback "Helpful," and around 10% found it "Highly helpful." This trend of "Helpful" being the most frequent response is consistent across most professional statuses.
Text: "Our analysis suggests that people from diverse educational backgrounds and publishing experience can find the LLM scientific feedback generation framework useful (Supp. Fig. 3,4)."
Context: This sentence, also from the first paragraph of the Discussion section, emphasizes the inclusivity of the LLM feedback framework, suggesting its potential benefit for researchers across various career stages and roles.
Relevance: Supplementary Figure 4 complements Figure 3 by demonstrating that the perceived helpfulness of LLM feedback extends across different professional statuses. This finding further supports the claim that LLM feedback can be a valuable tool for a diverse range of researchers, regardless of their career stage or position.
The Methods section outlines the data sources and procedures used in the study. It details the selection criteria and characteristics of the Nature Family Journals dataset and the ICLR dataset, both used for retrospective analysis. The section also describes the pipeline for generating scientific feedback using GPT-4, including PDF parsing, prompt construction, and feedback structure. Finally, it explains the retrospective comment matching pipeline, involving extractive text summarization and semantic text matching, and its validation through human verification.
The section provides a thorough description of the datasets used, including the number of papers and reviews, publication period, selection criteria, and sources. This transparency allows for replication and assessment of the data's representativeness.
The section clearly explains the steps involved in both the LLM feedback generation pipeline and the retrospective comment matching pipeline. This clarity allows readers to understand how the data was processed and analyzed.
The section describes the human verification process used to validate the accuracy of the comment matching pipeline. This step strengthens the reliability of the overlap analysis and demonstrates the authors' commitment to ensuring the validity of their results.
While the section mentions prompt construction, it could benefit from a more detailed explanation of the specific prompts used for GPT-4, including any variations or refinements made during the study. This would provide insights into how the prompts were designed to elicit relevant and comprehensive feedback.
Rationale: Prompt engineering is crucial for eliciting desired responses from LLMs. A detailed description of the prompts used would enhance the reproducibility of the study and allow other researchers to understand the specific instructions given to GPT-4.
Implementation: Include the full prompt templates used for both the Nature Family Journals dataset and the ICLR dataset as supplementary figures. Additionally, discuss any iterative refinements made to the prompts based on preliminary results or feedback from human annotators.
The section acknowledges the token limit of GPT-4 but doesn't discuss its potential impact on the analysis. It would be beneficial to address whether this limit resulted in any information loss or affected the comprehensiveness of the LLM feedback.
Rationale: The token limit of LLMs is a known constraint that can affect their ability to process and analyze long documents. Addressing this limitation would enhance the transparency of the study and allow readers to assess the potential impact on the results.
Implementation: Analyze the distribution of paper lengths in both datasets and compare it to the token limit of GPT-4. Discuss whether any papers exceeded the limit and, if so, how this was handled. Additionally, explore potential mitigation strategies for future studies, such as using techniques for summarizing or compressing long documents before feeding them to the LLM.
The section mentions focusing on criticisms in the feedback but doesn't provide specific criteria for identifying criticisms. A clearer explanation of how criticisms were distinguished from other types of comments would improve the transparency and replicability of the analysis.
Rationale: The definition of "criticisms" can be subjective and vary between reviewers. Providing clear criteria for identifying criticisms would ensure consistency in the analysis and allow other researchers to replicate the study.
Implementation: Define the specific criteria used to identify criticisms in both LLM and human feedback. This could involve using keywords, phrases, or sentence structures that indicate a critical comment. Additionally, provide examples of comments that were classified as criticisms and comments that were not, to illustrate the distinction.
Supplementary Table 1 provides a summary of the papers and their associated reviews sampled from 15 Nature family journals. It lists the journal name, the number of papers sampled from that journal, and the total number of reviews associated with those papers. For example, 773 papers were sampled from Nature with 2,324 associated reviews, and 810 papers were sampled from Nature Communications with 2,250 associated reviews. In total, the table includes data for 3,096 papers and 8,745 reviews across 15 Nature family journals.
Text: "The first dataset, sourced from Nature family journals, includes 8,745 comments from human reviewers for 3,096 accepted papers across 15 Nature family journals, including Nature, Nature Biomedical Engineering, Nature Human Behaviour, and Nature Communications (Supp. Table 1, Methods)."
Context: This sentence, located in the third paragraph of the Results section, introduces the first dataset used in the retrospective analysis, which comprises papers and reviews from Nature family journals. It mentions Supplementary Table 1 as a source for more detailed information about the dataset.
Relevance: Supplementary Table 1 is relevant to the Methods section as it provides a detailed breakdown of the papers and reviews included in the Nature family journals dataset, which is one of the two main datasets used in the study's retrospective analysis. This table allows readers to understand the scope and composition of the dataset, including the number of papers and reviews from each journal, which is crucial for assessing the generalizability of the study's findings.
Supplementary Table 2 summarizes the ICLR papers and their associated reviews sampled from the years 2022 and 2023, grouped by decision category. The table presents the number of papers and reviews for each decision category, including Accept (Oral), Accept (Spotlight), Accept (Poster), Reject after author rebuttal, and Withdrawn after reviews, for both ICLR 2022 and ICLR 2023. For example, in ICLR 2022, there were 55 accepted papers with oral presentations and 200 associated reviews, while in ICLR 2023, there were 90 accepted papers with oral presentations and 317 associated reviews. The table also provides the total number of papers and reviews for each year, with 820 papers and 3,168 reviews in ICLR 2022 and 889 papers and 3,337 reviews in ICLR 2023.
Text: "Our second dataset was sourced from ICLR (International Conference on Learning Representations), a leading computer science venue on artificial intelligence (Supp. Table 2, Methods)."
Context: This sentence, found in the third paragraph of the Introduction section, introduces the second dataset used in the study, which consists of papers and reviews from the ICLR machine learning conference. It mentions Supplementary Table 2 as a source for more detailed information about this dataset.
Relevance: Supplementary Table 2 is relevant to the Methods section as it provides a detailed breakdown of the ICLR dataset used in the study's retrospective analysis. The table presents the number of papers and reviews for each decision category (accept with oral presentation, accept with spotlight presentation, accept with poster presentation, reject, and withdraw) for both ICLR 2022 and ICLR 2023. This information is crucial for understanding the composition of the dataset and for assessing the generalizability of the study's findings across different acceptance outcomes.
Supplementary Table 3 presents the results of human verification conducted on the retrospective comment extraction and matching pipeline. It is divided into two subtables: (a) Extractive Summarization and (b) Semantic Matching. Subtable (a) focuses on the accuracy of extracting comments from scientific feedback, reporting the counts of true positives (correctly extracted comments), false negatives (missed relevant comments), and false positives (incorrectly extracted or split comments). It also presents the calculated Precision, Recall, and F1 Score for this stage, with an F1 score of 0.968. Subtable (b) evaluates the accuracy of pairing extracted comments based on semantic similarity. It displays the counts of matches and mismatches between human judgment and the system's predictions, providing Precision, Recall, and F1 Score for this stage, with an F1 score of 0.824.
Text: "We validated the pipeline’s accuracy through human verification, yielding an F1 score of 96.8% for extraction (Supp. Table 3a, Methods) and 82.4% for matching (Supp. Table 3b, Methods)."
Context: This sentence, located at the end of the third paragraph in the Results section, describes the human verification process used to validate the accuracy of the comment matching pipeline. It specifically mentions Supplementary Table 3a for extraction results and Supplementary Table 3b for matching results.
Relevance: Supplementary Table 3 is highly relevant to the Methods section as it provides detailed results of the human verification process used to validate the accuracy of the comment extraction and matching pipeline. This pipeline is a crucial component of the study's methodology, as it enables the comparison of LLM-generated feedback with human feedback. The table presents key performance metrics, including precision, recall, and F1 score, for both the extraction and matching stages, demonstrating the high accuracy of the pipeline and strengthening the reliability of the study's findings.
Supplementary Table 4 presents the mean token lengths of papers and human reviews in the two datasets used in the study: ICLR and Nature Family Journals. The table shows that ICLR papers have a mean token length of 5,841.46, while Nature Family Journal papers have a mean token length of 12,444.06. Similarly, human reviews for ICLR papers have a mean token length of 671.53, while those for Nature Family Journal papers have a mean token length of 1,337.93.
Text: "This token limit exceeds the 5,841.46-token average of ICLR papers and covers over half of the 12,444.06-token average for Nature family journal papers (Supp. Table 4)."
Context: This sentence, found in the last paragraph of the Methods section, explains the rationale for using the initial 6,500 tokens of the extracted text to construct the prompt for GPT-4. It refers to Supplementary Table 4 for the mean token lengths of papers in both datasets, justifying the chosen token limit.
Relevance: Supplementary Table 4 is relevant to the Methods section as it provides context for the decision to use the initial 6,500 tokens of the extracted text to construct the prompt for GPT-4. The table shows the mean token lengths of papers in both the ICLR and Nature Family Journals datasets, highlighting the difference in length between the two. This information justifies the chosen token limit, as it exceeds the average length of ICLR papers and covers over half of the average length of Nature Family Journal papers, ensuring that a substantial portion of each paper is included in the prompt.
Supplementary Table 5 presents examples of comments extracted from both LLM (GPT-4) and human feedback on papers submitted to the ICLR conference, categorized by human coding. The table is organized by human coding aspects, such as 'Clarity and Presentation', 'Comparison to Previous Studies', 'Theoretical Soundness', 'Novelty', and 'Reproducibility'. For each category, the table provides both the human and GPT-4 generated comments, allowing for a direct comparison of the feedback provided by both sources. The comments highlight various aspects of the paper, such as writing quality, comparison with existing methods, theoretical rigor, novelty of the approach, and reproducibility of the results.
Text: "Using our extractive text summarization pipeline, we extracted lists of comments from both the LLM and human feedback for each paper. Each comment was then annotated according to our predefined schema, identifying any of the 11 aspects it represented (Supp. Table 5,6,7)."
Context: This sentence, located in the last paragraph of the Methods section, describes the process of annotating comments extracted from LLM and human feedback according to a predefined schema of 11 key aspects. It mentions Supplementary Tables 5, 6, and 7 as examples of the annotated comments.
Relevance: Supplementary Table 5 is relevant to the Methods section as it provides examples of the annotated comments extracted from LLM and human feedback. The table showcases the different aspects of scientific papers that were considered in the annotation process, such as clarity and presentation, comparison to previous studies, theoretical soundness, novelty, and reproducibility. This information helps readers understand the scope and depth of the annotation schema and provides concrete examples of how the comments were categorized.
Supplementary Table 6 presents example comments extracted from both LLM (GPT-4) and human feedback on papers submitted to the ICLR conference. The table is organized by 'Human Coding', which categorizes the comments based on their focus: 'Add ablations experiments', 'Implications of the Research', or 'Ethical Aspects'. For each category, the table provides both the human and GPT-4 generated comments. This table illustrates the similarities and differences in feedback provided by humans and the LLM.
Text: "null"
Context: null
Relevance: While Supplementary Table 6 is not explicitly mentioned in the Methods section, it is relevant as it provides examples of the types of comments that the comment matching pipeline would be processing. It showcases the diversity of feedback aspects and the nuances in language used by both humans and LLMs, which the pipeline needs to accurately capture and compare.
Table 7 presents example comments from both a large language model (LLM) and human reviewers regarding a scientific paper submitted to ICLR. The comments are categorized by human coding aspects such as 'Add ablations experiments', 'Implications of the Research', and 'Ethical Aspects'. Each row includes the source of the comment (Human or GPT-4) and the comment itself.
Text: "null"
Context: null
Relevance: Similar to Supplementary Table 6, Table 7, though not directly mentioned in the Methods section, is relevant as it provides additional examples of the comments that the comment matching pipeline would be analyzing. It further illustrates the range of feedback aspects covered by both human and LLM feedback and the variations in language and specificity.
This flow diagram illustrates a three-stage pipeline designed to compare comments generated by a Large Language Model (LLM) with those from human reviewers. The pipeline begins with 'Language Model Feedback,' where key comments are extracted from the LLM's analysis of a scientific paper. The second stage, 'Language Model Summary,' condenses the extracted comments into a more succinct form. Finally, in the 'Feedback Matching' stage, the original feedback and the summarized feedback are compared using semantic similarity analysis, with matching points highlighted and a similarity rating assigned.
Text: "We developed a retrospective comment matching pipeline to evaluate the overlap between feedback from LLM and human reviewers (Fig. 1b, Methods)."
Context: This sentence, located in the fourth paragraph of the Results section, introduces the retrospective comment matching pipeline and refers to Figure 1b for a visual representation of the process. However, Figure 1b is a simplified overview, and Supplementary Figure 6 provides a more detailed workflow of the pipeline.
Relevance: Supplementary Figure 6 is crucial for understanding the methodology used to compare LLM and human feedback. It visually details the steps involved in extracting comments, summarizing them, and matching them based on semantic similarity. This pipeline is central to the study's retrospective analysis, enabling the quantification of overlap between LLM and human feedback.
Figure 7 explores the robustness of hit rate measurements in ICLR data by controlling for the number of comments. Five bar graphs (a-e) depict hit rate comparisons for various categories of ICLR papers, considering factors like acceptance type (oral presentation, spotlight, poster) and post-review status (rejected, withdrawn). Each bar graph compares the hit rates of 'GPT-4 vs. Human,' 'Human vs. Human,' 'Human (w/o control) vs. Human,' and 'GPT-4 (shuffle) vs. Human.' The presence of error bars suggests the use of confidence intervals, likely at 95%, although the caption doesn't explicitly confirm this. Statistical significance is indicated using asterisks with varying levels: *P < 0.05, **P < 0.01, ***P < 0.001, and ****P < 0.0001.
Text: "The results, with and without this control, were largely similar across both the ICLR dataset for different decision outcomes (Supp. Fig. 7,10) and the Nature family journals dataset across different journals (Supp. Fig. 8,9,10)."
Context: This sentence, found in the eighth paragraph of the Methods section, refers to Supplementary Figure 7 as part of the robustness check performed to ensure that controlling for the number of comments does not significantly alter the hit rate results in the ICLR data.
Relevance: Supplementary Figure 7 is crucial for demonstrating the robustness of the study's findings regarding the overlap between LLM and human feedback. It shows that controlling for the number of comments in the ICLR data does not significantly affect the hit rate results, supporting the claim that the observed overlap is not simply an artifact of the number of comments generated.
Supplementary Figure 8 comprises nine bar graphs, labeled (a) to (i), showcasing a robustness check by controlling for the number of comments when measuring overlap within datasets from nine Nature family journals. Each graph depicts the 'Hit Rate (%)' across three scenarios: comparing GPT-4 feedback to human feedback ('GPT-4 vs. Human'), comparing feedback from two different human reviewers ('Human vs. Human'), and comparing feedback from two groups of human reviewers where one group's data is shuffled to control for the number of comments ('Human (w/o control) vs. Human (shuffle) vs. Human'). The figure aims to demonstrate that controlling for the number of comments yields similar results to analyses without this control, suggesting that the observed overlap between LLM and human feedback is comparable to the overlap between different human reviewers. The caption further states that additional results for other Nature family journals are presented in Supplementary Figure 9 and that the error bars in the graphs represent 95% confidence intervals. Statistical significance is indicated by asterisks: * for P < 0.05, ** for P < 0.01, *** for P < 0.001, and **** for P < 0.0001.
Text: "The results, with and without this control, were largely similar across both the ICLR dataset for different decision outcomes (Supp. Fig. 7,10) and the Nature family journals dataset across different journals (Supp. Fig. 8,9,10)."
Context: This sentence, also from the eighth paragraph of the Methods section, refers to Supplementary Figure 8 as part of the robustness check for the Nature family journals data. It highlights that the results are consistent with and without controlling for the number of comments, further supporting the reliability of the findings.
Relevance: Supplementary Figure 8 is essential for demonstrating the robustness of the study's findings in the Nature family journals data. It shows that controlling for the number of comments does not significantly alter the hit rate results, reinforcing the claim that the observed overlap between LLM and human feedback is comparable to the overlap between human reviewers.
Supplementary Figure 10 consists of four scatter plots that examine the robustness of controlling for the number of comments when analyzing the correlation of hit rates between GPT-4 and human reviewers in predicting peer review outcomes. Each scatter plot compares the hit rate of GPT-4 versus human reviewers with the hit rate of human reviewers versus other human reviewers. Subplot (a) focuses on hit rates across various Nature family journals while controlling for the number of comments, showing a correlation coefficient (r) of 0.80 and a p-value of 3.69 x 10^-4. Subplot (b) examines hit rates across ICLR papers with different decision outcomes, also controlling for the number of comments, with an r-value of 0.98 and a p-value of 3.28 x 10^-3. Subplot (c) analyzes hit rates across Nature family journals without controlling for the number of comments (r = 0.75, P = 1.37 x 10^-3). Lastly, subplot (d) investigates hit rates across ICLR papers with different decision outcomes without controlling for the number of comments (r = 0.98, P = 2.94 x 10^-3). The circle size in each scatter plot represents the sample size for each data point.
Text: "Results with and without this control, were largely similar across both the ICLR dataset for different decision outcomes (Supp. Fig. 7,10) and the Nature family journals dataset across different journals (Supp. Fig. 8,9,10)."
Context: This sentence, located in the fifth paragraph of the Methods section, refers to Supplementary Figure 10 as part of a robustness check to demonstrate that controlling for the number of comments yields similar results to analyses without this control when comparing hit rates between GPT-4 and human reviewers across different journals and decision outcomes.
Relevance: Supplementary Figure 10 is relevant to the Methods section as it provides a visual representation of the robustness check performed to validate the consistency of the study's findings regarding the correlation of hit rates between GPT-4 and human reviewers. It demonstrates that the observed correlations are not significantly affected by variations in the number of comments, strengthening the reliability of the study's conclusions.
Supplementary Figure 12 presents the prompt template used with GPT-4 to generate scientific feedback on papers from the Nature journal family dataset. The template instructs GPT-4 to draft a high-quality review outline for a Nature family journal, starting with "Review outline:" followed by four sections: "1. Significance and novelty," "2. Potential reasons for acceptance," "3. Potential reasons for rejection," and "4. Suggestions for improvement." For the "Potential reasons for rejection" section, the template emphasizes providing multiple key reasons with at least two sub-bullet points for each reason, detailing the arguments with specificity. Similarly, the "Suggestions for improvement" section encourages listing multiple key suggestions with detailed explanations. The template emphasizes providing thoughtful and constructive feedback in outline form only.
This prompt is then fed into GPT-4, which generates the scientific feedback in a single pass. Further details and validations of the pipeline can be found in the Supplementary Information.
Relevance: Supplementary Figure 12 is crucial for understanding the specific instructions provided to GPT-4 for generating scientific feedback. It reveals the structure and content requirements of the feedback, highlighting the emphasis on detailed explanations, multiple reasons for rejection, and specific suggestions for improvement. This information is essential for interpreting the quality and nature of the LLM-generated feedback analyzed in the study.
Supplementary Figure 13 displays the prompt template used with GPT-4 for extractive text summarization of comments in both LLM-generated and human-written feedback. The template instructs GPT-4 to identify the key concerns raised in a review, focusing specifically on potential reasons for rejection. It requires the analysis to be presented in JSON format, including a concise summary and the exact wording from the review for each concern. The template provides an example JSON format, illustrating how to structure the output with numbered keys representing each concern and corresponding values containing a summary and verbatim quote. It emphasizes ignoring minor issues like typos and clarifications and instructs GPT-4 to output only the JSON data.
The pipeline first performs extractive text summarization34–37 to extract the comments from both LLM and human-written feedback.
Relevance: Supplementary Figure 13 is essential for understanding how comments are extracted from both LLM and human feedback for subsequent analysis. It reveals the specific instructions given to GPT-4 for identifying key concerns, focusing on potential reasons for rejection, and structuring the output in JSON format. This information is crucial for interpreting the accuracy and reliability of the comment extraction process and the subsequent comparison between LLM and human feedback.
Supplementary Figure 14 presents the prompt template used with GPT-4 for semantic text matching to identify shared comments between two sets of feedback. The input consists of two JSON files containing extracted comments from the previous step, one for each review being compared (either LLM vs. human or human vs. human). GPT-4 is instructed to match points with a significant degree of similarity in their concerns, avoiding superficial similarities or weak connections. For each matched pair, GPT-4 is asked to provide a rationale explaining the match and rate the similarity on a scale of 5 to 10, with detailed descriptions for each rating level. The output is expected in JSON format, with each key representing a matched pair (e.g., "A1-B2") and the corresponding value containing the rationale and similarity rating. If no match is found, an empty JSON object should be output.
It then applies semantic text matching38–40 to identify shared comments between the two feedback sources.
Relevance: Supplementary Figure 14 is crucial for understanding how shared comments are identified between LLM and human feedback. It reveals the specific instructions given to GPT-4 for matching comments based on semantic similarity, avoiding superficial matches, providing rationales for each match, and rating the similarity level. This information is essential for interpreting the accuracy and reliability of the comment matching process and the subsequent analysis of overlap between LLM and human feedback.