Retrieval Augmented Generation (RAG) and Large Language Models (LLMs) for Enterprise Knowledge Management and Document Automation: A Systematic Literature Review

Ehlullah Karakurt, Akhan Akbulut
Preprints.org
Istanbul Kültür University

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This paper presents a Systematic Literature Review (SLR) analyzing 77 high-quality studies to map the current landscape of Retrieval-Augmented Generation (RAG) and Large Language Model (LLM) integration in enterprise knowledge management and document automation. The primary objective is to identify and quantify the 'lab to market' gap, which is the disparity between academic research practices and the practical requirements for robust, production-scale enterprise deployment. The methodology involves a rigorous, multi-stage filtering process of literature published between 2015 and 2025, guided by nine specific research questions covering platforms, datasets, algorithms, and evaluation metrics.

The review's key findings reveal a field that is largely in an experimental phase, heavily reliant on specific technologies and practices. A significant majority of implementations are built on cloud-native infrastructures (66.2%) and utilize public, open-source data from sources like GitHub (54.5%), which introduces a risk of poor generalization to specific corporate contexts. Architecturally, supervised learning is the dominant paradigm (92.2%), with a clear shift toward Transformer-based models. For the core RAG process, dense vector search is the standard retrieval method (80.5%), often augmented with other techniques to handle domain-specific terminology.

The central conclusion is the strong evidence for the 'lab to market' gap, particularly in evaluation and validation. The literature is dominated by technical, automated metrics like precision and recall (80.5%) and academic validation methods such as k-fold cross-validation (93.5%). In stark contrast, metrics that measure tangible business impact (15.6%) and validation through real-world case studies (13.0%) are rare. The most frequently cited challenges are controlling AI 'hallucinations' and ensuring factual consistency (48.1%), followed by data privacy (37.7%) and system latency (31.2%).

Based on this synthesis, the paper proposes a strategic roadmap to bridge the identified gap. This roadmap prioritizes future research in several key areas: developing secure and privacy-preserving retrieval mechanisms, optimizing for ultra-low latency, creating holistic evaluation benchmarks that include business key performance indicators (KPIs), and expanding RAG capabilities to handle multimodal and multilingual data. The paper positions itself not just as a summary of existing work but as a forward-looking guide for transitioning RAG+LLM systems from academic prototypes to enterprise-ready solutions.

Research Impact and Future Directions

Overall, the paper's central claim of a significant 'lab to market' gap in RAG+LLM development is strongly supported by the evidence synthesized from the 77 reviewed studies. The most compelling finding is the stark, quantitative disconnect between the prevalence of academic validation techniques (93.5% use k-fold cross-validation) and the scarcity of methods that measure real-world value (only 15.6% of studies report business impact metrics). However, the overall reliability of the paper is significantly weakened by numerous and severe internal inconsistencies, particularly in its graphical figures. Multiple charts contain data that directly contradicts source tables or other figures, and key analyses, such as the relationship heatmap in Figure 13, are presented without any discernible methodology, rendering them scientifically unverifiable.

Major Limitations and Risks: The primary risk to the paper's credibility is a pattern of systematic data inconsistency and a lack of methodological transparency in key analyses. Several figures (e.g., Figure 2, 9, 10, 14) present data that is inconsistent with their source tables, suggesting a lack of rigor in data handling and visualization. Furthermore, the criteria for selecting 'top performing' configurations (Table 11) are undefined, introducing subjectivity into a key part of the results. The most severe flaw is the relationship heatmap (Figure 13), which is presented with no methodology, an asymmetric structure that indicates a calculation error, and values that contradict the text. These issues collectively undermine confidence in the paper's more advanced analytical claims beyond the descriptive statistics.

Based on this analysis, the paper's findings can be used for strategic planning and understanding industry trends with a Medium level of confidence. The Systematic Literature Review design is appropriate for mapping the research landscape and identifying prevalent practices and challenges, which it does effectively. However, the confidence is not high because the numerous data inconsistencies and methodological gaps require that specific quantitative claims be treated with caution. To raise confidence, the most critical next step would be an independent replication of the data extraction and analysis to verify the quantitative findings and correct the widespread errors in the figures. Following this, a rigorous meta-analysis focusing on the subset of studies with real-world performance data would be required to move from describing the field to providing validated, prescriptive guidance on best practices.

Critical Analysis and Recommendations

Strong, Memorable Central Thesis (written-content)
The paper effectively frames its findings around the central concept of a 'lab to market' gap. This provides a powerful and memorable narrative hook that synthesizes the core argument—that academic research practices in RAG/LLM development are misaligned with enterprise needs. This strong thesis makes the paper's contribution clear and impactful.
Section: Abstract
Omission of Literature Review Time Frame (written-content)
The abstract fails to specify the time period of the included studies (2015-2025), a critical piece of context for a Systematic Literature Review. Including this information directly in the abstract would immediately inform the reader about the currency and scope of the review, enhancing its transparency and methodological rigor.
Section: Abstract
Effective 'Funnel' Structure in Introduction (written-content)
The introduction is structured as a logical funnel, starting with the broad business problem of information overload, narrowing to the technical limitations of LLMs, introducing RAG as the solution, and finally specifying the SLR methodology. This classic structure effectively guides the reader and builds a strong justification for the research.
Section: Introduction
Introduction of a Novel Organizing Framework (written-content)
The paper introduces the 'RAG–Enterprise Value Chain,' a conceptual framework that structures the analysis by linking technical components to stages of business value creation. This is a significant contribution that elevates the paper from a descriptive review to a more analytical and prescriptive work, offering a useful model for researchers and practitioners.
Section: Background and Related Work
Lack of Visual Aid for Core RAG Architecture (graphical-figure)
The paper describes the core RAG architecture verbally, which can be difficult to follow for non-specialists. Adding a simple block diagram illustrating the flow from user query to retrieval, generation, and final response would significantly improve the clarity and accessibility of this foundational concept.
Section: Background and Related Work
Transparent and Reproducible Research Protocol (written-content)
The methodology section provides exceptional detail on the SLR process, including the specific databases, the exact Boolean search string, and explicit filtering criteria. This high level of transparency is a hallmark of a rigorous SLR, allowing other researchers to understand, evaluate, and potentially replicate the study.
Section: Research Methodology
Lack of Procedural Detail in Quality Assessment (written-content)
The paper outlines its quality scoring system but omits crucial procedural details, such as how many reviewers conducted the assessment, how disagreements were resolved, and the inter-rater reliability score. This methodological limitation reduces the perceived objectivity and credibility of this critical filtering step.
Section: Research Methodology
Misleading Visualization of the Study Selection Process (graphical-figure)
Figure 2 is described as showing the number of papers at three distinct screening stages but only displays the final counts from each database. This is a significant flaw in methodological reporting, as it fails to visualize the filtering process and contradicts the accompanying text, undermining the transparency of the review.
Section: Research Methodology
Insightful Synthesis Beyond Simple Description (written-content)
The results section demonstrates a high level of analysis by moving beyond reporting simple statistics to uncovering deeper relationships in the data. For example, the cross-tabulation analysis linking the underutilization of real-world case studies with the scarcity of business impact metrics provides a powerful, evidence-based insight into the 'lab to market' gap.
Section: Results
Undefined Criteria for 'Top Performance' Creates Subjectivity (graphical-figure)
Table 11 synthesizes 'top performing' configurations without defining the objective criteria used for this selection. This methodological limitation makes the analysis subjective and irreproducible, weakening the claims about which architectures are definitively 'best' for specific tasks.
Section: Results
Severe Methodological Flaws in Relationship Heatmap (graphical-figure)
Figure 13, which aims to show relationships between research topics, suffers from critical flaws: the methodology for calculating the overlap values is completely absent, the resulting matrix is incorrectly asymmetric, and the data shown contradicts values cited in the text. These errors render this novel meta-analysis scientifically unverifiable and unreliable.
Section: Results
Strong Bridge Between Research and Practice (written-content)
The discussion section excels at translating the review's analytical findings into actionable, evidence-based recommendations for enterprise adoption. This focus on practical implications regarding infrastructure, compliance, and evaluation makes the paper highly valuable for practitioners and decision-makers.
Section: Discussion
Robust, Data-Driven Conclusion (written-content)
The conclusion provides a powerful synthesis that is explicitly grounded in the quantitative findings of the review. By weaving in key statistics on platform adoption, dataset usage, and recurring challenges, it delivers a credible, evidence-based snapshot of the field that strongly supports its final assessment.
Section: Conclusions and Future Work
Redundant and Inconsistent Future Research Roadmaps (written-content)
The paper presents two slightly different lists of future research directions in the Discussion and Conclusion sections. This creates redundancy and minor inconsistencies. Consolidating these into a single, definitive roadmap in the conclusion would provide a more powerful and unified final message.
Section: Conclusions and Future Work

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Background and Related Work

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 1. Distribution of Studies by Knowledge Management Domain.
Figure/Table Image (Page 5)
Table 1. Distribution of Studies by Knowledge Management Domain.
First Reference in Text
Based on our review of 77 studies, enterprise data span PDFs, spreadsheets, wikis, and transcripts across multiple domains (Table 1).
Description
  • Categorization of Research Studies: This table categorizes 77 research papers into six distinct application areas, referred to as 'Knowledge Management Domains'. This breakdown shows where research effort is concentrated in the field of enterprise knowledge management using advanced AI. The most researched area is 'Regulatory compliance governance' (helping companies follow rules), which includes 20 papers, making up 26% of the total. The next most common is 'Contract legal document automation' with 18 papers (23.4%). At the other end of the spectrum, 'Healthcare documentation' is the least studied domain, with only 4 papers (5.2%). The table provides both the raw number of papers and their corresponding percentage for each category, summing to a total of 77 papers.
Scientific Validity
  • ✅ Clear Quantitative Summary: The table provides a clear and appropriate quantitative summary of the literature distribution, which is a fundamental requirement for a systematic literature review. This categorization effectively maps the research landscape and highlights key areas of focus within the field.
  • 💡 Ambiguity in Classification Methodology: The methodology for assigning studies to these domains is not specified. It is unclear if the categories are mutually exclusive or if a single study could be assigned to multiple domains. To improve methodological rigor, please clarify the classification criteria. For instance, were these categories predefined or emergent from the literature? How were studies spanning multiple domains handled?
  • 💡 Disconnect Between Reference Text and Table Content: The reference text claims the table illustrates the variety of data types (PDFs, spreadsheets, etc.) found in the studies. However, the table actually categorizes studies by application domain (e.g., 'Contract legal document automation'). This is a significant mismatch. The text should be revised to accurately reflect that Table 1 shows the distribution of research across application domains, not data formats.
Communication
  • ✅ Simple and Effective Layout: The table's design is clean, simple, and easy to interpret. The use of clear column headers ('Domain', '# Papers', '%') and the inclusion of both absolute counts and percentages allow for quick comprehension of the data.
  • 💡 Lack of Sorting: The rows are not sorted in any discernible order (e.g., by frequency or alphabetically). To improve readability and immediately highlight the most significant findings, please sort the table rows in descending order based on the '# Papers' column. This would make it easier for readers to identify the most and least researched domains at a glance.
  • ✅ Self-Contained and Informative Caption: The caption is concise and accurately describes the content of the table. A reader can understand the table's primary purpose—to show the distribution of studies by domain—from the caption and the table itself, without needing extensive context from the main text.
Table 2. The RAG-Enterprise Value Chain: Mapping RAG + LLM Stages to Research...
Full Caption

Table 2. The RAG-Enterprise Value Chain: Mapping RAG + LLM Stages to Research Questions.

Figure/Table Image (Page 5)
Table 2. The RAG-Enterprise Value Chain: Mapping RAG + LLM Stages to Research Questions.
First Reference in Text
Not explicitly referenced in main text
Description
  • Conceptual Framework for AI Deployment: This table presents a five-stage conceptual model, termed the 'RAG-Enterprise Value Chain,' which outlines the process of deploying a specific type of AI system in a business setting. RAG, or Retrieval-Augmented Generation, is a technique where an AI model retrieves information from a knowledge base before generating an answer. The model breaks this process down into: 1. Input (defining data), 2. Retrieval (finding relevant information), 3. Generation (creating the output), 4. Validation (quality checks), and 5. Business Impact (measuring real-world value).
  • Mapping of Process Stages to Research Questions: The core function of the table is to connect each stage of the proposed deployment model to the specific research questions (RQs) that the paper investigates. For example, the 'Input' stage is aligned with RQ1 (Platforms) and RQ2 (Datasets), which concern the foundational infrastructure. Similarly, the 'Validation' stage is linked to RQ5 (Metrics) and RQ6 (Validation), which focus on how to measure the system's performance. This structure serves as a roadmap for how the paper will analyze the existing literature.
Scientific Validity
  • ✅ Provides a Strong Methodological Structure: The table introduces a conceptual framework that provides a clear and logical structure for the systematic literature review. By mapping research questions to distinct stages of a deployment pipeline, it establishes a coherent methodology for analyzing and synthesizing the literature, which is a significant strength for a review paper.
  • ✅ Introduces a Potentially Novel Heuristic: The proposed 'RAG-Enterprise Value Chain' is a novel contribution in itself. It offers a useful heuristic or mental model for both researchers and practitioners to think about the end-to-end lifecycle of enterprise RAG systems, from data ingestion to measuring business value.
  • 💡 Framework is Conceptual, Not Empirically Validated: The 'Value Chain' is proposed as a conceptual model to organize this review. While useful for this purpose, its validity and applicability as a general framework for enterprise AI deployment are not empirically tested or validated within the paper. Its status as a heuristic versus a validated model should be clear.
  • 💡 Inconsistent or Incomplete RQ Mapping: The alignment of RQs to stages could be more rigorous. For instance, RQ8 (Best Configs) is placed under 'Generation', but the best configuration would logically depend on all preceding stages (Input, Retrieval). Furthermore, some RQs mentioned elsewhere in the paper (e.g., RQ7 on software metrics) are not included in this mapping, which creates an inconsistency.
Communication
  • ✅ Clear and Organized Presentation: The table is well-structured with clear column headers ('Stage', 'Key RQ Alignment', 'Description'). This format makes it easy for the reader to understand the proposed framework and how it relates to the paper's research questions at a glance.
  • ✅ Functions as an Effective Reader Roadmap: The table effectively sets expectations for the structure of the paper's results and discussion. It acts as a roadmap, guiding the reader through the logical flow of the analysis, which enhances the overall clarity of the manuscript.
  • 💡 Explicit Reference in Text is Needed: Although the text in section 2.3 describes the framework, it does not explicitly refer to 'Table 2'. To improve clarity and direct the reader's attention, the text should include a direct reference, such as '...as detailed in Table 2.'
Table 3. Prior Reviews on RAG and LLMs.
Figure/Table Image (Page 6)
Table 3. Prior Reviews on RAG and LLMs.
First Reference in Text
Although numerous primary studies explore RAG and enterprise use cases, previous surveys and mappings have covered portions of the space (Table 3) [1-3,7,35,59].
Description
  • Summary of Previous Review Articles: This table summarizes six previous review articles related to advanced AI techniques, specifically Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). RAG is a method where an AI model first searches for relevant information before generating a response, making its answers more factual. LLMs are the powerful AI systems, like GPT, that perform these tasks. For each review, the table lists the authors, the time frame of the studies they covered (e.g., 2020-2023), the number of papers they analyzed (ranging from 27 to 52), and their specific research focus. The topics covered by these prior reviews include the evolution of RAG methods, benchmarking LLM performance, and addressing 'hallucination,' which is when an AI model generates incorrect or nonsensical information.
Scientific Validity
  • ✅ Establishes a Clear Research Gap: The table effectively situates the current work by summarizing prior reviews. This is a critical component of a systematic literature review, as it demonstrates that the authors have surveyed existing syntheses and are positioning their paper to fill a specific, unaddressed niche—in this case, the application of RAG/LLMs to enterprise knowledge management.
  • ✅ Supports Claims Made in the Text: The information in the 'Focus' column directly supports the authors' subsequent claim that previous reviews have only covered 'portions of the space' and none have focused specifically on their chosen scope. This strengthens the justification for the current study.
  • 💡 Lacks Explicit Selection Criteria: The paper does not describe the methodology used to identify these six prior reviews. While the selected papers appear relevant, explicitly stating the search strategy for finding other review articles would enhance the methodological transparency and rigor of this section.
Communication
  • ✅ Efficient and Clear Layout: The table is well-organized with clear, descriptive column headers ('Authors', 'Years', '#Papers', 'Focus'). This allows readers to quickly scan the table and understand the landscape of prior review literature without needing to read through lengthy prose.
  • ✅ Self-Contained and Informative: The table, along with its caption, is largely self-contained. A reader can grasp the key takeaway—that several reviews on RAG/LLMs exist but with different focuses—just by looking at this element, which is a hallmark of effective data presentation.
  • 💡 Minor Redundancy in Columns: The 'Citation' column, which lists the paper's internal reference number, is somewhat redundant since the 'Authors' column already provides the primary identifier for the work. To streamline the table and reduce visual clutter, consider removing the 'Citation' column.

Research Methodology

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1. Systematic Literature Review process.
Figure/Table Image (Page 6)
Figure 1. Systematic Literature Review process.
First Reference in Text
The questions were then translated into precise Boolean search strings (Figure 1).
Description
  • Three-Phase Research Process: This figure is a flowchart that outlines the standard three-phase process for conducting a Systematic Literature Review (SLR), which is a structured method for finding and analyzing existing research. The phases are: 1. Planning (defining the scope), 2. Conducting (executing the search and analysis), and 3. Reporting (presenting the findings).
  • Detailed Steps within Each Phase: Each phase is broken down into specific tasks. The 'Planning' phase includes identifying research questions, keywords, and databases. The 'Conducting' phase involves collecting and selecting studies, assessing their quality, and then extracting and synthesizing the data. The 'Reporting' phase culminates in reporting the results. A looping arrow from 'Reporting' back to 'Planning' suggests the process can be iterative.
Scientific Validity
  • ✅ Adherence to Standard Methodology: The figure correctly illustrates the established and widely accepted stages of an SLR process. This demonstrates a commitment to methodological rigor and transparency, which is a key strength for a review paper.
  • 💡 Generic and Non-Specific: The flowchart is highly generic and does not provide any specific details pertinent to this particular study, such as the actual keywords, databases, or selection criteria used. While it correctly outlines the process, it fails to document the application of that process for this review, which should be detailed in the main text.
  • 💡 Mismatch between Reference Text and Figure Content: The reference text specifically mentions translating questions into 'Boolean search strings' and points to this figure. However, the figure only shows a high-level step, 'Identify keywords,' and makes no mention of Boolean logic or search strings. The figure, therefore, does not directly support the specific action described in the text.
Communication
  • ✅ Clear and Intuitive Layout: The diagram uses a clear left-to-right flow for the main stages and a top-to-bottom sequence for the tasks within each stage. The use of distinct colors and simple icons effectively segments the process and makes the information easy to digest at a glance.
  • 💡 Ambiguous Iteration Loop: The dotted arrow looping from 'Reporting' back to the beginning is ambiguous. It is unclear if this implies the process is cyclical, or if it suggests that the findings of one review can inform the planning of a future one. To improve clarity, the meaning of this iterative loop should be explicitly stated in the caption or main text.
  • 💡 Inconsistent Flow Logic: Within the 'Conducting' section, 'Study selection' is duplicated, and its relationship with 'Quality assessment' is unclear. The arrows should be more precise to show the exact sequence of operations. Suggest refining the flowchart to show a single, unambiguous path for study selection and its temporal relationship to quality assessment.
Figure 2. Distribution of the selected papers after each screening stage.
Figure/Table Image (Page 7)
Figure 2. Distribution of the selected papers after each screening stage.
First Reference in Text
Figure 2 presents the number of records retrieved from each database in three major stages of the selection process: initial retrieval, after applying exclusion criteria, and after quality assessment.
Description
  • Source Distribution of Final Papers: This bar chart illustrates the number of research papers that were ultimately selected for the literature review, broken down by the academic database they were sourced from. It shows that out of the total papers analyzed, the vast majority (55) came from Google Scholar. The next most significant source was IEEE Xplore, contributing 8 papers. The other four databases—ACM DL, ScienceDirect, SpringerLink, and Wiley—each contributed a much smaller number, ranging from 2 to 5 papers.
Scientific Validity
  • 💡 The figure does not support the claims made in the reference text.: There is a major contradiction between the reference text and the figure's content. The text states that the figure shows the number of records at three distinct stages of the selection process (initial retrieval, after exclusion, after quality assessment). However, the bar chart only displays a single set of values: the final number of selected papers from each source. It fails to visualize the filtering process, which is a significant omission in methodological reporting.
  • ✅ Provides transparency on final sources.: The figure is useful for transparently showing the provenance of the final set of 77 studies. It clearly communicates the heavy reliance on Google Scholar, which is an important piece of information for readers evaluating the scope and potential biases of the literature search.
  • 💡 Lacks essential context for evaluating the search process.: By only showing the final counts, the figure omits crucial context. It's impossible to know the initial number of hits from each database or the number of papers excluded at each stage. This information is necessary to assess the precision and effectiveness of the search strategy across the different databases. A visualization showing the attrition of papers at each stage would be far more informative.
Communication
  • 💡 The visualization is misleading and inconsistent with its description.: The primary communication failure is that the figure does not deliver what the reference text promises. It is presented as a multi-stage visualization but is, in fact, a simple, single-stage bar chart. To resolve this, you must either (a) change the figure to a stacked or grouped bar chart that actually shows the data from all three stages, with a clear legend, or (b) rewrite the reference text and caption to accurately describe that the figure only shows the final distribution of selected papers.
  • ✅ The choice of a bar chart is appropriate for the data shown.: For the data it does show (final counts from different categories), a bar chart is a suitable and easily interpretable choice. It effectively highlights the disparity in the number of papers sourced from each database.
  • 💡 Missing data labels reduce precision.: The exact values for the bars are not labeled, forcing the reader to estimate them from the y-axis. This is especially imprecise for the shorter bars. To improve clarity and precision, add numerical labels on top of or inside each bar.
  • 💡 Axis labels could be more specific.: The y-axis label 'Number of Papers' is generic. A more descriptive label like 'Final Number of Selected Papers' would be more accurate and leave no room for ambiguity, especially given the confusion created by the reference text.
Figure 3. Quality score distribution of the selected papers (scores range...
Full Caption

Figure 3. Quality score distribution of the selected papers (scores range 11-16).

Figure/Table Image (Page 8)
Figure 3. Quality score distribution of the selected papers (scores range 11-16).
First Reference in Text
Figure 3 shows the resulting distribution of quality scores (11–16), where each “yes” earned 2 points, “partial” earned 1 point and “no” earned 0 points [35].
Description
  • Distribution of Paper Quality Scores: This bar chart shows the results of a quality check performed on the research papers selected for this review. Each of the 77 final papers was assigned a quality score between 11 and 16. The chart displays how many papers received each score. The most frequent score was 14, which was awarded to 20 papers. The scores are distributed in a roughly bell-shaped curve, with fewer papers at the lower (5 papers scored 11) and higher ends (10 papers scored 16). This process is used to ensure that only methodologically sound studies are included in the final analysis.
Scientific Validity
  • ✅ Demonstrates a rigorous filtering process: The figure provides transparent evidence that a quality assessment was conducted on the selected literature, which is a critical step in a systematic review. By showing the distribution of scores, it supports the claim that the final corpus of 77 papers meets a certain quality threshold.
  • 💡 Lacks justification for the quality cutoff threshold: The text states that papers scoring less than 10 were removed, but provides no rationale for this specific cutoff value. The choice of a threshold can significantly impact the final set of included studies. To improve methodological rigor, please provide a justification for why a score of 10 was deemed the appropriate cutoff for inclusion.
  • 💡 Details of the scoring protocol are absent: The validity of the quality scores depends on the reliability of the scoring process. The manuscript does not specify whether the scoring was performed by one or multiple reviewers. If multiple reviewers were involved, information on inter-rater reliability (e.g., Cohen's kappa) should be reported to ensure the scores are objective and reproducible.
Communication
  • ✅ Effective choice of visualization: A bar chart (or histogram) is an appropriate and effective way to visualize the frequency distribution of discrete scores. It clearly and immediately communicates the central tendency and spread of the quality scores.
  • 💡 Ambiguous X-axis label: The X-axis is labeled 'Value', which is too generic. For better clarity and to make the figure more self-contained, please change this label to 'Quality Score'.
  • 💡 Y-axis scale is inappropriate for count data: The Y-axis, which represents a count of papers, uses decimal tick marks (e.g., 2.5, 7.5, 12.5). Since the number of papers must be an integer, the axis labels should be whole numbers (e.g., 0, 5, 10, 15, 20) to avoid confusion.
  • 💡 Missing data labels: The exact count for each bar must be estimated by the reader. To improve precision and readability, please add numerical data labels on top of each bar indicating the exact number of papers for each score.
Table 4. The 77 primary studies used in this systematic literature review.
Figure/Table Image (Page 8)
Table 4. The 77 primary studies used in this systematic literature review.
First Reference in Text
Not explicitly referenced in main text
Description
  • Complete List of Analyzed Studies: This table provides a comprehensive list of all 77 primary research articles that were selected and analyzed for this systematic literature review. This serves as the foundational dataset for the entire study, detailing each piece of literature that the authors' conclusions are based on.
  • Key Information for Each Study: For each of the 77 papers, the table presents four key pieces of information: a unique ID number (from 1 to 77), the full title of the paper, its year of publication, and its corresponding number in the paper's main reference list. The publication years range from 2017 to 2024, indicating a strong focus on recent research in this fast-moving field.
Scientific Validity
  • ✅ Enhances Transparency and Reproducibility: Providing a complete list of all included primary studies is a fundamental requirement for a rigorous systematic literature review. This transparency allows other researchers to scrutinize the selected corpus and potentially replicate or build upon the review, which is a significant methodological strength.
  • 💡 Missed Opportunity for Data Integration: While essential, the table is functionally just a bibliography. Its scientific utility could be greatly enhanced by adding columns that integrate data from other parts of the methodology. For example, including the 'Knowledge Management Domain' (from Table 1), 'Publication Type' (from Figure 4), and the assigned 'Quality Score' (from Figure 3) for each study would create a powerful summary table and a much richer dataset for the reader.
Communication
  • ✅ Clear and Standardized Format: The table is presented in a clean, standard format with clearly labeled columns, making it easy for readers to find the title, year, and reference for any given study.
  • 💡 Placement in Main Body: Long reference tables like this one, which span multiple pages, can disrupt the flow of the main text. It is often better practice to place such comprehensive lists in an Appendix to improve the readability of the core manuscript.
  • 💡 Lack of Direct Reference: The table is not explicitly cited by number in the text (e.g., "The final 77 studies are listed in Table 4"). While its presence is implied, a direct reference should be added when the final set of studies is first mentioned to guide the reader effectively.
Figure 4. Distribution of publication types (journal vs. conference).
Figure/Table Image (Page 10)
Figure 4. Distribution of publication types (journal vs. conference).
First Reference in Text
Figure 4 shows that the selected publications are slightly favoring conference proceedings (58.4%) over journal articles (41.6%), which is typical for a fast-moving field like RAG.
Description
  • Publication Trends Over Time: This is a stacked bar chart that visualizes the number of selected research papers published each year from 2015 to 2025. It illustrates a significant trend: the field experienced very little activity until 2020, after which there was a dramatic and accelerating increase in publications. The peak is in 2024 with a total of 27 papers.
  • Distribution of Publication Venues: The chart breaks down the publications into two types: 'Journal' articles (peer-reviewed academic publications in serials) and 'Conference' proceedings (papers presented at academic conferences). The colors show the proportion of each type per year. According to the text, overall, 58.4% of the 77 studies are from conferences, while 41.6% are from journals, highlighting a preference for faster-moving conference venues.
Scientific Validity
  • ✅ Effectively illustrates the field's rapid growth: The visualization strongly supports the paper's narrative that this is a recent and rapidly evolving field of study. The exponential increase in publications from 2020 onwards is a key piece of evidence that is clearly and appropriately conveyed.
  • 💡 The 2025 data point is potentially misleading: The chart shows a sharp drop in publications for 2025, which could be misinterpreted as a decline in the field's activity. This is almost certainly an artifact of the study's data collection cutoff date (June 15, 2025). This limitation should be explicitly stated in the caption or main text to prevent misinterpretation.
  • 💡 Exact percentages in the text are not verifiable from the figure: The reference text provides precise percentages (58.4% vs. 41.6%). However, due to the lack of data labels on the bars, it is impossible for a reader to verify these numbers by looking at the figure. The chart supports the qualitative trend but not the precise quantitative claim.
Communication
  • ✅ Appropriate choice of chart type: A stacked bar chart is an excellent choice for this data, as it effectively shows two things at once: the change in the total volume of publications over time, and the internal composition (journal vs. conference) of that volume for each year.
  • 💡 Lack of data labels hinders readability: The absence of numerical labels on the bar segments forces the reader to estimate the counts, reducing the chart's precision. To improve clarity, please add data labels to each segment of the bars to show the exact number of journal and conference papers for each year.
  • 💡 Color choice could be improved for accessibility: The use of two similar shades of orange may be difficult for readers with color vision deficiency to distinguish. For better accessibility, consider using a more distinct color pair or adding different patterns to the bars.
  • 💡 Axes and legend are clear: The axes are clearly labeled ('Number of Publications', 'Year'), and the legend ('Publication Type') is easy to understand, making the chart's structure straightforward to interpret.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 5. Thematic taxonomy of RAG and LLM components emerging from the...
Full Caption

Figure 5. Thematic taxonomy of RAG and LLM components emerging from the reviewed literature: relationships among learning paradigms, indexing strategies, model backbones, and application domains.

Figure/Table Image (Page 10)
Figure 5. Thematic taxonomy of RAG and LLM components emerging from the reviewed literature: relationships among learning paradigms, indexing strategies, model backbones, and application domains.
First Reference in Text
Figure 5 shows the thematic taxonomy of the RAG + LLM components under review, visualizing the architectural and conceptual relationships resulting from classical machine learning methods to modern variants of RAG.
Description
  • Conceptual Map of the Research Field: This figure is a concept map, or 'thematic taxonomy,' that visually organizes the key topics and technologies found across the 77 reviewed research papers. It acts as a high-level blueprint of the entire field, with the central theme branching out into major categories like the types of AI models used, how they are trained, what they are used for, and the challenges they face.
  • Core Technologies and Components: The map details the core components of the AI systems discussed. This includes 'LLM Backbones,' which are the fundamental architectures of the Large Language Models (e.g., GPT-style); 'RAG Architectures,' which are different methods for Retrieval-Augmented Generation (a technique where the AI looks up information before answering); and 'Indexing Strategies,' which are ways to organize the knowledge base for fast searching (e.g., using 'Dense-Vector Indices' that represent information numerically).
  • Applications and Challenges: The taxonomy also covers the practical aspects of this technology. The 'Use Cases' branch shows application areas like 'Knowledge Management' and 'Document Automation' in domains such as 'Regulatory Compliance'. Conversely, the 'Challenges and Research Gaps' branch highlights key problems that researchers are trying to solve, such as 'Hallucination' (when the AI generates incorrect information), 'Data Privacy', and 'Latency' (the time delay for a response).
Scientific Validity
  • ✅ Provides an excellent synthesis of the literature: Creating a thematic taxonomy is a highly appropriate and valuable method for synthesizing the findings of a systematic literature review. It moves beyond simple statistics to provide a structured, conceptual framework that represents the intellectual landscape of the field. This is a significant contribution of the paper.
  • ✅ Appears to be data-driven: The caption and reference text state that the taxonomy 'emerges from the reviewed literature,' which implies that the categories and their relationships are derived from the content of the 77 analyzed papers. This grounding in the review's dataset gives the conceptual map scientific validity as a summary of the field's structure.
  • 💡 Lacks quantitative weighting: The map shows the existence of various themes and components but provides no information on their prevalence in the literature. For example, it's impossible to tell if 'Supervised Learning' was mentioned more frequently than 'Unsupervised Learning'. The scientific value could be greatly enhanced by incorporating quantitative data, for instance, by varying the size of the nodes or the thickness of the connecting lines based on the frequency of mentions in the reviewed papers.
Communication
  • ✅ Gives a powerful 'at-a-glance' overview: The mind map format is highly effective for communicating the entire scope of the research area in a single view. It allows readers to quickly understand the key components and how they relate to one another, serving as an excellent visual abstract for the paper's findings.
  • 💡 High information density can be overwhelming: The diagram is very dense, containing a large number of nodes and connections. While comprehensive, this can be visually overwhelming for a reader, and the small font size may be difficult to read, especially in print. Consider simplifying the diagram or breaking it into a few smaller, more focused figures for key sub-topics.
  • 💡 Undefined color-coding scheme: The nodes are color-coded (e.g., green, blue, purple), but there is no legend to explain the meaning of the colors. This renders the color scheme purely decorative when it could have been used to convey an additional layer of information (e.g., distinguishing between concepts, technologies, and applications). Please add a legend to define the color scheme.
Table 5. Distribution of Platform Topologies.
Figure/Table Image (Page 11)
Table 5. Distribution of Platform Topologies.
First Reference in Text
Table 5 quantifies the prevalence of these deployment methods across the 77 rigorously reviewed studies.
Description
  • Categorization of AI Deployment Infrastructures: This table classifies the 77 reviewed research papers based on their 'Platform Topology,' which refers to the type of computing infrastructure used to run the AI systems. It breaks down the studies into four distinct categories: 'Cloud native infrastructures' (using public services like Amazon Web Services or Google Cloud), 'On premises data centers' (using a company's own private servers), 'Edge or mobile hardware' (running AI on devices like smartphones), and 'Hybrid topologies' (a combination of these approaches).
  • Dominance of Cloud-Based Platforms: The key finding is the overwhelming dominance of cloud-based solutions. A significant majority of the studies, 51 out of 77 (representing 66.2%), utilized cloud-native infrastructures. This highlights the central role of scalable, on-demand cloud computing in modern AI research and development.
  • Prevalence of Other Topologies: The table also quantifies less common approaches. 'On premises' deployments were used in 15 studies (19.5%), often for reasons of data security or regulatory compliance. A smaller but notable group of 8 studies (10.4%) focused on 'edge or mobile hardware,' which is important for applications requiring low latency or offline functionality. 'Hybrid' systems were the least common, appearing in only 3 studies (3.9%).
Scientific Validity
  • ✅ Directly Addresses a Core Research Question: The table provides a direct and quantitative answer to RQ1, which asks about the platforms used in enterprise RAG + LLM studies. This is an appropriate and fundamental analysis for a systematic literature review, clearly mapping the technological landscape.
  • ✅ Data is Internally Consistent and Transparent: The table presents both absolute counts (# of Studies) and percentages, which is good practice. The numbers correctly sum to the total of 77 studies, and the percentages sum to 100%, indicating careful data handling and internal consistency.
  • 💡 Ambiguity in Category Definitions: While the 'Characterization' column is helpful, the operational definitions for the categories could be more precise. For example, the distinction between a 'hybrid' system and a 'cloud native' system that accesses on-premises data can be blurry. Providing more rigorous, mutually exclusive definitions for each topology would enhance the methodological soundness of the classification.
Communication
  • ✅ Excellent Clarity and Structure: The table is exceptionally clear and well-structured. The inclusion of the 'Characterization' column, which provides a concise description for each platform type, is a major strength. It makes the table highly self-contained and easy for readers to understand without needing to refer back to the main text.
  • ✅ Effective Use of a Table Format: A table is the ideal format for presenting this type of categorical data. It allows for a direct comparison of the prevalence of different platform types and presents the quantitative data in a precise and unambiguous way.
  • 💡 Suggest Sorting for Emphasis: To further improve readability and immediately emphasize the main finding, consider sorting the rows in descending order based on the '# of Studies' or '%' column. This would place the most dominant category, 'Cloud native infrastructures,' at the top, making the key takeaway even more apparent at a glance.
Table 6. Distribution of Dataset Categories.
Figure/Table Image (Page 11)
Table 6. Distribution of Dataset Categories.
First Reference in Text
The systematic review of 77 quality assessed studies (2015–2025) identifies four dataset categories used to develop and evaluate Retrieval Augmented Generation (RAG) with Large Language Models (LLMs) for enterprise level knowledge management and document automation (Table 6) [1,35].
Description
  • Classification of Data Sources: This table categorizes the 77 reviewed research papers based on the type of data they used to build and test their AI systems. It identifies four main sources: 'GitHub open source', 'Proprietary repositories', 'Benchmarks', and 'Custom industrial corpora'.
  • Dominance of Public Open-Source Data: The most significant finding is that a majority of studies, 42 out of 77 (54.5%), relied on publicly available data from GitHub. GitHub is a popular online platform where developers share code, documentation, and project information, making it a convenient but generic source for research.
  • Usage of Private and Standardized Datasets: The remaining studies used more specialized data. 'Custom industrial corpora' (special datasets created by companies for a specific task) were used in 13 studies (16.9%). 'Proprietary repositories' (a company's private internal documents) were used in 12 studies (15.6%). Finally, 'Benchmarks' (standardized academic datasets used for fair comparison of AI models) were the least common, used in 10 studies (13.0%).
Scientific Validity
  • ✅ Directly addresses a key research question: The table provides a clear, quantitative answer to RQ2 regarding the datasets used in the field. This is a fundamental and appropriate analysis for a systematic literature review, as it maps the data landscape and reveals common practices and potential biases in the literature.
  • 💡 Ambiguity in category definitions: The distinction between 'Proprietary repositories' and 'Custom industrial corpora' is not sufficiently clear from the provided descriptions. This ambiguity could lead to inconsistent classification of studies. To improve methodological rigor, please provide more precise, mutually exclusive operational definitions for these categories. For instance, is the key differentiator whether the data was pre-existing versus assembled specifically for the research?
  • 💡 Potential for unaddressed bias: The finding that over half the studies use public GitHub data is significant and points to a potential limitation in the field: 'domain shift'. Models trained on general-purpose public data may not perform well on specialized, private enterprise data. The table effectively surfaces this issue, but the authors should explicitly discuss the implications of this data source bias in their analysis.
Communication
  • ✅ Excellent use of a descriptive column: The inclusion of the 'Description' column is a major strength. It makes the table highly self-contained and ensures that readers can understand the meaning of each category without having to search for definitions in the main text, which is a best practice for table design.
  • ✅ Clear and well-structured layout: The table is cleanly formatted with clear headers and presents both absolute counts and percentages, allowing for easy interpretation and comparison across categories.
  • 💡 Lack of sorting reduces immediate impact: The rows are not sorted by frequency. To improve readability and immediately highlight the most important finding, please sort the rows in descending order based on the '# Studies' column. This would place 'GitHub open source' at the top, making the key takeaway instantly apparent.
Figure 6. Proportional distribution of dataset sources.
Figure/Table Image (Page 11)
Figure 6. Proportional distribution of dataset sources.
First Reference in Text
Not explicitly referenced in main text
Description
  • Proportional Breakdown of Data Sources: This pie chart illustrates the relative share of four different types of datasets used across the 77 reviewed research papers. The largest slice, 'GitHub Open-Source' (publicly available code and documents from the GitHub platform), accounts for 52.5% of all datasets used. The remaining sources are divided among 'Benchmarks' (standardized academic datasets for model comparison) at 18.2%, 'Custom Industrial' (datasets specially collected by companies for a specific task) at 17.2%, and 'Proprietary Repos' (private, internal company data) at 12.1%.
Scientific Validity
  • 💡 Severe data inconsistency with Table 6: There is a critical inconsistency between the data presented in this pie chart and the data in Table 6, which purports to show the same information. For example, this figure reports 'GitHub Open-Source' at 52.5% and 'Benchmarks' at 18.2%, whereas Table 6 reports them at 54.5% and 13.0%, respectively. This discrepancy is a major scientific validity issue that undermines the credibility of the data presentation. The raw counts in Table 6 appear to be correct, indicating that this figure has been plotted with incorrect values.
  • 💡 Redundant visualization: This figure is entirely redundant as it visualizes the exact same categorical distribution data already presented with greater precision in Table 6. In scientific reporting, it is generally considered poor practice to present the same data in both a table and a figure. The table alone would suffice, or a bar chart could be used if a visual representation is strongly desired.
Communication
  • 💡 Inappropriate choice of chart type: Pie charts are widely discouraged for scientific communication because the human eye is poor at accurately comparing angles and areas. A simple bar chart would be a far more effective and clearer way to present this data, as it allows for easy and accurate comparison of the lengths of the bars.
  • 💡 Lack of reference in the main text: This figure is an 'orphan' element; it is not referenced anywhere in the main body of the paper. All figures must be explicitly introduced and discussed in the text to provide context and guide the reader.
  • 💡 Inconsistent terminology: The labels used in the figure (e.g., 'GitHub Open-Source', 'Proprietary Repos') are slightly different from those in Table 6 (e.g., 'GitHub open source', 'Proprietary repositories'). Please ensure terminology is consistent across all figures and tables for clarity.
  • 💡 Cluttered labeling: The external labels connected by lines create visual clutter, particularly for the 'GitHub Open-Source' category. A bar chart would allow for much cleaner and more direct labeling of the categories and their corresponding values.
Table 7. Distribution of Machine Learning Paradigms in Enterprise RAG + LLM...
Full Caption

Table 7. Distribution of Machine Learning Paradigms in Enterprise RAG + LLM Studies.

Figure/Table Image (Page 12)
Table 7. Distribution of Machine Learning Paradigms in Enterprise RAG + LLM Studies.
First Reference in Text
A review of 77 studies, subjected to a rigorous quality filter, shows an overwhelming preference for supervised learning when combining RAG and LLM in enterprise contexts (Table 7) [7,29].
Description
  • Classification of AI Training Methodologies: This table categorizes the 77 reviewed studies based on the 'machine learning paradigm' they employed, which is the fundamental approach used to train the AI models. The paradigms are divided into three types: Supervised, Unsupervised, and Semi-supervised learning.
  • Overwhelming Dominance of Supervised Learning: The most striking finding is the dominance of 'Supervised learning', which was used in 71 out of 77 studies, accounting for 92.2% of the literature. Supervised learning is a method where the AI learns from data that has been pre-labeled with the correct answers, much like a student studying with an answer key. This suggests most research in this area relies on having well-annotated datasets.
  • Limited Use of Unsupervised and Semi-supervised Methods: In contrast, 'Unsupervised learning' and 'Semi-supervised learning' are far less common, each accounting for only 3 studies (3.9%). Unsupervised learning involves finding hidden patterns in unlabeled data without an answer key. Semi-supervised learning is a hybrid approach that uses a small amount of labeled data to help guide the learning process on a much larger set of unlabeled data. Their low prevalence indicates a potential research gap in scenarios where labeled data is scarce.
Scientific Validity
  • ✅ Directly Addresses a Core Research Question: The table provides a clear and quantitative answer to RQ3 ('Which types of machine learning... are employed?'). This classification is a fundamental and appropriate analysis for a systematic literature review, effectively mapping the methodological landscape of the field.
  • ✅ Strongly Supports the Central Claim in the Text: The data presented (92.2% for supervised vs. 3.9% for others) provides robust, quantitative evidence for the claim in the reference text of an 'overwhelming preference for supervised learning'. The data and the interpretation are in perfect alignment.
  • 💡 Lacks Granularity in Classification: While the broad categories are useful, they may oversimplify the methodologies. For example, 'supervised learning' could encompass a wide range of techniques from simple classification to complex fine-tuning of large models. The methodology for assigning a single paradigm to each paper, especially for studies that might use hybrid approaches, is not detailed. Providing more insight into the classification criteria would enhance methodological transparency.
Communication
  • ✅ Excellent Clarity and Self-Containment: The table is exceptionally clear. The inclusion of the 'Description' column, which provides a concise definition for each technical term, is a major strength. It makes the table self-contained and easily understandable for readers who may not be experts in machine learning.
  • ✅ Effective Data Presentation: The table format is ideal for this categorical data. It presents the absolute counts and percentages cleanly, allowing for easy comparison and immediate comprehension of the key finding: the dominance of supervised learning.
  • 💡 Suggest Sorting for Emphasis: To further enhance readability and immediately highlight the main finding, sorting the rows in descending order by '# Studies' would be beneficial. This would place 'Supervised' at the top, reinforcing the primary message of the table at first glance.
Figure 7. Distribution of machine learning paradigms.
Figure/Table Image (Page 13)
Figure 7. Distribution of machine learning paradigms.
First Reference in Text
Figure 7 visualizes these rates and highlights the near ubiquity of supervised methods (92.2%), while unsupervised and semi supervised strategies remain under researched [7,29].
Description
  • Proportional Breakdown of AI Training Methods: This donut chart displays the proportional distribution of the three main machine learning training methods, or 'paradigms,' used across the 77 studies. The chart is dominated by 'Supervised' learning, which accounts for 92.2% of all cases. Supervised learning is a technique where an AI model is trained on data that has been pre-labeled with correct answers. The other two methods, 'Unsupervised' learning (where the AI finds patterns in unlabeled data) and 'Semi-supervised' learning (a hybrid approach), are used much less frequently, each accounting for an equal share of 3.9%.
Scientific Validity
  • ✅ Data strongly supports the textual claim: The numerical data presented in the chart (92.2% for supervised methods) perfectly aligns with and provides strong visual support for the central claim made in the reference text regarding the 'near ubiquity' of this paradigm.
  • 💡 Redundant and less precise than the table: This figure is entirely redundant, as it visualizes the exact same data presented with greater precision in Table 7. It is generally considered poor practice in scientific reporting to present the same data in both a table and a figure. The table provides the exact counts and is sufficient on its own.
  • 💡 Inappropriate chart type for scientific comparison: Donut charts (and pie charts) are widely discouraged for scientific data visualization because they require readers to compare angles and areas, which is less accurate than comparing lengths. A simple bar chart would be a more methodologically sound and effective choice for presenting this categorical data, as it allows for direct and unambiguous comparison between the paradigms.
Communication
  • 💡 Ineffective and redundant visualization: The primary communication issue is the figure's redundancy with Table 7. It adds no new information and clutters the paper. For the sake of clarity and conciseness, this figure should be removed, and the text should refer directly to Table 7.
  • 💡 Poor choice of chart type: As noted for scientific validity, a donut chart is a poor choice for communicating this data. The extreme imbalance (92.2% vs. 3.9%) makes the smaller segments very difficult to see and compare. A horizontal bar chart would be a far superior alternative, allowing for clear labeling and easy comparison of the categories.
  • ✅ Clear labeling of percentages: The chart does a good job of clearly labeling each segment with the paradigm name and its exact percentage, which makes the main message easy to grasp despite the limitations of the chart type.
Table 8. Taxonomy and Frequency of Algorithms, RAG Architectures, and Indexing...
Full Caption

Table 8. Taxonomy and Frequency of Algorithms, RAG Architectures, and Indexing Strategies.

Figure/Table Image (Page 13)
Table 8. Taxonomy and Frequency of Algorithms, RAG Architectures, and Indexing Strategies.
First Reference in Text
Table 8 provides a detailed taxonomy of these methods, classifying them into traditional baselines, deep learning models, RAG variants, and indexing strategies.
Description
  • Taxonomy and Frequency of AI Techniques: This table categorizes and counts the mentions of various AI techniques across the 77 reviewed studies. It is divided into four main groups: 'Traditional ML (Baselines)', 'Deep Learning Models', 'RAG Architectures', and 'Retrieval & Indexing'.
  • Key Frequency Data: The most frequently mentioned categories of techniques were 'Retrieval & Indexing' (127 total mentions) and 'Traditional ML' (121 total mentions). 'Retrieval & Indexing' refers to methods for organizing and searching information; the most common was 'Dense Vector' search (62 mentions), a modern technique that understands meaning, followed by older keyword-based methods like 'BM25/TF-IDF' (45 mentions). 'Traditional ML' includes foundational algorithms like 'Naïve Bayes' (26 mentions) and 'SVM' (22 mentions), which are often used as 'baselines'—a standard for comparison to prove that newer methods are an improvement. 'RAG Architectures'—the specific designs for the core AI technology being studied—were mentioned 82 times. In contrast, older 'Deep Learning Models' like LSTM and CNN were mentioned only 8 times, indicating the field has largely moved on to newer approaches.
Scientific Validity
  • ✅ Provides a granular, quantitative overview of the technical landscape: The table demonstrates a rigorous data extraction process by not only categorizing technologies but also quantifying their frequency of mention. This provides a valuable, data-driven snapshot of the specific algorithms, architectures, and strategies that are most prevalent in the literature, directly addressing RQ4.
  • 💡 The unit of measurement, '# Mentions,' is ambiguous and potentially misleading: The metric '# Mentions' is not clearly defined. It is unclear whether this is a count of papers that mention a technique (i.e., a maximum of 77) or a raw count of every time the term appears across all papers. If the latter, a single paper that discusses an algorithm extensively could disproportionately inflate its count, skewing the perception of its overall importance in the field. To improve methodological rigor, please clarify the counting protocol. A metric like '# of Studies Mentioning Technique' would be less ambiguous and more robust.
  • 💡 The categorization conflates components with parallel technologies: 'Retrieval & Indexing' strategies are not independent alternatives to 'RAG Architectures'; they are integral sub-components of them. Presenting them as separate, parallel categories in the table is conceptually confusing. A more accurate representation would be a hierarchical structure that shows indexing strategies as a component within RAG systems.
  • 💡 Risk of misinterpretation without textual context: The high number of mentions for 'Traditional ML' (121) could lead a reader to incorrectly conclude that these methods are dominant. The crucial context—that they are primarily used as baselines for comparison—is only provided in the subsequent text. The table itself does not convey this, creating a risk of misinterpretation if viewed in isolation.
Communication
  • ✅ Effective grouping for a high-level overview: The table successfully organizes a large and complex set of technical terms into four digestible categories. This structure helps the reader to quickly grasp the main classes of technologies discussed in the literature.
  • 💡 Crucial context is missing from the table itself: The fact that 'Traditional ML' algorithms are used as 'Baselines' is critical for correct interpretation. This information should be integrated directly into the table to make it more self-contained. Suggest changing the category label from 'Traditional ML (Baselines)' to 'Traditional ML (Primarily used as baselines)' or adding a note directly in the table.
  • 💡 Lack of internal sorting reduces readability: Within each category, the specific algorithms are not consistently sorted by their number of mentions. To improve clarity and make it easier for readers to identify the most prominent techniques at a glance, please sort the items within each category in descending order of their '# Mentions'.
  • 💡 Note is helpful but could be more prominent: The note clarifying that counts represent total mentions is helpful but is placed in small font below the table. Given the ambiguity of the '# Mentions' header, this important clarification could be missed. Consider making the header more descriptive, such as '# Total Mentions (across 77 studies)'.
Figure 8. Frequency of the top five machine learning algorithms used primarily...
Full Caption

Figure 8. Frequency of the top five machine learning algorithms used primarily as baselines or classifiers in RAG + LLM studies.

Figure/Table Image (Page 14)
Figure 8. Frequency of the top five machine learning algorithms used primarily as baselines or classifiers in RAG + LLM studies.
First Reference in Text
Figure 8 illustrates the continued relevance of classical algorithms as benchmarks alongside these modern innovations.
Description
  • Frequency of Top Five Classical AI Algorithms: This bar chart displays the frequency of use for the five most common 'classical' machine learning algorithms found in the 77 reviewed studies. These algorithms are often used as 'baselines,' which are simpler, established methods that new, more complex AI systems are compared against to prove their superiority. The most frequently used algorithm was Naive Bayes, appearing in 26 studies. The others, in descending order of frequency, were SVM (Support Vector Machine) in 22 studies, Logistic Regression in 19, Decision Tree in 18, and Random Forest in 15.
Scientific Validity
  • ✅ Effectively visualizes a key finding: The figure successfully highlights a key finding from the data in Table 8: that despite the focus on modern LLMs, traditional machine learning algorithms are still highly prevalent in the research literature, primarily for benchmarking. This supports the paper's narrative about the role of these established methods.
  • 💡 Critical inconsistency in the unit of measurement: There is a significant contradiction between this figure and Table 8 regarding the unit of data. The y-axis of this figure is labeled 'Number of Studies,' implying that, for example, 26 distinct studies used Naive Bayes. However, the header in Table 8, from which this data is derived, is '# Mentions,' and its note clarifies that 'Counts represent total mentions.' These are two very different metrics. This inconsistency must be resolved. 'Number of Studies' is the more scientifically robust and meaningful metric, and if correct, the labeling in Table 8 should be changed.
Communication
  • ✅ Excellent and informative caption: The caption is a major strength. It provides crucial context by explicitly stating that these algorithms are used 'primarily as baselines or classifiers.' This information is essential for correct interpretation and makes the figure highly self-contained, a significant improvement over the presentation in Table 8.
  • ✅ Appropriate choice of visualization: A bar chart is the ideal visualization for comparing the frequencies of a small number of discrete categories. It is clear, intuitive, and effectively communicates which algorithms are most common.
  • 💡 Missing data labels: The exact values for each bar must be estimated from the y-axis, which reduces precision. To improve clarity and make the chart easier to read, please add numerical data labels on top of each bar.
Table 9. Distribution of Metric Categories.
Figure/Table Image (Page 14)
Table 9. Distribution of Metric Categories.
First Reference in Text
Quality Evaluation Questions lists the five primary categories (Table 9) of evaluation metrics used across the 77 reviewed studies and the proportion of studies employing each type.
Description
  • Classification of AI Performance Metrics: This table categorizes the different types of performance measures, or 'evaluation metrics,' used in the 77 reviewed studies to assess the quality of AI systems. It is divided into five main categories, ranging from automated technical scores to human judgments and real-world business outcomes.
  • Dominance of Automated Technical Metrics: The data shows a strong reliance on automated, technical metrics. The most common are standard classification measures like 'Precision / Recall / Accuracy', used in 62 studies (80.5%). These are followed by retrieval-specific metrics like 'Recall@K / Precision@K' (measuring if the right answer is in the top K results), used in 56 studies (72.7%), and text-generation metrics like 'ROUGE / BLEU' (which compare AI-generated text to human examples), used in 34 studies (44.2%).
  • Scarcity of Human-Centric and Business-Oriented Metrics: In stark contrast to the technical metrics, measures that reflect real-world usability are far less common. 'Human Evaluation', where a person directly assesses the AI's output for qualities like fluency and factual correctness, was included in only 15 studies (19.5%). Even rarer were 'Business Impact Metrics', which measure tangible outcomes like reduced costs or increased efficiency, appearing in just 12 studies (15.6%). This disparity highlights a significant gap between academic evaluation and practical enterprise value.
Scientific Validity
  • ✅ Effectively Highlights a Critical Research Gap: The table's primary strength is its clear, quantitative demonstration of the gap between academic evaluation practices and the needs of enterprise applications. The stark contrast between the high usage of automated metrics (80.5%) and the low usage of business impact metrics (15.6%) is a significant finding that strongly supports the paper's narrative about the 'lab to market' challenge.
  • 💡 Methodological ambiguity regarding non-exclusive categories: The sum of studies across all categories (179) far exceeds the total number of papers (77), indicating that the categories are not mutually exclusive and that a single study can employ multiple metric types. This is a critical methodological detail that is not explicitly stated. To avoid potential misinterpretation, a note should be added to the table clarifying that studies can be counted in multiple categories. This would improve the transparency and rigor of the data presentation.
  • ✅ Directly addresses a core research question: This table provides a direct and data-driven answer to RQ5 ('Which evaluation metrics are used to assess model performance?'). The categorization and quantification are appropriate for a systematic literature review and effectively map the current state of evaluation practices in the field.
Communication
  • ✅ Excellent use of a descriptive column: The 'Description' column is a major communication strength. It provides concise definitions for each metric category, making the table highly self-contained and accessible to readers who may not be familiar with all the technical terms (e.g., ROUGE/BLEU).
  • ✅ Clear and well-structured layout: The table is cleanly formatted with clear headers and presents both absolute counts (# Studies) and percentages, allowing for easy interpretation and comparison of the prevalence of each metric category.
  • 💡 Lack of sorting reduces immediate impact: The rows are not sorted in any particular order. To improve readability and immediately emphasize the key finding, please sort the rows in descending order based on the '# Studies' or '%' column. This would place the most common metrics at the top and the least common at the bottom, making the disparity instantly clear to the reader.
Figure 9. Proportions of studies using each evaluation metric category (n = 77).
Figure/Table Image (Page 15)
Figure 9. Proportions of studies using each evaluation metric category (n = 77).
First Reference in Text
Figure 9 visualizes these percentages [17,31,32,108].
Description
  • Visualization of Evaluation Metric Usage: This bar chart displays the percentage of the 77 reviewed studies that utilized different categories of performance metrics to evaluate their AI systems. The chart highlights a strong preference for automated, technical metrics over those that involve human judgment or measure real-world business value.
  • Dominance of Technical Metrics: The chart shows that the most common metric category was 'Precision/Recall/Accuracy', used by approximately 75% of the studies. This was followed by 'Recall/Prec' (a shorthand for metrics that check if the correct answer is within the top results), used by 68% of studies, and 'ROUGE/BLEU' (metrics that compare AI-generated text to a human-written reference), used by 42%.
  • Scarcity of Human-Centric and Business Metrics: A key takeaway is the sharp decline in the use of metrics that assess practical utility. 'Human Eval' (where a person directly scores the AI's output) was reported in only 18% of studies. Even less frequent were 'Business Impact' metrics (measuring outcomes like efficiency gains or cost reduction), which were used in just 15% of the studies. This visualizes a significant gap between laboratory-style evaluation and real-world application assessment.
Scientific Validity
  • 💡 Critical data inconsistency with Table 9: There is a severe and critical inconsistency between the percentages displayed in this figure and the data presented in Table 9, which is the source for this visualization. For every category, the percentage in the figure is different from the value calculated from the table (e.g., 'Precision/Recall/Accuracy' is 75% in the figure vs. 80.5% calculated from Table 9; 'Human Eval' is 18% vs. 19.5%). This is a major scientific validity issue that undermines the credibility of the data presentation and suggests a significant error in data handling or figure generation. This must be corrected to ensure the figure accurately reflects the source data.
  • 💡 Redundant and less precise visualization: This figure is entirely redundant, as it visualizes the exact same categorical data already presented with greater precision (i.e., with raw counts) in Table 9. In scientific reporting, it is generally considered poor practice to present the same data in both a table and a figure. The table alone is sufficient and more precise.
  • ✅ Effectively visualizes the central narrative: Despite the numerical inaccuracies, the overall shape of the distribution—the steep drop-off from technical metrics to business metrics—effectively visualizes and reinforces the paper's central argument about the 'lab to market' gap in evaluation practices.
Communication
  • 💡 Missing Y-axis label: The vertical axis is not labeled, forcing the reader to infer that it represents a percentage. To adhere to best practices for data visualization, the y-axis must be explicitly labeled, for example, as 'Percentage of Studies (%)'.
  • 💡 Inconsistent and unclear X-axis labels: The labels on the x-axis are inconsistent with those in Table 9. For example, 'Recall/Prec' is an unclear abbreviation for 'Recall@K / Precision@K'. Furthermore, the labels are rotated, which reduces readability. A horizontal bar chart would be a better choice as it allows for fully horizontal, and thus more legible, category labels.
  • 💡 Poor image quality: The resolution of the figure is low, resulting in blurry text and imprecise visual elements. Figures should always be provided in a high-resolution format to ensure clarity and professionalism.
  • 💡 Redundant presentation of information: As noted in the scientific validity section, presenting the same data in both Table 9 and Figure 9 is redundant. To improve the manuscript's conciseness and impact, one of these elements should be removed. Given the data inconsistencies in the figure, it is the clear candidate for removal.
Table 10. Distribution of Validation Methods.
Figure/Table Image (Page 15)
Table 10. Distribution of Validation Methods.
First Reference in Text
Studies use various validation strategies to assess the robustness and generalizability of RAG + LLM systems (Table 10, Figure 10) [17,31,32].
Description
  • Classification of Model Validation Strategies: This table categorizes the 77 reviewed studies based on the method they used to validate the performance of their AI systems. Three primary strategies are identified: 'k fold Cross Validation', 'Hold out Split', and 'Real world Case Study'.
  • Dominance of k-fold Cross Validation: The most prevalent method is 'k fold Cross Validation', which was used in 72 out of 77 studies (93.5%). This is a common technique in machine learning where the dataset is split into 'k' subsets; the model is trained on k-1 of the subsets and tested on the remaining one, a process that is repeated k times to ensure the results are stable and not just a fluke of one particular data split.
  • Usage of Other Validation Methods: The 'Hold out Split' method, a simpler approach where the data is divided just once into a training set and a test set, was used in 20 studies (26.0%). The least common approach was the 'Real world Case Study', where the AI system is deployed in a live business environment to measure its actual impact. This was reported in only 10 studies (13.0%). The fact that the numbers add up to more than 77 indicates that some studies used more than one validation method.
Scientific Validity
  • ✅ Directly addresses a core research question: The table provides a clear, quantitative answer to RQ6 ('Which validation approaches... are adopted?'). This analysis is a fundamental component of a systematic literature review, effectively mapping the methodological practices in the field.
  • ✅ Highlights a key gap between academic and practical validation: The data strongly supports the paper's central theme of a 'lab to market' gap. The overwhelming prevalence of a traditional academic technique like k-fold cross-validation (93.5%) compared to the scarcity of real-world case studies (13.0%) is a significant and well-supported finding.
  • 💡 Methodological ambiguity of non-exclusive categories: The sum of studies across the categories is 102, which is greater than the total of 77 papers. This indicates that the categories are not mutually exclusive, a critical detail that is not explicitly stated in the table itself. To improve transparency and prevent misinterpretation, a note should be added to the table clarifying that a single study could employ and be counted under multiple validation methods.
  • 💡 Table oversimplifies a nuanced application: The text provides a crucial piece of context that the table omits: k-fold is predominantly used for 'retrieval modules', while the less computationally expensive hold-out split is used for 'generative LLM components'. The table presents these as parallel choices, which is an oversimplification. The data is valid, but the table's structure doesn't fully capture the nuance of how these methods are applied within a single system.
Communication
  • ✅ Excellent clarity and self-containment: The table is exceptionally well-designed for clarity. The 'Description' column, which provides a concise explanation of each validation method, is a major strength. It makes the technical terms accessible and ensures the table is understandable on its own.
  • ✅ Appropriate format for the data: A table is the ideal format for presenting this type of categorical data, as it clearly lays out the different methods, their descriptions, and their frequencies (both absolute and relative) in a way that is easy to read and compare.
  • 💡 Lack of sorting reduces immediate impact: The rows are not sorted by frequency. To improve readability and make the main finding more immediately obvious, please sort the rows in descending order based on the '# Studies' column. This would place 'k fold Cross Validation' at the top, instantly highlighting its dominance.
Figure 10. Distribution of validation approaches across 77 enterprise RAG + LLM...
Full Caption

Figure 10. Distribution of validation approaches across 77 enterprise RAG + LLM studies.

Figure/Table Image (Page 15)
Figure 10. Distribution of validation approaches across 77 enterprise RAG + LLM studies.
First Reference in Text
Studies use various validation strategies to assess the robustness and generalizability of RAG + LLM systems (Table 10, Figure 10) [17,31,32].
Description
  • Visualization of AI Model Validation Methods: This bar chart displays the percentage of the 77 reviewed studies that used one of three common methods for validating the performance of their AI systems. The most widely used method is 'K-Fold CV' (k-fold Cross Validation), employed by 92.3% of studies. This is a standard academic technique where a dataset is repeatedly split into training and testing portions to get a stable measure of performance. The 'Hold-out Split' method, a simpler one-time split of the data, was used by 25.6% of studies. The least common method was 'Case Study/Field Trial', where the system is tested in a real-world setting, used by only 12.8% of studies. The percentages sum to more than 100%, indicating that some studies used multiple validation techniques.
Scientific Validity
  • 💡 Critical data inconsistency with Table 10: There is a significant scientific validity issue due to data inconsistency between this figure and its source, Table 10. For every category, the percentages are different: 'K-Fold CV' is 92.3% here vs. 93.5% in the table; 'Hold-out Split' is 25.6% vs. 26.0%; and 'Case Study/Field Trial' is 12.8% vs. 13.0%. These discrepancies, while small, suggest errors in data transcription or figure generation and undermine the credibility of the reported findings. The figure must be corrected to match the source data in Table 10.
  • 💡 Redundant and less precise visualization: This figure is entirely redundant, as it visualizes the exact same information already presented with greater precision (including raw counts) in Table 10. It is poor practice to present identical data in both a table and a figure. The table is superior as it provides both absolute and relative frequencies. This figure should be removed.
  • ✅ Visually supports the paper's narrative: Despite the numerical errors, the visual pattern of the chart—a steep decline from the academic 'K-Fold CV' to the practical 'Case Study/Field Trial'—effectively reinforces the paper's central argument about the gap between academic validation and real-world application.
Communication
  • 💡 Redundant presentation clutters the manuscript: The primary communication failure is the figure's redundancy. Including both Table 10 and Figure 10 to show the same data is inefficient and adds unnecessary clutter to the paper. For clarity and conciseness, this figure should be removed and the text should refer only to the more detailed Table 10.
  • 💡 Missing Y-axis label: The vertical axis lacks a label, forcing the reader to infer its meaning from the context and the data labels on the bars. To adhere to standard data visualization practices, the y-axis must be explicitly labeled, for example, 'Percentage of Studies (%)'.
  • ✅ Appropriate choice of chart type and clear data labels: A bar chart is a suitable choice for comparing the usage rates of these discrete categories. A key strength is the inclusion of precise percentage labels on top of each bar, which makes the chart's quantitative information easy to read and understand.
Figure 11. Number of studies using each metric category (multi select allowed;...
Full Caption

Figure 11. Number of studies using each metric category (multi select allowed; n=77 total studies).

Figure/Table Image (Page 16)
Figure 11. Number of studies using each metric category (multi select allowed; n=77 total studies).
First Reference in Text
Table 9 and Figure 11 show the distribution of metric categories for 77 quality assessed articles.
Description
  • Categorization of Software Metrics: This bar chart shows the number of studies that used different types of software metrics. These metrics are used to measure characteristics of computer code and software development processes. The chart displays five categories of these metrics.
  • Dominance of Object-Oriented Metrics: The most commonly used type of metric was 'object-oriented metrics', which were featured in 24 of the 77 studies. These are measures that analyze the structure and complexity of code written in an object-oriented style, a common modern programming paradigm. The next most frequent category, identified in the text as 'procedural and domain-specific metrics', was used in 11 studies. The remaining three categories, identified as 'web, process, and performance metrics', were used very rarely, each appearing in only 2 studies.
Scientific Validity
  • 💡 Critical confusion between 'Software Metrics' and 'Evaluation Metrics': There is a severe methodological flaw in how this figure is presented. The reference text incorrectly links this figure to Table 9, stating they both show 'metric categories'. However, Table 9 details AI evaluation metrics (like accuracy and ROUGE), while this figure, according to the text in section 4.7, shows software metrics (like object-oriented complexity). These are two entirely different concepts and research questions (RQ5 vs. RQ7). This conflation is a major scientific error that fundamentally confuses two separate analyses and undermines the paper's clarity and rigor.
  • 💡 Lack of a source data table: Unlike other charts in this paper that are based on a corresponding table (e.g., Figure 10 is based on Table 10), there is no source table provided for the data in Figure 11. For a systematic literature review, transparency is paramount. The raw data (the list of metric categories and the number of studies for each) must be presented in a table to allow for verification and scrutiny.
  • ✅ Caption clarifies counting methodology: The caption's inclusion of '(multi select allowed; n=77 total studies)' is an example of good practice. It correctly informs the reader that the sum of the bars will not equal 77 because a single study could use metrics from multiple categories, which is crucial for correct interpretation.
Communication
  • 💡 Figure is critically misleading due to incorrect referencing: The primary communication failure is that the reference text creates a false equivalence between this figure and Table 9, leading to significant confusion. This figure should be exclusively discussed within the context of RQ7 (Software Metrics) and completely decoupled from the discussion of RQ5 (Evaluation Metrics) and Table 9. The incorrect reference must be removed.
  • 💡 Missing X-axis labels: The bars on the x-axis are not labeled in the figure itself. A reader cannot understand what each bar represents without searching for the information in the main text. This violates the principle that figures should be as self-contained as possible. Each bar must be clearly labeled with its respective metric category.
  • 💡 Data labels would improve clarity: The exact count for each bar must be estimated from the y-axis. To improve precision and readability, please add numerical data labels (e.g., '24', '11', '2') on top of each bar.
  • ✅ Appropriate chart type: A bar chart is a suitable and effective choice for displaying the frequency of use for several non-exclusive categories, allowing for easy comparison between them.
Table 11. Key Configurations and Performance Findings.
Figure/Table Image (Page 16)
Table 11. Key Configurations and Performance Findings.
First Reference in Text
Table 11 summarizes the top configurations, and Figure 12 charts the frequency with which each configuration achieved state of the art results on its respective benchmark or case study [31,61,105].
Description
  • Summary of Top-Performing AI System Setups: This table identifies the five most successful AI system setups, or 'configurations,' found in the reviewed literature. A configuration is a specific combination of an AI architecture (like 'RAG Token' or 'Hybrid RAG') and a particular Large Language Model (like 'BART' or 'GPT-3.5'). The table aims to answer the question: 'What combinations of technologies work best for specific tasks?'
  • Task-Specific Performance Data: For each of the five configurations, the table details the specific business task it excelled at (e.g., 'Knowledge grounded QA' or 'Contract Clause Generation'). It also provides the number of studies in which that configuration was reported as the top performer, with the most frequent being 'RAG Token + Fine Tuned BART' which was the best in 5 studies.
  • Key Performance Findings: The 'Key Findings' column summarizes the performance of each configuration with specific metrics or qualitative outcomes. For example, the 'RAG Token + Fine Tuned BART' setup achieved up to an 87% exact match score on question-answering tasks and reduced AI 'hallucinations' (factually incorrect outputs) by 35%. Another configuration, used for technical manual tasks, reportedly 'Reduced manual editing time by 40%' in real-world trials.
Scientific Validity
  • ✅ Synthesizes performance data to answer a key research question: The table provides a direct and valuable answer to RQ8 ('Best Performing RAG + LLM Configurations'). By moving beyond simple frequency counts to synthesize performance outcomes, it offers a higher level of analysis that is crucial for a systematic review and provides actionable insights.
  • 💡 The criteria for 'top performance' are undefined: A major methodological limitation is the lack of a clear, objective definition for what constitutes a 'top performing' configuration or 'state of the art results'. It is unclear if this was determined by the original authors' claims or by a standardized re-evaluation. This subjectivity makes the selection process opaque and potentially irreproducible. The criteria for inclusion in this table must be explicitly defined.
  • 💡 Conclusions are drawn from a very small sample size: The total number of 'top performing reports' summarized here is 16 (5+4+3+2+2). This is a very small subset of the 77 total studies analyzed. Drawing strong, generalizable conclusions about the 'best' configurations from such a small sample is statistically questionable. This limitation should be prominently acknowledged in the text.
  • 💡 Performance metrics are not directly comparable: The table presents a mix of disparate performance metrics (e.g., 'exact match' for QA, 'ROUGE-L' for summarization, '% time saved' for workflow automation). While this accurately reflects the literature, presenting them side-by-side without strong caveats could misleadingly imply they are comparable. The inherent difficulty in comparing performance across different task types should be discussed.
Communication
  • ✅ Excellent structure for conveying complex information: The table is very well-structured. The columns for 'Configuration', 'Task Type', and 'Key Findings' work together effectively to link a specific technology stack to a specific business problem and its performance outcome. This is a highly effective way to communicate complex findings.
  • ✅ 'Key Findings' column provides valuable qualitative context: The inclusion of a qualitative summary in the 'Key Findings' column is a major strength. It goes beyond raw numbers to explain why a configuration was successful, making the results much more meaningful and interpretable for the reader.
  • 💡 The column header '#*' is cryptic: The header '#*' is not self-explanatory. While its meaning is clarified in a footnote, this is easy for a reader to miss. For better clarity, the header should be changed to something more descriptive, such as '# Top Performing Studies' or '# SOTA Reports'.
Figure 12. Several studies have shown that each RAG + LLM configuration...
Full Caption

Figure 12. Several studies have shown that each RAG + LLM configuration attained top reported performance (n = 16 total top performing reports).

Figure/Table Image (Page 17)
Figure 12. Several studies have shown that each RAG + LLM configuration attained top reported performance (n = 16 total top performing reports).
First Reference in Text
Table 11 summarizes the top configurations, and Figure 12 charts the frequency with which each configuration achieved state of the art results on its respective benchmark or case study [31,61,105].
Description
  • Frequency of Top-Performing AI Configurations: This bar chart visualizes how frequently each of the five best-performing AI system setups, or 'configurations,' were reported as achieving top results in the literature. A configuration refers to a specific combination of an AI architecture and a Large Language Model. The chart is based on a total of 16 such top-performing reports identified across the 77 reviewed studies.
  • Key Data Points: The most frequently cited top-performing setup was 'RAG-Token + BART-ft', which appeared in 5 reports. The next most frequent was 'RAG-Seq + GPT-3.5' with 4 reports. The remaining configurations, 'Hybrid RAG + T5-L', 'RAG-Token + ROBERTa', and 'RAG-Seq + Flan-T5', were reported as top-performers in 3, 2, and 2 reports, respectively. The bars are sorted in descending order of frequency.
Scientific Validity
  • 💡 Redundant and less informative than Table 11: This figure is entirely redundant as it visualizes only a single column of data ('#*') that is already presented clearly in Table 11. The table is scientifically superior because it provides crucial context, such as the specific task and the qualitative performance findings for each configuration. Presenting the same, limited data in a separate figure adds no new insight and is an inefficient use of space.
  • 💡 Conclusions are based on a very small sample size: The entire analysis is based on only 16 'top performing reports' drawn from a pool of 77 studies. Drawing strong conclusions about which configurations are definitively 'best' from such a small and select sample is statistically weak. The findings should be presented with strong caveats about the limited evidence base.
  • 💡 The criteria for 'top performance' are subjective and undefined: The methodology for identifying a 'top performing report' or 'state of the art results' is not defined. This introduces a significant risk of selection bias and makes the analysis subjective and difficult to reproduce. Objective criteria for this classification must be provided.
  • ✅ Data is consistent with the source table: The numerical values displayed on the bars (5, 4, 3, 2, 2) perfectly match the data presented in the '#*' column of Table 11. This internal consistency is a positive aspect, even if the figure itself is redundant.
Communication
  • 💡 The figure is redundant and should be removed: The primary communication issue is the figure's complete redundancy with Table 11. It adds no new information and clutters the manuscript. The most effective way to improve communication would be to remove this figure entirely and refer only to the more comprehensive Table 11 in the text.
  • 💡 Missing Y-axis label: The vertical axis lacks a label, forcing the reader to infer its meaning. To adhere to standard data visualization practices, the y-axis should be explicitly labeled, for example, 'Number of Top Performance Reports'.
  • ✅ Effective use of sorting and data labels: The chart correctly follows best practices by sorting the bars in descending order, which immediately highlights the most frequent configuration. Additionally, the inclusion of clear numerical labels on top of each bar makes the exact values easy to read.
  • ✅ Appropriate choice of chart type: A bar chart is a suitable and clear way to compare the frequencies of a small number of discrete categories, making the relative prevalence of each configuration easy to grasp visually.
Table 12. Distribution of Challenges in Enterprise RAG + LLM Studies.
Figure/Table Image (Page 17)
Table 12. Distribution of Challenges in Enterprise RAG + LLM Studies.
First Reference in Text
Despite the rapid advances in RAG + LLM for enterprise knowledge management and document automation, the synthesis of 77 high quality studies reveals five recurring challenges and several open research directions (Table 12) [17,32-34,38-42,44,45,58,106,108].
Description
  • Categorization of Key Research Challenges: This table identifies and quantifies the five most frequently mentioned challenges in the field of enterprise AI systems, as reported in the 77 reviewed research papers. The challenges range from technical issues like performance and accuracy to practical concerns like security and business value.
  • Prevalence of Factual Consistency as the Top Challenge: The most cited problem is 'Hallucination Factual Consistency', which was mentioned in 37 studies (48.1%). This refers to the critical issue of AI models generating information that is factually incorrect or not grounded in the source data. This high frequency underscores it as the primary concern for researchers in the field.
  • Distribution of Other Major Challenges: Other significant challenges include 'Data Privacy & Security', a concern in 29 studies (37.7%), and 'Latency & Scalability' (the speed and performance of the AI), mentioned in 24 studies (31.2%). The difficulty of adapting AI to new specialized areas ('Domain Adaptation Transfer Learning') was noted in 18 studies (23.4%). The least frequently mentioned challenge was the 'Difficulty in Measuring Business Impact', cited in just 12 studies (15.6%), highlighting a gap in research focused on quantifiable business outcomes.
Scientific Validity
  • ✅ Provides a data-driven summary of research gaps: The table effectively answers RQ9 by quantitatively summarizing the key challenges identified in the literature. This is a valuable contribution for a systematic review, as it moves beyond a simple narrative to provide evidence-based insights into the field's most pressing problems.
  • 💡 Lacks methodological transparency for challenge extraction: The methodology for identifying and coding a 'challenge' from a paper is not defined. It is unclear if this was based on keyword searches, a qualitative analysis of the papers' discussion sections, or explicit statements of limitations by the original authors. This lack of a defined protocol makes the results difficult to reproduce and their objectivity hard to assess.
  • 💡 Ambiguity of non-exclusive categories is not stated: The total number of studies listed (120) is significantly higher than the 77 papers in the review, indicating that the categories are not mutually exclusive (i.e., one paper can mention multiple challenges). This is a crucial methodological detail that should be explicitly stated in a note on the table to prevent readers from misinterpreting the data.
Communication
  • ✅ Excellent clarity and self-containment: The table is very well-designed for communication. The 'Description' column is a major strength, providing a concise and clear explanation of each technical challenge. This makes the table highly self-contained and easily understandable for a broad audience.
  • 💡 Lack of sorting reduces immediate impact: The rows are not sorted by frequency, which forces the reader to scan the entire table to identify the most and least significant challenges. To improve readability and immediately highlight the key findings, please sort the rows in descending order based on the '# Studies' or '%' column.
  • ✅ Appropriate and effective table format: A table is the ideal format for presenting this type of categorical data. It allows for a clear, direct comparison of the challenges and presents both absolute counts and percentages in a precise and unambiguous manner.
Figure 13. Heatmap of overlap and gaps between research questions (RQ1–RQ9)....
Full Caption

Figure 13. Heatmap of overlap and gaps between research questions (RQ1–RQ9). Color intensity reflects how often two RQs are contextually addressed together.

Figure/Table Image (Page 18)
Figure 13. Heatmap of overlap and gaps between research questions (RQ1–RQ9). Color intensity reflects how often two RQs are contextually addressed together.
First Reference in Text
Knowledge overlaps and gaps between RQs are illustrated in the heatmap in Figure 13.
Description
  • Visualization of Inter-Topic Relationships: This figure is a heatmap, a grid where colors represent values, used here to show the strength of the relationship between the nine different research questions (RQs) that guided the study. Each row represents one RQ (from RQ1 to RQ9), and each column represents a topic that corresponds to one of the RQs (e.g., the 'Architectures' column corresponds to RQ4). The cells in the grid are colored and contain a number from 0.1 to 1.0, where warmer colors (like yellow) and higher numbers indicate a stronger connection, meaning the two topics were frequently discussed together in the reviewed papers. Cooler colors (like green and blue) and lower numbers indicate a weaker connection.
  • Key Finding of Strong Overlap: The heatmap is used to highlight areas of strong overlap in the research. For instance, the text points out a strong relationship between the topic of 'Architectures' (RQ4) and 'Best Configurations' (RQ8). The corresponding cells in the heatmap show high values (0.9 and 0.7), indicating that papers discussing AI system architecture are very likely to also discuss which configurations perform best. This type of analysis helps identify the most tightly connected concepts in the research field.
Scientific Validity
  • 💡 Critical lack of methodological transparency: The most significant flaw is the complete absence of methodology for calculating the overlap values. The caption implies a co-occurrence frequency ('how often...addressed together'), while the text later mentions a 'Pearson r' value, which is a statistical correlation. These are different metrics. Without a clear, reproducible definition of how the 0.1-1.0 values were derived from the 77 papers, the entire figure is scientifically unverifiable.
  • 💡 Asymmetric matrix suggests flawed calculation: A matrix showing the overlap or correlation between two variables (e.g., RQ4 and RQ8) should be symmetric; the relationship between A and B is the same as between B and A. This matrix is not symmetric (e.g., the cell for [row RQ8, column Architectures/RQ4] is 0.7, while the cell for [row RQ4, column Best_Configs/RQ8] is 0.9). This asymmetry indicates a fundamental error in the calculation or the conceptual model, severely undermining the validity of the data.
  • 💡 Inconsistent data between figure and text: The text states that the relationship between RQ4 and RQ8 has a 'Pearson r = 0.77'. However, the heatmap displays two different values for this relationship (0.7 and 0.9). This direct contradiction between the text and the figure is a serious error that confuses the reader and questions the reliability of the data.
  • ✅ Novel analytical approach: Despite its flawed execution, the conceptual approach of creating a relationship matrix to identify knowledge overlaps and gaps between research questions is a novel and valuable form of meta-analysis for a literature review. It attempts to provide a deeper structural insight into the field beyond simple frequency counts.
Communication
  • 💡 Severely confusing axis labeling: The figure's axes are extremely difficult to interpret. The Y-axis is labeled with RQ numbers (RQ1-RQ9), but the X-axis is labeled with topic names ('Platforms', 'Datasets', etc.). The reader must refer to Table 2 elsewhere in the paper to deduce that the X-axis labels are just proxies for the RQ numbers. This violates the principle of a self-contained figure and creates unnecessary work and confusion for the reader. Both axes should be clearly and consistently labeled (e.g., as RQ1-RQ9), with topic names provided in the caption if necessary.
  • 💡 Unlabeled color bar: The color bar on the right is not labeled. It is unclear what the 0.1-1.0 scale represents. To improve clarity, the color bar needs a descriptive title, such as 'Overlap Score' or 'Correlation Coefficient', consistent with the (currently missing) methodology.
  • 💡 Poor readability of X-axis labels: The long, rotated labels on the X-axis are difficult to read and contribute to visual clutter. Using consistent RQ1-RQ9 labels for both axes, as suggested above, would resolve this issue and make the entire heatmap cleaner and more readable.
  • ✅ Appropriate choice of visualization type: A heatmap is the correct and most effective type of visualization for displaying a correlation or overlap matrix. It allows for the rapid, intuitive identification of strong (hot colors) and weak (cool colors) relationships within a complex dataset.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 14. Selected publications per year (2015-2025).
Figure/Table Image (Page 19)
Figure 14. Selected publications per year (2015-2025).
First Reference in Text
Finally, the publication trend in Figure 14 could be annotated with event markers (e.g., major model releases) to contextualize inflection points [1-3].
Description
  • Trend of Publications Over Time: This bar chart illustrates the number of selected research papers published each year between 2015 and 2025. It shows a clear trend of exponential growth in the field. From 2015 to 2019, there were very few publications (between 1 and 5 per year). A significant acceleration began in 2022 with 10 publications, followed by a surge to 20 publications in 2023 and a peak of 23 in 2024. The count for 2025 shows a sharp drop to 4 publications, which is likely an artifact of the data collection for the review ending partway through that year.
Scientific Validity
  • 💡 Critical data inconsistency with a previous figure: There is a severe scientific validity issue as the data in this figure directly contradicts the data presented in Figure 4, which also shows publications per year. For example, this figure shows 23 publications for 2024, whereas Figure 4 shows a total of 27 for the same year. Likewise, the counts for 2020, 2021, 2022, and 2023 are all different between the two figures. This is a major error that undermines the credibility of the data analysis and must be corrected.
  • 💡 Redundant and less informative visualization: This figure is redundant. Figure 4 already presents the same trend of total publications over time but provides the additional, valuable detail of breaking down the totals into journal and conference papers. This figure, therefore, adds no new information and should be removed in favor of the more comprehensive Figure 4.
  • 💡 The 2025 data point is misleading without context: The sharp drop in publications in 2025 could be misinterpreted by a reader as a sudden decline in research interest. This is almost certainly an artifact of the study's data collection cutoff date. This crucial context is not provided in the caption or text, making the final data point misleading. This limitation must be explicitly stated.
Communication
  • 💡 Fundamentally flawed Y-axis scale: The y-axis of the chart is incorrectly scaled. The axis labels only go up to 20, but the bar for the year 2024 clearly represents a value of 23, extending well beyond the top of the axis. This is a basic and critical error in data visualization that makes the chart inaccurate and unprofessional. The axis range must be corrected to encompass the maximum data value.
  • 💡 Redundant figure clutters the manuscript: The primary communication issue is the figure's redundancy with Figure 4. Including both is inefficient and clutters the discussion section. To improve the manuscript's clarity and conciseness, this figure should be removed, and all discussion of publication trends should refer to the more detailed Figure 4.
  • 💡 Missing data labels: The exact number of publications for each year is not labeled on the bars, forcing the reader to estimate the values from the flawed y-axis. For precision and clarity, numerical data labels should be added to the top of each bar.
  • ✅ Appropriate choice of chart type: A bar chart is a suitable and effective choice for visualizing the number of publications over discrete time intervals (years), as it makes the growth trend easy to see.

Conclusions and Future Work

Key Aspects

Strengths

Suggestions for Improvement