Retrieval Augmented Generation (RAG) and Large Language Models (LLMs) for Enterprise Knowledge Management and Document Automation: A Systematic Literature Review

Section Analysis

Abstract

Key Aspects

🎯 Research Problem and Objective: The abstract establishes a clear research gap by highlighting that while Retrieval Augmented Generation (RAG) and Large Language Models (LLMs) are transforming enterprise knowledge management, a comprehensive understanding of their practical, real-world deployment remains limited. The primary objective of the study is to address this gap by systematically reviewing existing literature. This framing effectively communicates the paper's relevance and purpose, setting the stage for the subsequent methodological explanation and findings.
🔬 Systematic Literature Review Methodology: The study's methodological foundation is a Systematic Literature Review (SLR), a rigorous and transparent approach to synthesizing research. The abstract specifies the scope, involving the analysis of 77 high-quality primary studies selected through a stringent screening process. It also notes the formulation of nine targeted research questions concerning platforms, datasets, algorithms, and validation metrics, which provides a clear framework for the investigation and lends credibility to the findings.
📊 Key Quantitative Findings: The abstract effectively uses quantitative data to summarize the current state of enterprise RAG and LLM adoption. Key statistics reveal that the field is largely experimental, with 63.6% of implementations using GPT-based models and 80.5% relying on standard retrieval frameworks like FAISS or Elasticsearch. These figures provide a concise, data-driven snapshot of the technological landscape, substantiating the paper's claims about the dominance of certain technologies and the overall immaturity of enterprise adoption.
📉 The 'Lab to Market' Gap: A central finding articulated in the abstract is the identification of a significant 'lab to market' gap. This concept encapsulates the disparity between academic research practices and the requirements for production-scale enterprise deployment. The abstract supports this by pointing out divergent validation methods—k-fold cross-validation is common for sub-tasks (93.5%), while generative evaluation relies on static hold-out sets—and noting that fewer than 15% of studies tackle real-time integration challenges, highlighting a critical disconnect that hinders practical application.
🔑 Contribution and Strategic Roadmap: The abstract concludes by stating the paper's primary contribution: providing a data-driven perspective that systematically maps the disparities between academic research and enterprise needs. It positions the study not just as a summary of existing work but as a forward-looking analysis that offers a strategic roadmap. This roadmap is intended to guide future research and development toward bridging the identified 'lab to market' gap, thereby facilitating the creation of robust, enterprise-ready applications.

Strengths

✅ Clear Problem Statement and Scope
The abstract opens with a concise and compelling problem statement, immediately establishing the relevance of the research. It clearly defines the scope by specifying the methodology (SLR) and the sample size (77 studies), giving the reader a precise understanding of the paper's foundation.

"The integration of Retrieval Augmented Generation (RAG) with Large Language Models (LLMs) is rapidly transforming enterprise knowledge management, yet a comprehensive understanding of their deployment in real world workflows remains limited." (Page 2)
✅ Effective Use of Quantitative Data
The inclusion of specific statistics (e.g., 63.6%, 80.5%, 93.5%) is a major strength. This data provides concrete evidence for the authors' claims, efficiently summarizing the landscape of RAG/LLM adoption and making the findings more impactful and credible than qualitative statements alone would be.

"Our findings reveal that enterprise adoption is largely in the experimental phase: 63.6% of implementations utilize GPT based models, and 80.5% rely on standard retrieval frameworks such as FAISS or Elasticsearch." (Page 2)
✅ Strong, Memorable Central Thesis
The concept of the 'lab to market' gap serves as a powerful and memorable central thesis. It effectively synthesizes the core findings into a single, understandable idea, providing a strong narrative hook that clearly communicates the paper's main argument and contribution.

"Critically, this review identifies a significant ’lab to market’ gap; while retrieval and classification sub-tasks frequently employ academic validation methods like k-fold cross-validation (93.5%), generative evaluation predominantly relies on static hold-out sets..." (Page 2)

Suggestions for Improvement

💡 Explicitly state the literature review's time frame
High impact. An abstract for a Systematic Literature Review should ideally specify the time period of the included studies to immediately inform the reader about the currency of the review. While the body of the paper clarifies the 2015-2025 range, including this crucial piece of context directly in the abstract would enhance its completeness and transparency, which are cornerstones of the SLR methodology.

"This study presents a Systematic Literature Review (SLR) analyzing 77 high-quality primary studies selected after rigorous screening to evaluate how these technologies address practical enterprise challenges." (Page 2)

Implementation: Revise the sentence describing the SLR to include the date range. For example, change 'This study presents a Systematic Literature Review (SLR) analyzing 77 high-quality primary studies...' to 'This study presents a Systematic Literature Review (SLR) analyzing 77 high-quality primary studies published between 2015 and 2025...'
💡 Briefly characterize the proposed 'strategic roadmap'
Medium impact. The abstract concludes by promising a 'strategic roadmap' but provides no detail on its focus. Adding a brief, clarifying phrase to hint at the key components of this roadmap (e.g., evaluation frameworks, privacy, scalability) would make the paper's contribution more tangible and compelling to readers, better managing their expectations and highlighting the practical value of the work.

"By systematically mapping these disparities, this study offers a data driven perspective and a strategic roadmap for bridging the gap between academic prototypes and robust enterprise applications." (Page 2)

Implementation: Expand the final sentence to include a brief characterization of the roadmap. For example: '...this study offers a data driven perspective and a strategic roadmap for bridging the gap between academic prototypes and robust enterprise applications, emphasizing holistic evaluation, privacy-preserving architectures, and real-time integration.'

Introduction

Key Aspects

🗺️ Problem Context and Motivation: The introduction establishes a compelling business problem by describing how modern organizations are overwhelmed by vast quantities of unstructured data. It explains that traditional knowledge management systems, reliant on keyword searches, are inadequate for handling the complexity and volume of current information, leading to inefficiencies and risks. This context effectively motivates the need for more advanced solutions in sectors like finance and healthcare, where efficient knowledge integration is critical for compliance and innovation.
🧠 Technological Solution: LLMs and RAG: The paper introduces Large Language Models (LLMs) as a powerful advancement in natural language processing but immediately points out their key limitation: their reliance on static training data, which can lead to factual inaccuracies or 'hallucinations'. It then presents Retrieval Augmented Generation (RAG) as the crucial solution. RAG is explained as a method that enhances LLMs by integrating a real-time information retrieval step, effectively grounding the model's generated output in current, domain-specific data to improve accuracy and trustworthiness.
❓ Articulation of the Research Gap: A core function of this section is to clearly define the gap in existing research. The authors state that despite the promise of combining RAG and LLMs, the literature lacks detailed, scalable frameworks for their practical application in enterprise settings. To address this, the paper poses several critical research questions concerning the optimal choice of retrieval technologies, model fine-tuning strategies, and reliable evaluation metrics, thereby setting a clear and focused agenda for the systematic review.
🔬 Methodology and Scope: This aspect outlines the paper's rigorous research methodology, identifying it as a Systematic Literature Review (SLR). The scope is clearly defined, covering publications from 2015 to mid-2025 across six major academic databases. The process is transparently described, from an initial pool of over 500 papers to a final selection of 77 high-quality studies based on strict inclusion and exclusion criteria, which lends significant credibility to the subsequent analysis and findings.
🔑 Synthesis of Findings and Contributions: The final part of the introduction synthesizes the main takeaways from the literature review, presenting a high-level overview of the paper's contributions. It highlights key trends, such as the dominance of supervised learning and the rise of hybrid architectures. Furthermore, it moves beyond simple summary by proposing best-practice recommendations for practitioners and identifying open research directions, positioning the paper as a strategic roadmap for the field.

Strengths

✅ Effective 'Funnel' Structure
The introduction follows a classic and highly effective 'funnel' structure. It begins with the broad, industry-wide problem of information overload, narrows down to the specific technical limitations of LLMs, introduces RAG as the targeted solution, and finally specifies the paper's methodological approach (SLR) to studying this solution. This logical progression effectively guides the reader and builds a strong justification for the research.

"In the era of digital transformation, organizations in all industries are inundated with vast amounts of unstructured information... However, traditional knowledge management systems... struggle to handle rapidly evolving data or complex queries..." (Page 2)
✅ Clear and Explicit Problem Statement
The paper excels at explicitly stating the research gap it aims to fill. Instead of merely describing the topic, it directly points out that the existing literature lacks detailed frameworks for applying RAG and LLMs at an enterprise scale. This clarity immediately establishes the paper's necessity and originality.

"Despite the potential of RAG + LLM integration, the current literature lacks detailed frameworks for their application in enterprise knowledge management and document automation, particularly in terms of scalability..." (Page 3)
✅ Comprehensive Preview of Contributions
The final paragraph provides a robust preview of the paper's value beyond a simple literature summary. It synthesizes key trends, outlines actionable best-practice recommendations for practitioners, and identifies promising future research directions. This forward-looking summary effectively frames the paper as a strategic roadmap for the field.

"This SLR offers a structured, data driven overview of RAG + LLM for enterprise knowledge management and document automation, charting the evolution of methods, standard practices, and critical gaps." (Page 3)

Suggestions for Improvement

💡 Explicitly state one or two key research questions
High impact. The introduction mentions that 'Critical research questions arise' and then describes their topics thematically. Directly stating one or two of the most central questions in their original form would make the paper's investigative focus even more concrete and compelling for the reader. This would immediately anchor the purpose of the SLR in specific, answerable inquiries, enhancing the introduction's role as a clear setup for the analysis that follows.

"Critical research questions arise, such as which retrieval indexes, vector databases, or knowledge graph representations are most effective for diverse types of documents, such as contracts or policies?" (Page 3)

Implementation: After the sentence 'Critical research questions arise...', add a sentence that provides examples of the RQs. For instance: '...arise. Key among them are: What evaluation metrics and validation strategies reliably capture generative quality, latency, and factual correctness? And, what are the most persistent challenges to real time integration and scalability?'
💡 Quantify the 'dramatic growth' to increase impact
Medium impact. The introduction states that 'enterprise RAG + LLM research has grown dramatically since 2020,' which is a strong but qualitative claim. Substantiating this statement with a single, powerful statistic drawn from the review's findings (e.g., the percentage of papers published in the last 2-3 years) would make the timeliness and relevance of this SLR immediately more tangible and impactful for the reader.

"The analysis reveals several notable trends. First, enterprise RAG + LLM research has grown dramatically since 2020, with a nearly equal split between journal articles and conference venues [1,2]." (Page 3)

Implementation: Revise the sentence to include a specific data point from the study's analysis, which is visualized later in the paper. For example: 'First, enterprise RAG + LLM research has grown dramatically since 2020, with over 80% of the reviewed studies published since the beginning of 2023.'
💡 Improve readability by structuring the final paragraph
Medium impact. The final paragraph is a dense but valuable summary of trends, recommendations, and future research. Its readability could be improved by breaking it into two smaller paragraphs or by using more explicit signposting language within the existing paragraph. This would help distinguish between the summary of existing trends and the forward-looking roadmap, allowing readers to more easily digest the paper's core contributions.

"The analysis reveals several notable trends... Finally, based on our analysis of enterprise case studies, a key challenge lies in maintaining data privacy... Based on these insights, we outline best practice recommendations for deployers... Open research directions are also identified..." (Page 3)

Implementation: Either split the paragraph after the sentence ending '...measures of business impact [7,17,31]' or add transition phrases. For example, begin the next sentence with 'Drawing from this analysis, we outline best practice recommendations...' to clearly signal a shift from findings to recommendations.

Background and Related Work

Key Aspects

⚙️ Technical Foundations of RAG and LLMs: This section establishes the technical context by first introducing Large Language Models (LLMs) and their fundamental limitations, namely a fixed knowledge cutoff and a tendency for factual errors or 'hallucinations' in specialized domains. It then presents Retrieval Augmented Generation (RAG) as the key architectural solution to these problems. RAG is explained as a process that integrates an external information retrieval step with the LLM's generative capabilities, thereby grounding the output in real-time, relevant data. The text also surveys core technical variations, including different indexing strategies like dense vector indexes and symbolic knowledge graphs, which are crucial for efficiently searching external knowledge sources.
🏢 Enterprise Application Context: The paper frames the relevance of RAG and LLMs by outlining their application in enterprise settings, specifically for Knowledge Management (KM) and Document Automation. It contrasts this modern approach with traditional systems that rely on keyword search and rigid templates, arguing that these older methods are ill-equipped to handle the volume and complexity of modern corporate data. The core value proposition is articulated as bridging the gap between inflexible, template-based automation and ungrounded, purely generative AI, enabling the creation of context-aware, scalable enterprise solutions.
🗺️ A Proposed Conceptual Framework: A significant contribution within this section is the introduction of a novel conceptual model, the 'RAG Enterprise Value Chain'. This five-stage framework (Input, Retrieval, Generation, Validation, Business Impact) is proposed by the authors to structure the end-to-end deployment and analysis of RAG systems. The paper explains that this model serves a dual purpose: it provides a coherent structure for synthesizing the findings of the systematic literature review, and it offers a standardized methodology for practitioners to design and evaluate RAG solutions by linking technical components to measurable business outcomes.
📚 Positioning within Existing Literature: The final part of this section explicitly situates the paper within the landscape of existing research to justify its contribution. The authors review previous systematic and mapping studies on RAG and related topics, acknowledging their scope and findings. However, they clearly identify a research gap by stating that no prior work has specifically focused on synthesizing the literature at the intersection of RAG technology and the dual enterprise domains of knowledge management and document automation. This positioning effectively establishes the novelty and necessity of the current review.

Strengths

✅ Logical and Progressive Structure
The section is exceptionally well-structured, following a logical progression that effectively builds the reader's understanding. It moves from the foundational technology (what RAG is), to the application domain (where it is used), to a proposed analytical framework (how it will be studied), and finally to a literature review (why this study is necessary). This clear, funnel-like organization provides a robust and coherent foundation for the rest of the paper.

"In this section, we first introduce the technical foundations of Retrieval Augmented Generation (RAG) and large language models (LLMs), then describe their key applications in enterprise knowledge management and document automation, and finally review existing systematic and mapping studies..." (Page 4)
✅ Explicit Justification of Research Gap
The authors provide a clear and direct justification for their research by explicitly identifying a gap in the existing literature. Section 2.4 concisely summarizes prior surveys and then states unequivocally that none have addressed the specific combination of RAG in enterprise KM and document automation. This directness is a major strength, leaving no ambiguity about the paper's unique contribution.

"None of these focuses specifically on enterprise knowledge management and document automation as a whole, leaving a gap that our 2015–2025 synthesis addresses [1,3,59]." (Page 5)
✅ Introduction of a Novel Organizing Framework
The section goes beyond a standard literature summary by introducing the 'RAG–Enterprise Value Chain'. This conceptual framework is a key strength, as it provides a structured, value-centric lens for analyzing the literature. By mapping technical components to stages of business value, it elevates the paper from a descriptive review to a more analytical and prescriptive work, offering a useful model for both researchers and practitioners.

"This framework serves two purposes: first, it structures our synthesis by mapping specific technical elements to phases of value creation [3,17]; second, it offers a standardized perspective for designing and evaluating enterprise RAG solutions from raw input to demonstrable business value [1]." (Page 5)

Suggestions for Improvement

💡 Add a diagram to illustrate the RAG process
High impact. Section 2.1 describes the core RAG architecture verbally, which can be challenging for readers not already familiar with the concept. A simple block diagram illustrating the flow from user query to the final generated response (showing the retrieval step in between) would significantly enhance clarity and accessibility. Visual aids are particularly valuable in a review paper that aims to synthesize and explain complex technical architectures for a broad audience.

"In the original RAG formulation [54], a neural retriever retrieves the top K passages by dense similarity; these are concatenated with the query and conditioned into the generator." (Page 4)

Implementation: Create a simple flowchart or block diagram to be placed in Section 2.1. The diagram should visually represent the key components: User Query, Retriever, External Knowledge Source, Retrieved Documents, LLM Generator, and Final Response, with arrows indicating the flow of information as described in the text.
💡 Better connect the conceptual framework to the paper's structure
Medium impact. The paper introduces the 'RAG Enterprise Value Chain' in Section 2.3 and states that it 'structures our synthesis.' However, the connection could be made more explicit to improve narrative cohesion. Adding a sentence that directly foreshadows how the subsequent Results section (Section 4) is organized according to this five-stage framework would act as a helpful signpost for the reader, strengthening the link between the background and the main analysis.

"This framework connects inputs and retrieval design to generation, validation, and business impact, consistent with our RQs and with the prior architecture mapping literature [1,3,17]." (Page 5)

Implementation: At the end of the first paragraph in Section 2.3, add a sentence that explicitly links the framework to the structure of the results. For example: 'Accordingly, the analysis of findings in Section 4 is organized around these five stages to systematically map the technical choices and outcomes reported in the literature.'

Non-Text Elements

Table 1. Distribution of Studies by Knowledge Management Domain.

Figure/Table Image (Page 5)

First Reference in Text

Based on our review of 77 studies, enterprise data span PDFs, spreadsheets, wikis, and transcripts across multiple domains (Table 1).

Description

Categorization of Research Studies: This table categorizes 77 research papers into six distinct application areas, referred to as 'Knowledge Management Domains'. This breakdown shows where research effort is concentrated in the field of enterprise knowledge management using advanced AI. The most researched area is 'Regulatory compliance governance' (helping companies follow rules), which includes 20 papers, making up 26% of the total. The next most common is 'Contract legal document automation' with 18 papers (23.4%). At the other end of the spectrum, 'Healthcare documentation' is the least studied domain, with only 4 papers (5.2%). The table provides both the raw number of papers and their corresponding percentage for each category, summing to a total of 77 papers.

Scientific Validity

✅ Clear Quantitative Summary: The table provides a clear and appropriate quantitative summary of the literature distribution, which is a fundamental requirement for a systematic literature review. This categorization effectively maps the research landscape and highlights key areas of focus within the field.
💡 Ambiguity in Classification Methodology: The methodology for assigning studies to these domains is not specified. It is unclear if the categories are mutually exclusive or if a single study could be assigned to multiple domains. To improve methodological rigor, please clarify the classification criteria. For instance, were these categories predefined or emergent from the literature? How were studies spanning multiple domains handled?
💡 Disconnect Between Reference Text and Table Content: The reference text claims the table illustrates the variety of data types (PDFs, spreadsheets, etc.) found in the studies. However, the table actually categorizes studies by application domain (e.g., 'Contract legal document automation'). This is a significant mismatch. The text should be revised to accurately reflect that Table 1 shows the distribution of research across application domains, not data formats.

Communication

✅ Simple and Effective Layout: The table's design is clean, simple, and easy to interpret. The use of clear column headers ('Domain', '# Papers', '%') and the inclusion of both absolute counts and percentages allow for quick comprehension of the data.
💡 Lack of Sorting: The rows are not sorted in any discernible order (e.g., by frequency or alphabetically). To improve readability and immediately highlight the most significant findings, please sort the table rows in descending order based on the '# Papers' column. This would make it easier for readers to identify the most and least researched domains at a glance.
✅ Self-Contained and Informative Caption: The caption is concise and accurately describes the content of the table. A reader can understand the table's primary purpose—to show the distribution of studies by domain—from the caption and the table itself, without needing extensive context from the main text.

Table 2. The RAG-Enterprise Value Chain: Mapping RAG + LLM Stages to Research...

Full Caption

Table 2. The RAG-Enterprise Value Chain: Mapping RAG + LLM Stages to Research Questions.

Figure/Table Image (Page 5)

First Reference in Text

Not explicitly referenced in main text

Description

Conceptual Framework for AI Deployment: This table presents a five-stage conceptual model, termed the 'RAG-Enterprise Value Chain,' which outlines the process of deploying a specific type of AI system in a business setting. RAG, or Retrieval-Augmented Generation, is a technique where an AI model retrieves information from a knowledge base before generating an answer. The model breaks this process down into: 1. Input (defining data), 2. Retrieval (finding relevant information), 3. Generation (creating the output), 4. Validation (quality checks), and 5. Business Impact (measuring real-world value).
Mapping of Process Stages to Research Questions: The core function of the table is to connect each stage of the proposed deployment model to the specific research questions (RQs) that the paper investigates. For example, the 'Input' stage is aligned with RQ1 (Platforms) and RQ2 (Datasets), which concern the foundational infrastructure. Similarly, the 'Validation' stage is linked to RQ5 (Metrics) and RQ6 (Validation), which focus on how to measure the system's performance. This structure serves as a roadmap for how the paper will analyze the existing literature.

Scientific Validity

✅ Provides a Strong Methodological Structure: The table introduces a conceptual framework that provides a clear and logical structure for the systematic literature review. By mapping research questions to distinct stages of a deployment pipeline, it establishes a coherent methodology for analyzing and synthesizing the literature, which is a significant strength for a review paper.
✅ Introduces a Potentially Novel Heuristic: The proposed 'RAG-Enterprise Value Chain' is a novel contribution in itself. It offers a useful heuristic or mental model for both researchers and practitioners to think about the end-to-end lifecycle of enterprise RAG systems, from data ingestion to measuring business value.
💡 Framework is Conceptual, Not Empirically Validated: The 'Value Chain' is proposed as a conceptual model to organize this review. While useful for this purpose, its validity and applicability as a general framework for enterprise AI deployment are not empirically tested or validated within the paper. Its status as a heuristic versus a validated model should be clear.
💡 Inconsistent or Incomplete RQ Mapping: The alignment of RQs to stages could be more rigorous. For instance, RQ8 (Best Configs) is placed under 'Generation', but the best configuration would logically depend on all preceding stages (Input, Retrieval). Furthermore, some RQs mentioned elsewhere in the paper (e.g., RQ7 on software metrics) are not included in this mapping, which creates an inconsistency.

Communication

✅ Clear and Organized Presentation: The table is well-structured with clear column headers ('Stage', 'Key RQ Alignment', 'Description'). This format makes it easy for the reader to understand the proposed framework and how it relates to the paper's research questions at a glance.
✅ Functions as an Effective Reader Roadmap: The table effectively sets expectations for the structure of the paper's results and discussion. It acts as a roadmap, guiding the reader through the logical flow of the analysis, which enhances the overall clarity of the manuscript.
💡 Explicit Reference in Text is Needed: Although the text in section 2.3 describes the framework, it does not explicitly refer to 'Table 2'. To improve clarity and direct the reader's attention, the text should include a direct reference, such as '...as detailed in Table 2.'

Table 3. Prior Reviews on RAG and LLMs.

Figure/Table Image (Page 6)

First Reference in Text

Although numerous primary studies explore RAG and enterprise use cases, previous surveys and mappings have covered portions of the space (Table 3) [1-3,7,35,59].

Description

Summary of Previous Review Articles: This table summarizes six previous review articles related to advanced AI techniques, specifically Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). RAG is a method where an AI model first searches for relevant information before generating a response, making its answers more factual. LLMs are the powerful AI systems, like GPT, that perform these tasks. For each review, the table lists the authors, the time frame of the studies they covered (e.g., 2020-2023), the number of papers they analyzed (ranging from 27 to 52), and their specific research focus. The topics covered by these prior reviews include the evolution of RAG methods, benchmarking LLM performance, and addressing 'hallucination,' which is when an AI model generates incorrect or nonsensical information.

Scientific Validity

✅ Establishes a Clear Research Gap: The table effectively situates the current work by summarizing prior reviews. This is a critical component of a systematic literature review, as it demonstrates that the authors have surveyed existing syntheses and are positioning their paper to fill a specific, unaddressed niche—in this case, the application of RAG/LLMs to enterprise knowledge management.
✅ Supports Claims Made in the Text: The information in the 'Focus' column directly supports the authors' subsequent claim that previous reviews have only covered 'portions of the space' and none have focused specifically on their chosen scope. This strengthens the justification for the current study.
💡 Lacks Explicit Selection Criteria: The paper does not describe the methodology used to identify these six prior reviews. While the selected papers appear relevant, explicitly stating the search strategy for finding other review articles would enhance the methodological transparency and rigor of this section.

Communication

✅ Efficient and Clear Layout: The table is well-organized with clear, descriptive column headers ('Authors', 'Years', '#Papers', 'Focus'). This allows readers to quickly scan the table and understand the landscape of prior review literature without needing to read through lengthy prose.
✅ Self-Contained and Informative: The table, along with its caption, is largely self-contained. A reader can grasp the key takeaway—that several reviews on RAG/LLMs exist but with different focuses—just by looking at this element, which is a hallmark of effective data presentation.
💡 Minor Redundancy in Columns: The 'Citation' column, which lists the paper's internal reference number, is somewhat redundant since the 'Authors' column already provides the primary identifier for the work. To streamline the table and reduce visual clutter, consider removing the 'Citation' column.

Research Methodology

Key Aspects

⚙️ Systematic Literature Review Framework: The study is grounded in a Systematic Literature Review (SLR) methodology, a structured approach chosen to ensure a rigorous, transparent, and reproducible investigation. The authors explicitly adopt a three-stage process—planning, conducting, and reporting—which serves as the overarching framework for the entire study. This methodological choice is significant because it aims to minimize researcher bias by adhering to a predefined protocol, thereby enhancing the credibility and reliability of the synthesized findings on RAG and LLM applications in enterprise contexts.
🔍 Search and Retrieval Strategy: The planning phase is characterized by a meticulous search and retrieval strategy designed to capture a comprehensive body of literature. The authors formulated nine specific Research Questions (RQs) to guide the inquiry, covering topics from platforms and algorithms to evaluation metrics and challenges. These RQs were translated into a precise Boolean search string, which was then executed across six major academic databases, including IEEE Xplore and Google Scholar, for studies published between 2015 and 2025. This systematic approach ensures that the initial pool of literature is both broad and highly relevant to the review's objectives.
⚖️ Multi-Stage Filtering Protocol: To refine the initial search results into a high-quality corpus, the methodology employs a multi-stage filtering protocol. The first stage involves applying five explicit exclusion criteria, which systematically remove studies that are non-English, abstracts-only, not primary research, lacking empirical results, or missing detailed methodological descriptions. This step is crucial for ensuring that all studies included in the subsequent quality assessment are substantive, empirical, and directly relevant to the research questions, forming the basis for a reliable and focused analysis.
⭐ Quantitative Quality Assessment: The final stage of the selection process involves a rigorous, quantitative quality assessment to ensure the academic soundness of the included studies. Each remaining paper was scored against eight specific quality evaluation questions, covering aspects like the clarity of aims, validity of the methods, and transparency of reporting. A scoring system (2 for 'yes', 1 for 'partial', 0 for 'no') was used, and any paper scoring below 10 out of a possible 16 was excluded. This final quality gate guarantees that the 77 primary studies forming the basis of the review meet a high standard of scholarly rigor, strengthening the validity of the paper's conclusions.

Strengths

✅ Transparent and Reproducible Protocol
The methodology section is exceptionally transparent, clearly detailing every step of the SLR process. By providing the specific databases, the exact Boolean search string, and the explicit inclusion, exclusion, and quality criteria, the authors establish a highly reproducible research design. This level of detail is a hallmark of a high-quality SLR and allows other researchers to understand, evaluate, and potentially replicate the study.

"Each stage incorporates specific protocols designed to minimize bias and improve transparency throughout the research process [61]." (Page 6)
✅ Explicit and Comprehensive Research Questions
The explicit enumeration of the nine research questions (RQ1-RQ9) provides a clear and robust framework that guides the entire review. These questions are well-defined and cover a comprehensive range of topics from technical implementation details to practical challenges. This clarity of purpose ensures the subsequent data extraction and analysis are focused and directly aligned with the study's stated objectives.

"The research questions (RQs) addressed are as follows;" (Page 6)
✅ Rigorous Two-Phase Filtering Process
The use of a two-phase filtering process, combining explicit exclusion criteria with a subsequent quantitative quality assessment, demonstrates methodological rigor. This dual approach ensures that the final selection of 77 papers is not only topically relevant but also meets a high standard of academic quality. The clear cutoff score (less than 10 out of 16) for the quality assessment adds a layer of objectivity to the selection process.

"Once the exclusion criteria were enforced, the remaining articles were subjected to the eight question quality assessment. Any paper scoring less than 10 out of 16 was removed." (Page 8)

Suggestions for Improvement

💡 Justify the selection of academic databases
High impact. The methodology lists six major academic databases but does not provide a rationale for their selection. For a review claiming to 'capture a comprehensive body of relevant studies,' justifying why these specific sources were chosen (e.g., for their strong coverage of computer science, engineering, and fast-moving preprints) and why others (e.g., Scopus) were omitted would strengthen the methodological rigor and bolster the validity of the search strategy.

"Six major academic databases were selected (IEEE Xplore, ACM Digital Library, SpringerLink, ScienceDirect, Wiley Online Library, and Google Scholar) to capture a comprehensive body of relevant studies published between 2015 and 2025 [19,26]." (Page 6)

Implementation: After listing the six databases, add a sentence or two explaining the rationale for their inclusion. For example: 'These databases were chosen to provide comprehensive coverage across key disciplines, with IEEE Xplore and ACM Digital Library for core computer science and engineering, ScienceDirect, SpringerLink, and Wiley for broader scientific publications, and Google Scholar to include influential preprints and conference proceedings from the rapidly evolving field of AI.'
💡 Report initial search results before screening
Medium impact. To fully align with best-practice reporting standards for SLRs, such as the PRISMA guidelines, the methodology should state the total number of records initially identified across all databases before the removal of duplicates. While Figure 2 shows the number of papers at later stages, this initial raw number is a key piece of information for assessing the breadth of the initial search and the selectivity of the screening process. Its inclusion would enhance the transparency of the review process.

"Figure 2 presents the number of records retrieved from each database in three major stages of the selection process: initial retrieval, after applying exclusion criteria, and after quality assessment." (Page 7)

Implementation: In the paragraph preceding the discussion of the exclusion criteria, add a sentence stating the total number of initial hits. For example: 'The initial search across all six databases yielded a total of [Number] records. After removing duplicates, [Number] unique articles remained for screening against the exclusion criteria.'
💡 Detail the quality assessment procedure and inter-rater reliability
High impact. The paper outlines the quality assessment questions and scoring system but omits procedural details, such as how many reviewers conducted the assessment and how disagreements were resolved. To ensure the objectivity of this critical filtering step, it is standard practice in SLRs to use at least two independent reviewers and report the inter-rater reliability (e.g., using Cohen's Kappa statistic). Adding this information would significantly strengthen the credibility and perceived objectivity of the quality assessment process.

"Any paper scoring less than 10 out of 16 was removed. Figure 3 shows the resulting distribution of quality scores (11–16), where each “yes” earned 2 points, “partial” earned 1 point and “no” earned 0 points [35]." (Page 8)

Implementation: After describing the quality scoring system, add a brief explanation of the assessment procedure. For example: 'The quality assessment was performed independently by two researchers. Any discrepancies in scores were resolved through discussion to reach a consensus. The initial inter-rater agreement was high, with a Cohen's Kappa of [e.g., 0.85], indicating a strong level of agreement in the assessment process.'

Non-Text Elements

Figure 1. Systematic Literature Review process.

Figure/Table Image (Page 6)

First Reference in Text

The questions were then translated into precise Boolean search strings (Figure 1).

Description

Three-Phase Research Process: This figure is a flowchart that outlines the standard three-phase process for conducting a Systematic Literature Review (SLR), which is a structured method for finding and analyzing existing research. The phases are: 1. Planning (defining the scope), 2. Conducting (executing the search and analysis), and 3. Reporting (presenting the findings).
Detailed Steps within Each Phase: Each phase is broken down into specific tasks. The 'Planning' phase includes identifying research questions, keywords, and databases. The 'Conducting' phase involves collecting and selecting studies, assessing their quality, and then extracting and synthesizing the data. The 'Reporting' phase culminates in reporting the results. A looping arrow from 'Reporting' back to 'Planning' suggests the process can be iterative.

Scientific Validity

✅ Adherence to Standard Methodology: The figure correctly illustrates the established and widely accepted stages of an SLR process. This demonstrates a commitment to methodological rigor and transparency, which is a key strength for a review paper.
💡 Generic and Non-Specific: The flowchart is highly generic and does not provide any specific details pertinent to this particular study, such as the actual keywords, databases, or selection criteria used. While it correctly outlines the process, it fails to document the application of that process for this review, which should be detailed in the main text.
💡 Mismatch between Reference Text and Figure Content: The reference text specifically mentions translating questions into 'Boolean search strings' and points to this figure. However, the figure only shows a high-level step, 'Identify keywords,' and makes no mention of Boolean logic or search strings. The figure, therefore, does not directly support the specific action described in the text.

Communication

✅ Clear and Intuitive Layout: The diagram uses a clear left-to-right flow for the main stages and a top-to-bottom sequence for the tasks within each stage. The use of distinct colors and simple icons effectively segments the process and makes the information easy to digest at a glance.
💡 Ambiguous Iteration Loop: The dotted arrow looping from 'Reporting' back to the beginning is ambiguous. It is unclear if this implies the process is cyclical, or if it suggests that the findings of one review can inform the planning of a future one. To improve clarity, the meaning of this iterative loop should be explicitly stated in the caption or main text.
💡 Inconsistent Flow Logic: Within the 'Conducting' section, 'Study selection' is duplicated, and its relationship with 'Quality assessment' is unclear. The arrows should be more precise to show the exact sequence of operations. Suggest refining the flowchart to show a single, unambiguous path for study selection and its temporal relationship to quality assessment.

Figure 2. Distribution of the selected papers after each screening stage.

Figure/Table Image (Page 7)

First Reference in Text

Figure 2 presents the number of records retrieved from each database in three major stages of the selection process: initial retrieval, after applying exclusion criteria, and after quality assessment.

Description

Source Distribution of Final Papers: This bar chart illustrates the number of research papers that were ultimately selected for the literature review, broken down by the academic database they were sourced from. It shows that out of the total papers analyzed, the vast majority (55) came from Google Scholar. The next most significant source was IEEE Xplore, contributing 8 papers. The other four databases—ACM DL, ScienceDirect, SpringerLink, and Wiley—each contributed a much smaller number, ranging from 2 to 5 papers.

Scientific Validity

💡 The figure does not support the claims made in the reference text.: There is a major contradiction between the reference text and the figure's content. The text states that the figure shows the number of records at three distinct stages of the selection process (initial retrieval, after exclusion, after quality assessment). However, the bar chart only displays a single set of values: the final number of selected papers from each source. It fails to visualize the filtering process, which is a significant omission in methodological reporting.
✅ Provides transparency on final sources.: The figure is useful for transparently showing the provenance of the final set of 77 studies. It clearly communicates the heavy reliance on Google Scholar, which is an important piece of information for readers evaluating the scope and potential biases of the literature search.
💡 Lacks essential context for evaluating the search process.: By only showing the final counts, the figure omits crucial context. It's impossible to know the initial number of hits from each database or the number of papers excluded at each stage. This information is necessary to assess the precision and effectiveness of the search strategy across the different databases. A visualization showing the attrition of papers at each stage would be far more informative.

Communication

💡 The visualization is misleading and inconsistent with its description.: The primary communication failure is that the figure does not deliver what the reference text promises. It is presented as a multi-stage visualization but is, in fact, a simple, single-stage bar chart. To resolve this, you must either (a) change the figure to a stacked or grouped bar chart that actually shows the data from all three stages, with a clear legend, or (b) rewrite the reference text and caption to accurately describe that the figure only shows the final distribution of selected papers.
✅ The choice of a bar chart is appropriate for the data shown.: For the data it does show (final counts from different categories), a bar chart is a suitable and easily interpretable choice. It effectively highlights the disparity in the number of papers sourced from each database.
💡 Missing data labels reduce precision.: The exact values for the bars are not labeled, forcing the reader to estimate them from the y-axis. This is especially imprecise for the shorter bars. To improve clarity and precision, add numerical labels on top of or inside each bar.
💡 Axis labels could be more specific.: The y-axis label 'Number of Papers' is generic. A more descriptive label like 'Final Number of Selected Papers' would be more accurate and leave no room for ambiguity, especially given the confusion created by the reference text.

Figure 3. Quality score distribution of the selected papers (scores range...

Full Caption

Figure 3. Quality score distribution of the selected papers (scores range 11-16).

Figure/Table Image (Page 8)

First Reference in Text

Figure 3 shows the resulting distribution of quality scores (11–16), where each “yes” earned 2 points, “partial” earned 1 point and “no” earned 0 points [35].

Description

Distribution of Paper Quality Scores: This bar chart shows the results of a quality check performed on the research papers selected for this review. Each of the 77 final papers was assigned a quality score between 11 and 16. The chart displays how many papers received each score. The most frequent score was 14, which was awarded to 20 papers. The scores are distributed in a roughly bell-shaped curve, with fewer papers at the lower (5 papers scored 11) and higher ends (10 papers scored 16). This process is used to ensure that only methodologically sound studies are included in the final analysis.

Scientific Validity

✅ Demonstrates a rigorous filtering process: The figure provides transparent evidence that a quality assessment was conducted on the selected literature, which is a critical step in a systematic review. By showing the distribution of scores, it supports the claim that the final corpus of 77 papers meets a certain quality threshold.
💡 Lacks justification for the quality cutoff threshold: The text states that papers scoring less than 10 were removed, but provides no rationale for this specific cutoff value. The choice of a threshold can significantly impact the final set of included studies. To improve methodological rigor, please provide a justification for why a score of 10 was deemed the appropriate cutoff for inclusion.
💡 Details of the scoring protocol are absent: The validity of the quality scores depends on the reliability of the scoring process. The manuscript does not specify whether the scoring was performed by one or multiple reviewers. If multiple reviewers were involved, information on inter-rater reliability (e.g., Cohen's kappa) should be reported to ensure the scores are objective and reproducible.

Communication

✅ Effective choice of visualization: A bar chart (or histogram) is an appropriate and effective way to visualize the frequency distribution of discrete scores. It clearly and immediately communicates the central tendency and spread of the quality scores.
💡 Ambiguous X-axis label: The X-axis is labeled 'Value', which is too generic. For better clarity and to make the figure more self-contained, please change this label to 'Quality Score'.
💡 Y-axis scale is inappropriate for count data: The Y-axis, which represents a count of papers, uses decimal tick marks (e.g., 2.5, 7.5, 12.5). Since the number of papers must be an integer, the axis labels should be whole numbers (e.g., 0, 5, 10, 15, 20) to avoid confusion.
💡 Missing data labels: The exact count for each bar must be estimated by the reader. To improve precision and readability, please add numerical data labels on top of each bar indicating the exact number of papers for each score.

Table 4. The 77 primary studies used in this systematic literature review.

Figure/Table Image (Page 8)

First Reference in Text

Not explicitly referenced in main text

Description

Complete List of Analyzed Studies: This table provides a comprehensive list of all 77 primary research articles that were selected and analyzed for this systematic literature review. This serves as the foundational dataset for the entire study, detailing each piece of literature that the authors' conclusions are based on.
Key Information for Each Study: For each of the 77 papers, the table presents four key pieces of information: a unique ID number (from 1 to 77), the full title of the paper, its year of publication, and its corresponding number in the paper's main reference list. The publication years range from 2017 to 2024, indicating a strong focus on recent research in this fast-moving field.

Scientific Validity

✅ Enhances Transparency and Reproducibility: Providing a complete list of all included primary studies is a fundamental requirement for a rigorous systematic literature review. This transparency allows other researchers to scrutinize the selected corpus and potentially replicate or build upon the review, which is a significant methodological strength.
💡 Missed Opportunity for Data Integration: While essential, the table is functionally just a bibliography. Its scientific utility could be greatly enhanced by adding columns that integrate data from other parts of the methodology. For example, including the 'Knowledge Management Domain' (from Table 1), 'Publication Type' (from Figure 4), and the assigned 'Quality Score' (from Figure 3) for each study would create a powerful summary table and a much richer dataset for the reader.

Communication

✅ Clear and Standardized Format: The table is presented in a clean, standard format with clearly labeled columns, making it easy for readers to find the title, year, and reference for any given study.
💡 Placement in Main Body: Long reference tables like this one, which span multiple pages, can disrupt the flow of the main text. It is often better practice to place such comprehensive lists in an Appendix to improve the readability of the core manuscript.
💡 Lack of Direct Reference: The table is not explicitly cited by number in the text (e.g., "The final 77 studies are listed in Table 4"). While its presence is implied, a direct reference should be added when the final set of studies is first mentioned to guide the reader effectively.

Figure 4. Distribution of publication types (journal vs. conference).

Figure/Table Image (Page 10)

First Reference in Text

Figure 4 shows that the selected publications are slightly favoring conference proceedings (58.4%) over journal articles (41.6%), which is typical for a fast-moving field like RAG.

Description

Publication Trends Over Time: This is a stacked bar chart that visualizes the number of selected research papers published each year from 2015 to 2025. It illustrates a significant trend: the field experienced very little activity until 2020, after which there was a dramatic and accelerating increase in publications. The peak is in 2024 with a total of 27 papers.
Distribution of Publication Venues: The chart breaks down the publications into two types: 'Journal' articles (peer-reviewed academic publications in serials) and 'Conference' proceedings (papers presented at academic conferences). The colors show the proportion of each type per year. According to the text, overall, 58.4% of the 77 studies are from conferences, while 41.6% are from journals, highlighting a preference for faster-moving conference venues.

Scientific Validity

✅ Effectively illustrates the field's rapid growth: The visualization strongly supports the paper's narrative that this is a recent and rapidly evolving field of study. The exponential increase in publications from 2020 onwards is a key piece of evidence that is clearly and appropriately conveyed.
💡 The 2025 data point is potentially misleading: The chart shows a sharp drop in publications for 2025, which could be misinterpreted as a decline in the field's activity. This is almost certainly an artifact of the study's data collection cutoff date (June 15, 2025). This limitation should be explicitly stated in the caption or main text to prevent misinterpretation.
💡 Exact percentages in the text are not verifiable from the figure: The reference text provides precise percentages (58.4% vs. 41.6%). However, due to the lack of data labels on the bars, it is impossible for a reader to verify these numbers by looking at the figure. The chart supports the qualitative trend but not the precise quantitative claim.

Communication

✅ Appropriate choice of chart type: A stacked bar chart is an excellent choice for this data, as it effectively shows two things at once: the change in the total volume of publications over time, and the internal composition (journal vs. conference) of that volume for each year.
💡 Lack of data labels hinders readability: The absence of numerical labels on the bar segments forces the reader to estimate the counts, reducing the chart's precision. To improve clarity, please add data labels to each segment of the bars to show the exact number of journal and conference papers for each year.
💡 Color choice could be improved for accessibility: The use of two similar shades of orange may be difficult for readers with color vision deficiency to distinguish. For better accessibility, consider using a more distinct color pair or adding different patterns to the bars.
💡 Axes and legend are clear: The axes are clearly labeled ('Number of Publications', 'Year'), and the legend ('Publication Type') is easy to understand, making the chart's structure straightforward to interpret.

Results

Key Aspects

☁️ Deployment Infrastructures: This aspect, addressing RQ1, analyzes the deployment platforms for enterprise RAG systems. The findings reveal a strong preference for cloud-native infrastructures (66.2%), which offer scalability through managed services and vast computational resources. However, the review also identifies significant use of on-premises data centers (19.5%) and edge hardware (10.4%), driven by critical enterprise needs such as data sovereignty, regulatory compliance, and low-latency offline operation. This distribution underscores that platform choice is not monolithic but is instead a strategic decision balancing scalability with security and performance constraints.
💾 Data Sourcing and Its Implications: In response to RQ2, the analysis categorizes the datasets used for developing and evaluating RAG systems. The results show a heavy reliance on public, open-source GitHub repositories (54.5%), valued for their accessibility and diversity. This trend, however, is identified as a major risk for enterprise transferability, as models trained on such general-purpose data are prone to domain shift and may not generalize to specific corporate contexts. The less frequent use of proprietary repositories (15.6%) and custom industrial corpora (16.9%) highlights a key challenge in creating realistic, domain-specific benchmarks while navigating data privacy concerns.
⚙️ Architectural and Algorithmic Landscape: This part of the analysis synthesizes findings for RQ3 and RQ4, mapping the dominant machine learning paradigms and technical architectures. The data shows an overwhelming preference for supervised learning (92.2%), a trend attributed to the availability of labeled data in many enterprise settings. Architecturally, the review confirms a decisive shift toward Transformer-based models, with RAG Sequence and RAG Token being the most common variants, while traditional ML algorithms like SVM and Naïve Bayes are primarily used as performance baselines rather than core components. For retrieval, dense vector search is the standard (80.5%), often augmented with sparse methods like BM25 to improve handling of domain-specific terms.
⚖️ Evaluation and Validation Mismatch: Addressing RQ5 and RQ6, this aspect reveals a critical disconnect between how RAG systems are evaluated in research versus what is required for enterprise deployment. The findings show that while technical, automated metrics (e.g., Precision/Recall at 80.5%, ROUGE/BLEU at 44.2%) are prevalent, they often fail to capture factual correctness and real-world utility. This is compounded by a validation strategy mismatch: k-fold cross-validation is common for retrieval sub-tasks (93.5%), but real-world case studies (13.0%) and metrics measuring tangible business impact (15.6%) are rare, highlighting a significant 'lab to market' gap.
🏆 Task-Dependent Performance Configurations: In answering RQ8, the review identifies that the best-performing RAG+LLM configuration is highly dependent on the specific enterprise task. The analysis of top-performing models shows that RAG Token architectures, which allow for dynamic context retrieval during generation, are particularly effective for knowledge-grounded question answering. Conversely, for more open-ended generative tasks like contract drafting, RAG Sequence combined with large, zero-shot LLMs performs strongly. The findings also emphasize that hybrid retrieval strategies (merging dense vectors with knowledge graphs) and fine-tuning on in-domain data are crucial for achieving state-of-the-art results in specialized applications.
❓ Identified Challenges and Research Gaps: This final aspect synthesizes the primary challenges and research gaps identified in the literature (RQ9). Hallucination and the need for factual consistency emerge as the most frequently cited challenge (48.1% of studies), followed by data privacy/security (37.7%) and latency/scalability (31.2%). The paper uniquely contributes a relational analysis via a heatmap, which visually confirms strong dependencies, such as the link between architecture and performance (RQ4 & RQ8). This analysis not only catalogues existing problems but also provides a data-driven map of under-researched areas, such as multi-modal RAG and continual learning, to guide future work.

Strengths

✅ Rigorous and Transparent Structure
The Results section is exceptionally well-organized, systematically addressing each of the nine research questions in its own dedicated subsection. This structure provides outstanding clarity and allows readers to easily trace the evidence for each conclusion back to a specific question posed in the methodology. This rigorous, question-driven approach enhances the transparency and logical flow of the analysis.

"In this section, the responses to each research question are explained." (Page 9)
✅ Effective Use of Quantitative Data and Visualization
The authors make excellent use of tables and figures to present quantitative findings for nearly every research question. This approach effectively distills complex data into digestible formats, such as the distribution of platform topologies in Table 5 or the frequency of evaluation metrics in Table 9. This strong empirical grounding, supported by clear visualizations, makes the paper's claims credible and easy to verify.

"Table 8 provides a detailed taxonomy of these methods, classifying them into traditional baselines, deep learning models, RAG variants, and indexing strategies." (Page 12)
✅ Synthesis Beyond Simple Description
The analysis consistently moves beyond merely reporting statistics to offer insightful synthesis and interpretation. A prime example is the cross-tabulation analysis that connects the underutilization of real-world case studies (RQ6) with the scarcity of business impact metrics (RQ5). This demonstrates a deeper level of analysis that uncovers critical relationships in the data, thereby building a compelling narrative about the gap between academic research and enterprise needs.

"Beyond raw frequency counts, relational analysis (cross tabulation between RQ5: Evaluation Metrics and RQ6: Validation Approaches) reveals a critical linkage: while real world case studies/field trials (12% of studies) remain underutilized, 80% of these live deployments consistently incorporated Business Impact Metrics." (Page 18)

Suggestions for Improvement

💡 Consolidate the brief 'Software Metrics' subsection for better narrative flow
High impact. The subsection for RQ7 ('Software Metrics Adopted') is very brief and feels disconnected from the main narrative about RAG architectures, evaluation, and challenges. The finding that object-oriented metrics are common is presented without a clear explanation of its significance to RAG+LLM performance or evaluation. Integrating this finding into a more relevant section, such as RQ5 (Evaluation Metrics), would improve the section's overall coherence and narrative momentum.

"This distribution highlights that object oriented metrics remain the most common (31.2%) [5,9,17]. Procedural and domain specific metrics see far less adoption, particularly for defect prediction components [22,31]." (Page 15)

Implementation: Merge the core finding of RQ7 into the discussion of RQ5. For example, when discussing evaluation metrics, add a sentence noting that for studies involving code-based datasets (e.g., from GitHub), software-specific metrics like object-oriented metrics are sometimes used as an auxiliary form of evaluation, then remove the standalone RQ7 subsection.
💡 Explicitly connect heatmap findings to the 'Enterprise Value Chain' framework
Medium impact. The section concludes with a heatmap (Figure 13) and a brief statement that it validates the 'RAG–Enterprise Value Chain' proposed earlier. This is a powerful conclusion, but its justification is too concise. Explicitly elaborating on how a specific correlation in the heatmap (e.g., the strong link between Architectures and Best Configurations) empirically supports the dependency between stages in their proposed framework (e.g., 'Retrieval' and 'Business Impact') would make the paper's conceptual contribution more robust and impactful.

"This heatmap also serves to validate the RAG–Enterprise Value Chain by showing the empirical dependency of later stages (Configurations) on earlier stages (Architectures) [1,3,17]." (Page 18)

Implementation: In the final paragraph of Section 4.9, expand on the validation claim. After the quote, add a sentence such as: 'For example, the high overlap between RQ4 (Architectures) and RQ8 (Best Configurations) (Pearson r = 0.77) empirically demonstrates the dependency of the 'Business Impact' stage on choices made in the 'Retrieval' and 'Generation' stages of our proposed value chain.'
💡 Provide a concrete example for the top-cited challenge
Medium impact. The analysis correctly identifies 'Hallucination Factual Consistency' as the most frequent challenge from Table 12 (48.1%). However, the description remains abstract. Adding a single, synthesized sentence that provides a concrete, qualitative example of what this challenge looks like in an enterprise context (e.g., generating a non-existent policy clause) would make this critical finding more tangible and immediately understandable to a broader audience, especially practitioners.

"Detecting and mitigating fabricated or outdated content in generated documents" (Page 17)

Implementation: In Section 4.9, after presenting Table 12 and identifying the top challenges, add a sentence to illustrate the primary issue. For example: 'This top challenge of hallucination often manifests in enterprise settings as a system confidently generating a plausible but factually incorrect contract clause by misinterpreting retrieved legal documents, or citing a technical specification that has been superseded.'

Non-Text Elements

Figure 5. Thematic taxonomy of RAG and LLM components emerging from the...

Full Caption

Figure 5. Thematic taxonomy of RAG and LLM components emerging from the reviewed literature: relationships among learning paradigms, indexing strategies, model backbones, and application domains.

Figure/Table Image (Page 10)

First Reference in Text

Figure 5 shows the thematic taxonomy of the RAG + LLM components under review, visualizing the architectural and conceptual relationships resulting from classical machine learning methods to modern variants of RAG.

Description

Conceptual Map of the Research Field: This figure is a concept map, or 'thematic taxonomy,' that visually organizes the key topics and technologies found across the 77 reviewed research papers. It acts as a high-level blueprint of the entire field, with the central theme branching out into major categories like the types of AI models used, how they are trained, what they are used for, and the challenges they face.
Core Technologies and Components: The map details the core components of the AI systems discussed. This includes 'LLM Backbones,' which are the fundamental architectures of the Large Language Models (e.g., GPT-style); 'RAG Architectures,' which are different methods for Retrieval-Augmented Generation (a technique where the AI looks up information before answering); and 'Indexing Strategies,' which are ways to organize the knowledge base for fast searching (e.g., using 'Dense-Vector Indices' that represent information numerically).
Applications and Challenges: The taxonomy also covers the practical aspects of this technology. The 'Use Cases' branch shows application areas like 'Knowledge Management' and 'Document Automation' in domains such as 'Regulatory Compliance'. Conversely, the 'Challenges and Research Gaps' branch highlights key problems that researchers are trying to solve, such as 'Hallucination' (when the AI generates incorrect information), 'Data Privacy', and 'Latency' (the time delay for a response).

Scientific Validity

✅ Provides an excellent synthesis of the literature: Creating a thematic taxonomy is a highly appropriate and valuable method for synthesizing the findings of a systematic literature review. It moves beyond simple statistics to provide a structured, conceptual framework that represents the intellectual landscape of the field. This is a significant contribution of the paper.
✅ Appears to be data-driven: The caption and reference text state that the taxonomy 'emerges from the reviewed literature,' which implies that the categories and their relationships are derived from the content of the 77 analyzed papers. This grounding in the review's dataset gives the conceptual map scientific validity as a summary of the field's structure.
💡 Lacks quantitative weighting: The map shows the existence of various themes and components but provides no information on their prevalence in the literature. For example, it's impossible to tell if 'Supervised Learning' was mentioned more frequently than 'Unsupervised Learning'. The scientific value could be greatly enhanced by incorporating quantitative data, for instance, by varying the size of the nodes or the thickness of the connecting lines based on the frequency of mentions in the reviewed papers.

Communication

✅ Gives a powerful 'at-a-glance' overview: The mind map format is highly effective for communicating the entire scope of the research area in a single view. It allows readers to quickly understand the key components and how they relate to one another, serving as an excellent visual abstract for the paper's findings.
💡 High information density can be overwhelming: The diagram is very dense, containing a large number of nodes and connections. While comprehensive, this can be visually overwhelming for a reader, and the small font size may be difficult to read, especially in print. Consider simplifying the diagram or breaking it into a few smaller, more focused figures for key sub-topics.
💡 Undefined color-coding scheme: The nodes are color-coded (e.g., green, blue, purple), but there is no legend to explain the meaning of the colors. This renders the color scheme purely decorative when it could have been used to convey an additional layer of information (e.g., distinguishing between concepts, technologies, and applications). Please add a legend to define the color scheme.

Table 5. Distribution of Platform Topologies.

Figure/Table Image (Page 11)

First Reference in Text

Table 5 quantifies the prevalence of these deployment methods across the 77 rigorously reviewed studies.

Description

Categorization of AI Deployment Infrastructures: This table classifies the 77 reviewed research papers based on their 'Platform Topology,' which refers to the type of computing infrastructure used to run the AI systems. It breaks down the studies into four distinct categories: 'Cloud native infrastructures' (using public services like Amazon Web Services or Google Cloud), 'On premises data centers' (using a company's own private servers), 'Edge or mobile hardware' (running AI on devices like smartphones), and 'Hybrid topologies' (a combination of these approaches).
Dominance of Cloud-Based Platforms: The key finding is the overwhelming dominance of cloud-based solutions. A significant majority of the studies, 51 out of 77 (representing 66.2%), utilized cloud-native infrastructures. This highlights the central role of scalable, on-demand cloud computing in modern AI research and development.
Prevalence of Other Topologies: The table also quantifies less common approaches. 'On premises' deployments were used in 15 studies (19.5%), often for reasons of data security or regulatory compliance. A smaller but notable group of 8 studies (10.4%) focused on 'edge or mobile hardware,' which is important for applications requiring low latency or offline functionality. 'Hybrid' systems were the least common, appearing in only 3 studies (3.9%).

Scientific Validity

✅ Directly Addresses a Core Research Question: The table provides a direct and quantitative answer to RQ1, which asks about the platforms used in enterprise RAG + LLM studies. This is an appropriate and fundamental analysis for a systematic literature review, clearly mapping the technological landscape.
✅ Data is Internally Consistent and Transparent: The table presents both absolute counts (# of Studies) and percentages, which is good practice. The numbers correctly sum to the total of 77 studies, and the percentages sum to 100%, indicating careful data handling and internal consistency.
💡 Ambiguity in Category Definitions: While the 'Characterization' column is helpful, the operational definitions for the categories could be more precise. For example, the distinction between a 'hybrid' system and a 'cloud native' system that accesses on-premises data can be blurry. Providing more rigorous, mutually exclusive definitions for each topology would enhance the methodological soundness of the classification.

Communication

✅ Excellent Clarity and Structure: The table is exceptionally clear and well-structured. The inclusion of the 'Characterization' column, which provides a concise description for each platform type, is a major strength. It makes the table highly self-contained and easy for readers to understand without needing to refer back to the main text.
✅ Effective Use of a Table Format: A table is the ideal format for presenting this type of categorical data. It allows for a direct comparison of the prevalence of different platform types and presents the quantitative data in a precise and unambiguous way.
💡 Suggest Sorting for Emphasis: To further improve readability and immediately emphasize the main finding, consider sorting the rows in descending order based on the '# of Studies' or '%' column. This would place the most dominant category, 'Cloud native infrastructures,' at the top, making the key takeaway even more apparent at a glance.

Table 6. Distribution of Dataset Categories.

Figure/Table Image (Page 11)

First Reference in Text

The systematic review of 77 quality assessed studies (2015–2025) identifies four dataset categories used to develop and evaluate Retrieval Augmented Generation (RAG) with Large Language Models (LLMs) for enterprise level knowledge management and document automation (Table 6) [1,35].

Description

Classification of Data Sources: This table categorizes the 77 reviewed research papers based on the type of data they used to build and test their AI systems. It identifies four main sources: 'GitHub open source', 'Proprietary repositories', 'Benchmarks', and 'Custom industrial corpora'.
Dominance of Public Open-Source Data: The most significant finding is that a majority of studies, 42 out of 77 (54.5%), relied on publicly available data from GitHub. GitHub is a popular online platform where developers share code, documentation, and project information, making it a convenient but generic source for research.
Usage of Private and Standardized Datasets: The remaining studies used more specialized data. 'Custom industrial corpora' (special datasets created by companies for a specific task) were used in 13 studies (16.9%). 'Proprietary repositories' (a company's private internal documents) were used in 12 studies (15.6%). Finally, 'Benchmarks' (standardized academic datasets used for fair comparison of AI models) were the least common, used in 10 studies (13.0%).

Scientific Validity

✅ Directly addresses a key research question: The table provides a clear, quantitative answer to RQ2 regarding the datasets used in the field. This is a fundamental and appropriate analysis for a systematic literature review, as it maps the data landscape and reveals common practices and potential biases in the literature.
💡 Ambiguity in category definitions: The distinction between 'Proprietary repositories' and 'Custom industrial corpora' is not sufficiently clear from the provided descriptions. This ambiguity could lead to inconsistent classification of studies. To improve methodological rigor, please provide more precise, mutually exclusive operational definitions for these categories. For instance, is the key differentiator whether the data was pre-existing versus assembled specifically for the research?
💡 Potential for unaddressed bias: The finding that over half the studies use public GitHub data is significant and points to a potential limitation in the field: 'domain shift'. Models trained on general-purpose public data may not perform well on specialized, private enterprise data. The table effectively surfaces this issue, but the authors should explicitly discuss the implications of this data source bias in their analysis.

Communication

✅ Excellent use of a descriptive column: The inclusion of the 'Description' column is a major strength. It makes the table highly self-contained and ensures that readers can understand the meaning of each category without having to search for definitions in the main text, which is a best practice for table design.
✅ Clear and well-structured layout: The table is cleanly formatted with clear headers and presents both absolute counts and percentages, allowing for easy interpretation and comparison across categories.
💡 Lack of sorting reduces immediate impact: The rows are not sorted by frequency. To improve readability and immediately highlight the most important finding, please sort the rows in descending order based on the '# Studies' column. This would place 'GitHub open source' at the top, making the key takeaway instantly apparent.

Figure 6. Proportional distribution of dataset sources.

Figure/Table Image (Page 11)

First Reference in Text

Not explicitly referenced in main text

Description

Proportional Breakdown of Data Sources: This pie chart illustrates the relative share of four different types of datasets used across the 77 reviewed research papers. The largest slice, 'GitHub Open-Source' (publicly available code and documents from the GitHub platform), accounts for 52.5% of all datasets used. The remaining sources are divided among 'Benchmarks' (standardized academic datasets for model comparison) at 18.2%, 'Custom Industrial' (datasets specially collected by companies for a specific task) at 17.2%, and 'Proprietary Repos' (private, internal company data) at 12.1%.

Scientific Validity

💡 Severe data inconsistency with Table 6: There is a critical inconsistency between the data presented in this pie chart and the data in Table 6, which purports to show the same information. For example, this figure reports 'GitHub Open-Source' at 52.5% and 'Benchmarks' at 18.2%, whereas Table 6 reports them at 54.5% and 13.0%, respectively. This discrepancy is a major scientific validity issue that undermines the credibility of the data presentation. The raw counts in Table 6 appear to be correct, indicating that this figure has been plotted with incorrect values.
💡 Redundant visualization: This figure is entirely redundant as it visualizes the exact same categorical distribution data already presented with greater precision in Table 6. In scientific reporting, it is generally considered poor practice to present the same data in both a table and a figure. The table alone would suffice, or a bar chart could be used if a visual representation is strongly desired.

Communication

💡 Inappropriate choice of chart type: Pie charts are widely discouraged for scientific communication because the human eye is poor at accurately comparing angles and areas. A simple bar chart would be a far more effective and clearer way to present this data, as it allows for easy and accurate comparison of the lengths of the bars.
💡 Lack of reference in the main text: This figure is an 'orphan' element; it is not referenced anywhere in the main body of the paper. All figures must be explicitly introduced and discussed in the text to provide context and guide the reader.
💡 Inconsistent terminology: The labels used in the figure (e.g., 'GitHub Open-Source', 'Proprietary Repos') are slightly different from those in Table 6 (e.g., 'GitHub open source', 'Proprietary repositories'). Please ensure terminology is consistent across all figures and tables for clarity.
💡 Cluttered labeling: The external labels connected by lines create visual clutter, particularly for the 'GitHub Open-Source' category. A bar chart would allow for much cleaner and more direct labeling of the categories and their corresponding values.

Table 7. Distribution of Machine Learning Paradigms in Enterprise RAG + LLM...

Full Caption

Table 7. Distribution of Machine Learning Paradigms in Enterprise RAG + LLM Studies.

Figure/Table Image (Page 12)

First Reference in Text

A review of 77 studies, subjected to a rigorous quality filter, shows an overwhelming preference for supervised learning when combining RAG and LLM in enterprise contexts (Table 7) [7,29].

Description

Classification of AI Training Methodologies: This table categorizes the 77 reviewed studies based on the 'machine learning paradigm' they employed, which is the fundamental approach used to train the AI models. The paradigms are divided into three types: Supervised, Unsupervised, and Semi-supervised learning.
Overwhelming Dominance of Supervised Learning: The most striking finding is the dominance of 'Supervised learning', which was used in 71 out of 77 studies, accounting for 92.2% of the literature. Supervised learning is a method where the AI learns from data that has been pre-labeled with the correct answers, much like a student studying with an answer key. This suggests most research in this area relies on having well-annotated datasets.
Limited Use of Unsupervised and Semi-supervised Methods: In contrast, 'Unsupervised learning' and 'Semi-supervised learning' are far less common, each accounting for only 3 studies (3.9%). Unsupervised learning involves finding hidden patterns in unlabeled data without an answer key. Semi-supervised learning is a hybrid approach that uses a small amount of labeled data to help guide the learning process on a much larger set of unlabeled data. Their low prevalence indicates a potential research gap in scenarios where labeled data is scarce.

Scientific Validity

✅ Directly Addresses a Core Research Question: The table provides a clear and quantitative answer to RQ3 ('Which types of machine learning... are employed?'). This classification is a fundamental and appropriate analysis for a systematic literature review, effectively mapping the methodological landscape of the field.
✅ Strongly Supports the Central Claim in the Text: The data presented (92.2% for supervised vs. 3.9% for others) provides robust, quantitative evidence for the claim in the reference text of an 'overwhelming preference for supervised learning'. The data and the interpretation are in perfect alignment.
💡 Lacks Granularity in Classification: While the broad categories are useful, they may oversimplify the methodologies. For example, 'supervised learning' could encompass a wide range of techniques from simple classification to complex fine-tuning of large models. The methodology for assigning a single paradigm to each paper, especially for studies that might use hybrid approaches, is not detailed. Providing more insight into the classification criteria would enhance methodological transparency.

Communication

✅ Excellent Clarity and Self-Containment: The table is exceptionally clear. The inclusion of the 'Description' column, which provides a concise definition for each technical term, is a major strength. It makes the table self-contained and easily understandable for readers who may not be experts in machine learning.
✅ Effective Data Presentation: The table format is ideal for this categorical data. It presents the absolute counts and percentages cleanly, allowing for easy comparison and immediate comprehension of the key finding: the dominance of supervised learning.
💡 Suggest Sorting for Emphasis: To further enhance readability and immediately highlight the main finding, sorting the rows in descending order by '# Studies' would be beneficial. This would place 'Supervised' at the top, reinforcing the primary message of the table at first glance.

Figure 7. Distribution of machine learning paradigms.

Figure/Table Image (Page 13)

First Reference in Text

Figure 7 visualizes these rates and highlights the near ubiquity of supervised methods (92.2%), while unsupervised and semi supervised strategies remain under researched [7,29].

Description

Proportional Breakdown of AI Training Methods: This donut chart displays the proportional distribution of the three main machine learning training methods, or 'paradigms,' used across the 77 studies. The chart is dominated by 'Supervised' learning, which accounts for 92.2% of all cases. Supervised learning is a technique where an AI model is trained on data that has been pre-labeled with correct answers. The other two methods, 'Unsupervised' learning (where the AI finds patterns in unlabeled data) and 'Semi-supervised' learning (a hybrid approach), are used much less frequently, each accounting for an equal share of 3.9%.

Scientific Validity

✅ Data strongly supports the textual claim: The numerical data presented in the chart (92.2% for supervised methods) perfectly aligns with and provides strong visual support for the central claim made in the reference text regarding the 'near ubiquity' of this paradigm.
💡 Redundant and less precise than the table: This figure is entirely redundant, as it visualizes the exact same data presented with greater precision in Table 7. It is generally considered poor practice in scientific reporting to present the same data in both a table and a figure. The table provides the exact counts and is sufficient on its own.
💡 Inappropriate chart type for scientific comparison: Donut charts (and pie charts) are widely discouraged for scientific data visualization because they require readers to compare angles and areas, which is less accurate than comparing lengths. A simple bar chart would be a more methodologically sound and effective choice for presenting this categorical data, as it allows for direct and unambiguous comparison between the paradigms.

Communication

💡 Ineffective and redundant visualization: The primary communication issue is the figure's redundancy with Table 7. It adds no new information and clutters the paper. For the sake of clarity and conciseness, this figure should be removed, and the text should refer directly to Table 7.
💡 Poor choice of chart type: As noted for scientific validity, a donut chart is a poor choice for communicating this data. The extreme imbalance (92.2% vs. 3.9%) makes the smaller segments very difficult to see and compare. A horizontal bar chart would be a far superior alternative, allowing for clear labeling and easy comparison of the categories.
✅ Clear labeling of percentages: The chart does a good job of clearly labeling each segment with the paradigm name and its exact percentage, which makes the main message easy to grasp despite the limitations of the chart type.

Table 8. Taxonomy and Frequency of Algorithms, RAG Architectures, and Indexing...

Full Caption

Table 8. Taxonomy and Frequency of Algorithms, RAG Architectures, and Indexing Strategies.

Figure/Table Image (Page 13)

First Reference in Text

Table 8 provides a detailed taxonomy of these methods, classifying them into traditional baselines, deep learning models, RAG variants, and indexing strategies.

Description

Taxonomy and Frequency of AI Techniques: This table categorizes and counts the mentions of various AI techniques across the 77 reviewed studies. It is divided into four main groups: 'Traditional ML (Baselines)', 'Deep Learning Models', 'RAG Architectures', and 'Retrieval & Indexing'.
Key Frequency Data: The most frequently mentioned categories of techniques were 'Retrieval & Indexing' (127 total mentions) and 'Traditional ML' (121 total mentions). 'Retrieval & Indexing' refers to methods for organizing and searching information; the most common was 'Dense Vector' search (62 mentions), a modern technique that understands meaning, followed by older keyword-based methods like 'BM25/TF-IDF' (45 mentions). 'Traditional ML' includes foundational algorithms like 'Naïve Bayes' (26 mentions) and 'SVM' (22 mentions), which are often used as 'baselines'—a standard for comparison to prove that newer methods are an improvement. 'RAG Architectures'—the specific designs for the core AI technology being studied—were mentioned 82 times. In contrast, older 'Deep Learning Models' like LSTM and CNN were mentioned only 8 times, indicating the field has largely moved on to newer approaches.

Scientific Validity

✅ Provides a granular, quantitative overview of the technical landscape: The table demonstrates a rigorous data extraction process by not only categorizing technologies but also quantifying their frequency of mention. This provides a valuable, data-driven snapshot of the specific algorithms, architectures, and strategies that are most prevalent in the literature, directly addressing RQ4.
💡 The unit of measurement, '# Mentions,' is ambiguous and potentially misleading: The metric '# Mentions' is not clearly defined. It is unclear whether this is a count of papers that mention a technique (i.e., a maximum of 77) or a raw count of every time the term appears across all papers. If the latter, a single paper that discusses an algorithm extensively could disproportionately inflate its count, skewing the perception of its overall importance in the field. To improve methodological rigor, please clarify the counting protocol. A metric like '# of Studies Mentioning Technique' would be less ambiguous and more robust.
💡 The categorization conflates components with parallel technologies: 'Retrieval & Indexing' strategies are not independent alternatives to 'RAG Architectures'; they are integral sub-components of them. Presenting them as separate, parallel categories in the table is conceptually confusing. A more accurate representation would be a hierarchical structure that shows indexing strategies as a component within RAG systems.
💡 Risk of misinterpretation without textual context: The high number of mentions for 'Traditional ML' (121) could lead a reader to incorrectly conclude that these methods are dominant. The crucial context—that they are primarily used as baselines for comparison—is only provided in the subsequent text. The table itself does not convey this, creating a risk of misinterpretation if viewed in isolation.

Communication

✅ Effective grouping for a high-level overview: The table successfully organizes a large and complex set of technical terms into four digestible categories. This structure helps the reader to quickly grasp the main classes of technologies discussed in the literature.
💡 Crucial context is missing from the table itself: The fact that 'Traditional ML' algorithms are used as 'Baselines' is critical for correct interpretation. This information should be integrated directly into the table to make it more self-contained. Suggest changing the category label from 'Traditional ML (Baselines)' to 'Traditional ML (Primarily used as baselines)' or adding a note directly in the table.
💡 Lack of internal sorting reduces readability: Within each category, the specific algorithms are not consistently sorted by their number of mentions. To improve clarity and make it easier for readers to identify the most prominent techniques at a glance, please sort the items within each category in descending order of their '# Mentions'.
💡 Note is helpful but could be more prominent: The note clarifying that counts represent total mentions is helpful but is placed in small font below the table. Given the ambiguity of the '# Mentions' header, this important clarification could be missed. Consider making the header more descriptive, such as '# Total Mentions (across 77 studies)'.

Figure 8. Frequency of the top five machine learning algorithms used primarily...

Full Caption

Figure 8. Frequency of the top five machine learning algorithms used primarily as baselines or classifiers in RAG + LLM studies.

Figure/Table Image (Page 14)

First Reference in Text

Figure 8 illustrates the continued relevance of classical algorithms as benchmarks alongside these modern innovations.

Description

Frequency of Top Five Classical AI Algorithms: This bar chart displays the frequency of use for the five most common 'classical' machine learning algorithms found in the 77 reviewed studies. These algorithms are often used as 'baselines,' which are simpler, established methods that new, more complex AI systems are compared against to prove their superiority. The most frequently used algorithm was Naive Bayes, appearing in 26 studies. The others, in descending order of frequency, were SVM (Support Vector Machine) in 22 studies, Logistic Regression in 19, Decision Tree in 18, and Random Forest in 15.

Scientific Validity

✅ Effectively visualizes a key finding: The figure successfully highlights a key finding from the data in Table 8: that despite the focus on modern LLMs, traditional machine learning algorithms are still highly prevalent in the research literature, primarily for benchmarking. This supports the paper's narrative about the role of these established methods.
💡 Critical inconsistency in the unit of measurement: There is a significant contradiction between this figure and Table 8 regarding the unit of data. The y-axis of this figure is labeled 'Number of Studies,' implying that, for example, 26 distinct studies used Naive Bayes. However, the header in Table 8, from which this data is derived, is '# Mentions,' and its note clarifies that 'Counts represent total mentions.' These are two very different metrics. This inconsistency must be resolved. 'Number of Studies' is the more scientifically robust and meaningful metric, and if correct, the labeling in Table 8 should be changed.

Communication

✅ Excellent and informative caption: The caption is a major strength. It provides crucial context by explicitly stating that these algorithms are used 'primarily as baselines or classifiers.' This information is essential for correct interpretation and makes the figure highly self-contained, a significant improvement over the presentation in Table 8.
✅ Appropriate choice of visualization: A bar chart is the ideal visualization for comparing the frequencies of a small number of discrete categories. It is clear, intuitive, and effectively communicates which algorithms are most common.
💡 Missing data labels: The exact values for each bar must be estimated from the y-axis, which reduces precision. To improve clarity and make the chart easier to read, please add numerical data labels on top of each bar.

Table 9. Distribution of Metric Categories.

Figure/Table Image (Page 14)

First Reference in Text

Quality Evaluation Questions lists the five primary categories (Table 9) of evaluation metrics used across the 77 reviewed studies and the proportion of studies employing each type.

Description

Classification of AI Performance Metrics: This table categorizes the different types of performance measures, or 'evaluation metrics,' used in the 77 reviewed studies to assess the quality of AI systems. It is divided into five main categories, ranging from automated technical scores to human judgments and real-world business outcomes.
Dominance of Automated Technical Metrics: The data shows a strong reliance on automated, technical metrics. The most common are standard classification measures like 'Precision / Recall / Accuracy', used in 62 studies (80.5%). These are followed by retrieval-specific metrics like 'Recall@K / Precision@K' (measuring if the right answer is in the top K results), used in 56 studies (72.7%), and text-generation metrics like 'ROUGE / BLEU' (which compare AI-generated text to human examples), used in 34 studies (44.2%).
Scarcity of Human-Centric and Business-Oriented Metrics: In stark contrast to the technical metrics, measures that reflect real-world usability are far less common. 'Human Evaluation', where a person directly assesses the AI's output for qualities like fluency and factual correctness, was included in only 15 studies (19.5%). Even rarer were 'Business Impact Metrics', which measure tangible outcomes like reduced costs or increased efficiency, appearing in just 12 studies (15.6%). This disparity highlights a significant gap between academic evaluation and practical enterprise value.

Scientific Validity

✅ Effectively Highlights a Critical Research Gap: The table's primary strength is its clear, quantitative demonstration of the gap between academic evaluation practices and the needs of enterprise applications. The stark contrast between the high usage of automated metrics (80.5%) and the low usage of business impact metrics (15.6%) is a significant finding that strongly supports the paper's narrative about the 'lab to market' challenge.
💡 Methodological ambiguity regarding non-exclusive categories: The sum of studies across all categories (179) far exceeds the total number of papers (77), indicating that the categories are not mutually exclusive and that a single study can employ multiple metric types. This is a critical methodological detail that is not explicitly stated. To avoid potential misinterpretation, a note should be added to the table clarifying that studies can be counted in multiple categories. This would improve the transparency and rigor of the data presentation.
✅ Directly addresses a core research question: This table provides a direct and data-driven answer to RQ5 ('Which evaluation metrics are used to assess model performance?'). The categorization and quantification are appropriate for a systematic literature review and effectively map the current state of evaluation practices in the field.

Communication

✅ Excellent use of a descriptive column: The 'Description' column is a major communication strength. It provides concise definitions for each metric category, making the table highly self-contained and accessible to readers who may not be familiar with all the technical terms (e.g., ROUGE/BLEU).
✅ Clear and well-structured layout: The table is cleanly formatted with clear headers and presents both absolute counts (# Studies) and percentages, allowing for easy interpretation and comparison of the prevalence of each metric category.
💡 Lack of sorting reduces immediate impact: The rows are not sorted in any particular order. To improve readability and immediately emphasize the key finding, please sort the rows in descending order based on the '# Studies' or '%' column. This would place the most common metrics at the top and the least common at the bottom, making the disparity instantly clear to the reader.

Figure 9. Proportions of studies using each evaluation metric category (n = 77).

Figure/Table Image (Page 15)

First Reference in Text

Figure 9 visualizes these percentages [17,31,32,108].

Description

Visualization of Evaluation Metric Usage: This bar chart displays the percentage of the 77 reviewed studies that utilized different categories of performance metrics to evaluate their AI systems. The chart highlights a strong preference for automated, technical metrics over those that involve human judgment or measure real-world business value.
Dominance of Technical Metrics: The chart shows that the most common metric category was 'Precision/Recall/Accuracy', used by approximately 75% of the studies. This was followed by 'Recall/Prec' (a shorthand for metrics that check if the correct answer is within the top results), used by 68% of studies, and 'ROUGE/BLEU' (metrics that compare AI-generated text to a human-written reference), used by 42%.
Scarcity of Human-Centric and Business Metrics: A key takeaway is the sharp decline in the use of metrics that assess practical utility. 'Human Eval' (where a person directly scores the AI's output) was reported in only 18% of studies. Even less frequent were 'Business Impact' metrics (measuring outcomes like efficiency gains or cost reduction), which were used in just 15% of the studies. This visualizes a significant gap between laboratory-style evaluation and real-world application assessment.

Scientific Validity

💡 Critical data inconsistency with Table 9: There is a severe and critical inconsistency between the percentages displayed in this figure and the data presented in Table 9, which is the source for this visualization. For every category, the percentage in the figure is different from the value calculated from the table (e.g., 'Precision/Recall/Accuracy' is 75% in the figure vs. 80.5% calculated from Table 9; 'Human Eval' is 18% vs. 19.5%). This is a major scientific validity issue that undermines the credibility of the data presentation and suggests a significant error in data handling or figure generation. This must be corrected to ensure the figure accurately reflects the source data.
💡 Redundant and less precise visualization: This figure is entirely redundant, as it visualizes the exact same categorical data already presented with greater precision (i.e., with raw counts) in Table 9. In scientific reporting, it is generally considered poor practice to present the same data in both a table and a figure. The table alone is sufficient and more precise.
✅ Effectively visualizes the central narrative: Despite the numerical inaccuracies, the overall shape of the distribution—the steep drop-off from technical metrics to business metrics—effectively visualizes and reinforces the paper's central argument about the 'lab to market' gap in evaluation practices.

Communication

💡 Missing Y-axis label: The vertical axis is not labeled, forcing the reader to infer that it represents a percentage. To adhere to best practices for data visualization, the y-axis must be explicitly labeled, for example, as 'Percentage of Studies (%)'.
💡 Inconsistent and unclear X-axis labels: The labels on the x-axis are inconsistent with those in Table 9. For example, 'Recall/Prec' is an unclear abbreviation for 'Recall@K / Precision@K'. Furthermore, the labels are rotated, which reduces readability. A horizontal bar chart would be a better choice as it allows for fully horizontal, and thus more legible, category labels.
💡 Poor image quality: The resolution of the figure is low, resulting in blurry text and imprecise visual elements. Figures should always be provided in a high-resolution format to ensure clarity and professionalism.
💡 Redundant presentation of information: As noted in the scientific validity section, presenting the same data in both Table 9 and Figure 9 is redundant. To improve the manuscript's conciseness and impact, one of these elements should be removed. Given the data inconsistencies in the figure, it is the clear candidate for removal.

Table 10. Distribution of Validation Methods.

Figure/Table Image (Page 15)

First Reference in Text

Studies use various validation strategies to assess the robustness and generalizability of RAG + LLM systems (Table 10, Figure 10) [17,31,32].

Description

Classification of Model Validation Strategies: This table categorizes the 77 reviewed studies based on the method they used to validate the performance of their AI systems. Three primary strategies are identified: 'k fold Cross Validation', 'Hold out Split', and 'Real world Case Study'.
Dominance of k-fold Cross Validation: The most prevalent method is 'k fold Cross Validation', which was used in 72 out of 77 studies (93.5%). This is a common technique in machine learning where the dataset is split into 'k' subsets; the model is trained on k-1 of the subsets and tested on the remaining one, a process that is repeated k times to ensure the results are stable and not just a fluke of one particular data split.
Usage of Other Validation Methods: The 'Hold out Split' method, a simpler approach where the data is divided just once into a training set and a test set, was used in 20 studies (26.0%). The least common approach was the 'Real world Case Study', where the AI system is deployed in a live business environment to measure its actual impact. This was reported in only 10 studies (13.0%). The fact that the numbers add up to more than 77 indicates that some studies used more than one validation method.

Scientific Validity

✅ Directly addresses a core research question: The table provides a clear, quantitative answer to RQ6 ('Which validation approaches... are adopted?'). This analysis is a fundamental component of a systematic literature review, effectively mapping the methodological practices in the field.
✅ Highlights a key gap between academic and practical validation: The data strongly supports the paper's central theme of a 'lab to market' gap. The overwhelming prevalence of a traditional academic technique like k-fold cross-validation (93.5%) compared to the scarcity of real-world case studies (13.0%) is a significant and well-supported finding.
💡 Methodological ambiguity of non-exclusive categories: The sum of studies across the categories is 102, which is greater than the total of 77 papers. This indicates that the categories are not mutually exclusive, a critical detail that is not explicitly stated in the table itself. To improve transparency and prevent misinterpretation, a note should be added to the table clarifying that a single study could employ and be counted under multiple validation methods.
💡 Table oversimplifies a nuanced application: The text provides a crucial piece of context that the table omits: k-fold is predominantly used for 'retrieval modules', while the less computationally expensive hold-out split is used for 'generative LLM components'. The table presents these as parallel choices, which is an oversimplification. The data is valid, but the table's structure doesn't fully capture the nuance of how these methods are applied within a single system.

Communication

✅ Excellent clarity and self-containment: The table is exceptionally well-designed for clarity. The 'Description' column, which provides a concise explanation of each validation method, is a major strength. It makes the technical terms accessible and ensures the table is understandable on its own.
✅ Appropriate format for the data: A table is the ideal format for presenting this type of categorical data, as it clearly lays out the different methods, their descriptions, and their frequencies (both absolute and relative) in a way that is easy to read and compare.
💡 Lack of sorting reduces immediate impact: The rows are not sorted by frequency. To improve readability and make the main finding more immediately obvious, please sort the rows in descending order based on the '# Studies' column. This would place 'k fold Cross Validation' at the top, instantly highlighting its dominance.

Figure 10. Distribution of validation approaches across 77 enterprise RAG + LLM...

Full Caption

Figure 10. Distribution of validation approaches across 77 enterprise RAG + LLM studies.

Figure/Table Image (Page 15)

First Reference in Text

Studies use various validation strategies to assess the robustness and generalizability of RAG + LLM systems (Table 10, Figure 10) [17,31,32].

Description

Visualization of AI Model Validation Methods: This bar chart displays the percentage of the 77 reviewed studies that used one of three common methods for validating the performance of their AI systems. The most widely used method is 'K-Fold CV' (k-fold Cross Validation), employed by 92.3% of studies. This is a standard academic technique where a dataset is repeatedly split into training and testing portions to get a stable measure of performance. The 'Hold-out Split' method, a simpler one-time split of the data, was used by 25.6% of studies. The least common method was 'Case Study/Field Trial', where the system is tested in a real-world setting, used by only 12.8% of studies. The percentages sum to more than 100%, indicating that some studies used multiple validation techniques.

Scientific Validity

💡 Critical data inconsistency with Table 10: There is a significant scientific validity issue due to data inconsistency between this figure and its source, Table 10. For every category, the percentages are different: 'K-Fold CV' is 92.3% here vs. 93.5% in the table; 'Hold-out Split' is 25.6% vs. 26.0%; and 'Case Study/Field Trial' is 12.8% vs. 13.0%. These discrepancies, while small, suggest errors in data transcription or figure generation and undermine the credibility of the reported findings. The figure must be corrected to match the source data in Table 10.
💡 Redundant and less precise visualization: This figure is entirely redundant, as it visualizes the exact same information already presented with greater precision (including raw counts) in Table 10. It is poor practice to present identical data in both a table and a figure. The table is superior as it provides both absolute and relative frequencies. This figure should be removed.
✅ Visually supports the paper's narrative: Despite the numerical errors, the visual pattern of the chart—a steep decline from the academic 'K-Fold CV' to the practical 'Case Study/Field Trial'—effectively reinforces the paper's central argument about the gap between academic validation and real-world application.

Communication

💡 Redundant presentation clutters the manuscript: The primary communication failure is the figure's redundancy. Including both Table 10 and Figure 10 to show the same data is inefficient and adds unnecessary clutter to the paper. For clarity and conciseness, this figure should be removed and the text should refer only to the more detailed Table 10.
💡 Missing Y-axis label: The vertical axis lacks a label, forcing the reader to infer its meaning from the context and the data labels on the bars. To adhere to standard data visualization practices, the y-axis must be explicitly labeled, for example, 'Percentage of Studies (%)'.
✅ Appropriate choice of chart type and clear data labels: A bar chart is a suitable choice for comparing the usage rates of these discrete categories. A key strength is the inclusion of precise percentage labels on top of each bar, which makes the chart's quantitative information easy to read and understand.

Figure 11. Number of studies using each metric category (multi select allowed;...

Full Caption

Figure 11. Number of studies using each metric category (multi select allowed; n=77 total studies).

Figure/Table Image (Page 16)

First Reference in Text

Table 9 and Figure 11 show the distribution of metric categories for 77 quality assessed articles.

Description

Categorization of Software Metrics: This bar chart shows the number of studies that used different types of software metrics. These metrics are used to measure characteristics of computer code and software development processes. The chart displays five categories of these metrics.
Dominance of Object-Oriented Metrics: The most commonly used type of metric was 'object-oriented metrics', which were featured in 24 of the 77 studies. These are measures that analyze the structure and complexity of code written in an object-oriented style, a common modern programming paradigm. The next most frequent category, identified in the text as 'procedural and domain-specific metrics', was used in 11 studies. The remaining three categories, identified as 'web, process, and performance metrics', were used very rarely, each appearing in only 2 studies.

Scientific Validity

💡 Critical confusion between 'Software Metrics' and 'Evaluation Metrics': There is a severe methodological flaw in how this figure is presented. The reference text incorrectly links this figure to Table 9, stating they both show 'metric categories'. However, Table 9 details AI evaluation metrics (like accuracy and ROUGE), while this figure, according to the text in section 4.7, shows software metrics (like object-oriented complexity). These are two entirely different concepts and research questions (RQ5 vs. RQ7). This conflation is a major scientific error that fundamentally confuses two separate analyses and undermines the paper's clarity and rigor.
💡 Lack of a source data table: Unlike other charts in this paper that are based on a corresponding table (e.g., Figure 10 is based on Table 10), there is no source table provided for the data in Figure 11. For a systematic literature review, transparency is paramount. The raw data (the list of metric categories and the number of studies for each) must be presented in a table to allow for verification and scrutiny.
✅ Caption clarifies counting methodology: The caption's inclusion of '(multi select allowed; n=77 total studies)' is an example of good practice. It correctly informs the reader that the sum of the bars will not equal 77 because a single study could use metrics from multiple categories, which is crucial for correct interpretation.

Communication

💡 Figure is critically misleading due to incorrect referencing: The primary communication failure is that the reference text creates a false equivalence between this figure and Table 9, leading to significant confusion. This figure should be exclusively discussed within the context of RQ7 (Software Metrics) and completely decoupled from the discussion of RQ5 (Evaluation Metrics) and Table 9. The incorrect reference must be removed.
💡 Missing X-axis labels: The bars on the x-axis are not labeled in the figure itself. A reader cannot understand what each bar represents without searching for the information in the main text. This violates the principle that figures should be as self-contained as possible. Each bar must be clearly labeled with its respective metric category.
💡 Data labels would improve clarity: The exact count for each bar must be estimated from the y-axis. To improve precision and readability, please add numerical data labels (e.g., '24', '11', '2') on top of each bar.
✅ Appropriate chart type: A bar chart is a suitable and effective choice for displaying the frequency of use for several non-exclusive categories, allowing for easy comparison between them.

Table 11. Key Configurations and Performance Findings.

Figure/Table Image (Page 16)

First Reference in Text

Table 11 summarizes the top configurations, and Figure 12 charts the frequency with which each configuration achieved state of the art results on its respective benchmark or case study [31,61,105].

Description

Summary of Top-Performing AI System Setups: This table identifies the five most successful AI system setups, or 'configurations,' found in the reviewed literature. A configuration is a specific combination of an AI architecture (like 'RAG Token' or 'Hybrid RAG') and a particular Large Language Model (like 'BART' or 'GPT-3.5'). The table aims to answer the question: 'What combinations of technologies work best for specific tasks?'
Task-Specific Performance Data: For each of the five configurations, the table details the specific business task it excelled at (e.g., 'Knowledge grounded QA' or 'Contract Clause Generation'). It also provides the number of studies in which that configuration was reported as the top performer, with the most frequent being 'RAG Token + Fine Tuned BART' which was the best in 5 studies.
Key Performance Findings: The 'Key Findings' column summarizes the performance of each configuration with specific metrics or qualitative outcomes. For example, the 'RAG Token + Fine Tuned BART' setup achieved up to an 87% exact match score on question-answering tasks and reduced AI 'hallucinations' (factually incorrect outputs) by 35%. Another configuration, used for technical manual tasks, reportedly 'Reduced manual editing time by 40%' in real-world trials.

Scientific Validity

✅ Synthesizes performance data to answer a key research question: The table provides a direct and valuable answer to RQ8 ('Best Performing RAG + LLM Configurations'). By moving beyond simple frequency counts to synthesize performance outcomes, it offers a higher level of analysis that is crucial for a systematic review and provides actionable insights.
💡 The criteria for 'top performance' are undefined: A major methodological limitation is the lack of a clear, objective definition for what constitutes a 'top performing' configuration or 'state of the art results'. It is unclear if this was determined by the original authors' claims or by a standardized re-evaluation. This subjectivity makes the selection process opaque and potentially irreproducible. The criteria for inclusion in this table must be explicitly defined.
💡 Conclusions are drawn from a very small sample size: The total number of 'top performing reports' summarized here is 16 (5+4+3+2+2). This is a very small subset of the 77 total studies analyzed. Drawing strong, generalizable conclusions about the 'best' configurations from such a small sample is statistically questionable. This limitation should be prominently acknowledged in the text.
💡 Performance metrics are not directly comparable: The table presents a mix of disparate performance metrics (e.g., 'exact match' for QA, 'ROUGE-L' for summarization, '% time saved' for workflow automation). While this accurately reflects the literature, presenting them side-by-side without strong caveats could misleadingly imply they are comparable. The inherent difficulty in comparing performance across different task types should be discussed.

Communication

✅ Excellent structure for conveying complex information: The table is very well-structured. The columns for 'Configuration', 'Task Type', and 'Key Findings' work together effectively to link a specific technology stack to a specific business problem and its performance outcome. This is a highly effective way to communicate complex findings.
✅ 'Key Findings' column provides valuable qualitative context: The inclusion of a qualitative summary in the 'Key Findings' column is a major strength. It goes beyond raw numbers to explain why a configuration was successful, making the results much more meaningful and interpretable for the reader.
💡 The column header '#*' is cryptic: The header '#*' is not self-explanatory. While its meaning is clarified in a footnote, this is easy for a reader to miss. For better clarity, the header should be changed to something more descriptive, such as '# Top Performing Studies' or '# SOTA Reports'.

Figure 12. Several studies have shown that each RAG + LLM configuration...

Full Caption

Figure 12. Several studies have shown that each RAG + LLM configuration attained top reported performance (n = 16 total top performing reports).

Figure/Table Image (Page 17)

First Reference in Text

Table 11 summarizes the top configurations, and Figure 12 charts the frequency with which each configuration achieved state of the art results on its respective benchmark or case study [31,61,105].

Description

Frequency of Top-Performing AI Configurations: This bar chart visualizes how frequently each of the five best-performing AI system setups, or 'configurations,' were reported as achieving top results in the literature. A configuration refers to a specific combination of an AI architecture and a Large Language Model. The chart is based on a total of 16 such top-performing reports identified across the 77 reviewed studies.
Key Data Points: The most frequently cited top-performing setup was 'RAG-Token + BART-ft', which appeared in 5 reports. The next most frequent was 'RAG-Seq + GPT-3.5' with 4 reports. The remaining configurations, 'Hybrid RAG + T5-L', 'RAG-Token + ROBERTa', and 'RAG-Seq + Flan-T5', were reported as top-performers in 3, 2, and 2 reports, respectively. The bars are sorted in descending order of frequency.

Scientific Validity

💡 Redundant and less informative than Table 11: This figure is entirely redundant as it visualizes only a single column of data ('#*') that is already presented clearly in Table 11. The table is scientifically superior because it provides crucial context, such as the specific task and the qualitative performance findings for each configuration. Presenting the same, limited data in a separate figure adds no new insight and is an inefficient use of space.
💡 Conclusions are based on a very small sample size: The entire analysis is based on only 16 'top performing reports' drawn from a pool of 77 studies. Drawing strong conclusions about which configurations are definitively 'best' from such a small and select sample is statistically weak. The findings should be presented with strong caveats about the limited evidence base.
💡 The criteria for 'top performance' are subjective and undefined: The methodology for identifying a 'top performing report' or 'state of the art results' is not defined. This introduces a significant risk of selection bias and makes the analysis subjective and difficult to reproduce. Objective criteria for this classification must be provided.
✅ Data is consistent with the source table: The numerical values displayed on the bars (5, 4, 3, 2, 2) perfectly match the data presented in the '#*' column of Table 11. This internal consistency is a positive aspect, even if the figure itself is redundant.

Communication

💡 The figure is redundant and should be removed: The primary communication issue is the figure's complete redundancy with Table 11. It adds no new information and clutters the manuscript. The most effective way to improve communication would be to remove this figure entirely and refer only to the more comprehensive Table 11 in the text.
💡 Missing Y-axis label: The vertical axis lacks a label, forcing the reader to infer its meaning. To adhere to standard data visualization practices, the y-axis should be explicitly labeled, for example, 'Number of Top Performance Reports'.
✅ Effective use of sorting and data labels: The chart correctly follows best practices by sorting the bars in descending order, which immediately highlights the most frequent configuration. Additionally, the inclusion of clear numerical labels on top of each bar makes the exact values easy to read.
✅ Appropriate choice of chart type: A bar chart is a suitable and clear way to compare the frequencies of a small number of discrete categories, making the relative prevalence of each configuration easy to grasp visually.

Table 12. Distribution of Challenges in Enterprise RAG + LLM Studies.

Figure/Table Image (Page 17)

First Reference in Text

Despite the rapid advances in RAG + LLM for enterprise knowledge management and document automation, the synthesis of 77 high quality studies reveals five recurring challenges and several open research directions (Table 12) [17,32-34,38-42,44,45,58,106,108].

Description

Categorization of Key Research Challenges: This table identifies and quantifies the five most frequently mentioned challenges in the field of enterprise AI systems, as reported in the 77 reviewed research papers. The challenges range from technical issues like performance and accuracy to practical concerns like security and business value.
Prevalence of Factual Consistency as the Top Challenge: The most cited problem is 'Hallucination Factual Consistency', which was mentioned in 37 studies (48.1%). This refers to the critical issue of AI models generating information that is factually incorrect or not grounded in the source data. This high frequency underscores it as the primary concern for researchers in the field.
Distribution of Other Major Challenges: Other significant challenges include 'Data Privacy & Security', a concern in 29 studies (37.7%), and 'Latency & Scalability' (the speed and performance of the AI), mentioned in 24 studies (31.2%). The difficulty of adapting AI to new specialized areas ('Domain Adaptation Transfer Learning') was noted in 18 studies (23.4%). The least frequently mentioned challenge was the 'Difficulty in Measuring Business Impact', cited in just 12 studies (15.6%), highlighting a gap in research focused on quantifiable business outcomes.

Scientific Validity

✅ Provides a data-driven summary of research gaps: The table effectively answers RQ9 by quantitatively summarizing the key challenges identified in the literature. This is a valuable contribution for a systematic review, as it moves beyond a simple narrative to provide evidence-based insights into the field's most pressing problems.
💡 Lacks methodological transparency for challenge extraction: The methodology for identifying and coding a 'challenge' from a paper is not defined. It is unclear if this was based on keyword searches, a qualitative analysis of the papers' discussion sections, or explicit statements of limitations by the original authors. This lack of a defined protocol makes the results difficult to reproduce and their objectivity hard to assess.
💡 Ambiguity of non-exclusive categories is not stated: The total number of studies listed (120) is significantly higher than the 77 papers in the review, indicating that the categories are not mutually exclusive (i.e., one paper can mention multiple challenges). This is a crucial methodological detail that should be explicitly stated in a note on the table to prevent readers from misinterpreting the data.

Communication

✅ Excellent clarity and self-containment: The table is very well-designed for communication. The 'Description' column is a major strength, providing a concise and clear explanation of each technical challenge. This makes the table highly self-contained and easily understandable for a broad audience.
💡 Lack of sorting reduces immediate impact: The rows are not sorted by frequency, which forces the reader to scan the entire table to identify the most and least significant challenges. To improve readability and immediately highlight the key findings, please sort the rows in descending order based on the '# Studies' or '%' column.
✅ Appropriate and effective table format: A table is the ideal format for presenting this type of categorical data. It allows for a clear, direct comparison of the challenges and presents both absolute counts and percentages in a precise and unambiguous manner.

Figure 13. Heatmap of overlap and gaps between research questions (RQ1–RQ9)....

Full Caption

Figure 13. Heatmap of overlap and gaps between research questions (RQ1–RQ9). Color intensity reflects how often two RQs are contextually addressed together.

Figure/Table Image (Page 18)

First Reference in Text

Knowledge overlaps and gaps between RQs are illustrated in the heatmap in Figure 13.

Description

Visualization of Inter-Topic Relationships: This figure is a heatmap, a grid where colors represent values, used here to show the strength of the relationship between the nine different research questions (RQs) that guided the study. Each row represents one RQ (from RQ1 to RQ9), and each column represents a topic that corresponds to one of the RQs (e.g., the 'Architectures' column corresponds to RQ4). The cells in the grid are colored and contain a number from 0.1 to 1.0, where warmer colors (like yellow) and higher numbers indicate a stronger connection, meaning the two topics were frequently discussed together in the reviewed papers. Cooler colors (like green and blue) and lower numbers indicate a weaker connection.
Key Finding of Strong Overlap: The heatmap is used to highlight areas of strong overlap in the research. For instance, the text points out a strong relationship between the topic of 'Architectures' (RQ4) and 'Best Configurations' (RQ8). The corresponding cells in the heatmap show high values (0.9 and 0.7), indicating that papers discussing AI system architecture are very likely to also discuss which configurations perform best. This type of analysis helps identify the most tightly connected concepts in the research field.

Scientific Validity

💡 Critical lack of methodological transparency: The most significant flaw is the complete absence of methodology for calculating the overlap values. The caption implies a co-occurrence frequency ('how often...addressed together'), while the text later mentions a 'Pearson r' value, which is a statistical correlation. These are different metrics. Without a clear, reproducible definition of how the 0.1-1.0 values were derived from the 77 papers, the entire figure is scientifically unverifiable.
💡 Asymmetric matrix suggests flawed calculation: A matrix showing the overlap or correlation between two variables (e.g., RQ4 and RQ8) should be symmetric; the relationship between A and B is the same as between B and A. This matrix is not symmetric (e.g., the cell for [row RQ8, column Architectures/RQ4] is 0.7, while the cell for [row RQ4, column Best_Configs/RQ8] is 0.9). This asymmetry indicates a fundamental error in the calculation or the conceptual model, severely undermining the validity of the data.
💡 Inconsistent data between figure and text: The text states that the relationship between RQ4 and RQ8 has a 'Pearson r = 0.77'. However, the heatmap displays two different values for this relationship (0.7 and 0.9). This direct contradiction between the text and the figure is a serious error that confuses the reader and questions the reliability of the data.
✅ Novel analytical approach: Despite its flawed execution, the conceptual approach of creating a relationship matrix to identify knowledge overlaps and gaps between research questions is a novel and valuable form of meta-analysis for a literature review. It attempts to provide a deeper structural insight into the field beyond simple frequency counts.

Communication

💡 Severely confusing axis labeling: The figure's axes are extremely difficult to interpret. The Y-axis is labeled with RQ numbers (RQ1-RQ9), but the X-axis is labeled with topic names ('Platforms', 'Datasets', etc.). The reader must refer to Table 2 elsewhere in the paper to deduce that the X-axis labels are just proxies for the RQ numbers. This violates the principle of a self-contained figure and creates unnecessary work and confusion for the reader. Both axes should be clearly and consistently labeled (e.g., as RQ1-RQ9), with topic names provided in the caption if necessary.
💡 Unlabeled color bar: The color bar on the right is not labeled. It is unclear what the 0.1-1.0 scale represents. To improve clarity, the color bar needs a descriptive title, such as 'Overlap Score' or 'Correlation Coefficient', consistent with the (currently missing) methodology.
💡 Poor readability of X-axis labels: The long, rotated labels on the X-axis are difficult to read and contribute to visual clutter. Using consistent RQ1-RQ9 labels for both axes, as suggested above, would resolve this issue and make the entire heatmap cleaner and more readable.
✅ Appropriate choice of visualization type: A heatmap is the correct and most effective type of visualization for displaying a correlation or overlap matrix. It allows for the rapid, intuitive identification of strong (hot colors) and weak (cool colors) relationships within a complex dataset.

Discussion

Key Aspects

🔑 Synthesis of Core Findings: This section synthesizes the detailed results into a coherent overview of the RAG+LLM enterprise landscape. It reiterates the dominance of cloud-native infrastructures (66.2%) while noting the importance of on-premises solutions for privacy, the heavy reliance on public datasets like GitHub (54.5%) which risks domain shift, and the prevalence of supervised learning (92.2%). The discussion also highlights the gap between the common use of technical metrics and the rare measurement of business impact (15.6%), and summarizes the five most pressing challenges, with hallucination control being the most cited (48.1%).
📈 Practical Implications for Enterprise Adoption: Moving from analysis to application, the discussion provides a set of actionable recommendations for organizations looking to deploy RAG systems. It advocates for a hybrid infrastructure model that balances cloud scalability with on-premises security and edge-device latency. The paper also outlines strategies for ensuring regulatory compliance through privacy-preserving retrieval techniques, overcoming data scarcity with semi-supervised methods, and establishing comprehensive evaluation frameworks that integrate technical metrics with business KPIs and human-in-the-loop validation.
⚖️ Acknowledged Limitations of the Review: In a display of scholarly transparency, this part of the discussion critically assesses the intrinsic limitations of the systematic literature review itself. The authors acknowledge a potential scope bias from excluding 'gray literature' like industry white papers, which are significant in a fast-evolving field. They also recognize the risk of publication bias, where successful studies are more likely to be published, and a language bias due to the focus on English-language papers. Finally, the temporal constraints and metric heterogeneity across studies are noted as challenges that precluded a quantitative meta-analysis.
🗺️ A Roadmap for Future Research: The discussion concludes by outlining a strategic roadmap with six prioritized research directions to advance the field from academic prototypes to robust enterprise systems. Key areas identified for future work include developing secure, end-to-end encrypted indexing for proprietary data and achieving ultra-low latency for real-time applications. The roadmap also calls for expanding RAG to handle multimodal and multilingual data, establishing standardized business-oriented benchmarks, and improving system explainability and trust through features like provenance graphs to facilitate auditing and boost user confidence.

Strengths

✅ Effective Synthesis of Complex Results
The discussion excels at distilling the extensive quantitative data from the Results section into a coherent, high-level narrative. It successfully synthesizes answers to all nine research questions, providing a clear and accessible summary of the field's current state, key trends, and dominant challenges without overwhelming the reader with statistics.

"In this section, answers to the nine research questions (RQ1–RQ9) are synthesized, the maturity and limitations of the current body of work are assessed, and a roadmap is outlined for moving RAG + LLM from academic prototypes to robust, production ready enterprise systems." (Page 19)
✅ Strong Bridge Between Research and Practice
A major strength of this section is its dedicated focus on translating analytical findings into actionable advice. The 'Practical Implications for Enterprise Adoption' subsection provides clear, evidence-based recommendations on infrastructure, compliance, evaluation, and model maintenance, making the paper highly valuable for practitioners and decision-makers.

"Organizations aiming to deploy Retrieval Augmented Generation and Large Language Model solutions will benefit from a hybrid infrastructure that uses cloud platforms for large scale, low sensitivity workloads; on premises indexing to protect confidential data; and edge inference to deliver rapid, low latency responses..." (Page 20)
✅ Transparent Self-Critique and Forward-Looking Vision
The inclusion of comprehensive 'Limitations' and 'Future Research Directions' subsections demonstrates strong academic rigor. By openly acknowledging the review's boundaries (e.g., scope bias, publication bias) and providing a structured, prioritized research roadmap, the authors build credibility and position the paper as a foundational guide for the field's future development.

"While this Systematic Literature Review (SLR) adheres to a rigorous methodology involving exhaustive database searches and stringent quality assessments, several intrinsic limitations must be acknowledged." (Page 21)

Suggestions for Improvement

💡 Elaborate on the central practical guideline with concrete examples
Medium impact. The discussion opens with a powerful, high-level guideline distinguishing between sequence-level and token-level RAG. While concise, its practical utility could be significantly enhanced by adding a brief, synthesized example for each case (e.g., 'contract drafting' for sequence, 'invoice data extraction' for token). This elaboration would make the paper's primary practical takeaway more tangible and immediately useful for an enterprise audience.

"Across the reviewed studies, a practical guideline emerges: use sequence level retrieval for generative reasoning in open ended tasks, and employ token level methods for narrowly scoped extractive tasks (e.g., field lookup)." (Page 19)

Implementation: After the sentence presenting the guideline, add a clarifying sentence with examples. For instance: 'For instance, a system for drafting new legal contract clauses (an open-ended, generative task) would benefit from sequence-level RAG, while a system designed to extract a specific PO number from an invoice (a narrowly-scoped, extractive task) would be better served by token-level RAG.'
💡 Visually link identified challenges to the proposed research roadmap
High impact. The discussion identifies key challenges (privacy, latency, etc.) and later proposes a research roadmap. A simple diagram or table explicitly mapping each challenge to a specific research direction would visually reinforce the logic of the roadmap, making the paper's concluding argument more compelling and easier for readers to follow. This belongs in the Discussion as it's about synthesizing and presenting the paper's final arguments.

"Five recurring challenges emerge: privacy (37.7%), latency (31.2%), business impact evaluation (15.6%), hallucination control (48.1%), and domain adaptation (23.4%) (Table 12)." (Page 20)

Implementation: In Section 5.4, before the bulleted list, add a small, two-column table or a brief paragraph that explicitly connects the challenges from Table 12 to the research directions. For example: 'This roadmap directly addresses the key challenges identified in our synthesis: 'Secure Indexing' targets the critical issue of data privacy (37.7%), 'Ultra-Low Latency RAG' addresses scalability concerns (31.2%), and 'Explainability and Trust' is essential for mitigating hallucinations (48.1%).'
💡 Add prioritization or sequencing to the future research directions
High impact. Section 5.4 presents six critical research avenues as a flat list. To maximize the roadmap's strategic value, the authors could add a layer of prioritization, perhaps by identifying which directions are foundational (e.g., 'Secure Indexing' and 'Standardized Benchmarks') and must be addressed to enable progress in others (e.g., 'Multimodal Integration'). This would transform the list into a more strategic and actionable research agenda.

"Several research avenues warrant prioritization to foster the advancement of RAG + LLM in enterprise contexts:" (Page 21)

Implementation: At the beginning of Section 5.4, add a sentence to frame the list with a sense of priority. For example: 'While all these avenues are critical, we posit that progress in Secure Indexing and the establishment of Standardized Benchmarks are foundational prerequisites for building the trust and evaluative frameworks necessary to pursue the other directions at enterprise scale.'

Non-Text Elements

Figure 14. Selected publications per year (2015-2025).

Figure/Table Image (Page 19)

First Reference in Text

Finally, the publication trend in Figure 14 could be annotated with event markers (e.g., major model releases) to contextualize inflection points [1-3].

Description

Trend of Publications Over Time: This bar chart illustrates the number of selected research papers published each year between 2015 and 2025. It shows a clear trend of exponential growth in the field. From 2015 to 2019, there were very few publications (between 1 and 5 per year). A significant acceleration began in 2022 with 10 publications, followed by a surge to 20 publications in 2023 and a peak of 23 in 2024. The count for 2025 shows a sharp drop to 4 publications, which is likely an artifact of the data collection for the review ending partway through that year.

Scientific Validity

💡 Critical data inconsistency with a previous figure: There is a severe scientific validity issue as the data in this figure directly contradicts the data presented in Figure 4, which also shows publications per year. For example, this figure shows 23 publications for 2024, whereas Figure 4 shows a total of 27 for the same year. Likewise, the counts for 2020, 2021, 2022, and 2023 are all different between the two figures. This is a major error that undermines the credibility of the data analysis and must be corrected.
💡 Redundant and less informative visualization: This figure is redundant. Figure 4 already presents the same trend of total publications over time but provides the additional, valuable detail of breaking down the totals into journal and conference papers. This figure, therefore, adds no new information and should be removed in favor of the more comprehensive Figure 4.
💡 The 2025 data point is misleading without context: The sharp drop in publications in 2025 could be misinterpreted by a reader as a sudden decline in research interest. This is almost certainly an artifact of the study's data collection cutoff date. This crucial context is not provided in the caption or text, making the final data point misleading. This limitation must be explicitly stated.

Communication

💡 Fundamentally flawed Y-axis scale: The y-axis of the chart is incorrectly scaled. The axis labels only go up to 20, but the bar for the year 2024 clearly represents a value of 23, extending well beyond the top of the axis. This is a basic and critical error in data visualization that makes the chart inaccurate and unprofessional. The axis range must be corrected to encompass the maximum data value.
💡 Redundant figure clutters the manuscript: The primary communication issue is the figure's redundancy with Figure 4. Including both is inefficient and clutters the discussion section. To improve the manuscript's clarity and conciseness, this figure should be removed, and all discussion of publication trends should refer to the more detailed Figure 4.
💡 Missing data labels: The exact number of publications for each year is not labeled on the bars, forcing the reader to estimate the values from the flawed y-axis. For precision and clarity, numerical data labels should be added to the top of each bar.
✅ Appropriate choice of chart type: A bar chart is a suitable and effective choice for visualizing the number of publications over discrete time intervals (years), as it makes the growth trend easy to see.

Conclusions and Future Work

Key Aspects

🔑 Synthesis of Empirical Findings: This section synthesizes the core findings from the systematic review of 77 studies, presenting a data-driven summary of the current landscape. It highlights key patterns with specific statistics, including the dominance of cloud-native deployments (66.2%), a heavy reliance on public GitHub data (54.5%), and the near-ubiquity of supervised learning (92.2%). The conclusion also recaps the architectural trends, the significant gap between technical evaluation metrics and the rare measurement of business impact (15.6%), and the most frequently cited challenges, such as hallucination and factual consistency (48.1%).
⚙️ The Core RAG+LLM Paradigm: The conclusion frames Retrieval-Augmented Generation (RAG) as a powerful paradigm for enterprise applications by confirming its primary benefit: mitigating the issues of stale knowledge and hallucinations in Large Language Models (LLMs) through retrieval grounding. However, it provides a balanced perspective by immediately stating that substantial work is still required. This framing emphasizes that while the core concept is proven, significant gaps remain in meeting critical enterprise requirements such as data privacy, low latency, regulatory compliance, and demonstrating measurable business value.
🗺️ A Strategic Roadmap for Future Work: To address the identified gap between academic prototypes and enterprise-ready systems, the paper outlines a strategic roadmap with six priority research directions. This roadmap serves as a clear call to action for the research community. The key areas include: developing robust security and privacy mechanisms like end-to-end encryption; optimizing for ultra-low latency; advancing data-efficient learning strategies for label-scarce domains; creating holistic evaluation benchmarks that include business KPIs; expanding capabilities to multimodal and multilingual data; and implementing continual maintenance for evolving data corpora.
🎯 Vision for Production-Ready Systems: The section culminates in a comprehensive vision for what is required to realize the full potential of RAG and LLMs in the enterprise. This vision moves beyond any single component, arguing for a holistic, systems-level approach. It calls for the integration of security-by-design principles, latency-aware architectures, data-efficient adaptation techniques, multimodal and multilingual capabilities, and disciplined continual learning processes. Crucially, it emphasizes that the success of these integrated systems must ultimately be validated through rigorous, large-scale field trials that measure real-world impact.

Strengths

✅ Robust, Data-Driven Synthesis
The conclusion is exceptionally strong because it is not merely a qualitative summary but a dense synthesis grounded in the specific quantitative findings of the review. By weaving in key statistics on platform adoption, dataset usage, and recurring challenges, it provides a credible, evidence-based snapshot of the field.

"Recurring issues include hallucination and factual consistency (48.1%) [34,41,58], data privacy (37.7%) [39,40,44], latency and scalability (31.2%) [32,33,45], limited business impact evaluation (15.6%) [17], and domain adaptation transfer (23.4%) [42,106]." (Page 22)
✅ Clear and Actionable Research Roadmap
The section provides a highly valuable and actionable roadmap for future work. By structuring the recommendations as six distinct, well-defined priority directions, it moves beyond a simple summary to offer clear guidance for researchers and practitioners, helping to focus future efforts on the most critical gaps.

"To bridge the gap between promising prototypes and robust, production-ready systems, we outline six priority directions:" (Page 22)
✅ Balanced and Insightful Final Assessment
The conclusion masterfully presents a balanced final assessment of the RAG+LLM paradigm. It effectively affirms the technology's core value in mitigating LLM weaknesses while simultaneously emphasizing that significant, multifaceted work is still required to meet stringent enterprise demands, offering a realistic and mature perspective.

"In general, RAG + LLM mitigates stale knowledge and reduces hallucinations through retrieval grounding, but substantial work remains to meet enterprise requirements around privacy, latency, compliance, and measurable value." (Page 22)

Suggestions for Improvement

💡 Consolidate and unify the future research roadmaps
High impact. The paper presents two slightly different lists of future research directions—one in Section 5.4 and another in Section 6. This creates minor redundancy and inconsistency (e.g., 'Explainability and Trust' appears in the first list but is absent from the second). Consolidating these into a single, definitive, and comprehensive roadmap within the main conclusion would enhance clarity and provide a more powerful, unified final message for readers.

"To bridge the gap between promising prototypes and robust, production-ready systems, we outline six priority directions:" (Page 22)

Implementation: Merge the bulleted list from Section 5.4 ('Future Research Directions') with the list in Section 6. Create a single, authoritative list that includes all key points, such as adding 'Explainability and Trust' to the final list of priorities, and ensure consistent terminology throughout to present a single, coherent vision for future work.
💡 Explicitly connect the roadmap to the review's key challenges
High impact. The conclusion identifies the top recurring challenges with specific percentages (e.g., hallucination at 48.1%) and separately proposes a research roadmap. The argument would be more compelling if these two elements were explicitly linked. Directly stating how each priority direction addresses a specific, data-backed challenge would create a stronger narrative, demonstrating that the proposed solutions are a direct response to the most significant problems uncovered by the review.

"Recurring issues include hallucination and factual consistency (48.1%) [34,41,58], data privacy (37.7%) [39,40,44], latency and scalability (31.2%) [32,33,45], limited business impact evaluation (15.6%) [17], and domain adaptation transfer (23.4%) [42,106]." (Page 22)

Implementation: Before introducing the bulleted list of priority directions, add a sentence that explicitly maps the top challenges to the proposed solutions. For example: 'This roadmap directly addresses the most pressing issues identified in our review; for instance, 'Security & Privacy' targets the second-most cited challenge (37.7%), while 'Holistic Evaluation' and 'Continual Maintenance' are essential for mitigating hallucinations (48.1%) and measuring business impact (15.6%).'
💡 Conclude with a broader statement on transformative impact
Medium impact. The final sentence is an excellent technical summary but lacks a broader, forward-looking statement on the overall significance of the field. A conclusion is the ideal place to briefly zoom out and articulate the ultimate 'so what?'. Adding a final sentence that frames the successful implementation of RAG+LLM not just as an operational improvement but as a strategic transformation in how enterprises leverage knowledge would provide a more powerful and memorable closing.

"Realizing its full potential will require security-by-design retrieval, latency-aware systems, data-efficient adaptation, holistic measurement of business value, multimodal/multilingual capability, and disciplined continual learning—validated through rigorous field trials at scale." (Page 22)

Implementation: After the current final sentence, add a concluding thought that elevates the vision. For example: 'Successfully navigating this roadmap will not only optimize enterprise operations but could fundamentally reshape how organizations create, access, and leverage knowledge as a core strategic asset in the age of AI.'

Retrieval Augmented Generation (RAG) and Large Language Models (LLMs) for Enterprise Knowledge Management and Document Automation: A Systematic Literature Review

First Page Preview

Table of Contents

Overall Summary

Study Background and Main Findings

Research Impact and Future Directions

Critical Analysis and Recommendations

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Background and Related Work

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Research Methodology

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Conclusions and Future Work

Key Aspects

Strengths

Suggestions for Improvement