This paper introduces Data Formulator 2 (Df2), an AI-powered visualization system designed to address the challenges of iterative authoring in exploratory data analysis. Existing AI tools often require users to provide complete, text-only descriptions of visualizations upfront, which is impractical when analytical goals evolve during exploration. Df2 tackles this limitation by blending a graphical user interface (GUI) with natural language (NL) input, allowing users to specify chart designs precisely while delegating data transformation tasks to the AI. The system also introduces "data threads," a mechanism for tracking the history of data transformations and visualizations, enabling users to easily revisit, revise, and branch from previous steps.
The core of Df2's methodology involves decoupling chart specification from data transformation. Users define their visualization intent through a combination of GUI interactions (e.g., drag-and-drop field mapping) and concise NL instructions. The system then generates a Vega-Lite specification (a high-level grammar for interactive graphics) and prompts a large language model (LLM) to produce Python code for the necessary data transformations. Df2 executes this code, handles potential errors, and instantiates the Vega-Lite specification with the transformed data to generate the visualization. Data threads provide a visual representation of the user's interaction history, facilitating navigation and reuse of previous results.
A user study with eight participants, with varying levels of expertise in data analysis and programming, demonstrated Df2's effectiveness in supporting iterative visualization authoring. Participants successfully completed a series of tasks involving the creation of 16 visualizations, requiring diverse data transformations. The study revealed distinct iteration styles among users, some preferring broader exploration with multiple branches (wide trees), while others favored deeper, more linear progressions (deep trees). Participants also employed various prompting techniques, ranging from imperative commands to questions and chat-style interactions. The study highlighted the importance of Df2's transparency features, such as code explanations and data provenance tracking, in building user trust and facilitating verification of AI-generated outputs.
The discussion explores future directions for Df2, including integration with visualization recommendation systems and the development of agent-based systems for coordinating data transformation and chart editing. The authors acknowledge the limitations of the user study, particularly its focus on reproduction tasks and the lab setting, and propose future research involving open-ended exploration and longitudinal studies to investigate long-term user behavior and learning effects.
Data Formulator 2 (Df2) presents a compelling approach to iterative visualization authoring, effectively addressing the limitations of existing AI-powered tools. The system's innovative blend of GUI and NL input, coupled with its sophisticated data threading mechanism, empowers users to navigate complex data transformations and explore diverse visualization strategies with remarkable efficiency. The user study, while limited in sample size, provides strong qualitative evidence for Df2's usability and its potential to transform data analysis workflows. The system's transparency features, including code explanations and data provenance tracking, foster user trust and facilitate verification of AI-generated outputs.
However, the study's reliance on reproduction tasks and the lab setting constrains the generalizability of findings to real-world, open-ended exploration scenarios. Future research addressing these limitations, along with the proposed enhancements for recommendation systems and agent-based chart editing, will be crucial for realizing Df2's full potential. The core strength of this work lies in its robust, user-centered design and its potential to democratize access to sophisticated data visualization techniques by lowering the barrier to entry for users with varying levels of programming expertise. The integration of AI capabilities within an intuitive interface offers a promising pathway for more efficient, insightful, and accessible data exploration.
The abstract effectively establishes the existing gap in AI-powered visualization tools, specifically their inadequacy for iterative authoring, which is a common practice in exploratory data analysis.
The abstract clearly introduces Data Formulator 2 (Df2) as the proposed solution and immediately states its primary design goal: to overcome the limitations of existing systems in iterative authoring.
The abstract successfully communicates the core mechanisms of Df2 that address the identified problem, such as the blend of GUI and NL inputs, AI-driven data transformation, and support for navigating iteration history.
The inclusion of a user study with a specific number of participants lends credibility to the system's claims and indicates that the findings are backed by empirical evidence.
The abstract mentions that Df2 helped participants complete "challenging data exploration sessions." While conciseness is key in an abstract, adding a very brief, impactful descriptor for the nature of these challenges (e.g., involving complex data transformations, multi-step analyses) could subtly enhance the reader's understanding of Df2's capabilities and impact. This would be a low-impact change, but could add a touch more specificity without significantly increasing length. It belongs in the abstract as it clarifies the context of the user study's findings, which is a crucial part of summarizing the paper's contribution.
Implementation: Consider revising the last sentence to incorporate a brief qualifier for the challenges. For example: "A user study with eight participants demonstrated that Df2 allowed participants to develop their own iteration styles to complete challenging data exploration sessions, such as those involving evolving analytical goals and multi-step data transformations."
The Introduction clearly articulates the core problem: the mismatch between the iterative nature of data exploration and the capabilities of existing AI-powered visualization tools. It effectively sets the stage by detailing why current solutions fall short.
The paper doesn't just state a general problem but pinpoints specific deficiencies in current tools, namely the issues with text-only prompts (lack of precision, difficulty in describing complex designs) and the lack of support for iterative behaviors like branching and backtracking.
The Introduction provides a strong logical bridge from the identified problems to the proposed key insights of Df2. The multi-modal chart builder and data threads are presented as direct responses to the limitations discussed.
The section concludes with a clear, bulleted list of the paper's main contributions, which helps the reader understand the scope and impact of the work upfront.
The Introduction mentions that the user study 'discovered data analysts’ different iteration styles.' While the full details are rightly reserved for later sections, providing a very brief, high-level characterization of these styles (e.g., 'ranging from cautious, linear refinements to more exploratory, branched investigations') within the introduction could further pique reader interest and make this specific contribution more tangible from the outset. This is a medium-impact suggestion that could enhance the foreshadowing of key findings, fitting well within the summary of contributions.
Implementation: Consider expanding the sentence slightly, for example: 'We conducted a user study that discovered data analysts’ different iteration styles (e.g., varying in their approach to branching and refinement) and rich experiences using our new interaction approaches...'
Figure 1: With Data Formulator 2, analysts can iterate on a previous design by (1) selecting a chart from data threads and (2) providing combined natural language and graphical user interface inputs in the chart builder to specify the new design. The AI model generates code to transform the data and update the chart. Data threads are updated with new charts for future use.
Figure 2: An analyst explores electricity from different energy sources, renewable percentage trends, and country rankings by renewable percentages using a dataset on CO2 and electricity for 20 countries (2000-2020, table 1). The analyst creates five data versions in three branches to support different chart designs. DF2 allows users to manage iteration directions and create rich visualizations using a blended UI and natural language inputs.
Figure 3: DF2 overview. Users create visualizations by providing fields (drag-and-drop or type) and NL instructions to the Chart Builder, delegating data transformation to AI. Data View shows derived data. Users navigate data history and select contexts for the next iteration using (the thread in use is displayed as local data threads). They refine or create new charts by providing instructions in Chart Builder. The main panel provides pop-up windows to inspect code, explanations, and chat history.
Figure 4: Experiences with DF2: (1) creating the basic renewable energy chart using drag-and-drop to encode fields; (2 and 3) creating charts requiring new fields by providing field names and optional natural language instructions to derive new data.
Figure 5: Iteration with DF2: (1) provide an instruction to filter the renewable energy percentage chart by top CO2 countries, (2) update the chart with Global Median? and instruct DF2 to add the global median alongside the top 5 CO2 countries' trends, and (3) move Global Median? from column to opacity to update the chart design without deriving new data.
The section clearly lays out the foundational design choices of Df2—decoupling chart specification from data transformation and using data threads for iteration—providing a strong conceptual framework for the subsequent detailed descriptions. This upfront clarity helps the reader understand the core architectural decisions.
The description of how users compose charts using a blend of GUI (shelf-configuration) and NL inputs is thorough and effectively justifies the benefits of this approach, such as saving users effort in writing verbose prompts for complex designs.
The method details a sophisticated, multi-segment prompting strategy for the LLM, including a "goal refinement" step and the inclusion of dialog history. The automated error correction mechanism, where Df2 queries the LLM with error messages, demonstrates a robust approach to AI integration.
The distinction between global and local data threads, along with their specific roles in navigation, context awareness, and facilitating quick revisions, is well-explicated. This highlights a nuanced understanding of user needs during iterative analysis.
The system provides multiple avenues for users to inspect AI-generated results (data, code, explanations, chat history) and allows direct manipulation of chart styles without AI intervention. This empowers users and builds trust.
The "goal refinement" step, where the LLM elaborates the user's intent into a JSON object before code generation, is an important design feature for improving transformation accuracy. While its rationale is clearly stated, the Method section could enhance reader understanding by briefly clarifying how, or if, this refined goal is exposed to the user. Knowing whether users can inspect or influence this AI-interpreted goal before final code generation is pertinent to understanding the system's transparency and the user's agency in the AI-assisted workflow. This is a medium-impact suggestion as it relates to the interpretability of the AI's intermediate reasoning and user oversight.
Implementation: Add a sentence after describing the "goal refinement" step (page 6) to specify if the refined JSON is visible to the user (e.g., in logs, or as a pre-confirmation step) or if it's a purely internal process. For instance: "This refined JSON goal is logged as part of the interaction history, accessible via the 'view chat history' pop-up, allowing users to retrospectively understand the LLM's interpretation, though it is not presented for pre-confirmation in the current design."
The paper comprehensively describes error handling for the AI-generated Python data transformation code. However, after successful code execution, the process involves instantiating the Vega-Lite script with the new data, including inferring semantic types. It would strengthen the Method section to briefly address how Df2 handles potential errors that might arise specifically during this Vega-Lite instantiation or subsequent rendering phase (e.g., type mismatches not caught by Python, Vega-Lite spec errors, or rendering engine issues with the transformed data). This is a low-to-medium impact suggestion that would provide a more complete picture of the system's robustness.
Implementation: Following the description of Vega-Lite script instantiation (page 7), add a sentence clarifying the handling of errors at this stage. For example: "If errors arise during the Vega-Lite instantiation or rendering (e.g., due to incompatible data types with the chart template or malformed Vega-Lite specifications), Df2 currently surfaces these errors to the user, prompting a revision of either the chart design or the transformation logic. Future work could explore AI-assisted diagnostics for such visualization-specific errors."
Figure 6: DF2's workflow: (1) DF2 generates a Vega-Lite spec skeleton based on user specifications and chart type. (2) If new fields (e.g., Rank) are required, DF2 prompts its AI model to generate data transformation code. (3) The Vega-Lite skeleton is then instantiated with the new data to produce the desired chart.
Figure 7: DF2 converts user encodings into a Vega-Lite specification, which is combined with AI-transformed data to visualize country ranks in 2000 and 2020.
Figure 8: Data threads and local data threads (right). Users can select previous data or charts to create new branches, and the AI reuses code for new transformations based on user instructions. The local data thread offers shortcuts to (1) rerun the previous instruction, (2) issue a follow-up instruction, or (3) expand the previous card to revise and rerun the instruction.
Figure 9: DF2 provides explanations of the code generated by AI to assist users understand the data transformation. This example is the explanation of the code behind table-56 in Figure 8.
The Results section clearly presents quantitative data on task completion rates and times, providing a solid baseline for Df2's performance in the hands of users. This is complemented by a well-categorized breakdown of hints requested, offering insights into areas where users faced challenges.
The integration of direct participant quotes throughout the section is highly effective. These quotes vividly illustrate user experiences, particularly when comparing Df2 to other tools and explaining their interaction strategies, adding depth and authenticity to the findings.
The paper provides a nuanced and detailed characterization of the diverse iteration styles (wide vs. deep trees; backtracking/revise vs. follow-up) and prompting techniques users developed. This qualitative analysis, supported by specific examples and user rationale, reveals valuable insights into how users adapt to and utilize novel AI-powered systems.
The section thoroughly explores participants' verification strategies, highlighting how different backgrounds influenced their methods for assessing AI-generated outputs (e.g., relying on code explanations, the code itself, or data tables). This sheds light on trust formation and the importance of transparency features.
The inclusion of 'Additional Feedback' detailing specific user suggestions for Df2 improvements (e.g., interface affordances, AI disambiguation) demonstrates a commitment to user-centered design and provides a clear pathway for future system refinements.
Medium impact. While the qualitative descriptions of iteration styles are rich and well-supported by quotes, augmenting this with quantitative data would provide more objective evidence for the observed behavioral clusters (e.g., 'wide vs. deep' and 'backtrack vs. follow-up'). This analysis directly pertains to user study data and belongs in the Results section. It would enhance the rigor of these findings, allowing for more direct comparisons between styles and potentially revealing correlations with user backgrounds or task outcomes. Such data could further substantiate the claims about distinct user approaches.
Implementation: Analyze the recorded user study data, particularly the interaction logs and workflow structures exemplified in Figure 12. Extract metrics such as: average number of branches created per participant, average depth of data threads, frequency of 'revise' actions (self-loops) versus 'follow-up' actions (new nodes) for each participant or user group. Present these metrics concisely, perhaps in a small table or integrated into the textual discussion of these iteration styles, to complement the qualitative observations.
Medium-to-high impact. The Results section effectively details the diverse prompting styles adopted by participants. An analysis exploring potential correlations between these styles (e.g., imperative commands, questions, verbosity, direct data manipulation) and task-related outcomes—such as efficiency (time to completion, number of interactions), error rates (frequency of AI misinterpretations), or the need for hints—would offer significant insights. This analysis of user-generated data is appropriate for the Results section and could inform the development of prompting guidelines or adaptive AI feedback mechanisms within Df2, thereby enhancing usability and user success.
Implementation: Systematically categorize the prompts used by participants based on the styles already identified (e.g., imperative, question-based, verbose, concise, column-focused). For each participant or prompt style category, analyze corresponding task segments for metrics like time taken per sub-task, number of AI interactions required to achieve a correct visualization, and instances where hints were requested or significant corrections were needed. Report any observed correlations or notable patterns, or the lack thereof, to provide a richer understanding of prompt effectiveness.
Figure 10: Participants' self-reported roles, expertise in chart creation, data transformation, programming, and AI assistants (1=novice, 4=expert), task completion time, and hints needed during study tasks.
Figure 11: The dataset and tasks in our user study. (1) Dataset 1: Understanding top earning majors and the relation between salary and women percentage. (2) Dataset 2: Exploring movie genres with best return-on-investment values (profit vs. profit ratio) and top movies. The branching directions are added for illustration; participants developed their own iteration strategies. We refer to these target charts as C1-7 for the college dataset and M1-9 for the movies dataset.
Figure 12: Participants' workflow for study tasks in Figure 11 (C1-7 for college, M1-9 for movie). Each node represents a data table version, with blue for initial datasets, yellow for data tables instantiating (one or multiple) target visualizations in Figure 11 (number i in the node indicate the i-th target visualizations for the given dataset), and gray for others. Self-loop arrows indicate prompt revisions and data table updates ('×2' indicates two revisions).
The discussion effectively articulates a forward-looking vision for Df2 by proposing integration with recommendation systems, leveraging Df2's unique strengths like dynamic data transformation and data threads to overcome limitations of existing recommenders.
The paper demonstrates a pragmatic understanding of current AI capabilities by proposing a balanced approach to chart editing—maintaining precise GUI control for stylistic refinements while exploring future AI-driven agent systems for more complex, unified interactions.
The discussion on proactive AI clarification highlights a user-centric approach to future development, aiming to reduce user effort in verification and build trust by making the AI a more intelligent conversational partner.
The authors exhibit strong methodological awareness by openly discussing the user study's limitations, such as the nature of the tasks and the lab environment, and by proposing specific future studies (open exploration, longitudinal) to address these gaps.
Medium-to-high impact. The discussion correctly identifies the increased risk of bias and irrelevant suggestions when Df2's data transformation capabilities expand the recommendation space. While acknowledging this is good, elaborating on potential research avenues or specific strategies to mitigate these issues would significantly strengthen this future work direction. This is pertinent to the Discussion section as it addresses a critical challenge for the proposed enhancements to recommendation capabilities, directly impacting the responsible development and deployment of such AI features.
Implementation: After the sentence 'Therefore, as part of future work, it would be valuable to explore ways to support visual recommendation in a larger exploration space, especially managing and communicating exploration paths to the user to prevent unintentional bias towards an undesired direction,' consider adding: 'This could involve research into developing algorithmic safeguards for recommendation diversity, incorporating interactive user feedback mechanisms to refine suggestion quality and flag potential biases, or designing transparent interfaces that clearly articulate the provenance and rationale behind AI-generated recommendations and alternative exploration paths.'
Medium impact. The proposal for an agent-based system to coordinate data transformation and chart editing is innovative. However, the discussion could benefit from briefly considering how user control and oversight would be maintained within such a system, particularly given that AI agents might require multiple interactions. Addressing user agency in complex AI interactions is crucial for system usability and trust, making it a relevant point for the Discussion section's future work considerations.
Implementation: Following the sentence 'The key challenge is managing response time and maintaining reliability, as AI agents often require multiple interactions to reach consensus,' add a sentence such as: 'Furthermore, future investigations should explore mechanisms for effective user intervention and preference articulation within these agent-based dialogues, ensuring users can guide, correct, or override agent decisions, particularly when consensus is slow to achieve or diverges from the user’s evolving intent.'