Large Language Models (LLMs) are increasingly used to support data wrangling, but their integration into interactive transformation workflows raises new challenges for auditability, reproducibility, and accountability. When users approve, reject, or refine LLM-generated suggestions, conventional data lineage systems often fail to capture why a change occurred, who was responsible for it, and which transformation produced the final dataset. This thesis investigates a compact traceability framework for human–LLM-assisted transformations of uploaded tabular files. The target setting is a single structured tabular data file (e.g. CSV), column-level transformation workflows, and practical reproducibility. The framework tracks file versions, table versions, selected columns, LLM suggestions, human decisions, approved transformation specifications, generated code references, execution events, and resulting output versions, with the goal of enabling reconstruction and rollback without storing the full conversation. This paper is co-supervised by Prof. Jiannan Wang (jnwang@tsinghua.edu.cn) from Department of Computer Science and Technology at Tsinghua University , who also serves as the second supervisor.
Thesis Type |
|
Student |
Mohammad Abdel Aziz |
Status |
Running |
Presentation room |
Seminar room I5 - 6202 |
Supervisor(s) |
Stefan Decker |
Advisor(s) |
Yixin Peng |
Contact |
peng@dbis.rwth-aachen.de |
Background
Data wrangling is a central step in data analysis pipelines, especially when users need to clean, normalize, enrich, or reshape tabular data before downstream analysis. Interactive systems such as data cleaning and data transformation tools have shown that users benefit from immediate feedback, editable transformation scripts, previews, and undo functionality. More recently, LLM-based systems have extended this workflow by generating transformation suggestions, natural-language explanations, and executable code.
However, LLM-assisted wrangling changes the provenance problem. A final table version is no longer produced only by a deterministic script or a fixed workflow. Instead, it may result from an interaction among uploaded data, selected columns, user intent, prompt context, LLM-generated suggestions, human approval decisions, generated code, execution results, and later refinements. Existing provenance and workflow systems provide useful concepts for entities, activities, agents, derivations, and execution traces, but they do not directly specify how to represent the decision process between a human user and an LLM in a compact, auditable, and practically reproducible way.
This thesis therefore focuses on the design and prototyping of a traceability model for human–LLM-assisted CSV transformations. The central challenge is to preserve enough information to explain and reproduce a transformation, while avoiding the storage of the full conversation history. The work addresses three guiding questions: how to represent human–LLM-assisted tabular transformations as a compact provenance graph; what information is necessary to reconstruct why a data modification occurred and enable rollback; and how practical reproducibility can be supported when the suggestion phase involves non-deterministic LLM interaction.
Tasks
a) Analyze the workflow and derive a compact provenance model
- Analyze the target workflow for human–LLM-assisted CSV transformation, including file upload, column selection, suggestion generation, user decision, transformation approval, code generation, execution, output versioning, and rollback.
- Identify the minimum traceability information required to reconstruct why a data modification occurred without storing the full conversation.
- Design a compact provenance graph with core node and edge types, including file versions, table versions, selected columns, LLM suggestions, human decisions, approved transformation specifications, code references, execution events, and output versions.
- Map the proposed graph model to established provenance concepts such as entities, activities, agents, derivations, usage, generation, and attribution.
b) Compare provenance requirements for deterministic and LLM-assisted transformations
- Review existing concepts from data provenance, workflow provenance, interactive data cleaning, and script-level provenance tracking.
- Compare deterministic transformation pipelines with human–LLM-assisted workflows, focusing on intent capture, responsibility assignment, suggestion provenance, decision provenance, code provenance, and output provenance.
- Define what should be treated as reproducible artifacts, including transformation specifications, prompt-relevant metadata, model configuration, code references, input/output table versions, and execution metadata.
- Clarify the boundary between full conversation logging and compact traceability, including which information can be abstracted, hashed, summarized, or omitted.
c) Prototype the traceability framework for CSV transformations
- Implement a prototype that supports one uploaded CSV file and column-level transformation workflows.
- Represent each user-approved transformation as a structured specification connected to the relevant input table version, selected columns, LLM suggestion, human decision, generated code reference, execution event, and output table version.
- Provide functionality to inspect the provenance graph for a selected output column or table version.
- Support rollback by identifying the predecessor table version and the transformation chain that led to the current state.
- Store traceability records in a compact format suitable for querying and later export.
d) Evaluate reproducibility, auditability, and rollback support
- Evaluate whether the stored traceability records are sufficient to reconstruct the transformation history of a CSV workflow.
- Test rollback scenarios by reverting selected table versions and verifying that the graph correctly identifies the responsible transformation chain.
- Assess practical reproducibility by re-executing approved transformation specifications and comparing regenerated outputs with stored output versions.
- Analyze robustness under LLM non-determinism by distinguishing between the non-deterministic suggestion phase and the reproducible approved transformation phase.
- Produce an error and limitation analysis covering ambiguous user intent, changed model outputs, failed code execution, partial transformations, and missing metadata.
References
[1] Moreau, L., Missier, P. (eds.): PROV-DM: The PROV Data Model. W3C Recommendation, World Wide Web Consortium (2013).
[2] Lebo, T., Sahoo, S., McGuinness, D. (eds.): PROV-O: The PROV Ontology. W3C Recommendation, World Wide Web Consortium (2013).
[3] Buneman, P., Khanna, S., Tan, W.C.: Why and Where: A Characterization of Data Provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Berlin, Heidelberg (2001).
[4] Raman, V., Hellerstein, J.M.: Potter’s Wheel: An Interactive Data Cleaning System. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, pp. 381–390. Morgan Kaufmann, San Francisco (2001).
[5] Kandel, S., Paepcke, A., Hellerstein, J.M., Heer, J.: Wrangler: Interactive Visual Specification of Data Transformation Scripts. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2011, pp. 3363–3372. ACM, New York (2011).
[6] Pimentel, J.F., Murta, L., Braganholo, V., Freire, J.: noWorkflow: A Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts. Proceedings of the VLDB Endowment 10(12), 1841–1844 (2017).
[7] Narayan, A., Chami, I., Orr, L.J., Ré, C.: Can Foundation Models Wrangle Your Data? Proceedings of the VLDB Endowment 16(4), 738–746 (2022).
[8] Chen, W.-H., Tong, W., Case, A., Zhang, T.: Dango: A Mixed-Initiative Data Wrangling System using Large Language Model. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI 2025, article 389, pp. 1–28. ACM, New York (2025).
- Good programming skills in Python and experience with data processing libraries such as pandas.
- Basic understanding of relational/tabular data, CSV processing, and data cleaning workflows.
- Familiarity with database concepts such as data lineage, provenance, versioning, and reproducibility is beneficial.
- Basic experience with LLM APIs, prompt-based systems, or code generation workflows is helpful.
- Interest in human-in-the-loop systems, auditability, and responsible data management.