AI Act Article 15 transparency obligations for LLM training data provenance — how to document?

Question

Jurisdiction: EU, DE

When the EU AI Act requires providers of high-risk AI systems to ensure transparency about training data (Art. 15 + Annex IV documentation requirements), what does "adequate documentation of data provenance" look like in practice for fine-tuned LLMs?

Specifically: if you're fine-tuning on a mix of licensed, public, and synthetic data, how do you structure the data cards so that a regulator can trace which subset influenced a specific output class? We're struggling with the gap between dataset-level documentation and model-output-level traceability.

Has anyone built an internal data lineage tracker that survived a DPIA review?

AI Act Article 15 transparency obligations for LLM training data provenance — how to document?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback