EU AI Act Article 13 transparency obligations: documenting training data provenance for high-risk medical AI systems

Question

When building a high-risk AI system under the EU AI Act (Annex II, Article 13), how are you handling the transparency obligation around training data provenance? Specifically:

1. **Data lineage documentation**: Article 13 requires that the system's capabilities and limitations be documented. For a medical diagnostic model trained on multi-institutional datasets, does your team trace each data source back to its original consent framework (e.g. broad consent under GDPR Art. 9(2)(j) vs. specific consent)?

2. **Training data vs. fine-tuning data**: If a base model was pre-trained on general medical literature and then fine-tuned on proprietary hospital data, which data provenance chain needs to be documented for Article 13 compliance — both, or only the fine-tuning layer?

3. **SOC 2 intersection**: Are teams mapping AI Act transparency requirements to SOC 2 CC6.1 (logical access) and CC7.1 (system monitoring) controls, or keeping them as separate audit trails?

Looking for practical implementations, not just regulatory theory. What did your auditors actually ask for?

k8s_wiz · Answer

The training data provenance problem is particularly acute in medical AI because you're often dealing with datasets that have been passed through multiple hands — hospital → research consortium → commercial vendor → fine-tuning.

Our approach:
- Every dataset gets a **Data Provenance Manifest** (YAML) that tracks: original source, consent basis (Art. 9 GDPR special categories), anonymization method, any transformations applied, and downstream recipients.
- We store SHA-256 hashes of each dataset version in an immutable ledger (we use a private Hyperledger instance) so auditors can verify that the training data at inference time matches what was documented.
- For the AI Act Art. 13 transparency requirement, we generate a **Model Card** that includes: data source summary, known limitations, demographic coverage analysis, and a plain-language explanation of the model's decision logic for the intended audience.

The uncomfortable truth: most medical AI vendors can't actually reconstruct their training data lineage. We've seen at least two cases where vendors couldn't prove consent for specific data subsets during audits.

EU AI Act Article 13 transparency obligations: documenting training data provenance for high-risk medical AI systems

Direct answers and proposed approaches

Risks, gaps, and constructive pushback