Synthetic data in the Spotlight: Compliance Insights on the AI Act and the GDPR

Synthetic data in the Spotlight: Compliance Insights on the AI Act and the GDPR

10.12.2024.

Compliance is the key 

Failure to comply with the GDPR or the new AI Act can lead to severe financial consequences for businesses. Penalties prescribed by Article 99 of the AI Act range from warnings and non-monetary measures to fines up to 7% of the offender’s annual worldwide turnover. More serious violations of the GDPR, which go against core principles, can result in fines of up to 4% of the annual revenue or up to €20 million, whichever is higher. 

Non-compliance with these Acts can cause significant financial problems for companies operating across the EU. Beyond adhering to the new AI legislation, companies must be aware of privacy implications since machine learning models utilize data sets. 

Seems complicated? What if there were data that didn’t need to be collected from the real world? Enter synthetic data. 

What is synthetic data? 

Synthetic data is artificially created through computer simulation or generated by algorithms to replace real-world data. It serves as an alternative or supplement to real-world data for machine learning models, especially when real data is not readily available. 

This concept has gained popularity in deep learning and has various use cases, such as data science experiments and safeguarding patient data in healthcare to make clinical trials more efficient. 

While synthetic data is artificial, it statistically and mathematically reflects real-world events. It aids in training machine learning algorithms that require large amounts of data, which can be costly and come with data usage restrictions.  

Why synthetic data? 

Data privacy regulators recommend synthetic data as an alternative to real-world data in specific contexts. For example, the Norwegian Data Protection Authority fined the country’s Confederation of Sport for a personal data breach involving the unintended sharing of personal data for 3.2 million Norwegians online during cloud computing solution testing. The regulator noted this could have been avoided with synthetic data 

What does the AI Act say about synthetic data? 

The AI Act primarily regulates high-risk AI systems and imposes requirements on AI models that process personal data. Article 15 mandates that high-risk AI systems be designed and developed to achieve accuracy, robustness, cybersecurity, and consistent performance throughout their lifecycle. 

If you look at article 10 (effective August 2026) under the name ‘Data and Data Governance’, you can see that it sets out criteria for training, validation and testing data sets regarding high-risk AI systems. Training, validation and testing data sets shall be subject to data governance and management practices appropriate for the intended purpose of high-risk AI systems regarding, for example, relevant design choices, data collection processes and origin of data, relevant data preparation processing operations and so on.  

You probably wonder where does this lengthy article mentions synthetic data? Keep on reading, we’re almost there. 

Now if we jump to article 10 (5), you can see that the AI Act allows, if it is strictly necessary for the purpose of ensuring bias detection and correction in relation to high-risk AI systems, the processing of special category of personal data subject to appropriate safeguards. This processing can occur if, according to article 10 (5) (a) the bias detection and correction cannot be effectively fulfilled by processing other data, including synthetic and anonymized data.  

Long story short, article 10 (5) allows processing special category personal data for bias detection and correction in high-risk AI systems, provided other data, including synthetic and anonymized data, cannot fulfill this objective. 

Additionally, the AI Act also acknowledged the non-personal nature of synthetic data in article 59 (1) (b). 

The GDPR on synthetic data 

All privacy regimes focus on principles that promote lawful data processing as well as requirements regarding data quality, minimization and security. It has been stipulated that by replacing collected real-world data with artificially created data, a novel layer of security can be added to personal data. Since synthetic data is mainly created on demand, it might be in line with the principle of data minimization.  

Looking at a data protection by design approach, it is considered that synthetic data technology could enhance data privacy as well as improve fairness and mitigate bias.  

However, even though synthetic data advocates state that synthetic data equals anonymized data and that its benefits can be spotted in Recital 29 GDPR, it is still not resolved whether it can be regarded as anonymized data. It is thought that the degree to which synthetic data can differ from original data is the decisive factor in determining whether it can be considered anonymous.  

What is the difference between anonymized and pseudonymized data? 

Anonymized data refers to information that has been processed to uproot markers that could be used to identify a person, either directly or indirectly. 

As stated in the GDPR, the principles of data protection apply to any information concerning an identified or identifiable natural person. The principles of data protection shouldn’t apply to anonymous information or information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. 

However, total anonymization can be difficult to achieve since undertakings need to consider indirect markets like those that could still exist within datasets. In case indirect markers could unmask anonymized data then there is no anonymization. 

On the other hand, the GDPR states that personal data which has gone pseudonymization should be considered as data on an identifiable natural person. It is further stipulated that, to determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. 

The thing is that pseudonymization is temporary and reversible; therefore, it is subject to privacy rules in all major data protection regulations such as the GDPR, CPRA and HIPAA. Pseudonymization can be achieved in different ways such as using cryptographic hashes and encryption. In contrast to anonymized data, pseudonymized data retains the ability to recover the original data so this technique is more often used by various organizations and businesses. 

Article 4 of the GDPR defines pseudonymized data as the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information (the term additional information refers to, for example, a hash value, token or code), provided that such additional information is kept separately and is subject to technical and organizational measures to ensure that personal data cannot be attributed to an identified or identifiable natural person. 

A similar definition of pseudonymization can be spotted in the Serbian Law on Data Protection. Our Law on Data Protection mentions pseudonymization in few other places; for example, according to article 50, the controller and the processor should implement appropriate technical, organizational and personnel measures to achieve level of security appropriate to the risk, and these measures include pseudonymization and encryption of personal data. 

Let’s jump to the conclusion- the main difference between anonymization and pseudonymization can be seen through answering a single question: can the dataset be turned from non-identifiable to identifiable with additional information or technologies? 

Is anonymization the real goal of synthetic data or not? 

Creating synthetic data without any links to the original data subjects could ensure confidentiality and eliminate the risk of re-identification. However, there’s an ongoing debate: does anonymization strip synthetic data of its utility? Can synthetic data produce results like those of real-world data when anonymized? 

For synthetic data to be truly anonymous and compliant with GDPR criteria, re-identification must be impossible, whether directly or indirectly.  

The 2021 AEPD publication on misunderstandings regarding anonymization clarifies that anonymization doesn’t necessarily mean a loss of utility. The usefulness of anonymized data depends on the purpose and the acceptable re-identification risk. Since personal data cannot be stored permanently beyond its original purpose, waiting for future utility isn’t an option. 

History has shown multiple examples of flawed anonymization leading to re-identification. In 2013, the New York City Taxi and Limousine Commission published data on over 173 million taxi trips, including allegedly anonymized license numbers. Due to inaccurate anonymization, it was possible to identify the original license numbers and individual drivers.  

Can it be exempted from the GDPR? 

Not really- even if the statement about anonymized data could be totally true, the GDPR’s safeguards cannot be completely avoided. If we think of the process of generating the synthetic dataset as an anonymization process, it could still qualify as the processing of personal data. That means that the controller still needs to establish a lawful basis for generating synthetic data. 

What can you do to obtain compliance? 

Using synthetic data has a lot of potential to enhance compliance and stay in line with main requirements posed by the GDPR and AI Act. Taking legal requirements into account and implementing the process accurately can aid in remaining compliant.  

You can start by assessing the specific needs of your business and whether using synthetic data is the way to go. Its main utility can be spotted in machine learning and scientific research.  

Pay attention to your obligations under the GDPR. Don’t forget to assess whether the synthetic datasets differ enough from their original counterparts and adjust the settings of your AI model accordingly. It is recommended to perform an identifiability test on the synthetic dataset to find out whether re-identification is possible. Consider all organizational and technical measures which especially comes down to integrity, security management, confidentiality and incident response techniques.  

For tailored guidance and to ensure full compliance, it would be wise to consult an attorney specializing in IT law.  

This is not our first time discussing AI-related topics. You can learn more about the AI Act and corporate compliance by reading this article: ‘AI Meets Corporate Compliance- Setting the stage’. If you’re interested in the interplay between AI and intellectual property, check out one of our older articles: ‘AI & Intellectual Property: Can AI own copyright?’. 

 

Note: This text does not constitute legal advice, but rather the personal opinion of the author. 

Scroll