Mapping the EU AI Landscape (Part 3): Can you Train on my Data?

BlogPost

Oct 28, 2024

DISCALIMER:** The views and opinions expressed in this blog are solely my own and do not reflect those of my employer, or any of its affiliates.

TLDR; In the EU, anonymize personal data unless necessary to reduce bias, and ensure compliance with data minimization principles. For copyrighted material, you can train models unless owners explicitly opt out using appropriate machine-readable tags. Outputs generated by AI systems are public domain unless human creativity is involved, which allows copyright claims. The rules remain broad, creating ambiguity across member states.

Intro

After writing my previous post on the EU Data strategy, I was left with a sense of dissatisfaction. Working in data myself, I failed to see clear and defined rules on how we’re supposed to collect/process data.

For this reason, I took some very real problems the sector (training LLMs but it can be extended to other modalities too) is facing and transformed them into questions I aimed to answer:

How can you handle Personal Identification Information (PII) in your data?
- Should you anonymize data?
- What is the best way to access data?
What about licensed and copyrighted material?
- Can you train your model on copyrighted material?
- What about intellectual property?

Documents and Conventions

Before starting, I should clarify which documents are which (mostly for myself, I need this). Since some documents come with an official name (e.g., Regulation 2018/679) and some with a common name (the latter is the GDPR), here’s a list:

Regulation (EU) 2016/679 : General Data Protection Regulation, aka GDPR
Regulation (EU) 2018/1725 : New Data Protection Framework, aka NDPF1725
Directive (EU) 2016/680 : Law Enforcement Directive, aka LED680
Directive (EU) 2019/790: Directive on Copyright in the Digital Single Market: DC790
Directive - 96/9/EC: Database Directive, aka DD96
Directive 2001/29/EC: Copyright Harmonization, aka CH29

For citation, I’ll use a simplified notation: Article X.Y.Z, where X is the article number, Y is the paragraph, and Z is the point¹. For the introductory sections, I’ll use Recital X.Y.Z. This might be obvious for those with a legal background, but for my fellow tech people - this one’s for you! <3

With this said, los geht’s!

Personal Identification Information

Let’s start by looking at how the AI Act handles personal data. Searching for “personal data” in the Act reveals several key points:

Recitals 10 : GDPR’s fundamental rights to data protection still apply. No exceptions for AI training.
Recital 67²: You must be transparent about how and why you originally collected the data.
Recital 69 ³: Emphasizes data minimization and protection principles. You should anonymize and encrypt data where possible, or better yet, use federated approaches where algorithms go to the data instead of data moving around.
Recital 140 : Discusses data sandboxes⁴ and how personal data collected for other purposes can be used in AI systems only under specific conditions⁵.
Article 10: Focuses on data governance, with a range of requirements if you’re working with personal data⁶.
Article 59: Details rules for using personal data in data sandboxes. Essentially, in these sandboxes, you may process “lawfully” gathered personal data, but only if working in sectors like healthcare, environment, transport, or public sector efficiency. Additionally, data must be deleted after use.

To summarize, anonymizing or pseudonymizing personal data is essential. In rare cases where full anonymization isn’t feasible, you must demonstrate that processing the personal data is necessary to mitigate biases.

Data Features

Recital 69 introduces the principle of “data minimization and protection”, but what does that actually mean? According to Article 66 of the AI Act, the data used for AI also needs to be of “high quality”—a somewhat vague term. In essence, data used in AI should meet certain standards and adhere to specific principles.

Principle of Data Minimization

The EU data protection supervisor defines the principle of data minimisation to be “data controllers should collect only the personal data they really need, and should keep it only for as long as they need it.”. Working in data myself, the question of “how much data is enough” often pops out. In the context of training large language models, it sometimes feels as though there’s no such thing as “enough” data, suggesting that data minimization could be overlooked if LLMs demand more data from various sources. Furthermore, Article 10.5 of the AI Act allows the collection of personal data specifically to reduce biases, potentially creating a loophole where vast amounts of personal data could be collected without anonymization, all under the aim of mitigating bias in LLM training⁷.

High-Quality Data

Recently, my team and I published an article discussing our data processing methods for one of our projects. In it, we touch on a field-wide issue: there’s no standard definition for “high-quality” data, and different techniques are emerging to assess various properties of text. So, I was surprised to see “high-quality data” mentioned in the AI Act. Reading through, I noticed the definitions are still pretty broad and seem to revolve around a few main concepts:

Biases (Recital 67): We’ve talked about this before. The AI Act also acknowledges the risk of feedback loops where biased data leads to biased models, which in turn produce even more biased data, creating a self-sustaining loop.
Relevance and Representativeness (Article 10.3): High-quality data should be relevant to the AI system’s intended purpose and sufficiently representative of the real-world scenarios the system is expected to operate in. You could argue that if you’re trying to build artificial general intelligence, all data is relevant.
Error-Free and Complete (Article 10.3): I have no idea what this means from a technical perspective. What counts as an error in a text? A typo? A syntactic issue? A conspiracy theory (non-factual content)? Thankfully, the Act adds “to the best extent possible” before this requirement (Article 10.2.h).
Appropriate Statistical Properties (Article 10.3): Another vague requirement. However, if you can define certain statistical properties as “quality signals” (like factual correctness, average word length, or toxic content), you could start measuring these to give concrete values for each of these signals.

Back to PII

This section was about PII—did you forget? I definitely did and had to reread from the start.

So, did we manage to find an answer on handling PII in your data? Surprisingly, yes.

The golden rule: always anonymize your data! Removing PII from text is a growing field of research⁸, and there are some established methods out there. But if you really want to play at pro level, consider never even touching the data directly. Instead, go for federated approach (wink wink to GaiaX).

Finally, if none of this appeals and you’re doing nothing to address PII, just say you need it all to mitigate bias! Sure, recent research might call this out as ineffective, but you can try to spin it using scaling law.

Copyright and Intellectual Property

Now let’s dig into copyright and intellectual property (IP).

What do we actually care about here? For AI systems, I’d say it boils down to two main questions:

If I train a system on copyrighted material (say, with non-commercial clauses), can I use the system for commercial purposes?
And what about intellectual property? If my model spits out something close to the input data, can I get sued? If the output is modified, when does it stop being IP from the original source?

Train on Copyrighted Material

So, let’s check in the AI Act. First up, if you’re conducting research, you’re in the clear and you can train on pretty much anything (Recital 25). But if you’re working on something for commercial purposes, things get trickier.

Copyright vs. Term of Services vs. License

Before we start discussing training on copyrighted data, let’s clarify what copyright really means and how it relates to other terms. If you’re already familiar with these concepts, feel free to skip ahead.

In simple terms, copyright protects the ownership of the original work. For example, if you scroll down to the bottom of this post, you’ll see “© Copyright 2024, All right Reserved By - Nicolo’ Brandizzi”, which means I own this post. Now, say you want to use this blog post to create a podcast episode. Can you read parts of my blog post for your podcast? That’s where the license comes in. A license regulated how copyrighted work can be used by others. If I license this post for “non-commercial use only,” you wouldn’t be able to monetize it on your podcast. Finally, since this post is hosted on my website, which is also hosted on GitHub pages, we’re both subject to GitHub’s Terms of Service (ToS). GitHub’s ToS states that users aren’t allowed to distribute my content outside of GitHub (e.g., mirroring my post on other sites). So, the ToS basically defines the rules we all have to follow to use a platform or service.

Custom Licenses

Here’s the thing: there’s no fixed set of licenses, or ways to interact with copyrighted content. You’ve probably heard of licenses like Creative Common, MIT license and GNU General Public License. These are well-established, but you can always create your own!

For instance, I could create a unique license for this blog post where I allow AI training only on paragraphs that start with a consonant. Unusual? Definitely. But still legally binding.

Why don’t we see more custom licenses? Probably due to several factors: legal expertise is usually required, and custom licenses are harder to interpret and enforce, which can lead to legal headaches.

Opting Out of AI Training

If you’re training a model for commercial use, Recital 105 makes it clear: you need to check if the data owner has explicitly said, “Hey, don’t use this for AI”. If there’s no specific mention or opt-out, you’re technically in the clear to use the data.

So, how do data owners actually opt out of having their content fed into an AI? Well, the Act doesn’t lay it out in exact terms but points⁹ us to Article 4.3 of DC790, which says that rights holders need to express their wishes in an “[…] appropriate manner, such as machine-readable means in the case of content made publicly available online.”.

In plain terms, if the opt-out isn’t “appropriate” (like, say, yelling “NO AI” from your window), your data might still be used for AI training. But let’s clarify something important: if the data isn’t labeled properly with a “do not train” tag (like a machine-readable indicator), it doesn’t mean copyright protection vanishes. It just means that, unless explicitly marked, you can use it for training unless the owner opts out. Copyright remains, but these guidelines let you use the data if no opt-out is provided.

An Example of Appropriate Opting Out

So, what does this “appropriate manner” actually look like?

Take DeviantArt, for example (a popular digital art platform you may know for other reasons too 👀). f we check out their terms of services, point 24 states: “Unless you actively give your consent, for Artificial Intelligence Purposes, DeviantArt will include a robots meta tag with the “noai” or “noimageai” directive in the head section of the HTML page associated with that Content on the Site […]”

This is a fantastic, straightforward way to set your intent individually. Now, if we could just agree on standard tags to cover different data types (e.g., “notextai,” “noaudioai,” etc.), we’d be all set.

What About the EU?

DeviantArt’s approach is commendable, but it’s not exactly a universal standard yet. So, does the EU offer a standard for opting out? Kind of.

The DC790 introduces the concept of Text and Data Mining (TDM), referring to automated methods of analyzing vast amounts of text and data. TDM’s copyright implications are complex, mainly because it’s wrapped up in copyright exemptions. For instance, Article 3 of DC790 grants certain exemptions to copyright ¹⁰ if entities conduct research¹¹.

For everyone else, data use is allowed if the data doesn’t explicitly prohibit TDM and if you’ve accessed it lawfully (i.e., no hacking). In that case, Article 4 of DC790 exempts you too.

So you might think you’re covered by opting out of TDM and keeping your data safe, right? Well, yes and no. Remember how I mentioned that there’s no standardized opt-out? This is a well documented problem and hasn’t been solved yet. As it stands, there’s no universally recognized method for opting out.

Whose Job Is This?

In recent years, we’ve seen a flood of datasets released each year¹² on various platforms like HuggingFace, Zenodo, Kaggle, and so on.

So, who’s responsible for collecting the “right” data? Is it the dataset provider (i.e., that one struggling PhD student) or the people who end up using the data? Well, according to Recital 106, it’s the latter. The AI providers bear the responsibility of identifying the copyrights associated with the data they use.

I’m genuinely conflicted about this. On one hand, it makes sense to hold the people actually training the models accountable, given they likely have the resources to investigate copyright. But on the other hand, datasets often come stripped down, with only the core content (like text from blog posts or images from DeviantArt) leaving out the original HTML or metadata where copyright info might have been stored. So, what’s the solution in these cases?

Honestly, I think we should obligate or at least heavily incentivize dataset collectors (yes, our overwhelmed PhD students again) to include copyright information for each data point. This small step could go a long way.

Summary on Copyrighted Material

Check before you train: If you’re working on a commercial AI, it’s on you to make sure the data is free to use. Look for opt-out tags or warnings.
Opting out is possible: Data owners can make it clear their data is off-limits with metadata, terms, contracts, or even public statements.
Stay responsible: The EU wants a balance between innovation and protecting rights, so don’t think you can ignore these rules and use everything freely.

Derivate works

Alright, let’s dive into the second question (the one about intellectual property, hope you didn’t forget already).

This one is hard. Why? Because it’s basically philosophy.

Take my blog post here, for instance. If I license it with a “non-commercial use” clause, that means you can’t turn it into a podcast for profit. But what happens if I add a no derivate clause? What does “derivative” even mean?

The Blurry Line of Similarity

When it comes to defining what’s derivative, things get murky fast. Copyright and trademark laws don’t follow one-size-fits-all rules, it depends on context, interpretation, and, often, a judge’s decision.

Take the Creative Commons non-derivative license. It says, “If you remix, transform, or build upon the material, you may not distribute the modified material”. Terms like “remix,” “transform,” and “build upon” are vague, and they link to footnotes that dig into what counts as an adaptation. In short “modification rises to the level of an adaptation under copyright law when the modified work is based on the prior work but manifests sufficient new creativity to be copyrightable, such as a translation of a novel from one language to another, or the creation of a screenplay based on a novel.”

This “sufficient new creativity” is a fuzzy area, and only a court can really decide what qualifies. I found a few landmark cases that talk about this gray zone:

DC Comics vs. Gotham Garage (2015): DC Comics sued a car manufacturer for creating replicas of the Batmobile. Although the Batmobile is “just a car”, the court ruled it was a protected character due to its unique design and connection to Batman’s world. This shows that even vehicles can be characters under copyright law when they are distinctive enough.
Universal vs. Nintendo (1982): Universal claimed Nintendo’s Donkey Kong was too similar to King Kong. However, the court sided with Nintendo, ruling that the broad idea of a giant ape was public domain, and Donkey Kong was distinct in its own way. Here, the court acknowledged similar themes, but differences in characterization and execution allowed Nintendo to win.
Mattel vs. MGA Entertainment (2011): Mattel, the maker of Barbie, sued MGA, claiming the Bratz dolls were derivative of Barbie. After years of litigation, the court ruled in favor of MGA, noting that while the two products shared similarities as fashion dolls, the distinctive style and design of Bratz dolls made them original enough. The case highlighted how styling and branding can create separation, even within similar categories of products.

Each of these cases shows that there’s no universal rule for “too similar.” Courts look at key characteristics, overall design, and whether the new work borrows too much from the original or brings something fresh.

So, what does this mean for our AI models? The answer, honestly, is that no one knows yet¹³.

Who owns the output?

So, imagine you’re using an AI system to clean up the readability of your blog posts. The system takes your messy, grammatically broken draft and spits out a polished, readable piece instead. Who owns this newly polished version? Let’s run through the options:

The company providing the AI service
The AI system itself
You
It depends
No one

If you guessed “it depends”, you’re on the right track!

Copyrighting AI-Generated Outputs

Out of the 5 options I enumerated before, let’s eliminate option 2 right away. No, the AI system itself can’t claim ownership law, copyright is only granted to natural and legal persons¹⁴, meaning humans or organizations. AI isn’t considered either of these (yet), so it’s out.

What about option 1, the company providing the service? If they owned the output, users would likely abandon AI systems altogether, since it would mean you couldn’t use any AI-generated work commercially. That’s why most AI companies state in their terms of service that the output belongs to the user¹⁵. Besides copyright, you need human creativity, something the company didn’t exactly contribute to.

That leaves us with two possible owners: you or no one, and two levels of human involvement: some, or none. It’s straightforward: if the AI system alone generated the content with zero human creativity, it falls into the public domain. Nobody owns it.

On the other hand, if you co-created with the AI, adding your own creativity, then you meet the human creativity requirement and can claim copyright. But how much involvement is enough? Probably more than just pushing a button and sitting back. For a deep dive into this, check out this paper.

Summary on Derivate Works

So, what does all this mean for AI? If your AI system is just reproducing the exact same input data, prepare to get sued for copyright infringement. For everything else, the jury’s still out, but one thing is clear: to copyright AI-generated content, you need to contribute some genuine human creativity.

Conclusion

So, we’ve made it to the end. First off, thank you for sticking around this long! (And if you just jumped to this section from the intro, no hard feelings. <3)

Key Take-Aways

To recap, here’s what we’ve covered:

PII Handling: Always anonymize or pseudonymize data, and only use personal data when necessary to mitigate biases. Following data minimization principles is crucial, though federated approaches can further limit direct data access.
Data Quality: The AI Act emphasizes high-quality, error-free, relevant, and representative data, but “high quality” lacks a strict definition, leaving much up to interpretation.
Copyrighted Material: When using copyrighted data, check for opt-out tags or machine-readable permissions, as unmarked data could be used by default. However, this doesn’t nullify copyright protections; it simply allows use unless the owner explicitly opts out.
Derivative Works: The concept of “derivative” remains legally ambiguous, with AI outputs falling into a gray area regarding originality and transformation.
AI-Generated Outputs: Ownership of AI-generated content depends on human involvement. Minimal involvement, like pressing a button, isn’t sufficient for copyright, as outputs must involve meaningful human creativity.

Also, fun fact: I used ChatGPT to help summarize this list. Does that make it public domain? Well, that’s still a gray area…

Personal Take-Aways

As some of you know, my background is pretty technical, and coming from engineering, I’ve learned there’s always an answer (a yes or a no, a number that defines a minimum threshold, or something equally straightforward).

This? This is the complete opposite. Regulation is a different world where you’re defining acceptable behavior while considering so many factors that listing them all is already a task. After all that effort, the regulation you create can still end up used in ways you’d never imagine, sometimes for the worse¹⁶.

So, why am I saying this? Maybe I’m just trying to convince myself that the EU has a monumental task in balancing regulation with freedom. Unlike the U.S., the EU isn’t a single state but a collection of member states, each with its own priorities. If you pass a law that’s too restrictive, a member state might veto it. So, regulations have to be broad enough to avoid pushback while still trying to meet their original goals.

But that’s also why there’s a lack of specifics in EU policy. And while it makes sense, it doesn’t help. We seriously need standards (like a consistent way to opt out of AI training) and precise laws. This ambiguity hinders innovation. After all, a law might be interpreted one way in Spain and completely differently in Poland¹⁷. And no one wants to risk a lawsuit over unclear laws.

It’s ironic. Legislating is about weighing endless variables and unpredictable outcomes. You know what’s good at that? AI. Yes, this is a place where AI could actually help. In an ideal future, I’d love to see AI draft laws, with humans having the final say. Humans would always hold the power, but it’s becoming clear that we’re not always great at weighing long-term consequences. So, why not let something more capable handle that?

Footnotes

I actually found that the proper citation style is in the form Article X(Y)(Z), but that looks awful to me. Too many parenthesis… ↩
In order to facilitate compliance with Union data protection law, such as the GDPR, data governance and management practices should include, in the case of personal data, transparency about the original purpose of the data collection. ↩
The right to privacy and to protection of personal data must be guaranteed throughout the entire lifecycle of the AI system. In this regard, the principles of data minimisation and data protection by design and by default, as set out in Union data protection law, are applicable when personal data are processed. Measures taken by providers to ensure compliance with those principles may include not only anonymisation and encryption, but also the use of technology that permits algorithms to be brought to the data and allows training of AI systems without the transmission between parties or copying of the raw or structured data themselves, without prejudice to the requirements on data governance provided for in this Regulation. ↩
Information on data sandboxes is still limited. The AI Act only mentions that each EU member state is responsible for setting up these sandboxes (Recital 138), where regulations may be relaxed to promote innovation. Realistically, they’ll likely resemble the testing and experimentation facilities TEFs we discussed in the post on the Coordinated Plan on AI. ↩
The statement refers to the following articles: (i) Article 6.4 GDPR: This article deals with the legal basis for processing personal data, requiring that processing be “necessary for compliance with a legal obligation to which the controller is subject.” This emphasizes that any processing of personal data for law enforcement or judicial cooperation purposes must have a solid legal basis under EU or Member State law. (ii) Article 9.2.g GDPR: This provision concerns the processing of special categories of personal data, which are considered more sensitive and require higher levels of protection. Point g specifically allows processing such data when “necessary for reasons of substantial public interest, on the basis of Union or Member State law which shall be proportionate to the aim pursued.” This is relevant because law enforcement activities often involve sensitive data like criminal records or biometric information. (iii) Articles 5, 6, and 10 of the New Data Protection Framework: These articles mirror the principles of lawfulness, fairness, and transparency in data processing, the legal basis for processing, and limitations on processing sensitive data, respectively, as outlined in the GDPR. Their inclusion emphasizes that these principles also apply to EU institutions and bodies, ensuring consistent data protection standards. (iv) Article 4.2 and Article 10 of the Law Enforcement Directive: While not explicitly stated, the reference to these articles without prejudice suggests that the statement does not intend to limit or restrict the application of these provisions. Article 4.2 defines “competent authority,” which is crucial for determining which entities are subject to the directive’s rules. Article 10 deals with data protection principles, requiring that processing be “necessary and proportionate” and carried out “with due regard for the legitimate interests of the data subject.” This reinforces the need for a balanced approach that protects both individual rights and law enforcement needs. ↩
Among the musts we can see: (2.b) always disclose where your data came from and how you collected it; (2.c) how you processed your data before use; (2.d) formulate assumption in respect of what you expect to find in your data; (2.f) examine possible biases; (2.g) given the biases you are suspecting to find implement measure to detect and mitigate them; (4) data should be geographical, contextual and behavirousal appropriate for the intended purpose (aka if you want an LMM for eu, don’t train it on american data only); (5) if you really cannot do anything for biases with your current data you are exceptionally allowed to process other categories of personal data; (5.c) in this case you have to document who has access to the data; (5.d) you cannot transfer them and (5.c) you have to detele them once you’re done. ↩
However, recent studies (Dong et al. 2024 and Huang et al. 2023) have shown that increasing the quantity of data is not necessarily the most effective technique for reducing bias. ↩
Look up “de-identification” on google scholar. ↩
Recital 106: “[…] comply with the reservation of rights expressed by rightsholders pursuant to Article 4(3) of Directive (EU) 2019/790.”. ↩
Specifically, this article exempts data use under Article 5.a (giving authors control over reproducing their database content) and Article 7.1 (protecting against data extraction and re-utilization) from DD96, Article 2 of CH29 (protecting authors’ copyright), and Article 15.1 of DC790 (covering press publications and online use). ↩
But to be clear, these exemptions are limited to “research organizations and cultural heritage institutions”. DC790 defines a research organization as “a university, including its libraries, a research institute or any other entity, the primary goal of which is to conduct scientific research or to carry out educational activities involving also the conduct of scientific research: (a) on a not-for-profit basis or by reinvesting all the profits in its scientific research; (b) pursuant to a public interest mission recognized by a Member State”. ↩
According to Fortune Business Insider, the AI training dataset market is projected to grow from $2.39 bullion in 2023 to $17.04 billion by 2032 with a compound annual growth rate of 24.7%! ↩
There are various court cases open: Sarah Andersen et al. vs. Stability AI, Midjourney, and DeviantArt (2023), Getty Images vs. Stability AI (2023), Artists Against OpenAI’s DALL-E. ↩
I need to share this I found when I was looking for animal copyright. Back in 2011 the photographer David Slater traveled to Indonesia to photograph wildlife. During his trip, a crested macaque monkey took his camera and snapped several photos, including a famous selfie. When Slater published the photo, it became popular, and he later asserted copyright over the image. However, in 2015, PETA filed a lawsuit on behalf of the monkey, arguing that Naruto, the macaque, should hold the copyright to the photo. They contended that since the monkey had created the photograph independently, copyright should belong to Naruto, with PETA as a legal representative to protect Naruto’s interests. The U.S. District Court anddismissed the case ruling that animals do not have standing to hold copyright under U.S. law. According to the court, copyright law in the United States only applies to works created by human authors, and no existing legal framework allows non-human animals to claim copyright ownership. ↩
Among these companies we have OpenAI, Microsoft and Anthropic. ↩
One well-known example of an EU regulation used in an unexpected way is the General Data Protection Regulation (GDPR). Originally intended to protect personal data privacy, it has also led to some unintended consequences. For instance, in 2018, journalists in the UK tried to use Freedom of Information requests to obtain information on political donations and lobbying activities. However, some organizations and government bodies cited GDPR as a reason for refusing to share data, claiming that revealing donors’ identities could infringe on personal privacy. This led to concerns that GDPR was being leveraged to limit public accountability rather than protect individual privacy. ↩
Labor laws for gig economy workers vary widely in the EU. Spain’s 2021 “Riders’ Law” classifies gig workers as employees, granting them minimum wage, social security, and other protections. Poland, on the other hand, largely treats gig workers as independent contractors, offering fewer protections and more flexibility. ↩

← Mapping the EU AI Landscape (Part 2): Data Strategy Mapping the EU AI Landscape (Part 4): AI Factories →