The Digital Poisoners: How Russian Propaganda Networks Are Corrupting AI Training Data

BlogPost

Jul 23, 2025

hero image

DISCALIMER:** The views and opinions expressed in this blog are solely my own and do not reflect those of my employer, or any of its affiliates.

TLDR; Russian propaganda networks are actively trying to corrupt AI training data. My investigation into “Pravda” (Russian for “truth”) related domains found millions of propaganda documents in public web archives, and even in supposedly “clean” datasets. This post details my findings and outlines steps for dataset creators, AI developers, and policymakers to combat this new form of information warfare.

The Investigation Begins

A few weeks ago, I stumbled upon this article from The Bulletin of the Atomic Scientists. The headline was alarming: Russian networks were flooding the internet with propaganda specifically designed to corrupt AI chatbots. As someone who works on data collection and training of LLMs1, this immediately caught my attention.

The article described a sophisticated campaign using hundreds of fake news websites, all variations of “Pravda” (Russian for “truth”), pumping out disinformation at an industrial scale. But I wondered: was this propaganda actually making it into the datasets we use to train AI models?

I had to find out. Fortunately, I had Common Crawl2 dumps lying around. Time to dig in.

Following the Trail

My investigation started simply: searching for any URL containing “pravda” in the Common Crawl dumps from 2013 to 2024. But I quickly realized I needed a more comprehensive approach. Using domain lists compiled by DFRLab and this excellent Recorded Future report, I assembled a list of 306 domains associated with the Pravda disinformation network.

Armed with this list, I processed 29.1 billion documents from Common Crawl, using 1,482 CPU hours to filter through the data. The domains revealed a clear evolution:

Pravda Content Growth Over Time

What you’re seeing is a 5-fold increase in propaganda3 content since 2013, with the steepest growth occurring after Russia’s 2022 invasion of Ukraine. In absolute numbers:

  • 2013: 3,783 documents
  • 2020: 879,992 documents
  • 2024: 1,966,501 documents

That’s nearly 2 million documents in 2024 alone, totaling 75GB of poisoned text across all years. To put this in context, 75GB of text is roughly equivalent to 37,500 novels or the entire English Wikipedia… thrice.

{While widespread, the dissemination of Pravda network documents was highly targeted. A look at the languages reveals a strategy4}(make it clear that the target is on language):

Language Percentage
Ukrainian 34.11%
Slovak 33.57%
Portuguese 9.37%
German 6.80%
Czech 5.36%

The high percentages in Ukrainian and Slovak suggest an effort to influence narratives directly within or near the conflict zone, and in countries with strong historical or geographical ties. The presence of Portuguese, German, and Czech content indicates a broader strategy to shape perceptions across Europe.

.

Unmasking the Propaganda

To understand what narratives were being pushed, I needed to analyze the actual content. Using Google’s Gemini embedding model5, I compared the documents against propaganda seed terms across seven categories:

  • Pro-Kremlin narratives (“special military operation”, “liberation of Donbas”)
  • Anti-Western rhetoric (“NATO aggression”, “Western decadence”)
  • Disinformation tropes (“staged massacre”, “false flag operation”)
  • Anti-Ukraine propaganda (“Nazi regime in Kyiv”, “corrupt elites in Kyiv”)
  • Ideological messaging (“Christian Orthodox roots”, “spiritual decay of Europe”)
  • Historical revisionism (“Crimea has always been Russian”, “NATO betrayed agreements”)
  • Conspiracy theories (“deep state”, “New World Order”)

The semantic analysis revealed something: this isn’t random spam. The content clustered into 17 distinct narrative groups, with clear thematic coherence. The propaganda is carefully crafted to reinforce specific worldviews and talking points.

Pravda Content Growth Over Time

The word cloud above illustrates the most dominant cluster, highlighting the highly politicized and conflict-oriented vocabulary being pushed. Common terms include “ukraine”, “war”, “military”, “nato”, “europe”, “west”, and “nazi.”

Beyond Common Crawl: High-Quality Datasets

Common Crawl is intentionally unfiltered, it’s meant to capture the web as it is, warts and all. But what about the carefully curated datasets that AI companies use for training? I decided to check Nemotron6, a high-quality dataset created by NVIDIA.

The results were mixed:

  • In Nemotron’s low-quality partition: 2 documents per million, 6,920 documents contaminated
  • In Nemotron’s high-quality partition: 30 documents per million, 21,998 documents contaminated

It’s particularly interesting that the “high-quality” partition of Nemotron showed more than ten times the contamination rate of the “low-quality” one. This finding suggests that even sophisticated filtering methods may be struggling to identify and remove subtle propaganda. Even in supposedly “clean” datasets, the propaganda persists. While the percentages are lower, we’re still talking about tens of thousands of documents that made it through quality filters.

Why This Matters

You might think: “So what? It’s less than 1% of the data. Can this really affect an AI model?”

The answer is yes, and here’s why. Training a large language model isn’t like teaching a human – it’s more like creating an incredibly sophisticated pattern-matching system. The model learns statistical associations between words, concepts, and narratives. When propaganda content appears thousands of times with consistent messaging, it creates strong statistical associations that the model learns to reproduce.

Consider this: if even 0.1% of a model’s training data consistently associates “Ukraine” with “Nazi regime” or “NATO” with “aggression”, the model learns these associations. When users later ask about these topics, the model might subtly reflect these biases in its responses.

It’s not about the model explicitly saying “Ukraine is run by Nazis” – it’s about subtle biases in word choice, framing, and the kinds of information the model considers relevant. These biases can shape how millions of users understand current events.

A New Battlefield

What we’re witnessing is information warfare adapted for the AI age. Instead of targeting human minds directly through social media or news outlets, this campaign targets the training data of AI systems that will shape how future generations access and understand information.

The Bulletin article aptly calls this “LLM grooming” – and the term is accurate. Just as grooming involves the gradual manipulation of a victim’s perception of reality, these propaganda networks are slowly, methodically shaping how AI models understand the world. It’s a patient, long-term strategy. The Pravda network has been operating since at least 2013, slowly seeding the internet with content. They’re not trying to convince you or me – they’re trying to influence the AI systems that our children will rely on for homework help, news summaries, and understanding the world.

What Can Be Done?

This investigation reveals both the scale of the problem and potential solutions:

For AI Developers

  1. Audit your data sources: Don’t assume curated datasets are clean. As I showed, even high-quality datasets contain propaganda.
  2. Implement bias testing: Propaganda content clusters together. Identify and review suspicious biases before including them in training data.
  3. Don’t ignore non-English content: The focus on Ukrainian and Slovak content shows that attackers target specific languages. Every language needs proper content moderation.
  4. Transparency: Document data sources and filtering methods. Users have a right to know what went into training their AI.

For Policymakers

  1. Treat data poisoning as a security threat: This is not just about content moderation, it’s about protecting the integrity of AI systems that will shape public discourse.
  2. Support research: We need better tools for detecting coordinated disinformation campaigns in training data.
  3. Mandatory documentation of training data: We need to know if our AI systems have inherited biases from poisoned data. The EU AI Act already requires high-risk AI systems to document their training data sources and quality measures – this should become standard practice globally.

The Investigation Continues

This analysis represents just the tip of the iceberg. We’ve identified one network, but how many others exist? How much propaganda has already been incorporated into the AI models we use today?

Let’s be clear about what we’re dealing with here: this is not accidental misinformation or biased reporting. This is an intentional, coordinated campaign designed to poison public discourse. We already live in increasingly polarized societies, where people struggle to find common ground and shared truth. These propaganda networks are deliberately exploiting our divisions, amplify conspiracy theories, and fracture our ability to have productive conversations about real challenges. When these poisoned narratives get embedded into AI systems – systems that millions will rely on for information– the potential for societal harm multiplies exponentially.

What started as curiosity about a news article revealed a sophisticated operation that’s been poisoning the well of AI training data. The Pravda network shows us that in the age of AI, controlling information means controlling not just what people read today, but how machines will understand and explain the world tomorrow.

  1. I worked in the OpenGPT-X project where we trained the multilingual LLM named Teuken

  2. Common Crawl is a massive repository of web crawl data that’s freely available to researchers. It contains petabytes of data collected since 2008 and is the foundation for training many large language models, including GPT-3, LLaMA, and others. Essentially, if you’re training a large language model, you’re probably using Common Crawl data. 

  3. While my initial filtering identified millions of documents from these domains, it’s important to recognize that not all content associated with “Pravda” domains is necessarily propaganda. These sites sometimes host legitimate news, cultural articles, or re-post content that isn’t overtly propagandistic. Through further analysis using semantic embedding, I determined that approximately 2.68% of the Pravda-related documents identified could be classified as hardcore propaganda. 

  4. You might notice English is missing from this list. That’s not because the pravda network ignores English. I simply couldn’t analyze English-language content due to computational constraints ¯\_(ツ)_/¯ 

  5. An embedding model converts text into numerical representations that capture semantic meaning. This allows us to mathematically compare how similar different texts are to known propaganda narratives. 

  6. Nemotron is a curated dataset where low-quality content has been filtered out using various quality metrics. It’s the kind of “clean” dataset that companies might use when they want higher quality training data than raw Common Crawl provides. 

If you like my work, consider buying me a coffee to support future posts.

Buy Me A Coffee
← Mapping the EU AI Landscape (Part 5): Gigafactory & The Site Selection Problem