Data Privacy and Generative AI:
The Truth About Common Security Promises August 29, 2025 - Blogs on Text Analytics

For many people and organizations, data privacy isn’t just a preference, it’s non-negotiable. Beyond preventing their information from training future AI models, they need absolute assurance that sensitive data is never exposed to others, not even for a moment. When working with text data, whether in research, business, or social impact projects, trust in how that data is handled is essential.

A first important distinction is how you access a generative AI service.

• Web interface (chat-style services): When you log in through a provider’s website (like ChatGPT, Claude, or Gemini in a browser), your conversations are generally not private. They are often stored indefinitely, and, unless you explicitly disable it, they may be used for training and quality improvement. Even when companies allow you to “turn off training,” the conversations may still remain accessible to them.

• API access (software integration): By contrast, when generative AI is used through an API (as in our software), the rules are different. Most GenAI providers state that data sent via API is not used for training. However, they typically retain it temporarily, often for 30 days, for system monitoring and debugging purposes.

When a software company integrates GenAI, it is currently done using API calls, so while the data is not used for training, it is typically kept for 30 days before deletion. This is the case for our own software QDA Miner and WordStat. Now, in recent years, many software companies integrating generative AI have started to advertise zero data retention, no training on your data, or even full data encryption. At first glance, these promises sound like a guarantee of privacy. But what do they actually mean in practice?

What does “zero data retention” mean?

When a company advertises zero data retention (or ZDR), it usually means that the data you send to their servers is never stored at all or is deleted immediately after processing. If a company temporarily holds data for administrative or debugging purposes, typically for 30 days, but even for just a few minutes, then they cannot claim to offer true zero data retention.

Even when companies claim zero data retention, your data might still be stored or used for training. Here’s the catch: these policies often apply only to the company’s gateway server that receives your request and forwards it. The actual AI model, typically hosted on a separate server, may have different data handling practices. The underlying GenAI service itself may still process, log, or briefly store the data before it disappears. This is why “zero data retention” claims require careful scrutiny. Unless providers explicitly guarantee that no data is retained anywhere, including by the AI service itself, these promises may not deliver true end-to-end privacy.

What about “full data encryption”?

Encryption is another term that is sometimes misunderstood. When providers claim that your data is “fully encrypted,” they usually mean it is encrypted while stored on their servers or while being transmitted. However, at some point, your text must be decrypted and sent to the Generative AI server so that the AI model can process it. That means the AI service itself can still “see” your data, at least while handling your request. And unless they also claim a full zero data retention applied not only to their server but to the GenAI service as well.

Why “No Training” Isn’t the Whole Story

Some providers reassure users that their data will not be used for training AI models. While this is important, it doesn’t address the confidentiality concern: the fact that your text may still pass through third-party servers and remain accessible (in some form) for a limited time.

How to Verify “Zero Data Retention” Claims

If privacy is critical, here are some steps you can take to make sure a vendor’s zero data retention claim is trustworthy:

1) Ask for explicit contractual language: For enterprise or paid plans, request the exact wording in the terms of service or contract with the Generative AI service provider they are using. Make sure it clearly states that data is not stored or used for training by the GenAI service itself, not just the vendor’s gateway server.

2) Check the provider’s official documentation: Look for statements in API or enterprise documentation confirming data retention policies.

3) Review security and compliance certifications: Certifications like SOC 2, ISO 27001, HIPAA, or GDPR demonstrate robust data management practices. They don’t guarantee zero retention on their own, but they do indicate the company takes data security seriously.

4) Request a Technical Explanation: Ask how your data flows through the system. Confirm whether it is encrypted in transit, temporarily stored, or processed in a multi-tenant environment. A trustworthy provider should be able to explain this clearly.

5) Explore Local Alternatives: If absolute privacy is required, consider running a local model (like Ollama or other open-source options) on your own hardware. That way, your data never leaves your machine and you are fully in control.

Zero Data Retention in Peril: When Courts Override Privacy Promises

Even the strongest data retention policies can crumble under legal pressure. A striking recent example demonstrates this vulnerability: in May 2025, a federal court ordered OpenAI to indefinitely retain all user data, including both ChatGPT conversations and API outputs, that would normally be deleted under their standard policies (see the court order here). This order affects many OpenAI services, whether accessed through web interfaces or API calls, effectively suspending their promised data deletion practices. While OpenAI states there are some conditions under which such policies are still in place (see OpenAI response), the scope and duration of these exceptions remain uncertain.

This legal intervention reveals a fundamental flaw in relying on any cloud-based privacy guarantees. While this particular case involves OpenAI, similar court orders could potentially affect any GenAI provider regardless of their stated privacy policies. For organizations with strict confidentiality requirements, this exposes a critical blind spot: your data’s ultimate fate depends not only on a provider’s stated policies, but also on their legal battles, battles you have no control over. While court-ordered data preservation isn’t an everyday occurrence, it underscores why truly sensitive data may only be safe when it never leaves your own systems.

What About QDA Miner and WordStat

As we mentioned before, our software applications use API calls to access directly those Generative AI services. Your data’s location depends on your chosen provider: OpenAI, Claude, and Gemini process data on US-based servers; Mistral operates from France; and DeepSeek processes and stores data in China. According to all those companies, the data is not used for training, but should be typically stored for 30 days for administrative and maintenance purposes (yet, as we saw, this may no longer be true as GenAI companies may be forced to keep your data indefinitely). For this reason, we do not and cannot claim zero data retention, and even those claiming today meeting such a standard may have to retract those claims, right now or in a near future.

If your work requires strict privacy, we offer you two options:

1. Disable GenAI access: Both QDA Miner and WordStat offer two ways to disable internet access, preventing the use of GenAI features. One of them through a software setting visible to the user (which means it can be modified), and another option that prevent users from re-enabling such an option. You will still have access to other local AI features such as topic modeling, word embedding in WordStat, or Query-by-Examples, Cluster Coding and Similarity Searches in QDA Miner. These “traditional” AI features run entirely on your computer and the data they process never leave your system.

2 Use OLLAMA models: Among the various GenAI engines we support, we implemented a connection to Ollama, which is a GenAI service that will run LLM models locally. In other words, the model runs directly on your machine, and no data is transmitted to the cloud. The only requirement is a capable computer with sufficient graphics processing power, at least 8 GB of VRAM, though 16 GB or more is recommended for optimal performance. This setup ensures that your data stays fully private and under your control.

In conclusion:

Marketing promises like zero data retention and full encryption may sound reassuring, but they don’t guarantee complete privacy. When confidentiality is paramount, the only truly safe approach is keeping your data local, either by avoiding cloud-based GenAI services entirely or using local models that never transmit data beyond your own system.

It’s important to note that these privacy concerns shouldn’t discourage the use of cloud-based generative AI for all applications. When working with non-confidential data, these services can be incredibly valuable tools. For instance, analyzing publicly available datasets such as government reports, published research papers, news articles, or open-source social media content, poses no confidentiality risks since the information is already in the public domain. Similarly, educational projects using anonymized or synthetic data, marketing content creation, general research on publicly available information, or prototype development with dummy data can all benefit from the power and convenience of cloud-based AI services. The key is making an informed decision based on your data’s sensitivity level: use the robust capabilities of cloud services when privacy isn’t a concern, but switch to local solutions when confidentiality is paramount.