Data Privacy and Generative AI:The Truth About Common Security Promises July 5, 2025 - Blogs on Text Analytics
For many people and organizations, data privacy isn’t just a preference, it’s non-negotiable. Beyond preventing their information from training future AI models, they need absolute assurance that sensitive data is never exposed to others, not even for a moment. When working with text data, whether in research, business, or social impact projects, trust in how that data is handled is essential.
A first important distinction is how you access a generative AI service.
• Web interface (chat-style services): When you log in through a provider’s website (like ChatGPT, Claude, or Gemini in a browser), your conversations are generally not private. They are often stored indefinitely, and, unless you explicitly disable it, they may be used for training and quality improvement. Even when companies allow you to “turn off training,” the conversations may still remain accessible to them.
• API access (software integration): By contrast, when generative AI is used through an API (as in our software), the rules are different. Most GenAI providers state that data sent via API is not used for training. However, they typically retain it temporarily, often for 30 days, for system monitoring and debugging purposes.
When a software company integrates GenAI, it is currently done using API calls, so while the data is not used for training, it is typically kept for 30 days before deletion. This is the case for our own software QDA Miner and WordStat. Now, in recent years, many software companies integrating generative AI have started to advertise zero data retention, no training on your data, or even full data encryption. At first glance, these promises sound like a guarantee of privacy. But what do they actually mean in practice?
What does “zero data retention” mean?
When a company advertises zero data retention (or ZDR), it usually means that the data you send to their servers is never stored at all or is deleted immediately after processing. If a company temporarily holds data for administrative or debugging purposes, typically for 30 days, but even for just a few minutes, then they cannot claim to offer true zero data retention.
Even when companies claim zero data retention, your data might still be stored or used for training. Here’s the catch: these policies often apply only to the company’s gateway server that receives your request and forwards it. The actual AI model, typically hosted on a separate server, may have different data handling practices. The underlying GenAI service itself may still process, log, or briefly store the data before it disappears. This is why “zero data retention” claims require careful scrutiny. Unless providers explicitly guarantee that no data is retained anywhere, including by the AI service itself, these promises may not deliver true end-to-end privacy.
The “Multi-Tenant” vs. “Private Instance” Trap: It is also critical to look at where the model is running. When a vendor promises ZDR on a standard public cloud API, your data is still traveling through a multi-tenant environment, meaning your text passes through the exact same shared server queues as every other company on Earth. True data sovereignty requires moving away from shared cloud endpoints toward isolated, private server infrastructure.
What about “full data encryption”?
Encryption is another term that is sometimes misunderstood. When providers claim that your data is “fully encrypted,” they usually mean it is encrypted while stored on their servers or while being transmitted. However, at some point, your text must be decrypted and sent to the Generative AI server so that the AI model can process it. That means the AI service itself can still “see” your data, at least while handling your request. And unless they also claim a full zero data retention applied not only to their server but to the GenAI service as well.
Why “No Training” Isn’t the Whole Story
Some providers reassure users that their data will not be used for training AI models. While this is important, it doesn’t address the confidentiality concern: the fact that your text may still pass through third-party servers and remain accessible (in some form) for a limited time.
How to Verify “Zero Data Retention” Claims
If privacy is critical, here are some steps you can take to make sure a vendor’s zero data retention claim is trustworthy:
1) Ask for explicit contractual language: For enterprise or paid plans, request the exact wording in the terms of service or contract with the Generative AI service provider they are using. Make sure it clearly states that data is not stored or used for training by the GenAI service itself, not just the vendor’s gateway server.
2) Check the provider’s official documentation: Look for statements in API or enterprise documentation confirming data retention policies.
3) Review security and compliance certifications: Certifications like SOC 2, ISO 27001, HIPAA, or GDPR demonstrate robust data management practices. They don’t guarantee zero retention on their own, but they do indicate the company takes data security seriously.
4) Request a Technical Explanation: Ask how your data flows through the system. Confirm whether it is encrypted in transit, temporarily stored, or processed in a multi-tenant environment. A trustworthy provider should be able to explain this clearly.
5) Explore Local and Private Infrastructure: If absolute privacy is required, look for software that decouples from public APIs entirely. Instead of sending data to a third-party cloud, you should have the flexibility to host open-weight models locally on your own workstation (using frameworks like Ollama or LM Studio) or route requests through your organization’s internal, private secure servers.
Zero Data Retention in Peril: When Courts Override Privacy Promises
Even the strongest data retention policies can crumble under legal pressure. A striking recent example demonstrates this vulnerability: in May 2025, a federal court ordered OpenAI to indefinitely retain all user data, including both ChatGPT conversations and API outputs, that would normally be deleted under their standard policies (see the court order here). This order affects many OpenAI services, whether accessed through web interfaces or API calls, effectively suspending their promised data deletion practices. While OpenAI states there are some conditions under which such policies are still in place (see OpenAI response), the scope and duration of these exceptions remain uncertain.
This legal intervention reveals a fundamental flaw in relying on any cloud-based privacy guarantees. While this particular case involves OpenAI, similar court orders could potentially affect any GenAI provider regardless of their stated privacy policies. For organizations with strict confidentiality requirements, this exposes a critical blind spot: your data’s ultimate fate depends not only on a provider’s stated policies, but also on their legal battles, battles you have no control over. While court-ordered data preservation isn’t an everyday occurrence, it underscores why truly sensitive data may only be safe when it never leaves your own systems.
What About QDA Miner and WordStat
As we mentioned before, our software applications use API calls to directly access Generative AI services. If you choose to use public cloud providers, your data’s location depends entirely on that provider: OpenAI, Claude, and Gemini process data on US-based servers; Mistral operates from France; and DeepSeek processes data in China. While these companies state that API data is not used for model training, they traditionally retain it for up to 30 days for administrative monitoring, and as the 2025 federal court rulings have shown, legal interventions can force cloud companies to preserve data indefinitely.
The reality is straightforward: no software vendor can guarantee end-to-end cloud privacy if the system forces you to route data through a third-party’s public API endpoint. To provide researchers and institutional review boards (IRBs) with the practical control they need, we have expanded the configuration options for our GenAI integrations. QDA Miner and WordStat do not lock you into a single cloud provider or a fixed data pipeline. By allowing users to customize server endpoints, choose regional routing, or execute models locally, the software supports four distinct paths to match your specific data security requirements:
- Public Cloud with Regional Controls (e.g., European Servers): For organizations that utilize cloud workflows but must adhere to strict regional data protection regulations (such as GDPR), we provide explicit routing controls. For example, when using OpenAI, you can configure the software to target European-based servers. This ensures that your data payloads remain within specific geographic legal jurisdictions, avoiding the regulatory complexities of transatlantic data transfers.
- Local AI Engines (Ollama and LM Studio): If your work requires absolute confidentiality, you can completely cut the cord to the cloud. In addition to our existing integration with Ollama, we have added support for LM Studio. Both of these local orchestration engines allow you to run open-weight models directly on your hardware. When using these tools, the LLM runs entirely within the RAM of your local workstation. Zero bytes of data leave your computer. The only requirement is a capable computer with a dedicated graphics card (a minimum of 8 GB of VRAM is required, though 16 GB or more is highly recommended for optimal processing speed).
- Private Organization Servers (Institutional Gateways): For universities, government agencies, and enterprises that maintain their own secure AI infrastructure, QDA Miner and WordStat now allow you to configure and switch between up to three distinct private server URLs. Our system standardizes on the universal OpenAI API communication protocol. If your institution hosts its own private, firewalled instances of models on platforms like Azure OpenAI or Google Cloud (Vertex AI), you can point our software directly to your organization’s secure network endpoint. If the enterprise server requires specific authentication wrappers, a researcher can simply run a lightweight local gateway (like LiteLLM) on their workstation to act as a seamless, secure pass-through proxy. Your data stays entirely within your institution’s approved cloud container, managed completely by your own IT department.
- Total Disabling of Outbound Internet Access:While Retaining Local AI: For institutions that require strict network isolation, QDA Miner and WordStat include a universal Internet Access Toggle. Turning this setting off completely cuts the software’s connection to the outside web. It disables all outbound HTTPS and API traffic, which halts not only public and private cloud AI access, but also general web features like automated software updates,and geocoding services. To help organizations enforce these rules, we offer two levels of control:
User-Level Toggle: A setting within the software options that allows the user to turn off all outbound internet operations manually.
Administrator-Level Lockout: A deployment configuration designed for IT managers that forces the internet access option to “Off” and completely removes the toggle from the user interface, preventing users from re-enabling it.
Crucially, activating this restriction does not look like a complete loss of capability:
Offline Local Processing remains fully functional: Because local tools like Ollama and LM Studio operate entirely within the machine’s internal memory space (localhost), they require zero internet access to function. Researchers can continue to use full generative text extraction on a completely disconnected or air-gapped computer.
Traditional Analytics are unchanged: You retain full access to all traditional text analytics and local machine learning features (such as Topic Modeling in WordStat or Cluster Coding, Query-by-Examples, and Coding Similarity features in QDA Miner), which run entirely on the local CPU or GPU.
Note for IT Administrators: If your organization requires a more granular approach, such as the ability to block public AI access while whitelisting a specific internal corporate server endpoint through your firewall, please contact us. We are actively looking to collaborate with institutional IT departments to tailor these administrative deployment features to your specific network infrastructure needs.
In conclusion:
Marketing promises like zero data retention and full encryption may sound reassuring, but they do not guarantee complete privacy. When confidentiality is paramount, the only truly safe approach is to avoid public cloud-based GenAI services entirely. Instead, you should keep your data local by utilizing offline models that never transmit data beyond your computer, or route your queries through a private server that keeps the data strictly within your organization.
It’s important to note that these privacy concerns shouldn’t discourage the use of cloud-based generative AI for all applications. When working with non-confidential data, these services can be incredibly valuable tools. For instance, analyzing publicly available datasets such as government reports, published research papers, news articles, or open-source social media content, poses no confidentiality risks since the information is already in the public domain. Similarly, educational projects using anonymized or synthetic data, marketing content creation, general research on publicly available information, or prototype development with dummy data can all benefit from the power and convenience of cloud-based AI services.
Ultimately, the key is making an informed decision based on your data’s sensitivity level. By understanding the underlying architecture of these systems, researchers and organizations can choose the exact level of security their data requires—leveraging the robust capabilities of cloud services when privacy isn’t a concern, or switching to local solutions to enforce the absolute sovereignty of an offline, local environment.