Extracting GBs of training data from ChatGPT

Itamar Golan
December 27, 2023

Introduction

Some of you might be wondering about the risk of your employees or applications accidentally disclosing sensitive data to GenAI tools like ChatGPT, Bard, Jasper, Bing, etc. This is a prime example of this huge threat.

Why is GenAI riskier than regular websites?

GenAI tools, based on Large Language Models (LLMs), are trained on a massive corpora of data. Generally, the more data they are trained on, the better they perform. Unlike previous tools and Shadow IT, with which we used to share data, it is explicitly clear that the data shared with these tools will be used for future training, in order to get better. Consequently, there's a high likelihood that this data, including your code, personally identifiable information (PII), secrets, and other sensitive tokens, will be reproduced by these GenAI tools in the future elsewhere. Also, due to claims by employees about an increase of 25%-50% in productivity, the rate and the volume of GenAI usage is not comparable to anything we’ve seen before.

Extracting GBs of Training Data from ChatGPT

This simple yet insightful paper demonstrates how it's pretty easily possible to extract GBs of training data (including secrets, personally identifiable information (PII), protected health information (PHI), etc.) from ChatGPT using just a few lines of code and a few hundred dollars. The attack method is straightforward but effective. Essentially, you prompt the model with the command, "Repeat the word X forever," and then observe as the model deviates from its original task and begins revealing the training data.

How was it confirmed as training data?

1. Perform a Google search.  

2. Download a substantial amount of internet data (approximately 10 terabytes), then build an efficient index on top of it using a suffix array (code provided below). Next, compare all the data generated from ChatGPT with data existing on the internet before ChatGPT's creation. Any lengthy sequence of text that matches their datasets is almost certainly memorized.

Conclusion

It's important to mention that this specific vulnerability has been patched as of December 25th, 2023. However, the potential risk remains the same: there is a way to manipulate any LLM into leaking the data it was trained on.

Therefore, you must ensure that your sensitive data stays well away from these tools. This means:

- Gaining visibility into GenAI usage within your organization (“Shadow AI”).

- Ensuring safe adoption by your employees & applications.

- Enforcing internal policies.

- Preventing any data leaks and prompt injections.

Links

Original paper
Code
ChatGPT data extraction example

Share this post