I get it. Almost everyone has done it. Your users’ data is sitting there ready to use, just copy-paste that to your dev environment and there you have some good test data for yourself. Job done - easy - you can move on to the next task.
Worse even, when you need to make sure that your dev/staging data is up to date - you spend your day setting up a cron job that routinely syncs production data to your dev or UAT environments.
If you’re getting an uneasy feeling every time you do this - you’re not alone. So why do people still do it? Here’s what I found to be the dominant reasons for this, and my thoughts on why those reasons are bunk.
This is rarely a valid reason to do anything, yet sometimes we inherit a process that deep down we know is flawed, but perhaps feel like we don't have enough knowledge or authority to challenge.
I've been there. When you are new at a company, telling more senior people they're doing things wrong seems intimidating, but if you have a viable alternative (more on that below) you can suggest (and ideally deliver on), your co-workers will appreciate it.
Having been in the data privacy space for a while now, it is painfully obvious how often someone actually finds out. One of the big improvements that GDPR has brought about is that it put a spotlight on how frequent and widespread data leaks and breaches really are - not a day goes by without new ones being reported. In light of this, hoping that this doesn't affect you is not an ideal strategy.
By taking data to a less secure environment you waste all effort you put into security controls of your production environments. Proper privacy measures go one step beyond and look at what happens if there is a data breach - either due to a mistake, or an attack - by minimising the number of people who work with real (sensitive) data you complement your other data security measures.
Sensitive data should only really be accessed when absolutely necessary - that’s in the spirit of building proper privacy-respecting culture.
It can be really painful to pseudonymise data, especially if your data model is very large and complex. It can take even more time to set up data obfuscation pipelines that keep your test data up to date - as if this wasn’t enough - all these scripts keep breaking as your data model evolves. Basically, in so many cases, the task of obfuscating data while preserving its complexity and referential integrity is just too much hassle. Shameless plug: I have some good news and some better news.
Synth is a declarative data generator that can enable you quickly generate realistic looking test data. Better still, it’s completely free and open source - we built it with developers in mind and really hope it can help address this problem once and for all. Our goal is to make it possible for you to get going with realistic-looking fake data in no time. 1
Let me preface everything I say by stating that I am not a lawyer, but I have been in this space for a while now and have spoken to numerous consultants, lawyers and experts on this matter. GDPR can be vague on certain things, but in my opinion using production data for testing is clearly a no-go. (See more on this in the notes). If things go wrong it will be very hard to justify to the regulators why sensitive data was copied to and used in its raw form in testing environments.2
Last but not least, trust is 🔑 - the way you treat your users’ data in a way defines your relationship with them. Being responsible about the data that your users trust you with is incredibly important in building a good and healthy culture for your company or project. Trust is hard to build and easily lost. This may seem trivial, but these “small” decisions define the future of your company, not the “culture” page on your website.
Being responsible is a team effort, but it pays off long-term. 3
- Synth goes one step beyond pseudonimisation - it creates completely fake data that looks and quacks like the real data. Data generated with Synth can be shared more freely.↩
- GDPR Article 25 and 32 refer to the requirements around implementing pseudonymisation.↩
- We often get asked what is the right stage of the project/company to start using something like Synth. The best answer is right from the start, the second best is right now. With our declarative “data as code” framework, your test data generation will evolve alongside your data model.↩