4 posts tagged with "data generation"

Seeding test databases in 2021 - best practices

August 31, 2021 · 19 min read

Seeding test databases in 2021 - best practices

In this tutorial, we'll learn how to design the Prisma data model for a basic message board and how to seed databases with the open-source tool synth and generate mock data to test our code.

The code for the example we are working with here can be accessed in the examples repository on GitHub.

Data modeling is not boring#

What is a data model?#

Data modeling (in the context of databases and this tutorial) refers to the practice of formalizing a collection of entities, their properties and relations between one another. It is an almost mathematical process (borrowing a lot of language from set theory) but that should not scare you. When it comes down to it, it is exceedingly simple and quickly becomes more of an art than a science.

The crux of the problem of data modeling is to summarize and write down what constitutes useful entities and how they relate to one another in a graph of connections.

You may wonder what constitutes a useful entity. It is indeed the toughest question to answer. It is very difficult to tackle it without a good combined idea of what you are building, the database you are building on top of and what the most common queries, operations and aggregate statistics are. There are many resources out there that will guide you through answering that question. Here we'll start with the beginning: why is it needed?

Why do I need a data model?#

Often times, getting the data model of your application right is crucial to its performance. A bad data model for your backend can mean it gets crippled by seemingly innocuous tasks. On the other hand, a good grasp on data modeling will make your life as a developer 1000 times easier. A good data model is not a source of constant pain, letting you develop and expand without slowing you down. It just is one of those things that pays out compounding returns.

Plus, there are nowadays many open-source tools that make building applications on top of data models really enjoyable. One of them is Prisma.

Prisma is awesome#

Prisma is an ORM, an object relational mapping. It is a powerful framework that lets you specify your data model using a database agnostic domain specific language (called the Prisma schema). It uses pluggable generators to build a nice javascript API and typescript bindings for your data model. Hook that up to your IDE and you get amazing code completion that is tailored to your data model, in addition to a powerful query engine.

Let's walk through a example. We want to get a sense for what it'll take to design the data model for a simple message board a little like Reddit or YCombinator's Hacker News. At the very minimum, we want to have a concept of users: people should be able to register for an account. Beyond that, we need a concept of posts: some structure, attached to users, that holds the content they publish.

Using the Prisma schema language, which is very expressive even if you haven't seen it before, our first go at writing down a User entity might look something like this:

model User {  objectId  Bytes    @id @map("_id")  id        Int      @unique @default(autoincrement())  createdAt DateTime @default(now())  email     String   @unique  nickname  String  posts     Post[]}

In other words, our User entity has properties id (a database internal unique identifier), createdAt (a timestamp, defaulting to now if not specified, that marks the creation time of the user's account), email (the user-specified email address, given on registration) which is required to be unique (no two users can share an email address) and nickname (the user specified display name, given on registration).

In addition, it has a property posts which links a user with its posts through the Post entity. We may come up with something like this for the Post entity:

model Post {  objectId  Bytes    @id @map("_id")  id        Int      @unique @default(autoincrement())  postedAt  DateTime @default(now())  title     String  author    User     @relation(fields: [authorId], references: [id])  authorId  Int}

In other words, our Post entity has properties id (a database internal unique identifier); postedAt (a timestamp, defaulting to now if not specified, that marks the time at which the user created the post and published it) ; title (the title of the post); author and authorId which specify a one-to-many relationship between users and posts.

note

You may have noticed that the User and Post models have an attribute which we haven't mentioned. The objectId property is an internal unique identifier used by mongoDB (the database we're choosing to implement our data model on in this tutorial).

Let's look closer at these last two properties author and authorId. There is a significant difference between them with respect to how they are implemented in the database. Remember that, at the end of the day, our data model will need to be realized into a database. Because we're using Prisma, a lot of these details are abstracted away from us. In this case, the prisma code-generator will handle author and authorId slightly differently.

The @relation(...) attribute on the author property is Prisma's way of declaring that authorId is a foreign key field. Because the type of the author property is a User entity, Prisma understands that posts are linked to users via the foreign key authorId which maps to the user's id, the associated primary key. This is an example of a one-to-many relation.

How that relation is implemented is left to Prisma and depends on the database you choose to use. Since we are using mongodb here, this is implemented by direct object id references.

Because our data model encodes the relation between posts and users, looking up a user's posts is inexpensive. This is the benefit of designing a good data model for an application: operations you have designed and planned for at this stage, are optimized for.

To get us started using this Prisma data model in an actual application, let's create a new npm project in an empty directory:

$ npm init

When prompted to specify the entry point, use src/index.js. Install some nice typescript bindings for node with:

$ npm install --save-dev @types/node typescript

Then you can initialize the typescript compiler with

$ npx tsc --init

This creates a tsconfig.json file which configures the behavior of the typescript compiler. Create a directory src/ and add the following index.ts:

import {PrismaClient} from '@prisma/client'
const prisma = new PrismaClient()
const main = async () => {    const user = await prisma.user.findFirst()    if (user === null) {        throw Error("No user data.")    }    console.log(`found username: ${user.nickname}`)    process.exit(0)}
main().catch((e) => {    console.error(e)    process.exit(1)})

Then create a prisma/ directory and add a schema.prisma file containing the Prisma code for the two entities User and Post.

Finally, to our schema.prisma file, we need to add configuration for our local dev database and the generation of the client:

datasource db {  provider = "mongodb"  url      = "mongodb://localhost:27017/board"}
generator client {  provider = "prisma-client-js"  previewFeatures = ["mongodb"]}

Head over to the repository to see an example of the complete file, including the extra configuration.

To build the Prisma client, run

$ npx prisma generate

Finally, to run it all, edit your package.json file (at the root of your project's directory). Look for the "script" field and modify the "test" script with:

{  ...  "test": "tsc --project ./ && node ."  ... }

Now all we need is for an instance of mongoDB to be running while we're working. We can run that straight from the official docker image:

$ docker run -d --name message-board-example -p 27017:27017 --rm mongo

To run the example do

$ npm run test
> message-board-example@1.0.0 test /tmp/message-board-example> tsc --project ./ && node .
Error: No user data.

You should see something close to the output of the snippet: our simple code failed because it is looking for a user that does not exist (yet) in our dev database. We will fix that in a little bit . But first, here's a secret.

The secret to writing good code#

Actually it's no secret at all. It is one of those things that everybody with software engineering experience knows. The key to writing good code is learning from your mistakes!

When coding becomes tedious is when it is hard to learn from errors. Usually this is caused by a lengthy process to go from writing the code to testing it. This can happen for many reasons: having to wait for the deployment of a backend in docker compose, sitting idly by while your code compiles just to fail at the end because of a typo, the strong integration of a system with components external to it, and many more.

The process that goes from the early stages of designing something to verifying its functionalities and rolling it out, that is what is commonly called the development cycle.

It should indeed be a cycle. Once the code is out there, deployed and running, it gets reviewed for quality and purpose. More often than not this happens because users break it and give feedback. The outcome of that gets folded in planning and designing for the next iteration or release. The agile philosophy is built on the idea that this cycle should be as short as possible.

So that brings the question: how do you make the development cycle as quick as possible? The faster the cycle is, the better your productivity becomes.

Testing, testing and more testing#

One of the keys to shortening a development cycle is making testing easy. When playing with databases and data models, it is something that is often hacky. In fact there are very few tools that let you iterate quickly on data models, much less developer-friendly tools.

The core issue at hand is that between iterations on ideas and features, we will need to make small and quick changes to our data model. What happens to our databases and the data in them in that case? Migration is sometimes an option but is notoriously hard and may not work at all if our changes are significant.

For development purposes the quickest solution is seeding our new data model with mock data. That way we can test our changes quickly and idiomatically.

Generate data for your data model#

At Synth we are building a declarative test data generator. It lets you write your data model in plain zero-code JSON and seed many relational and non-relational databases with mock data. It is completely free and open-source.

Let's take our data model and seed a development mongoDB database instance with Synth. Then we can make our development cycle very short by using an npm script that sets it all up for us whenever we need it.

Installing `synth`#

We'll need the synth command-line tool to get started. From a terminal, run:

$ curl -sSL https://getsynth.com/install | sh

This will run you through an install script for the binary release of synth. If you prefer installing from source, we got you: head on over to the Installation pages of the official documentation.

Once the installer script is done, try running

$ synth versionsynth 0.5.4

to make sure everything works. If it doesn't work, add $HOME/.local/bin to your $PATH environment variable with

$ export PATH=$HOME/.local/bin:$PATH

and try again.

Synth schema#

Just like Prisma and its schema DSL, synth lets you write down your data model with zero code.

There is one main difference: the synth schema is aimed at the generation of data. This means it lets you specify the semantics of your data model in addition to its entities and relations. The synth schema has an understanding of what an email, a username, an address are; whereas the Prisma schema only cares about top-level types (strings, integers, etc).

Let's navigate to the top of our example project's directory and create a new directory called synth/ for storing our synth schema files.

├── package.json├── package-lock.json├── tsconfig.json├── prisma/├── synth/└── src/

Each file we will put in the synth/ directory that ends in .json will be opened by synth, parsed and interpreted as part of our data model. The structure of these files is simple: each one represents a collection in our database.

Collections#

A collection is a single JSON schema file, stored in a namespace directory. Because collections are formed of many elements, their Synth schema type is that of arrays.

To get started, let's create a User.json file in the synth/ directory:

{  "type": "array",  "length": 1,  "content": {    "type": "null"  }}

Then run

$ synth generate synth/{"users":[null]}

Let's break this down. Our User.json collection schema is a JSON object with three fields. The "type" represents the kind of generator we want. As we said above, collections must generate arrays. The "length" and "content" fields are the parameters we need to specify an array generator. The "length" field specifies how many elements the generated array must have. The "content" specifies from what the elements of the array are generated.

For now the value of "content" is a generator of the null type. Which is why our array has null as a single element. But we will soon change this.

Note that the value of "length" can be another generator. Of course, because the length of an array is non-negative number, it cannot be just any generator. But it can be any kind that will generate non-negative numbers. For example

{  "type": "array",  "length": {    "type": "number",    "range": {      "low": 5,      "high": 10,      "step": 1    }  },  "content": {    "type": "null"  }}

This now makes our users collection variable length. Its length will be decided by the result of generating a new random integer between 5 and 10.

If you now run

$ synth generate synth/{"users":[null,null,null,null,null]}

you can see the result of that change.

info

By default synth fixes the seed of its internal PRNG. This means that, by default, running synth many times on the same input schemas will give the same output data. If you want to randomize the seed - and thus randomize the result, simply add the flag --random:

$ synth generate synth/ --random{"users":[null,null,null,null,null,null,null]}
$ synth generate synth/ --random{"users":[null,null,null,null,null,null,null,null,null]}

Schema nodes#

Before we can get our users collection to match our User Prisma model, we need to understand how to generate more kinds of data with synth.

Everything that goes into a schema file is a schema node. Schema nodes can be identified by the "type" field which specifies which kind of node it is. The documentation pages have a complete taxonomy of schema nodes and their "type".

Generating ids#

Let's look back at our User model. It has four properties:

id
createdAt
email
nickname

Let's start with id. How can we generate that?

The type of the id property in the User model is Int:

  id        Int      @unique @default(autoincrement())

and the attribute indicates that the field is meant to increment sequentially, going through values 0, 1, 2 etc.

The synth schema type for numbers is number. Within number there are three varieties of generators:

What decides the variant is the presence of the "range", "constant"or "id" field in the node's specification.

For example, a range variant would look like

{    "type": "number",    "range": {        "low": 5,        "high": 10,        "step": 1    }}

whereas a constant variant would look like

{    "type": "number",    "constant": 42}

For the id field we should use the id variant, which is auto-incrementing. Here is an example of id used in an array so we can see it behaves as expected:

{    "type": "array",    "length": 10,    "content": {        "type": "number",        "id": {}    }}

Generating emails#

Let us now look at the email field of our User model:

  email     String   @unique

Its type in the data model is that of a String. The synth schema type for that is string.

There are many different variants of string and they are all exhaustively documented. The different variants are identified by the presence of a distinguishing field which can be

"faker"
"pattern"
"uuid"
and a lot more...

Since we are interested in generating email addresses, we will be using the "faker" variant which leverages a preset collection of generators for common properties like usernames, addresses and emails:

{    "type": "string",    "faker": {        "generator": "safe_email"    }}

Generating objects#

OK, so we now know how to generate the id and the email properties of our User model. But we do not yet know how to put them together in one object. For that we need the object type:

{    "type": "object",    "id": {        "type": "number",        "id": {}    },    "email": {        "type": "string",        "faker": {            "generator": "safe_email"        }    }}

Leverage the docs#

Now we have everything we need to finish writing down our User model as a synth schema. A quick lookup of the documentation pages will tell us how to generate the createdAt and nickname fields.

Here is the finished result for our User.json collection:

``json synth[expect = "unknown variant date_time`"] { "type": "array", "length": 3, "content": { "type": "object", "id": { "type": "number", "id": {} }, "createdAt": { "type": "string", "date_time": { "format": "%Y-%m-%d %H:%M:%S", "begin": "2020-01-01 12:00:00" } }, "email": { "type": "string", "faker": { "generator": "safe_email" } }, "nickname": { "type": "string", "faker": { "generator": "username" } } } }

:::caution
[`date_time`][synth-datetime] is now a generator on its own and is no longer a subtype of the `string` generator
:::
### Making sure our constraints are satisfied
Looking back at the [`User` model](#prisma-is-awesome) we started from, there'sone thing that we did not quite address yet. The `email` field in the Prismaschema has the `@unique` attribute:
```graphql  email     String   @unique

This means that, in our data model, no two users can share the same email address. Yet, we haven't added that constraint anywhere in our final synth schema for the User.json collection.

What we need to use here is modifiers. A modifier is an attribute that we can add to any synth schema type to modify the way it behaves. There are two modifiers currently supported:

The optional modifier is an easy way to make a schema node randomly generate something or nothing:

{    "type": "number",    "optional": true,    "constant": 42}

Whereas the unique modifier is an easy way to enforce the constraint that the values generated have no duplication. So all we need to do, to represent our data model correctly, is to add the unique modifier to the email field:

{    "type": "string",    "unique": true,    "faker": {        "generator": "safe_email"    }}

The completed end result for the User.json collection can be viewed on GitHub here.

How to deal with relations#

Now that we have set up our User.json collection, let's turn our attention to the Post model and write out the synth schema for the Post.json collection.

Here is the end result:

``json synth[expect = "unknown variant date_time`"] { "type": "array", "length": 5, "content": { "type": "object", "id": { "type": "number", "id": {} }, "postedAt": { "type": "string", "date_time": { "format": "%Y-%m-%d %H:%M:%S", "begin": "2020-01-01 12:00:00" } }, "title": { "type": "string", "faker": { "generator": "bs" } }, "authorId": "@User.content.id" } }

:::caution
[`date_time`][synth-datetime] is now a generator on its own and is no longer a subtype of the `string` generator
:::
It all looks pretty similar to the `User.json` collection, except for oneimportant difference at the line
```json synth    "authorId": "@User.content.id"

The syntax @... is synth's way of specifying relations between collections. Here we are creating a many-to-1 relation between the field authorId of the Post.json collection and the field id of the User.json collection.

The final Post.json collection schema can be viewed on GitHub here.

Synth generate#

Now that our data model is implemented in Synth, we're ready to seed our test database with mock data. Here we'll use the offical mongo Docker image, but if you are using a relational database like Postgres or MySQL, you can follow the same process.

To start the mongo image in the background (if you haven't done so already), run

$ docker run -d -it -p 27017:27017 --rm mongo

Then, to seed the database with synth just run

$ synth generate synth/ --size 1000 --to mongodb://localhost:27017/board

That's it! Our test mongo instance is now seeded with the data of around 100 users. Head over to the examples repository to see the complete working example.

What's next#

Synth is completely free and built in the open by an amazing and fast growing community of contributors.

Join us in our mission to make test data easy and painless! We also have a very active Discord server where many members of the community would be happy to help if you encounter an issue!

Why not to use prod data for testing - and what to do instead

August 4, 2021 · 5 min read

Nodar Daneliya

Founder

I get it. Almost everyone has done it. Your users’ data is sitting there ready to use, just copy-paste that to your dev environment and there you have some good test data for yourself. Job done - easy - you can move on to the next task.

Worse even, when you need to make sure that your dev/staging data is up to date - you spend your day setting up a cron job that routinely syncs production data to your dev or UAT environments.

If you’re getting an uneasy feeling every time you do this - you’re not alone. So why do people still do it? Here’s what I found to be the dominant reasons for this, and my thoughts on why those reasons are bunk.

Why people do it (and why you shouldn’t)#

1. It’s always been done this way so why change now#

This is rarely a valid reason to do anything, yet sometimes we inherit a process that deep down we know is flawed, but perhaps feel like we don't have enough knowledge or authority to challenge.

I've been there. When you are new at a company, telling more senior people they're doing things wrong seems intimidating, but if you have a viable alternative (more on that below) you can suggest (and ideally deliver on), your co-workers will appreciate it.

2. No one will find out#

Having been in the data privacy space for a while now, it is painfully obvious how often someone actually finds out. One of the big improvements that GDPR has brought about is that it put a spotlight on how frequent and widespread data leaks and breaches really are - not a day goes by without new ones being reported. In light of this, hoping that this doesn't affect you is not an ideal strategy.

My colleague Andre, recently posted a great thread about this.

3. It’s just for myself and a bunch of my co-workers, what’s the worst that can happen#

By taking data to a less secure environment you waste all effort you put into security controls of your production environments. Proper privacy measures go one step beyond and look at what happens if there is a data breach - either due to a mistake, or an attack - by minimising the number of people who work with real (sensitive) data you complement your other data security measures.

Sensitive data should only really be accessed when absolutely necessary - that’s in the spirit of building proper privacy-respecting culture.

4. There is just no f…ing time to do this properly#

It can be really painful to pseudonymise data, especially if your data model is very large and complex. It can take even more time to set up data obfuscation pipelines that keep your test data up to date - as if this wasn’t enough - all these scripts keep breaking as your data model evolves. Basically, in so many cases, the task of obfuscating data while preserving its complexity and referential integrity is just too much hassle. Shameless plug: I have some good news and some better news.

Synth is a declarative data generator that can enable you quickly generate realistic looking test data. Better still, it’s completely free and open source - we built it with developers in mind and really hope it can help address this problem once and for all. Our goal is to make it possible for you to get going with realistic-looking fake data in no time. ¹

5. We’re already GDPR compliant, I can do what I want with the users’ data#

Let me preface everything I say by stating that I am not a lawyer, but I have been in this space for a while now and have spoken to numerous consultants, lawyers and experts on this matter. GDPR can be vague on certain things, but in my opinion using production data for testing is clearly a no-go. (See more on this in the notes). If things go wrong it will be very hard to justify to the regulators why sensitive data was copied to and used in its raw form in testing environments.²

Final thoughts#

Last but not least, trust is 🔑 - the way you treat your users’ data in a way defines your relationship with them. Being responsible about the data that your users trust you with is incredibly important in building a good and healthy culture for your company or project. Trust is hard to build and easily lost. This may seem trivial, but these “small” decisions define the future of your company, not the “culture” page on your website.

Being responsible is a team effort, but it pays off long-term. ³

Notes#

Synth goes one step beyond pseudonimisation - it creates completely fake data that looks and quacks like the real data. Data generated with Synth can be shared more freely.↩
GDPR Article 25 and 32 refer to the requirements around implementing pseudonymisation.↩
We often get asked what is the right stage of the project/company to start using something like Synth. The best answer is right from the start, the second best is right now. With our declarative “data as code” framework, your test data generation will evolve alongside your data model.↩

How to Create PostgreSQL Test Data

March 9, 2021 · 7 min read

Christos Hadjiaslanis

Founder

Introduction#

Developing high quality software inevitably requires some testing data.

You could be:

Integration testing your application for correctness and regressions
Testing the bounds of your application in your QA process
Testing the performance of queries as the size of your dataset increases

Either way, the software development lifecycle requires testing data as an integral part of developer workflow. In this article, we'll be exploring 3 different methods for generating test data for a Postgres database.

Setup#

In this example we'll be using Docker to host our Postgres database.

To get started you'll need to install docker and start our container running Postgres:

% docker run -p 5432:5432 -d -e POSTGRES_PASSWORD=1234 -e POSTGRES_USER=postgres -e POSTGRES_DB=dev postgres

As you can see, we've set very insecure default credentials. This is not meant to be a robust / productionised instance, but it'll do for our testing harness.

Our Schema#

In this example we'll setup a very simple schema. We're creating a basic app where we have a bunch of companies, and those companies have contacts.

CREATE TABLE companies(   company_id SERIAL PRIMARY KEY,   company_name VARCHAR(255) NOT NULL);
CREATE TABLE contacts(   contact_id SERIAL PRIMARY KEY,   company_id INT,   contact_name VARCHAR(255) NOT NULL,   phone VARCHAR(25),   email VARCHAR(100),   CONSTRAINT fk_company      FOREIGN KEY(company_id)       REFERENCES companies(company_id));

This schema captures some business logic of our app. We have unique primary keys, we have foreign key constraints, and we have some domain-specific data types which have 'semantic meaning'. For example, the random string _SX Æ A-ii is not a valid phone number.

Let's get started.

Manual Insertion#

The first thing you can do which works well when you're starting your project is to literally manually insert all the data you need. This involves just manually writing a SQL script with a bunch of INSERT statements. The only thing to really think about is the insertion order so that you don't violate foreign key constraints.

INSERT INTO companies(company_name)VALUES('BlueBird Inc'),      ('Dolphin LLC');            INSERT INTO contacts(company_id, contact_name, phone, email)VALUES(1,'John Doe','(408)-111-1234','john.doe@bluebird.dev'),      (1,'Jane Doe','(408)-111-1235','jane.doe@bluebird.dev'),      (2,'David Wright','(408)-222-1234','david.wright@dolphin.dev');

So here we're inserting directly into our database. This method is straight forward but does not scale when you need more data or the complexity of your schema increases. Also, testing for edge cases requires your hard-coding edge cases in the inserted data - resulting in a linear amount of work for the bugs you want to catch.

contact_id	company_id	contact_name	phone	email
1	1	John Doe	(408)-111-1234	john.doe@bluebird.dev
2	1	Jane Doe	(408)-111-1235	jane.doe@bluebird.dev
3	2	David Wright	(408)-222-1234	david.wright@dolphin.dev

Using generate_series to automate the process#

Since you're a programmer, you don't like manual work. You like things to be seamless and most importantly automated!

Postgres comes with a handy function called generate_series which, ...drum roll... generates series! We can use this to generate as much data as we want without writing it by hand.

Let's use generate_series to create 100 companies and 100 contacts

INSERT INTO companies(company_name)SELECT md5(random()::text)FROM generate_series(1,100);
INSERT INTO contacts(company_id, contact_name, phone, email)SELECT id, md5(random()::text), md5(random()::text)::varchar(20), md5(random()::text) FROM generate_series(1,100) id;

contact_id	company_id	contact_name	phone	email
1	1	81cc02c106b7c30d4e2b032c91cdb75a	d056f1eee1dca55db03c	cd0da2eef81aaa02d6ba15ef4551fb9f
2	2	d2b0112bc9bbec85c5229a4b4f28a350	07ba86b1dc24cdadfd24	7404f5b502084563f2ac20c29ed0e584
3	3	64005702ecaff9f489e8074d6a718aae	50db9534b58e0616cd34	3ea36293665aa1ac38e7d6371893046a
4	4	202e87bc3d0c8c080048b2c0138c709b	65f6ea317bd0f2c950dc	8b8d9b92916f4cf77c38308f6ac4391b
5	5	8b2fd25d7b95158df5af671cb3255755	3e6ddc67aabe7164ce9a	ed32035400a7500203352f3597d2548f

We generated 100 companies and contacts here, the types are correct, but the output is underwhelming. First of all, every company has exactly 1 contact, and more importantly the actual data looks completely useless.

If you care about your data being semantically correct (i.e. text in your phone column actually being a phone number) we need to get more sophisticated.

We could define functions ourselves to generate names / phone numbers / emails etc, but why re-invent the wheel?

Using a data generator like Synth#

Synth is an open-source project designed to solve the problem of creating realistic testing data. It has integration with Postgres, so you won't need to write any SQL.

Synth uses declarative configuration files (just JSON don't worry) to define how data should be generated. To install the synth binary refer to the installation page.

The first step to use Synth is to create a workspace. A workspace is just a directory in your filesystem that tell Synth that this is where you are going to be storing configuration:

$ mkdir workspace && cd workspace && synth init 

Next we want to create a namespace (basically a stand-alone data model) for this schema. We do this by simply creating a subdirectory and Synth will treat it as a separate schema:

$ mkdir my_app

Now comes the fun part! Using Synth's configuration language we can specify how our data is generated. Let's start with the smaller table companies.

To tell Synth that companies is a table (or collection in the Synth lingo) we'll create a new file app/companies.json.

{    "type": "array",    "length": {        "type": "number",        "constant": 1    },    "content": {        "type": "object",        "company_id": {            "type": "number",            "id": {}        },        "company_name": {            "type": "string",            "faker": {                "generator": "company_name"            }        }    }}

Here we're telling Synth that we have 2 columns, company_id and company_name. The first is a number, the second is a string and the contents of the JSON object define the constraints of the data.

If we sample some data using this data model we get the following:

$ synth generate my_app/ --size 2{  "companies": [    {      "company_id": 1,      "company_name": "Campbell Ltd"    },    {      "company_id": 2,      "company_name": "Smith PLC"    }  ]}

Now we can do the same thing for the contacts table by create a file my_app/contacts.json. Here we have the added complexity of a foreign key constraints to the company table, but we can solve it easily using Synth's same_as generator.

{    "type": "array",    "length": {        "type": "number",        "constant": 1    },    "content": {        "type": "object",        "company_id": {            "type": "same_as",            "ref":"companies.content.company_id"        },        "contact_name": {            "type": "string",            "faker": {                "generator": "name"            }        },        "phone": {            "type": "string",            "faker": {                "generator": "phone_number",                "locales": ["FR_FR"]            }        },        "email": {            "type": "string",            "faker": {                "generator": "safe_email"            }        }    }}

There is quite a bit going on here - to get an in-depth understanding of the synth configuration refer I'd recommend reading the comprehensive docs. There are tons of cool features which this schema can't really explore!

Now we have both our tables data model under Synth, we can generate data into Postgres:

$ synth generate my_app/ --to postgres://postgres:1234@localhost:5432/dev

Taking a look at the company table:

contact_id	company_id	contact_name	phone	email
1	1	Carrie Walsh	+44(0)117 496 0785	espinozabetty@hotmail.com
2	2	Brittany Flores	+441632 960 480	osharp@mcdaniel.com
3	3	Tammy Rodriguez	01632960737	brenda82@ward.org
4	4	Amanda Marks	(0808) 1570096	hwilcox@gonzalez.com
5	5	Kimberly Delacruz MD	+44(0)114 4960207	pgarcia@thompson.com
6	6	Jordan Williamson	(0121) 4960483	jamesmiles@weber.org
7	7	Nicholas Williams	(0131) 496 0974	fordthomas@gmail.com

Much better :)

Conclusion#

We explored 3 different ways to generate data.

Manual Insertion: Is ok to get you started. If your needs are basic it's the path of least effort to creating a working dataset.
Postgres generate_series: This method scales better than manual insertion - but if you care about the contents of your data and have foreign key constraints you'll need to write quite a bit of bespoke SQL by hand.
Synth: Synth has a small learning curve, but to create realistic testing data at scale it reduces most of the manual labour.

In the next post we'll explore how to subset your existing database for testing purposes. And don't worry if you have sensitive / personal data - we'll cover that too.

Create realistic test data for your web app

March 8, 2021 · 12 min read

Christos Hadjiaslanis

Founder

So we've all been in this situation. You're building a Web App, you're super productive in your stack and you can go quickly - however generating lot's of data to see what your app will look like with enough users and traffic is a pain.

Either you're going to spend a lot of time manually inputting data or you're going to write some scripts to generate that data for you. There must be a better way.

In this post we're going to explore how we can solve this problem using the open-source project Synth. Synth is a state-of-the-art declarative data generator - you tell Synth what you want your data to look like and Synth will generate that data for you.

This tutorial is going to use a simple MERN (Mongo Express React Node) web-app as our test subject, but really Synth is not married to any specific stack.

I'm going to assuming you're working on MacOS or Linux (Windows support coming soon 🤞) and you have NodeJS, Yarn and Docker installed.

For this example we'll be running Synth version 0.3.2 .

Getting started#

As a template, we'll use a repository which will give us scaffolding for the MERN app. I picked this example because it shows how to get started quickly with a MERN stack, where the end product is a usable app you can write in 10 minutes. For our purposes, we don't really need to build it from scratch, so let's just clone the repo and avoid writing any code ourselves.

git clone https://github.com/samaronybarros/movies-app.git && cd movies-app

Next, we'll be using docker to run an ephemeral version of our database locally. Docker is great for getting started quickly with popular software, and luckily for us MongoDB has an image on the docker registry. So - let's setup an instance of MongoDB to run locally (no username / password):

docker run -d --name mongo-on-docker -p 27017:27017 mongo

Starting the Web App#

The repository we just cloned contains a working end-to-end web-app running on a MERN stack. It's a super simple CRUD application enabling the user to add / remove some movie reviews which are persisted on a MongoDB database.

The app consists of 2 main components, a nodejs server which lives under the movies-app/server/ sub-directory, and a React front-end which lives under the movies-app/client sub-directory.

The client and server talk to each other using a standard HTTP API under /movie.

So let's get started and run the back-end:

cd server && yarn install && node index.js

And then the client (you'll need two terminals here 🤷):

cd client && yarn install && yarn start

Cool! If you navigate to http://localhost:8000/ you should see the React App running 🙂

Let's add some movies by hand#

Hold the phone. Why are we adding movies by hand since we have a tool to generate data for us?

Well, by adding a little bit of test data by hand, we can then use Synth to infer the structure of the data and create as many movies as we want for us. Otherwise we would have to write the entire data definition (what we call a schema) by hand.

So, let's add a couple of movies manually using the Web UI.

Create Movies

Ok, so now that we have a couple of movies, let's get started with Synth!

Synth#

In the following section we will cover how Synth fits into the Web App development workflow:

First we'll install the Synth binary
Then we'll initialize a Synth workspace in our repo to host our data model
Next will ingest data from MongoDB into Synth
And finally generate a bunch of fake data from Synth and back into Mongo

Installing Synth#

To install Synth on MacOS / Linux, visit the docs and choose the appropriate installation for your OS. If you are feeling adventurous, you can even build from source!

Declarative Data Generation#

Synth uses a declarative data model to specify how data is generated.

Hmmm, so what is a declarative model you may ask? A declarative model, as opposed to an imperative model, is where you 'declare' your desired end state and the underlying program will figure out how to get there.

On the other had, an imperative model (which is what we are mostly used to), is step by step instructions on how to get to our end-state. Most popular programming languages like Java or C are imperative - your code is step-by-step instructions on how to reach an end state.

Programming frameworks like SQL or React or Terraform are declarative. You don't specify how to get to your end-state, you just specify what you want and the underlying program will figure out how to get there.

With Synth you specify what your desired dataset should look like, not how to make it. Synth figures how to build it for you 😉

Creating a Workspace#

A workspace represents a set of synthetic data namespaces managed by Synth. Workspaces are marked by .synth/ sub-directory.

A workspace can have zero or more namespaces, where the namespaces are just represented as sub-directories. All information pertaining to a workspace is in its directory.

So let's create sub-directory called data/ and initialize our Synth workspace.

movies-app $ mkdir data && cd data && synth init

Namespaces#

The namespace is the top-level abstraction in Synth. Namespaces are the equivalent of Schemas in SQL-land. Fields in a namespace can refer to other fields in a namespace - but you cannot reference data across namespaces.

Namespaces in turn, have collections which are kind of like tables in SQL-land. A visual example of the namespace/collection hierarchy can be seen below.

Alt Text

To create a namespace, we need to feed some data into Synth.

Feeding Data into Synth#

There are two steps to feed data into Synth from our MongoDB instance:

We need to export data from MongoDB into a format that Synth can ingest. Luckily for us, Synth supports JSON out of the box so this can be done quite easily with the mongoexport command - a light weight tool that ships with MongoDB to enable quick dumps of the database via the CLI. We need to specify a little bit more metadata, such as the database we want to export from using --db cinema , the collection using --collection and the specific fields we are interested in --fields name,rating,time. We want the data from mongoexport to be in a JSON array so that Synth can easily parse it, so let's specify the --jsonArray flag.
Next, we need to create a new Synth namespace using the synth import command. synth import supports a --from flag if you want to import from a file, but if this is not specified it will default to reading from stdin. We need to feed the output of the mongoexport command into Synth. To do this we can use the convenient Bash pipe | to redirect the stdout from mongoexport into Synth's stdin.

docker exec -i mongo-on-docker mongoexport \    --db cinema \    --collection movies \    --fields name,rating,time \    --forceTableScan \    --jsonArray | synth import cinema --collection movies

Synth runs an inference step on the JSON data that it's fed, trying to infer the structure of the data. Next Synth automatically creates the cinema namespace by creating the cinema/ sub-directory and populates it with the collection movies.json.

$ tree -a data/data/├── .synth│   └── config.toml└── cinema    └── movies.json

We can now use this namespace to generate some data:

$ synth generate cinema/{  "movies": [    {      "_id": {        "$oid": "2D4p4WBXpVTMrhRj"      },      "name": "2pvj5fas0dB",      "rating": 7.5,      "time": [        "TrplCeFShATp2II422rVdYQB3zVx"      ]    },    {      "_id": {        "$oid": "mV57kUhvdsWUwiRj"      },      "name": "Ii7rH2TSjuUiyt",      "rating": 2.5,      "time": [        "QRVSMW"      ]    }  ]}

So now we've generated data with the same schema as the original - but the value of the data points doesn't really line up with the semantic meaning of our dataset. For example, the time array is just garbled text, not actual times of the day.

The last steps is to tweak the Synth schema and create some realistic looking data!

Tweaking the Synth schema#

So let's open cinema/movies.json in our favorite text editor and take a look at the schema:

{  "type": "array",  "length": {    "type": "number",    "subtype": "u64",    "range": {      "low": 1,      "high": 4,      "step": 1    }  },  "content": {    "type": "object",    "time": {      "type": "array",      "length": {        "type": "number",        "subtype": "u64",        "range": {          "low": 1,          "high": 2,          "step": 1        }      },      "content": {        "type": "one_of",        "variants": [          {            "weight": 1.0,            "type": "string",            "pattern": "[a-zA-Z0-9]*"          }        ]      }    },    "name": {      "type": "string",      "pattern": "[a-zA-Z0-9]*"    },    "_id": {      "type": "object",      "$oid": {        "type": "string",        "pattern": "[a-zA-Z0-9]*"      }    },    "rating": {      "type": "number",      "subtype": "f64",      "range": {        "low": 7.0,        "high": 10.0,        "step": 1.0      }    }  }}

There is a lot going on here but let's break it down.

The top-level object (which represents our movies collection) is of type array - where the content of the array is an object with 4 fields, _id, name, time, and rating.

We can completely remove the field _id since this is automatically managed by MongoDB and get started in making our data look real. You may want to have the Generators Reference open here for reference.

Rating#

First let's change the rating field. Our app can only accept numbers between 0 and 10 inclusive in increments of 0.5. So we'll use the Number::Range content type to represent this and replace the existing value:

{    "range": {        "high": 10,        "low": 0,         "step": 0.5    },    "subtype": "f64",    "type": "number"}

Time#

The time field has been correctly detected as an array of values. First of all, let's say a movie can be shown up to 5 times a day, so we'll change the high field at time.length.range to 6 (high is exclusive). At this stage, the values are just random strings, so let's instead use the String::DateTime content type to generate hours of the day.

``json synth[expect = "unknown variant date_time`"] { "type": "array", "length": { "type": "number", "subtype": "u64", "range": { "low": 1, "high": 5, "step": 1 } }, "content": { "type": "one_of", "variants": [ { "weight": 1.0, "type": "string", "date_time": { "subtype": "naive_time", "format": "%H:%M", "begin": "12:00", "end": "23:59" } } ] } }

:::caution
`date_time` is now a generator on its own and is no longer a subtype of the `string` generator
:::
### Name
Finally, the movie name field should be populated with realistic looking movie names.
Under the hood, Synth uses the Python Faker library to generate so called 'semantic types' (think credit card numbers, addresses, license plates etc.). Unfortunately Faker does no have movie names, so instead we can use a random text generator instead with a capped output size.
So let's use the `String::Faker` content type to generate some fake movie names!
```json synth{    "type": "string",    "faker": {        "generator": "text",        "max_nb_chars": 20    }}

Final Schema#

So, making all the changes above, we can use our beautiful finished schema to generate data for our app:

``json synth[expect = "unknown variant date_time`"] { "type": "array", "length": { "type": "number", "subtype": "u64", "range": { "low": 1, "high": 2, "step": 1 } }, "content": { "type": "object", "name": { "type": "string", "faker": { "generator": "text", "max_nb_chars": 20 } }, "time": { "optional": false, "type": "array", "length": { "type": "number", "subtype": "u64", "range": { "low": 1, "high": 5, "step": 1 } }, "content": { "type": "one_of", "variants": [ { "weight": 1.0, "type": "string", "date_time": { "subtype": "naive_time", "format": "%H:%M", "begin": "00:00", "end": "23:59" } } ] } }, "rating" : { "range": { "high": 10, "low": 0, "step": 0.5 }, "subtype": "f64", "type": "number" } } }

```bash$ synth generate cinema/ --size 5{  "movies": [    {      "name": "Tonight somebody.",      "rating": 7,      "time": [        "15:17"      ]    },    {      "name": "Wrong investment.",      "rating": 7.5,      "time": [        "22:56"      ]    },    {      "name": "Put public believe.",      "rating": 5.5,      "time": [        "20:32",        "21:06",        "16:15"      ]    },    {      "name": "Animal firm public.",      "rating": 8.5,      "time": [        "20:06",        "20:25"      ]    },    {      "name": "Change member reach.",      "rating": 8.0,      "time": [        "12:36",        "14:34"      ]    }  ]}

Ah, much better!

Generating data from Synth into MongoDB#

So now that we can generate as much correct data as we want, let's point Synth at MongoDB and let loose the dogs of war.

This step can be broken into two parts:

Run the synth generate command with our desired collection movies and specifying the number of records we want using the --size field.
Pipe stdout to the mongoimport command, mongoexport's long lost cousin. Again here we specify the database we want to import to, --db cinema and the specific collection movies. We also want the --jsonArray flag to notify mongoimport that it should expect a JSON array.

synth generate cinema/ \    --collection movies \    --size 1000 \    | docker exec -i mongo-on-docker mongoimport \    --db cinema \    --collection movies \    --jsonArray

And voila! Our app now has hundreds of valid movies in our database!

Alt Text

Conclusion#

This post was a summary of how you can use Synth to generate realistic looking test data for your Web App. In the next part of this tutorial, we'll explore how we can use Synth to generate relational data, i.e. where you have references between collections in your database.

To check out the Synth source code you can visit the Synth repo on GitHub, and to join the conversation hop-on the the Synth discord server.

Data modeling is not boring#

What is a data model?#

Why do I need a data model?#

Prisma is awesome#

note

The secret to writing good code#

Testing, testing and more testing#

Generate data for your data model#

Installing synth#

Synth schema#

Collections#

info

Schema nodes#

Generating ids#

Generating emails#

Generating objects#

Leverage the docs#

How to deal with relations#

Synth generate#

What's next#

Why people do it (and why you shouldn’t)#

1. It’s always been done this way so why change now#

2. No one will find out#

3. It’s just for myself and a bunch of my co-workers, what’s the worst that can happen#

4. There is just no f…ing time to do this properly#

5. We’re already GDPR compliant, I can do what I want with the users’ data#

Final thoughts#

Notes#

Introduction#

Setup#

Our Schema#

Manual Insertion#

Using generate_series to automate the process#

Using a data generator like Synth#

Conclusion#

Getting started#

Starting the Web App#

Let's add some movies by hand#

Synth#

Installing Synth#

Declarative Data Generation#

Creating a Workspace#

Namespaces#

Feeding Data into Synth#

Tweaking the Synth schema#

Rating#

Time#

Final Schema#

Generating data from Synth into MongoDB#

Conclusion#

Installing `synth`#