NAV
CLI

Introduction

What is Synth?

Synth is an API-first platform to train and deploy synthetic data models. It can provision full-fledged database infrastructures that synthesize existing data environments. This makes Synth a great fit for you if you need to use your data for research, development or integration testing.

How does it work?

Synth has two main components:

  1. A local trainer. This is a small Docker container packing a collection of wrapped open source software that reads the input data and creates synthetic data models from it.
  2. The getsynth.com API. This is where the models produced by the trainer can be uploaded. The API lets you provision cloud-hosted databases and produce new samples of synthesized data from the models you have uploaded.

Getting Started

We will walk you through getting started. Just to make sure everything goes as expected, we recommend you have the following set up and running before you start.

  1. A fairly recent version of Docker. Anything above 16.0 should do fine.
  2. Some data to synthesize: this could be an existing PostgreSQL database, or a small CSV. In the tutorial image, you will a sample CSV data at /data/sample.csv.

Trainers

Synth manages synthetic data models generated by trainers. Trainers are processes that run locally, i.e. on your machine or your own cloud. The idea is that trainers look at input data in order to adjust model parameters. When they complete, they produce a result which is lightweight, anonymized and detached from the input data. This makes the model less sensitive (from a data privacy perspective) and easier to move around than the original data.

Models can be uploaded to the getsynth.com API, where you can deploy them into database instances or sample data directly from them through the client libraries, the CLI or the API.

Obtaining the trainer container

docker pull gcr.io/getsynth/trainer

The easiest way to run the trainer (and not have to worry about tangled dependencies!) is by pulling the official Docker image directly from our container registry. Alternatively, you can build the image from source yourself.

Running the trainer container

docker run -it -v ~/.local/share/synth:/root/.local/share/synth:rw \
    gcr.io/getsynth/trainer

The trainer saves its state and its own configuration over at ~/.local/share/synth. If you need persistence in your setup, simply mount a volume/directory at that location in the container, as in the snippet.

Linking your account

synth auth login

Next you need to attach the trainer instance to your getsynth.com account. This is so that the data models you train have a place to go when you're done. If you do not already have an account, just follow the instructions when prompted. You'll have the option to also just log in with your GitHub account if that's easier.

If you have already used the trainer in the past (and have mounted the trainer state, as instructed in the previous step), you can skip this step, as your existing credentials will be reused.

Creating a model

synth model new --help

The next few steps will walk you through creating and training a new synthetic data model - the whole point! The series of subcommands we are interested here all lie under synth model. First, we want to create a new model with synth model new.

Creating a model from a database

To create a new model from a database:

synth model new --from-database=postgresql://{{user}}:{{pw}}@{{host}}/{{db}}

If you want to create a new model directly from an existing database instance that is accessible to you, use the --from-database option. The format for this is the URL connection string for the database you are attempting to connect to. If in doubt as to what those look like in your situation, take a quick look at SQLAlchemy's documentation. The trainer will pass what you supply here to the SQLAlchemy library almost untouched.

Creating a model from a file

To create a new model from a file:

synth model new --from-data=/data/sample.csv

If you just want to create a model from a file, you can supply the path to your CSV file with the --from-data option. If you do not happen to have any CSV file laying around, just use the sample data provided in the image. You will find it at /data/sample.csv in the container.

Inspecting the model manifest

To inspect the model manifest:

synth model inspect {{model_id}}

Remember to replace {{model_id}} with the ID of the model you created in the previous step.

In JSON format, model manifests look like this:

{
  "tables": {
    "customer": {
      "fields": {
        "c_name": {
          "type": "categorical",
          "pii": true,
          "pii_category": "name"
        },
        "c_address": {
          "type": "categorical",
          "pii": true,
          "pii_category": "address"
        },
        "c_phone": {
          "type": "categorical",
          "pii": true,
          "pii_category": "phone_number"
        }
      }
    }
  }
}

Whatever the data's origin, synth model new will scan the data source and attempt to automagically prefill some of the metadata required to specify the model. If it runs successfully, you will get the ID of the new model printed to stdout. You can then take a look at what it found with the synth model inspect subcommand.

Training a model

To train the model created above:

synth model train --from-database=... {{model_id}}

or, if you created the model from a data file:

synth model train --from-data=... {{model_id}}

Remember to replace {{model_id}} with the ID of the model you created before.

So far all we have is a bunch of metadata that specifies what we call a "model manifest". But were you to sample synthetic data from that only, the data you would get would most likely not look very close to the original. That is because under the hood, the model specified by this manifest has a lot of parameters that need to be tuned (i.e trained).

Training a model is as simple as creating a new one. Just remember to use the same value for the --from-database/--from-data argument as you did when creating the model. Otherwise the trainer might complain saying some of the data specified in the manifest cannot be found in the data source being used for training.

Making sure everything is ready for deployment

To double check your model is ready to go:

synth model ls

If the training was successful, your trained model was uploaded to your Synth platform account and is ready to be deployed. Just to double check, you can use the synth model ls subcommand, which will list all models held under your account alongside their current state.

Deploying a model

To deploy a new database instance of your model:

synth instance deploy {{model_id}}

Now that the model is on the Synth platform, you can deploy synthetic copies of your original data to ephemeral database instances whenever you need it!

Sampling from a model

synth model sample {{model_id}} --sample-size=10

Alternatively, you can sample data directly from the CLI with synth model sample. The output format will be csv. The --output option can be used to save the result somewhere instead of having the sample being output to stdout.

API

Take a look at our API documentation.

CLI