Core concepts

This section covers the core concepts found in Synth.

Namespaces#

The namespace is the top-level abstraction in Synth. Namespaces are the equivalent of traditional schemas in the world of relational databases likes PostgreSQL. References can exist between fields in a given namespace, but never across namespaces.

Namespaces are simply directories from which synth reads a collection of schema files. For example, a namespace blog could have the following structure:

└── blog/    ├── users.json    └── posts.json 

Any file whose extension is .json in a namespace directory will be opened by the synth generate subcommand and considered part of the namespace's schema.

Collections#

Every namespace has zero or more collections. Collections are addressable by their name and correspond to tables in the world of relational databases. Strictly speaking, collections are a super-set of tables as they are in fact arbitrarily deep JSON document trees.

Collections are represented in a namespace directory as JSON files. The name of a collection (the way it is referred to by synth) is its filename without the extension. For example the file bank/transactions.json defines a collection named transactions in a namespace bank.

For a more comprehensive example, let's imagine our namespace bank has a collection transactions and another collection users. The directory structure then looks like this:

└── bank/    ├── transactions.json    └── users.json 

Collections must be valid instances of the synth schema that describe an array. This means at the top-level all collections must be array generators.

Field references#

A field reference is a special kind of fields that is useful for declaring relations between different parts of a collection or different collections in the same namespace.

A field reference can be specified by using the same_as generator type.

The value of the "ref" field should be the address of the field you want to refer to. A field address takes the form <collection name>.<level_0>.<level_1>.... For example, say we have a collection users.json containing the following schema:

{  "type": "array",  "length": {    "type": "number",    "subtype": "u64",    "range": {      "low": 1,      "high": 4,      "step": 1    }  },  "content": {    "type": "object",    "username": {      "type": "string",      "faker": {        "generator": "username"      }    },    "credit_card": {      "type": "string",      "faker": {        "generator": "credit_card"      }    },    "id": {      "type": "number",      "subtype": "u64",      "id": {}    }  }}

A reference to the username field would have the address users.content.username. If we want to add a reference to this field from another collection we would simply add:

{  "type": "array",  "length": 1,  "content": {    "type": "object",    "username": {      "type": "same_as",      "ref": "users.content.username"    }  }}

Schema#

The schema is the core data structure that you need to understand to be productive with Synth. The schema represents your data model, it tells Synth exactly how to generate data, which fields we need, what types and so on. This is a little involved so there is a section devoted to just the schema.

Scenarios#

Since collections correspond to closely to a database collection, we will have numerous use cases which only uses a subset of the collections in a namespace. This is were we will use scenarios.

Scenarios allow us to define a specific use case for the data in a namespace. So expanding from our bank example, we can create a scenario which only generates data for users by having the following directory structure:

└── bank/    ├── scenarios    │   └── users-only.json    ├── transactions.json    └── users.json

This creates a scenario called users-only by having a [scenario-name].json inside the scenarios/ directory inside our namespace. The definition for this scenario will look as follow:

{  "users": {}}

This definition explicitly marks the users collection for inclusion inside this scenario.

Importing datasets#

Synth can ingest and build schemas on the fly with the synth import command.

Generating data#

To generate data from an existing namespace use the synth generate command.

synth uses a seedable pseudo-random source of entropy. By default, the seed is set to a constant value of 0 using the Rust-native rand::SeedableRng::seed_from_u64 function. This means that, by default, the data that synth generates is deterministic: it is only a function of your schema files.

This behavior can be tuned (and the seed be changed, or randomized) using the --seed or --random flag.