Skip to main content

ยท 2 min read
Sam

Confetti

We are celebrating 1,000 stars on the Synth repository.

This wouldn't have happened without your support and contribution, so we want to thank the community for being with us along the way.

The Journey#

The Synth project was first started all the way back in January 2021, a whole 2 years ago. In this time we have worked tirelessly to deliver a consistent, fast, and easy to use tool for generating fake data for databases.

2021#

  • Jan 2021: Synth started
  • Jul 2021: Python integration deprecated
  • Oct 2021: v0.6.0 released

2022#

  • Sep 2022: New maintainers take over
  • Nov 2022: v0.6.9, the latest release, released

2023#

  • Jan 2023: Synth reaches 1k stars

Where're We Now?#

We're pleased by the adoption by the Rust community. We couldn't make it this far without your feedback and contributions.

2 ๐Ÿ“ฆ
Open source projects
101 ๐ŸŽˆ
Currently use Synth
28 ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ
Contributors
317 โœ…
Merged PRs & resolved issues
1,000 โญ
GitHub stars
18,638 โŒจ๏ธ
Lines of Rust

* as of January 5th

Contributors#

Many features and enhancements are actually proposed and implemented by the community. We want to take this chance to thank all our contributors!

ยท 8 min read
Andre Bogus

Rust Nightly

When my colleague Christos wrote about using Rust for our startup, he made no mention of the fact that we actually use a nightly Rust compiler, and our code incorporates a set of nightly features. So I thought it might be beneficial to look at the cost-benefit ratio of using Rust nightly, for us, for the open source ecosystem, and for Rust.

Disclaimer: I wasn't the one who introduced nightly Rust features into this particular codebase, but I have a good few years of experience with nightly Rust, working on clippy since 2015

The Good#

Everybody loves features! Features are great! Can't have too many of them. And even if you don't use any nightly features, using a nightly Rust compiler will give you 12 weeks of head start on performance benefits before they hit stable Rust. In fact, for a good four years, I only had a nightly compiler installed, because my Chromebook had very little disk space and I needed nightly for clippy anyway. Besides, who's going to find bugs that nightly may have if no one tests it? Apart from officially unstable features (like, say, internal compiler APIs that clippy used), I only ever once encountered any incompatibility โ€“ which I will report later.

That said, once you have nightly, those features are a real temptation. Curiosity had me playing with new toys on more than one crate: bytecount, initially flamer and overflower, and mutagen is still nightly only. Apart from the latter, most only played with one feature at a time. Even with unstable features, many of them were pretty solid already and changes were often easy enough to follow. And if you only write a prototype of something (well, you probably shouldn't be using Rust, but if like me you are very well versed with it, and feel sufficiently productive), an unstable feature or two may give you the extra oomph to get it done, quickly.

Many of the features unlock certain powers that our code can use (e.g. async_closure), or give us a better programming experience (e.g. box_patterns). Those latter features are often undervalued; if for example you have a chain of objects on the heap, matching through all of them in one pattern makes the code much easier to read and maintain (as long as that feature is available, of course). Having to write two, three, four match statements for this becomes cumbersome quickly.

The Bad#

But there's a psychological factor at play: Once you give in to temptation with one feature, your hesitation to adopting yet another one has already diminished. Adding a third one looks even more benign and the fourth one is a no-brainer, right? And so at synth, our core/src/lib.rs starts with this:

#![feature(    format_args_capture,    async_closure,    map_first_last,    box_patterns,    error_iter,    try_blocks)]

There even used to be a try_trait until I removed it because it was easier to replace with a custom implementation than changing the whole code over to try_trait_v2 when the trait design was improved. The synth/src/lib.rs uses another concat_idents feature in some macros, so that's not even all of it. git blame tells us that the first four features were introduced in January '21, with the latter two were added in July. Coming back to that try_trait change, when #88223 hit, I found our code no longer compiled with the latest nightly. Of course we had pinned an older nightly, but I regularly checked if there was trouble on the horizon.

Now this trait was mostly used to use try-like semantics for generators (the part of synth that builds the test data) which could either return a result or yield yet another part of the thing to generate. While this design is very elegant, it is not really dependent on the try trait implementation at all, it only reused a few types from it (and even rarely used question marks). So when this code stopped working, I tried to update it to the new implementation, and when that proved too hairy, I simply removed the feature along with the one or two question marks, added our own GeneratorResult trait and changed a few occurrences of Try trait usage to use our own trait so the code worked again.

I think this case shows us two things:

  1. Sometimes the cost-benefit ratio of a feature can change over time, and often the costs are paid later rather than sooner. In this case, it was the cost of replacing the feature with our own implementation which we could have made right away without using the feature. On the other hand, the feature had worked well for us for a good while, so that was basically repaying technical debt.
  2. Often, it's not too hard to replace a feature. Still it's a chore that takes up time and resources, so unless there's an acute need to do so, the opportunity cost means other things will often be more valuable right now. So features, once used, tend to linger.

Regarding the pinned nightly and also the only incompatibility I ever encountered, I removed that pin when the May 2021 nightly version we had used stopped working with an updated proc_macro2 crate. It was later re-established with a newer version. We have this version in all our CI jobs and also in our rust-toolchain file. There also were a few troubles when we had CI and the main toolchain inadvertently go out of sync, but those were fixed up quickly.

For a version that hails as "unstable", Rust nightly is actually surprisingly solid. That has two reasons: 1. Every PR that gets merged has been extensively tested by CI, so obvious errors get caught out before the code even hits nightly. 2. Whenever a change looks risky, the infrastructure team supplies a "crater run". Crater is a tool that will try to compile every crate on crates.io with a given version of a compiler and compares if things fail to compile now. Since crates.io has 68798 crates in stock at the time of this writing, there's a pretty good chance that whatever weird thing you might encounter in live code is thrown at the compiler. I have done such a rustc change once, and it was very reassuring to know that my PR didn't break any code out there.

The Conclusion#

If you want to compile your code with the current fastest version, you can use nightly now. As long as you don't use any features, your code should still compile on stable (however, I would still use a stable compiler in CI to check, because some changes may become insta-stable, e.g. added trait implementations; Rust cannot put those behind a feature gate). There is a very small risk of breakage, but you can revert to a beta or stable compiler with no hassle if that happens.

Using nightly features is in a way like every other form of technical debt. A bit of risk taking that can give you some potentially big payoff now, at the price of possible future breakage. Whether you want to take that risk depends a lot on your project and the phase it lives in. If you're a startup desperate to get your project out there, not using that feature may mean that there won't be a project to fix later otherwise. On the other hand, if you are writing code that should live for a while, or a library that is aimed to be widely used, avoiding nightly features is likely your best bet.

If you are on nightly, you have two options: Go all in and embrace the instability or pin a known good version. I now think that pinning is the only sane option for all but hobby projects (where a bit of breakage can be acceptable every now and then). I note that clippy has a special place here, because it's essentially tied to the current rust version by design, and we get away with syncing every two weeks and staying on master (not even nightly) otherwise. Once you decide on a pinned version, you may as well pin all your dependencies and update them very cautiously, because any update could break your build. Even then it may be a good idea to test with a current nightly every now and then to gauge whether any incompatibility will hit you whenever you should decide to update.

If you encounter breakage, take a step back and look if the feature is still pulling its weight or if it's cheaper to get rid of it. So to sum up: Going nightly carries some risk. Being aware of and mitigating that risk can give you benefits now, at the cost of a price tag in the future. As always, your mileage may vary.

ยท 8 min read

Rust

When building a company you are setting out to fundamentally solve a problem. For this reason, engineers have been systematically attracted by this romantic idea of changing the world with your brain and a laptop. We are at heart problem solvers.

As engineers, we can (and most of us have) become zealous at times about our solutions to these problems. We have pragmatists who just get stuff done - they address the symptom fast and effectively. We have idealists who will grind at an elegant scalable solution and try to treat the disease. Whichever camp you subscribe to, at a certain point you need to form an opinion about which technologies you are going to use to solve the problems you see in the world - and this opinion will inevitably cause contention.

Conventional wisdom is to 'use the right tool for the job'. The choice of programming language for example, depends on the domain of the problem you are trying to solve. If you're implementing some algorithm, in a secluded project, it's easy to make the case about what the language for the job may be. You can run a benchmark and literally test the execution time for each candidate language (if you're optimising for execution time). You can persuade yourself you've made a rational and 'objectively correct' decision.

However, in the context of building a business, your optimisation function is a high-dimensional mess involving performance, development velocity, hiring, server costs, ecosystem, tooling, support, licenses etc. You can assign weights to what is most important for your business, but at the end of the day the decision is inevitably qualitative.

At Synth, we're working on building the best data generator in the world. We made a conscious decision to use Rust for our main line of products. After more than a year of building I've had the opportunity to see Rust at its best and worst in the context of starting a company - this post is a compilation of these (at times cynical) thoughts.

Development Velocity#

Rust has a really steep learning curve. Coming from an OO background it took me months to become productive in Rust. This was incredibly frustrating for me as I felt that my lack of productivity was impacting the team, which it was. Even when you eventually do become productive (and you will), Rust forces you to really think deeply about what you're doing and things inevitably take longer to get over the line. A poorly thought out design decision today can come back to haunt you months later. What should be a simple change or refactor can end up resulting in complete tear down as you try to appease the borrow checker. This is deadly.

The entire premise of a startup is that you have to iterate rapidly. Very few companies know what they should be building from day one. It's an iterative process involving a feedback loop of talking to users and making changes to reflect the feedback. The faster you can make that feedback loop, the higher probability you have of success.

Correctness#

The evident hit in development velocity is redeemed to an extent by Rust's emphasis on writing correct programs. "if it compiles it works' so to speak. I've found this to be true for the most part while building with Rust and it is an absolute joy to work with for this reason.

Even if your program is not perfect, you understand the failure modes much better. The set of unknown failure modes is reduced substantially as your program breaks in exactly the way you expect it to. The lack of null pointers in conjunction with the Result paradigm (vs say, exceptions) compels you to build correct programs where edge cases are well understood and are handled explicitly by you (or unimplemented! but no one is perfect).

If you've reached product market fit - correctness may counteract the development velocity hit. When you know what you're building you need to iterate less. Your dev team is also going to be spending less time dealing with bugs as you've already dealt with that while trying to appease the compiler.

If it compiles it works - and this is an invaluable asset when you're aggressively shipping code.

Talent#

Getting great talent is unbelievably important for an early stage startup. The fact that the absolute number of competent and experienced Rust developers is so small initially seems detrimental to getting great people. This is exacerbated by Rust's steep learning curve as you need to hire someone with experience, or it's going to take months for them to become productive. However, this is not the full picture.

In our experience the competence of your average Rust developer is much higher than more conventional programming languages. Something spoke to these individuals when they picked up Rust, and it's hard to put your finger on it but it's that same quality that makes a great engineer. It's also been a pleasant surprise to find out that really good engineers will seek you out as an employer because you use Rust. They don't want to work in *script or Java or C++. They want to work with Rust because it's great.

Open Source#

At Synth, we've chosen to adopt an open-core business model. The idea behind an open-core business is you develop and open source product with a permissive license which solves a real technical problem. You work on building a user base, a community and a great product all out in the open. You then structure your business model around solving the corresponding organisational problem - and that's how you make money.

We've been really lucky to have a really active set of contributors - giving ideas, reporting bugs and contributing (at times very significant) code. It is hard to know for sure, but we have a strong hunch that a lot of the contributors are active because they have an interest in Rust projects specifically. A lot of our contributors are also interested in learning Rust - not necessarily being veterans of the language. This has worked out great as the more experienced members of our core team mentor and review code of young rustaceans, building a symbiotic positive feedback loop.

Thank you to all our contributors - you know who you are and you guys are amazing.

Libraries#

Rust has an ecosystem of incredibly high quality libraries. The Rust core team has led by example and focused on a high quality and tight standard library. The result of a highly focused standard library is (unfortunately) a lack of canonical libraries for doing things outside the standard library. So you want a webserver, pick from one of the 100s available. You want a crate ( Rust lingo for library) for working with JWT tokens? Here's 9, pick one. I mean, even something as fundamental as an asynchronous runtime is split between tokio and async-std and others. As a young rustacean this can be overwhelming.

What ends up happening over time is certain libraries become implicitly canonical as they receive overwhelming support and start becoming serious dependencies differentiating from their alternatives. Also in a project update from RustConf 2021 it was mentioned that the idea of having 'recommended crates' may be visited in the future.

The lack of canonical non-standard libraries is an issue when you're getting started - but over time this diminishes as you get a better understanding of the ecosystem. What has been constantly detrimental to our development velocity has been the lack of client libraries for Rust. We've had to write a bunch of different integrations ourselves, but they're often clunky as we don't have the time to invest in making them really high quality. For example most of Google's products have at best an unofficial code-generated crate maintained by the community, and at worst absolutely nothing. You need to write it from scratch.

Should you build your startup with Rust?#

Well it depends. Assuming you're building a product in the right domain for Rust (say a CLI as opposed to a social media site), even then the answer is not clear-cut. If you don't have close to 100% conviction that you know what you're building, I would be inclined to say no. Development velocity and being able to make rapid iterations is so important for an early stage startup that it outweighs a lot of the benefits that Rust brings to the table.

If your company is later stage, and you now understand exactly what you should be building (assuming this is every the case) then I would say yes. The 'correctness' of Rust programs and the propensity of Rust to attract great engineers can help in building a great engineering culture and a great company.

ยท 10 min read
Andre Bogus

50 ways to crash our product

I personally think that the software we build should make more people's lives better than it makes worse. So when users recently started filing bug reports, I read them with mixed feelings. On one hand, it meant that those particular users were actually using synth, on the other hand, it also meant that we were failing to give them a polished experience. So when it was my turn to write more stuff about what we do here, I set myself a challenge: Find as many ways as I can to break our product.

I briefly considered fuzzing, but decided against it. It felt like cheating. Where's the challenge in that? Also, I wanted to be sure that the bugs would be reachable by ordinary (or perhaps at least exceptional) users, and would accept misleading error messages (that a fuzzer couldn't well decide) as bugs. Finally I am convinced I learn more about some code when actively trying to break it, and that's always a plus. So "let's get cracking!" I quoth and off I went.

Overview#

Before we start, I should perhaps consider giving a short architectural overview on synth. Basically the software has four parts:

  1. The DSL (which is implemented by a set of types in core/src/schema that get deserialized from JSON),
  2. a compiler that creates a graph (which form a directed acyclic graph of items that can generate values),
  3. export (writing to the data sink) and
  4. import facilities (for creating a synth namespace from a database schema)

My plan was to look at each of the components and see if I can find inputs to break them in interesting ways. For example, leaving out certain elements or putting incorrect JSON data (that would not trip up the deserialization part, but lead to incorrect compilation later on) might be a fruitful target. Starting from an empty schema:

{    "type": "array",    "length": 1,    "content": {        "type": "object"    }}

I then called out synth generate until finding a problem. First, I attempted to insert confusing command line arguments, but the clap-based parser handled all of them gracefully. Kudos!

#1 The first thing I tried is using a negative length:

{    "type": "array",    "length": -1,    "content": {        "type": "object"    }}

Which was met with BadRequest: could not convert from value 'i64(-1)': Type { expected: "U32", got: "i64(-1)" }. Not exactly a crash, but the error message could be friendlier and have more context. I should note that this is a very unspecialized error variant within the generator framework. It would make sense to validate this before compiling the generator and emit a more user-friendly error.

Bonus: If we make the length "optional": true (which could happen because of a copy & paste error), depending on the seed, we will get another BadRequest error. The evil thing is that this will only happen with about half of the seeds, so you may or may not be lucky here (or may even become unlucky if another version would slightly change the seed handling).

#2 Changing the length field to {} makes for another befuddling error:

Error: Unable to open the namespace
Caused by:    0: at file 2_unitlength/unitlength.json    1: Failed to parse collection    2: missing field `type` at line 8 column 1

The line number is wrong here, the length should be in line six in the content object, not in line eight.

#3 It's not that long that we can use literal numbers for number constants here (for example given the length). The old way would use a number generator. A recent improvement let us generate arbitrary numbers, however this is likely not a good idea for a length field:

{    "type": "array",    "length": {        "type": "number",        "subtype": "u32"    },    "content": {        "type": "object"    }}

This might be done very quickly, but far more likely it will work for a long time, exhausing memory in the process, because this actually generates a whole lot of empty objects (which are internally BTreeMaps, so an empty one comes at 24 bytes) โ€“ up to 4.294.967.295 of them, which would fill 96GB! While this is not an error per se, we should probably at least warn on this mistake. We could also think about streaming the result instead of storing it all in memory before writing it out, at least unless there are references that need to be stored, and this would also allow us to issue output more quickly.

#4 Let's now add a string:

``json synth[expect = "string` generator is missing a subtype"] { "type": "array", "length": { "type": "number", "subtype": "u32" }, "content": { "type": "object", "s": { "type": "string" } } }


Oops, I forgot to specify which kind of string. But I wouldn't know that from the error:
```consoleError: Unable to open the namespace
Caused by:    0: at file 4_unknownstring/unknownstring.json    1: Failed to parse collection    2: invalid value: map, expected map with a single key at line 10 column 1

#5 Ok, let's make that a format then. However, I forgot that the formatmust contains a map with the keys "format" and "arguments", putting them into the s map directly:

``json synth[expect = "argumentsis expected to be a field offormat`"] { "type": "array", "length": { "type": "number", "subtype": "u32" }, "content": { "type": "object", "s": { "type": "string", "format": "say my {name}", "arguments": { "name": "name" } } } }


```consoleError: Unable to open the namespace
Caused by:    0: at file 5_misformat/misformat.json    1: Failed to parse collection    2: invalid value: map, expected map with a single key at line 14 column 1

#6 Ok, then let's try to use a faker. Unfortunately, I haven't really read the docs, so I'll just try the first thing that comes to mind:

``json synth[expect = "fakeris expected to have agenerator` field. Try '"faker": {"generator": "name"}'"] { "type": "array", "length": { "type": "number", "subtype": "u32" }, "content": { "type": "object", "name": { "type": "string", "faker": "name" } } }


This gets us:
```consoleError: Unable to open the namespace
Caused by:    0: at file empty/empty.json    1: Failed to parse collection    2: invalid type: string "name", expected struct FakerContent at line 11 column 1

One could say that the error is not exactly misleading, but not exactly helpful either. As I've tried a number of things already, I'll take it. Once I get the syntax right ("faker": { "generator": "name" }, the rest of the faker stuff seems to be rock solid.

#7 Trying to mess up with date_time, I mistakenly specify a date format for a naive_time value.

``json synth[expect = "unknown variant date_time, expected one of pattern, faker, categorical, serialized, uuid, truncated, sliced, format, constant`"] { "type": "array", "length": 1, "content": { "type": "object", "date": { "type": "string", "date_time": { "format": "%Y-%m-%d", "subtype": "naive_time", "begin": "1999-01-01", "end": "2199-31-12" } } } } }


This gets me the following error which is again misplaced at the end of the input, and not exactly understandable. The same happens if I select a date format of `"%H"` and bounds of `0` to `23`.
```consoleError: Unable to open the namespace
Caused by:    0: at file 7_datetime/datetime.json    1: Failed to parse collection    2: input is not enough for unique date and time at line 16 column 1

I believe since the time is not constrained in any way by the input, we should just issue a warning and generate an unconstrained time instead, so the user will at least get some data. Interestingly, seconds seem to be optional, so %H:%M works.

#8 More, if I use naive_date instead, but make the minimum 0-0-0, we get the technically correct but still mis-spanned:

Error: Unable to open the namespace
Caused by:    0: at file 8_endofdays/endofdays.json    1: Failed to parse collection    2: input is out of range at line 16 column 1s

For the record, the error is on line 11.

#9 Now we let date_time have some rest and go on to categorical. Having just one variant with a weight of 0 will actually trigger an unreachable error:

{    "type": "array",    "length": 1,    "content": {        "type": "object",        "cat": {            "type": "string",            "categorical": {                "empty": 0            }        }    }}

Well, the code thinks we should not be able to reach it. Surprise!

thread 'main' panicked at 'internal error: entered unreachable code', /home/andre/projects/synth/core/src/schema/content/categorical.rs:82:9note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

For the record, this is the first internal error I was able to uncover so far. Given this success with categorical strings, it was natural to look if one_of could be similarly broken, but the generator just chose the one variant despite its 0.0 weight.

#10 Unsupported types on import

Databases can sometimes contain strange things, and so far the support is in beta, so it was expected that I would find types for which we currently don't implement import. This includes JSON for mysql and postgres, the mysql spatial datatypes as well as postgres' geometric types, user-defined enumerations, postgres' network address types, postgres arrays (soon only nested ones), etc.

The way to reproduce that is to create a table with a field of the type, e.g. here with mysql

CREATE TABLE IF NOT EXISTS json (    data JSON);
DELETE FROM json;
INSERT INTO json (data) VALUES ('{ "a": ["b", 42] }');

Now call synth import jsonnamespace --from mysql://<user>:<password>@<host>:<port>/<database> to get

Error: We haven't implemented a converter for json

Since the error is mostly the same for all types, and was somewhat expected, I won't claim a point for each type here.

#11 Exporting an array of nulls into postgres is not correctly implemented, so

{    "type": "array",    "length": 5,    "content": {        "type": "object",        "s": {            "type": "array",            "length": 1,            "content": {                "type": "null"            }        }    }}

will give us a wrong data type error from postgres. The problem here is that we lose the type information from the generator, and just emit null values which do not allow us to construct the right types for encoding into a postgres buffer. The solution would be to re-architect the whole system to reinstate that type information, possibly side-stepping sqlx in the process. Note that this is not equal to issue #171, which relates to nested arrays.

#12 going back to #3, I thought about other ways to make the code overconsume resources. But time and memory are only one thing to consume, in fact it's easy enough to consume another: The stack. The following bash script:

X='{ "type": "null" }'
for i in $(seq 0 4096)do    X="{ \"type\": \"string\", \"format\": { \"format\": \"{x}\", \"arguments\": { \"x\": $X } } }"done
echo $X > 12_stack_depth/stack_depth.jsonsynth gen --size 1 12_stack_depth

will generate the following error:

Error: Unable to open the namespace
Caused by:    0: at file 12_stack_depth/stack_depth.json    1: Failed to parse collection    2: recursion limit exceeded at line 1 column 2929

So I give up. I've found 1 way to crash our product with an unintended error, reproduced some known limitations and outlined a number of error messages we can improve on. I fell far short of my original goal, which either means I'm really bad at finding errors, or our code is incredibly reliable. Given the track record of software written in Rust, I'd like to think it's the latter, but I'll leave judgement to you.

Anyway, this was a fun exercise and I looked at many more things that turned out to just work well, so that's a good thingโ„ข. With all the test automation we have today, it's easy to forget that the manual approach also has its upsides. So feel free and try to break your (or our) code!

ยท 12 min read

So you want to mock an API

API mocking refers to the process of simulating the behaviour of a real API using a fake replacement.

There are many great use-cases for API mocking. You may want to mock an API to eliminate development dependencies between engineering teams. If for example a service which is a dependency to your front-end isn't ready - you may want a placeholder which will unblock your front-end team.

API mocking is also a really powerful tool when doing integration tests against 3rd party APIs. This can be broken down roughly into functional and non-function testing:

  • Functional Testing: You care about the semantics of the API. It is important to be able to make a request and get an appropriate response which takes your request into account. Unfortunately API providers often have sub-par testing environments which can make this a real pain.
  • Non-Functional Testing: You don't really care about the semantics of the API. This can be used to speed up integration tests as requests don't need to travel to the API providers servers. You may also want to verify your SLAs, load test / stress test your systems etc. In this case being rate-limited by the API provider's testmode can be a limiting constraint.

At Synth, we're building a declarative data generator. We wanted to apply our data generation engine to mocking a subset of a popular API and see how far we could go. We set out to prototype a solution over (roughly) 5 days as a side project - this blog post is an overview of that journey.

Day 1: Picking an API#

So much to do, so little time. We decided we wanted to mock a popular API but didn't know where to start. Companies like Stripe have an excellent testmode and even an open source http server which you can use instead.

We decided to ask the internet 'Which API have you been struggling to test against' on various forums like Reddit, HN and others. Ok so now we wait and see let's see what the internet has to say.

Day 2: Choosing Shopify#

Lo and behold! the internet responded. A bunch of people responded and primarily complained about payment processors (except for Stripe which was explicitly praised yet again!). A few products and companies came up repeatedly as being difficult to test against. We qualitatively evaluated the internet's feedback and reviewed documentation from the different APIs mentioned to understand the implementation complexity. After all we had 3.5 days left, so we couldn't pick anything too complex. In the end we decided to go with the Shopify API!

Just as a disclaimer we have absolutely no issues with Shopify, it just so happens that a lot of the feedback we got pointed us that direction.

Now the Shopify API is pretty big - and we're building a mock server POC from scratch, so we decided to narrow down and try to mock a single endpoint first. We chose the Event API which seemed pretty straight forward. Looking at the Event API there are three dimensions to consider when designing our POC solution.

1. The Data Model#

The Event API returns a JSON payload which is a collection of Events. An example Event can be seen below:

{    // Refers to a certain event and its resources.    "arguments": "Ipod Nano - 8GB",    // A text field containing information about the event.    "body": null,    // The date and time (ISO 8601 format) when the event was created.    "created_at": "2015-04-20T08:33:57-11:00",    // The ID of the event.    "id": 164748010,    // A human readable description of the event.    "desciption": "Received a new order",    // A relative URL to the resource the event is for, if applicable.    "path": "/admin/orders/406514653/transactions/#1145",    // A human readable description of the event. Can contain some HTML formatting.    "message": "Received a new order",    // The ID of the resource that generated the event.    "subject_id": 406514653,    // The type of the resource that generated the event.    "subject_type": "Order",    // The type of event that occurred.     "verb": "confirmed"}

Off the bat it's clear that there is some business logic that needs to be implemented. For example, there is some notion of causality, i.e. an Order cannot be closed before it's been placed. This non-trivial business logic was good news - it means we can showcase some complex data generation logic that's built into synth.

Since we don't have access to the code that runs the Shopify API, we have to simulate the behaviour of the Event data model. There are varying degrees of depth into which one can go, and we broke it into 4 levels:

  1. Level 1 - Stub: Level 1 is just about exposing an endpoint where the data on a per element basis 'looks' right. You have the correct types, but you don't really care about correctness across elements. For example, you care that path has the correct subject_id in the URI, but you don't care that a given Order goes from placed to closed to re_openedetc...
  2. Level 2 - Mock: Level 2 involves maintaining the semantics of the Events collection as a whole. For example created_at should always increase as id increases (a larger id means an event was generated at a later date). verbs should follow proper causality (as per the order example above). etc.
  3. Level 3 - Emulate: Level 3 is about maintaining semantics across endpoints. For example creating an order in a different Shopify API endpoint should create an order_placed event in the Event API.
  4. Level 4 - Simulate: Here you are basically reverse engineering all the business logic of the API. It should be indistinguishable from the real thing.

Really these levels can be seen as increasing in scope as you simulate semantics per-element, per-endpoint, cross-endpoint and finally for the entire API.

2. The Endpoint Behaviour#

The Event API exposes 2 endpoints:

  • GET /admin/api/2021-07/events.json which retrieves a list of all events
  • GET /admin/api/2021-07/events/{event_id}.json which retrieves a single even by its ID.

The first endpoint exposes various query parameters (which are basically filters) which alter the response body:

limit:          The number of results to show. (default: 50, maximum: 250)since_id:       Show only results after the specified ID.created_at_min: Show events created at or after this date and time. (format: 2014-04-25T16:15:47-04:00)created_at_max: Show events created at or before this date and time. (format: 2014-04-25T16:15:47-04:00)filter:         Show events specified in this filter.verb:           Show events of a certain type.fields:         Show only certain fields, specified by a comma-separated list of field names.

Luckily the filtering behaviour is simple, and as long as the implementation stays true to the description in the docs it should be easy to emulate.

The second endpoint takes one query parameter which is a comma-separated list of fields to return for a given event. Again should be easy enough.

3. Authentication#

We decided not to touch authentication for now as the scope would blow up for a 5-day POC. Interestingly we got a bunch of feedback that mocking OAuth flows or similar would be really useful, regardless of any specific API. We may come back to this at a future date.

Day 3: Evaluating Implementation Alternatives#

And then there was Day 3. We'd done our due diligence to pick a popular yet underserved API, and we'd drilled down on how deep we could go in trying to faithfully represent the implementation.

As any self-respecting engineer would do, we decided to scour the internet for off-the-shelf solutions to automate as much of the grunt work as possible. Some naive Googling brought up a mock server called JSON server - an API automation solution which spins up a REST API for you given a data definition. Excited by this we quickly wrote up 2 fake Event API events, and started JSON server feeding it the fake events - and it worked!

Well almost; we were initially excited by the fact that it did exactly what is said on the tin and very well, however it didn't have an easy way to specify the custom query parameters we needed to faithfully reproduce the API. For example returning results before or after a given created_at timestamp (feel free to let us know if we missed something here!).

So we needed something a little more sophisticated. The internet came to the rescue again with a comprehensive list of API simulation tools. The basic precondition we had was that the API simulator had to be OSS with a permissive license we could build on. This immediately disqualified 50% of the available solutions, and we did a divide and conquer exercise quickly evaluating the rest.

Api Simulation Tools

The remaining tools were either not built for this purpose, or they were incredibly complex pieces of software that would take a while to get acquainted with.

In the end we decided to implement the endpoint functionality ourselves - we figured that a 50 LOC node/express server would do a fine job for a POC.

Day 4: Implementing the core functionality#

Day 4 was the most straight forward. Let's get this thing to work!

1. The Data Model#

We decided to reproduce the API at level 1-2 since we didn't really have any other endpoints. We used synth to quickly whip up a data model that generates data that looks like responses from the Event API. I won't go into depth on how this works here as it's been covered in other posts. In about 15 minutes of tweaking the synth schema, we generated ~10 Mb data that looks like this:

[    {we had        "arguments": "generate virtual platforms",        "body": null,        "created_at": "2019-09-17T14:16:47",        "description": "Received new order",        "id": 477306,        "message": "Received new order",        "path": "/admin/orders/83672/transactions/#6020",        "subject_id": 1352997,        "subject_type": "Order",        "verb": "closed"    },    {        "arguments": "innovate best-of-breed schemas",        "body": null,        "created_at": "2017-05-20T00:04:41",        "description": "Received new order",        "id": 370051,        "message": "Received new order",        "path": "/admin/orders/82607/transactions/#9154",        "subject_id": 1226112,        "subject_type": "Order",        "verb": "sale_pending"    },    {        "arguments": "incentivize scalable mindshare",        "body": null,        "created_at": "2018-02-21T12:51:36",        "description": "Received new order",        "id": 599084,        "message": "Received new order",        "path": "/admin/orders/01984/transactions/#3595",        "subject_id": 1050540,        "subject_type": "Order",        "verb": "placed"    }]

We then dumped it all in a MongoDB collection with the one-liner:

$ synth generate shopify --to mongodb://localhost:27017/shopify --size 40000

Next step is to re-create the Event API endpoint.

2. Creating the API#

Creating the API was pretty straightforward. We wrote a prisma model for the responses which basically worked out of the box with the data dumped into MongoDB by synth. This gave us all the filtering we needed basically for free.

Then we wrote a quick and dirty express server that maps the REST endpoint's querystrings into a query for prisma. The whole thing turned out to be ~90 LOC. You can check the source here .

Day 5: Packaging#

Docker

The data is ready, the API is ready, time to package this thing up and give it to people to actually use. Let's see if our experiment was a success.

While strategising about distributing our API, we were optimising for two things:

  1. Ease of use - how simple it is for someone to download this thing and get going
  2. Time - we have 2-3 hours to make sure this thing is packaged and ready to go

in that order.

We needed to pack the data, database and a node runtime to actually run the server. Our initial idea was to use docker-compose with 2 services, the database and web-server, and then the network plumbing to get it to work. After discussing this for a few minutes, we decided that docker-compose may be an off-ramp for some users as they don't have it installed or are not familiar with how it works. This went against our first tenet which is 'ease of use'.

So we decided to take the slightly harder and hackier route of packaging the whole thing in a single Docker container. It seemed like the best trade-off between goals 1 and 2.

There were 6 steps to getting this thing over the line:

  1. Start with the MongoDB base image. This gives us a Linux environment and a database.
  2. Download and install NodeJS runtime in the container.
  3. Download and install synth in the container.
  4. Copy the javascript sources over & the synth data model
  5. Write a small ENTRYPOINT shell script to start the mongod, server and generate data into the server
  6. Start the server and expose port 3000

And we're done! We've hackily happily packaged our mock API in a platform agnostic one liner.

Was it a success?#

An important aspect of this experiment was to see if we could conceive, research, design and implement a PoC in a week (as a side project, we were working on synth at the same time). I can safely say this was a success! We got it done to spec.

An interesting thing to note is that 60% of the time was spent on ideating, researching and planning - and only 40% of the time on the actual implementation. However, spending all that time planning before writing code definitely saved us a bunch of time, and if we didn't plan so much the project would have overshot or failed.

Now if the PoC itself was a success is a different question. This is where you come in. If you're using the Event API, build the image and play around with it.

You can get started by quickly cloning our git repository and then:

cd shopify && docker build -t shopify-mock . && docker run --rm -p 3000:3000 shopify-mock

then simply:

curl "localhost:3000/admin/api/2021-07/events.json"

We'd like to keep iterating on the Shopify API and improve it. If there is interest we'll add more endpoints and improve the existing Event data model.

If you'd like to contribute, or are interested mocks for other APIs other than Shopify, feel free to open an issue on GitHub!

ยท 19 min read

Seeding test databases in 2021 - best practices

In this tutorial, we'll learn how to design the Prisma data model for a basic message board and how to seed databases with the open-source tool synth and generate mock data to test our code.

The code for the example we are working with here can be accessed in the examples repository on GitHub.

Data modeling is not boring#

What is a data model?#

Data modeling (in the context of databases and this tutorial) refers to the practice of formalizing a collection of entities, their properties and relations between one another. It is an almost mathematical process (borrowing a lot of language from set theory) but that should not scare you. When it comes down to it, it is exceedingly simple and quickly becomes more of an art than a science.

The crux of the problem of data modeling is to summarize and write down what constitutes useful entities and how they relate to one another in a graph of connections.

You may wonder what constitutes a useful entity. It is indeed the toughest question to answer. It is very difficult to tackle it without a good combined idea of what you are building, the database you are building on top of and what the most common queries, operations and aggregate statistics are. There are many resources out there that will guide you through answering that question. Here we'll start with the beginning: why is it needed?

Why do I need a data model?#

Often times, getting the data model of your application right is crucial to its performance. A bad data model for your backend can mean it gets crippled by seemingly innocuous tasks. On the other hand, a good grasp on data modeling will make your life as a developer 1000 times easier. A good data model is not a source of constant pain, letting you develop and expand without slowing you down. It just is one of those things that pays out compounding returns.

Plus, there are nowadays many open-source tools that make building applications on top of data models really enjoyable. One of them is Prisma.

Prisma is awesome#

Prisma is an ORM, an object relational mapping. It is a powerful framework that lets you specify your data model using a database agnostic domain specific language (called the Prisma schema). It uses pluggable generators to build a nice javascript API and typescript bindings for your data model. Hook that up to your IDE and you get amazing code completion that is tailored to your data model, in addition to a powerful query engine.

Let's walk through a example. We want to get a sense for what it'll take to design the data model for a simple message board a little like Reddit or YCombinator's Hacker News. At the very minimum, we want to have a concept of users: people should be able to register for an account. Beyond that, we need a concept of posts: some structure, attached to users, that holds the content they publish.

Using the Prisma schema language, which is very expressive even if you haven't seen it before, our first go at writing down a User entity might look something like this:

model User {  objectId  Bytes    @id @map("_id")  id        Int      @unique @default(autoincrement())  createdAt DateTime @default(now())  email     String   @unique  nickname  String  posts     Post[]}

In other words, our User entity has properties id (a database internal unique identifier), createdAt (a timestamp, defaulting to now if not specified, that marks the creation time of the user's account), email (the user-specified email address, given on registration) which is required to be unique (no two users can share an email address) and nickname (the user specified display name, given on registration).

In addition, it has a property posts which links a user with its posts through the Post entity. We may come up with something like this for the Post entity:

model Post {  objectId  Bytes    @id @map("_id")  id        Int      @unique @default(autoincrement())  postedAt  DateTime @default(now())  title     String  author    User     @relation(fields: [authorId], references: [id])  authorId  Int}

In other words, our Post entity has properties id (a database internal unique identifier); postedAt (a timestamp, defaulting to now if not specified, that marks the time at which the user created the post and published it) ; title (the title of the post); author and authorId which specify a one-to-many relationship between users and posts.

note

You may have noticed that the User and Post models have an attribute which we haven't mentioned. The objectId property is an internal unique identifier used by mongoDB (the database we're choosing to implement our data model on in this tutorial).

Let's look closer at these last two properties author and authorId. There is a significant difference between them with respect to how they are implemented in the database. Remember that, at the end of the day, our data model will need to be realized into a database. Because we're using Prisma, a lot of these details are abstracted away from us. In this case, the prisma code-generator will handle author and authorId slightly differently.

The @relation(...) attribute on the author property is Prisma's way of declaring that authorId is a foreign key field. Because the type of the author property is a User entity, Prisma understands that posts are linked to users via the foreign key authorId which maps to the user's id, the associated primary key. This is an example of a one-to-many relation.

How that relation is implemented is left to Prisma and depends on the database you choose to use. Since we are using mongodb here, this is implemented by direct object id references.

Because our data model encodes the relation between posts and users, looking up a user's posts is inexpensive. This is the benefit of designing a good data model for an application: operations you have designed and planned for at this stage, are optimized for.

To get us started using this Prisma data model in an actual application, let's create a new npm project in an empty directory:

$ npm init

When prompted to specify the entry point, use src/index.js. Install some nice typescript bindings for node with:

$ npm install --save-dev @types/node typescript

Then you can initialize the typescript compiler with

$ npx tsc --init

This creates a tsconfig.json file which configures the behavior of the typescript compiler. Create a directory src/ and add the following index.ts:

import {PrismaClient} from '@prisma/client'
const prisma = new PrismaClient()
const main = async () => {    const user = await prisma.user.findFirst()    if (user === null) {        throw Error("No user data.")    }    console.log(`found username: ${user.nickname}`)    process.exit(0)}
main().catch((e) => {    console.error(e)    process.exit(1)})

Then create a prisma/ directory and add a schema.prisma file containing the Prisma code for the two entities User and Post.

Finally, to our schema.prisma file, we need to add configuration for our local dev database and the generation of the client:

datasource db {  provider = "mongodb"  url      = "mongodb://localhost:27017/board"}
generator client {  provider = "prisma-client-js"  previewFeatures = ["mongodb"]}

Head over to the repository to see an example of the complete file, including the extra configuration.

To build the Prisma client, run

$ npx prisma generate

Finally, to run it all, edit your package.json file (at the root of your project's directory). Look for the "script" field and modify the "test" script with:

{  ...  "test": "tsc --project ./ && node ."  ... }

Now all we need is for an instance of mongoDB to be running while we're working. We can run that straight from the official docker image:

$ docker run -d --name message-board-example -p 27017:27017 --rm mongo

To run the example do

$ npm run test
> message-board-example@1.0.0 test /tmp/message-board-example> tsc --project ./ && node .
Error: No user data.

You should see something close to the output of the snippet: our simple code failed because it is looking for a user that does not exist (yet) in our dev database. We will fix that in a little bit . But first, here's a secret.

The secret to writing good code#

Actually it's no secret at all. It is one of those things that everybody with software engineering experience knows. The key to writing good code is learning from your mistakes!

When coding becomes tedious is when it is hard to learn from errors. Usually this is caused by a lengthy process to go from writing the code to testing it. This can happen for many reasons: having to wait for the deployment of a backend in docker compose, sitting idly by while your code compiles just to fail at the end because of a typo, the strong integration of a system with components external to it, and many more.

The process that goes from the early stages of designing something to verifying its functionalities and rolling it out, that is what is commonly called the development cycle.

It should indeed be a cycle. Once the code is out there, deployed and running, it gets reviewed for quality and purpose. More often than not this happens because users break it and give feedback. The outcome of that gets folded in planning and designing for the next iteration or release. The agile philosophy is built on the idea that this cycle should be as short as possible.

So that brings the question: how do you make the development cycle as quick as possible? The faster the cycle is, the better your productivity becomes.

Testing, testing and more testing#

One of the keys to shortening a development cycle is making testing easy. When playing with databases and data models, it is something that is often hacky. In fact there are very few tools that let you iterate quickly on data models, much less developer-friendly tools.

The core issue at hand is that between iterations on ideas and features, we will need to make small and quick changes to our data model. What happens to our databases and the data in them in that case? Migration is sometimes an option but is notoriously hard and may not work at all if our changes are significant.

For development purposes the quickest solution is seeding our new data model with mock data. That way we can test our changes quickly and idiomatically.

Generate data for your data model#

At Synth we are building a declarative test data generator. It lets you write your data model in plain zero-code JSON and seed many relational and non-relational databases with mock data. It is completely free and open-source.

Let's take our data model and seed a development mongoDB database instance with Synth. Then we can make our development cycle very short by using an npm script that sets it all up for us whenever we need it.

Installing synth#

We'll need the synth command-line tool to get started. From a terminal, run:

$ curl -sSL https://getsynth.com/install | sh

This will run you through an install script for the binary release of synth. If you prefer installing from source, we got you: head on over to the Installation pages of the official documentation.

Once the installer script is done, try running

$ synth versionsynth 0.5.4

to make sure everything works. If it doesn't work, add $HOME/.local/bin to your $PATH environment variable with

$ export PATH=$HOME/.local/bin:$PATH

and try again.

Synth schema#

Just like Prisma and its schema DSL, synth lets you write down your data model with zero code.

There is one main difference: the synth schema is aimed at the generation of data. This means it lets you specify the semantics of your data model in addition to its entities and relations. The synth schema has an understanding of what an email, a username, an address are; whereas the Prisma schema only cares about top-level types (strings, integers, etc).

Let's navigate to the top of our example project's directory and create a new directory called synth/ for storing our synth schema files.

โ”œโ”€โ”€ package.jsonโ”œโ”€โ”€ package-lock.jsonโ”œโ”€โ”€ tsconfig.jsonโ”œโ”€โ”€ prisma/โ”œโ”€โ”€ synth/โ””โ”€โ”€ src/

Each file we will put in the synth/ directory that ends in .json will be opened by synth, parsed and interpreted as part of our data model. The structure of these files is simple: each one represents a collection in our database.

Collections#

A collection is a single JSON schema file, stored in a namespace directory. Because collections are formed of many elements, their Synth schema type is that of arrays.

To get started, let's create a User.json file in the synth/ directory:

{  "type": "array",  "length": 1,  "content": {    "type": "null"  }}

Then run

$ synth generate synth/{"users":[null]}

Let's break this down. Our User.json collection schema is a JSON object with three fields. The "type" represents the kind of generator we want. As we said above, collections must generate arrays. The "length" and "content" fields are the parameters we need to specify an array generator. The "length" field specifies how many elements the generated array must have. The "content" specifies from what the elements of the array are generated.

For now the value of "content" is a generator of the null type. Which is why our array has null as a single element. But we will soon change this.

Note that the value of "length" can be another generator. Of course, because the length of an array is non-negative number, it cannot be just any generator. But it can be any kind that will generate non-negative numbers. For example

{  "type": "array",  "length": {    "type": "number",    "range": {      "low": 5,      "high": 10,      "step": 1    }  },  "content": {    "type": "null"  }}

This now makes our users collection variable length. Its length will be decided by the result of generating a new random integer between 5 and 10.

If you now run

$ synth generate synth/{"users":[null,null,null,null,null]}

you can see the result of that change.

info

By default synth fixes the seed of its internal PRNG. This means that, by default, running synth many times on the same input schemas will give the same output data. If you want to randomize the seed - and thus randomize the result, simply add the flag --random:

$ synth generate synth/ --random{"users":[null,null,null,null,null,null,null]}
$ synth generate synth/ --random{"users":[null,null,null,null,null,null,null,null,null]}

Schema nodes#

Before we can get our users collection to match our User Prisma model, we need to understand how to generate more kinds of data with synth.

Everything that goes into a schema file is a schema node. Schema nodes can be identified by the "type" field which specifies which kind of node it is. The documentation pages have a complete taxonomy of schema nodes and their "type".

Generating ids#

Let's look back at our User model. It has four properties:

  • id
  • createdAt
  • email
  • nickname

Let's start with id. How can we generate that?

The type of the id property in the User model is Int:

  id        Int      @unique @default(autoincrement())

and the attribute indicates that the field is meant to increment sequentially, going through values 0, 1, 2 etc.

The synth schema type for numbers is number. Within number there are three varieties of generators:

What decides the variant is the presence of the "range", "constant"or "id" field in the node's specification.

For example, a range variant would look like

{    "type": "number",    "range": {        "low": 5,        "high": 10,        "step": 1    }}

whereas a constant variant would look like

{    "type": "number",    "constant": 42}

For the id field we should use the id variant, which is auto-incrementing. Here is an example of id used in an array so we can see it behaves as expected:

{    "type": "array",    "length": 10,    "content": {        "type": "number",        "id": {}    }}

Generating emails#

Let us now look at the email field of our User model:

  email     String   @unique

Its type in the data model is that of a String. The synth schema type for that is string.

There are many different variants of string and they are all exhaustively documented. The different variants are identified by the presence of a distinguishing field which can be

Since we are interested in generating email addresses, we will be using the "faker" variant which leverages a preset collection of generators for common properties like usernames, addresses and emails:

{    "type": "string",    "faker": {        "generator": "safe_email"    }}

Generating objects#

OK, so we now know how to generate the id and the email properties of our User model. But we do not yet know how to put them together in one object. For that we need the object type:

{    "type": "object",    "id": {        "type": "number",        "id": {}    },    "email": {        "type": "string",        "faker": {            "generator": "safe_email"        }    }}

Leverage the docs#

Now we have everything we need to finish writing down our User model as a synth schema. A quick lookup of the documentation pages will tell us how to generate the createdAt and nickname fields.

Here is the finished result for our User.json collection:

``json synth[expect = "unknown variant date_time`"] { "type": "array", "length": 3, "content": { "type": "object", "id": { "type": "number", "id": {} }, "createdAt": { "type": "string", "date_time": { "format": "%Y-%m-%d %H:%M:%S", "begin": "2020-01-01 12:00:00" } }, "email": { "type": "string", "faker": { "generator": "safe_email" } }, "nickname": { "type": "string", "faker": { "generator": "username" } } } }


:::caution
[`date_time`][synth-datetime] is now a generator on its own and is no longer a subtype of the `string` generator
:::
### Making sure our constraints are satisfied
Looking back at the [`User` model](#prisma-is-awesome) we started from, there'sone thing that we did not quite address yet. The `email` field in the Prismaschema has the `@unique` attribute:
```graphql  email     String   @unique

This means that, in our data model, no two users can share the same email address. Yet, we haven't added that constraint anywhere in our final synth schema for the User.json collection.

What we need to use here is modifiers. A modifier is an attribute that we can add to any synth schema type to modify the way it behaves. There are two modifiers currently supported:

The optional modifier is an easy way to make a schema node randomly generate something or nothing:

{    "type": "number",    "optional": true,    "constant": 42}

Whereas the unique modifier is an easy way to enforce the constraint that the values generated have no duplication. So all we need to do, to represent our data model correctly, is to add the unique modifier to the email field:

{    "type": "string",    "unique": true,    "faker": {        "generator": "safe_email"    }}

The completed end result for the User.json collection can be viewed on GitHub here.

How to deal with relations#

Now that we have set up our User.json collection, let's turn our attention to the Post model and write out the synth schema for the Post.json collection.

Here is the end result:

``json synth[expect = "unknown variant date_time`"] { "type": "array", "length": 5, "content": { "type": "object", "id": { "type": "number", "id": {} }, "postedAt": { "type": "string", "date_time": { "format": "%Y-%m-%d %H:%M:%S", "begin": "2020-01-01 12:00:00" } }, "title": { "type": "string", "faker": { "generator": "bs" } }, "authorId": "@User.content.id" } }


:::caution
[`date_time`][synth-datetime] is now a generator on its own and is no longer a subtype of the `string` generator
:::
It all looks pretty similar to the `User.json` collection, except for oneimportant difference at the line
```json synth    "authorId": "@User.content.id"

The syntax @... is synth's way of specifying relations between collections. Here we are creating a many-to-1 relation between the field authorId of the Post.json collection and the field id of the User.json collection.

The final Post.json collection schema can be viewed on GitHub here.

Synth generate#

Now that our data model is implemented in Synth, we're ready to seed our test database with mock data. Here we'll use the offical mongo Docker image, but if you are using a relational database like Postgres or MySQL, you can follow the same process.

To start the mongo image in the background (if you haven't done so already), run

$ docker run -d -it -p 27017:27017 --rm mongo

Then, to seed the database with synth just run

$ synth generate synth/ --size 1000 --to mongodb://localhost:27017/board

That's it! Our test mongo instance is now seeded with the data of around 100 users. Head over to the examples repository to see the complete working example.

What's next#

Synth is completely free and built in the open by an amazing and fast growing community of contributors.

Join us in our mission to make test data easy and painless! We also have a very active Discord server where many members of the community would be happy to help if you encounter an issue!

ยท 8 min read
Andre Bogus

I recently wrote the most complex procedural Rust macro Iโ€™ve ever attempted. This post tries to outline the problems Iโ€™ve encountered and tells how I overcame them.

The Background#

With synth, we are building a declarative command line test data generator. For now, the specification that declares what test data to build is just JSON that gets deserialised to our data structures using serde_json. This was a quick and easy way to configure our application without much overhead in terms of code. For example:

{    "type": "array",    "length": {        "type": "number",        "constant": 3    },    "content": {        "type": "object",        "id": {            "type": "number",            "id": {}        },        "name": {            "type": "string",            "faker": {                "generator": "name"            }        },        "email": {            "type": "string",            "faker": {                "generator": "ascii_email"            }        }    }}

However, itโ€™s also not very nice to write (for example JSON has no comments, no formulas, etc.), so we wanted to bind our specification to a scripting language. Our end goal is to extend the language (both in terms of builtin functions and syntax) to make the configuration really elegant. After some testing and benchmarking different runtimes, our choice fell on koto, a nice little scripting language that was built foremost for live coding.

Unfortunately, koto has a very bare interface to bind to external Rust code. Since we are talking about a rather large number of types we want to include, it was clear from the start that we would want to generate the code to bind to koto.

Early Beginnings#

So I started with a somewhat simple macro to wrangle koto types (e.g. Maps) into our Rust types. However, I soon found that the marshalling overhead would have been fine for an initial setup phase, but not for recurrent calls into koto (for example for filter functions called for each row). Thus I changed my approach to try and bind functions, then extended that to types and impl blocks.

I found โ€“ as I then thought โ€“ a genius technique of generating functions that would call each other, thus daisy-chaining a series of code blocks into one that could then be called with another bindlang_main!() proc macro:

static FN_NUMBER: AtomicUsize = AtomicUsize::new(0);
fn next_fn(mut b: Block, arg: &Expr) -> Item {    let number = FN_NUMBER.fetch_add(1, SeqCst);    let this_fn = fn_ident(number);    if let Some(n) = number.checked_sub(1) {        let last_fn = fn_ident(n);        b.stmts.push(parse_quote! { #last_fn(#arg); });    }    b.stmts.extend(last_call);    parse_quote! { fn #this_fn(#arg) { #b } }}
#[proc_macro]fn bindlang_main(arg: TokenStream) -> TokenStream {    let arg = ident(arg.to_string());    TokenStream::from(if let Some(n) = FN_NUMBER.load(SeqCst).checked_sub(1) {        let last_fn = fn_ident(n);        quote! { #last_fn(#arg) }    } else {        proc_macro2::TokenStream::default()    })}

I also wrote a derive macro to implement the marshalling traits. This worked well for a small example that was entirely contained within one module, but failed once the code was spread out through multiple modules: The functions would no longer be in the same scope and therefore stopped finding each other.

Worse, I needed a number of pre-defined maps with functions for method dispatch for our external types within koto. A type in Rust can have an arbitrary number of impl blocks but I needed exactly one table, and I couldnโ€™t simply daisy-chain those.

It was clear I needed a different solution. After thinking long and hard I came to the conclusion that I needed to pull all the code together in one scope, by the bindlang_main!() macro. My idea was that I create a HashMap of syn::Items to be quoted together into one TokenStream. A lazy static Arc<Mutex<Vec<Context>>> was to collect the information from multiple attribute invocations:

#[derive(Default)]struct Context {    bare_fns: Vec<MethodSig>,    modules: HashMap<String, Vec<MethodSig>>,    vtables: HashMap<String, Vec<MethodSig>>,    types: HashMap<String, String>,}lazy_static::lazy_static! {    static ref CONTEXT: Arc<Mutex<Context>> = Arc::new(Mutex::new(Context::default()));}
#[proc_macro_attribute]pub fn bindlang(_attrs: TokenStream, code: TokenStream) -> TokenStream {    let code_cloned = code.clone();    let input = parse_macro_input!(code_cloned as Item);    // evaluate input here, and store information in Context    // CONTEXT.lock().unwrap()...    code}

This was when I found out that none of syn's type is Send and therefore cannot be stored within a Mutex. My first attempt to circumvent this was moving everything to Strings and using syn::parse_str to get the items out. This failed because of macro hygiene: Each identifier in Rust proc_macros has an identity. Two identifiers resulting from two macro operations will get different identities, no matter if their textual representation is the same.

I also found that proc_macro_derives have no way to get the #[derive(..)] attribute of the type, and I wanted to also bind derived trait implementations (at least for Default, because some types have no other constructors). So I removed the derive and moved the implementation to the #[bindlang] attribute macro, which now works on types, impl blocks and fns.

Beware: This makes use of the (unspecified, but as of now working) top-down order of macro expansion to work!

Dirty tricks avoided#

There is a Span::mixed_context() variant that will yield semi-hygienic macros (like with macro_rules). However, this looked risky (macro hygiene is there to protect us, so we better have a good reason to override it), so I took the data oriented approach, collecting the info I needed to create the code in the lazy_static to walk within bindlang_main!(). I still tried to generate the trait impls for marshalling directly in the attribute macro, but this again ran into macro hygiene trouble, because I could not recreate the virtual dispatch table identifiers. After moving this part to the main macro, too, the macro finally expanded successfully.

Except it didnโ€™t compile successfully.

I had forgotten to use the items I was creating code for in the macro, and koto requires all external types to implement Display. So I added those imports as macro arguments and added the Display impls to be met with a type inference error within the macro invocation. Clearly I needed some type annotations, but the error message only showed me the macro invocation, which was pretty unhelpful.

Expanding Our Vision#

My solution to debug those was to cargo expand the code, comment out the macro invocation and copy the expanded portions in its place so that my IDE would pinpoint the error for me. I had to manually un-expand some format! invocations so the code would resolve correctly, and finally found where I needed more type annotations. With those added, the code finally compiled. Whew!

I then extended the bindings to also cover trait impls and Options, while my colleague Christos changed the code to marshall Rust values into koto values to mangle Result::Err(_) into kotoโ€™s runtime errors. Remembering that implicit constructors (structs and enum variants) are also useful, I added support to binding those if public. There was another error where intermediate code generated wouldn't parse, but some eprintln! debugging helped pinpoint the piece of code where it happened.

When trying to bind functions taking a non self referenced argument (e.g. fn from(value: &Value) -> Self), I found that the bindings would not work, because my FromValue implementation could not get references. Remember, a function in Rust cannot return a borrow into a value that lives only within it. It took me a while to remember I blogged about the solution in 2015! Closures to the rescue! The basic idea is to have a function that takes a closure with a reference argument and return the result of that closure:

pub trait RefFromValue {    fn ref_from_value<R, F: Fn(&Self) -> R>(        key_path: &KeyPath<'_>,        value: &Value,        f: F,    ) -> Result<R, RuntimeError>;}

Having this in a separate trait allows us to distinguish types where the borrow isn't &T, e.g. for &str. Also we gain a bit of simplicity by using unified function call syntax (MyType::my_fn(..) instead of v0.my_fn). This also meant I had to nest the argument parsing: I did this by creating the innermost Expr and wrap it in argument extractors in reverse argument order:

let mut expr: Expr = parse_quote! { #path(#(#inner_idents),*) };for (i, ((a, v), mode)) in idents.iter().zip(args.iter()).enumerate().rev() {    expr = if mode.is_ref() {        parse_quote! {             ::lang_bindings::RefFromValue::ref_from_value(                &::lang_bindings::KeyPath::Index(#i, None),                #a,                |#v| #expr            )?        }    } else {        parse_quote! {            match (::lang_bindings::FromValue::from_value(                &::lang_bindings::KeyPath::Index(#i, None),                #a            )?) {                 (#v) => #expr            }        }    };}

Note that match in the else part is there to introduce a binding without requiring a Block, a common macro trick. Now all that was left to do was add #[bindlang] attributes to our Namespace and its contents, and also add a lot of Display implementations because koto requires this for all ExternalValue implementors.

In conclusion, our test configuration should now look something like:

synth.Namespace({    synth.Name("users"): synth.Content.Array (    {            synth.Name("id"): synth.Content.Number(NumberContent.Id(schema.Id)),            synth.Name("name"): synth.Content.String(StringContent.Faker("firstname", ["EN"])),            synth.Name("email"): synth.Content.String(StringContent.Faker("email", ["EN"])),    },    synth.Content.Number(NumberContent.Constant(10))    })})

That's only the beginning: I want to introduce a few coercions, a custom koto prelude and perhaps some syntactic sugar to make this even easier to both read and write.

The Takeaway#

Macros that collect state to use later are possible (well, as long as the expansion order stays as it is) and useful, especially where code is strewn across various blocks (or even files or modules). However, if code relies on other code, it best be emitted in one go, otherwise module visibility and macro hygiene conspire to make life hard for the macro author. And if at one point the expansion order gets changed in a way that breaks the macro, I can change it to a standalone crate to be called from build.rs thanks to proc_macro2 being decoupled from the actual implementation.

ยท 5 min read

I get it. Almost everyone has done it. Your usersโ€™ data is sitting there ready to use, just copy-paste that to your dev environment and there you have some good test data for yourself. Job done - easy - you can move on to the next task.

Worse even, when you need to make sure that your dev/staging data is up to date - you spend your day setting up a cron job that routinely syncs production data to your dev or UAT environments.

If youโ€™re getting an uneasy feeling every time you do this - youโ€™re not alone. So why do people still do it? Hereโ€™s what I found to be the dominant reasons for this, and my thoughts on why those reasons are bunk.

Why people do it (and why you shouldnโ€™t)#

1. Itโ€™s always been done this way so why change now#

This is rarely a valid reason to do anything, yet sometimes we inherit a process that deep down we know is flawed, but perhaps feel like we don't have enough knowledge or authority to challenge.

I've been there. When you are new at a company, telling more senior people they're doing things wrong seems intimidating, but if you have a viable alternative (more on that below) you can suggest (and ideally deliver on), your co-workers will appreciate it.

2. No one will find out#

Having been in the data privacy space for a while now, it is painfully obvious how often someone actually finds out. One of the big improvements that GDPR has brought about is that it put a spotlight on how frequent and widespread data leaks and breaches really are - not a day goes by without new ones being reported. In light of this, hoping that this doesn't affect you is not an ideal strategy.

My colleague Andre, recently posted a great thread about this.

3. Itโ€™s just for myself and a bunch of my co-workers, whatโ€™s the worst that can happen#

By taking data to a less secure environment you waste all effort you put into security controls of your production environments. Proper privacy measures go one step beyond and look at what happens if there is a data breach - either due to a mistake, or an attack - by minimising the number of people who work with real (sensitive) data you complement your other data security measures.

Sensitive data should only really be accessed when absolutely necessary - thatโ€™s in the spirit of building proper privacy-respecting culture.

4. There is just no fโ€ฆing time to do this properly#

It can be really painful to pseudonymise data, especially if your data model is very large and complex. It can take even more time to set up data obfuscation pipelines that keep your test data up to date - as if this wasnโ€™t enough - all these scripts keep breaking as your data model evolves. Basically, in so many cases, the task of obfuscating data while preserving its complexity and referential integrity is just too much hassle. Shameless plug: I have some good news and some better news.

Synth is a declarative data generator that can enable you quickly generate realistic looking test data. Better still, itโ€™s completely free and open source - we built it with developers in mind and really hope it can help address this problem once and for all. Our goal is to make it possible for you to get going with realistic-looking fake data in no time. 1

5. Weโ€™re already GDPR compliant, I can do what I want with the usersโ€™ data#

Let me preface everything I say by stating that I am not a lawyer, but I have been in this space for a while now and have spoken to numerous consultants, lawyers and experts on this matter. GDPR can be vague on certain things, but in my opinion using production data for testing is clearly a no-go. (See more on this in the notes). If things go wrong it will be very hard to justify to the regulators why sensitive data was copied to and used in its raw form in testing environments.2

Final thoughts#

Last but not least, trust is ๐Ÿ”‘ - the way you treat your usersโ€™ data in a way defines your relationship with them. Being responsible about the data that your users trust you with is incredibly important in building a good and healthy culture for your company or project. Trust is hard to build and easily lost. This may seem trivial, but these โ€œsmallโ€ decisions define the future of your company, not the โ€œcultureโ€ page on your website.

Being responsible is a team effort, but it pays off long-term. 3

Notes#


  1. Synth goes one step beyond pseudonimisation - it creates completely fake data that looks and quacks like the real data. Data generated with Synth can be shared more freely.โ†ฉ
  2. GDPR Article 25 and 32 refer to the requirements around implementing pseudonymisation.โ†ฉ
  3. We often get asked what is the right stage of the project/company to start using something like Synth. The best answer is right from the start, the second best is right now. With our declarative โ€œdata as codeโ€ framework, your test data generation will evolve alongside your data model.โ†ฉ

ยท 7 min read

Introduction#

Developing high quality software inevitably requires some testing data.

You could be:

  • Integration testing your application for correctness and regressions
  • Testing the bounds of your application in your QA process
  • Testing the performance of queries as the size of your dataset increases

Either way, the software development lifecycle requires testing data as an integral part of developer workflow. In this article, we'll be exploring 3 different methods for generating test data for a Postgres database.

Setup#

In this example we'll be using Docker to host our Postgres database.

To get started you'll need to install docker and start our container running Postgres:

% docker run -p 5432:5432 -d -e POSTGRES_PASSWORD=1234 -e POSTGRES_USER=postgres -e POSTGRES_DB=dev postgres

As you can see, we've set very insecure default credentials. This is not meant to be a robust / productionised instance, but it'll do for our testing harness.

Our Schema#

In this example we'll setup a very simple schema. We're creating a basic app where we have a bunch of companies, and those companies have contacts.

CREATE TABLE companies(   company_id SERIAL PRIMARY KEY,   company_name VARCHAR(255) NOT NULL);
CREATE TABLE contacts(   contact_id SERIAL PRIMARY KEY,   company_id INT,   contact_name VARCHAR(255) NOT NULL,   phone VARCHAR(25),   email VARCHAR(100),   CONSTRAINT fk_company      FOREIGN KEY(company_id)       REFERENCES companies(company_id));

This schema captures some business logic of our app. We have unique primary keys, we have foreign key constraints, and we have some domain-specific data types which have 'semantic meaning'. For example, the random string _SX ร† A-ii is not a valid phone number.

Let's get started.

Manual Insertion#

The first thing you can do which works well when you're starting your project is to literally manually insert all the data you need. This involves just manually writing a SQL script with a bunch of INSERT statements. The only thing to really think about is the insertion order so that you don't violate foreign key constraints.

INSERT INTO companies(company_name)VALUES('BlueBird Inc'),      ('Dolphin LLC');            INSERT INTO contacts(company_id, contact_name, phone, email)VALUES(1,'John Doe','(408)-111-1234','john.doe@bluebird.dev'),      (1,'Jane Doe','(408)-111-1235','jane.doe@bluebird.dev'),      (2,'David Wright','(408)-222-1234','david.wright@dolphin.dev');

So here we're inserting directly into our database. This method is straight forward but does not scale when you need more data or the complexity of your schema increases. Also, testing for edge cases requires your hard-coding edge cases in the inserted data - resulting in a linear amount of work for the bugs you want to catch.

contact_idcompany_idcontact_namephoneemail
11John Doe(408)-111-1234john.doe@bluebird.dev
21Jane Doe(408)-111-1235jane.doe@bluebird.dev
32David Wright(408)-222-1234david.wright@dolphin.dev

Using generate_series to automate the process#

Since you're a programmer, you don't like manual work. You like things to be seamless and most importantly automated!

Postgres comes with a handy function called generate_series which, ...drum roll... generates series! We can use this to generate as much data as we want without writing it by hand.

Let's use generate_series to create 100 companies and 100 contacts

INSERT INTO companies(company_name)SELECT md5(random()::text)FROM generate_series(1,100);
INSERT INTO contacts(company_id, contact_name, phone, email)SELECT id, md5(random()::text), md5(random()::text)::varchar(20), md5(random()::text) FROM generate_series(1,100) id;
contact_idcompany_idcontact_namephoneemail
1181cc02c106b7c30d4e2b032c91cdb75ad056f1eee1dca55db03ccd0da2eef81aaa02d6ba15ef4551fb9f
22d2b0112bc9bbec85c5229a4b4f28a35007ba86b1dc24cdadfd247404f5b502084563f2ac20c29ed0e584
3364005702ecaff9f489e8074d6a718aae50db9534b58e0616cd343ea36293665aa1ac38e7d6371893046a
44202e87bc3d0c8c080048b2c0138c709b65f6ea317bd0f2c950dc8b8d9b92916f4cf77c38308f6ac4391b
558b2fd25d7b95158df5af671cb32557553e6ddc67aabe7164ce9aed32035400a7500203352f3597d2548f

We generated 100 companies and contacts here, the types are correct, but the output is underwhelming. First of all, every company has exactly 1 contact, and more importantly the actual data looks completely useless.

If you care about your data being semantically correct (i.e. text in your phone column actually being a phone number) we need to get more sophisticated.

We could define functions ourselves to generate names / phone numbers / emails etc, but why re-invent the wheel?

Using a data generator like Synth#

Synth is an open-source project designed to solve the problem of creating realistic testing data. It has integration with Postgres, so you won't need to write any SQL.

Synth uses declarative configuration files (just JSON don't worry) to define how data should be generated. To install the synth binary refer to the installation page.

The first step to use Synth is to create a workspace. A workspace is just a directory in your filesystem that tell Synth that this is where you are going to be storing configuration:

$ mkdir workspace && cd workspace && synth init 

Next we want to create a namespace (basically a stand-alone data model) for this schema. We do this by simply creating a subdirectory and Synth will treat it as a separate schema:

$ mkdir my_app

Now comes the fun part! Using Synth's configuration language we can specify how our data is generated. Let's start with the smaller table companies.

To tell Synth that companies is a table (or collection in the Synth lingo) we'll create a new file app/companies.json.

{    "type": "array",    "length": {        "type": "number",        "constant": 1    },    "content": {        "type": "object",        "company_id": {            "type": "number",            "id": {}        },        "company_name": {            "type": "string",            "faker": {                "generator": "company_name"            }        }    }}

Here we're telling Synth that we have 2 columns, company_id and company_name. The first is a number, the second is a string and the contents of the JSON object define the constraints of the data.

If we sample some data using this data model we get the following:

$ synth generate my_app/ --size 2{  "companies": [    {      "company_id": 1,      "company_name": "Campbell Ltd"    },    {      "company_id": 2,      "company_name": "Smith PLC"    }  ]}

Now we can do the same thing for the contacts table by create a file my_app/contacts.json. Here we have the added complexity of a foreign key constraints to the company table, but we can solve it easily using Synth's same_as generator.

{    "type": "array",    "length": {        "type": "number",        "constant": 1    },    "content": {        "type": "object",        "company_id": {            "type": "same_as",            "ref":"companies.content.company_id"        },        "contact_name": {            "type": "string",            "faker": {                "generator": "name"            }        },        "phone": {            "type": "string",            "faker": {                "generator": "phone_number",                "locales": ["FR_FR"]            }        },        "email": {            "type": "string",            "faker": {                "generator": "safe_email"            }        }    }}

There is quite a bit going on here - to get an in-depth understanding of the synth configuration refer I'd recommend reading the comprehensive docs. There are tons of cool features which this schema can't really explore!

Now we have both our tables data model under Synth, we can generate data into Postgres:

$ synth generate my_app/ --to postgres://postgres:1234@localhost:5432/dev

Taking a look at the company table:

contact_idcompany_idcontact_namephoneemail
11Carrie Walsh+44(0)117 496 0785espinozabetty@hotmail.com
22Brittany Flores+441632 960 480osharp@mcdaniel.com
33Tammy Rodriguez01632960737brenda82@ward.org
44Amanda Marks(0808) 1570096hwilcox@gonzalez.com
55Kimberly Delacruz MD+44(0)114 4960207pgarcia@thompson.com
66Jordan Williamson(0121) 4960483jamesmiles@weber.org
77Nicholas Williams(0131) 496 0974fordthomas@gmail.com

Much better :)

Conclusion#

We explored 3 different ways to generate data.

  • Manual Insertion: Is ok to get you started. If your needs are basic it's the path of least effort to creating a working dataset.
  • Postgres generate_series: This method scales better than manual insertion - but if you care about the contents of your data and have foreign key constraints you'll need to write quite a bit of bespoke SQL by hand.
  • Synth: Synth has a small learning curve, but to create realistic testing data at scale it reduces most of the manual labour.

In the next post we'll explore how to subset your existing database for testing purposes. And don't worry if you have sensitive / personal data - we'll cover that too.

ยท 12 min read

So we've all been in this situation. You're building a Web App, you're super productive in your stack and you can go quickly - however generating lot's of data to see what your app will look like with enough users and traffic is a pain.

Either you're going to spend a lot of time manually inputting data or you're going to write some scripts to generate that data for you. There must be a better way.

In this post we're going to explore how we can solve this problem using the open-source project Synth. Synth is a state-of-the-art declarative data generator - you tell Synth what you want your data to look like and Synth will generate that data for you.

This tutorial is going to use a simple MERN (Mongo Express React Node) web-app as our test subject, but really Synth is not married to any specific stack.

I'm going to assuming you're working on MacOS or Linux (Windows support coming soon ๐Ÿคž) and you have NodeJS, Yarn and Docker installed.

For this example we'll be running Synth version 0.3.2 .

Getting started#

As a template, we'll use a repository which will give us scaffolding for the MERN app. I picked this example because it shows how to get started quickly with a MERN stack, where the end product is a usable app you can write in 10 minutes. For our purposes, we don't really need to build it from scratch, so let's just clone the repo and avoid writing any code ourselves.

git clone https://github.com/samaronybarros/movies-app.git && cd movies-app

Next, we'll be using docker to run an ephemeral version of our database locally. Docker is great for getting started quickly with popular software, and luckily for us MongoDB has an image on the docker registry. So - let's setup an instance of MongoDB to run locally (no username / password):

docker run -d --name mongo-on-docker -p 27017:27017 mongo

Starting the Web App#

The repository we just cloned contains a working end-to-end web-app running on a MERN stack. It's a super simple CRUD application enabling the user to add / remove some movie reviews which are persisted on a MongoDB database.

The app consists of 2 main components, a nodejs server which lives under the movies-app/server/ sub-directory, and a React front-end which lives under the movies-app/client sub-directory.

The client and server talk to each other using a standard HTTP API under /movie.

So let's get started and run the back-end:

cd server && yarn install && node index.js

And then the client (you'll need two terminals here ๐Ÿคท):

cd client && yarn install && yarn start

Cool! If you navigate to http://localhost:8000/ you should see the React App running ๐Ÿ™‚

Let's add some movies by hand#

Hold the phone. Why are we adding movies by hand since we have a tool to generate data for us?

Well, by adding a little bit of test data by hand, we can then use Synth to infer the structure of the data and create as many movies as we want for us. Otherwise we would have to write the entire data definition (what we call a schema) by hand.

So, let's add a couple of movies manually using the Web UI.

Create Movies

Ok, so now that we have a couple of movies, let's get started with Synth!

Synth#

In the following section we will cover how Synth fits into the Web App development workflow:

  1. First we'll install the Synth binary
  2. Then we'll initialize a Synth workspace in our repo to host our data model
  3. Next will ingest data from MongoDB into Synth
  4. And finally generate a bunch of fake data from Synth and back into Mongo

Installing Synth#

To install Synth on MacOS / Linux, visit the docs and choose the appropriate installation for your OS. If you are feeling adventurous, you can even build from source!

Declarative Data Generation#

Synth uses a declarative data model to specify how data is generated.

Hmmm, so what is a declarative model you may ask? A declarative model, as opposed to an imperative model, is where you 'declare' your desired end state and the underlying program will figure out how to get there.

On the other had, an imperative model (which is what we are mostly used to), is step by step instructions on how to get to our end-state. Most popular programming languages like Java or C are imperative - your code is step-by-step instructions on how to reach an end state.

Programming frameworks like SQL or React or Terraform are declarative. You don't specify how to get to your end-state, you just specify what you want and the underlying program will figure out how to get there.

With Synth you specify what your desired dataset should look like, not how to make it. Synth figures how to build it for you ๐Ÿ˜‰

Creating a Workspace#

A workspace represents a set of synthetic data namespaces managed by Synth. Workspaces are marked by .synth/ sub-directory.

A workspace can have zero or more namespaces, where the namespaces are just represented as sub-directories. All information pertaining to a workspace is in its directory.

So let's create sub-directory called data/ and initialize our Synth workspace.

movies-app $ mkdir data && cd data && synth init

Namespaces#

The namespace is the top-level abstraction in Synth. Namespaces are the equivalent of Schemas in SQL-land. Fields in a namespace can refer to other fields in a namespace - but you cannot reference data across namespaces.

Namespaces in turn, have collections which are kind of like tables in SQL-land. A visual example of the namespace/collection hierarchy can be seen below.

Alt Text

To create a namespace, we need to feed some data into Synth.

Feeding Data into Synth#

There are two steps to feed data into Synth from our MongoDB instance:

  1. We need to export data from MongoDB into a format that Synth can ingest. Luckily for us, Synth supports JSON out of the box so this can be done quite easily with the mongoexport command - a light weight tool that ships with MongoDB to enable quick dumps of the database via the CLI. We need to specify a little bit more metadata, such as the database we want to export from using --db cinema , the collection using --collection and the specific fields we are interested in --fields name,rating,time. We want the data from mongoexport to be in a JSON array so that Synth can easily parse it, so let's specify the --jsonArray flag.
  2. Next, we need to create a new Synth namespace using the synth import command. synth import supports a --from flag if you want to import from a file, but if this is not specified it will default to reading from stdin. We need to feed the output of the mongoexport command into Synth. To do this we can use the convenient Bash pipe | to redirect the stdout from mongoexport into Synth's stdin.
docker exec -i mongo-on-docker mongoexport \    --db cinema \    --collection movies \    --fields name,rating,time \    --forceTableScan \    --jsonArray | synth import cinema --collection movies

Synth runs an inference step on the JSON data that it's fed, trying to infer the structure of the data. Next Synth automatically creates the cinema namespace by creating the cinema/ sub-directory and populates it with the collection movies.json.

$ tree -a data/data/โ”œโ”€โ”€ .synthโ”‚   โ””โ”€โ”€ config.tomlโ””โ”€โ”€ cinema    โ””โ”€โ”€ movies.json

We can now use this namespace to generate some data:

$ synth generate cinema/{  "movies": [    {      "_id": {        "$oid": "2D4p4WBXpVTMrhRj"      },      "name": "2pvj5fas0dB",      "rating": 7.5,      "time": [        "TrplCeFShATp2II422rVdYQB3zVx"      ]    },    {      "_id": {        "$oid": "mV57kUhvdsWUwiRj"      },      "name": "Ii7rH2TSjuUiyt",      "rating": 2.5,      "time": [        "QRVSMW"      ]    }  ]}

So now we've generated data with the same schema as the original - but the value of the data points doesn't really line up with the semantic meaning of our dataset. For example, the time array is just garbled text, not actual times of the day.

The last steps is to tweak the Synth schema and create some realistic looking data!

Tweaking the Synth schema#

So let's open cinema/movies.json in our favorite text editor and take a look at the schema:

{  "type": "array",  "length": {    "type": "number",    "subtype": "u64",    "range": {      "low": 1,      "high": 4,      "step": 1    }  },  "content": {    "type": "object",    "time": {      "type": "array",      "length": {        "type": "number",        "subtype": "u64",        "range": {          "low": 1,          "high": 2,          "step": 1        }      },      "content": {        "type": "one_of",        "variants": [          {            "weight": 1.0,            "type": "string",            "pattern": "[a-zA-Z0-9]*"          }        ]      }    },    "name": {      "type": "string",      "pattern": "[a-zA-Z0-9]*"    },    "_id": {      "type": "object",      "$oid": {        "type": "string",        "pattern": "[a-zA-Z0-9]*"      }    },    "rating": {      "type": "number",      "subtype": "f64",      "range": {        "low": 7.0,        "high": 10.0,        "step": 1.0      }    }  }}

There is a lot going on here but let's break it down.

The top-level object (which represents our movies collection) is of type array - where the content of the array is an object with 4 fields, _id, name, time, and rating.

We can completely remove the field _id since this is automatically managed by MongoDB and get started in making our data look real. You may want to have the Generators Reference open here for reference.

Rating#

First let's change the rating field. Our app can only accept numbers between 0 and 10 inclusive in increments of 0.5. So we'll use the Number::Range content type to represent this and replace the existing value:

{    "range": {        "high": 10,        "low": 0,         "step": 0.5    },    "subtype": "f64",    "type": "number"}

Time#

The time field has been correctly detected as an array of values. First of all, let's say a movie can be shown up to 5 times a day, so we'll change the high field at time.length.range to 6 (high is exclusive). At this stage, the values are just random strings, so let's instead use the String::DateTime content type to generate hours of the day.

``json synth[expect = "unknown variant date_time`"] { "type": "array", "length": { "type": "number", "subtype": "u64", "range": { "low": 1, "high": 5, "step": 1 } }, "content": { "type": "one_of", "variants": [ { "weight": 1.0, "type": "string", "date_time": { "subtype": "naive_time", "format": "%H:%M", "begin": "12:00", "end": "23:59" } } ] } }


:::caution
`date_time` is now a generator on its own and is no longer a subtype of the `string` generator
:::
### Name
Finally, the movie name field should be populated with realistic looking movie names.
Under the hood, Synth uses the Python Faker library to generate so called 'semantic types' (think credit card numbers, addresses, license plates etc.). Unfortunately Faker does no have movie names, so instead we can use a random text generator instead with a capped output size.
So let's use the `String::Faker` content type to generate some fake movie names!
```json synth{    "type": "string",    "faker": {        "generator": "text",        "max_nb_chars": 20    }}

Final Schema#

So, making all the changes above, we can use our beautiful finished schema to generate data for our app:

``json synth[expect = "unknown variant date_time`"] { "type": "array", "length": { "type": "number", "subtype": "u64", "range": { "low": 1, "high": 2, "step": 1 } }, "content": { "type": "object", "name": { "type": "string", "faker": { "generator": "text", "max_nb_chars": 20 } }, "time": { "optional": false, "type": "array", "length": { "type": "number", "subtype": "u64", "range": { "low": 1, "high": 5, "step": 1 } }, "content": { "type": "one_of", "variants": [ { "weight": 1.0, "type": "string", "date_time": { "subtype": "naive_time", "format": "%H:%M", "begin": "00:00", "end": "23:59" } } ] } }, "rating" : { "range": { "high": 10, "low": 0, "step": 0.5 }, "subtype": "f64", "type": "number" } } }


```bash$ synth generate cinema/ --size 5{  "movies": [    {      "name": "Tonight somebody.",      "rating": 7,      "time": [        "15:17"      ]    },    {      "name": "Wrong investment.",      "rating": 7.5,      "time": [        "22:56"      ]    },    {      "name": "Put public believe.",      "rating": 5.5,      "time": [        "20:32",        "21:06",        "16:15"      ]    },    {      "name": "Animal firm public.",      "rating": 8.5,      "time": [        "20:06",        "20:25"      ]    },    {      "name": "Change member reach.",      "rating": 8.0,      "time": [        "12:36",        "14:34"      ]    }  ]}

Ah, much better!

Generating data from Synth into MongoDB#

So now that we can generate as much correct data as we want, let's point Synth at MongoDB and let loose the dogs of war.

This step can be broken into two parts:

  1. Run the synth generate command with our desired collection movies and specifying the number of records we want using the --size field.
  2. Pipe stdout to the mongoimport command, mongoexport's long lost cousin. Again here we specify the database we want to import to, --db cinema and the specific collection movies. We also want the --jsonArray flag to notify mongoimport that it should expect a JSON array.
synth generate cinema/ \    --collection movies \    --size 1000 \    | docker exec -i mongo-on-docker mongoimport \    --db cinema \    --collection movies \    --jsonArray

And voila! Our app now has hundreds of valid movies in our database!

Alt Text

Conclusion#

This post was a summary of how you can use Synth to generate realistic looking test data for your Web App. In the next part of this tutorial, we'll explore how we can use Synth to generate relational data, i.e. where you have references between collections in your database.

To check out the Synth source code you can visit the Synth repo on GitHub, and to join the conversation hop-on the the Synth discord server.