Event Stream

Best Practices#

Event Design#

Here we wish to lay out some opinions on how best to design the specifics of our events - particularly the naming and the schema. This is typically a very subjective exercise, one that's analogous to API design. These opinions should be seen as guidelines or best practices. We should feel free to deviate from them, particularly once we understand their intent and can understand when they don't apply to a particular use case.

Context and Needs#

While we are a growing organisation, we are still a small, agile one, and we will continue to be for quite some time. This is a double-edged sword - it means that we have less resources to implement things, but it also means that we can more easily communicate, and change direction in the future.

As such the "sweet spot" for process and practices is often a light touch. We don't need very dogmatic standards, and we don't need to optimise for edge cases. Instead we need just enough so that we can easily and quickly make decisions, learn, and iterate.

Concretely, this means for the event stream, we don't need to imagine a system which will work for hundreds of developers who cannot easily communicate or coordinate, or where events are being used in drastically different contexts and workloads. We can assume that our events are typically only going to be consumed in a handful of contexts, and we need just enough process to help us take a "best-effort" approach to design and architecture.

Perfect is the enemy of good - especially at our scale.

Naming and Semantics#

One of the main things to bear in mind when using an event stream is how it is intended to facilitate our architecture, particularly decoupling. The general idea of this pattern is that the producer of an event should not care about what happens to an event after it is produced, and as such should have no real awareness of the consumers of the event. As such the semantics, the meaning of the event, should be in terms of the domain of the producer, and should be:

a past-tense description of a change in the state of the producer's domain

Most commonly this would be a noun, describing an entity, and a verb, describing an action that happened to that entity.

So some examples of good events could be:

User created
Purchase made
Card linked

One thing we want to avoid, which I want to draw particular attention to, are events which are "commands". These are typically only relevant to a particular consumer and as such are overly specific and introduce coupling. Some examples of poor events could be:

Send email
Issue reward
Refund purchase

Note that due to our use of Protobuf, all of our events are effectively Ruby classes, and as such should follow standard Ruby naming conventions - so the entity should be singular. So the above "good" examples would be UserCreated, PurchaseMade, CardLinked.

Content and schema#

When deciding on the content of an event, we have to perform a balancing act.

On one hand, we want to minimise coupling, particularly unintended coupling. So we want to avoid putting too much data into a given event, to discourage consumers becoming coupled to the producers in unforeseen ways. This is somewhat analogous to "Hyrum's Law":

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

One example of a negative outcome we wish to avoid could be - imagine if every time we produced a "Purchase made" event, we included the user's email address. We may find other services start to rely on this event to update users' email addresses - completely unrelated to the intent of the event. So to a certain extent we want to avoid "fat" events.

On the other hand, we want to be able to provide sufficient information within the event that consumers can undertake whatever actions they need to without requesting more information. So we don't want to make particularly "thin" or "skinny" events. For example, if we had a "Purchase made" event which only had a purchase ID, a consumer of that event would likely need to then request further data from the "Purchases" service, increasing the coupling and reducing a lot of the benefits of using an event stream.

So it would appear, like most things, the desired approach is somewhere in the middle. What we should aim for is:

Include "just enough" information to describe the change

A typical starting off point for this would be to describe the relevant fields for the entity which has changed, and IDs of any related entities, but not the details about those entities. As such we should often find ourselves using a fairly flat representation, and with the event already having the context of which entity it refers to we should avoid prefixing attributes with the entity name.

Note that if we find ourselves needing to include a lot of information, or that our consumer needs to request a lot of additional information from the producer, it may be a "smell" that we've poorly modelled this interaction, or perhaps even the domain boundaries. In this case it can be worth taking a step back and thinking if there's a more "architectural" solution.

While this ends up with a design that hews closely to how our models are defined, we should avoid merely using a 1-to-1 representation of our models. Our models are internal implementation details of our services, not external contracts, and we want to be able to easily change them without having to change our external contracts and thus related systems. We don't want to have to co-ordinate with other services and teams just to change the name of a field in our database! So we should always use a simple, explicit mapping.

So, using the example of a "Purchase Made" event, our content might look something like:

id
purchaser_id
merchant_id
amount
timestamp

Summary#

Describe an action in the past tense
Use a noun-verb combination for the name, representing entity-action
Include only relevant fields from the entity
Include IDs of any other entities which are relevant
Have a simple but explicit mapping in your code to decouple the event schema from your model schema

Whether or not to consume events published within the same service#

Current state#

The current state of our system is that many of our services comprise multiple domains, and the boundaries are blurry. Given this situation, it is perfectly acceptable for a service to consume events which it itself publishes.

We should only do this when the producer and consumer exist within different domains in the same service. Currently developers do not have sufficient guidance to understand our domains and their boundaries, so this will be a "best effort" endeavour. The costs of getting this wrong are not particularly high, and it is easily reversible, so a "best effort" should be sufficient. We will continue to form a clearer picture of our domains and provide further guidance to developers.

Note that we should strive to ensure that the events themselves meet our standards, which will be developed in the near term.

Where we are operating within a single domain, we should seek not to use the event stream, instead either synchronously execute the workload, or enqueue an explicit Sidekiq job (at least until we arrive at a decision on the use of private Kafka topics).

Future state#

As we move towards our desired future architecture, each of our services will only comprise a single domain, with hard boundaries. Once any services are in this state, the recommendation is to not consume events from within the same service that produced the events. The event stream enables excellent patterns for decoupling systems - however that decoupling comes at a cost, and the advantages are fewer in this particular scenario.

It can make the service harder to understand. There are essentially hidden side-effects whenever an event is produced, which are not obvious without reviewing the consumers within the service. It also distorts our ability to mentally model a service as a black-box, with events being external interactions.
It can make the service harder to test. When testing a particular piece of code our tests will not execute any actions which are triggered by events which are consumed within the same service.
There is a cost and a complexity of managing the infrastructure related to the event stream, both at the service level and the broker level, although this factor is somewhat smaller.

One downside to this recommendation is that these services will likely need another mechanism by which to undertake asynchronous work, e.g. Sidekiq, or private Kafka topics. The specifics of this will be revisited once we introduce services which meet this standard.