Schemas and contracts

All events sent to Stream Machine adhere to the following:

  1. Serialization Schema
    This is the blueprint of the data that is sent, hence, this is about the shape of the data.

  2. Event Contract
    This is about the content that is sent, and is composed of the verifications that should be done for the received content.

These two are explained in detail in the sections below.

Serialization Schemas

In order to guarantee the integrity of the data that is sent to Stream Machine, all events must conform to a serialization schema. These schemas are easy to add and register with Stream Machine, hence they are simple to adapt to your use case.

The serialization schema defines how an event is turned into bytes and vice versa.

Currently, Stream Machine supports Apache Avro and Json Schema, however other serialization formats may be added in the future.

Each serialization schema must include a section with Stream Machine meta information, which is listed below for reference:

The notation below is specifically for Avro serialization schemas. The structure however, remains identical for other serialization methods.
{
  "name": "strmMeta",
  "type": {
    "name": "StrmMeta",
    "type": "record",
    "fields": [
      {
        "name": "schemaId", (1)
        "type": "string"
      },
      {
        "name": "nonce", (2)
        "type": "int"
      },
      {
        "name": "timestamp", (3)
        "type": "long",
        "logicalType": "date"
      },
      {
        "name": "keyLink", (4)
        "type": [
          "null",
          "string"
        ],
        "default": null
      },
      {
        "name": "billingId", (5)
        "type": [
          "null",
          "string"
        ],
        "default": null
      },
      {
        "name": "consentLevels", (6)
        "type": {
          "type": "array",
          "items": "int"
        }
      }
    ]
  }
}
1 the schema reference that is used for serialization, e.g. streammachine/clickstream/0.2.0. Note that this field references a specific version of a serialization schema.
2 an automatically generated globally unique value that can be used for distinguishing two single events from each other
3 the timestamp of the time the event was created. This value is entered upon reception in the Stream Machine event gateway.
4 the value of the keyLink ties this event to an encryption key. This string value is generated by Stream Machine, which means that the event producers send a null value, and the consumers receive the string
5 the billing id of your account
6 the consent levels that were given by the data owner for collecting and processing this event

As shown in the listing above, the consent levels are defined by your organization and are typically set once by a user giving some consent(s) for specific purposes. The meanings of the consent levels are typically created by the Data Protection Officer within the company.

Event Contracts

In order to guarantee that data that is sent to Stream Machine adheres to the rules defined by your organization, events must conform to an event contract. Contracts determine the behavior of validations, which fields are encrypted, and how events are tied together (hence, they get the same encryption key).

An example event contract version is listed below.

A single version of a contract is linked to one, and only one serialization schema.
{
  "schema": {
    "schema_type": "AVRO", (1)
    "schema_registry_ref": "streammachine/clickstream/0.2.0" (2)
  },
  "key_field": "producerSessionId", (3)
  "pii_fields": { (4)
    "customer/id": 0,
    "producerSessionId": 1
  },
  "validations": [ (5)
    {
      "field": "customer/id",
      "type": "regex",
      "value": "^.+$"
    },
    {
      "field": "url",
      "type": "regex",
      "value": "^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]"
    }
  ]
}
1 the serialization schema type (see available types).
2 the serialization schema reference, which includes the organization name, schema name and schema version.
3 the name of the field in the serialization schema that is used to "tie" events together. Typically, this is what determines an end user (i.e. your users) session.
4 the fields whose content in an event should be considered sensitive (i.e. personally identifiable information), and should be encrypted by Stream Machine.
5 the validations that should be performed on the content of specific fields in an event.
When a field is part of an object inside a collection, it cannot yet be part of the pii_fields. We intend to add this in a future version.

These contracts are very versatile to use, and a use case that Stream Machine foresees, is that a single serialization schema could potentially have many contracts (i.e. same shape of the data, but different rules apply to it).

The validations that are performed on the data that is received by Stream Machine currently only support regex [1] (based on customer use cases, we intend to extend this). Next, an example for the validations follow:

Assume an attribute of your event, say 'user/customer_id' in your organization
has to consist of 9 digits not starting with a zero, you could easily have
this as a validation rule in the event schema, implemented with a regex.

This is the mechanism that Stream Machine provides to increase the quality of your event data: validate before acceptance, and let the data processing teams define the rules instead of the data generating teams. An example of a validation can be seen (and tried) in Sending and receiving an event by hand

It is important to note the difference between key_field and keyLink, as they are related to each other, but are fundamentally different:

  1. key_field is part of the event contract and keyLink is part of the serialization schema

  2. key_field determines which field in the serialization schema should be used for considering whether events belong to the same sequence (for example a session)

  3. keyLink links a single event to an encryption key

  4. The value for key_field is determined by you

  5. The value for keyLink is determined by Stream Machine

  6. The value of key_field is used when creating a keyLink

As you can see, the two have a strong relationship, but they are different.

To prevent confusion of the two fields, we plan to rename key_field to something more descriptive

1. see java.util.Pattern for details