Stream Machine concepts
Stream Machine is an event processing engine designed around the following concepts
Datascience teams all over are well aware of the many ways that data are wrong. It’s generally left up to them to try to work their way out of the mess. Stream Machine aims to move the responsibility of generating correct data forwards so that those entities that produce event data will receive immediate feedback in case of discrepancies.
Events in Stream Machine are strictly defined both in shape and content. Events that do not conform will be rejected.
The events conform to a certain schema, and this schema defines which event attributes contain Personally Identifiable Information (PII). Events that are accepted by Stream Machine will have these attributes encrypted before entering any persistent storage. The encryption key is linked to an event attribute that defines its session, i.e. the thing that ties the events together as a sequence belonging to one entity. The encryption key is rotated every 24 hours.
A limited number of example schemas can be seen in the Stream Machine Portal
Both on ingest, and on further processing, Stream Machine was designed and built to provide low latency. The current implementation uses http/2 for ingest, with typical 99 percentile latencies well below 10ms .
Internally we use Kafka for high throughput fault tolerant pipelines. We can configure batch sizes at will, but having your event data in your own Kafka consumer within 1 second is easily doable.
Stream Machine was designed from the ground up for horizontal scalability and fault tolerance. Single points of failure are currently only if the whole cloud region goes down.
A schema consists of three components:
The serialization schema defines how an event is turned into bytes and vice versa. Currently Stream Machine supports Apache Avro and Json Schema, but other serialization mechanisms are fairly easy to add.
Each serialization schema must include a section with Stream Machine meta information of which the most conceptually important attribute is the consent levels that were given for collecting and processing this event
These consent levels are defined by your organization and are typically set once by a user giving some consent(s) for specific purposes. The meanings of the consent levels are typically created by the Data Protection Officer within the company.
This defines which attributes of the event contain PII data. As an example
consider a typical website clickstream. The attribute
url will generally not
be considered PII, but the attributes
session_id will! Those attributes defined as PII will be encrypted upon
acceptance by Stream Machine.
A schema defines custom attribute validation rules. Assume an attribute of your
/user/customer_id in your organization has to consist of 9 digits
not starting with a zero, you could easily have this as a validation rule in the
event schema. This is the mechanism that Stream Machine provides to increase
the quality of your event data: validate before acceptance, and let the data
processing teams define the rules instead of the data generating teams.
An example of a validation can be seen (and tried) in Sending and receiving an event by hand
Stream Machine encrypts the events but with what key? To understand this we have
to look at another component of the schema, the event sequence identifier
attribute . The value in this attribute defines
wether or not events belong to the same sequence. This might be website actions
of one person, or maybe a device id of a car sending location data, we don’t
The first time a new value is seen in the key field an encryption key is
generated in Stream Machine which is linked to the event via its
strmMeta/keyLink value. The encryption and its associated key link remain
in use for 24 hours, and a new pair will then be generated.
The primary event stream is called the encrypted stream and this by design no
longer contains PII data. Everyone in your company can use it . In case these data become compromised, you have
a business issue, but not a privacy issue.
Even these data are useful. With a typical clickstream, where
url is not
considered PII, you could identify dead ends on your site, or train recommender
engines on the encrypted stream, because the attributes that identify the
sequence even though encrypted, remain the same for 24 hours. This is plenty
long enough to understand typical customer journeys, without compromising the
privacy of your users.
If your usecase requires specific permissions, the process is as follows:
Here you instruct Stream Machine to decrypt event data with above consent levels. Stream Machine will
drop all events that don’t at least contain all the consent levels you require.
decrypt attributes with the consent levels you requested. Attributes with other pii levels will not be decrypted. So you receive exactly what you have a right to, and nothing more.
You access this stream with a specific set of credentials.
This means that provided the company is careful with not re-using credentials data consumers will only receive the data they are legally allowed to receive.
Stream Machine is currently an engine running on Google Cloud, with http/2 input. In order to have data accepted by Stream Machine you need the following:
a valid account
a stream definition, with associated credentials. Stream Machine currently uses the Oauth 2.0 client credentials scheme.
an http/http2 client that sends data in the correct format. We provide drivers for various languages  that simplify creating the events and sending the data. For maximum performance, http/2 is preferred over http1.1. You can find the drivers on github.com/streammachineio You don’t need our software to send data. Here you can play with the actual http interaction.
Stream Machine internally keeps its data in Apache Kafka topics, that typically auto-expire their data in 7 days.
For getting the data into your systems we currently have the following options.
Currently we have the option to do periodic (minute scale) exports to AWS S3 and Google Cloud buckets, in Json-lines format . This is a very common format for Data Scientists.