Stream Machine concepts
Stream Machine handles processing of events that contain personal data in a privacy regulations compliant way.
Stream Machine improves the quality of event data by separating the rules that govern the shape and content of the event data from the teams that generate the data. So in essence, the data-scientists and analytics determine the rules, and not the front-end teams where the data originates.
Stream Machine moves the decisions around privacy compliance from the software developers and data teams to those entities that know about privacy. The rules that govern the personal data aspects of an event are handled by so-called event contracts, and do not require work by software developers.
Stream Machine takes care of complexities of handling high-volume event data with low latency and high availability.
Stream Machine makes sure that there is an audit trail around the handling of personal event data. It provides the tooling to show what entity is using what personal data for what purpose.
Stream Machine is an event processing engine designed around the following concepts.
Data science teams all over are well aware of the many ways that data are wrong. It’s generally left up to them to try to work their way out of the mess. Stream Machine aims to move the responsibility of generating correct data forwards so that those entities that produce event data will receive immediate feedback in case of discrepancies.
Events in Stream Machine are strictly defined both in shape and content. Events that do not conform will be rejected. For this, Stream Machine uses Schemas and Contracts to which all events must conform.
The events conform to a certain schema, and this schema defines which event attributes contain Personally Identifiable Information (PII). Events that are accepted by Stream Machine will have these attributes encrypted before entering any persistent storage. The encryption key is linked to an event attribute that defines its session, i.e. the attribute that ties the events together as a sequence belonging to one entity. The encryption key is rotated every 24 hours.
A limited number of example schemas can be seen in the Stream Machine Portal
Both on ingest, and on further processing, Stream Machine was designed and built to provide low latency. The current implementation uses http/2 for ingest, with typical 99 percentile latencies well below 10ms .
Internally we use Kafka for high throughput fault tolerant pipelines. We can configure batch sizes at will, but having your event data in your own Kafka consumer within 1 second is easily doable.
Stream Machine was designed from the ground up for horizontal scalability and fault tolerance. Single points of failure are currently only if the whole cloud region goes down.
Stream Machine encrypts the events but with what key? To understand this we have
to look at another component of the schema, the event sequence identifier
attribute . The value in this attribute defines
wether or not events belong to the same sequence. This might be website actions
of one person, or maybe a device id of a car sending location data, we don’t
The first time a new value is seen in the key field an encryption key is
generated in Stream Machine which is linked to the event via its
strmMeta/keyLink value. The encryption and its associated key link remain
in use for 24 hours, and a new pair will then be generated.
The primary event stream is called the encrypted stream and this by design no longer contains PII data. Everyone in your company can use it . In case these data become compromised, you have a business issue, but not a privacy issue.
Even these data are useful. With a typical clickstream, where
url is not
considered personal data, you could identify dead ends on your site, or train
recommender engines on the encrypted stream, because the attributes that
identify the sequence even though encrypted, remain the same for 24 hours.
This is plenty long enough to understand typical customer journeys, without
compromising the privacy of your users.
Stream Machine supports two types of consent levels when creating a decrypted output stream: granular and cumulative.
Cumulative: Only the highest consent level is configured on the output stream. All consent levels from zero (included) up to this level (included) are decrypted in the output stream.
Granular: All the consent levels that are to be decrypted in the output stream are explicitly configured. This way, it’s possible to have "gaps" between the consent levels. For example, can specify level 1 and 4, which means that all other levels, including 2 and 3 remain encrypted.
The resulting set of consent levels effects two things, see: Create a decrypted stream
Here you instruct Stream Machine to decrypt event data with above consent levels. Stream Machine will
drop all events that don’t at least contain all the consent levels you require.
decrypt attributes with the consent levels you requested. Attributes with other pii levels will not be decrypted. So you receive exactly what you have a right to, and nothing more.
You access this stream with a specific set of credentials.
This means that data consumers will only receive the data they are legally allowed to receive, provided the company is careful with not re-using credentials.
Stream Machine is currently an engine running on Google Cloud, with http/2 input. In order to have data accepted by Stream Machine you need the following:
a valid account
a stream definition, with associated credentials. Stream Machine currently uses the Oauth 2.0 client credentials scheme.
an http/http2 client that sends data in the correct format. We provide drivers for various languages  that simplify creating the events and sending the data. For maximum performance, http/2 is preferred over http1.1. You can find the drivers on github.com/streammachineio You don’t need our software to send data. Here you can play with the actual http interaction.
Stream Machine internally keeps its data in Apache Kafka topics, that typically auto-expire their data in 7 days.
For getting the data into your systems we currently have the following options.
Currently we have the option to do periodic (minute scale) exports to AWS S3 and Google Cloud buckets, in Json-lines format . This is a very common format for Data Scientists.