Exporting to S3

If you want to export stream data to AWS S3, you first need to create a Sink pointing to the S3 bucket.

  • Depending on the situation, you might already have a bucket and the credentials. In that case, you can skip the Preparation and go directly to Creating the sink.

  • Or you don’t yet have the bucket and credentials, but you can create them yourself. In that case, you can follow along from the Preparation.

  • Or you need someone else to set this up for you in your AWS account. In that case, you can forward this document to them, so they know what to do.

Preparation: Set up S3 bucket and credentials

Before creating a sink, you need:

  • An S3 bucket (Step 1)

  • An IAM user with the correct permissions to write in this bucket (Step 2)

You first need to create an AWS credentials file that gives Stream Machine write access to a specific bucket/prefix.

To do so, follow the steps below:

1. Create the bucket

Create a bucket using the command below, using your own bucket name:

$ aws s3 mb s3://<your-bucket-name>

2. Create the necessary credentials

Create a file with the policy document below and save it in the current directory. This file contains the permissions Stream Machine needs.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::<your-bucket-name>"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::<your-bucket-name>/<optional-prefix>/*.jsonl"
        }
    ]
}
Make sure you replace both occurrences of <your-bucket-name> with the actual name of your S3 bucket and replace the <optional-prefix> with the prefix in which Stream Machine should put the files. If there is no prefix, also leave out the last slash (as a double slash will not work).

The provided policy document shows the minimal set of permissions needed by Stream Machine. We use these as follows:

  • GetBucketLocation: This is an unfortunate necessity, because the AWS SDK requires us to connect to the same region as from where the bucket has originally been created. Stream Machine cannot know this in advance, so we need to query it using this operation.

  • PutObject: Stream Machine only writes *.jsonl files to the specified location (bucket + prefix).

We don’t need more permissions than these, and we also prefer to have as few permissions as possible.

Stream Machine validates the configuration by writing an empty JSONL file (file name: .strm_test_<random UUID>.jsonl) to the specified bucket/prefix, using the provided credentials.

Stream Machine needs access to the bucket you’ve just created, so there needs to be an IAM user which has the policy of Preparation step 2 (This example uses the name strm-export-demo, but we recommend you use a more descriptive name for your organization).

First create the user

$ aws iam create-user --user-name strm-export-demo

Then attach the policy of Preparation step 2. This listing assumes the policy document is in the same directory. Replace the file name strm-policy.json with the correct file name.

$ aws iam put-user-policy --user-name strm-export-demo --policy-name strm-bucket-write-access --policy-document file://strm-policy.json

Finally, create the access key for this user and download the credentials: (keep these safe, as they provide access to the bucket)

$ aws iam create-access-key --user-name strm-export-demo > s3.json

Create a Sink

1. Preparation

First, make sure you have a file, called s3.json in your current directory, with the following contents:

{
    "AccessKey": {
        "UserName": "<strm-export-demo>",
        "AccessKeyId": "<your access key>",
        "Status": "Active",
        "SecretAccessKey": "<your secret access key>",
        "CreateDate": "2021-04-08T08:19:33+00:00" // The actual date might differ
    }
}
This is the same JSON as returned by aws iam create-access-key.

2. Create the sink

When you have the correct AWS credentials in a file s3.json, you can create the sink using the command below:

$ strm sinks create s3 s3-export --bucket-name stream-machine-export-demo --credentials-file s3.json

Output:

{
"sinkType": "S3",
"sinkName": "s3-export",
"bucketName": "stream-machine-export-demo"
}

You can see all your sinks with strm sinks list.

Create an exporter

An exporter is the Stream Machine component that reads your stream, and sends its events in batches to the sink (which in this example sends it to your S3 bucket).

Let’s create an exporter on the demo stream (make sure to create this first). Exporter names are unique per connected stream, so you could always call them 's3' for instance.

$ strm exporters create --exporter-name s3-export-demo --sink-name s3-export --sink-type s3 --interval 30 --path-prefix your-optional-prefix demostream

Output:

{
  "name": "s3-export-demo",
  "linkedStream": "demo",
  "sinkName": "s3-export",
  "sinkType": "S3",
  "intervalSecs": 30,
  "type": "BATCH",
  "pathPrefix": "your-optional-prefix"
}

Note that we’re sending data on the stream demo, and we’re sending batches every 30 seconds.

Also note that the --path-prefix argument is optional. Make sure it matches your bucket structure and permissions.

Checking the result

Everything has been set up and after <interval> number of seconds, you should start seeing data in your bucket.

$ aws s3 ls stream-machine-export-demo/your-optional-prefix/

Output:

2021-03-26 10:56:31      79296 2021-03-26T09:56:30-stream-151daf78-eb70-4b6a-aeb4-578edc32bee6---0-1-2-3-4.jsonl
2021-03-26 10:57:01     275726 2021-03-26T09:57:00-stream-151daf78-eb70-4b6a-aeb4-578edc32bee6---0-1-2-3-4.jsonl
2021-03-26 10:57:31     277399 2021-03-26T09:57:30-stream-151daf78-eb70-4b6a-aeb4-578edc32bee6---0-1-2-3-4.jsonl
In a future version, these filenames will show the stream name, instead of a uuid that we use internally.

And having a look inside one of the files.

$ aws s3 cp s3://stream-machine-export-demo/encrypted-events/2021-03-26T09:56:30-stream-151daf78-eb70-4b6a-aeb4-578edc32bee6---0-1-2-3-4.jsonl - | head -1

Output:

{
  "strmMeta": {
    "schemaId": "nps_unified_v1",
    "nonce": 1166008347,
    "timestamp": 1616752583028,
    "keyLink": -1041113576,
    "billingId": "hello0123456789",
    "consentLevels": []
  },
  ...
  "version": "",
  "device_id": "ARXmmtSPDUrUkYjM9KXNS3EjWevX7SgfmsL20bls",
  "customer_id": "ARXmmtQG2niSlDp9ejYWnprox14WGMvYcFuM2iMd8TE=",
  "consent_level": "",
  "session_id": "ARXmmtShwQynjwguyW1IzMyyYf5blOr82+aNGQr2BA==",
  "swimlane_id": "swimlane-id-79",
  ....
}

About the filenames

The last part of the filenames identifies the partitions being processed by the Kafka consumers that are doing the actual exports. When under a high event rate, we need more than 1 Kafka consumer, we would see a division of partitions over multiple filenames. In this example, the topic has 5 partitions, and all of them are processed by one Kafka consumer.

With manual offset management in the Kafka consumer, we’re fairly confident there will be no duplicate nor missing data in your bucket.

Important considerations for consumers

The S3 exporter is a very generic building block, which integrates into most architectures, making it very usable.

Still, there are some things to be aware of:

Empty files

When there are no events, the S3 exporter will not write any files to the bucket, so you won’t be seeing many empty files.

However, when the batch exporter component is created or (re)started, we write an empty JSONL file to validate the configuration (does the bucket exist and does Stream Machine have the appropriate permissions?). This results in some empty files, so your downstream code needs to be able to handle these.

IAM user credentials

Stream Machine stores the provided IAM credentials in an encrypted storage, which is highly secured. Nevertheless it might be wise to create a dedicated Stream Machine user, which is used for connecting to the S3 bucket.

This user should only have the necessary permissions (Preparation step 2), only on the necessary resources (bucket + optional prefix + .jsonl suffix).

This way, you can easily revoke/change the credentials, and re-upload these using our CLI (strm create sink is also used to update the sink) without impacting other applications.

Tearing down

Tearing down an export requires first to remove the exporter, and only then remove the sink. You’re not required to remove the sink at all, it’s just a configuration item.

$ strm exporters delete demo --exporter-name s3-export-demo
$ strm sinks delete s3-export s3