Exporting to S3

If you want to export stream data to AWS S3, you first need to create a sink pointing to the S3 bucket.

Create a Sink

You first need to create an AWS S3 JSON credentials file that gives Stream Machine write [1] access to a specific bucket. Follow along with this tutorial which boils down to:

$ aws s3 mb s3://stream-machine-export-demo
$ aws iam create-user --user-name export-demo
$ aws iam attach-user-policy --user-name export-demo \
    --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
$ aws iam create-access-key --user-name export-demo > s3.json

$ strm sinks create s3 s3-export \
    --bucket-name stream-machine-export-demo -cf s3.json
{
    "sinkType": "S3",
    "sinkName": "s3-export",
    "bucketName": "stream-machine-export-demo"
}

The minimal set of permissions should be as follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::<your-bucket-name>"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::<your-bucket-name>/<optional-prefix>/*"
        }
    ]
}
Currently the CLI does not check the validity nor the access rights to the bucket. This will only happen once your start sending data. Also, you’ll have no error reporting yet! So please get in touch with Stream Machine a.s.a.p. if something is not working as you expect. We’re still in an alpha version, but we’re happy to help you out.

You can see all your sinks with strm sinks list.

Create an exporter

An exporter is the Stream Machine component that reads your stream, and sends its events in batches to the S3 bucket.

$ strm exporters create --help
Usage: strm exporters create [OPTIONS] streamName

  Create a new exporter

Options:
  -en, --exporter-name TEXT     Name of the exporter.
  -sn, --sink-name TEXT         The name of the sink that should be
                                used for this exporter.
  -st, --sink-type [GCLOUD|S3]  The type of sink that should be used
                                for this exporter.
  -i, --interval INT            The interval in seconds between each
                                batch that is exported to the configured sink.
  -pp, --path-prefix TEXT       Optional path prefix. Every object
                                that is exported to the configured sink will
                                have this path prepended to the resource
                                destination.
  -h, --help                    Show this message and exit

Arguments:
  streamName  Name of the stream that is being exported

Let’s create an exporter on the demo stream (make sure to create this first). Exporter names are unique per connected stream, so you could always call them 's3' for instance.

$ strm exporters create -en s3-export-demo -sn s3-export -st S3 \
    -i 30 -pp encrypted-events demo
{
  "name": "s3-export-demo",
  "linkedStream": "demo",
  "sinkName": "s3-export",
  "sinkType": "S3",
  "intervalSecs": 30,
  "type": "BATCH",
  "pathPrefix": "encrypted-events"
}

Because we’re sending some data on the stream demo and we’re sending batches every 30 seconds, we should quickly see some data in the bucket.

$ aws s3 ls stream-machine-export-demo/encrypted-events/
2021-03-26 10:56:31      79296 2021-03-26T09:56:30-stream-151daf78-eb70-4b6a-aeb4-578edc32bee6---0-1-2-3-4.jsonl
2021-03-26 10:57:01     275726 2021-03-26T09:57:00-stream-151daf78-eb70-4b6a-aeb4-578edc32bee6---0-1-2-3-4.jsonl
2021-03-26 10:57:31     277399 2021-03-26T09:57:30-stream-151daf78-eb70-4b6a-aeb4-578edc32bee6---0-1-2-3-4.jsonl
in a future version, these filenames will show the stream name, instead of a uuid that we use internally.

And having a look inside one of the files.

$ aws s3 cp s3://stream-machine-export-demo/encrypted-events/2021-03-26T09:56:30-stream-151daf78-eb70-4b6a-aeb4-578edc32bee6---0-1-2-3-4.jsonl - | head -1
{
  "strmMeta": {
    "schemaId": "nps_unified_v1",
    "nonce": 1166008347,
    "timestamp": 1616752583028,
    "keyLink": -1041113576,
    "billingId": "hello0123456789",
    "consentLevels": []
  },
  ...
  "version": "",
  "device_id": "ARXmmtSPDUrUkYjM9KXNS3EjWevX7SgfmsL20bls",
  "customer_id": "ARXmmtQG2niSlDp9ejYWnprox14WGMvYcFuM2iMd8TE=",
  "consent_level": "",
  "session_id": "ARXmmtShwQynjwguyW1IzMyyYf5blOr82+aNGQr2BA==",
  "swimlane_id": "swimlane-id-79",
  ....
}

About the filenames

The last part of the filenames identifies the partitions being processed by the Kafka consumer that are doing the actual exports. When under a high event rate, we need more than 1 Kafka consumer, we would see a division of partitions over multiple filenames. In this example, the topic has 5 partitions, and all of them are processed by one Kafka consumer.

With manual offset management in the Kafka consumer, we’re fairly confident there will be no duplicate nor missing data in your bucket.

Tear down

Tearing down an export requires first to remove the exporter, and only then remove the sink. You’re not required to remove the sink at all, it’s just a configuration item.

$ strm exporters delete demo -en s3-export-demo
$ strm sinks delete s3-export s3

1. AmazonS3FullAccess below is probably too much