Life in Silicon Valley: October 2015

Thursday, October 29, 2015

Sensor: You Can Try Out Some Real Data

I've set up a rendering of some actual sensor data in a couple of formats:

A line chart with X as time and Y as 3 lines of x, y, z acceleration
A 3d plot of x, y, z acceleration with color being the sample time

Is interesting to see the actual sensor fidelity in a visual form. CMSensorRecorder records at 50 samples per second and the visualizations are 400 samples or 8 seconds of data.

You can try out the sample here at http://test.accelero.com There are a couple of suggested start times shown on the page. Enter a time and hit the Fetch button. Recall this fetch button allows the browser to directly query DynamoDB for the sample results. In this case anonymously and hard coded to this particular user's Cognito Id...

Once the results are shown you should be able to drag around on the 3d plot to see the acceleration over time.

The above timeslice is a short sample where the watch starts flat and is rotated 90 degrees in a few steps. If you try out the second sample you will see a recording of a more circular motion of the watch.

Note that d3.js is used for the line charts and vis.js is used for the interactive 3d plot.

Sunday, October 25, 2015

Apple Watch Accelerometer displayed!

There you have it! A journey started in June has finally rendered the results intended. Accelerometer data from the Watch is processed through a pile of AWS services to a dynamic web page.

Here we see the very first rendering of a four second interval where the watch is rotated around its axis. X, Y and Z axes are red, green, blue respectively. Sample rate is 50/second.

The accelerometer data itself is mildly interesting. Rendering it on the Watch or the iPhone were trivial exercises. The framework in place is what makes this fun:

Ramping up on WatchOS 2.0 while it was being developed
Same with Swift 2.0
Getting data out of the Watch
The AWS iOS and Javascript SDKs
Cognito federated identity for both the iPhone app and the display web page
A server-less data pipeline using Kinesis, Lambda and DynamoDB
A single-page static content web app with direct access to DynamoDB

No web servers, just a configuration exercise using AWS Paas resources. This app will likely be near 100% uptime, primarily charged per use, will scale with little intervention, is logged, AND is a security first design.

Code for this checkpoint is here.

Friday, October 23, 2015

Amazon's iOS SDK KinesisRecorder: bug found!

Recall earlier posts discussing 50% extra Lambda->DynamoDB event storage. It turns out the problem is the AWS SDK KinesisRecorder running in the iPhone. Unlike the sample code provided, I actually have concurrent saveRecord() and submitAllRecords() flows -- sort of like real world. And this concurrency exposed a problem in the way KinesisRecorder selects data for submit to Kinesis.

Root Cause: rowid is not a stable handle for selecting and removing records.

Anyway, I made a few changes to KinesisRecorder:submitAllRecords(). These changes are mostly to index records by their partition_key. This seems to work ok for me. However, it may not scale for cases where the KinesisRecorder winds up managing a larger number of rows. This needs some benchmarking.

Pull request is here. And here's updated iPhone code to do the right thing.

As they say "now we're cookin' with gas!"

Here we see the actual storage rate is around the expected 50 per second. The error and retry rates are minimal.

Sooo, back to now analyzing the data that is actually stored in DynamoDB!

Sunday, October 18, 2015

AWS Lambda: You can't improve what you don't measure

Now that there is a somewhat reliable pipeline of data from the Watch-iPhone out to AWS, I have a chance to measure the actual throughput through AWS. Interesting results indeed.

As of this moment, Lambda is performing 50% more work than is needed.

Here's a plot of DynamoDB write activity:

Here we see that during a big catchup phase when a backlog of events are being sent at a high rate, the writes are globally limited to 100/second. This is good and expected. However, the last part is telling. Here we have caught up and only 50 events/second are being sent through the system. But DynamoDB is showing 75 write/second!

Here's a filtered log entry from one of the Lambda logs:

Indeed, individual batches are being retried. See job 636 for example. Sometimes on the same Lambda 'instance'. Sometimes on a different instance. This seems to indicate some sort of visibility timeout issue (assuming the Lambda queuer even has this concept).

Recall, the Watch is creating 50 accelerometer samples per second through CMSensorRecorder. And the code on the Watch-iPhone goes through various buffers and queues and winds up sending batches to Kinesis. Then the Kinesis->Lambda connector buffers and batches this data for processing by Lambda. This sort of a pipeline will always have tradeoffs between latency, efficiency and reliability. My goal is to identify some top level rules of thumb for future designs. Again my baseline settings:

Watch creates 50 samples/second
Watch dequeuer will dequeue up to 200 samples per batch
These batches are queued on the iPhone and flushed to Kinesis every 30 seconds
Kinesis->Lambda will create jobs of up to 5 of these batches
Lambda will take these jobs and store them to DynamoDB (in batches of 25)

There are some interesting observations:

These jobs contain up to 1000 samples and take a little over 3 seconds to process
The latency pipeline from actual event to DynamoDB store should be:

Around 2-3 minute delay in CMSensorRecorder
Transmit from Watch to iPhone is on the order of 0.5 seconds
iPhone buffering for Kinesis is 0-30 seconds
Kinesis batching will be 0-150 seconds (batches of 5)
Execution time in the Lambda of around 3 seconds

Some basic analysis of the Lambda flow shows items getting re-dispatched often!

This last part is interesting. In digging through the Lambda documentation and postings from other folks, there is very little definition of the retry contract. Documents say things like "if the lambda throws an exception.." or "items will retry until they get processed". Curiously, nothing spells out what constitutes processed, when it decides to retry, etc.

Unlike SQS, exactly what is the visibility timeout? How does Lambda decide to start up an additional worker? Will it ever progress past a corrupted record?

This may be a good time to replumb the pipeline back to my traditional 'archive first' approach: S3->SQS->worker. This has defined retry mechanisms, works at very high throughput, and looks to be a bit cheaper anyway!

Thursday, October 15, 2015

Apple Watch Accelerometer -> iPhone -> Kinesis -> Lambda -> DynamoDB

I've been cleaning up the code flow for more and more of the edge cases. Now, batches sent to Kinesis include Cognito Id and additional instrumentation. This will help when it comes time to troubleshoot data duplication, dropouts, etc. in the analytics stream.

For this next pass, the Lambda function records the data in DynamoDB -- including duplicates. The data looks like this:

The Lambda function (here in source) deserializes the event batch and iterates through each record, one DynamoDB put at a time. Effective throughput is around 40 puts/second (on a table provisioned at 75/sec).

Here's an example run from the Lambda logs (comparing batch size 10 and batch size 1):

START RequestId: d0e5b23a-54f1-4be8-b100-3a4eaabfbced Version: $LATEST
2015-10-16T04:10:46.409Z d0e5b23a-54f1-4be8-b100-3a4eaabfbced Records: 10 pass: 2000 fail: 0
END RequestId: d0e5b23a-54f1-4be8-b100-3a4eaabfbced
REPORT RequestId: d0e5b23a-54f1-4be8-b100-3a4eaabfbced Duration: 51795.09 ms Billed Duration: 51800 ms Memory Size: 128 MB Max Memory Used: 67 MB
START RequestId: 6f430920-1789-43e1-a3b9-21aa8f79218e Version: $LATEST
2015-10-16T04:13:22.468Z 6f430920-1789-43e1-a3b9-21aa8f79218e Records: 1 pass: 200 fail: 0
END RequestId: 6f430920-1789-43e1-a3b9-21aa8f79218e
REPORT RequestId: 6f430920-1789-43e1-a3b9-21aa8f79218e Duration: 5524.53 ms Billed Duration: 5600 ms Memory Size: 128 MB Max Memory Used: 67 MB

START RequestId: d0e5b23a-54f1-4be8-b100-3a4eaabfbced Version: $LATEST

2015-10-16T04:10:46.409Z d0e5b23a-54f1-4be8-b100-3a4eaabfbced Records: 10 pass: 2000 fail: 0

END RequestId: d0e5b23a-54f1-4be8-b100-3a4eaabfbced

REPORT RequestId: d0e5b23a-54f1-4be8-b100-3a4eaabfbced Duration: 51795.09 ms Billed Duration: 51800 ms Memory Size: 128 MB Max Memory Used: 67 MB

START RequestId: 6f430920-1789-43e1-a3b9-21aa8f79218e Version: $LATEST

2015-10-16T04:13:22.468Z 6f430920-1789-43e1-a3b9-21aa8f79218e Records: 1 pass: 200 fail: 0

END RequestId: 6f430920-1789-43e1-a3b9-21aa8f79218e

REPORT RequestId: 6f430920-1789-43e1-a3b9-21aa8f79218e Duration: 5524.53 ms Billed Duration: 5600 ms Memory Size: 128 MB Max Memory Used: 67 MB

Recall, the current system configuration is:

50 events/second are created by the Watch Sensor Recorder
These events are dequeued in the Watch into batches of 200 items
These batches are sent to the iPhone on the fly
The iPhone queues these batches in the onboard Kinesis recorder
This recorder flushes to Amazon every 30 seconds
Lambda will pick up these flushes in batches (presently a batch size of 1)
These batches will be written to DynamoDB [async.queue concurrency = 8]

The Lambda batch size of 1 is an interesting tradeoff. This results in the lowest latency processing. The cost appears to be around 10% more work (mostly a lot more startup/dispatch cycles).

Regardless, this pattern needs to write to DB faster than the event creation rate...

Next steps to try:

Try dynamo.batchWriteItem -- this may help, but will be more overhead to deal with failed items and provisioning exceptions
Consider batching multiple sensor events into a single row. The idea here is to group all 50 events in a particular second into the same row. This will only show improvement if the actual length of an event record is a significant fraction of 1kb record size
Shrink the size of an event to the bare minimum
Consider using Avro for the storage scheme
AWS IoT

Other tasks in the queue:

Examine the actual data sent to DynamoDB -- what is are the actual latency results?
Any data gaps or duplication?
How does the real accelerometer data look?
(graph the data in a 'serverless' app)

Pages