Life in Silicon Valley: AWS Lambda: You can't improve what you don't measure

Now that there is a somewhat reliable pipeline of data from the Watch-iPhone out to AWS, I have a chance to measure the actual throughput through AWS. Interesting results indeed.

As of this moment, Lambda is performing 50% more work than is needed.

Here's a plot of DynamoDB write activity:

Here we see that during a big catchup phase when a backlog of events are being sent at a high rate, the writes are globally limited to 100/second. This is good and expected. However, the last part is telling. Here we have caught up and only 50 events/second are being sent through the system. But DynamoDB is showing 75 write/second!

Here's a filtered log entry from one of the Lambda logs:

Indeed, individual batches are being retried. See job 636 for example. Sometimes on the same Lambda 'instance'. Sometimes on a different instance. This seems to indicate some sort of visibility timeout issue (assuming the Lambda queuer even has this concept).

Recall, the Watch is creating 50 accelerometer samples per second through CMSensorRecorder. And the code on the Watch-iPhone goes through various buffers and queues and winds up sending batches to Kinesis. Then the Kinesis->Lambda connector buffers and batches this data for processing by Lambda. This sort of a pipeline will always have tradeoffs between latency, efficiency and reliability. My goal is to identify some top level rules of thumb for future designs. Again my baseline settings:

Watch creates 50 samples/second
Watch dequeuer will dequeue up to 200 samples per batch
These batches are queued on the iPhone and flushed to Kinesis every 30 seconds
Kinesis->Lambda will create jobs of up to 5 of these batches
Lambda will take these jobs and store them to DynamoDB (in batches of 25)

There are some interesting observations:

These jobs contain up to 1000 samples and take a little over 3 seconds to process
The latency pipeline from actual event to DynamoDB store should be:

Around 2-3 minute delay in CMSensorRecorder
Transmit from Watch to iPhone is on the order of 0.5 seconds
iPhone buffering for Kinesis is 0-30 seconds
Kinesis batching will be 0-150 seconds (batches of 5)
Execution time in the Lambda of around 3 seconds

Some basic analysis of the Lambda flow shows items getting re-dispatched often!

This last part is interesting. In digging through the Lambda documentation and postings from other folks, there is very little definition of the retry contract. Documents say things like "if the lambda throws an exception.." or "items will retry until they get processed". Curiously, nothing spells out what constitutes processed, when it decides to retry, etc.

Unlike SQS, exactly what is the visibility timeout? How does Lambda decide to start up an additional worker? Will it ever progress past a corrupted record?

This may be a good time to replumb the pipeline back to my traditional 'archive first' approach: S3->SQS->worker. This has defined retry mechanisms, works at very high throughput, and looks to be a bit cheaper anyway!

Life in Silicon Valley

Pages

Sunday, October 18, 2015

AWS Lambda: You can't improve what you don't measure

No comments:

Post a Comment