Sunday, October 18, 2015

AWS Lambda: You can't improve what you don't measure

Now that there is a somewhat reliable pipeline of data from the Watch-iPhone out to AWS, I have a chance to measure the actual throughput through AWS. Interesting results indeed.

As of this moment, Lambda is performing 50% more work than is needed.

Here's a plot of DynamoDB write activity:

Here we see that during a big catchup phase when a backlog of events are being sent at a high rate, the writes are globally limited to 100/second. This is good and expected. However, the last part is telling. Here we have caught up and only 50 events/second are being sent through the system. But DynamoDB is showing 75 write/second!

Here's a filtered log entry from one of the Lambda logs:

Indeed, individual batches are being retried. See job 636 for example. Sometimes on the same Lambda 'instance'. Sometimes on a different instance. This seems to indicate some sort of visibility timeout issue (assuming the Lambda queuer even has this concept).

Recall, the Watch is creating 50 accelerometer samples per second through CMSensorRecorder. And the code on the Watch-iPhone goes through various buffers and queues and winds up sending batches to Kinesis. Then the Kinesis->Lambda connector buffers and batches this data for processing by Lambda. This sort of a pipeline will always have tradeoffs between latency, efficiency and reliability. My goal is to identify some top level rules of thumb for future designs. Again my baseline settings:
  • Watch creates 50 samples/second
  • Watch dequeuer will dequeue up to 200 samples per batch
  • These batches are queued on the iPhone and flushed to Kinesis every 30 seconds
  • Kinesis->Lambda will create jobs of up to 5 of these batches
  • Lambda will take these jobs and store them to DynamoDB (in batches of 25)
There are some interesting observations:
  • These jobs contain up to 1000 samples and take a little over 3 seconds to process
  • The latency pipeline from actual event to DynamoDB store should be:
    • Around 2-3 minute delay in CMSensorRecorder
    • Transmit from Watch to iPhone is on the order of 0.5 seconds
    • iPhone buffering for Kinesis is 0-30 seconds
    • Kinesis batching will be 0-150 seconds (batches of 5)
    • Execution time in the Lambda of around 3 seconds
  • Some basic analysis of the Lambda flow shows items getting re-dispatched often!
This last part is interesting. In digging through the Lambda documentation and postings from other folks, there is very little definition of the retry contract. Documents say things like "if the lambda throws an exception.." or "items will retry until they get processed". Curiously, nothing spells out what constitutes processed, when it decides to retry, etc.

Unlike SQS, exactly what is the visibility timeout? How does Lambda decide to start up an additional worker? Will it ever progress past a corrupted record?

This may be a good time to replumb the pipeline back to my traditional 'archive first' approach: S3->SQS->worker. This has defined retry mechanisms, works at very high throughput, and looks to be a bit cheaper anyway!


No comments:

Post a Comment