Sharp edges in serverless

I recently gave a talk on how to approach cloud-native (and serverless) development specifically for new startups or those coming from traditional development backgrounds.

I am a huge advocate for using cloud services, but one of the key themes of the talk was “beware of sharp edges”; basically that there are some seriously non-obvious pitfalls to look out for when relying on managed services.

Even as a seasoned engineer (one who has designed cloud services at AWS), I am still surprised by some of the usability traps hiding in the cloud. They seem to jump out when you least expect it, and the solutions/workarounds are often not pretty.

The devil is in the details, and the details are in the limits pages

When you start working with managed services, you quickly learn how important, and nuanced, service limits can be. I was bitten by this more than once as a junior engineer at AWS, and still am surprised occasionally when working with clients.

As a service designer, limits are a fundamental way to protect your service and downstream services and create well-understood/tested boundaries on your service.

As a service user, the key here is to study the limits pages like your livelihood depends on it (maybe it does), and to understand the limits at architecture time. You do not want to encounter an unmovable object once you’re in production. What are the fundamental scaling limits of the service? Which limits can be increased and which can’t? How do the limits vary by region?

While cloud services may advertise themselves as “infinitely scalable”, reality is more complicated. For example, a Kinesis stream can contain an “infinite” number of shards, but each shard is limited to 1MB/s throughput. Similarly, a DynamoDB table can be provisioned for “infinitely” high throughput, but there is still a fundamental limit on the throughput a single partition/host can support. In both of these examples, you’d better architect your application in such a way that your data is distributed across partitions/shards, or you’re going to run into a fundamental scaling wall, probably once you’re already in production.

On CloudFormation and Usability

I am a huge advocate for CloudFormation as a fundamental building block in AWS. In the past I have referred to it as the “gold standard” for IaC on AWS, and “the assembly language of the cloud”.

I maintain that CloudFormation, and tools that output CloudFormation such as the Cloud Development Kit and Serverless Framework, are the best option for IaC due to CFN’s focus on safe, predictable deployments, rollbacks, and first-class integrations with AWS services.

However, some design decisions seem to have traded-off these characteristics over usability. It’s no secret that CloudFormation has usability challenges – its declarative format comes with a steep learning curve, the dev/test cycle is slow, errors are obscure, and IDE/tooling support is basically nonexistent (CDK shows a lot of promise here).

Other tools, such as Terraform, seem to prioritize usability over safety – providing a clean syntax and partial deployments (no rollbacks) by default. The S3-based state store in particular should instill fear into the hearts of oncalls everywhere.

Obstacles to the serverless future

I maintain that serverless architectures have so many advantages that they should be the default choice for the majority of new applications. However, there are some sharp edges that may surprise you when building serverless applications on AWS.

I was working with a client recently on a relatively simple greenfield serverless app with 23 functions plus supporting resources. We quickly built out an MVP and a CI/CD pipeline and we were well on our way to changing the world. Right in the middle of one of the busiest sprints of the year:

Error --------------------------------------------------

The CloudFormation template is invalid: Template format error: Number of resources, 201, is greater than maximum allowed, 200

Fire your architect, because he didn’t anticipate this limit (it can’t be increased, I checked):

Yes, CloudFormation limits the number of resources in a single stack to 200, and yes their official advice is to split the resources across multiple stacks.

That’s all well and good, except CloudFormation doesn’t really support moving resources between stacks. If you’re in production, prepare yourself for a painful blue/green style migration (especially your data stores), probably updates to your CI/CD tools, and expect your development progress to screech to a halt.

Of course, it makes perfect sense that CloudFormation has a limit to the number of resources in a stack. The service has to resolve a complex dependency graph and execute a distributed workflow to create, update, and rollback this complex graph of resources. This doesn’t change the fact that the limit represents an obstacle to serverless development on AWS because 1) it is too low and 2) the workarounds are too painful.

Serverless applications specifically demand a staggering number of resources: API Gateway methods for all of your APIs, including options methods for CORS, IAM roles for least-privilege access control, permissions resources, etc. This will only get worse with time as higher-level abstractions like the CDK gain in popularity.

In many cases, it makes sense to decompose your architecture into multiple stacks, but for teams getting off the ground, this may be an arbitrarily-imposed architectural decision.

If this is indeed a best-practice, then AWS tools should make it easy to do. To date, tools such as SAM, CDK, and Serverless framework do not use multiple stacks by default, with panicked users such as myself instead relying on community-driven solutions.

AWS is heavily invested in serverless but there are still plenty of usability obstacles to overcome. In addition to making it easy to get started on a serverless project, AWS should also enable users to evolve an application into a mature production-grade application with as little friction as possible.