Designing for scale
Traditionally, designing for scale meant carefully sizing your infrastructure for peak usage and then adding a factor to handle variability in load. At some point, when you reached a certain threshold on CPU, memory, disk (capacity and throughput) or network bandwidth, you would repeat the exercise for handling increased loads and initiate a lengthy procurement and provisioning process. Depending on the application, this could mean a scale up (vertical scaling) with bigger machines or scale out (horizontal scaling) with more machines being deployed. Once deployed, the new capacity would be fixed (and run continuously) whether the additional capacity was being utilized fully or not.
In cloud applications, it is easy to scale—both vertically and horizontally. Additionally, the increase and the decrease in the number of nodes (in horizontal scalability) can be done automatically to improve resource utilization and manage costs better.
Typically, cloud applications are designed to be horizontally scalable. In most cases, the application services or the business tier is specifically designed to be stateless so that compute nodes can be added or deleted with no impact to the functioning of the application. If the application state is important then it can be stored externally using a caching or storage service. Depending on the application, things like session state can also be provided by the caller in each call, or be rehydrated from a data store.
Horizontal scaling in the data tier is usually achieved through sharding. Sharding splits a database across two or more databases to handle higher query or data volumes than what can be effectively handled by a single database node. In traditional application design, you would choose an appropriate sharding strategy and implement all the logic necessary, to route the read/write requests to the right shard. This results in increased code complexity. Instead, if you choose to use a PaaS cloud database service, the responsibility for scalability and availability is largely taken care of by the cloud provider.
An architecture comprising of loosely coupled components is a well-accepted approach and best practice. This is especially true while building highly scalable systems. Loose coupling allows you to distribute your components and scale them, independently.
The most commonly used design approaches to implement loose coupling is to introduce queues between major processing components in your architecture. Most PaaS cloud providers offer a queuing service that can be used to design for high concurrency and unusual spikes in load. In a high velocity data pipeline type application, the buffering capability of queues is leveraged to guard against data loss when a downstream processing component is unavailable, slow or has failed.
The following diagram shows a high capacity data processing pipeline. Notice that queues are placed strategically between various processing components to help match the impedance between the inflows of data versus processing components' speed.
Typically, the web tier writes messages or work requests to a queue. A component from the services tier then picks up this request from the queue and processes it. This ensures faster response times for end users as the queue-based asynchronous processing model does not block on responses.
In a traditional architecture, you may have used message queues with simple enqueue and dequeue operations to add processing requests and remove them for processing from the queues, subsequently. However, implementing queue-based architectures on the cloud is a little different. This is because your queue may be distributed across several nodes, internally, by the cloud service, your messages automatically replicated for you across several nodes, and also because one of these nodes may be unavailable when your request arrives or fails during the processing of your request.
In order to design more effectively, it is important to understand that:
- The message order is not guaranteed to be preserved between the enqueue and dequeue operations. If there is a requirement to strictly preserve this sequence then you need to include sequencing information as a part of the content of each message.
- It may so happen that one of the replicas of the message may not get deleted (due to a hardware failure or the unavailability of the node). Hence, there is a chance that the message or processing request would get processed twice. It is imperative to design your transactions to be idempotent in such circumstances.
- As the queue is distributed across several servers, it is also possible that no messages or not all messages are returned in any given polling request. The cloud queuing service is not guaranteed to check all the servers for messages against each polling request. However, a message not returned in a given polling request will be returned in a subsequent one.
- Due to the variability in the rate of incoming requests, a lot of polling requests (as described previously) need not return any requests for processing. For example, online orders in an online shopping site may show wide variability between daytime and night hours. The empty polling requests are wasteful in terms of resource usage and more importantly, incur unnecessary costs. One solution to reduce these costs is to implement the exponential back-off algorithm (that steadily increases the intervals between empty polling requests). But this approach has the down side of not processing requests soon after their arrival. A more effective approach is to implement long polling. With long polling the queuing service waits for a message to become available, and returns it if the message arrives within a configurable time period. Long polling for a queue can easily be enabled through an API or a UI interface.
- In a cloud queue service, it is important to differentiate between a dequeue and a delete operation. When a message is dequeued, it is not automatically deleted from the queue. This is done to guard against the possibility of failure in the message reaching the processing component (due to a connection or a hardware failure). Therefore, when a message is read off the queue and returned to a component, it is still maintained in the queue. However, it is rendered invisible for a period of time so that other components do not pick it up for processing. As soon as the queue service returns the message, a visibility timeout clock is started. This timeout value is usually configurable. What happens if your processing takes longer than the visibility timeout? In such an eventuality, it is a good practice to extend the time window through your code to avoid the message becoming visible again, and getting processed by another instance of your processing component.
- If your application requirements do not require each message to be processed immediately upon receipt, you can achieve greater efficiency and throughput in your processing by batching a number of requests and processing them together through a batch API.
As charges for using cloud queuing services are usually based on the number of requests, batching requests can reduce your bills as well.
- It is important to design and implement a handling strategy for messages that lead to fatal errors or exceptions in your code. These messages will repeatedly get processed until the default timeout set for how long a message should be retained in the queue. This is wasteful processing and leads to additional charges on your bill. Some queuing services provide a dead letter queue facility to park such messages for further review. However, ensure you place a message in the dead letter queue after a certain number of retries or dequeue the count.
The number of messages in your queue is also a good metric to use for auto scaling your processing tier.
- Depending on the number of different types of messages and their processing duration, it is a good practice to have separate queues for them. In addition, consider having a separate thread to process each queue instead of a single thread processing multiple queues.