Architecturally Significant Requirements

#data-engineering #architecture

I first wanted to call the article Data Engineering terms, but then I thought that I was describing more than just terminology. We can think of those terms as the quality attributes requirements that greatly affect the design, cost, and ultimate usefulness of the systems. That's why I called it Architecturally Significant Requirements(ASR). Behind the marketing noise, there are real problems that need to be tackled.

In this post, I am talking about different kinds of cloud and data systems as I am closer to it but you can think of any other system.

For a better understanding, let's build an understanding around an example. Let's say we have a website where one can find pictures of cats. You can go to the site, and to your excitement, you get a cat picture, and even upload your own. Cute, isn't it?

Availability

Availability is the length of time a provider guarantees that data and services will be available. This is usually documented as a percentage of the time per year, for example, 99.9% (or three nines) means that your data will be available to you but you will not be able to access it for more than ten minutes per week or 8.77 hours per year.

That doesn't sound like much, but imagine the costs Valve would incur if Steam, the world's largest digital distribution system, were to go down for 9 hours. Even once a year! What if it's some kind of bank!

Imagine that our cat website requires periodic updating of security rules and shutting down the server for a period of time. This means that our site will not be available for that period of time.

Datacenters and cloud providers often schedule downtime for maintenance, which is acceptable as long as you don't have an immediate need for service access during those maintenance windows.

Cloud providers usually define availability in a service level agreement (SLA). A provider may define zero downtime if one ISP can access even one service, while others may require multiple ISPs and countries to be able to access all services to be defined as available. So it's not really a universal definition and you need to dig into documentation on every case.

High Availability vs Fault Tolerance

A system is called Highly Available if it quickly recovers from any failure minimizing interruptions to the end-user. If, for example, a failure occurs on one of the nodes at our cat website, the user whose requests were processed on that node will receive an error, but the system will still respond to the other requests.

Highly Available system minimizes downtime by quickly restoring essential services in the event of a system, component, or application failure.

High-availability systems are the solution for applications that need to recover quickly and can withstand a brief interruption in the event of a failure. In some industries, applications are so time-critical that they cannot withstand even a few seconds of downtime. However, many other industries can tolerate short interruptions when their database is unavailable.

In the case of a Fault Tolerance system, the crash will be handled somehow and the request will complete correctly so that the user can get a correct answer. This can be achieved, for example, by having redundant resources that can always replace those that failed. This, however, is likely to take more time because of the extra steps.

Like a High Availability system, a Fault Tolerance system is designed to minimize downtime. However, the methods used to minimize downtime in a Fault Tolerant system are different from those used in High Availability systems. After all, a Fault Tolerant system is designed to allow the system to continue operating even if one of its components fails. Hence Fault Tolerance has a more complex design and higher cost of operation to maintain any failure in one of its components.

A system can be Highly Available but not Fault Tolerant, or both. If a system is considered Fault Tolerant, it is also considered Highly Available. However, there are situations where a highly available application is not considered to be Fault Tolerant.

Many systems are willing to tolerate a small amount of downtime with High Availability, rather than paying the much higher cost of Fault Tolerance.

Durability

Durability refers to the continued existence of an object or resource. It does not mean that you can access it, it just means that it continues to exist.

In our cat website example, you can imagine that by uploading a picture of a cat, you would like it to exist on the system until you remove it. You would be very disappointed if it will be corrupted or deleted due to technical problems. This is Durability.

Durability is also expressed as a percentage per year. For example, AWS S3 has a durability of eleven nines — 99.99999999999%. This means that your data will remain intact, without losing more than 0.000000001% of your data per year.

Speaking of data durability, there are several approaches to increasing data durability.

The first approach is to use encoding algorithms and metadata, such as checksums, to detect data corruption and somehow fix it algorithmically using stored information.

The second approach is to simply store multiple copies of the data in multiple locations. This is called redundancy. To survive data loss it is required that at least one copy of the data remains intact. The chances of data survival increase with the number of copies stored, with multiple locations being an important multiplying factor.

Both approaches allow data to survive the loss or corruption in one or even more locations due to accidents, wars, theft, or any other natural disaster or alien invasion.

In HDFS, for example, since its existence, durability has been achieved with replication, but with the release of Hadoop 3.0, it was replaced by encoding (HDFS Erasure Coding) for more intelligent use of disk space.

Resiliency

Resiliency can be described as the ability of a system to heal itself after failures, high loads, or attacks that would otherwise cause it to crash. It is not about avoiding failures, it is about accepting the fact that failures will happen and reacting to them in a way that avoids downtime or loss of data.

The goal of resiliency is to get the system back to a fully operational state after a failure.This does not mean that it will be available continuously during an incident, but only that it will self-repair. It's the key difference between resiliency and availability. For two different systems, it is possible for one system to be more reliable but less available than the other.

Resiliency is most often ensured by redundancy and automatic rerouting of operations within the system. Note that system performance may degrade until the problem is fixed, but there will be little or no service interruption and no loss of data.

Some metrics such as RPO (Recovery Point Objective), RTO (Recovery Time Objective), MTTR (Mean Time to Recovery) are some ways to measure system resilience.

Reliability

The Oxford dictionary gives a great description of Reliability, such as ”the quality of being able to be trusted to do what somebody wants or needs.”

Apparently, Reliability is closely related to Availability, however, the system may be "available" but not work properly. Reliability is the probability that the system will work as intended.

In our website example, if the cat pictures are highly available, but it takes hours to load one picture... Can we really rely on this product? Okay, performance is really important. But what if half of the uploaded pictures never made it to the storage, and were just thrown away by the system? All of those affect the reliability of our website.

It seems that reliability can't be measured by one metric, you need a set of specific metrics and events that tell you about the state and health of the system. These metrics, at an acceptable level, tell us whether or not we can rely on them.

Scalability

Scalability is the ability to adjust capacity to meet demand with minimum effort by adding more resources.

Typically, scaling does not involve rewriting code but involves either adding servers or increasing the resources of existing servers. Therefore, we distinguish between vertical and horizontal scaling. Vertical scaling is when we add more RAM, disks, etc. to an existing server, and horizontal scaling is when we put more servers in data centers, and the servers there are somehow working together.

Flexibility in configuring the resources consumed by a system is a key business driver for moving those systems to the cloud. With the right design, you can reduce costs without compromising performance or user experience. Similarly, you can maintain a good user experience during periods of high load by simply buying more resources.

Conclusion

Nothing else influences an architecture more than the quality attribute requirements it must satisfy. Those quality attribute requirements or ASRs here are derived during the initial requirement phase must be carefully studied. Also, by studying various ASRs one must analyze the risks because the more we try to develop a perfect error-free or fault-tolerant system, the more the cost increases. Sometimes we can let it fail and try again later. Such decisions can only be made with good research and clear communication.

Good architecture has a lot of screws. Well-distributed data architecture has a little more. Just as developers and architects make decisions about various good patterns and best practices, it is equally important that they always have a compass pointing to the general principles of architecture that should guide them at any given time.