I am very annoyed that all sorts of big data engineers confuse S3 and HDFS systems, assuming that S3 is the same as HDFS. That’s not true.
HDFS is a distributed file system designed for storing large data. It runs on physical machines on which something else can run.
S3 is an AWS object storage, it has nothing to do with storing files, all data in S3 is stored as objects (Object Entities) with a key (document name), value (object content), and VersionID associated with it. Nothing else can be done in S3 because it is not a file system.
Without going deep into the details of HDFS, let's look at the architecturally relevant places.
Hadoop has one main node called NameNode and a bunch of dependent DataNodes.
The Data Nodes process data, serve data consumers, and log any change to the file system namespace or its properties to the Name Node. The DataNode has no knowledge of HDFS files. It stores each HDFS data block in a separate file on its local file system.
The NameNode maintains the file system namespace, contains all metadata information for the Hadoop cluster, and enforces a data replication policy. The NameNode machine is the single point of failure for the HDFS cluster, if it fails, the rest of the cluster cannot be accessed. To prevent this single point of failure, many techniques have been devised such as providing redundant NameNodes or Quorum Journal Manager (QJM).
Object storage has a completely different storage architecture than HDFS. Many of the patterns and paradigms developed specifically for HDFS primitives may not transfer as well to object storage as one would like.
Data stored in object storage is highly available and globally replicated (when using multi-region storage) without loss of performance. Object storage systems are not as dependent on the master node as HDFS is.
An object here consists of data and an object identifier. Any metadata (logical file name, creation time, owner, access rights) needs to be placed outside the storage. This eliminates the need for the hierarchical structure used in file storage — there is no limit to the amount of metadata that can be used. It also means that directory and permission metadata will not work the way it does in HDFS. Hadoop applications such as Ranger and Sentry (to name just a couple of examples) depend on these HDFS permissions.
HDFS supports the traditional hierarchical organization of files. The user or application can create directories and store files within those directories. The file system namespace hierarchy is similar to most other existing file systems — you can create and delete files, move a file from one directory to another, or rename a file. Although HDFS relaxes some POSIX requirements to allow streaming access to file system data, generally supports it.
Object storage differs from file and block storage in that data is stored in an "object" rather than in a block that makes up a file. There is no directory structure in object storage, everything is stored in a flat address space. The simplicity of object storage makes it scalable but also limits its functionality.
Unlike file systems, object storage does not support POSIX I/O calls: open, close, read, write and search for a file. Instead, they have only two basic operations: PUT and GET.
- PUT creates a new object and fills it with data. As a result, the data in the existing object cannot be changed, so all objects in the repository are immutable. When you create a new object, the repository returns its unique identifier. This is usually a UUID, which has no internal meaning like a filename.
- GET retrieves the contents of an object based on the object identifier (UUID). To edit an object in the repository, you need to create a copy of it and make changes to it. While doing so, you must keep track of which object identifiers correspond to the more recent version of the data.
Since object storage has only a few available operations, there are important limitations or advantages:
- Data blocks in the object store are intended to be written once, so a node does not need to lock the object before reading the contents. There is no risk that another node will write something to the object while reading it.
- The only reference to the object is a unique object identifier(UUID). So, you can use a simple hash function over it to determine the physical location of the object (disk or storage node). The compute node does not need to contact the metadata server to know which server actually hosts the object's content.
- Accessing data without locks and deterministic mapping of objects to the physical location of the data allows for efficient scaling of I/O performance. There are no situations where many compute nodes access data via object storage servers.
Because of this, almost every object store has an additional database layer on top of the data layer. This database layer (vendors may call it a "gateway" or "proxy") has a more user-friendly interface. It usually supports mapping an object identifier to user-friendly metadata, such as object name or access rights.
The consistency model of an HDFS is one-copy-update-semantics. All clients see the contents of the file identically as if only one copy of the file existed. Creation, updates, and deletions are visible immediately, and that the directory listing results are current with respect to the files within that directory.
Object storages are generally Eventually Consistent — it may take time for object creation, deletion, and updating to become visible to all calling users. Old copies of a file may exist for an indefinite period of time.
The delete and rename directory operations are implemented by recursive file operations. Hence, they take time proportional to the number of files during which partial updates can be seen.
Hadoop assumes that rename and deletes operations are atomic. This means that job output is observed by readers on an all-or-nothing basis. This is important for data consistency because partial data should not be written when a job fails, which could corrupt the dataset.
Object storage FileSystem clients implement these as operations on the individual objects whose names match the directory prefix. As a result, the changes take place a file at a time and are not atomic. If an operation fails partway through the process, then the state of the object store reflects the partially completed operation.
In addition, S3 has no atomic renaming of directories — they can fail partway through, and callers cannot safely rely on atomic renames as part of a commit algorithm. This is a critical problem for data integrity and leads to the complex application logic to ensure data consistency, such as never adding data to an existing partition.
In HDFS, compute power and storage capacity work together on the same nodes. When you need more storage, you must add comparable compute power, and vice versa. You cannot add one without the other.
The flat organization of the address namespace, together with extensible metadata functionality, allows you to scale storage capacity and compute power independently. As your needs change, you can add object storage nodes without having to build up compute power that you will not use.
When you store data in S3 instead of HDFS, you can access it directly from multiple clusters. This makes it easy to dismount and create new clusters without moving data around.
Since HDFS is often on-premise on the same private network as the clients, data access latencies are minimized. For typical large-scale ETL jobs, this is not a significant factor, but small jobs, a large number of small files, or a large number of consecutive file system operations can be sensitive to latency per request.
S3, on the other hand, is always somewhere further away in AWS data centers and in many situations, S3 has a higher I/O variance than HDFS. This can be problematic if you have strict I/O requirements, such as in an application based on HBase or another NoSQL database. Although read I/O variance can often be reduced with caching and read replicas, you might consider other cloud-based tools.
A common best practice is to maintain the reference table sources in cloud object store but bring the file into HDFS as a first step in the job if the job re-query the same pieces of data often and don't require shuffle activity. That allows jobs to automatically use disk buffer caches.
There are many other criteria, such as cost, durability, security, backup capabilities, etc. But let's not think about that, S3 wins here anyway.
Hadoop and HDFS made it cheap to store and distribute large amounts of data. But now with everyone moving to cloud architectures, the benefits of HDFS have become minimal and not worth the complexity it brings. That's why organizations now and in the future will use object storage in the cloud as the backend for their storage solutions.
Many clients for object storage pretend that it is a file system, a file system with the same features and operations as HDFS. This is ultimately a sham: they have different characteristics, and sometimes the illusion fails. S3 is not a block storage file system because it does not expose the concept of blocks, directories, etc., which help make normal file system operations fast. But it is possible to build much more complex interfaces based on object storage. Most parallel file systems, including Lustre, Panasas, and BeeGFS, are built on concepts derived from object storage. They make tradeoffs in the front-end and back-end to balance scalability with performance and usability. But this flexibility is provided by building on top of object-oriented, rather than block-based, data views.