HDFS - Hadoop Distributed File System
HDFS is a distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. HDFS is one of the major components of Apache Hadoop, the others being MapReduce and YARN. HDFS should not be confused with or replaced by Apache HBase, which is a column-oriented non-relational database management system that sits on top of HDFS and can better support real-time data needs with its in-memory processing engine. ___(IBM)
Functions of a File System
Following are the basic functions of a file system
- It controls how data is stored and retrieved.
- Maintain metadata about data(files and folders).
- Permissions and security to view the files and folders.
- Manage storage capacity efficiently.
Different File Systems
when we have all these file systems available then why do we need yet another one?
Local File System vs. HDFS
As we know, A big data cluster have multiple nodes, each node have multiple hard drives, and the data is stored on drives by breaking it down into blocks of 128MB(configurable) size. These blocks are then replicated across multiple nodes for high availability.
Local File System
Now, Local file system runs on node level. HDFS uses local file system to store files/data on individual nodes. But it has only information of blocks in its local nodes only.
On the other hand, HDFS operates on cluster level. So, it has the information which nodes have which blocks of file to process.
You want to store a file of say 6 GB on hadoop cluster. HDFS will breakdown the file into 128MB small blocks spread and replicate it on multiple nodes. Now lets get down to node level, each node have multiple blocks and local file system of node keeps information of the blocks on its local level. It does not have any information what resides on the other node. So, HDFS comes into play and keeps track of all the blocks in different nodes and helps in processing.
Benefits of HDFS
- Supports distributed processing by saving data into blocks, not as a whole file.
- Handle failures by replicating blocks.
- Scalability to support future expansions.
- Cost-effective because it uses commodity hardware.
Note: If you observe HDFS benefits, it addresses the distributed computing challenges.
Did you find this article useful? Give it a clap 👏, share with the community, have some thoughts, or I missed something? Share with me in the comments 📝.
The author is an data scientist with a passion to build meaningful impact-oriented products. He is Kaggle expert. He is former Google Developer Student Club Lead(GDSC) and AWS educate cloud ambassador. He loves to connect with people. If you like his work, Say Hi to him.