What is HDFS?

Posted by tib on June 18th, 2019

Hadoop comes with a distributed file system referred to as HDFS. In HDFS data is distributed over many machines and replicated to make sure their durability to failure and high availability to parallel application.

It is value effective because it uses commodity hardware. It involves the conception of blocks, data nodes and node name.

Where to use HDFS

  • Very massive Files: Files ought to be of many megabytes, gigabytes or a lot of. Hadoop Training in Bangalore
  • Streaming data Access: The time to scan whole information set is a lot of necessary than latency in reading the primary. HDFS is constructed on write-once and read-many-times pattern.
  • Commodity Hardware: It works on low value hardware.

Where not to use HDFS

  • Low Latency data access: Applications that need terribly less time to access the primary data mustn't use HDFS because it is giving importance to whole data instead of time to fetch the primary record.
  • Lots Of small Files: The name node contains the metadata of files in memory and if the files are tiny in size it takes plenty of memory for name node's memory that isn't possible.
  • Multiple Writes: It mustn't be used when we have to write multiple times.

HDFS concepts

  1. Blocks: A Block is that the minimum quantity of data that it will read or write. HDFS blocks are 128 MB by default and this can be configurable. Files n HDFS are broken into block-sized chunks, which are hold on as independent units. Unlike a file system, if the file is in HDFS is smaller than block size, then it doesn't occupy full blocks size, i.e. 5 MB of file hold on in HDFS of block size 128 MB takes 5MB of space only. The HDFS block size is giant simply to reduce the value of seek.
  2. Name Node: HDFS works in master-worker pattern wherever the name node acts as master. Name Node is controller and manager of HDFS because it is aware of the status and the data of all the files in HDFS; the metadata info being file permission, names and location of every block. The data are small, thus it's stored within the memory of name node, allowing faster access to data. Moreover the HDFS cluster is accessed by multiple clients at the same time, so all this info is handled by a single machine.
  3. Data Node: They store and retrieve blocks once they are told to; by client or name node. They report back to name node sporadically, with list of blocks that they're storing. The info node being commodity hardware also wills the work of block creation, deletion and replication as explicit by the name node.
  4. Secondary Name Node: it's a separate physical machine that acts as a helper of name node. It performs periodic check points. It communicates with the name node and take snapshot of Meta data that helps minimize time period and loss of data.

HDFS options and Goals

The Hadoop Distributed file system (HDFS) could be a distributed file system. It’s a core a part of Hadoop that is used for data storage. It’s designed to run on commodity hardware. Hadoop Training in Marathahalli

Unlike different distributed file system, HDFS is very fault-tolerant and may be deployed on low-priced hardware. It will simply handle the application that contains massive data sets.

Let's see a number of the necessary features and goals of HDFS.

Features of HDFS

  • Highly scalable - HDFS is very scalable because it will scale many nodes in a single cluster.
  • Replication - because of some unfavorable conditions, the node containing the data is also loss. So, to beat such issues, HDFS always maintains the copy of data on a different machine.
  • Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system within the event of failure. The HDFS is very fault-tolerant that if any machine fails, the other machine containing the copy of that information mechanically becomes active.
  • Distributed data storage - this can be one of the most necessary features of HDFS that creates Hadoop very powerful. Here, information is split into multiple blocks and hold on into nodes.
  • Portable - HDFS is designed in such the way that it will simply portable from platform to a different.

Goals of HDFS

  • The hardware failure handling - The HDFS contains multiple server machines.
  • Streaming data access - The HDFS applications sometimes run on the general-purpose file system. This application needs streaming access to their information sets.
  • Coherence Model - the application that runs on HDFS need to follow the write-once-ready-many approach. So, a file once created needn't to be modified. However, it may be appended and truncate.

 

Like it? Share it!


tib

About the Author

tib
Joined: April 4th, 2019
Articles Posted: 35

More by this author