What is HBase?

Posted by tib on July 23rd, 2019

Hbase is an open source and sorted map data built on Hadoop. It’s column oriented  and horizontally scalable . It is based on Google's massive Table. It’s set of tables that keep data in key value format. Hbase is compatible for distributed data sets that are quite common in big data use cases. Hbase provides Apis enabling development in practically any programing language. It’s a section of the Hadoop system that gives random real-time read/write access to data within the Hadoop file system.

Why Hbase?

  • RDBMS get exponentially slow because the data becomes massive
  • Expects data to be extremely structured, i.e. ability to fit in a very well-defined schema
  • Any modification in schema may need a downtime
  • For sparse datasets, too much of overhead of maintaining NULL values

For more details: Bigdata training in Bangalore

Features of Hbase

  • Horizontally scalable: you'll add any number of columns anytime.
  • Automatic Failover: Automatic failover could be a resource that permits a system administrator to automatically switch data handling to a standby system within the event of system compromise
  • Integrations with Map/Reduce framework: Al the commands and java codes internally implements Map/ reduce to do the task and it's built over Hadoop Distributed file system.
  • Sparse, distributed, persistent, multidimensional sorted map that is indexed by rowkey, column key, and timestamp.
  • Often referred as a key worth store or column family-oriented database, or storing versioned maps of maps.
  • Fundamentally, it is a platform for storing and retrieving data with random access.
  • It does not care about data types (storing an integer in one row and a string in another for a similar column).
  • It does not enforce relationships among your information.
  • It is meant to run on a cluster of computers, built using commodity hardware.

HBase Read

A read against HBase should be reconciled between the HFiles, MemStore & BLOCKCACHE. The BlockCache is designed to stay frequently accessed data from the HFiles in memory so as to avoid disk reads. Every column family has its own BlockCache. BlockCache contains data in kind of 'block', as unit of data that HBase reads from disk in a very single pass. The HFile is physically set out as a sequence of blocks and an index over those blocks. This means reading a block from HBase needs only trying up that block's location in the index and retrieving it from disk.

Block: it's the smallest indexed unit of data and is that the smallest unit of data that may be scan from disk. Default size 64KB.

Scenario, once smaller block size is preferred: To perform random lookups. Having smaller blocks creates a larger index and thereby consumes additional memory.

Scenario, once larger block size is preferred: To perform sequential scans frequently. This permits you to save lots of on memory as a result of larger blocks mean fewer index entries and therefore a smaller index.

Reading a row from HBase needs initial checking the MemStore, and then the BlockCache; Finally, HFiles on disk are accessed.

HBase Write

When a write is created, by default, it goes into 2 places:

  • write-ahead log (WAL), HLog, and
  • In-memory writes buffer, MemStore.

Clients do not interact directly with the underlying HFiles during writes, rather writes goes to WAL & MemStore in parallel. Each write to HBase needs confirmation from both the WAL and the MemStore. Hadoop training in Bangalore

HBase MemStore

  • The MemStore could be a write buffer wherever HBase accumulates data in memory before a permanent write.
  • Its contents are flushed to disk to form an HFile once the MemStore fills up.
  • It does not write to an existing HFile however instead a new file on each flush forms.
  • The HFile is that the underlying storage format for HBase.
  • HFiles belong to a column family (one MemStore per column family). A column family will have multiple HFiles; however the reverse is not true.
  • Size of the MemStore is defined in hbase-site.xml known as hbase.hregion.memstore.flush.size.

What happens, once the server hosting a MemStore that has not yet been flushed crashes?

Every server in HBase cluster keeps a WAL to record changes as they happen. The WAL could be a file on the underlying file system. A write is not considered successful till the new WAL entry is successfully written, this guarantees durability.

RDBMS vs HBase

RDBMS and HBase differences are given below.

  • Schema/Database in RDBMS will be compared to namespace in Hbase.
  • A table in RDBMS is compared to column family in Hbase.
  • A record (after table joins) in RDBMS is compared to a record in Hbase.
  • A collection of tables in RDBMS is compared to a table in Hbase.

Like it? Share it!


About the Author

Joined: April 4th, 2019
Articles Posted: 35

More by this author