Hadoop Basics -

Hadoop

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It is designed to handle massive volumes of data in a fault-tolerant and scalable manner. Here's an overview of Hadoop, its architecture, advantages, and disadvantages:

Hadoop Architecture

Hadoop is built on two core components: the Hadoop Distributed File System (HDFS) for storage and MapReduce for data processing. These components work across a cluster of interconnected machines, commonly referred to as nodes.

HDFS – Storage Layer

HDFS is a scalable, fault-tolerant distributed file system that stores data across multiple DataNodes (worker nodes) in the cluster. A central NameNode (master node) manages the file system metadata—such as file names, permissions, and the location of data blocks. HDFS ensures reliability by replicating each data block across multiple DataNodes, providing high availability and resilience against hardware failures.

MapReduce – Processing Layer

MapReduce is Hadoop’s data processing engine. It divides large datasets into smaller tasks, which are executed in parallel across the cluster. The JobTracker (or ResourceManager in newer YARN-based versions) assigns tasks to TaskTrackers (or NodeManagers) on each node. This distributed approach improves processing speed and ensures efficient use of cluster resources.

Nodes in the Cluster

Hadoop clusters are composed of two types of nodes:

Master Nodes: Include the NameNode and JobTracker/ResourceManager, responsible for coordination, scheduling, and metadata management.
Worker Nodes: Run DataNodes and TaskTrackers/NodeManagers, handling actual data storage and task execution.

Together, these nodes collaborate to deliver scalable, reliable, and high-performance storage and processing of massive data volumes.

Advantages of Hadoop

Scalability: Hadoop can scale horizontally by adding more nodes to the cluster, allowing it to handle growing volumes of data with ease.
Fault Tolerance: Hadoop's distributed architecture ensures high availability and fault tolerance by replicating data across multiple nodes. If a node fails, the data can be recovered from replicas stored on other nodes.
Cost-Effectiveness: Hadoop runs on commodity hardware, making it a cost-effective solution for storing and processing large datasets compared to traditional proprietary systems.
Flexibility:
- Handles structured (e.g., CSV, JSON), semi-structured (e.g., XML, Avro), and unstructured data (e.g., text, logs).
- Supports diverse workloads, from batch processing to real-time analytics.
- Supports various programming languages including Java, Python, and Scala for writing MapReduce jobs.

Disadvantages of Hadoop

Complexity: Setting up and configuring a Hadoop cluster can be complex and requires specialized skills in distributed systems and cluster management.
Latency: Hadoop is optimized for batch processing and may experience latency with real-time data processing.
Limited SQL Support: Supports query languages like SQL, Hive, and Impala, but may lack the performance and capabilities of traditional RDBMS for complex queries.
Data Modelling: Requires extra effort for data modelling and schema management compared to traditional relational databases.
ACID Support: Hadoop doesn’t provide built-in support for ACID transactions. While components like HBase offer ACID features, they may not be suitable for all use cases.
Others:
- MapReduce requires a lot of code, even for simple tasks.
- Primarily supports Java; Python can be used but needs extra effort.
- Primarily reads from HDFS but can integrate with other storage systems.

Hadoop Basics

Index