Hadoop Interview Questions and Answers
What is Big Data? What are the "Vs" of Big Data?
-
Big Data refers to extremely large and complex datasets that cannot be easily processed or analyzed using traditional data processing tools. The "Vs" commonly associated with Big Data are:
- Volume: The sheer amount of data.
- Velocity: The speed at which data is generated and needs to be processed.
- Variety: The different types of data (structured, semi-structured, unstructured).
- Veracity: The trustworthiness and accuracy of the data.
What is Apache Hadoop? Why is it used?
- Apache Hadoop is an open-source framework that allows for the distributed storage and processing of large datasets across clusters of computers, using simple programming models. It's used because it provides a scalable, fault-tolerant, flexible, and cost-effective way to handle and analyze Big Data.
What are the core components of Hadoop?
-
The core components are:
- HDFS (Hadoop Distributed File System): For distributed storage.
- YARN (Yet Another Resource Negotiator): For resource management and job scheduling.
- MapReduce: The original processing framework (though often replaced by Spark or other engines running on YARN).
Explain the architecture of HDFS.
-
HDFS follows a Master-Slave architecture:
- NameNode (Master): Manages the file system namespace, controls access to files, and stores metadata (file names, directories, block locations, permissions). There is typically one active NameNode.
- DataNodes (Slaves): Store the actual data blocks of files. They perform block creation, deletion, and replication upon instruction from the NameNode. There are multiple DataNodes in a cluster.
- Secondary NameNode (Checkpointing): Periodically merges the NameNode's edit logs with the fsimage to prevent the edit log from becoming too large. (Note: In HA setups, JournalNodes replace this role).
What is the role of the NameNode in HDFS?
- The NameNode is the central authority and single point of access for HDFS. Its primary roles include: managing the file system namespace (directory tree and file metadata), handling client requests for file operations (open, close, rename, etc.), and regulating block replication.
What is the role of the DataNode in HDFS?
- DataNodes are the workhorses of HDFS. They are responsible for: storing data in the form of blocks, serving read and write requests from clients, and performing block creation, deletion, and replication based on instructions from the NameNode. They also send heartbeats to the NameNode.
What is a Block in HDFS? What is the default block size?
- A Block is the smallest unit of data that HDFS stores. When a file is uploaded to HDFS, it is broken down into blocks. The default block size is 128 MB (in Hadoop 2.x and 3.x, it was 64 MB in 1.x).
Why is the HDFS block size so large compared to traditional file systems?
- A large block size minimizes the number of seek operations when reading data from disk, which is a significant bottleneck in distributed systems. It also reduces the amount of metadata the NameNode needs to manage.
What is Data Replication in HDFS? What is the default replication factor?
- Data Replication is the process of storing multiple copies of each data block across different DataNodes in the cluster. This provides fault tolerance and high availability. The default replication factor is 3.
Explain the HDFS Read operation.
- A client requests a file from the NameNode.
- The NameNode returns the list of blocks and their locations (DataNodes) for the file.
- The client picks a DataNode for the first block and requests the data.
- The client reads the block directly from the DataNode.
- The client repeats this process for subsequent blocks, potentially reading blocks in parallel from different DataNodes.
Explain the HDFS Write operation.
- A client requests to create a file from the NameNode.
- The NameNode checks permissions and if the file exists. If okay, it returns a DataStreamer object.
- The client breaks the file into blocks.
- For each block, the client asks the NameNode for a list of DataNodes to store the replicas.
- The client writes the data to the first DataNode in the list.
- The first DataNode forwards the data to the second DataNode, and so on (data pipelining).
- Once all replicas are written, the pipeline is closed, and the client notifies the NameNode that the block write is complete.
What is HDFS High Availability (HA)? How is it achieved?
- HDFS HA ensures that the NameNode is not a single point of failure. It is achieved by having two (or more) NameNodes, one Active and one Standby. The Standby NameNode keeps its state synchronized with the Active NameNode using a shared storage system (like Quorum Journal Manager - QJM or NFS). A Failover Controller (like ZooKeeper) manages the failover process if the Active NameNode fails.
What is the role of YARN in Hadoop?
-
YARN is the resource management layer of Hadoop. It is responsible for:
- Managing computing resources (CPU, memory) across the cluster.
- Scheduling applications to run on the cluster.
- Decoupling the resource management from the processing logic, allowing various processing engines (MapReduce, Spark, Tez) to run on the same cluster.
Explain the YARN Architecture.
- ResourceManager (RM): The global master that arbitrates resources among all applications in the system. It consists of two main components: the Scheduler and the ApplicationsManager.
- NodeManager (NM): The per-machine agent responsible for containers, monitoring their resource usage (CPU, memory), and reporting them to the ResourceManager.
- ApplicationMaster (AM): The per-application entity. Its responsibility is to negotiate resources from the ResourceManager and work with the NodeManagers to execute and monitor tasks.
- Container: A resource allocation unit in YARN. It is an encapsulation of resources like CPU, memory, and disk I/O on a single NodeManager.
What is the difference between Hadoop 1.x and Hadoop 2.x/3.x?
- The major difference is the introduction of YARN in Hadoop 2.x/3.x. In 1.x, MapReduce handled both resource management and processing (JobTracker and TaskTrackers). YARN separated these roles, allowing multiple processing frameworks to run on Hadoop. Other improvements include HDFS HA, HDFS Federation, and larger default block sizes.
Explain the MapReduce programming model.
-
MapReduce is a batch processing model consisting of two main phases:
- Map Phase: Takes input data (key-value pairs) and processes it to produce intermediate key-value pairs.
- Reduce Phase: Takes the intermediate key-value pairs (grouped by key) and processes them to produce the final output key-value pairs.
What is the role of the Mapper in MapReduce?
- The Mapper is the first phase of a MapReduce job. It takes input data split into smaller chunks, processes each chunk, and emits intermediate key-value pairs.
What is the role of the Reducer in MapReduce?
- The Reducer is the second phase. It receives intermediate key-value pairs that have been grouped by key (after the Shuffling and Sorting phase) and processes them to produce the final output.
Explain the Shuffling and Sorting phase in MapReduce.
-
This phase occurs automatically between the Map and Reduce phases.
- Shuffling: The intermediate key-value pairs from all Mappers are partitioned and transferred to the appropriate Reducers.
- Sorting: Before being fed to the Reducer, the intermediate key-value pairs for each key are sorted.
What is a Combiner in MapReduce? When is it used?
- A Combiner is an optional component that can be used in the MapReduce framework. It performs a local aggregation of the intermediate key-value pairs emitted by a Mapper before they are sent to the Reducers. It's used to reduce the amount of data transferred over the network, improving performance. It should only be used if the operation is commutative and associative (like sum or count).
What is a Partitioner in MapReduce? What is the default Partitioner?
- A Partitioner determines which Reducer instance an intermediate key-value pair should be sent to during the Shuffling phase. This ensures that all intermediate values for the same key go to the same Reducer. The default Partitioner uses a hash function on the key to determine the Reducer.
What is Apache Hive? What is its purpose?
- Apache Hive is a data warehousing system built on top of Hadoop. It provides a SQL-like query language called HiveQL (HQL) that allows users to query and analyze large datasets stored in HDFS, HBase, and other Hadoop-compatible file systems. It translates HQL queries into MapReduce, Tez, or Spark jobs.
What is the difference between Managed and External tables in Hive?
- Managed Table: Hive manages both the data and the schema. When you drop a managed table, Hive drops both the metadata and the underlying data in HDFS.
- External Table: Hive only manages the schema (metadata). The data resides in a specified location in HDFS (or other storage). When you drop an external table, Hive only drops the metadata; the data in HDFS remains untouched. Used when data is already in HDFS or managed by other processes.
What is Apache Pig? What is its purpose?
- Apache Pig is a platform for analyzing large datasets. It uses a high-level data flow language called Pig Latin. Pig is used for ETL (Extract, Transform, Load) tasks on large datasets. It provides a more procedural approach compared to Hive's declarative SQL. Pig Latin scripts are translated into MapReduce, Tez, or Spark jobs.
What is Apache Spark? How is it different from MapReduce?
- Apache Spark is a unified analytics engine for large-scale data processing. It is significantly faster than traditional MapReduce, especially for iterative algorithms and interactive data analysis, because it can perform computations in memory using Resilient Distributed Datasets (RDDs) or DataFrames/Datasets. It also offers a wider range of functionality beyond batch processing (Spark Streaming, MLlib, GraphX).
What is RDD in Spark?
- RDD (Resilient Distributed Dataset) is a fundamental data structure in Spark. It is a fault-tolerant collection of elements that can be operated on in parallel across a cluster. RDDs are immutable and can be cached in memory for faster access.
What is Apache HBase? What are its characteristics?
- Apache HBase is a distributed, versioned, non-relational (NoSQL) database built on top of HDFS. It provides random, real-time read/write access to large datasets. Characteristics include: column-oriented storage, strong consistency, automatic sharding (Regions), and integration with Hadoop ecosystem.
What is the HBase Data Model?
-
HBase data is organized into:
- Tables: Contain rows of data.
- Row Key: A unique identifier for each row, used for sorting and partitioning.
- Column Families: A logical and physical grouping of columns. All columns within a family are stored together.
- Columns: A qualifier within a column family.
- Cells: The intersection of a row key, column family, column, and timestamp.
- Timestamp: Represents the version of the data in a cell.
What is Apache Sqoop? What is it used for?
- Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. It's used for importing data from RDBMS into HDFS, Hive, or HBase, and exporting data from Hadoop back to RDBMS.
What is Apache Flume? What is it used for?
- Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store (like HDFS, HBase, or Kafka). It uses an agent-based architecture with Sources, Channels, and Sinks.
What is Apache ZooKeeper? Why is it important in Hadoop?
- Apache ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. It's crucial in Hadoop for coordinating distributed components, such as managing NameNode failover in HDFS HA, ResourceManager HA in YARN, and managing RegionServers in HBase.
What is Apache Kafka? How does it relate to the Hadoop ecosystem?
- Apache Kafka is a distributed event streaming platform. It's often used as a high-throughput, fault-tolerant messaging system for ingesting real-time data into the Hadoop ecosystem. Components like Spark Streaming, Flume, and Storm can consume data from Kafka topics for processing.
What is Apache Oozie? What is its purpose?
- Apache Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It's used to define and schedule complex workflows of dependent jobs (like MapReduce, Pig, Hive jobs) as a single unit.
What are the different modes of Hadoop deployment?
- Standalone Mode: Runs as a single Java process. Used for testing and debugging.
- Pseudo-Distributed Mode: Runs on a single machine but simulates a distributed environment with separate processes for NameNode, DataNode, ResourceManager, etc.
- Fully Distributed Mode: Runs on a cluster of machines with separate nodes for NameNode, ResourceManager, and multiple DataNodes and NodeManagers. This is the production mode.
What are the main configuration files in Hadoop?
core-site.xml
: Global configuration settings (e.g., HDFS URI).hdfs-site.xml
: HDFS-specific configurations (e.g., replication factor, block size).yarn-site.xml
: YARN-specific configurations (e.g., ResourceManager address).mapred-site.xml
: MapReduce-specific configurations (e.g., MapReduce framework name).
How do you access data in HDFS from the command line?
- Using the
hdfs dfs
command, which is an alias forhadoop fs
. Examples:hdfs dfs -ls /
,hdfs dfs -put localfile.txt /hdfs/path/
,hdfs dfs -get /hdfs/path/hdfsfile.txt .
What is the purpose of the JobHistoryServer in YARN?
- The JobHistoryServer (JHS) stores metadata and logs about completed MapReduce jobs. This allows users to access and view details about past jobs, including counters, configuration, and task attempts.
What is Fault Tolerance in Hadoop? How is it achieved?
-
Fault Tolerance is the ability of the system to continue operating even if some components fail. In Hadoop, it's primarily achieved through:
- HDFS Data Replication: Multiple copies of data blocks ensure data availability even if DataNodes fail.
- YARN: If a NodeManager or ApplicationMaster fails, YARN can reschedule tasks on other available resources.
- HDFS HA: Ensures the NameNode is not a single point of failure.
What is Rack Awareness in HDFS?
- Rack Awareness is the concept of the NameNode being aware of the network topology of the cluster, specifically which rack each DataNode belongs to. This is used to optimize data placement for replication (default strategy: one replica on the local rack, two replicas on a different rack) and for read operations (reading from the closest replica).
What is the maximum number of NameNodes in an HDFS HA setup?
- Typically, there are two NameNodes in an HDFS HA setup (one Active, one Standby). HDFS Federation allows for multiple independent NameNodes, each managing a portion of the namespace, but these are not for HA of a single namespace.
What is the role of the Quorum Journal Manager (QJM) in HDFS HA?
- QJM is the recommended shared storage system for HDFS HA. Both the Active and Standby NameNodes communicate with a group of JournalNodes. The Active NameNode writes edit log entries to the JournalNodes, and the Standby NameNode reads these entries to keep its state synchronized.
What is the difference between MapReduce and Spark? (Revisited)
- MapReduce: Batch processing, disks-based intermediate storage, slower for iterative jobs, simpler programming model.
- Spark: In-memory processing, faster, supports batch, streaming, SQL, ML, Graph processing, more complex API (RDDs, DataFrames).
What are the advantages of using Spark over MapReduce?
- Faster processing (in-memory computation).
- Supports more complex computations (iterative algorithms, interactive queries).
- Provides a unified platform for different workloads (batch, streaming, SQL, ML).
- Easier to program (more expressive APIs in Scala, Python, Java, R).
What is Spark SQL?
- Spark SQL is a Spark module for structured data processing. It provides a programming interface for working with structured data using SQL queries or the DataFrame API. It allows you to query data stored in various formats (Parquet, ORC, JSON) and sources (Hive, JDBC, HDFS).
What is Spark Streaming?
- Spark Streaming is a Spark module that enables processing live streams of data. It takes live input data streams and divides them into batches, which are then processed by the Spark engine. This allows for near real-time data processing.
What are the different types of joins in Hive?
JOIN
(Inner Join)LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
CROSS JOIN
LEFT SEMI JOIN
What is the Metastore in Hive?
- The Metastore is a central repository for Hive metadata (schema information, location of data in HDFS, partition information, etc.). It can be configured to use a variety of databases (Derby, MySQL, PostgreSQL, etc.).
What is SerDe in Hive?
- SerDe (Serializer/Deserializer) is used by Hive to read data from HDFS and write data back to HDFS. It tells Hive how to process records from the data file. Hive provides various built-in SerDes (e.g., LazySimpleSerDe for delimited text files).
What is Partitioning and Bucketing in Hive?
- Partitioning: Dividing a table into partitions based on the values of one or more columns. This helps improve query performance by allowing Hive to scan only relevant partitions.
- Bucketing: Dividing data within a partition into buckets based on the hash of a column. This helps optimize joins and sampling.
What is the difference between Hive and HBase?
- Hive: Data warehousing system, provides SQL-like queries for batch processing, optimized for scanning large datasets, schema-on-read.
- HBase: NoSQL database, provides random real-time read/write access, optimized for key-based lookups and range scans, schema-on-write (column families defined upfront).
What are the components of a Sqoop job?
- Sqoop uses connectors to interact with different databases. When you run a Sqoop job, it typically generates Java code for MapReduce jobs (or uses Spark/Tez) to transfer the data in parallel.
What is the purpose of the .crc file in HDFS?
- HDFS stores checksums for each block in a hidden
.crc
file to verify data integrity. When a client reads data, it verifies the checksum to detect corruption.
How do you ensure data security in Hadoop?
- Authentication: Using Kerberos for strong authentication between users/services and Hadoop components.
- Authorization: HDFS file permissions, Hive/HBase table permissions, Sentry/Ranger for fine-grained access control.
- Encryption: Encryption at rest (HDFS encryption) and encryption in transit (SSL/TLS).
- Auditing: Logging user actions and access attempts.
What is Kerberos in the context of Hadoop security?
- Kerberos is a network authentication protocol that provides strong authentication for client/server applications using secret-key cryptography. In Hadoop, it's used to verify the identity of users and services before granting access to resources.
What is Sentry/Ranger? (Security)
- Apache Sentry (Cloudera) and Apache Ranger (Hortonworks/Apache) are centralized security frameworks for fine-grained authorization in the Hadoop ecosystem. They allow administrators to define and manage access control policies for components like Hive, Impala, HDFS, HBase, etc.
What is Hadoop Federation? (HDFS) (Optional)
- HDFS Federation allows you to scale the NameNode horizontally by having multiple independent NameNodes, each managing a portion of the HDFS namespace. This addresses the NameNode's scalability limitations and provides isolation between namespaces.
What is the difference between HDFS Federation and HDFS HA? (HDFS)
- HDFS Federation: Multiple independent NameNodes managing different namespaces for scalability and isolation.
- HDFS HA: Two (or more) NameNodes managing the *same* namespace for fault tolerance.
What are the different types of filesystems supported by Hadoop?
- HDFS (hdfs://)
- Local File System (file://)
- S3 File System (s3n:// or s3a://) - for AWS S3
- Google Cloud Storage File System (gs://) - for GCP Cloud Storage
- Azure Blob Storage File System (wasb:// or abfs://) - for Azure Blob Storage
What is the purpose of the CompositeInputFormat in MapReduce? (Advanced MapReduce) (Optional)
- CompositeInputFormat allows you to combine multiple input sources and process them together in a single MapReduce job.
What is a Custom Partitioner? When would you use one? (MapReduce)
- A Custom Partitioner is a user-defined class that implements the Partitioner interface. You would use one when the default hash-based partitioning doesn't meet your requirements, for example, if you need to partition data based on a specific business logic or ensure that data for related keys goes to the same Reducer based on a criterion other than the key itself.
What is the difference between a Combiner and a Reducer? (MapReduce)
- Combiner: Optional, runs on the Map side, performs local aggregation, reduces data transfer. Must be commutative and associative.
- Reducer: Mandatory, runs on the Reduce side, performs global aggregation, processes all values for a given key.
What is the role of the ShuffleHandler in YARN? (YARN)
- The ShuffleHandler is a service running on the NodeManager that is responsible for serving the intermediate Map outputs to the Reducer tasks during the shuffle phase.
What is Containerization in the context of Hadoop 3.x?
- Hadoop 3.x provides support for running YARN containers within Docker containers. This allows for better resource isolation, dependency management, and easier deployment of complex applications.
What is Erasure Coding in HDFS 3.x?
- Erasure Coding is a technique used in HDFS 3.x to provide fault tolerance with less storage overhead compared to traditional replication. Instead of storing multiple full replicas of a block, it stores data and parity blocks across DataNodes. It's suitable for cold data or data with lower access frequency.
What are the different types of schedulers in YARN?
- FIFO Scheduler: First-In, First-Out. Simple but not suitable for shared clusters.
- Capacity Scheduler: Allows organizations to allocate dedicated capacity to different queues/organizations. Supports hierarchical queues and guarantees capacity.
- Fair Scheduler: Aims to give all applications a fair share of resources over time. Resources are shared among applications.
How do you submit a Spark job to a YARN cluster?
- Using the
spark-submit
script, specifying the YARN cluster manager (e.g.,--master yarn
) and other parameters like application JAR/Python script, class name, executor memory, etc.
What is the difference between Spark RDDs, DataFrames, and Datasets? (Spark)
- RDD: Low-level API, schema-less, immutable, resilient, distributed collection of objects. Provides fine-grained control.
- DataFrame: Higher-level API, structured data with named columns, optimized using the Catalyst optimizer, available in Scala, Java, Python, R. Less control but better performance for structured data.
- Dataset: Combines the benefits of RDDs (type safety) and DataFrames (optimization). Available in Scala and Java.
What is the Zookeeper Ensemble? (ZooKeeper)
- A ZooKeeper ensemble is a group of ZooKeeper servers that work together to provide high availability and fault tolerance. A majority of servers in the ensemble must be operational for the service to be available.
What is the difference between Sqoop Import and Sqoop Export? (Sqoop)
- Sqoop Import: Transfers data from a relational database (RDBMS) to Hadoop (HDFS, Hive, HBase).
- Sqoop Export: Transfers data from Hadoop (HDFS, Hive, HBase) to a relational database (RDBMS).
What are the components of a Flume Agent? (Flume)
- Source: Consumes data from external sources (e.g., files, network ports, Kafka).
- Channel: A temporary store for events between the Source and Sink (e.g., Memory Channel, File Channel, Kafka Channel).
- Sink: Delivers data to a destination (e.g., HDFS Sink, HBase Sink, Logger Sink, Kafka Sink).
What is the purpose of the _SUCCESS file in Hadoop output directories?
- The
_SUCCESS
file is an empty marker file created by MapReduce, Spark, and other processing frameworks in the output directory upon successful completion of a job. Its presence indicates that the job finished without errors and the output data is complete.
What is Speculative Execution in MapReduce/Spark?
- Speculative execution is a feature where if a task is running slower than expected, the framework launches a duplicate copy of the task on another node. The task that finishes first is accepted, and the other is terminated. This helps mitigate the impact of slow nodes (stragglers).
What are Hadoop Counters? What are they used for? (MapReduce)
- Counters are a mechanism in MapReduce to collect statistics about a job's progress. They are used to track metrics like the number of input records processed, output records written, bytes read/written, or custom application-specific metrics. Useful for monitoring and debugging.
What is the Distributed Cache in MapReduce?
- The Distributed Cache is a facility provided by the MapReduce framework to distribute small, read-only files (like lookup tables, configuration files, JARs) needed by the Map or Reduce tasks to all nodes in the cluster before the job starts.
How do you handle small files in HDFS? What are the issues with many small files?
-
Issues with many small files:
- NameNode overhead (metadata for each file).
- Increased seek times when accessing data.
- Inefficient MapReduce processing (each file typically gets its own Mapper).
-
Handling small files:
- Combine small files into larger SequenceFiles, Avro data files, or Parquet/ORC files.
- Use HAR (Hadoop Archive) (less common).
- Use HBase for storing many small records.
What is the difference between Block and Split in MapReduce?
- Block: A physical division of data in HDFS, typically 128 MB.
- Split: A logical division of input data that is processed by a single Mapper task. Typically, a split corresponds to an HDFS block, but it can be smaller or larger depending on the InputFormat and compression.
What is the purpose of the InputFormat in MapReduce?
-
The InputFormat is responsible for:
- Splitting the input data into logical splits (InputSplits).
- Creating a RecordReader for each split to read the data record by record.
What is the purpose of the OutputFormat in MapReduce?
-
The OutputFormat is responsible for:
- Validating the output specification of the job.
- Creating a RecordWriter to write the output key-value pairs to the output file(s).
What is Impala? How is it different from Hive? (Cloudera specific, but common) (Optional)
- Impala is a massively parallel processing (MPP) SQL query engine for data stored in Hadoop. It provides much faster query response times than Hive (when Hive uses MapReduce) because it bypasses MapReduce and directly queries the data using its own engine. It's often used for interactive SQL queries. Hive is better for complex ETL and batch processing.
What is Tez? (Processing Framework) (Optional)
- Apache Tez is an extensible framework for building high-performance batch and interactive data processing applications, based on YARN. It's used by Hive and Pig as a faster execution engine than MapReduce by creating directed acyclic graphs (DAGs) of tasks.
What is the difference between ACID properties in traditional RDBMS and HBase? (HBase)
- Traditional RDBMS offer full ACID (Atomicity, Consistency, Isolation, Durability) compliance. HBase offers ACID properties at the row level, meaning operations on a single row are atomic and consistent, but transactions spanning multiple rows are not natively supported with full ACID guarantees without external coordination.
What is Bloom Filter in HBase? (HBase) (Optional)
- A Bloom Filter is a probabilistic data structure used in HBase (specifically in HFiles) to quickly check if a given Row Key or Row-Column combination exists in the file without having to read the entire file. It helps speed up read operations.
What is Replication in HBase? (HBase) (Optional)
- HBase Replication allows you to replicate data asynchronously between HBase clusters, often in different data centers. This is used for disaster recovery or distributing reads across clusters.
What is the difference between Flume and Sqoop? (Data Ingestion)
- Flume: Primarily for ingesting streaming data (logs, events) from various sources into Hadoop. Designed for continuous data flow.
- Sqoop: Primarily for bulk data transfer between relational databases and Hadoop. Designed for batch imports/exports.
What is the role of the Active and Standby NameNodes in HDFS HA? (HDFS) (Revisited)
- The Active NameNode handles all client requests and modifications to the namespace. The Standby NameNode stays synchronized by reading the edit logs written by the Active NameNode to the JournalNodes. If the Active NameNode fails, the Standby takes over and becomes the new Active NameNode.
What is the purpose of the Balancer in HDFS? (HDFS)
- The HDFS Balancer is a tool used to balance the data distribution across DataNodes in the cluster. It moves blocks from DataNodes with higher disk utilization to DataNodes with lower disk utilization to ensure even disk space usage.
What is the command to format the NameNode? When is it used?
- The command is
hdfs namenode -format
. It initializes the HDFS file system and should *only* be used when setting up a new HDFS cluster for the first time or when you want to completely wipe the existing HDFS data and metadata (use with extreme caution!).
How do you check the health of HDFS DataNodes?
- You can check the NameNode UI (typically at
http://namenode_host:9870
) which lists all DataNodes and their status (live, dead, decommissioned). You can also use thehdfs dfsadmin -report
command.
How do you check the health of YARN NodeManagers?
- You can check the ResourceManager UI (typically at
http://resourcemanager_host:8088
) which lists all NodeManagers and their status.
What is the difference between a MapReduce Job, Task, and Attempt? (MapReduce)
- Job: The overall execution of a MapReduce program (e.g., a WordCount job).
- Task: Either a Map task or a Reduce task within a job. A job is broken down into multiple Map and Reduce tasks.
- Attempt: A specific instance of a task running on a NodeManager. If a task fails, YARN may launch new attempts for that task.
What is the default number of Reducers if not specified in a MapReduce job?
- The default is 1. This means all intermediate data will be sent to a single Reducer, which can be a performance bottleneck. It's usually recommended to configure the number of Reducers based on the data size and cluster capacity.
How can you optimize a Hive query? (Hive)
- Use appropriate file formats (Parquet, ORC).
- Partition and bucket tables.
- Vectorization (processing data in batches).
- Cost-Based Optimizer (CBO).
- Tune MapReduce/Tez/Spark execution engine parameters.
- Use Tez or Spark instead of MapReduce as the execution engine.
- Properly join tables (e.g., using Map joins for small tables).
What is the difference between LIMIT
and FETCH FIRST
in Hive? (Hive) (Optional)
LIMIT
restricts the number of rows returned by the query.FETCH FIRST
is a standard SQL clause (though Hive supports it) that also restricts the number of rows, often used withORDER BY
to get the top N rows.
What is the purpose of the EXPLAIN
command in Hive and Spark SQL?
- The
EXPLAIN
command shows the execution plan for a query. This helps understand how the query will be processed (e.g., which tables will be scanned, how joins will be performed, which partitions will be accessed) and identify potential performance issues.
What is Schema-on-Read vs Schema-on-Write? How do Hive and HBase fit?
- Schema-on-Write: The schema is enforced when data is written to the storage (e.g., traditional RDBMS). Data must conform to the schema before storage. HBase has some schema-on-write aspects (column families).
- Schema-on-Read: The schema is applied when data is read (queried). Data can be stored in a flexible format, and the interpretation is done during the query. Hive is a prime example; you define the schema in HiveQL, but the underlying data in HDFS can be simple text files.
What is a Region in HBase? (HBase)
- A Region is a contiguous, sorted range of rows in an HBase table. HBase tables are automatically sharded horizontally by row key into Regions. Regions are served by RegionServers.
What is a WAL (Write Ahead Log) in HBase? (HBase)
- The WAL is a log of all changes made to data in HBase before they are written to the MemStore (in-memory buffer). This ensures durability; if a RegionServer crashes, the WAL can be replayed to recover the data that was in memory but not yet persisted to disk (HFiles).
What is MemStore and HFile in HBase? (HBase)
- MemStore: An in-memory write buffer in a RegionServer where new writes are stored before being flushed to disk. Each column family in a Region has its own MemStore.
- HFile: The actual on-disk storage format for HBase data. When a MemStore reaches a certain size, its contents are flushed to a new HFile in HDFS.
What is Compaction in HBase? (HBase)
- Compaction is the process of merging multiple HFiles into a smaller number of larger HFiles. This is done to improve read performance, reduce the number of files to manage, and clean up deleted or expired versions of data. There are minor and major compactions.
What are the different deployment modes for Spark? (Spark)
- Standalone Mode: Spark's own simple cluster manager.
- Apache Mesos: Cluster manager.
- Apache YARN: Cluster manager (most common in Hadoop ecosystems).
- Kubernetes: Container orchestration platform.
What is the purpose of the Driver Program in Spark? (Spark)
- The Driver Program is the process that runs the
main
function of your Spark application. It creates the SparkContext (or SparkSession), defines the DAG (Directed Acyclic Graph) of computations, and coordinates the execution on the cluster manager.
What are Transformations and Actions in Spark? (Spark)
- Transformations: Operations on RDDs/DataFrames/Datasets that create a new RDD/DataFrame/Dataset (e.g.,
map
,filter
,groupBy
). They are lazy; they don't compute results immediately. - Actions: Operations that trigger the execution of the transformations and return a result to the driver program or write data to an external storage (e.g.,
count
,collect
,saveAsTextFile
).
What is Lazy Evaluation in Spark? (Spark)
- Spark uses lazy evaluation for transformations. This means that transformations are not executed immediately when they are called. Instead, Spark builds a DAG of transformations. Execution only happens when an Action is called. This allows Spark to optimize the execution plan.
What is the difference between collect()
and take()
in Spark? (Spark)
collect()
returns all elements of the RDD/DataFrame/Dataset to the driver program. Use with caution on large datasets as it can cause out-of-memory errors on the driver.take(n)
returns the firstn
elements of the RDD/DataFrame/Dataset to the driver program. Safer thancollect()
for inspecting a small subset of data.
What is Broadcast Join in Spark? (Spark)
- Broadcast Join is an optimization technique in Spark where if one of the tables being joined is small enough, Spark can broadcast the smaller table to all executor nodes. This avoids the expensive shuffle operation required for standard joins and can significantly improve performance.
What is Shuffle in Spark? (Spark)
- Shuffle is a complex and expensive operation in Spark where data needs to be redistributed across partitions and executors. It typically occurs during operations like
groupByKey
,reduceByKey
,join
, ororderBy
. Optimizing shuffle is crucial for Spark performance.
What is the difference between reduceByKey
and groupByKey
in Spark? (Spark)
groupByKey
groups the values for each key and returns a new RDD of(key, Iterable[value])
pairs.reduceByKey
groups the values for each key and then applies a reduction function to combine the values for each key. It is generally more efficient thangroupByKey
because it performs the reduction on the map side before shuffling, reducing the amount of data transferred.
What is YARN ResourceManager High Availability (HA)? (YARN)
- ResourceManager HA provides fault tolerance for the YARN ResourceManager. It involves having multiple ResourceManagers (one Active, others Standby) and using a shared storage system (like ZooKeeper) and a Failover Controller to manage the state synchronization and failover process.
What is the purpose of the container-executor in YARN? (YARN) (Optional)
- The container-executor is a process running on the NodeManager responsible for launching and managing containers. It provides an extra layer of security and resource isolation, especially when running tasks as different users.
What are YARN Queues? How are they used? (YARN)
- YARN queues are used to manage and allocate resources to different applications or users in a multi-tenant cluster. Schedulers (Capacity or Fair) manage these queues, allowing administrators to define resource limits, priorities, and access control for different queues.
What are the different components of Apache Kafka? (Kafka)
- Producers: Applications that publish messages to Kafka topics.
- Consumers: Applications that subscribe to and read messages from Kafka topics.
- Brokers: Kafka servers that store messages and serve consumer/producer requests.
- Topics: Categories or feeds to which messages are published. Topics are partitioned.
- Partitions: Ordered, immutable sequence of records within a topic. Data is appended to partitions.
- ZooKeeper: Used by Kafka for managing broker metadata, leader election, and configuration. (Note: Kafka is moving away from ZooKeeper in newer versions).
What is the difference between a Kafka Topic and a Partition? (Kafka)
- A Topic is a category of messages. A Partition is an ordered, immutable sequence of records within a topic. A topic is divided into one or more partitions. This allows for horizontal scaling and parallel processing of messages.
What is the role of ZooKeeper in Kafka? (Kafka) (Revisited)
-
ZooKeeper was traditionally used by Kafka for:
- Broker registration and discovery.
- Leader election for brokers and partitions.
- Storing configuration and metadata.
- Newer Kafka versions are reducing dependency on ZooKeeper.
What is the difference between Apache Storm, Spark Streaming, and Flink? (Stream Processing) (Optional)
- Storm: True real-time, micro-batch processing. Low-level API.
- Spark Streaming: Micro-batch processing on top of Spark Core. Simpler API than Storm, integrates well with other Spark components.
- Flink: True stream processing (event-at-a-time or micro-batch). Provides excellent state management and fault tolerance.
What is the purpose of the .jhist file in MapReduce? (MapReduce) (Optional)
- The
.jhist
file (Job History file) contains details about a completed MapReduce job, including configuration, task attempts, counters, etc. It's used by the JobHistoryServer.
What is the role of the Namenode's FsImage and EditLog? (HDFS)
- FsImage: A snapshot of the HDFS file system namespace and its state at a particular point in time. Stored on disk.
- EditLog: A transaction log that records all changes made to the HDFS namespace since the last FsImage. Stored on disk. The NameNode applies the EditLog to the FsImage in memory to reconstruct the current state.
What is the process of Checkpointing by the Secondary NameNode? (HDFS)
- The Secondary NameNode periodically fetches the EditLog from the Active NameNode, applies it to the latest FsImage it has, and creates a new FsImage. This new FsImage is then transferred back to the Active NameNode. This process reduces the size of the EditLog and the time it takes for the NameNode to start.
What is the difference between a cold boot and a warm boot of the NameNode? (HDFS)
- Cold Boot: The NameNode starts by loading the FsImage from disk into memory and then applying the entire EditLog. This can take a long time for large clusters.
- Warm Boot: In an HA setup, the Standby NameNode is already keeping its state synchronized by applying EditLog entries. If it becomes the Active NameNode, it only needs to process a small portion of the latest EditLog entries, resulting in a much faster startup.
What is the purpose of the Rack Awareness script? (HDFS)
- The Rack Awareness script is a custom script configured in HDFS that the NameNode calls to determine which rack a given DataNode belongs to. This information is used for replica placement and read optimization.
What is the difference between Map-side join and Reduce-side join in MapReduce? (MapReduce)
- Map-side Join: Performed in the Map phase. Requires the input data to be sorted or partitioned in a specific way. Often used with the Distributed Cache to broadcast smaller tables. More efficient as it avoids the shuffle phase for the join.
- Reduce-side Join: Performed in the Reduce phase. Data from both inputs with the same join key are brought together at the same Reducer via shuffling. Less efficient due to the shuffle cost.
What is the concept of Data Locality in Hadoop? Why is it important?
- Data Locality refers to the ability of the processing framework (MapReduce, Spark) to run a task on the same node where the data it needs is stored (or on a node on the same rack). It's important because moving computation to data is much more efficient than moving large amounts of data over the network to the computation. Hadoop tries to schedule tasks with data locality preference.
What are the different levels of Data Locality?
- NODE_LOCAL: The task runs on the same node as the data block. Highest preference.
- RACK_LOCAL: The task runs on a different node but on the same rack as the data block. Second highest preference.
- ANY: The task runs on any node in the cluster. Lowest preference, involves cross-rack or cross-data center network transfer.
What are the challenges of managing a large Hadoop cluster?
- Monitoring and alerting.
- Configuration management.
- Security setup and management (Kerberos, Sentry/Ranger).
- Resource management and multi-tenancy (YARN queue configuration).
- Upgrades and patching.
- Troubleshooting distributed issues.
- Data governance and lineage.
- Cost management (if running on cloud).
How do you monitor a Hadoop cluster?
- Using the web UIs provided by Hadoop components (NameNode UI, ResourceManager UI, JobHistoryServer UI).
- Using monitoring tools integrated with Hadoop distributions (Cloudera Manager, Ambari - now part of Cloudera).
- Using external monitoring systems (Prometheus, Grafana, Nagios) integrated with JMX metrics from Hadoop daemons.
- Using Cloud Provider monitoring services (Cloud Monitoring for GCP, CloudWatch for AWS).
- Analyzing logs (Cloud Logging for GCP, S3/CloudWatch Logs for AWS).
What is the purpose of the _distcp_ command?
distcp
(Distributed Copy) is a tool used for copying large amounts of data between Hadoop file systems (including different clusters or different storage systems like S3, GCS). It uses MapReduce to perform the copy operation in parallel.
What is the difference between hadoop fs -copyFromLocal
and hadoop fs -put
?
- They are essentially the same command.
put
is a more commonly used alias forcopyFromLocal
. Both copy files from the local filesystem to HDFS.
What is the difference between hadoop fs -copyToLocal
and hadoop fs -get
?
- They are essentially the same command.
get
is a more commonly used alias forcopyToLocal
. Both copy files from HDFS to the local filesystem.
What is the purpose of the YARN Timeline Server? (YARN) (Optional)
- The YARN Timeline Server stores and provides generic application information and framework-specific information for current and finished applications. It provides more detailed historical data than the JobHistoryServer.
What is the difference between a Spark Application, Job, Stage, and Task? (Spark)
- Application: The user program using Spark APIs. Consists of a Driver program and executors.
- Job: Triggered by a Spark Action. A job is divided into Stages.
- Stage: A set of tasks that can be executed together in parallel. Stages are separated by shuffle boundaries or narrow dependencies.
- Task: The smallest unit of execution in Spark. Runs on an executor and processes a single partition of data.
What are Narrow and Wide Dependencies in Spark RDDs? (Spark) (Optional)
- Narrow Dependency: Each partition of the parent RDD is used by at most one partition of the child RDD (e.g.,
map
,filter
). Allows for pipelining and less overhead. - Wide Dependency: Multiple partitions of the parent RDD are used by each partition of the child RDD (e.g.,
groupByKey
,reduceByKey
,join
). Requires a shuffle operation.
What is Data Skew in Spark/MapReduce? How can you handle it?
- Data Skew occurs when data is unevenly distributed across partitions, leading to some tasks processing significantly more data than others. This can cause bottlenecks as the job waits for the slowest tasks to complete.
-
Handling Data Skew:
- Salting/Adding a random prefix to keys.
- Custom Partitioning.
- Using Combiners (in MapReduce).
- Using
reduceByKey
instead ofgroupByKey
(in Spark). - Re-partitioning the data (in Spark).
What is the difference between Parquet and ORC file formats in Hadoop?
- Both are columnar storage file formats used in Hadoop that offer better performance and compression compared to row-based formats (like text or SequenceFile) for analytical queries.
- Parquet: Developed by Cloudera and Twitter. Widely supported.
- ORC (Optimized Row Columnar): Developed by Hortonworks. Designed specifically for Hive. Often shows better compression and read performance with Hive.
What is the purpose of the .inprogress file in HDFS?
- When a file is being written to HDFS, it is initially written to a temporary file with a
.inprogress
suffix. Once the file is successfully closed, the.inprogress
suffix is removed. This indicates that the file is still being written and is not yet complete.
What is the default replication factor in HDFS? Can you change it? How?
-
The default replication factor is 3. Yes, you can change it:
- Globally in
hdfs-site.xml
. - Per file using the
hdfs dfs -setrep
command.
- Globally in
What is the difference between a Hot and Cold standby NameNode? (HDFS HA) (Optional)
- Hot Standby: The Standby NameNode is fully synchronized with the Active NameNode and can take over almost immediately. Requires shared storage (QJM).
- Cold Standby: The Standby NameNode is not actively synchronized. If the Active fails, the Cold Standby needs to load the latest FsImage and apply the EditLog, which takes significant time. Less common in modern HA setups.
What is the purpose of the Failover Controller in HDFS HA? (HDFS) (Optional)
- The Failover Controller is a component (often implemented using ZooKeeper) that monitors the health of the Active NameNode. If the Active NameNode fails, the Failover Controller initiates the failover process to transition the Standby NameNode to the Active state.
What is the default port for the NameNode, ResourceManager, and JobHistoryServer UIs?
- NameNode UI (Hadoop 3.x): 9870 (Hadoop 2.x: 50070)
- ResourceManager UI: 8088
- JobHistoryServer UI: 19888
What is the purpose of the core-site.xml file?
core-site.xml
contains core configuration settings for Hadoop, such as the default filesystem URI (fs.defaultFS
), which points to the NameNode address.
What is the purpose of the yarn-site.xml file?
yarn-site.xml
contains configuration settings for YARN, such as the ResourceManager address (yarn.resourcemanager.hostname
) and scheduler configurations.
What is the purpose of the mapred-site.xml file?
mapred-site.xml
contains configuration settings for MapReduce, such as the MapReduce framework to use (mapreduce.framework.name
, typically set toyarn
) and job history server configuration.
What is the purpose of the hdfs-site.xml file?
hdfs-site.xml
contains configuration settings specific to HDFS, such as the replication factor (dfs.replication
), block size (dfs.blocksize
), and NameNode/DataNode directories.
What is the difference between a Block and a Replica? (HDFS)
- A Block is a logical division of a file. A Replica is a physical copy of a block stored on a DataNode. A block has multiple replicas (default 3).
What is the purpose of the Checksum in HDFS? (HDFS) (Revisited)
- Checksums are used to detect data corruption in HDFS. Each DataNode maintains checksums for the blocks it stores. When a client reads a block, it verifies the checksum to ensure the data hasn't been corrupted during storage or transfer.
What is the command to check the disk usage in HDFS?
hdfs dfs -du /hdfs/path
(shows disk usage of files/directories)hdfs dfs -count -q /hdfs/path
(shows quota information)
What is the command to set the replication factor for a file in HDFS?
hdfs dfs -setrep -w
What is the command to add a new DataNode to the cluster? (Admin) (Optional)
- Configure the new node with DataNode software and point it to the NameNode. Start the DataNode process. The NameNode will discover it via heartbeats. You might need to update the
dfs.hosts
file and refresh the NameNode if host exclusion/inclusion is enabled.
What is the command to decommission a DataNode? (Admin) (Optional)
- Add the DataNode's hostname or IP to the
dfs.hosts.exclude
file on the NameNode. Refresh the NameNode (hdfs dfsadmin -refreshNodes
). The NameNode will start replicating blocks from the decommissioning node to other nodes. Once all blocks are replicated, the DataNode can be safely shut down.
What is the purpose of the YARN NodeManager's health checker script? (YARN) (Optional)
- The NodeManager can be configured to run a health checker script periodically. If the script reports the node as unhealthy, the NodeManager will stop accepting new containers and will eventually be considered unhealthy by the ResourceManager, preventing tasks from being scheduled on that node.
What is the difference between a YARN Application and a YARN Container? (YARN)
- A YARN Application is a specific instance of a program submitted to the cluster (e.g., a MapReduce job, a Spark job). A YARN Container is a resource allocation unit (CPU, memory) within which a task of an application runs. An application consists of one or more containers.
What is the purpose of the ApplicationMaster in YARN? (YARN) (Revisited)
- The ApplicationMaster is responsible for managing the lifecycle of a single application. It negotiates resources from the ResourceManager (in the form of containers) and coordinates the execution of the application's tasks by requesting containers from NodeManagers and monitoring their progress.
What is the difference between HBase Master and RegionServer? (HBase)
- HBase Master: Monitors RegionServers, handles Region assignments, manages metadata, and performs administrative tasks (schema changes, load balancing). There can be multiple Masters for HA, but only one is active.
- RegionServer: Hosts and manages Regions. Handles read/write requests from clients for the Regions it serves. Manages MemStores and HFiles. There are multiple RegionServers in a cluster.
What are the different types of compactions in HBase? (HBase) (Optional)
- Minor Compaction: Merges a few smaller HFiles into a larger one. More frequent.
- Major Compaction: Merges all HFiles in a column family of a Region into a single new HFile. Also performs garbage collection (removes deleted/expired versions). Less frequent, resource-intensive.
What is the purpose of ZNode in ZooKeeper? (ZooKeeper)
- A ZNode (ZooKeeper Node) is the basic data unit in ZooKeeper. It is a node in a hierarchical namespace, similar to a file system path. ZNodes can store data and have children. They can be persistent, ephemeral (tied to a client session), or sequential.
What is the concept of Watches in ZooKeeper? (ZooKeeper)
- Clients can set watches on ZNodes. A watch is a one-time trigger that sends an event notification to the client when the ZNode's data or children change. This allows clients to react to changes in the distributed system's state.
What is the difference between Apache Oozie and Apache Airflow? (Workflow Scheduling) (Optional)
- Oozie: Designed specifically for scheduling Hadoop jobs (MapReduce, Pig, Hive). Uses XML to define workflows.
- Airflow: A more general-purpose workflow management platform. Workflows are defined as Directed Acyclic Graphs (DAGs) in Python. Supports a wider range of connectors and is often considered more flexible and easier to use for complex pipelines.
What is the purpose of the .crc file in HDFS? (Revisited)
- The
.crc
file stores the checksums for the corresponding data file in HDFS. This allows HDFS to verify the integrity of the data blocks.
How does HDFS handle corrupted blocks?
- If a DataNode detects a corrupted block (e.g., during a read operation or background scanning), it reports the corruption to the NameNode. The NameNode then marks the block as corrupted and schedules the replication of a healthy replica to replace the corrupted one.
What is the purpose of the HDFS Safe Mode? (HDFS)
- HDFS Safe Mode is a state that the NameNode enters during startup. In Safe Mode, the NameNode does not allow any modifications to the file system (create, delete, modify). It waits for DataNodes to report their blocks and ensures that a sufficient percentage of blocks have reached the minimum replication level. Once this condition is met, Safe Mode is exited, and the file system becomes fully operational.
How do you exit Safe Mode in HDFS?
- HDFS automatically exits Safe Mode once the required percentage of blocks are reported and replicated. You can also manually exit Safe Mode using the command
hdfs dfsadmin -safemode leave
(use with caution).
What is the difference between HDFS and a traditional file system (like ext4 or NTFS)?
- HDFS: Distributed, designed for large files and batch processing, fault-tolerant (replication), optimized for sequential reads, write-once-read-many model, doesn't support random writes within a file.
- Traditional File System: Local, designed for smaller files and random access, less fault-tolerant (unless using RAID), supports random reads and writes.
What is the purpose of the MapReduce framework's InputSplit? (MapReduce) (Revisited)
- An InputSplit is a logical representation of a chunk of input data that will be processed by a single Mapper task. It doesn't contain the actual data but rather information about the data's location (file path, offset, length).
What is the purpose of the RecordReader in MapReduce? (MapReduce)
- The RecordReader is created by the InputFormat for each InputSplit. It reads the input data record by record and converts it into key-value pairs that are then fed to the Mapper.
What is the difference between Hive and Impala? (Revisited)
- Hive uses MapReduce/Tez/Spark for execution (batch processing, higher latency). Impala has its own native MPP engine (interactive queries, lower latency). Impala is generally faster for ad-hoc queries, while Hive is better for complex ETL.
What is the purpose of the Zookeeper Quorum? (ZooKeeper) (Revisited)
- A Zookeeper Quorum is the group of Zookeeper servers in an ensemble. For the ensemble to be available, a majority of the servers in the quorum must be running and able to communicate with each other. This ensures consistency and reliability.
What is the difference between Sqoop Connectors?
- Sqoop uses connectors to interact with different database systems. There are generic JDBC connectors and database-specific connectors (e.g., for MySQL, PostgreSQL, Oracle, SQL Server) that might offer better performance or support specific features.
What is the purpose of the --split-by
argument in Sqoop import? (Sqoop)
- The
--split-by
argument tells Sqoop which column to use to create splits for the MapReduce job. Sqoop will generate queries withWHERE
clauses based on the values in this column to partition the data for parallel import. Choosing a column with a uniform distribution of values is important for good performance.
What is the purpose of the --boundary-query
argument in Sqoop import? (Sqoop) (Optional)
- The
--boundary-query
argument allows you to provide a custom query to determine the boundaries for creating splits. This can be useful when the data distribution is skewed or when the default split-by column doesn't work well.