Spark Interview Questions and Answers


1. What is Apache Spark?
  • Apache Spark is an open-source distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It's used for large-scale data processing due to its speed and ease of use compared to traditional MapReduce.
2. What are the key features of Apache Spark?
  • In-Memory Computing: Stores data in memory for faster processing.
  • Scalability: Can handle petabytes of data using a cluster of machines.
  • Ease of Use: Provides APIs in Java, Scala, Python, and R.
  • Unified Analytics Engine: Supports SQL, streaming data, machine learning, and graph processing.
3. What is a Resilient Distributed Dataset (RDD)?
  • RDDs are the fundamental building blocks of Apache Spark. They represent an immutable, distributed collection of objects that can be operated on in parallel across a cluster.
4. What are the characteristics of RDDs?
  • Immutable: Once created, their content cannot be modified.
  • Distributed: Divided into multiple partitions across nodes.
  • Resilient: Fault-tolerant through lineage information.
  • Lazy Evaluation: Transformations are not executed until an action is triggered.
5. What is the difference between map and flatMap transformations in Spark RDDs?
  • map(): Transforms each element of the RDD into exactly one new element.
  • flatMap(): Transforms each element into zero or more new elements, flattening the results.
6. What is lazy evaluation in Spark?
  • Lazy evaluation means that Spark does not immediately execute transformations as they are called. Instead, it builds a logical execution plan and waits until an action is triggered to execute the transformations.
7. What is a SparkSession?
  • SparkSession is the entry point to programming Spark with the Dataset and DataFrame API. It allows the creation of DataFrames and execution of SQL queries.
8. What is the difference between DataFrame and Dataset in Spark?
  • DataFrame: A distributed collection of data organized into named columns, similar to a table in a relational database.
  • Dataset: A strongly-typed, immutable collection of objects that can be transformed in parallel using functional or relational operations.
9. What is Spark SQL?
  • Spark SQL is a module for structured data processing. It allows querying data via SQL as well as the DataFrame and Dataset APIs.
10. What is Catalyst Optimizer in Spark SQL?
  • Catalyst is Spark SQL's query optimizer that leverages advanced programming language features to build an extensible query optimizer.
11. What is Tungsten in Spark?
  • Tungsten is a Spark project focused on improving the efficiency of memory and CPU for Spark applications, enhancing performance through whole-stage code generation and memory management.
12. What is a Broadcast Variable in Spark?
  • Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
13. What is an Accumulator in Spark?
  • Accumulators are variables that are only "added" to through an associative and commutative operation and can be used to implement counters or sums.
14. What is a Spark Driver?
  • The Driver is the central component of a Spark application. It is responsible for creating the SparkContext, scheduling tasks, and coordinating the execution of jobs.
15. What are Executors in Spark?
  • Executors are worker nodes in a Spark cluster responsible for running individual tasks in a given Spark job and returning the results to the driver.
16. What is a DAG in Spark?
  • A DAG (Directed Acyclic Graph) represents a sequence of computations performed on data, where each node is an RDD partition and edges represent the operations to be applied.
17. What is a Stage in Spark?
  • A Stage is a set of parallel tasks that are executed as part of a Spark job. Stages are divided based on shuffle boundaries.
18. What is a Task in Spark?
  • A Task is the smallest unit of work in Spark, representing a single computation on a partition of data.
19. What is Shuffling in Spark?
  • Shuffling is the process of redistributing data across partitions, which may involve disk and network I/O, and is triggered by operations like groupByKey() and reduceByKey().
20. How can you minimize data shuffling in Spark?
  • Use operations like reduceByKey() instead of groupByKey(), broadcast variables, and partitioning strategies to minimize shuffling.
21. What is Spark Streaming?
  • Spark Streaming is a component of Spark that enables processing of real-time data streams using micro-batches.
22. What is Structured Streaming in Spark?
  • Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine, allowing processing of streaming data using DataFrame and Dataset APIs.
23. What is a Window Operation in Spark Streaming?
  • Window operations allow aggregation of data over a sliding window of time, enabling computations like moving averages and counts over time intervals.
24. What is Checkpointing in Spark?
  • Checkpointing is a process of saving the state of RDDs to reliable storage to recover from failures and to truncate the lineage graph for long-running applications.
25. What is the difference between persist() and cache() in Spark?
  • cache() is a shorthand for persist(StorageLevel.MEMORY_ONLY), storing RDDs in memory, while persist() allows specifying different storage levels.
26. What is the difference between coalesce() and repartition() in Spark?
  • coalesce() reduces the number of partitions without shuffling, while repartition() can increase or decrease the number of partitions with a full shuffle.
27. What is Spark MLlib?
  • MLlib is Spark's machine learning library, providing scalable algorithms for classification, regression, clustering, collaborative filtering, and more.
28. What is a Pipeline in Spark MLlib?
  • A Pipeline chains multiple data processing stages, including feature transformers and estimators, to streamline machine learning workflows.
29. What is Spark GraphX?
  • GraphX is Spark's API for graphs and graph-parallel computation, enabling analysis of graph-structured data.
30. What is the difference between narrow and wide transformations in Spark?
  • Narrow transformations (e.g., map(), filter()) do not require data shuffling, while wide transformations (e.g., groupByKey(), join()) involve shuffling data across partitions.
31. What are the different cluster managers supported by Spark?
  • Standalone, Apache Mesos, Hadoop YARN, and Kubernetes.
32. What is YARN?
  • YARN (Yet Another Resource Negotiator) is a resource management layer for Hadoop that allows multiple data processing engines to handle data stored in a single platform.
33. What is the role of the Cluster Manager in Spark?
  • The Cluster Manager is responsible for acquiring resources on the cluster and allocating them to Spark applications.
34. What is the difference between local and cluster modes in Spark?
  • In local mode, Spark runs on a single machine, while in cluster mode, Spark runs on a cluster of machines managed by a cluster manager.
35. How does Spark handle fault tolerance?
  • Spark uses RDD lineage information to recompute lost data partitions in case of failures, ensuring fault tolerance.