Blog

Top 80 Hadoop Interview Questions and Answers for 2024 | Simplilearn

Lesson 16 of 16By Shruti M

Big data has been growing tremendously in the current decade. With Big Data comes the widespread adoption of Hadoop to solve major Big Data challenges. Hadoop is one of the most popular frameworks that is used to store, process, and analyze Big Data. Hence, there is always a demand for professionals to work in this field. But, how do you get yourself a job in the field of Hadoop? Well, we have answers to that! 2 your vlave

Read more: What Are the Skills Needed to Learn Hadoop?

In this blog, we will talk about the Hadoop interview questions that could be asked in a Hadoop interview. We will look into Hadoop interview questions from the entire Hadoop ecosystem, which includes HDFS, MapReduce, YARN, Hive, Pig, HBase, and Sqoop.

Let’s begin with one of the important topic: HDFS

The different vendor-specific distributions of Hadoop are Cloudera, MAPR, Amazon EMR, Microsoft Azure, IBM InfoSphere, and Hortonworks (Cloudera).

Learn Job Critical Skills To Help You Grow!Big Data Engineer Master’s Program Explore Program

The different Hadoop configuration files include:

The three modes in which Hadoop can run are :

HDFS is fault-tolerant because it replicates data on different DataNodes. By default, a block of data is replicated on three DataNodes. The data blocks are stored in different DataNodes. If one node crashes, the data can still be retrieved from other DataNodes. 

Learn Job Critical Skills To Help You Grow!Big Data Engineer Master’s Program Explore Program

The architecture of HDFS is as shown:

For an HDFS service, we have a NameNode that has the master process running on one of the machines and DataNodes, which are the slave nodes.

NameNode is the master service that hosts metadata in disk and RAM. It holds information about the various DataNodes, their location, the size of each block, etc. 

DataNodes hold the actual data blocks and send block reports to the NameNode every 10 seconds. The DataNode stores and retrieves the blocks when the NameNode asks. It reads and writes the client’s request and performs block creation, deletion, and replication based on instructions from the NameNode.

The two types of metadata that a NameNode server holds are:

By default, each block in HDFS is divided into 128 MB. The size of all the blocks, except the last block, will be 128 MB. For an input file of 350 MB, there are three input splits in total. The size of each split is 128 MB, 128MB, and 94 MB.

Future-Proof Your AI/ML Career: Top Dos and Don'tsFree Webinar | 5 Dec, Tuesday | 7 PM IST Register Now

HDFS Rack Awareness refers to the knowledge of different DataNodes and how it is distributed across the racks of a Hadoop Cluster.

By default, each block of data is replicated three times on various DataNodes present on different racks. Two identical blocks cannot be placed on the same DataNode. When a cluster is “rack-aware,” all the replicas of a block cannot be placed on the same rack. If a DataNode crashes, you can retrieve the data block from different DataNodes.   

The following commands will help you restart NameNode and all the daemons:

You can stop the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode command and then start the NameNode using ./sbin/Hadoop-daemon.sh start NameNode command.

You can stop all the daemons with ./sbin /stop-all.sh command and then start the daemons using the ./sbin/start-all.sh command.

To check the status of the blocks, use the command:

hdfs fsck <path> -files -blocks

To check the health status of FileSystem, use the command:

hdfs fsck / -files –blocks –locations > dfs-fsck.log

Storing several small files on HDFS generates a lot of metadata files. To store these metadata in the RAM is a challenge as each file, block, or directory takes 150 bytes for metadata. Thus, the cumulative size of all the metadata will be too large.

The following command will copy data from the local file system onto HDFS:

hadoop fs –copyFromLocal [source] [destination]

Example: hadoop fs –copyFromLocal /tmp/data.csv /user/test/data.csv

In the above syntax, the source is the local path and destination is the HDFS path. Copy from the local system using a -f option (flag option), which allows you to write the same file or a new file to HDFS. 

The commands below are used to refresh the node information while commissioning, or when the decommissioning of nodes is completed. 

This is used to run the HDFS client and it refreshes node configuration for the NameNode. 

This is used to perform administrative tasks for ResourceManager.

Learn Job Critical Skills To Help You Grow!Big Data Engineer Master’s Program Explore Program

Yes, the following are ways to change the replication of files on HDFS:

We can change the dfs.replication value to a particular number in the $HADOOP_HOME/conf/hadoop-site.xml file, which will start replicating to the factor of that number for any new content that comes in.

If you want to change the replication factor for a particular file or directory, use:

$HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /path of the file

Example: $HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /user/temp/test.csv

In a cluster, it is always the NameNode that takes care of the replication consistency. The fsck command provides information regarding the over and under-replicated block. 

These are the blocks that do not meet their target replication for the files they belong to. HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication.

Consider a cluster with three nodes and replication set to three. At any point, if one of the NameNodes crashes, the blocks would be under-replicated. It means that there was a replication factor set, but there are not enough replicas as per the replication factor. If the NameNode does not get information about the replicas, it will wait for a limited amount of time and then start the re-replication of missing blocks from the available nodes. 

These are the blocks that exceed their target replication for the files they belong to. Usually, over-replication is not a problem, and HDFS will automatically delete excess replicas.

Consider a case of three nodes running with the replication of three, and one of the nodes goes down due to a network failure. Within a few minutes, the NameNode re-replicates the data, and then the failed node is back with its set of blocks. This is an over-replication situation, and the NameNode will delete a set of blocks from one of the nodes. 

After HDFS, let’s now move on to some of the interview questions related to the processing framework of Hadoop: MapReduce.

A distributed cache is a mechanism wherein the data coming from the disk can be cached and made available for all worker nodes. When a MapReduce program is running, instead of reading the data from the disk every time, it would pick up the data from the distributed cache to benefit the MapReduce processing. 

To copy the file to HDFS, you can use the command:

hdfs dfs-put /user/Simplilearn/lib/jar_file.jar

To set up the application’s JobConf, use the command:

DistributedCache.addFileToClasspath(newpath(“/user/Simplilearn/lib/jar_file.jar”), conf)

Then, add it to the driver class.

Learn Job Critical Skills To Help You Grow!Big Data Engineer Master’s Program Explore Program

This communicates with the InputSplit and converts the data into key-value pairs suitable for the mapper to read. 

This is an optional phase; it is like a mini reducer. The combiner receives data from the map tasks, works on it, and then passes its output to the reducer phase. 

The partitioner decides how many reduced tasks would be used to summarize the data. It also confirms how outputs from combiners are sent to the reducer, and controls the partitioning of keys of the intermediate map outputs.

This is quite a common question in Hadoop interviews; let us understand why MapReduce is slower in comparison to the other processing frameworks:

By default, you cannot change the number of mappers, because it is equal to the number of input splits. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.

For example, if you have a 1GB file that is split into eight blocks (of 128MB each), there will only be only eight mappers running on the cluster. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.

This is an important question, as you would need to know the different data types if you are getting into the field of Big Data.

For every data type in Java, you have an equivalent in Hadoop. Therefore, the following are some Hadoop-specific data types that you could use in your MapReduce program:

If a DataNode is executing any task slowly, the master node can redundantly execute another instance of the same task on another node. The task that finishes first will be accepted, and the other task would be killed. Therefore, speculative execution is useful if you are working in an intensive workload kind of environment.

The following image depicts the speculative execution:

From the above example, you can see that node A has a slower task. A scheduler maintains the resources available, and with speculative execution turned on, a copy of the slower task runs on node B. If node A task is slower, then the output is accepted from node B.

We need to have the following configuration parameters:

Learn Job Critical Skills To Help You Grow!Big Data Engineer Master’s Program Explore Program

As the name indicates, OutputCommitter describes the commit of task output for a MapReduce job.

Example: org.apache.hadoop.mapreduce.OutputCommitter

public abstract class OutputCommitter extends OutputCommitter

MapReduce relies on the OutputCommitter for the following:

Spilling is a process of copying the data from memory buffer to disk when the buffer usage reaches a specific threshold size. This happens when there is not enough memory to fit all of the mapper output. By default, a background thread starts spilling the content from memory to disk after 80 percent of the buffer size is filled. 

For a 100 MB size buffer, the spilling will start after the content of the buffer reaches a size of 80 MB. 

The number of mappers and reducers can be set in the command line using:

-D mapred.map.tasks=5 –D mapred.reduce.tasks=2

In the code, one can configure JobConf variables:

If this ever happens, map tasks will be assigned to a new node, and the entire task will be rerun to re-create the map output. In Hadoop v2, the YARN framework has a temporary daemon called application master, which takes care of the execution of the application. If a task on a particular node failed due to the unavailability of a node, it is the role of the application master to have this task scheduled on another node.

Yes. Hadoop supports various input and output File formats, such as:

Now, let’s learn about resource management and the job scheduling unit in Hadoop, which is YARN (Yet Another Resource Negotiator).

In Hadoop v1,  MapReduce performed both data processing and resource management; there was only one master process for the processing layer known as JobTracker. JobTracker was responsible for resource tracking and job scheduling. 

Managing jobs using a single JobTracker and utilization of computational resources was inefficient in MapReduce 1. As a result, JobTracker was overburdened due to handling, job scheduling, and resource management. Some of the issues were scalability, availability issue, and resource utilization. In addition to these issues, the other problem was that non-MapReduce jobs couldn’t run in v1.

To overcome this issue, Hadoop 2 introduced YARN as the processing layer. In YARN, there is a processing master called ResourceManager. In Hadoop v2, you have ResourceManager running in high availability mode. There are node managers running on multiple machines, and a temporary daemon called application master. Here, the ResourceManager is only handling the client connections and taking care of tracking the resources. 

In Hadoop v2, the following features are available:

Learn Job Critical Skills To Help You Grow!Big Data Engineer Master’s Program Explore Program

There is a client/application/API which talks to ResourceManager. The ResourceManager manages the resource allocation in the cluster. It has two internal components, scheduler, and application manager. The ResourceManager is aware of the resources that are available with every node manager. The scheduler allocates resources to various running applications when they are running in parallel. It schedules resources based on the requirements of the applications. It does not monitor or track the status of the applications.

Applications Manager is what accepts job submissions. It monitors and restarts the application masters in case of failures. Application Master manages the resource needs of individual applications. It interacts with the scheduler to acquire the required resources, and with NodeManager to execute and monitor tasks, which tracks the jobs running. It monitors each container’s resource utilization.

A container is a collection of resources, such as RAM, CPU, or network bandwidth. It provides the rights to an application to use a specific amount of resources. 

Let us have a look at the architecture of YARN:

Whenever a job submission happens, ResourceManager requests the NodeManager to hold some resources for processing. NodeManager then guarantees the container that would be available for processing. Next, the ResourceManager starts a temporary daemon called application master to take care of the execution. The App Master, which the applications manager launches, will run in one of the containers. The other containers will be utilized for execution. This is briefly how YARN takes care of the allocation.

The answer is ResourceManager. It is the name of the master process in Hadoop v2.

The commands are as follows:

a) To check the status of an application:

b) To kill or terminate an application:

Yes, Hadoop v2 allows us to have more than one ResourceManager. You can have a high availability YARN cluster where you can have an active ResourceManager and a standby ResourceManager, where the ZooKeeper handles the coordination.

There can only be one active ResourceManager at a time. If an active ResourceManager fails, then the standby ResourceManager comes to the rescue.

The different schedulers available in YARN are:

In a high availability cluster, there are two ResourceManagers: one active and the other standby. If a ResourceManager fails in the case of a high availability cluster, the standby will be elected as active and instructs the ApplicationMaster to abort. The ResourceManager recovers its running state by taking advantage of the container statuses sent from all node managers.

Every node in a Hadoop cluster will have one or multiple processes running, which would need RAM. The machine itself, which has a Linux file system, would have its own processes that need a specific amount of RAM usage. Therefore, if you have 10 DataNodes, you need to allocate at least 20 to 30 percent towards the overheads, Cloudera-based services, etc. You could have 11 or 12 GB and six or seven cores available on every machine for processing. Multiply that by 10, and that's your processing capacity. 

If an application starts demanding more memory or more CPU cores that cannot fit into a container allocation, your application will fail. This happens because the requested memory is more than the maximum container size.

Now that you have learned about HDFS, MapReduce, and YARN, let us move to the next section. We’ll go over questions about Hive, Pig, HBase, and Sqoop.

Learn Job Critical Skills To Help You Grow!Big Data Engineer Master’s Program Explore Program

Before moving into the Hive interview questions, let us summarize what Hive is all about. Facebook adopted the Hive to overcome MapReduce’s limitations. MapReduce proved to be difficult for users as they found it challenging to code because not all of them were well-versed with the coding languages. Users required a language similar to SQL, which was well-known to all the users. This gave rise to Hive.

Hive is a data warehouse system used to query and analyze large datasets stored in HDFS. It uses a query language called HiveQL, which is similar to SQL. Hive works on structured data. Let us now have a look at a few Hive questions. 

The different components of the Hive are:

Partition is a process for grouping similar types of data together based on columns or partition keys. Each table can have one or more partition keys to identify a particular partition. 

Partitioning provides granularity in a Hive table. It reduces the query latency by scanning only relevant partitioned data instead of the entire data set. We can partition the transaction data for a bank based on month — January, February, etc. Any operation regarding a particular month, say February, will only have to scan the February partition, rather than the entire table data.

We know that the Hive’s data is stored in HDFS. However, the metadata is either stored locally or it is stored in RDBMS. The metadata is not stored in HDFS, because HDFS read/write operations are time-consuming. As such, Hive stores metadata information in the metastore using RDBMS instead of HDFS. This allows us to achieve low latency and is faster.

The components used in Hive query processors are:

Using SequenceFile format and grouping these small files together to form a single sequence file can solve this problem. Below are the steps:

The following query will insert a new column:

CHANGE COLUMN new_col INT

Pig is a scripting language that is used for data processing. Let us have a look at a few questions involving Pig:

The different ways of executing a Pig script are as follows:

The major components of a Pig execution environment are:

Pig has three complex data types, which are primarily Tuple, Bag, and Map.

A tuple is an ordered set of fields that can contain different data types for each field. It is represented by braces ().

A bag is a set of tuples represented by curly braces {}.

A map is a set of key-value pairs used to represent data elements. It is represented in square brackets [ ].

Example: [key#value, key1#value1,….]

Pig has Dump, Describe, Explain, and Illustrate as the various diagnostic operators.

The dump operator runs the Pig Latin scripts and displays the results on the screen.

Load the data using the “load” operator into Pig.

Display the results using the “dump” operator.

Describe operator is used to view the schema of a relation.

Load the data using “load” operator into Pig

View the schema of a relation using “describe” operator

Explain operator displays the physical, logical and MapReduce execution plans.

Load the data using “load” operator into Pig

Display the logical, physical and MapReduce execution plans using “explain” operator

Illustrate operator gives the step-by-step execution of a sequence of statements.

Load the data using “load” operator into Pig

Show the step-by-step execution of a sequence of statements using “illustrate” operator

The group statement collects various records with the same key and groups the data in one or more relations.

Example: Group_data = GROUP Relation_name BY AGE

The order statement is used to display the contents of relation in sorted order based on one or more fields.

Example: Relation_2 = ORDER Relation_name1 BY (ASC|DSC)

Distinct statement removes duplicate records and is implemented only on entire records, and not on individual records.

Example: Relation_2 = DISTINCT Relation_name1

The relational operators in Pig are as follows:

It joins two or more tables and then performs GROUP operation on the joined table result.

This is used to compute the cross product (cartesian product) of two or more relations.

This will iterate through the tuples of a relation, generating a data transformation.

This is used to join two or more tables in a relation.

This will limit the number of output tuples.

This will split the relation into two or more relations.

It will merge the contents of two relations.

This is used to sort a relation based on one or more fields.

FilterOperator is used to select the required tuples from a relation based on a condition. It also allows you to remove unwanted records from the data file.

Example: Filter the products with a whole quantity that is greater than 1000

A = LOAD ‘/user/Hadoop/phone_sales’ USING PigStorage(‘,’) AS (year:int, product:chararray, quantity:int);

B = FILTER A BY quantity > 1000

To do this, we need to use the limit operator to retrieve the first 10 records from a file.

Load the data in Pig:

test_data = LOAD “/user/test.txt” USING PigStorage(‘,’) as (field1, field2,….);

Limit the data to first 10 records:

Limit_test_data = LIMIT test_data 10;

Now let’s have a look at questions from HBase. HBase is a NoSQL database that runs on top of Hadoop. It is a four-dimensional database in comparison to RDBMS databases, which are usually two-dimensional. 

This is one of the most common interview questions. 

Region server contains HBase tables that are divided horizontally into “Regions” based on their key values. It runs on every node and decides the size of the region. Each region server is a worker node that handles read, writes, updates, and delete request from clients.

This assigns regions to RegionServers for load balancing, and monitors and manages the Hadoop cluster. Whenever a client wants to change the schema and any of the metadata operations, HMaster is used.

This provides a distributed coordination service to maintain server state in the cluster. It looks into which servers are alive and available, and provides server failure notifications. Region servers send their statuses to ZooKeeper indicating if they are ready to reading and write operations.

The row key is a primary key for an HBase table. It also allows logical grouping of cells and ensures that all cells with the same row key are located on the same server.

Column families consist of a group of columns that are defined during table creation, and each column family has certain column qualifiers that a delimiter separates. 

The HBase table is disabled to allow modifications to its settings. When a table is disabled, it cannot be accessed through the scan command.

To disable the employee table, use the command:

To check if the table is disabled, use the command:

Learn Job Critical Skills To Help You Grow!Big Data Engineer Master’s Program Explore Program

The following code is used to open a connection in HBase:

HTableInterface usersTable = new HTable(myConf, “users”);

The replication feature in HBase provides a mechanism to copy data between clusters. This feature can be used as a disaster recovery solution that provides high availability for HBase.

The following commands alter the hbase1 table and set the replication_scope to 1. A replication_scope of 0 indicates that the table is not replicated.

alter ‘hbase1’, {NAME => ‘family_name’, REPLICATION_SCOPE => ‘1’}

Yes, it is possible to import and export tables from one HBase cluster to another. 

hbase org.apache.hadoop.hbase.mapreduce.Export “table name” “target export location”

Example: hbase org.apache.hadoop.hbase.mapreduce.Export “employee_table” “/export/employee_table”

create ‘emp_table_import’, {NAME => ‘myfam’, VERSIONS => 10}

hbase org.apache.hadoop.hbase.mapreduce.Import “table name” “target import location”

Example: create ‘emp_table_import’, {NAME => ‘myfam’, VERSIONS => 10}

hbase org.apache.hadoop.hbase.mapreduce.Import “emp_table_import” “/export/employee_table”

Compaction is the process of merging HBase files into a single file. This is done to reduce the amount of memory required to store the files and the number of disk seeks needed. Once the files are merged, the original files are deleted.

The HBase Bloom filter is a mechanism to test whether an HFile contains a specific row or row-col cell. The Bloom filter is named after its creator, Burton Howard Bloom. It is a data structure that predicts whether a given element is a member of a set of data. These filters provide an in-memory index structure that reduces disk reads and determines the probability of finding a row in a particular file.

A namespace is a logical grouping of tables, analogous to a database in RDBMS. You can create the HBase namespace to the schema of the RDBMS database.

To create a namespace, use the command:

To list all the tables that are members of the namespace, use the command: list_namespace_tables ‘default’

To list all the namespaces, use the command:

67. How does the Write Ahead Log (WAL) help when a RegionServer crashes?

If a RegionServer hosting a MemStore crash, the data that existed in memory, but not yet persisted, is lost. HBase recovers against that by writing to the WAL before the write completes. The HBase cluster keeps a WAL to record changes as they happen. If HBase goes down, replaying the WAL will recover data that was not yet flushed from the MemStore to the HFile.

The following code is used to list the contents of an HBase table:

To update column families in the table, use the following command:

alter ‘table_name’, ‘column_family_name’

Example: alter ‘employee_table’, ‘emp_address’

The catalog has two tables: hbasemeta and -ROOT-

The catalog table hbase:meta exists as an HBase table and is filtered out of the HBase shell’s list command. It keeps a list of all the regions in the system and the location of hbase:meta is

stored in ZooKeeper. The -ROOT- table keeps track of the location of the .META table.

In HBase, all read and write requests should be uniformly distributed across all of the regions in the RegionServers. Hotspotting occurs when a given region serviced by a single RegionServer receives most or all of the read or write requests.

Hotspotting can be avoided by designing the row key in such a way that data is written should go to multiple regions across the cluster. Below are the techniques to do so:

Moving onto our final section, let us have a look at some questions on Sqoop. Sqoop is one of the data ingestion tools mainly used for structured data. Using Sqoop, we can store this data on HDFS, which is then used for Hive, MapReduce, Pig, or any other processing frameworks.

The default Hadoop file formats are Delimited Text File Format and SequenceFile Format. Let us understand each of them individually:

This is the default import format and can be specified explicitly using the --as-textfile argument. This argument will write string-based representations of each record to the output files, with delimiter characters between individual columns and rows.

1,here is a message,2010-05-01

2,strive to learn,2010-01-01

SequenceFile is a binary format that stores individual records in custom record-specific data types. These data types are manifested as Java classes. Sqoop will automatically generate

these data types for you. This format supports the exact storage of all data in binary representations and is appropriate for storing binary data.

The Sqoop eval tool allows users to execute user-defined queries against respective database servers and preview the result in the console.

The following commands show how to import the test_db database and test_demo table, and how to present it to Sqoop.

Suppose there is a “departments” table in “retail_db” that is already imported into Sqoop and you need to export this table back to RDBMS.

i) Create a new “dept” table to export in RDBMS (MySQL)

ii) Export “departments” table to the “dept” table

JDBC driver is a standard Java API used for accessing different databases in RDBMS using Sqoop. Each database vendor is responsible for writing their own implementation that will communicate with the corresponding database with its native protocol. Each user needs to download the drivers separately and install them onto Sqoop prior to its use.

JDBC driver alone is not enough to connect Sqoop to the database. We also need connectors to interact with different databases. A connector is a pluggable piece that is used to fetch metadata and allows Sqoop to overcome the differences in SQL dialects supported by various databases, along with providing optimized data transfer.

 To update a column of a table which is already exported, we use the command:

The following is an example:

jdbc:mysql://localhost/dbname – username root

–password cloudera --export-dir /input/dir

The Codegen tool in Sqoop generates the Data Access Object (DAO) Java classes that encapsulate and interpret imported records.

The following example generates Java code for an “employee” table in the “testdb” database.

--connect jdbc:mysql://localhost/testdb \

Yes, Sqoop can be used to convert data into different formats. This depends on the different arguments that are used for importing. 

I hope this blog has helped you with the essential Hadoop interview questions. You learned about HDFS, MapReduce, YARN, Hive, Pig, HBase, and Sqoop. You got an idea as to the kind of Hadoop interview questions you will be asked and what your answers should consist of. If you want to learn more about Big Data and Hadoop, enroll in our Big Data Hadoop Certification program today!

Shruti is an engineer and a technophile. She works on several trending technologies. Her hobbies include reading, dancing and learning new languages. Currently, she is learning the Japanese language.

Caltech Post Graduate Program in Data Science

*Lifetime access to high-quality, self-paced e-learning content.

What is Hadoop? Components of Hadoop and Its Uses

Top 10 Hive Interview Questions and Answers

Top 75+ Frontend Developer Interview Questions and Answers

sus reducer ltd © 2009 -2024- Simplilearn Americas Inc.