1. What do you understand by the term ‘big data’?
Big data deals with complex and large sets of data that cannot be handled using conventional software.
2. How is big data useful for businesses?
Big Data helps businesses or organizations to understand their customers better by allowing them to draw conclusions from large data sets collected over the years. It helps them make better decisions.
3. What is the Port Number for NameNode?
NameNode – Port 50070
4. What is the function of the JPS command?
The JPS command is used to test whether all the Hadoop daemons are running correctly or not.
5. What is the command to start up all the Hadoop daemons together?
6. Name a few features of Hadoop.
• It’s open source nature.
• Data locality.
• Data recovery.
7. What are the five V’s of Big Data?
The five V’s of Big data are Volume, Velocity, Variety, Veracity, and Value.
8. What are the components of HDFS?
The two main components of HDFS are:
• Name Node
• Data Node
9. How is Hadoop related to Big Data?
Hadoop is a framework that specializes in big data operations.
10. Name a few data management tools used with Edge Nodes?
Oozie, Flume, Ambari, and Hue are some of the data management tools that work with edge nodes in Hadoop.
11. What are the steps to deploy a Big Data solution?
The three steps to deploying a Big Data solution are:
• Data Ingestion
• Data Storage and
• Data Processing
12. How many modes can Hadoop be run in?
Hadoop can be run in three modes— Standalone mode, Pseudo-distributed mode and fully-distributed mode.
13. Name the core methods of a reducer
The three core methods of a reducer are,
14. What is the command for shutting down all the Hadoop Daemons together?
15. What is the role of NameNode in HDFS?
NameNode is responsible for processing metadata information for data blocks within HDFS.
16. What is FSCK?
FSCK (File System Check) is a command used to detect inconsistencies and issues in the file.
17. What are the real-time applications of Hadoop?
Some of the real-time applications of Hadoop are in the fields of:
• Content management.
• Financial agencies.
• Defense and cybersecurity.
• Managing posts on social media.
18. What is the function of HDFS?
The HDFS (Hadoop Distributed File System) is Hadoop’s default storage unit. It is used for storing different types of data in a distributed environment.
19. What is commodity hardware?
Commodity hardware can be defined as the basic hardware resources needed to run the Apache Hadoop framework.
20. Name a few daemons used for testing JPS command.
21. What are the most common input formats in Hadoop?
• Text Input Format
• Key Value Input Format
• Sequence File Input Format
22. Name a few companies that use Hadoop.
Yahoo, Facebook, Netflix, Amazon, and Twitter.
23. What is the default mode for Hadoop?
Standalone mode is Hadoop’s default mode. It is primarily used for debugging purpose.
24. What is the role of Hadoop in big data analytics?
By providing storage and helping in the collection and processing of data, Hadoop helps in the analytics of big data.
25. What are the components of YARN?
The two main components of YARN (Yet Another Resource Negotiator) are:
• Resource Manager
• Node Manager
26. Tell us how big data and Hadoop are related to each other.
Big data and Hadoop are almost synonyms terms. With the rise of big data, Hadoop, a framework that specializes in big data operations also became popular. The framework can be used by professionals to analyze big data and help businesses to make decisions.
27. Explain the steps to be followed to deploy a Big Data solution.
Followings are the three steps that are followed to deploy a Big Data Solution –
i. Data Ingestion
The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc. The data can be ingested either through batch jobs or real-time streaming. The extracted data is then stored in HDFS.
• Steps of Deploying Big Data Solution
• ii. Data Storage
• After data ingestion, the next step is to store the extracted data. The data either be stored in HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access whereas HBase for random read/write access.
• iii. Data Processing
• The final step in deploying a big data solution is the data processing. The data is processed through one of the processing frameworks like Spark, MapReduce, Pig, etc.
28. Why is Hadoop used for Big Data Analytics?
Since data analysis has become one of the key parameters of business, hence, enterprises are dealing with massive amount of structured, unstructured and semi-structured data. Analyzing unstructured data is quite difficult where Hadoop takes major part with its capabilities of
• Data collection
Moreover, Hadoop is open source and runs on commodity hardware. Hence it is a cost-benefit solution for businesses.
29. What is the Command to format the NameNode?
$ hdfs namenode –format
30. Will you optimize algorithms or code to make them run faster .
The answer to this question should always be “Yes.” Real world performance matters and it doesn’t depend on the data or model you are using in your project.
The interviewer might also be interested to know if you have had any previous experience in code or algorithm optimization. For a beginner, it obviously depends on which projects he worked on in the past. Experienced candidates can share their experience accordingly as well. However, be honest about your work, and it is fine if you haven’t optimized code in the past. Just let the interviewer know your real experience and you will be able to crack the big data interview.
31. How do you approach data preparation?
Data preparation is one of the crucial steps in big data projects. A big data interview may involve at least one question based on data preparation. When the interviewer asks you this question, he wants to know what steps or precautions you take during data preparation.
As you already know, data preparation is required to get necessary data which can then further be used for modeling purposes. You should convey this message to the interviewer. You should also emphasize the type of model you are going to use and reasons behind choosing that particular model. Last, but not the least, you should also discuss important data preparation terms such as transforming variables, outlier values, unstructured data, identifying gaps, and others.
32. How would you transform unstructured data into structured data?
Unstructured data is very common in big data. The unstructured data should be transformed into structured data to ensure proper data analysis. You can start answering the question by briefly differentiating between the two. Once done, you can now discuss the methods you use to transform one form to another. You might also share the real-world situation where you did it. If you have recently been graduated, then you can share information related to your academic projects.
By answering this question correctly, you are signaling that you understand the types of data, both structured and unstructured, and also have the practical experience to work with these. If you give an answer to this question specifically, you will definitely be able to crack the big data interview.
33. Which hardware configuration is most beneficial for Hadoop jobs?
Dual processors or core machines with a configuration of 4 / 8 GB RAM and ECC memory is ideal for running Hadoop operations. However, the hardware configuration varies based on the project-specific workflow and process flow and need customization accordingly.
34. What happens when two users try to access the same file in the HDFS?
HDFS NameNode supports exclusive write only. Hence, only the first user will receive the grant for file access and the second user will be rejected.
35. How to recover a NameNode when it is down?
1. Use the FsImage which is file system metadata replica to start a new NameNode.
2. Configure the DataNodes and also the clients to make them acknowledge the newly started NameNode.
3. Once the new NameNode completes loading the last checkpoint FsImage which has received enough block reports from the DataNodes, it will start to serve the client.
In case of large Hadoop clusters, the NameNode recovery process consumes a lot of time which turns out to be a more significant challenge in case of routine maintenance.
36. What is the difference between “HDFS Block” and “Input Split”?
The HDFS divides the input data physically into blocks for processing which is known as HDFS Block.Input Split is a logical division of data by mapper for mapping operation.
37. What are the common input formats in Hadoop
• Text Input Format – The default input format defined in Hadoop is the Text Input Format.
• Sequence File Input Format – To read files in a sequence, Sequence File Input Format is used.
• Key Value Input Format – The input format used for plain text files (files broken into lines) is the Key Value Input Format.
38. Explain some important features of Hadoop.
Hadoop supports the storage and processing of big data. It is the best solution for handling big data challenges. Some important features of Hadoop are –
• Open Source – Hadoop is an open source framework which means it is available free of cost. Also, the users are allowed to change the source code as per their requirements.
• Distributed Processing – Hadoop supports distributed processing of data i.e. faster processing. The data in Hadoop HDFS is stored in a distributed manner and MapReduce is responsible for the parallel processing of data.
• Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas for each block at different nodes, by default. This number can be changed according to the requirement. So, we can recover the data from another node if one node fails. The detection of node failure and recovery of data is done automatically.
• Reliability – Hadoop stores data on the cluster in a reliable manner that is independent of machine. So, the data stored in Hadoop environment is not affected by the failure of the machine.
• Scalability – Another important feature of Hadoop is the scalability. It is compatible with the other hardware and we can easily ass the new hardware to the nodes.
• High Availability – The data stored in Hadoop is available to access even after the hardware failure. In case of hardware failure, the data can be accessed from another path.
39. Explain the different modes in which Hadoop run.
Apache Hadoop runs in the following three modes –
• Standalone (Local) Mode – By default, Hadoop runs in a local mode i.e. on a non-distributed, single node. This mode uses the local file system to perform input and output operation. This mode does not support the use of HDFS, so it is used for debugging. No custom configuration is needed for configuration files in this mode.
• Pseudo-Distributed Mode – In the pseudo-distributed mode, Hadoop runs on a single node just like the Standalone mode. In this mode, each daemon runs in a separate Java process. As all the daemons run on a single node, there is the same node for both the Master and Slave nodes.
• Fully – Distributed Mode – In the fully-distributed mode, all the daemons run on separate individual nodes and thus forms a multi-node cluster. There are different nodes for Master and Slave nodes.
40. Explain the core components of Hadoop.
Hadoop is an open source framework that is meant for storage and processing of big data in a distributed manner. The core components of Hadoop are –
• HDFS (Hadoop Distributed File System) – HDFS is the basic storage system of Hadoop. The large data files running on a cluster of commodity hardware are stored in HDFS. It can store data in a reliable manner even when hardware fails.
Core Components of Hadoop
• Hadoop MapReduce – MapReduce is the Hadoop layer that is responsible for data processing. It writes an application to process unstructured and structured data stored in HDFS. It is responsible for the parallel processing of high volume of data by dividing data into independent tasks. The processing is done in two phases Map and Reduce. The Map is the first phase of processing that specifies complex logic code and the Reduce is the second phase of processing that specifies light-weight operations.
• YARN – The processing framework in Hadoop is YARN. It is used for resource management and provides multiple data processing engines i.e. data science, real-time streaming, and batch processing.
41. What are the configuration parameters in a “MapReduce” program?
The main configuration parameters in “MapReduce” framework are:
• Input locations of Jobs in the distributed file system
• Output location of Jobs in the distributed file system
• The input format of data
• The output format of data
• The class which contains the map function
• The class which contains the reduce function
• JAR file which contains the mapper, reducer and the driver classes
42. What is a block in HDFS and what is its default size in Hadoop 1 and Hadoop 2? Can we change the block size?
Blocks are smallest continuous data storage in a hard drive. For HDFS, blocks are stored across Hadoop cluster.
• The default block size in Hadoop 1 is: 64 MB
• The default block size in Hadoop 2 is: 128 MB
Yes, we can change block size by using the parameter – dfs.block.size located in the hdfs-site.xml file.
43. What is Distributed Cache in a MapReduce Framework
Distributed Cache is a feature of Hadoop MapReduce framework to cache files for applications. Hadoop framework makes cached files available for every map/reduce tasks running on the data nodes. Hence, the data files can access the cache file as a local file in the designated job.
44. What are the three running modes of Hadoop?
The three running modes of Hadoop are as follows:
i. Standalone or local: This is the default mode and does not need any configuration. In this mode, all the following components of Hadoop uses local file system and runs on a single JVM –
ii. Pseudo-distributed: In this mode, all the master and slave Hadoop services are deployed and executed on a single node.
iii. Fully distributed: In this mode, Hadoop master and slave services are deployed and executed on separate nodes.
45. Explain JobTracker in Hadoop
JobTracker is a JVM process in Hadoop to submit and track MapReduce jobs.
JobTracker performs the following activities in Hadoop in a sequence –
• JobTracker receives jobs that a client application submits to the job tracker
• JobTracker notifies NameNode to determine data node
• JobTracker allocates TaskTracker nodes based on available slots.
• it submits the work on allocated TaskTracker Nodes,
• JobTracker monitors the TaskTracker nodes.
• When a task fails, JobTracker is notified and decides how to reallocate the task.
46. What are the different configuration files in Hadoop?
Answer: The different configuration files in Hadoop are –
core-site.xml – This configuration file contains Hadoop core configuration settings, for example, I/O settings, very common for MapReduce and HDFS. It uses hostname a port.
mapred-site.xml – This configuration file specifies a framework name for MapReduce by setting mapreduce.framework.name
hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It also specifies default block permission and replication checking on HDFS.
yarn-site.xml – This configuration file specifies configuration settings for ResourceManager and NodeManager.
47. How can you achieve security in Hadoop?
Kerberos are used to achieve security in Hadoop. There are 3 steps to access a service while using Kerberos, at a high level. Each step involves a message exchange with a server.
• Authentication – The first step involves authentication of the client to the authentication server, and then provides a time-stamped TGT (Ticket-Granting Ticket) to the client.
• Authorization – In this step, the client uses received TGT to request a service ticket from the TGS (Ticket Granting Server).
• Service Request – It is the final step to achieve security in Hadoop. Then the client uses service ticket to authenticate himself to the server.
48. What is commodity hardware?
Commodity hardware is a low-cost system identified by less-availability and low-quality. The commodity hardware comprises of RAM as it performs a number of services that require RAM for the execution. One doesn’t require high-end hardware configuration or supercomputers to run Hadoop, it can be run on any commodity hardware.
49. How do Hadoop MapReduce works?
There are two phases of MapReduce operation.
• Map phase – In this phase, the input data is split by map tasks. The map tasks run in parallel. These split data is used for analysis purpose.
• Reduce phase- In this phase, the similar split data is aggregated from the entire collection and shows the result.
50. What is MapReduce? What is the syntax you use to run a MapReduce program?
MapReduce is a programming model in Hadoop for processing large data sets over a cluster of computers, commonly known as HDFS. It is a parallel programming model.
The syntax to run a MapReduce program is – hadoop_jar_file.jar /input_path /output_path.