Towards end of the Course, you will get an opportunity to work on a live project, that will use the different Hadoop ecosystem components to work together in a Hadoop implementation to solve big data problems.
1. Setup a minimum 2 Node Hadoop Cluster
Node 1 - Namenode, JobTracker,datanode, tasktracker
Node 2 – Secondary namenode, datanode, tasktracker
2. Create a simple text file and copy to HDFS
Find out the location of the node to which it went.
Find in which data node the output files are written.
3. Create a large text file and copy to HDFS with a block size of 256 MB. Keep all the other files in default block size and find how block size has an impact on the performance.
4. Set a spaceQuota of 200MB for projects and copy a file of 70MB with replication=2
Identify the reason the system is not letting you copy the file?
How will you solve this problem without increasing the spaceQuota?
5. Configure Rack Awareness and copy the file to HDFS
Find its rack distribution and identify the command used for it.
Find out how to change the replication factor of the existing file.
The final certification project is based on real world use cases as follows:
Problem Statement 1:
1. Setup a Hadoop cluster with a single node or a 2 node cluster with all daemons like namenode, datanode, jobtracker, tasktracker, a secondary namenode that must run in the cluster with block size = 128MB.
2. Write a Namespace ID for the cluster and create a directory with name space quota as 10 and a space quota of 100MB in the directory.
3. Use the distcp command to copy the data to the same cluster or a different cluster, and create the list of data nodes participating in the cluster.
Problem statement 2:
1. Save the namespace of the Namenode, without using the secondary namenode, and ensure that the edit file merge, without stopping the namenode daemon.
2. Set include file, so that no other nodes can talk to the namenode.
3. Set the cluster re-balancer threshold to 40%.
4. Set the map and reduce slots to s4 and 2 respectively for each node.