Improve performance on small hadoop clusters
Monday, November 30th, 2009Hadoop is designed to run on huge clusters containing several hundred machines. But some people just don’t need such a big cluster and are able to use the benefits of HDFS and MapReduce on a smaller scale.
We managed to improve performance of our 10-node-test-cluster by almost 100% by adjusting the heartbeat intervals. Namenode and jobtracker use heartbeats to communicate with their workers (datanodes and tasktrackers).
We concentrate on jobtracker heartbeats. To reliably manage huge cluster the minimum interval is 3 seconds. Every 10 nodes the interval is increased by a second. If you have lots of fast running map- or reduce-tasks this implies a noticeable overhead.
What we did was to patch Hadoop and lower the minimum heartbeat interval to as low as 500ms and the increment to 10ms per node. This way we got our MapReduce-jobs run almost twice as fast. If you want to try it, you could take a look at our github branch (view commit). Please note that the git-branch contains our adopted version of Hadoop, so use it only for testing purposes.
There is a fix (HADOOP-5784) in the upcoming version 0.21 which allows you to lower the heartbeat increment per node.