First HadoopApp on the NCIT Cluster

This example is based on the hadoop WordCount tutorial at: First of all we need our Map-Reduce program. All files presented here are in this archive.

     package org.myorg;
     import java.util.*;
     import org.apache.hadoop.fs.Path;
     import org.apache.hadoop.conf.*;
     import org.apache.hadoop.mapred.*;
     import org.apache.hadoop.util.*;
     public class WordCount {
        public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
          private final static IntWritable one = new IntWritable(1);
         private Text word = new Text();
          public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
              output.collect(word, one);
        public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
          public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            int sum = 0;
            while (values.hasNext()) {
              sum +=;
            output.collect(key, new IntWritable(sum));
        public static void main(String[] args) throws Exception {
          JobConf conf = new JobConf(WordCount.class);
          FileInputFormat.setInputPaths(conf, new Path(args[0]));
          FileOutputFormat.setOutputPath(conf, new Path(args[1]));

Ok, then we want to compile this ( We use module load to set any environment variables.

# Original tutorial at:


. /opt/modules/Modules/3.2.5/init/bash

module load java/jdk1.6.0_23-64bit
module load libraries/hadoop-0.20.2

[[ ! -d build ]] && mkdir build;

javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar \
        -sourcepath src \
        -d build \
jar -cvf wordcount.jar -C build/ .

Let's run it:

[alexandru.herisanu@fep-53-1 ex3]$ ./

Note: src/org/myorg/ uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
added manifest
adding: org/(in = 0) (out= 0)(stored 0%)
adding: org/myorg/(in = 0) (out= 0)(stored 0%)
adding: org/myorg/WordCount$Reduce.class(in = 1611) (out= 649)(deflated 59%)
adding: org/myorg/WordCount.class(in = 1546) (out= 749)(deflated 51%)
adding: org/myorg/WordCount$Map.class(in = 1938) (out= 798)(deflated 58%)

Now you want to upload your files, see here. Let's run it using SGE integration. The HDFS filesystem is always up and running but the job  trackers are not (

# i presume you just compiled the program and uploaded the input files
# into /user/alexandru.herisanu/myjob (well your directory)

qsub -q ibm-nehalem.q -pe hadoop 4 -N HadoopExample -cwd \
        -jsv /opt/n1sge6/sge-6.2u5/ncit-hadoop/ \
        -l hdfs_input=/user/alexandru.herisanu/myjob <<EOF

module load java/jdk1.6.0_23-64bit
module load libraries/hadoop-0.20.2

hadoop --config \$TMPDIR/conf jar wordcount.jar org.myorg.WordCount \
        /user/alexandru.herisanu/myjob /user/alexandru.herisanu/myjob/output
hadoop --config \$TMPDIR/conf fs -cat /user/hadoop-alexandru.herisanu/myjob/output/part*


Let's run it.

[alexandru.herisanu@fep-53-1 ex3]$ cat *149654

11/05/31 12:58:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/05/31 12:58:39 INFO mapred.FileInputFormat: Total input paths to process : 2
11/05/31 12:58:39 INFO mapred.JobClient: Running job: job_201105311258_0001
11/05/31 12:58:40 INFO mapred.JobClient:  map 0% reduce 0%
11/05/31 12:58:49 INFO mapred.JobClient:  map 33% reduce 0%
11/05/31 12:58:52 INFO mapred.JobClient:  map 66% reduce 0%
11/05/31 12:58:53 INFO mapred.JobClient:  map 100% reduce 0%
11/05/31 12:59:01 INFO mapred.JobClient:  map 100% reduce 100%
11/05/31 12:59:03 INFO mapred.JobClient: Job complete: job_201105311258_0001
11/05/31 12:59:03 INFO mapred.JobClient: Counters: 19
11/05/31 12:59:03 INFO mapred.JobClient:   Job Counters
11/05/31 12:59:03 INFO mapred.JobClient:     Launched reduce tasks=1
11/05/31 12:59:03 INFO mapred.JobClient:     Rack-local map tasks=2
11/05/31 12:59:03 INFO mapred.JobClient:     Launched map tasks=3
11/05/31 12:59:03 INFO mapred.JobClient:     Data-local map tasks=1
11/05/31 12:59:03 INFO mapred.JobClient:   FileSystemCounters
11/05/31 12:59:03 INFO mapred.JobClient:     FILE_BYTES_READ=79
11/05/31 12:59:03 INFO mapred.JobClient:     HDFS_BYTES_READ=55
11/05/31 12:59:03 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=266
11/05/31 12:59:03 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=41
11/05/31 12:59:03 INFO mapred.JobClient:   Map-Reduce Framework
11/05/31 12:59:03 INFO mapred.JobClient:     Reduce input groups=5
11/05/31 12:59:03 INFO mapred.JobClient:     Combine output records=6
11/05/31 12:59:03 INFO mapred.JobClient:     Map input records=2
11/05/31 12:59:03 INFO mapred.JobClient:     Reduce shuffle bytes=91
11/05/31 12:59:03 INFO mapred.JobClient:     Reduce output records=5
11/05/31 12:59:03 INFO mapred.JobClient:     Spilled Records=12
11/05/31 12:59:03 INFO mapred.JobClient:     Map output bytes=82
11/05/31 12:59:03 INFO mapred.JobClient:     Map input bytes=51
11/05/31 12:59:03 INFO mapred.JobClient:     Combine input records=8
11/05/31 12:59:03 INFO mapred.JobClient:     Map output records=8
11/05/31 12:59:03 INFO mapred.JobClient:     Reduce input records=6
Bye     1
Goodbye 1
Hadoop  2
Hello   2
World   2
Starting Hadoop PE
$HADOOP_HOME = /opt/lib/hadoop/hadoop-0.20.2
starting jobtracker, logging to /export/home/ncit-cluster/prof/alexandru.herisanu/
modified context of job 149654
modified context of job 149654
Stopping Hadoop PE
stopping jobtracker stopping tasktracker stopping tasktracker stopping tasktracker stopping tasktracker
[alexandru.herisanu@fep-53-1 ex3]$
