Sunday, August 7, 2016

MapReduce FileAlreadyExistsException - Output file already exists in HDFS

The below exception is because your output directory is already existing in the HDFS file system. 


 Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/C:/HadoopWS/outfile already exists  
     at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)  
     at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:266)  
     at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:139)  


You have to delete the output directory after running the job once. This can be done on command line using the below script:

$ hdfs dfs –rm -r /pathToDirectory

If you would like it to do through the java code below code snippet can be used. This will delete the output folder before running the job everytime.

Path output =new Path(outPath);
FileSystem hdfs = FileSystem.get(conf);
        if (hdfs.exists(output)) {
            hdfs.delete(output, true);
}



Another workaround would be to pass the output directory through command line as below.

$ yarn jar {name_of_the_jar_file.jar} {package_name_of_jar} {hdfs_file_path_on_which_you_want_to_perform_map_reduce} {output_directory_path}

If you would like to create a new directory everytime below code can be used.


 String timeStamp = new SimpleDateFormat("yyyy.MM.dd.HH.mm.ss", Locale.US).format(new Timestamp(System.currentTimeMillis()));  
 FileOutputFormat.setOutputPath(job, new Path(“/MyDir” + "/" + timeStamp));  

Tuesday, August 2, 2016

Hadoop - Find The Largest Top 10 Directories in HDFS

Sometimes it is necessary to know what file(s) or directories are eating up all your disk space.Below scripts will help you to use Unix and Linux command for finding the largest or biggest the files or directories on HDFS.


 echo -e "calculating the size to determine top 10 directories on HDFS......"  
 for dir in `hadoop fs -ls /|awk '{print $8}'`;do hadoop fs -du $dir/* 2>/dev/null;done|sort -nk1|tail -10 > /tmp/size.txt  
 echo "| ---------------------------     | -------    | ------------ | ---------   | ----------   ------ |" > /tmp/tmp  
 echo "| Dir_on_HDFS | Size_in_MB | User | Group | Last_modified Time |" >> /tmp/tmp  
 echo "| ---------------------------     | -------    | ------------ | ---------   | ----------   ------ |" >> /tmp/tmp  
 while read line;  
 do  
     size=`echo $line|cut -d' ' -f1`  
     size_mb=$(( $size/1048576 ))  
     path=`echo $line|cut -d' ' -f2`  #(Use -f3 if running on cloudera)  
     dirname=`echo $path|rev|cut -d'/' -f1|rev`  
     parent_dir=`echo $path|rev|cut -d'/' -f2-|rev`  
     fs_out=`hadoop fs -ls $parent_dir|grep -w $dirname`  
     user=`echo $fs_out|grep $dirname|awk '{print $3}'`  
     group=`echo $fs_out|grep $dirname|awk '{print $4}'`  
     last_mod=`echo $fs_out|grep $dirname|awk '{print $6,$7}'`  
     echo "| $path | $size_mb | $user | $group | $last_mod |" >> /tmp/tmp  
 done < /tmp/size.txt  
 cat /tmp/tmp | column -t