A python mapreduce script to count TF-IDF base on hadoop mapreduce
- WordCount
- LineNumber
- Warning
- TF-IDF
- MapReduce1
- MapReduce2
- MapReduce3
<key> <value>
Mapper: <word> <1>
Reducer: <word> <wordcount>
hadoop jar /path/to/hadoop-streaming-*.jar \
-D mapreduce.reduce.map.memory.mb=5120 \
-D mapreduce.reduce.memory.mb=5120 \
-file mapper.py \
-mapper "python mapper.py" \
-file reducer.py \
-reducer "python reducer.py" \
-input /path/to/inputdirs \
-output /path/to/outputdirs
- By Default mapreduce program will split files by mapred.min.split.size, the default value is file.blocksize=64MiB. If a file is larger than mapred.min.split.size, then the solution to this problem is invalid.
<key> <value>
Mapper: <word, document_name> <line_number>
Reducer: <word> <document_name, [line_number]>
hadoop jar /path/to/hadoop-streaming-*.jar \
-D mapreduce.reduce.map.memory.mb=5120 \
-D mapreduce.reduce.memory.mb=5120 \
-file mapper.py \
-mapper "python mapper.py" \
-file reducer.py \
-reducer "python reducer.py" \
-input /path/to/inputdirs \
-output /path/to/outputdirs
<key> <value>
Mapper1: <word, document_name> <1>
Reducer1: <word, document_name> <word_appears_time_in_same_document>
<key> <value>
Mapper2: <document_name> <word, word_appears_time_in_same_document>
Reducer2: <word, document_name> <word_appears_time_in_same_document, total_words_in_this_document>
<key> <value>
Mapper3: <word> <document_name, word_appears_time_in_same_document, total_words_in_this_document, 1>
Reducer3: <word, document_name> <TF-IDF>
hadoop jar /path/to/hadoop-streaming-*.jar \
-D mapreduce.reduce.map.memory.mb=5120 \
-D mapreduce.reduce.memory.mb=5120 \
-file mapper.py \
-mapper "python mapper.py" \
-file reducer.py \
-reducer "python reducer.py" \
-input /path/to/inputdirs \
-output /path/to/outputdirs