Skip to content
This repository has been archived by the owner on Mar 17, 2020. It is now read-only.

A python script to count TF-IDF base on hadoop mapreduce

Notifications You must be signed in to change notification settings

hackerliang/TF-IDF-MapReduce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TF-IDF-MapReduce

A python mapreduce script to count TF-IDF base on hadoop mapreduce


Table of contents

WordCount

<key> <value>

Mapper: <word> <1>

Reducer: <word> <wordcount>

hadoop jar /path/to/hadoop-streaming-*.jar \
-D mapreduce.reduce.map.memory.mb=5120 \
-D mapreduce.reduce.memory.mb=5120 \
-file mapper.py \
-mapper "python mapper.py" \
-file reducer.py \
-reducer "python reducer.py" \
-input /path/to/inputdirs \
-output /path/to/outputdirs

LineNumber

Warning

  • By Default mapreduce program will split files by mapred.min.split.size, the default value is file.blocksize=64MiB. If a file is larger than mapred.min.split.size, then the solution to this problem is invalid.

<key> <value>

Mapper: <word, document_name> <line_number>

Reducer: <word> <document_name, [line_number]>

hadoop jar /path/to/hadoop-streaming-*.jar \
-D mapreduce.reduce.map.memory.mb=5120 \
-D mapreduce.reduce.memory.mb=5120 \
-file mapper.py \
-mapper "python mapper.py" \
-file reducer.py \
-reducer "python reducer.py" \
-input /path/to/inputdirs \
-output /path/to/outputdirs

TF-IDF

MapReduce1

<key> <value>

Mapper1: <word, document_name> <1>

Reducer1: <word, document_name> <word_appears_time_in_same_document>

MapReduce2

<key> <value>

Mapper2: <document_name> <word, word_appears_time_in_same_document>

Reducer2: <word, document_name> <word_appears_time_in_same_document, total_words_in_this_document>

MapReduce3

<key> <value>

Mapper3: <word> <document_name, word_appears_time_in_same_document, total_words_in_this_document, 1>

Reducer3: <word, document_name> <TF-IDF>

hadoop jar /path/to/hadoop-streaming-*.jar \
-D mapreduce.reduce.map.memory.mb=5120 \
-D mapreduce.reduce.memory.mb=5120 \
-file mapper.py \
-mapper "python mapper.py" \
-file reducer.py \
-reducer "python reducer.py" \
-input /path/to/inputdirs \
-output /path/to/outputdirs

About

A python script to count TF-IDF base on hadoop mapreduce

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages