Summary

Code for Large Scale Hierarchical Text Classification competition.

http://www.kaggle.com/c/lshtc

Summary

a centroid-based flat classifier.

Prediction

Selecting k-class from near the query with nearest centroid classifier.
Judging with binary classifier whether the query can be accepted to class.

(predict.cpp)

Selecting k-candidate classes that centroid of class close to the query.

Selecting classes that binary classifier of class returns p > 0.5. (Implementation of the binary classifier is logistic regression)

Training

For each data points..

Selecting k-class from near the data point with nearest centroid classifier.
Adding the data point as training data to dataset for each classes.

(prefetch.cpp)

For each classes..

Learning the binary classifier using own dataset.

(train.cpp)

What are the feature

using variant TF-IDF.

tf = log(number_of_term_occurs_in_document + 1)
idf = log(total_number_of_documents / (number_of_documents_containing_term + 1)) + 5
tfidf = tf * idf

and feature vector is normalized by L2 norm. (code: tfidf_transformer.hpp)

What are the metric for Centroid Classifier

using cosine similarity.

Requirements

Ubuntu 13.10
g++ 4.8.1
make
32GB RAM

How to Generate the Solution

please edit SETTINGS.h first.

make
./prefetch
./train
./predict

NOTE: ./prefetch is very slow. probably processing time exceeds 15 hours.

MISC programs

Running the Validation Test

./vt_prefech
./vt_train
./validation

Simple k-NN baseline

running the validation test.

./vt_knn

generating the sumission.txt.

./knn

Simple Nearest Centroid Classifier

running the validation test.

./vt_ncc

generating the sumission.txt.

./ncc

Figure

Model	LBMaF	Training Time	Prediction Time
k-NN	0.23088	n/a	10 minutes
NCC	0.28931	80 seconds	2 hours
NCC+BC	0.33025	15 hours	2 hours

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
figure		figure
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SETTINGS.h		SETTINGS.h
binary_classifier.hpp		binary_classifier.hpp
classifier_storage.hpp		classifier_storage.hpp
evaluation.hpp		evaluation.hpp
inverted_index.hpp		inverted_index.hpp
knn.cpp		knn.cpp
ncc.cpp		ncc.cpp
ncc_cache.hpp		ncc_cache.hpp
nearest_centroid_classifier.hpp		nearest_centroid_classifier.hpp
predict.cpp		predict.cpp
prefetch.cpp		prefetch.cpp
reader.hpp		reader.hpp
tfidf_transformer.hpp		tfidf_transformer.hpp
tick.hpp		tick.hpp
train.cpp		train.cpp
util.hpp		util.hpp
validation.cpp		validation.cpp
vt_classifier.cpp		vt_classifier.cpp
vt_knn.cpp		vt_knn.cpp
vt_ncc.cpp		vt_ncc.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

Prediction

Training

What are the feature

What are the metric for Centroid Classifier

Requirements

How to Generate the Solution

MISC programs

Running the Validation Test

Simple k-NN baseline

Simple Nearest Centroid Classifier

Figure

About

Releases

Packages

Languages

License

nagadomi/kaggle-lshtc

Folders and files

Latest commit

History

Repository files navigation

Summary

Prediction

Training

What are the feature

What are the metric for Centroid Classifier

Requirements

How to Generate the Solution

MISC programs

Running the Validation Test

Simple k-NN baseline

Simple Nearest Centroid Classifier

Figure

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages