Use Topic Modeling on Source Code (`tmsc`) to grab topics #3

RichardLitt · 2017-11-07T19:16:28Z

Source{d} has a repository that automatically suggests topics based on sourcecode.

Repository: https://github.com/src-d/tmsc
Paper on arxiv: https://arxiv.org/abs/1704.00135
Blogpost about paper: https://blog.sourced.tech/post/github_topic_modeling

People to ping about this: https://twitter.com/tmarkhor, https://twitter.com/francesc

RichardLitt · 2017-11-10T21:38:17Z

From Vadim:

To play with our model, execute

wget https://storage.googleapis.com/models.cdn.sourced.tech/models%2Ftopics%2Fc70a7514-9257-4b33-b468-27a8588d4dfa.asdf -o model.asdf

This will fetch the TM (91MB). Afterwards,

pip3 install ast2vec

That's our ML beast written in Python. Finally, load the model:

from ast2vec import Topics; model = Topics().load("model.asdf")

That's it. All the keywords are there. model.matrix is the sparse matrix of keyword -> topic, model.tokens is the keyword list. It is important to notice that those "tokens" are splitted and stemmed as given in the paper; the processing code is in ast2vec/uast_ids_to_bag.py No need to extract any ASTs to get the identifiers, using a regular syntax highlighter is enough.

RichardLitt added the enhancement label Nov 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Topic Modeling on Source Code (`tmsc`) to grab topics #3

Use Topic Modeling on Source Code (`tmsc`) to grab topics #3

RichardLitt commented Nov 7, 2017

RichardLitt commented Nov 10, 2017 •

edited

Loading

Use Topic Modeling on Source Code (tmsc) to grab topics #3

Use Topic Modeling on Source Code (tmsc) to grab topics #3

Comments

RichardLitt commented Nov 7, 2017

RichardLitt commented Nov 10, 2017 • edited Loading

Use Topic Modeling on Source Code (`tmsc`) to grab topics #3

Use Topic Modeling on Source Code (`tmsc`) to grab topics #3

RichardLitt commented Nov 10, 2017 •

edited

Loading