Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Topic Modeling on Source Code (tmsc) to grab topics #3

Open
RichardLitt opened this issue Nov 7, 2017 · 1 comment
Open

Use Topic Modeling on Source Code (tmsc) to grab topics #3

RichardLitt opened this issue Nov 7, 2017 · 1 comment

Comments

@RichardLitt
Copy link
Member

Source{d} has a repository that automatically suggests topics based on sourcecode.

Repository: https://github.com/src-d/tmsc
Paper on arxiv: https://arxiv.org/abs/1704.00135
Blogpost about paper: https://blog.sourced.tech/post/github_topic_modeling

People to ping about this: https://twitter.com/tmarkhor, https://twitter.com/francesc

@RichardLitt
Copy link
Member Author

RichardLitt commented Nov 10, 2017

From Vadim:

To play with our model, execute

wget https://storage.googleapis.com/models.cdn.sourced.tech/models%2Ftopics%2Fc70a7514-9257-4b33-b468-27a8588d4dfa.asdf -o model.asdf

This will fetch the TM (91MB). Afterwards,

pip3 install ast2vec

That's our ML beast written in Python. Finally, load the model:

from ast2vec import Topics; model = Topics().load("model.asdf")

That's it. All the keywords are there. model.matrix is the sparse matrix of keyword -> topic, model.tokens is the keyword list. It is important to notice that those "tokens" are splitted and stemmed as given in the paper; the processing code is in ast2vec/uast_ids_to_bag.py No need to extract any ASTs to get the identifiers, using a regular syntax highlighter is enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant