-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word embeddings, dictionary creation, and validation #12
Comments
@marcdotson Apologies for not having this when I said I would--I ended up running the BERT model overnight, only to discover in the morning that there was a typo. Also, I went overkill for this and cobbled together three different models. I wanted to really explore what was out there, and as it turns out, there is a lot. Unfortunately, it's not quite as plug-and-play as I would've liked it to be. The word2vec tutorial I used had a painfully-slow custom function to cleanup the data, and even running it with a sliver (1000 obs) of the actual dataset took ~10 mins or so. I uploaded a csv called If you'd like, I can split them into three different .py files instead of one .ipynb, since it would be more cohesive with the structure of the repository. For now, however, my brain needs a bit of a break from this. |
@wtrumanrose great conversation with Carly. Here are her recommendations when we have time to jump back into this:
|
In addition to k-means, some highlights from Carly in the recent email chain, moving here to preserve: We should compare k-means to a topic model.
And to affinity propagation.
|
Hi, @docsfox. Let's give using this issue a try? Using the randomly sampled subset of 1 million word tokens and the 50-dimensional word embeddings, I've compared a range of possible topics and clusters. Tuning for the number of topics produces a wacky bend in the log-likelihood, but comparing both I'm going to go ahead and look for a marketing topic and cluster where |
Hey Marc,
Great, I started running with Fast Kmeans on Friday which was actually
pretty quick, breaking up the the data into 1,000 mini batches. I'll be
working on it this afternoon and will let you know if I get similar results
with the 50d word embeddings!
Best,
Carly
…On Mon, Oct 2, 2023, 11:16 PM Marc Dotson ***@***.***> wrote:
Hi, @docsfox <https://github.com/docsfox>. Let's give using this issue a
try?
Using the randomly sampled subset of 1 million word tokens and the
50-dimensional word embeddings, I've compared a range of possible topics
and clusters. Tuning for the number of topics produces a wacky bend in the
log-likelihood, but comparing both I'm going to go ahead and look for a
marketing topic and cluster where k = 25. I just have this in R currently
(see /code/04_dictionary-identification.R), but I'm interested in the
Python comparison.
[image: clustering-km_tune]
<https://user-images.githubusercontent.com/29615257/272153340-bb31dbfd-d81f-47b2-b91b-656d78a3ba2c.png>
[image: clustering-lda_tune]
<https://user-images.githubusercontent.com/29615257/272153377-35b2e95b-e3ef-4e76-a4a8-3de83e4d2365.png>
—
Reply to this email directly, view it on GitHub
<#12 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATU2DQPWYWDU3XYVUGUNOXDX5ONS3AVCNFSM6AAAAAARHIYKBSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBUGIZDKNBYGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The text was updated successfully, but these errors were encountered: