Word embeddings, dictionary creation, and validation #12

marcdotson · 2022-10-17T17:10:13Z

Start with word embeddings. Can we cluster to find marketing language?
Validate this “top-down” dictionary creation with “bottom-up” dictionary automation (e.g., using support vector machines or some other predictive model using some outcome metric).
Automation, validation, and expert input in dictionary creation? Humphreys, A., & Wang, R. J. H. (2018). Automated text analysis for consumer research. Journal of Consumer Research, 44, 1274–1306. https://doi.org/10.1093/jcr/ucx104.
Use of sub-dictionaries? Efficiency, effectiveness, innovation, outcome, and performance as sub-dictionaries? A kind of sensitivity analyst by dropping certain dictionary terms?

wtrumanrose · 2022-11-20T01:23:20Z

@marcdotson Apologies for not having this when I said I would--I ended up running the BERT model overnight, only to discover in the morning that there was a typo. Also, I went overkill for this and cobbled together three different models. I wanted to really explore what was out there, and as it turns out, there is a lot. Unfortunately, it's not quite as plug-and-play as I would've liked it to be. The word2vec tutorial I used had a painfully-slow custom function to cleanup the data, and even running it with a sliver (1000 obs) of the actual dataset took ~10 mins or so. I uploaded a csv called python_data.csv which is just the word_tokens.rds I tinkered with a bit. Most of the punctuation and stop words should be removed and the text should all be in lowercase. For word2vec and TensorFlow, using python_data.csv would probably be best. I'm still trying to understand BERT, as it is more complex than the others, but if we wanted to use BERT, I think we would want to use the original transcript.

If you'd like, I can split them into three different .py files instead of one .ipynb, since it would be more cohesive with the structure of the repository. For now, however, my brain needs a bit of a break from this.

marcdotson · 2023-05-18T21:35:42Z

@wtrumanrose great conversation with Carly. Here are her recommendations when we have time to jump back into this:

Start with general pre-trained word embeddings. She prefers GloVe (have you used this before?) but word2vec may work as well. So it sounds like we were on the right path.
There is also the possibility of just using LLM word embeddings (i.e., a pre-trained transformer). We should especially look at the Bloomberg LLM.
The final step up would be to use transfer learning. Here we take an LLM's word embeddings and modify them to our specific context. We'd likely need a GPU cluster to do it still, and the Hugging Face library would provide access to the relevant transformer network's word embeddings.

marcdotson · 2023-08-31T19:23:34Z

In addition to k-means, some highlights from Carly in the recent email chain, moving here to preserve:

We should compare k-means to a topic model.

If you're looking to do clustering for groupings of terms then that does make sense to leave them as word embeddings. Is there a reason you aren't doing a topic model? It seems like you could achieve something similar by running a topic model to create topics and then examine the top features of those topics as well as the breakdown of topics within a document.

And to affinity propagation.

And affinity propagation is just another centroid based clustering algorithm, the main difference between KMeans and Affinity Prop is that you don't have to predefine a set number of clusters and it also identifies an exemplar observation for each cluster instead of describing the cluster by its average characteristics.

marcdotson · 2023-10-03T05:16:17Z

Hi, @docsfox. Let's give using this issue a try?

Using the randomly sampled subset of 1 million word tokens and the 50-dimensional word embeddings, I've compared a range of possible topics and clusters. Tuning for the number of topics produces a wacky bend in the log-likelihood, but comparing both I'm going to go ahead and look for a marketing topic and cluster where k = 25. I just have this in R currently (see /code/04_dictionary-identification.R), but I'm interested in the Python comparison.

docsfox · 2023-10-03T13:34:05Z

Hey Marc, Great, I started running with Fast Kmeans on Friday which was actually pretty quick, breaking up the the data into 1,000 mini batches. I'll be working on it this afternoon and will let you know if I get similar results with the 50d word embeddings! Best, Carly

…

On Mon, Oct 2, 2023, 11:16 PM Marc Dotson ***@***.***> wrote: Hi, @docsfox <https://github.com/docsfox>. Let's give using this issue a try? Using the randomly sampled subset of 1 million word tokens and the 50-dimensional word embeddings, I've compared a range of possible topics and clusters. Tuning for the number of topics produces a wacky bend in the log-likelihood, but comparing both I'm going to go ahead and look for a marketing topic and cluster where k = 25. I just have this in R currently (see /code/04_dictionary-identification.R), but I'm interested in the Python comparison. [image: clustering-km_tune] <https://user-images.githubusercontent.com/29615257/272153340-bb31dbfd-d81f-47b2-b91b-656d78a3ba2c.png> [image: clustering-lda_tune] <https://user-images.githubusercontent.com/29615257/272153377-35b2e95b-e3ef-4e76-a4a8-3de83e4d2365.png> — Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATU2DQPWYWDU3XYVUGUNOXDX5ONS3AVCNFSM6AAAAAARHIYKBSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBUGIZDKNBYGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

marcdotson assigned marcdotson and wtrumanrose Oct 17, 2022

marcdotson pinned this issue Nov 16, 2022

marcdotson unpinned this issue May 18, 2023

marcdotson changed the title ~~Dictionary creation and validation~~ Word embeddings, dictionary creation, and validation Jul 10, 2023

marcdotson unassigned wtrumanrose Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word embeddings, dictionary creation, and validation #12

Word embeddings, dictionary creation, and validation #12

marcdotson commented Oct 17, 2022

wtrumanrose commented Nov 20, 2022

marcdotson commented May 18, 2023 •

edited

Loading

marcdotson commented Aug 31, 2023

marcdotson commented Oct 3, 2023

docsfox commented Oct 3, 2023 via email

Word embeddings, dictionary creation, and validation #12

Word embeddings, dictionary creation, and validation #12

Comments

marcdotson commented Oct 17, 2022

wtrumanrose commented Nov 20, 2022

marcdotson commented May 18, 2023 • edited Loading

marcdotson commented Aug 31, 2023

marcdotson commented Oct 3, 2023

docsfox commented Oct 3, 2023 via email

marcdotson commented May 18, 2023 •

edited

Loading