Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word embeddings, dictionary creation, and validation #12

Open
marcdotson opened this issue Oct 17, 2022 · 5 comments
Open

Word embeddings, dictionary creation, and validation #12

marcdotson opened this issue Oct 17, 2022 · 5 comments
Assignees

Comments

@marcdotson
Copy link
Owner

  • Start with word embeddings. Can we cluster to find marketing language?
  • Validate this “top-down” dictionary creation with “bottom-up” dictionary automation (e.g., using support vector machines or some other predictive model using some outcome metric).
  • Automation, validation, and expert input in dictionary creation? Humphreys, A., & Wang, R. J. H. (2018). Automated text analysis for consumer research. Journal of Consumer Research, 44, 1274–1306. https://doi.org/10.1093/jcr/ucx104.
  • Use of sub-dictionaries? Efficiency, effectiveness, innovation, outcome, and performance as sub-dictionaries? A kind of sensitivity analyst by dropping certain dictionary terms?
@wtrumanrose
Copy link
Collaborator

@marcdotson Apologies for not having this when I said I would--I ended up running the BERT model overnight, only to discover in the morning that there was a typo. Also, I went overkill for this and cobbled together three different models. I wanted to really explore what was out there, and as it turns out, there is a lot. Unfortunately, it's not quite as plug-and-play as I would've liked it to be. The word2vec tutorial I used had a painfully-slow custom function to cleanup the data, and even running it with a sliver (1000 obs) of the actual dataset took ~10 mins or so. I uploaded a csv called python_data.csv which is just the word_tokens.rds I tinkered with a bit. Most of the punctuation and stop words should be removed and the text should all be in lowercase. For word2vec and TensorFlow, using python_data.csv would probably be best. I'm still trying to understand BERT, as it is more complex than the others, but if we wanted to use BERT, I think we would want to use the original transcript.

If you'd like, I can split them into three different .py files instead of one .ipynb, since it would be more cohesive with the structure of the repository. For now, however, my brain needs a bit of a break from this.

@marcdotson
Copy link
Owner Author

marcdotson commented May 18, 2023

@wtrumanrose great conversation with Carly. Here are her recommendations when we have time to jump back into this:

  1. Start with general pre-trained word embeddings. She prefers GloVe (have you used this before?) but word2vec may work as well. So it sounds like we were on the right path.
  2. There is also the possibility of just using LLM word embeddings (i.e., a pre-trained transformer). We should especially look at the Bloomberg LLM.
  3. The final step up would be to use transfer learning. Here we take an LLM's word embeddings and modify them to our specific context. We'd likely need a GPU cluster to do it still, and the Hugging Face library would provide access to the relevant transformer network's word embeddings.

@marcdotson marcdotson unpinned this issue May 18, 2023
@marcdotson marcdotson changed the title Dictionary creation and validation Word embeddings, dictionary creation, and validation Jul 10, 2023
@marcdotson
Copy link
Owner Author

In addition to k-means, some highlights from Carly in the recent email chain, moving here to preserve:

We should compare k-means to a topic model.

If you're looking to do clustering for groupings of terms then that does make sense to leave them as word embeddings. Is there a reason you aren't doing a topic model? It seems like you could achieve something similar by running a topic model to create topics and then examine the top features of those topics as well as the breakdown of topics within a document.

And to affinity propagation.

And affinity propagation is just another centroid based clustering algorithm, the main difference between KMeans and Affinity Prop is that you don't have to predefine a set number of clusters and it also identifies an exemplar observation for each cluster instead of describing the cluster by its average characteristics.

@marcdotson
Copy link
Owner Author

Hi, @docsfox. Let's give using this issue a try?

Using the randomly sampled subset of 1 million word tokens and the 50-dimensional word embeddings, I've compared a range of possible topics and clusters. Tuning for the number of topics produces a wacky bend in the log-likelihood, but comparing both I'm going to go ahead and look for a marketing topic and cluster where k = 25. I just have this in R currently (see /code/04_dictionary-identification.R), but I'm interested in the Python comparison.

clustering-km_tune

clustering-lda_tune

@docsfox
Copy link
Collaborator

docsfox commented Oct 3, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants