Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ULTRA-LM: Language model integration for ULTRA #24

Closed

Conversation

daniel4x
Copy link
Contributor

@daniel4x daniel4x commented Aug 4, 2024

ULTRA-LM

Hi @migalkin , per my issue here I finally had some time to refactor part of my code and pack it into a PR.

It is worth mentioning that I've also contributed an additional dataset combining KG and textual embeddings. The embeddings are currently provided as a separate download link.

Changes:

  • README.md instructions
  • pretrain.yaml - configuration with the path to LM vectors
  • Custom pretrain script - I slightly changed the original pretrain to load the embeddings from a given path
  • A new entity model - based on the NbfEntity model, here we introduce the new combining layer
  • New dataset RedHatCVE - a cybersecurity vulnerability dataset. To allow correct mapping between the entity labels and the embeddings, I explicitly added the mapping to the Data object (a bit ugly, I know 😢 ).

@daniel4x
Copy link
Contributor Author

Hi @migalkin
Any chance you've had the opportunity to look into this PR?

@migalkin
Copy link
Collaborator

Hi Daniel, thanks for the PR, it indeed touches upon one of the most requested features!

The new features are pretty nice but I have hard time reproducing them because there is no link to the pickle file with the precomputed LLM features.
Generally, in order to be merged with the main branch, the PR will need a bit more work to encompass more datasets and possible usecases:

  • Currently, the PR changes some of the inner workings of the datasets mostly in favor of one new dataset RedHatCVE. We'll need a unified mechanism for obtaining LLM features for all possible datasets if they have relevant entity / relation descriptions.
  • Putting node features inside the GNN like here
 self.lm_vectors = nn.Embedding.from_pretrained(kwargs['lm_vectors'], freeze=True)

is rather inefficient - it will be tedious to pretrain a model on several datasets because you'll have to swap that register all the time for each random dataset. PyG has a built-in mechanism of adding node features into the Data object (eg, as Data(x=llm_feature, edge_index=...) and I believe it would be a much better interface to read the features from the relevant data object together with edge index and edge type.

  • In many datasets, relations might also have their descriptions which could be encoded by LLMs as well. So we'll need a mechanism to optionally process those if they exist.

Let me know what you think. We also have an ongoing effort in a similar direction, it might be ready in a month or so.

@daniel4x
Copy link
Contributor Author

@migalkin thanks.

Indeed, I agree. In my case, this branch was tailor-made for my research, as I was using a single graph with LM.
I am glad to hear about the ongoing effort and am looking forward to learning more about it (and even contributing when/if possible 😉).

I believe you can close the PR, or you can leave it open for future reference.

@daniel4x daniel4x closed this Aug 24, 2024
@daniel4x daniel4x deleted the language_model_integration branch September 5, 2024 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants