ULTRA-LM: Language model integration for ULTRA #24

daniel4x · 2024-08-04T06:39:03Z

ULTRA-LM

Hi @migalkin , per my issue here I finally had some time to refactor part of my code and pack it into a PR.

It is worth mentioning that I've also contributed an additional dataset combining KG and textual embeddings. The embeddings are currently provided as a separate download link.

Changes:

README.md instructions
pretrain.yaml - configuration with the path to LM vectors
Custom pretrain script - I slightly changed the original pretrain to load the embeddings from a given path
A new entity model - based on the NbfEntity model, here we introduce the new combining layer
New dataset RedHatCVE - a cybersecurity vulnerability dataset. To allow correct mapping between the entity labels and the embeddings, I explicitly added the mapping to the Data object (a bit ugly, I know 😢 ).

daniel4x · 2024-08-20T15:17:24Z

Hi @migalkin
Any chance you've had the opportunity to look into this PR?

migalkin · 2024-08-23T04:23:26Z

Hi Daniel, thanks for the PR, it indeed touches upon one of the most requested features!

The new features are pretty nice but I have hard time reproducing them because there is no link to the pickle file with the precomputed LLM features.
Generally, in order to be merged with the main branch, the PR will need a bit more work to encompass more datasets and possible usecases:

Currently, the PR changes some of the inner workings of the datasets mostly in favor of one new dataset RedHatCVE. We'll need a unified mechanism for obtaining LLM features for all possible datasets if they have relevant entity / relation descriptions.
Putting node features inside the GNN like here

 self.lm_vectors = nn.Embedding.from_pretrained(kwargs['lm_vectors'], freeze=True)

is rather inefficient - it will be tedious to pretrain a model on several datasets because you'll have to swap that register all the time for each random dataset. PyG has a built-in mechanism of adding node features into the Data object (eg, as Data(x=llm_feature, edge_index=...) and I believe it would be a much better interface to read the features from the relevant data object together with edge index and edge type.

In many datasets, relations might also have their descriptions which could be encoded by LLMs as well. So we'll need a mechanism to optionally process those if they exist.

Let me know what you think. We also have an ongoing effort in a similar direction, it might be ready in a month or so.

daniel4x · 2024-08-24T14:26:36Z

@migalkin thanks.

Indeed, I agree. In my case, this branch was tailor-made for my research, as I was using a single graph with LM.
I am glad to hear about the ongoing effort and am looking forward to learning more about it (and even contributing when/if possible 😉).

I believe you can close the PR, or you can leave it open for future reference.

daniel4x added 3 commits August 2, 2024 17:04

ultra lm

9d3c4cd

update readme and fix entity_vocab access

022a250

remove TODO

b5f5239

daniel4x closed this Aug 24, 2024

daniel4x deleted the language_model_integration branch September 5, 2024 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ULTRA-LM: Language model integration for ULTRA #24

ULTRA-LM: Language model integration for ULTRA #24

daniel4x commented Aug 4, 2024 •

edited

Loading

daniel4x commented Aug 20, 2024

migalkin commented Aug 23, 2024

daniel4x commented Aug 24, 2024

ULTRA-LM: Language model integration for ULTRA #24

ULTRA-LM: Language model integration for ULTRA #24

Conversation

daniel4x commented Aug 4, 2024 • edited Loading

ULTRA-LM

Changes:

daniel4x commented Aug 20, 2024

migalkin commented Aug 23, 2024

daniel4x commented Aug 24, 2024

daniel4x commented Aug 4, 2024 •

edited

Loading