Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infering logits from model.forward for the entire batch instead of the last forward's output. #73

Open
michaelfeil opened this issue Jan 10, 2024 · 6 comments
Labels
documentation Improvements or additions to documentation

Comments

@michaelfeil
Copy link

michaelfeil commented Jan 10, 2024

I am trying to retrieve the logits from the model, to use https://github.com/EleutherAI/lm-evaluation-harness/blob/692e0f83b5341b543fa288f84289617f793e4e93/lm_eval/models/huggingface.py#L972

Huggingface transformers

In transformers I can the logits from the forward pass:

# for models from transformers.AutoModelForCausalLM
# inps.shape = torch.Size([2, 205]) # we are running Batch size 2
# two sequences with contaxt
mylogits = self.hf_model(
    input_ids=inps,  # attention_mask=attn_mask, labels=labels
).logits
# mylogits.shape = torch.Size([2, 205, 32000]) # llama2 has 32000 tokens
# # [batch, padding_length , vocab]

# this contains the log-liklihood for each token, from which we can infer the certainty

transformers-neuronx

out = self.neuron_model(
    inps,  # attention_mask=attn_mask, labels=labels
)
# returns torch.Size([2, 32000]), only the last two logits. also already after softmax.

In simple Pytorch words

# What I get is equivalent to this:
neuron_model(batched_inps) =~= F.log_softmax(self.hf_model(batched_inps).logits[:,-1,:], dim=-1)
# But I want to compute  `the_magic_function`
self.hf_cuda_model(batched_inps).logits[:,:,:] =?= neuron_model.the_magic_function

Update: 1/11

I got this to work. However, I am using the undocumented cache_ids feature.

The output seems correct, but the code is terribly slow,. My local laptop gpu RTX3060M runs TinyLLama1.1B around 25x faster.

def logits_hf(input_ids):
    with torch.inference_mode():
        return self.hf_cuda_model(inps).logits

def logits(input_ids):
     """
    get logits for the entire sequence

    :param inps: torch.Tensor
        A torch tensor of shape [batch, sequence_cont]
        the size of sequence may vary from call to call
    :return
        A torch tensor of shape [batch, sequence, vocab] with the
    logits returned from the model's decoder
    """
    _, sequence_length = input_ids.shape

    with torch.inference_mode():
        cache_ids = torch.arange(0, sequence_length, dtype=torch.int32).split(1)
        input_ids_split = input_ids.split(1, dim=1)

        return torch.stack(
            [
                self.model(input_ids=input_id, cache_ids=cache_id)
                for input_id, cache_id in zip(input_ids_split, cache_ids)
            ],
            dim=1,
        )

Using LLamaForSampling, inf2.8xlarge instance, tp_degree=2, neuron 2.15.9

self.neuron_model = LLamaForSampling.from_pretrained(..)
self.neuron_model.to_neuron()
@michaelfeil michaelfeil changed the title Inferring logits from model.__call__ instead of the last forward's output. Inferring logits from model.__call__ for the entire batch instead of the last forward's output. Jan 11, 2024
@michaelfeil michaelfeil changed the title Inferring logits from model.__call__ for the entire batch instead of the last forward's output. Infering logits from model.forward for the entire batch instead of the last forward's output. Jan 11, 2024
@micwade-aws
Copy link

Thanks for reporting @michaelfeil - we'll get back to you soon.

@michaelfeil
Copy link
Author

michaelfeil commented Jan 13, 2024

@micwade-aws Thanks, looking very much forward to your answer. FYI @jimburtoft our discussion today

@zhouku92
Copy link

zhouku92 commented Feb 4, 2024

+1 on this thread. Furthermore, any way to get the hidden states of the last layer?

@gsnaws gsnaws added the documentation Improvements or additions to documentation label Apr 9, 2024
@jluntamazon
Copy link
Contributor

@michaelfeil Here is one thing you could try:

To return model forward scores during inference, you can use the HuggingFaceGenerationModelAdapter. This wrapper supports the Hugging Face generate() API functionality, including the ability to return model forward scores. The only behavioral difference that you may notice is that we only produce scores for the final token in the prompt (rather than a score for each prompt token).

Here is an example of how to use this wrapper to access the model forward scores:

# Model config object
config = ...

# Create your Neuron model
neuron_model = ... 
# Compile you Neuron model
neuron_model.to_neuron()

# Create the Hugging Face wrapper model
neuron = HuggingFaceGenerationModelAdapter(config, neuron_model)

# Run inference using the Hugging Face generate API
# Pass in `output_scores=True, return_dict_in_generate=True` to return the scores
result = neuron.generate(inputs, ..., output_scores=True, return_dict_in_generate=True)

# Retrieve the tokens
tokens = result.sequences

# Retrieve the scores
scores = result.scores

For additional information about the HuggingFaceGenerationModelAdapter wrapper, you can visit this documentation.

Let me know if this solves the original issue.

@michaelfeil
Copy link
Author

@jluntamazon Thanks for your response! My issue was more directed to get the whole sequences logits, specifically to estimate the metrics for lm-eval-harness.

@hannanjgaws
Copy link
Contributor

Hi @michaelfeil:

We added the ability to return all input prompt context encoding logits in the 2.19 Release. This is enabled by setting output_all_logits=True in the NeuronConfig during Neuron model initialization.

Please note that the model.sample() and HuggingFaceGenerationModelAdapter.generate() APIs do not yet support returning all context encoding logits. For now, you must call the Neuron model directly to return the context encoding logits.

Here is an example of how to use output_all_logits=True to access the logits for all input tokens:

import torch
from transformers_neuronx import NeuronAutoModelForCausalLM, NeuronConfig

# Original model checkpoint location
checkpoint = ...

# Create your Neuron model with output_all_logits=True to return all logits during inference
neuron_model = NeuronAutoModelForCausalLM.from_pretrained(
    checkpoint,
    ...,
    neuron_config = NeuronConfig(..., output_all_logits=True)
)

# Compile your Neuron model
neuron_model.to_neuron()

# Prepare your inputs
input_ids = ...
_, context_length = input_ids.shape
cache_ids = torch.arange(0, context_length, dtype=torch.int32)
start_ids = torch.zeros(1, dtype=torch.int32)

# Perform context encoding and return all logits for each input token
logits = neuron_model(input_ids, cache_ids, start_ids)

Please let us know if this provides the behavior you are looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

6 participants