Infering logits from `model.forward` for the entire batch instead of the last forward's output. #73

michaelfeil · 2024-01-10T23:06:09Z

I am trying to retrieve the logits from the model, to use https://github.com/EleutherAI/lm-evaluation-harness/blob/692e0f83b5341b543fa288f84289617f793e4e93/lm_eval/models/huggingface.py#L972

Huggingface `transformers`

In transformers I can the logits from the forward pass:

# for models from transformers.AutoModelForCausalLM
# inps.shape = torch.Size([2, 205]) # we are running Batch size 2
# two sequences with contaxt
mylogits = self.hf_model(
    input_ids=inps,  # attention_mask=attn_mask, labels=labels
).logits
# mylogits.shape = torch.Size([2, 205, 32000]) # llama2 has 32000 tokens
# # [batch, padding_length , vocab]

# this contains the log-liklihood for each token, from which we can infer the certainty

`transformers-neuronx`

out = self.neuron_model(
    inps,  # attention_mask=attn_mask, labels=labels
)
# returns torch.Size([2, 32000]), only the last two logits. also already after softmax.

In simple Pytorch words

# What I get is equivalent to this:
neuron_model(batched_inps) =~= F.log_softmax(self.hf_model(batched_inps).logits[:,-1,:], dim=-1)
# But I want to compute  `the_magic_function`
self.hf_cuda_model(batched_inps).logits[:,:,:] =?= neuron_model.the_magic_function

Update: 1/11

I got this to work. However, I am using the undocumented cache_ids feature.

The output seems correct, but the code is terribly slow,. My local laptop gpu RTX3060M runs TinyLLama1.1B around 25x faster.

def logits_hf(input_ids):
    with torch.inference_mode():
        return self.hf_cuda_model(inps).logits

def logits(input_ids):
     """
    get logits for the entire sequence

    :param inps: torch.Tensor
        A torch tensor of shape [batch, sequence_cont]
        the size of sequence may vary from call to call
    :return
        A torch tensor of shape [batch, sequence, vocab] with the
    logits returned from the model's decoder
    """
    _, sequence_length = input_ids.shape

    with torch.inference_mode():
        cache_ids = torch.arange(0, sequence_length, dtype=torch.int32).split(1)
        input_ids_split = input_ids.split(1, dim=1)

        return torch.stack(
            [
                self.model(input_ids=input_id, cache_ids=cache_id)
                for input_id, cache_id in zip(input_ids_split, cache_ids)
            ],
            dim=1,
        )

Using LLamaForSampling, inf2.8xlarge instance, tp_degree=2, neuron 2.15.9

self.neuron_model = LLamaForSampling.from_pretrained(..)
self.neuron_model.to_neuron()

The text was updated successfully, but these errors were encountered:

micwade-aws · 2024-01-12T21:34:55Z

Thanks for reporting @michaelfeil - we'll get back to you soon.

michaelfeil · 2024-01-13T00:39:51Z

@micwade-aws Thanks, looking very much forward to your answer. FYI @jimburtoft our discussion today

zhouku92 · 2024-02-04T23:17:45Z

+1 on this thread. Furthermore, any way to get the hidden states of the last layer?

jluntamazon · 2024-04-09T17:34:05Z

@michaelfeil Here is one thing you could try:

To return model forward scores during inference, you can use the HuggingFaceGenerationModelAdapter. This wrapper supports the Hugging Face generate() API functionality, including the ability to return model forward scores. The only behavioral difference that you may notice is that we only produce scores for the final token in the prompt (rather than a score for each prompt token).

Here is an example of how to use this wrapper to access the model forward scores:

# Model config object
config = ...

# Create your Neuron model
neuron_model = ... 
# Compile you Neuron model
neuron_model.to_neuron()

# Create the Hugging Face wrapper model
neuron = HuggingFaceGenerationModelAdapter(config, neuron_model)

# Run inference using the Hugging Face generate API
# Pass in `output_scores=True, return_dict_in_generate=True` to return the scores
result = neuron.generate(inputs, ..., output_scores=True, return_dict_in_generate=True)

# Retrieve the tokens
tokens = result.sequences

# Retrieve the scores
scores = result.scores

For additional information about the HuggingFaceGenerationModelAdapter wrapper, you can visit this documentation.

Let me know if this solves the original issue.

michaelfeil · 2024-04-09T17:43:29Z

@jluntamazon Thanks for your response! My issue was more directed to get the whole sequences logits, specifically to estimate the metrics for lm-eval-harness.

hannanjgaws · 2024-07-10T14:22:47Z

Hi @michaelfeil:

We added the ability to return all input prompt context encoding logits in the 2.19 Release. This is enabled by setting output_all_logits=True in the NeuronConfig during Neuron model initialization.

Please note that the model.sample() and HuggingFaceGenerationModelAdapter.generate() APIs do not yet support returning all context encoding logits. For now, you must call the Neuron model directly to return the context encoding logits.

Here is an example of how to use output_all_logits=True to access the logits for all input tokens:

import torch
from transformers_neuronx import NeuronAutoModelForCausalLM, NeuronConfig

# Original model checkpoint location
checkpoint = ...

# Create your Neuron model with output_all_logits=True to return all logits during inference
neuron_model = NeuronAutoModelForCausalLM.from_pretrained(
    checkpoint,
    ...,
    neuron_config = NeuronConfig(..., output_all_logits=True)
)

# Compile your Neuron model
neuron_model.to_neuron()

# Prepare your inputs
input_ids = ...
_, context_length = input_ids.shape
cache_ids = torch.arange(0, context_length, dtype=torch.int32)
start_ids = torch.zeros(1, dtype=torch.int32)

# Perform context encoding and return all logits for each input token
logits = neuron_model(input_ids, cache_ids, start_ids)

Please let us know if this provides the behavior you are looking for.

michaelfeil changed the title ~~Inferring logits from model.__call__ instead of the last forward's output.~~ Inferring logits from model.__call__ for the entire batch instead of the last forward's output. Jan 11, 2024

michaelfeil changed the title ~~Inferring logits from model.__call__ for the entire batch instead of the last forward's output.~~ Infering logits from model.forward for the entire batch instead of the last forward's output. Jan 11, 2024

gsnaws added the documentation Improvements or additions to documentation label Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infering logits from `model.forward` for the entire batch instead of the last forward's output. #73

Infering logits from `model.forward` for the entire batch instead of the last forward's output. #73

michaelfeil commented Jan 10, 2024 •

edited

Loading

micwade-aws commented Jan 12, 2024

michaelfeil commented Jan 13, 2024 •

edited

Loading

zhouku92 commented Feb 4, 2024

jluntamazon commented Apr 9, 2024

michaelfeil commented Apr 9, 2024

hannanjgaws commented Jul 10, 2024

Infering logits from model.forward for the entire batch instead of the last forward's output. #73

Infering logits from model.forward for the entire batch instead of the last forward's output. #73

Comments

michaelfeil commented Jan 10, 2024 • edited Loading

Huggingface transformers

transformers-neuronx

Update: 1/11

micwade-aws commented Jan 12, 2024

michaelfeil commented Jan 13, 2024 • edited Loading

zhouku92 commented Feb 4, 2024

jluntamazon commented Apr 9, 2024

michaelfeil commented Apr 9, 2024

hannanjgaws commented Jul 10, 2024

Infering logits from `model.forward` for the entire batch instead of the last forward's output. #73

Infering logits from `model.forward` for the entire batch instead of the last forward's output. #73

michaelfeil commented Jan 10, 2024 •

edited

Loading

Huggingface `transformers`

`transformers-neuronx`

michaelfeil commented Jan 13, 2024 •

edited

Loading