Calculating logprobs/surprisals with transformer models using minicons

This brief document shows how one can calculate log probability (and its modified variants like surprisals) for sentences using autoregressive models such as gpt and gpt2.

For demonstration purposes I will use gpt2(small) from Huggingface, and evaluate it on a number agreement task from the BLiMP dataset. This task specifically tests whether the model assigns greater (log)probability to "hasn't" as compared to "haven't" in pairs of stimuli such as (1) and (2):

(1) The sketch of those trucks hasn't

(2) The sketch of those trucks haven't

Converting this into a hypothesis dealing with surprisals, the model should be "more surprised" to see (2) than (1).

minicons helps in performing such experiments:

from minicons import scorer
import torch
from torch.utils.data import DataLoader

import numpy as np

import json

Incremental/Autoregressive models can be instantiated using:

# Warning: This will download a 550mb model file if you do not already have it!
model = scorer.IncrementalLMScorer('gpt2', 'cpu')

or equivalently, using a manually constructed model:

from transformers import AutoModelForCausalLM

gpt2 = AutoModelForCausalLM.from_pretrained('gpt2', return_dict=True)
gpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2', use_fast=True)
model = scorer.IncrementalLMScorer(gpt2, tokenizer=gpt2_tokenizer, device='cpu')

(You may customize the model and the tokenizer here.)

minicons allows you to compute token-by-token log-probabilities using the model.compute_stats() function, which accepts texts encoded by the model.prepare_text() function. It has the following parameters:

batch [Iterable]: Input batch (list of sentences or single sentence)
rank [bool]: Whether the model should return ranks of each token (by probability)
base_two [bool]: Use base 2 for the log-prob
return_tensors [bool]: Whether the output should contain tensors.

Each value here represents the conditional log-probability -- log P(word | left context), so the first value represents the probability of the second word given the first.

logprobs = model.compute_stats(model.prepare_text("The sketch of those trucks hasn't"))

print(logprobs)

#[[-10.879678726196289, -2.5105514526367188,  -6.6631927490234375,  -8.962379455566406,  -8.681724548339844,  -0.0005340576171875]]

Note that you can also pass a batch of texts in a list format.

sentences = ["The sketch of those trucks hasn't", "The sketch of those trucks haven't"]

model.compute_stats(model.prepare_text(sentences))

# [[-10.879678726196289,
#  -2.5105514526367188,
#  -6.6631927490234375,
#  -8.962379455566406,
#  -8.681724548339844,
#  -0.0005340576171875],
# [-10.879678726196289,
#  -2.5105514526367188,
#  -6.6631927490234375,
#  -8.962379455566406,
#  -10.669326782226562,
#  -0.0013275146484375]]

To also get tokens in the output, use the following code. Note: minicons adds an additional 0.0 log-probability for the first token/word as convention.

model.token_score(sentences)

'''
[[('The', 0.0),
  ('sketch', -10.879678726196289),
  ('of', -2.5105514526367188),
  ('those', -6.6631927490234375),
  ('trucks', -8.962379455566406),
  ('hasn', -8.681724548339844),
  ("'t", -0.0005340576171875)],
 [('The', 0.0),
  ('sketch', -10.879678726196289),
  ('of', -2.5105514526367188),
  ('those', -6.6631927490234375),
  ('trucks', -8.962379455566406),
  ('haven', -10.669326782226562),
  ("'t", -0.0013275146484375)]]
'''

For surprisals, pass surprisal = True to model.token_score() (pass base_two = True if you want surprisals in bits)

model.token_score(sentences, surprisal = True, base_two = True)

'''
[[('The', 0.0),
  ('sketch', 15.69605827331543),
  ('of', 3.621960163116455),
  ('those', 9.612955093383789),
  ('trucks', 12.929980278015137),
  ('hasn', 12.525080680847168),
  ("'t", 0.0007704822928644717)],
 [('The', 0.0),
  ('sketch', 15.69605827331543),
  ('of', 3.621960163116455),
  ('those', 9.612955093383789),
  ('trucks', 12.929980278015137),
  ('haven', 15.392584800720215),
  ("'t", 0.0019151987507939339)]]
'''

You can also compute the overall sentence scores by using the model.sequence_score() function. By default it does so by normalizing the summed log probability score and dividing it by the length (i.e., number of tokens). To only get the overall log-probability, one would pass reduction = lambda x: x.sum(0) (for surprisals pass lambda x: -x.sum(0)) as an argument:

model.sequence_score(["The sketch of those trucks hasn't", "The sketch of those trucks haven't"], reduction = lambda x: x.sum(0))

# Log probabilities of the sentences:
# [tensor(-37.6981), tensor(-39.6865)]

Finally, minicons also facilitates large-scale experiments. For example, let's run our test of GPT2-small's behavior on the full number-agreement task from BLiMP:

stimuli = []
with open("distractor_agreement_relational_noun.jsonl", "r") as f:
    for line in f:
        row = json.loads(line)
        stimuli.append([row['one_prefix_prefix'] + " " + row['one_prefix_word_good'], row['one_prefix_prefix'] + " " + row['one_prefix_word_bad']])

for pair in stimuli[:5]:
    print(f"{pair[0]} vs. {pair[1]}")

## A niece of most senators hasn't vs. A niece of most senators haven't
## The sketch of those trucks hasn't vs. The sketch of those trucks haven't
## A newspaper article about the Borgias has vs. A newspaper article about the Borgias have
## The niece of most guests has vs. The niece of most guests have
## A sketch of lights doesn't vs. A sketch of lights don't

stimuli_dl = DataLoader(stimuli, batch_size = 100)

good_scores = []
bad_scores = []
for batch in stimuli_dl:
    good, bad = batch
    good_scores.extend(model.sequence_score(good, reduction = lambda x: x.sum(0)))
    bad_scores.extend(model.sequence_score(bad, reduction = lambda x: x.sum(0)))


# Testing the extent to which GPT2-small shows patterns of number-agreement:
print(np.mean([g > b for g,b in zip(good_scores, bad_scores)]))

# 0.89

Sidenote: Computing conditional log-probabilities

minicons also allows you to compute conditional log probabilities, using the conditional_score function! The arguments for this are the same as in sequence_score except that it now takes two forms of text inputs: a batch of prefixes and a batch of queries. Let's say you wanted to compute the log-probability of "can fly" given "a robin" vs. "a penguin":

prefixes = ['a robin', 'a penguin']
queries = ['can fly'] * 2

model.conditional_score(prefixes, queries) # we will use the default reduction method, which computes log-probability per token

# [-4.762691497802734, -4.574714660644531]

Sidenote: Passing BatchEncoding to the methods

Sometimes, you already have BatchEncoding that is the result encoded by the tokenizer (same as what you get by calling model.encode(stimuli)). In this case, you can also pass the BatchEncoding instance in place of the stimuli. For example, model.sequence_score(model.encode(stimuli), reduction = reduction) is equivalent to model.sequence_score(stimuli, reduction = reduction). This could be useful if you want to save the repetition of tokenization time or if you want to use customized token sequences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

surprisals.md

surprisals.md

Calculating logprobs/surprisals with transformer models using minicons

Sidenote: Computing conditional log-probabilities

Sidenote: Passing BatchEncoding to the methods

Files

surprisals.md

Latest commit

History

surprisals.md

File metadata and controls

Calculating logprobs/surprisals with transformer models using minicons

Sidenote: Computing conditional log-probabilities

Sidenote: Passing BatchEncoding to the methods