Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow split T5 & CLIP prompts for flux & add a separate T5 token counter #1906

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wbclark
Copy link

@wbclark wbclark commented Sep 24, 2024

CLIP and T5 are very different in the way they process and understand language, with a major difference being tokenization.

CLIP and T5 encoders process language very differently, even in how they tokenize input. T5 is case-sensitive, while CLIP is case-insensitive. T5 uses a prefix ("▁") to mark tokens that occur at the beginning of a word, whereas CLIP appends </w> to indicate the end of a word. Furthermore, T5 includes a rich vocabulary that extends into other languages, such as French and German, while CLIP is more likely to have single-token recognition of internet-specific terms like "womancrushwednesday" or "hamillhimself."

So to get the most out of a dual encoder model architecture like Flux, it's valuable to be able to pass separate prompts to the T5 and CLIP encoders to leverage their respective strengths and exert fine-grained artistic control over the conditioning.

This commit introduces the SPLIT keyword for Flux models. When present, everything before SPLIT is processed by T5, while everything after is processed by CLIP. Additionally, an extra token counter has been added to the UI, helping users understand how each encoder interprets their prompts.

Example images forthcoming.

@wbclark wbclark requested a review from lllyasviel as a code owner September 24, 2024 19:26
@wbclark
Copy link
Author

wbclark commented Sep 24, 2024

All examples are generated with Flux 1 Dev using the same seed and sampling parameters: euler sampler with simple scheduler, 20 steps, 896x1152 resolution, distilled CFG scale 3.5, CFG scale 1, & seed 65432.

The prompts demonstrate some of the differences in T5 and CLIP tokenization, as seen in this debug output from the 2nd example below (SPLIT case):

T5_PARSED: [['WARRANTkulturelle shift in the fabric of society, a revolution of ideas', 1.0]]
T5_TOKENIZED: [[25732, 25739, 4108, 16, 8, 3002, 13, 2710, 6, 3, 9, 9481, 13, 912]]
T5_TOKENS: [['▁WARRANT', 'kulturelle', '▁shift', '▁in', '▁the', '▁fabric', '▁of', '▁society', ',', '▁', 'a', '▁revolution', '▁of', '▁ideas']]

CLIP_PARSED: [['graffiti-covered walls of a futuristic city with neon lights and dystopian architecture', 1.0]]
CLIP_TOKENIZED: [[11676, 268, 5603, 8258, 539, 320, 30987, 1305, 593, 13919, 3073, 537, 38915, 4920]]
CLIP_TOKENS: [['graffiti</w>', '-</w>', 'covered</w>', 'walls</w>', 'of</w>', 'a</w>', 'futuristic</w>', 'city</w>', 'with</w>', 'neon</w>', 'lights</w>', 'and</w>', 'dystopian</w>', 'architecture</w>']]

Example 1:

prompt_counters_1

image_1

Example 2:

prompt_counters_2

image_2

Example 3:

prompt_counters_3

image_3

@anae-git
Copy link

https://github.com/DenOfEquity/forgeFlux_dualPrompt

@wbclark
Copy link
Author

wbclark commented Sep 25, 2024

Hey @DenOfEquity! I noticed your forgeFlux_dualPrompt extension and saw that we're working on similar ideas around multi-prompt support. I recently submitted this PR for dual prompt support and a T5 token counter for Flux models, and I'm also thinking about future improvements such as providing a complete tokenization breakdown for the user. That would be beneficial for users of single encoder models too, although it's an interesting challenge to do that in a way that's not disruptive to the existing interface...

Anyway, I'd love to hear your thoughts on the approach in this PR and if you're interested in collaborating. Before I started working on this, I looked through past issues and PRs here but didn’t see anything similar—so I'm curious now if this was something you and @lllyasviel had previously discussed, especially in terms of native support vs. an extension?

@DenOfEquity
Copy link
Collaborator

Hi. I haven't had time to test this yet, and I'd also like to see if lllyasviel has new opinions on it. IMO this keyword method is ideal as there's no major UI changes.
There's some brief discussion #1182, where lllyasviel wasn't sure if direct inclusion was worth it and my brief testing showed some potential but not definitive value. I think the best comparison is between {prompt1} {prompt2} and {prompt1} SPLIT {prompt2}.

Method should be extensible to at least sdxl (clip-l, clip-g).
Does it work with the prompt-bracket-checker builtin extension?

@wbclark
Copy link
Author

wbclark commented Sep 27, 2024

Thanks for the initial feedback.

IMO this keyword method is ideal as there's no major UI changes.

I agree, and there is no API change either. It all flows through the same prompt argument as before, and there is no change if the SPLIT keyword is not provided. Plus, users are already familiar with this paradigm (AND, BREAK, etc)... but mostly, I implemented it this way so that the change would be small and not increase maintenance burden.

There's some brief discussion #1182, where lllyasviel wasn't sure if direct inclusion was worth it and my brief testing showed some potential but not definitive value. I think the best comparison is between {prompt1} {prompt2} and {prompt1} SPLIT {prompt2}.

One reason I included the tokenization breakdown in my example is that I think it makes it much easier to see why we should care about the ability to specify different prompts per encoder.

I've been working with print debugging to see the tokenization in the terminal, but I want to find the right way to show this to the user, because understanding this will help them to write better prompts (even when using single encoder models).

Two designs I'm thinking about -- 1. show prompt tokenization on mouseover of the token counter, or 2. add a new accordion (collapsed by default to minimize visual disruption) which shows the tokenization once expanded.

I think another good comparison to try is simply prompt vs. prompt SPLIT vs. SPLIT prompt, which helps to understand how truly different the usage of each prompt is in the Flux model architecture.

Anyway, since this project aims to be more research focused, I think there is also a good argument for exposing the functionality to users, as they may discover some methods and applications that we did not consider.

Method should be extensible to at least sdxl (clip-l, clip-g).

I haven't experimented with that yet, and while my gut intuition is that it might be less valuable than the Flux case due to architectural differences, it would still be a very small change to implement as you are pointing out.

Thinking about it also has me rethinking a few things, like naming the token counters primary/secondary instead of CLIP/T5, and adding the additional counter to the negative prompt as well.

That might actually simplify things even further -- it may be possible to even treat both counters as a single UI element, and do most all changes backend/ and javascript/, to further limit the necessary change in modules/. I will investigate this approach soon.

Plus some of my tests with setting CFG != 1 on Flux have me thinking perhaps negative prompts aren't totally DoA there after all.

Does it work with the prompt-bracket-checker builtin extension?

No, it requires a small adjustment... and I should also test how embeddings are handled just to make sure. Thanks for pointing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants