SuperPrompt: Better Text to Image Prompts in 77M Parameters

gregtc · on March 14, 2024

Great work! I'd recommend including the "max_length=77" parameter in your example, and it seems like the huggingface hosted interface is broken because of the tokenizer. Also, I think your website link on X is outdated.

roborovskis · on March 14, 2024

will fix these, thanks for the heads up!

happymellon · on March 15, 2024

I have yet to find a text to image generator that works. Whenever I try to generate an image, for example something like Christmas cake, I get

Google image search: The results are pictures of that nice dark cake filled with currents, raisens and cherries.

TTI: Have a cheese cake, or perhaps a gingerbread house?

They seem to be able to do all sorts of great images, but never what I want.

justwool · on March 15, 2024

what is a Christmas cake.

you are asking it to do something humans cant.

Christmas cake is just not enough context and it isn't a "thing".

i would give you a picture of Christmas themed cake.

This is a larger issue with using AI in general. You have to be able to communicate efficiently.

Imagine asking a random person for a Christmas cake. Would you really expect to find the right thing?

happymellon · on March 15, 2024

> you are asking it to do something humans cant.

What are you talking about? Google can do it.

https://postimg.cc/mhn6SPVC

For reference:

https://www.bbcgoodfood.com/recipes/make-mature-christmas-ca...

Fruit cake is also something that these image generation tools cannot create. They come up with all sorts of sponge cake monstrosities.

justwool · on March 28, 2024

Uhh google showed you a bunch of different things that a bunch of different people call Christmas cakes?

Again fruit cake lacks all context. The definition I. Your head is not contained in the word itself.

This is why people get confused even talking to each other…

ShamelessC · on March 15, 2024

Eh? Perhaps this is cultural? I wasn’t aware Christmas cake was a thing but if I had to guess it would just be normal cake with green, red, white colors/icing and Christmas themed decorations. Sure enough, that’s most of what shows up on Google image search. It’s also exactly what SD-XL outputs in my (limited) testing. It doesn’t surprise _me_ too much that a text to image model struggles with that concept because it feels rare and under specified. Having said that, I live in the US south and maybe I’m just ignorant to all that. We mostly eat various pies here.

vunderba · on March 14, 2024

This is neat and some thing (aka text "expanders") that I imagine a lot of the commercial offerings (midjourney, etc) are using behind the scenes.

This seems to be targeting SDXL workflows, but in my experience a lot of the custom checkpoints derived from SDXL can have widely divergent recommended prompting styles ranging from natural language to just a list of booru tags.

So I'm guessing this is really only optimized for base SDXL, but I would be curious to see how well it worked on some of the more SOTA SDXL checkpoints such as juggernaut and unstable.

roborovskis · on March 14, 2024

I haven't tested extensively with non SDXL based checkpoints but there's nothing really SDXL specific about the model; if you're using a fine-tune that's trained on booru-style tags, it will probably not work as well - but otherwise it should work just fine. And in that case, just fork the project and tune it on however your fine-tune prompts best :)

pstorm · on March 14, 2024

I'm surprised this isn't getting more love. I love the concept of finetuned, hyper-specific, tiny LLMs. Of course, the data is the most important part.

roborovskis · on March 14, 2024

Thanks for the kind words! I started with the 780M param flan-t5-large model, and kept trying smaller and smaller base models - I was shocked at how good the output was at 77M. As you go smaller, though, it's much easier to accidentally overfit or collapse the model and produce gibberish. Had to be very careful with hyperparams and sanitizing / filtering the dataset.

Lerc · on March 14, 2024

Is the lack of training data the only thing preventing this approach from being applied to both positive and negative prompts together?

What size data set is actually needed? Does it need to be machine generated or can you get away with something smaller, perhaps crowdsourced?

roborovskis · on March 14, 2024

You could definitely use this for upsampling negative prompts, though I haven't tested that much. In theory, future T2I models shouldn't need to be negatively prompted as much; I find it's better to focus on really high quality positive prompts, as that is closer to the captions the model was trained on.

You can take a look at the dataset here: https://huggingface.co/datasets/roborovski/upsampled-prompts... Roughly 5k samples were needed for the smaller ones at a minimum, filtered from the 95k total generated.

smcleod · on March 14, 2024

Awesome work! I’d love to see how this could be integrated with existing tools like InvokeAI.

m463 · on March 15, 2024

I played with invokeai and liked it quite a bit, but it didn't even let me recall my last prompt (so keeping a prompt history would be loads better).

smcleod · on March 15, 2024

The only way to do that I know of is to right click and select 'use prompt' from an image you generated.

m463 · on March 15, 2024

wow, that's a start, thanks!

roborovskis · on March 14, 2024

As Invoke is open-source and already has transformers as a dependency, it should be pretty easy to add.

smcleod · on March 15, 2024

Discussing with the maintainers on their discord now :) https://discord.com/channels/1020123559063990373/10201235598...

squigz · on March 15, 2024

I miss when such discussion happened in Github issues

smcleod · on March 15, 2024

You're preaching to the choir!

ShamelessC · on March 14, 2024

Nice. I've been using GPT-4-turbo with a custom system prompt for this until now. Going to try this out.

thorum · on March 14, 2024

It’s impressive how well the T5 family of models has aged, even compared to newer LLM architectures.

htrp · on March 14, 2024

encoder decoder vs decoder only

justwool · on March 15, 2024

going to go through the process of getting this running on CNVRS. Would love if someone could make some GGufs.

or really any of the current testflight localllama projects.

This is really what I am looking for.

Small specific models that run perfect on my phone,

ultrasaurus · on March 14, 2024

I was reading a blog today[1] that was pretty confident that "continual orders-of-magnitude increases in compute usage [by AI] will utterly drown any changes in efficiency" but this is just one of a million ways we can make AI more efficient. It doesn't seem like a foregone conclusion that the costs will get order-of-magnitudeS more expensive on every axis.

1: Paywalled: https://www.noahpinion.blog/p/three-threats-to-the-age-of-en...

stavros · on March 15, 2024

Doesn't Fooocus do something similar?

Mathnerd314 · on March 15, 2024

Yeah, it uses GPT2: https://github.com/lllyasviel/Fooocus/issues/1345

lionkor · on March 14, 2024

> Left: Drawbench prompt "A rainbow penguin in a tuxedo". Right: SDXL output with SuperPrompt applied to the same input prompt.

Neither is wearing a tuxedo.

roborovskis · on March 14, 2024

Yup, the model will still forget details sometimes. This is a common issue with prompt upsampling methods, but I'm hoping to improve this with the next version.

hanniabu · on March 14, 2024

I wonder how much of that could be due to "tuxedo penguin" being a thing