Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SuperPrompt: Better Text to Image Prompts in 77M Parameters (brianfitzgerald.xyz)
150 points by roborovskis on March 14, 2024 | hide | past | favorite | 31 comments


Great work! I'd recommend including the "max_length=77" parameter in your example, and it seems like the huggingface hosted interface is broken because of the tokenizer. Also, I think your website link on X is outdated.


will fix these, thanks for the heads up!


I have yet to find a text to image generator that works. Whenever I try to generate an image, for example something like Christmas cake, I get

Google image search: The results are pictures of that nice dark cake filled with currents, raisens and cherries.

TTI: Have a cheese cake, or perhaps a gingerbread house?

They seem to be able to do all sorts of great images, but never what I want.


what is a Christmas cake.

you are asking it to do something humans cant.

Christmas cake is just not enough context and it isn't a "thing".

i would give you a picture of Christmas themed cake.

This is a larger issue with using AI in general. You have to be able to communicate efficiently.

Imagine asking a random person for a Christmas cake. Would you really expect to find the right thing?


> you are asking it to do something humans cant.

What are you talking about? Google can do it.

https://postimg.cc/mhn6SPVC

For reference:

https://www.bbcgoodfood.com/recipes/make-mature-christmas-ca...

Fruit cake is also something that these image generation tools cannot create. They come up with all sorts of sponge cake monstrosities.


Uhh google showed you a bunch of different things that a bunch of different people call Christmas cakes?

Again fruit cake lacks all context. The definition I. Your head is not contained in the word itself.

This is why people get confused even talking to each other…


Eh? Perhaps this is cultural? I wasn’t aware Christmas cake was a thing but if I had to guess it would just be normal cake with green, red, white colors/icing and Christmas themed decorations. Sure enough, that’s most of what shows up on Google image search. It’s also exactly what SD-XL outputs in my (limited) testing. It doesn’t surprise _me_ too much that a text to image model struggles with that concept because it feels rare and under specified. Having said that, I live in the US south and maybe I’m just ignorant to all that. We mostly eat various pies here.


This is neat and some thing (aka text "expanders") that I imagine a lot of the commercial offerings (midjourney, etc) are using behind the scenes.

This seems to be targeting SDXL workflows, but in my experience a lot of the custom checkpoints derived from SDXL can have widely divergent recommended prompting styles ranging from natural language to just a list of booru tags.

So I'm guessing this is really only optimized for base SDXL, but I would be curious to see how well it worked on some of the more SOTA SDXL checkpoints such as juggernaut and unstable.


I haven't tested extensively with non SDXL based checkpoints but there's nothing really SDXL specific about the model; if you're using a fine-tune that's trained on booru-style tags, it will probably not work as well - but otherwise it should work just fine. And in that case, just fork the project and tune it on however your fine-tune prompts best :)


I'm surprised this isn't getting more love. I love the concept of finetuned, hyper-specific, tiny LLMs. Of course, the data is the most important part.


Thanks for the kind words! I started with the 780M param flan-t5-large model, and kept trying smaller and smaller base models - I was shocked at how good the output was at 77M. As you go smaller, though, it's much easier to accidentally overfit or collapse the model and produce gibberish. Had to be very careful with hyperparams and sanitizing / filtering the dataset.


Is the lack of training data the only thing preventing this approach from being applied to both positive and negative prompts together?

What size data set is actually needed? Does it need to be machine generated or can you get away with something smaller, perhaps crowdsourced?


You could definitely use this for upsampling negative prompts, though I haven't tested that much. In theory, future T2I models shouldn't need to be negatively prompted as much; I find it's better to focus on really high quality positive prompts, as that is closer to the captions the model was trained on.

You can take a look at the dataset here: https://huggingface.co/datasets/roborovski/upsampled-prompts... Roughly 5k samples were needed for the smaller ones at a minimum, filtered from the 95k total generated.


Awesome work! I’d love to see how this could be integrated with existing tools like InvokeAI.


I played with invokeai and liked it quite a bit, but it didn't even let me recall my last prompt (so keeping a prompt history would be loads better).


The only way to do that I know of is to right click and select 'use prompt' from an image you generated.


wow, that's a start, thanks!


As Invoke is open-source and already has transformers as a dependency, it should be pretty easy to add.


Discussing with the maintainers on their discord now :) https://discord.com/channels/1020123559063990373/10201235598...


I miss when such discussion happened in Github issues


You're preaching to the choir!


Nice. I've been using GPT-4-turbo with a custom system prompt for this until now. Going to try this out.


It’s impressive how well the T5 family of models has aged, even compared to newer LLM architectures.


encoder decoder vs decoder only


going to go through the process of getting this running on CNVRS. Would love if someone could make some GGufs.

or really any of the current testflight localllama projects.

This is really what I am looking for.

Small specific models that run perfect on my phone,


I was reading a blog today[1] that was pretty confident that "continual orders-of-magnitude increases in compute usage [by AI] will utterly drown any changes in efficiency" but this is just one of a million ways we can make AI more efficient. It doesn't seem like a foregone conclusion that the costs will get order-of-magnitudeS more expensive on every axis.

1: Paywalled: https://www.noahpinion.blog/p/three-threats-to-the-age-of-en...


Doesn't Fooocus do something similar?



> Left: Drawbench prompt "A rainbow penguin in a tuxedo". Right: SDXL output with SuperPrompt applied to the same input prompt.

Neither is wearing a tuxedo.


Yup, the model will still forget details sometimes. This is a common issue with prompt upsampling methods, but I'm hoping to improve this with the next version.


I wonder how much of that could be due to "tuxedo penguin" being a thing




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: