Great work! I'd recommend including the "max_length=77" parameter in your example, and it seems like the huggingface hosted interface is broken because of the tokenizer. Also, I think your website link on X is outdated.
Eh? Perhaps this is cultural? I wasn’t aware Christmas cake was a thing but if I had to guess it would just be normal cake with green, red, white colors/icing and Christmas themed decorations. Sure enough, that’s most of what shows up on Google image search. It’s also exactly what SD-XL outputs in my (limited) testing. It doesn’t surprise _me_ too much that a text to image model struggles with that concept because it feels rare and under specified. Having said that, I live in the US south and maybe I’m just ignorant to all that. We mostly eat various pies here.
This is neat and some thing (aka text "expanders") that I imagine a lot of the commercial offerings (midjourney, etc) are using behind the scenes.
This seems to be targeting SDXL workflows, but in my experience a lot of the custom checkpoints derived from SDXL can have widely divergent recommended prompting styles ranging from natural language to just a list of booru tags.
So I'm guessing this is really only optimized for base SDXL, but I would be curious to see how well it worked on some of the more SOTA SDXL checkpoints such as juggernaut and unstable.
I haven't tested extensively with non SDXL based checkpoints but there's nothing really SDXL specific about the model; if you're using a fine-tune that's trained on booru-style tags, it will probably not work as well - but otherwise it should work just fine. And in that case, just fork the project and tune it on however your fine-tune prompts best :)
I'm surprised this isn't getting more love. I love the concept of finetuned, hyper-specific, tiny LLMs. Of course, the data is the most important part.
Thanks for the kind words! I started with the 780M param flan-t5-large model, and kept trying smaller and smaller base models - I was shocked at how good the output was at 77M. As you go smaller, though, it's much easier to accidentally overfit or collapse the model and produce gibberish. Had to be very careful with hyperparams and sanitizing / filtering the dataset.
You could definitely use this for upsampling negative prompts, though I haven't tested that much. In theory, future T2I models shouldn't need to be negatively prompted as much; I find it's better to focus on really high quality positive prompts, as that is closer to the captions the model was trained on.
I was reading a blog today[1] that was pretty confident that "continual orders-of-magnitude increases in compute usage [by AI] will utterly drown any changes in efficiency" but this is just one of a million ways we can make AI more efficient. It doesn't seem like a foregone conclusion that the costs will get order-of-magnitudeS more expensive on every axis.
Yup, the model will still forget details sometimes. This is a common issue with prompt upsampling methods, but I'm hoping to improve this with the next version.