How is this different from image captioning when the model used is a booru model? That's already a thing people do with making their training data for fine tuning these models.
It actually works on top of an image captioning model, SD takes in keywords as well like "artstation" and "octane render" which are not covered in standard captioning so that is why the difference between using an off-the-shelf captioning model vs this