Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there a Bloom filter equivalent for efficiently searching whether a variable-length string is (or, more challenging, contains!) a substring of a very large string?

I think the classic Bloom filter is suitable when you have an exact-match operation but not directly suitable for a substring operation. E.g. you could put 500,000 names into the filter and it could tell you efficiently that "Jason Bourne" is probably one of those names, but not that "urn" is a component of one of them.

For the "is this output in the training data anywhere?" question, the most generally useful question might be somdthing like "are the last 200 tokens of output a verbatim substring of HUGE_TRAINING_STRING?".

A totally different challenge: presumably it's very often appropriate for some relatively large "popular" or "common" strings to actually be memorized and repeated on request. E.g., imagine asking a large language model for the text of the Lord's Prayer or the Pledge of Allegiance or the lyrics to some country's national anthem or something. The expected right answer is going to be that verbatim output.

If it weren't for copyright, this would probably also be true for many long strings that don't occur frequently in the training data, although it wouldn't be a high priority for model training because the LLM isn't a very efficient way to store tons of non-repetitive verbatim text.



Well, you could just slide a 10-token window over your dataset, and insert those (10-token tuples) into the bloom filter.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: