Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

regex search is slow and difficult to index efficiently.


It's plenty fast for many purposes. I grep large source code repos all the time. It's very fast as long as you avoid grepping large binary files for no reason.

The link below shows timings for various regex on a 20MiB text file for several different regex softwares. The best code never exceeds half a second for even the worst expressions. Most are much faster.

http://sljit.sourceforge.net/regex_perf.html


For most reasonable search regexes, it's reasonably fast if your repository is already cached in memory. However, it's still fairly heavyweight, and the cost of running even these reasonable searches hundreds or thousands of times a second becomes significant.

When you add to this the fact that it's possible to deliberately construct regexes that are much more complicated to evaluate (particularly if you allow extensions beyond a true regular language) - potentially being a DoS target - and many repos won't be cached when the search comes in (or caching them would also be very expensive), the costs just get higher.

In contrast, a simple keyword search can be performed with a predictable, low cost, so it's no wonder that they'd prefer to implement this instead.


>When you add to this the fact that it's possible to deliberately construct regexes that are much more complicated to evaluate (particularly if you allow extensions beyond a true regular language) - potentially being a DoS target

Only if you allow extensions, but then they aren't regular expressions.


Backreferences aren't really an extension though, are they?

Re2 excludes them specifically .


Backreferences are certainly an extension. They are the primary reason that regular expression engines that allow them are slow.


Ahhh, "extension" meaning not in the POSIX basic definition. Sure. Though I don't know many tools that implement REs in that basic form. They almost all include features outside that definition.


No, extension as in it is not regular. Nothing to do with POSIX. RE2, Go and Rust do not provide backreference support in exchange for predictable performance.


There are plenty of regex engines that are guaranteed O(n) for all regexes they accept. There's re2, go-regex, and rust-regex just to list a few popular ones.

They still use Perl-like syntax, and might still be able to parse non-regular languages, but you don't need to worry about users deliberately crafting regexes to eat up CPU time like PCRE.


Do you remember Google code search? Difficult yes, but definitely not slow. This seems like it would be a high value feature as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: