No it's not. Its regex library is written in Rust, but was inspired by RE2. It shares no code with RE2. (And RE2 is a C++ library, not C.)
Off the top of my head, the only C code in ripgrep is optional integration with PCRE2. In addition to whatever libc is being used on POSIX platforms. Everything else is pure Rust.
It couldn't figure it out from looking through ripgrep's website: does ripgrep support intersection and complement of expressions? Like eg https://github.com/google/redgrep does.
Regular languages are closed under those operations after all.
No, it doesn't. It's only theoretically easy to implement. In practice, they explode the size of the underlying FSM. Moreover, in a command line tool, it's somewhat easy to work around that through the `-v` switch and shell pipelining.
ripgrep's regex syntax is the same as Rust's regex crate: https://docs.rs/regex/1.4.4/regex/#syntax (Which is in turn similar to RE2, although it supports a bit more niceties.)
> No, it doesn't. It's only theoretically easy to implement.
Oh, I didn't say anything about easy! I am on and off working on a Haskell re-implementation (but with GADTs and in Oleg's tagless final interpreter style etc, so it's more about exploring the type system).
> In practice, they explode the size of the underlying FSM.
You may be right, but that's still better than the gymnastics you'd have to do by hand to get the same features out of a 'normal' regex.
> Moreover, in a command line tool, it's somewhat easy to work around that through the `-v` switch and shell pipelining.
Alas, that only works, if your intersection or complement happen at the top level. You can't do something like
Perhaps I'll try and implement a basic version of redgrep in Rust as an exercise. (I just want something that supports basically all the operations regular languages are closed, but don't care too much about speed, as long as the runtime complexity is linear.)
Yeah sorry, I've gotten asked this question a lot. The issue is that building a production grade regex engine---even when it's restricted to regular languages---requires a lot more engineering than theory. And these particular features just don't really pull their weight IMO. They are performance footguns, and IMO, are also tricky to reason about inside of regex syntax.
If you get something working, I'd love to look at it though! Especially if you're building in a tagless final interpreter style. I find that approach extremely elegant.
For my current attempts, I bit off more than I could chew:
I tried to build a system that not only recognizes regular languages, but also serves as a parser for them (a la Parsec).
The latter approach pushes you to support something like fmap, but the whole derivatives-based approach needs more 'introspection' so support general mapping via fmap (ie a->b) is out, and you can only support things that you have more control over than functions.
(And in general, I am doing bifunctors, because I want the complement of the complement be the original thing.)
Sorry, if that's a bit confused.. If I was a better theoretician, I could probably work it out.
I haven't touched the code in a while. But recently I have thought about the theory some more. The Brzozowski derivative introduced the concept of multiplicative inverse of a string. I am working out the ramifications of extending that to the multiplicative inverse of arbitrary regular expressions. (The results might already be in the literature. I haven't looked much.)
I don't expect anything groundbreaking to come out of that, but I hope my understanding will improve.
> And these particular features just don't really pull their weight IMO. They are performance footguns, and IMO, are also tricky to reason about inside of regex syntax.
Well, in theory I could 'just' write a preprocessor that takes my regex with intersection and complement and translates it to a more traditional one. I wouldn't care too much if that's not very efficient.
I'm interested in those features because of the beauty of the theory, but it would also help make production regular expressions more modular.
Eg if you have a regular expression to decide on what's a valid username for someone to sign up to your system. You decide to use email addresses as your usernames, so the main qualification is that users can receive an email on it. But because they will be visible to other users, you have some additional requirements:
'.{0,100} & [^@]@[^@] & not (.(root|admin|<some offensive term>).@.) & not (.<sql injection>.*)'
That's a silly example. I think in production, I would be more likely to see something as complicated as this in eg some ad-hoc log parsing.
> The issue is that building a production grade regex engine---even when it's restricted to regular languages---requires a lot more engineering than theory.
The interesting thing here is that rust has good threading and fantastic crates
I played with making a regex library in rust. Which, as per RE2 design involves constructing graphs and glueing them together as the regex is traversed
This requires a cycle catching gc, or, just a preallocated arena... It was my first foray into rust and felt I would need to be hitting into unsafe, which I wasn't ready for. Array indexing might decompose into an arena, but syntactically just a bit messier (imho)
Would be interesting to see how the RE2 does it in rust (didn't know that)
I like how the article shows both sides of the fence, it makes me realize:
I get a lot of optimizations from ptr stuffing in c. But sometimes we should lay down the good, for the better
You're overcomplicating it. When it comes to finite state machines at least, it's very easy to use an ID index instead of the raw pointer itself. That's exactly what the regex crate does.
For reference, I am also the author of the regex crate. The only unsafe it uses specific to finite automata is to do explicit elimination of bounds checks in the core hybrid NFA/DFA loop.
> When it comes to finite state machines at least, it's very easy to use an ID index instead of the raw pointer itself.
As an old C programmer, the difference between an array index and a pointer caught me by surprise. In C a pointer is just an unchecked offset into memory. A real array index is just a unchecked offset into ... maybe a smaller chunk of raw memory.
But in rust, an array index is something that comes with additional bounds checking overheads with every use. And the memory it points to is also constrained - the entire array has to be initialised, so if the index passes the bounds check you are guaranteed rusts memory consistency invariants are preserved. Indexes also allow you to escape the borrow checker. If you own the slice, there is no need to prove you can access an element of the slice.
So yeah, you can use indexes instead of pointers, but for rust that's like saying you can use recursion instead of iteration. Indexing and pointers are two very different things in rust.
I guess so. But note that I didn't equate them. I just said that you can use an ID index instead. For the particular program of FSMs, they work very well.
If bounds checks prove to be a problem, you can explicitly elide them. Indeed, Rust's regex does just that. :-)
> I played with making a regex library in rust. Which, as per RE2 design involves constructing graphs and glueing them together as the regex is traversed
Has anyone built a production grade regex engine using derivatives? I don't think I've seen one. I personally always get stuck at how to handle things like captures or the very large Unicode character classes. Or hacking in look-around. (It's been a while since I've given this thought though, so I'm not sure I'll be able to elaborate much.)
I've made some attempts, but nothing production grade.
About large character classes: how are those harder than in approaches? If you build any FSM you have to deal with those, don't you?
One way to handle them that works well when the characters in your classes are mostly next to each other unicode, is to express your state transition function as an 'interval map'
What I mean is that eg a hash table or an array lets you build representations of mathematical functions that map points to values.
You want something that can model a step function.
You can either roll your own, or write something around a sorted-map data structure.
The keys in your sorted map are the 'edges' of your characters classes (eg where they start and end).
Does that make sense? Or am I misunderstanding the problem?
> I personally always get stuck at how to handle things like captures [...]
Let me think about that one for a while. Some Googling suggests https://github.com/elfsternberg/barre but they don't seem to support intersection, complement or stripping prefixes.
What do you want your capture groups to do? Do you eg just want to return pointers to where you captured them (if any)?
But that's only for parsing the regex itself. I don't see any match APIs that utilize them. I wouldn't expect to either, because you can't implement capturing inside a DFA. (You need a tagged DFA, which is a strictly more powerful thing. But in that case, the DFA size explodes. See the re2c project and their associated papers.)
If I'm remembering correctly, I think the problem with derivatives is that they jump straight to a DFA. You can't do that in a production regex engine because a DFA's worst case size is exponential in the size of the regex.
> If I'm remembering correctly, I think the problem with derivatives is that they jump straight to a DFA. You can't do that in a production regex engine because a DFA's worst case size is exponential in the size of the regex.
Oh, that's interesting! Because I actually worked on some approaches that don't jump directly to the DFA.
The problem is the notion of (extended) NFA you need is quite a bit more complicated when you support intersection and complement.
Indeed. And in the regex crate and RE2, for example, captures are only implemented in the "NFA" engines (specifically, the PikeVM and the bounded backtracker). So if you support captures, then those engines have to be able to support everything.
Off the top of my head, the only C code in ripgrep is optional integration with PCRE2. In addition to whatever libc is being used on POSIX platforms. Everything else is pure Rust.