No it's not. Its regex library is written in Rust, but was *inspired* by RE2. It...

eru · on March 14, 2021

It couldn't figure it out from looking through ripgrep's website: does ripgrep support intersection and complement of expressions? Like eg https://github.com/google/redgrep does.

Regular languages are closed under those operations after all.

burntsushi · on March 14, 2021

No, it doesn't. It's only theoretically easy to implement. In practice, they explode the size of the underlying FSM. Moreover, in a command line tool, it's somewhat easy to work around that through the `-v` switch and shell pipelining.

Paul's talk introduced redgrep is amazing by the way. Give it a watch if you haven't yet: https://www.youtube.com/watch?v=Ukqb6nMjFyk

ripgrep's regex syntax is the same as Rust's regex crate: https://docs.rs/regex/1.4.4/regex/#syntax (Which is in turn similar to RE2, although it supports a bit more niceties.)

eru · on March 14, 2021

> No, it doesn't. It's only theoretically easy to implement.

Oh, I didn't say anything about easy! I am on and off working on a Haskell re-implementation (but with GADTs and in Oleg's tagless final interpreter style etc, so it's more about exploring the type system).

> In practice, they explode the size of the underlying FSM.

You may be right, but that's still better than the gymnastics you'd have to do by hand to get the same features out of a 'normal' regex.

> Moreover, in a command line tool, it's somewhat easy to work around that through the `-v` switch and shell pipelining.

Alas, that only works, if your intersection or complement happen at the top level. You can't do something like

(A & not B) followed by (C & D)

that way.

> Paul's talk introduced redgrep is amazing by the way. Give it a watch if you haven't yet: https://www.youtube.com/watch?v=Ukqb6nMjFyk

I have, and I agree!

Perhaps I'll try and implement a basic version of redgrep in Rust as an exercise. (I just want something that supports basically all the operations regular languages are closed, but don't care too much about speed, as long as the runtime complexity is linear.)

burntsushi · on March 14, 2021

Yeah sorry, I've gotten asked this question a lot. The issue is that building a production grade regex engine---even when it's restricted to regular languages---requires a lot more engineering than theory. And these particular features just don't really pull their weight IMO. They are performance footguns, and IMO, are also tricky to reason about inside of regex syntax.

If you get something working, I'd love to look at it though! Especially if you're building in a tagless final interpreter style. I find that approach extremely elegant.

eru · on March 15, 2021

For my current attempts, I bit off more than I could chew:

I tried to build a system that not only recognizes regular languages, but also serves as a parser for them (a la Parsec).

The latter approach pushes you to support something like fmap, but the whole derivatives-based approach needs more 'introspection' so support general mapping via fmap (ie a->b) is out, and you can only support things that you have more control over than functions.

(And in general, I am doing bifunctors, because I want the complement of the complement be the original thing.)

Sorry, if that's a bit confused.. If I was a better theoretician, I could probably work it out.

I haven't touched the code in a while. But recently I have thought about the theory some more. The Brzozowski derivative introduced the concept of multiplicative inverse of a string. I am working out the ramifications of extending that to the multiplicative inverse of arbitrary regular expressions. (The results might already be in the literature. I haven't looked much.)

I don't expect anything groundbreaking to come out of that, but I hope my understanding will improve.

> And these particular features just don't really pull their weight IMO. They are performance footguns, and IMO, are also tricky to reason about inside of regex syntax.

Well, in theory I could 'just' write a preprocessor that takes my regex with intersection and complement and translates it to a more traditional one. I wouldn't care too much if that's not very efficient.

I'm interested in those features because of the beauty of the theory, but it would also help make production regular expressions more modular.

Eg if you have a regular expression to decide on what's a valid username for someone to sign up to your system. You decide to use email addresses as your usernames, so the main qualification is that users can receive an email on it. But because they will be visible to other users, you have some additional requirements:

'.{0,100} & [^@]@[^@] & not (.(root|admin|<some offensive term>).@.) & not (.<sql injection>.*)'

That's a silly example. I think in production, I would be more likely to see something as complicated as this in eg some ad-hoc log parsing.

> The issue is that building a production grade regex engine---even when it's restricted to regular languages---requires a lot more engineering than theory.

Amen to that!

tsegratis · on March 13, 2021

Ah, thanks burntsushi, I believe you are the ripgrep author even?

Great work btw. Ripgrep is the best

... I will have to restrict my comment to just LLVM being a larger, c++, dependency

... Just angling for more downvotes ;) Thanks for the reply

burntsushi · on March 13, 2021

To be clear, ripgrep has no runtime dependency on any LLVM or C++ library. rustc does.

tsegratis · on March 13, 2021

The interesting thing here is that rust has good threading and fantastic crates

I played with making a regex library in rust. Which, as per RE2 design involves constructing graphs and glueing them together as the regex is traversed

This requires a cycle catching gc, or, just a preallocated arena... It was my first foray into rust and felt I would need to be hitting into unsafe, which I wasn't ready for. Array indexing might decompose into an arena, but syntactically just a bit messier (imho)

Would be interesting to see how the RE2 does it in rust (didn't know that)

I like how the article shows both sides of the fence, it makes me realize:

I get a lot of optimizations from ptr stuffing in c. But sometimes we should lay down the good, for the better

burntsushi · on March 13, 2021

You're overcomplicating it. When it comes to finite state machines at least, it's very easy to use an ID index instead of the raw pointer itself. That's exactly what the regex crate does.

For reference, I am also the author of the regex crate. The only unsafe it uses specific to finite automata is to do explicit elimination of bounds checks in the core hybrid NFA/DFA loop.

rstuart4133 · on March 14, 2021

> When it comes to finite state machines at least, it's very easy to use an ID index instead of the raw pointer itself.

As an old C programmer, the difference between an array index and a pointer caught me by surprise. In C a pointer is just an unchecked offset into memory. A real array index is just a unchecked offset into ... maybe a smaller chunk of raw memory.

But in rust, an array index is something that comes with additional bounds checking overheads with every use. And the memory it points to is also constrained - the entire array has to be initialised, so if the index passes the bounds check you are guaranteed rusts memory consistency invariants are preserved. Indexes also allow you to escape the borrow checker. If you own the slice, there is no need to prove you can access an element of the slice.

So yeah, you can use indexes instead of pointers, but for rust that's like saying you can use recursion instead of iteration. Indexing and pointers are two very different things in rust.

burntsushi · on March 14, 2021

I guess so. But note that I didn't equate them. I just said that you can use an ID index instead. For the particular program of FSMs, they work very well.

If bounds checks prove to be a problem, you can explicitly elide them. Indeed, Rust's regex does just that. :-)

eru · on March 14, 2021

> I played with making a regex library in rust. Which, as per RE2 design involves constructing graphs and glueing them together as the regex is traversed

You could instead go with a derivatives approach.

https://en.wikipedia.org/wiki/Brzozowski_derivative

burntsushi · on March 14, 2021

Has anyone built a production grade regex engine using derivatives? I don't think I've seen one. I personally always get stuck at how to handle things like captures or the very large Unicode character classes. Or hacking in look-around. (It's been a while since I've given this thought though, so I'm not sure I'll be able to elaborate much.)

eru · on March 14, 2021

I've made some attempts, but nothing production grade.

About large character classes: how are those harder than in approaches? If you build any FSM you have to deal with those, don't you?

One way to handle them that works well when the characters in your classes are mostly next to each other unicode, is to express your state transition function as an 'interval map'

What I mean is that eg a hash table or an array lets you build representations of mathematical functions that map points to values.

You want something that can model a step function.

You can either roll your own, or write something around a sorted-map data structure.

Eg in C++ you'd base the whole thing around https://en.cppreference.com/w/cpp/container/map/upper_bound (or https://hackage.haskell.org/package/containers-0.4.0.0/docs/... in Haskell.)

The keys in your sorted map are the 'edges' of your characters classes (eg where they start and end).

Does that make sense? Or am I misunderstanding the problem?

> I personally always get stuck at how to handle things like captures [...]

Let me think about that one for a while. Some Googling suggests https://github.com/elfsternberg/barre but they don't seem to support intersection, complement or stripping prefixes.

What do you want your capture groups to do? Do you eg just want to return pointers to where you captured them (if any)?

I have an inkling that something inspired by https://en.wikipedia.org/wiki/Viterbi_algorithm might work.

https://github.com/google/redgrep/blob/main/parser.yy mentions something about capture, but not sure if that has anything to do with capture groups.

burntsushi · on March 14, 2021

> About large character classes: how are those harder than in approaches? If you build any FSM you have to deal with those, don't you?

I mean specifically in the context of derivatives. IIRC, the formulation used in Turon's paper wasn't amenable to large classes.

Yes, interval sets work great: https://github.com/rust-lang/regex/blob/master/regex-syntax/...

This is why I asked if a production grade regex engine based on derivatives exists. Because I want to see how the engineering is actually done.

> What do you want your capture groups to do? Do you eg just want to return pointers to where you captured them (if any)?

Look at any production grade regex engine. It will implement captures. It should do what they do.

> I have an inkling that something inspired by https://en.wikipedia.org/wiki/Viterbi_algorithm might work.

Nothing about Viterbi is fast, in my experience implementing it in the past. :-)

> https://github.com/google/redgrep/blob/main/parser.yy mentions something about capture, but not sure if that has anything to do with capture groups.

It looks like it does, and in particular see: https://github.com/google/redgrep/blob/6b9d5b02753c4ece17e2f...

But that's only for parsing the regex itself. I don't see any match APIs that utilize them. I wouldn't expect to either, because you can't implement capturing inside a DFA. (You need a tagged DFA, which is a strictly more powerful thing. But in that case, the DFA size explodes. See the re2c project and their associated papers.)

If I'm remembering correctly, I think the problem with derivatives is that they jump straight to a DFA. You can't do that in a production regex engine because a DFA's worst case size is exponential in the size of the regex.

eru · on March 15, 2021

> If I'm remembering correctly, I think the problem with derivatives is that they jump straight to a DFA. You can't do that in a production regex engine because a DFA's worst case size is exponential in the size of the regex.

Oh, that's interesting! Because I actually worked on some approaches that don't jump directly to the DFA.

The problem is the notion of (extended) NFA you need is quite a bit more complicated when you support intersection and complement.

burntsushi · on March 15, 2021

Indeed. And in the regex crate and RE2, for example, captures are only implemented in the "NFA" engines (specifically, the PikeVM and the bounded backtracker). So if you support captures, then those engines have to be able to support everything.

tsegratis · on March 16, 2021

Can I ask; what about a zdd?

The seem similar to closed languages with a disjunct and conjunct

Though I don't think I will, I was considering adding zdd or bdd to a PEG, to provide that conjunct

ofc, sat solver can represent a regex with conjuncts, but is this a good way of going about it, particularly with unbounded strings??

Would love to hear your thoughts on that