Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's unsurprising when you consider that there are often several magnitudes of difference in code between what code grows to when you have the capacity and time and compounding user requests, and what a meaningful starting point that provides useful functionality above and beyond what you had without it looks like.

As an extreme example here[1] is an article by Brian Kernighan about a basic regexp matcher by Rob Pike. The code, with comments, is 35 lines of C.

Meanwhile, the regexp engine in Ruby 3.2.2, not even including the Regexp class visible to the language, is ~20597 lines excluding the headers.

They are not reasonably comparable, of course. Pike's code provides a very basic syntax which doesn't even support character classes, and does nothing to speed up the matching. The latter supports a very much more complex syntax, and is far faster for repeated or long matches.

But while Ruby's regexp engine is 588 times large than Pike's code, you of course get a vastly higher boost in overall system capability from those first 35 lines of code (from no regexp matching to some regexp matching) than the last 35 lines of code, or indeed the second 35 lines.

So if you have a small system and a small team, you work with that and start small and it won't be that surprising when you get a lot done in few lines of code, even though you have a long list of things you'll add when you can justify the extra resources (like those missing language classes, and a billion other features)

(Then you get a larger system, and you'll very soon find yourself wondering how you could manage with so little)

I think it's mostly surprising because most developers today aren't used to thinking about capacity constraints of small systems, and so starts designing for lots of features from the start (can't have regexps without character classes, and capture groups, and back-references, and ...).

[1] https://www.cs.princeton.edu/courses/archive/spr09/cos333/be...



"I think it's mostly surprising because most developers today aren't used to thinking about capacity constraints of small systems, and so starts designing for lots of features from the start (can't have regexps without character classes, and capture groups, and back-references, and ...)."

Sometimes when I have expressed admiration for sed(1) in the past on HN, someone replies something like, "Yeah, but it doesn't have capture groups."

It is amazing how much work I do with sed(1). Basic RE most of the time, not even Extended RE. And I'm only using a fraction of what sed can do. It is not just me. This small program is everywhere, on every computer. It's in the toolchains used to build the operating systems and other software that everyone is using. The computing world depends on sed.

Then there are people online who complain about RE. With memes, no less. It's baffling to me because I find RE so useful. Eventually I realised the reason they dislike RE is because they want to compose something really complex, they get in over their head and then they try to blame RE instead of ther own stupidity. Meanwhile they could be doing a multitude of simpler things with RE very effectively. But that's not what they want.

Nope. They want the 2000+ lines of code, not the 35.

Of course this is a generalisation. Hence the word "most". There are some who are interested in the 35.

It's really easy to evaluate software today, assuming one is searching for simplicity, because so much of it is garbage written by people who are hopelessly addicted to needless complexity. They cannot define "simple". The mere use of the word is triggering for them.


Yeah, I think the 35 is probably too simplistic, but as the links elsewhere in this thread shows, you can get an implementation that converts to a DFA and is competitive with your browsers and Nodes native regexp engines in just a couple of times that. I'm not inherently against having options that do crazy work to provide extra features and squeeze out a bit more performance, but I also wish more people would try starting over, because the results are often surprising.

E.g I'm pretty much "by accident" building my own desktop environment. I don't really want to, but I've rewritten bit by bit of software where it turns out rewriting something that does exactly what I want is often simpler than trying to fix issues in huge pieces of software.

It took me one night to replace Caja with my own desktop/ file manager. It does far less than Caja, but it does more of what I want. E.g. it has a semi-spatial mode that lets me control when to snapshot positions rather than do it either always or never.

300 lines of code. Should other people run it? Probably not, but at 300 lines we can afford a lot of variants more closely tailored to different workflows.

My terminal is ca 1000 and close to be more accurate and have more features than st (about 8k lines), and has more of the feature I actually use than xterm (88k lines of code). Xterms scrollbars code is many times the size of my entire terminal.

Xterm is more capable, but not 88 times more capable... And less capable in the areas I care about.

I tend to think we need more software that is less capable.

I'd rather have a wider choice in 1kloc terminals than 1 x 88kloc one, because the 1kloc ones are far easier to customize and tailor.


A classic example of diminishing marginal returns.


It does make one wonder why adding diminishing returns is the usual way of software development.

Proof of concept -> early release -> stable release -> add feature -> add feature -> add feature, ..., repeat.

CPU cycles, RAM use, storage space, bugs, updates, managing complexity, manuals, adapting to newer standards, programmers understanding codebases, etc, etc, etc. It's not like incremental additions are 0-cost.

Kind of like inventory: of all the stuff (most) people have in their homes, only a small fraction is used regularly. The rest only sees use very rarely, or exists for "nice to have", decoration, or plain luggage / junk. Turning free living space into a junk bin, making it more difficult to move house, etc. People who think that junk in their attic costs nothing, can't do math.

The software fix would be to hunt aggressively for ways to simplify, reduce binary size & in-use memory footprint, remove lesser-used features, weigh any feature (present or potentially added) vs. its impact on maintainability, code size, etc.

In other words: maximize the bang-per-byte. Note this does NOT need to mean "feauture starved". Just very capable / useful given its footprint.

As opposed to maximize the feature list, or throw complex algorithms to squeeze every last % of performance.

Any projects out there that have bang-per-byte as #1 priority?


I've come to think we really should be more comfortable forking projects and keeping them small vs. adding features. E.g. consider the space of tiny to small-ish X11 menu tools. There's ratmenu, 9menu, dmenu, rofi, roughly ranged from tiny to slightly on the larger side, and I'm sure many more. If I want to add features to 9menu or ratmenu, I wouldn't add features directly to them unless it's something really minor or very generic. I'd fork them, or rewrite from scratch. Their appeal is that they're small and focused.

I'd rather have a choice of a dozen like them with slightly different feature sets that are easy to customize and tweak for my use, than one big one. Where the cutoff point is varies - I tend to find rofi a bit too big and complex, for example.


Toy solutions deal with small data inputs. How many lines of that 20k is just optimizing for large inputs (you can’t have long repeating sections in small inputs).


I've written a linear-time regex matcher in 65 lines of code: https://jasonhpriestley.com/regex


That's beautiful. And I think a great illustration of how much more impactful the "next n lines" or so are going to be than the "last n lines"

EDIT: Also, the impact is even greater when looking at your performance improvements with the dfa conversion as well: https://jasonhpriestley.com/regex-dfa


What features does it support? What if I pass in a complex regexp with data several GB in size?

There is a large difference between toy examples and hardened enterprise ready code.


The features are less important than the linear complexity.

It lowers to an NFA. An NFA can recognise any linear language, so adding more features affects the generation of the NFA, while increasing the performance involves faster matching against the NFA.

In this case he's done a followup with benchmarks where he's converting the NFA to a DFA ajd comparing favourably against both Node and your browers regexp engine with only a tiny little bit extra code:

https://jasonhpriestley.com/regex-dfa

That doesn't mean there isn't value to the rest of those 20k lines of code I referenced - that was not the point.

A lot of them adds a bunch of convenience, like a more expressive syntax that saves you from writing more convoluted regexps, and it presumably bought more performance than their previous, smaller iteration, every step up from a much smaller engine way back. I'm sure one could do better with less, but I'm also not dismissing that I'm sure each step was reasonable given the constraints.

But that too is also not the point. It was not about dismissing the size of the Ruby regexp engine as not worth it.

The point was that there is a very significant and rapid diminishing return, and that this explains why you can do so seemingly much with so very little, because the leap from no capability to something usable takes very little, but each subsequent increment will tend to buy you less, for more work.

That doesn't mean people should stop putting in that extra work and squeeze out a bit more. It just gives an answer to the question in the link.


Being O(n) is important, but the size of the constant factor is important, too. E.g. the theoretically most efficient known matrix multiplication algorithm only outperforms the less efficient common algorithms at ludicrous matrix sizes, because of the huge constant factor.


Yes, but this is again totally besides the point, and addressed by the second link to the version that does dfa conversion in a few dozen more lines and is competitive with the regexp implementation in chrome.


I think this is a good base case for exploring how achieving feature parity affects the code.


A lot of it is, but the point remains no matter where you set t he bar, you can usually get most of the benefit with a tiny portion of the code. Sometimes squeezing out a tiny bit more performance isn't worth much, sometimes it's worth 10x or 100x the amount of code - the point is not that it's inherently wrong to write all that extra code.

But you should at least be aware when the returns are diminishing to a point where the payoff becomes uncertain.


Dealing with edge cases is where your code blows up. Once you find an edge case do you leave it unaddressed and the program generate a wrong/unexpected answer? Could it be a potential security flaw?

Unfortunately when we write initial specifications we almost certainly get them wrong in one way or another. Trying to deal with that after the fact commonly leads to application bloat.


My experience is that while edge cases certainly adds bloat, there are plenty of cases where the bloat comes from entirely different things, and where simply rewriting with the hindsight of being able to see the overall structure better can do wonders. In other words: A lot of the time undoing mistakes made first time around as a result of not having the full picture.

And often it's a result of trying to cater for everything even when its not needed. E.g. layers of indirection in anticipation of extension that never happened, accessors that are never once accessed.

Often it's history. One of my "favourite" recent examples is the X11 XRender extension. On one hand it was modernising X. Allowing a much more modern rendering pipeline. On the other hand it insisted on holding on to a world that has moved on:

On one hand, it provides a number of pre-defined visuals and formats, requiring servers that implements XRender to provide ARGB32, RGB24, A8, A4, and A1 visuals. On the other hand, it 1) allows the server to provide a list of every other kind it can support, at every depth it can support, 2) doesn't label the standard, required formats in any way. So as a result you get back a huge list of visuals at depths you don't care about, and formats you don't care about, and never will.

As a result, 1) the client library goes through a pointless matching exercise to go through a bunch of visuals and formats that it's highly likely in many cases has never once in the history of the XRender extension been used by anyone for anything but testing. Are you going to do complex alphablending and compositing on a machine with an 8 bit display with a palette? No.

The very, very generic system of visuals and formats X11 supports made sense in the 1980's. Even in the 1990's. It was marginally useful to support legacy hardware into the 2000's (but then we're talking maybe supporting 16 or even 15 bit depth graphics cards, not monochrome or 8 bit).

It's not necessary any longer. If you're going to run hardware where this is an issue, you'll be running old software too. But here it still is, contributing a bunch of pointless code on the server, and forcing the client to implement a bunch of pointless code as well because the protocol just assumes people will care about more precise matching (we don't; I just want to use the standard visuals and formats).

A lot of the time, with the full picture then, you can yank out a whole lot of code and reduce the number of edge cases by offering fewer choices, and you can often - not always - do so without sacrificing functionality that is actually used.


In a converse reply, Microsoft has done very well financially by holding on to backwards compatibility.

In the OSS world we tend to look at rewrites as "This is for me, who gives a shit about the customer", but most customer (paying) facing software there are the expectations of Do not break the application, and do not break the customers expectations.

This said, X was and is a mess.


My point isn't so much that ditching the backwards compatibility is always right, even when it lets you shed lots of code, but that you can often massively reduce code size that way. Whether that's the right thing to do is often a fine balance, because it depends a great deal on how many people actually care about the old ways, and sometimes people will get it very wrong in either direction.

Sometimes the 10x more code is actually worth it. But you should be aware when it is 10x more code, and make sure it is worth it.


Rewriting the code isn't the hard part...

The QA is the hard part.

Developer: "Rewriting this line of code shouldn't change anything"

Narrator: It did


Drastically reducing the codebase if anything tends to make QA a lot easier ;) (though depending on your needs you may or may not like the outcome)


You can hold on to compatibility by providing a simple 1bit enabled api from the 70s, 8bit linear mode with palette with scrolling and palette oriented api, then a 3d api on argb 32bits. 3 simple api may be simpler to maintain than 1 unified api.


I think the most important thing to ask about an edge case is “how could I have designed this code to make it not an edge case”


"I should have beat the customer until they gave me full specifications of what they needed before writing code"

Is unfortunately what happens a lot.


Ah, but retrospectively what matters more is instead "now that we know how bad that got, how can we use what we learnt to do better when we rewrite". It's not a given that it's a bad thing to allow customers to drive rapid iteration even when it leads to a mess, as long as you're prepared to take the consequences. A lot of the worst systems I've seen were systems where people resisted rewrites rather than plan for them, often because they didn't establish enough tests, and/or were upset over the sunk costs. Treating the first iteration or iterations as throwaway learning experiences can be useful. If you plan for that from the start you can allow yourself to take shortcuts you otherwise wouldn't (but you sure as hell better make sure management understands sticking to that throwaway version isn't an option), knowing the lessons learnt will feed straight into the next iteration (and of course building a throwaway system doesn't mean every part needs to be written to be ditched).


My tiny regex in C is complete, in 1K of C. With formal verification

https://github.com/rurban/tiny-regex-c/blob/master/re.c


I think the “never look back” mindset of product growth is a major driver of code bloat.

Rarely are engineering teams allowed time to go back and refine and optimize legacy code, until it’s a blazing dumpster fire… and even then.

Products never seem to have enough features, so engineering teams plow forward. When a component absolutely needs a rewrite, the rewrite is often implemented with the same haste as the original.

Given the time, I think every engineer could easily factor out vast swaths of code and refine what’s left to be much leaner.

I feel like the same goes for programming languages and platforms. Most devs would prefer to spend their time adding new capabilities rather than optimizing a few lines of code out of existing ones.

But then again “Perfect is the enemy of good” and “If it ain’t broke don’t fix it” and “We’ve got 731 items on the roadmap, so let’s go!”




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: