Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The C# standard library will not run many case-insensitive comparisons. That comparison object doesn’t just compare two objects for equality, it also has another interface method which computes a hash of a single object.

Which doesn't matter because I'm talking about identical strings, so they will hash the same by definition, and they will have to be compared.

So the question is how fast the CI hash and equality operate compared to the CS ones.

And I asked about comparison because I assumed that would be the costlier of the two operations, relative to its CS brethren.



> how fast the CI hash and equality operate compared to the CS ones.

If the string is ASCII like in the OP’s use case, I think the difference is not huge.

CS comparison looks more optimized, they have an inner loop which compares 12 bytes as 3 64-bit values: https://source.dot.net/#System.Private.CoreLib/src/libraries...

CI comparer doesn’t do that, it loads individual UTF-16 elements: https://source.dot.net/#System.Private.CoreLib/src/libraries... But still, it’s very simple code which does sequential memory access.

> And I asked about comparison because I assumed that would be the costlier of the two operations, relative to its CS brethren.

I think the bottleneck is random memory loads from the hash table.

Hashing and comparison do sequential RAM access. The prefetcher in the CPU will do its job, you’ll get 2 memory loads every cycle, for short strings going to be extremely fast. If that hashtable doesn’t fit in L3 cache, the main memory latency is much slower than comparing strings of 10-20 characters, no matter case sensitive or not.


But all of those optimizations can also be done, and more efficiently, by transforming while reading. Even if they're as fast as they can be, they're not as fast as a memcmp. The C# approach isn't buying you any performance here.


Yeah, it’s possible to change case while loading, but I’m not sure that’s gonna be much faster.

But I’m sure it gonna be much harder.

For non-ASCII strings, converting case may change their length in bytes. You don’t even know in advance how much memory you need to transform 2GB of input text (or 1MB buffer if streaming). And if streaming, you need to be careful to keep code points together: with a naïve approach you gonna crash with a runtime exception when you split a single codepoint between chunks.

English words are 99.9% ASCII, but that remaining 0.1% like “naïve” is not. The C# standard library is doing the right thing for this use case. Specifically, for 99.9% of words the CI comparer will use the faster ASCII-only code to compare or hash, and only do the expensive shenanigans for small count of non-ASCII words.

Note how C# makes the implementation much simpler. A single parameter passed to the constructor of the Dictionary<string,Something> makes it implement case-insensitivity automagically.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: