Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is why, in 2021, the mantra that C is a good language for these low level byte twiddling tasks needs to die. Dealing with alignment and endianness properly requires a language that allows you to build abstractions.

The following is perfectly well defined in C++, despite looking like almost the same as the original unsafe C:

    #include <boost/endian.hpp>
    #include <cstdio>
    using namespace boost::endian;

    unsigned char b[5] = {0x80,0x01,0x02,0x03,0x04};

    int main() {
        uint32_t x = *((big_uint32_t*)(b+1));
        printf("%08x\n", x);
    }
Note that I deliberately misaligned the pointer by adding 1.

https://gcc.godbolt.org/z/5416oefjx

[Edit] Fun twist: the above code doesn't work where the intermediate variable x is removed because printf itself is not type safe, so no type conversion (which is when the bswap is deferred to) happens. In pure C++ when using a type safe formatting function (like fmt or iostreams) this wouldn't happen. printf will let you throw any garbage in to it. tl;dr outside embedded use cases writing C in 2021 is fucking nuts.



Correct me if I'm wrong, but your example is just using a library to do the same task, rather than illustrating any difference between C and C++. If you want to pull boost in to do this, that's great, but that hardly seems like a fair comparison to the OP, since instead of implementing code to solve this problem yourself you're just importing someone else's code.


No, the fact that this can be done in a library and looks like a native language feature demonstrates the power of C++ as a language.

This example is demonstrating:

- First class treatment of user (or library) defined types

- Operator overloading

- The fact that it produces fast machine code. Try changing big_uint32_t to regular uint32_t to see how this changes. When you use the later ubsan will introduce a trap for runtime checks, but it doesn't need to in this case.


Operator overloading is a mixed blessing though, it can be very convenient but it's also very good at obfuscating what's going on.

For instance I'm not familiar with this boost library so I'd have a lot of trouble piecing out what your snippet does, especially since there's no explicit function call besides the printf.

Personally if we're going the OOP route I'd much prefer something like Rust's `var.to_be()`, `var.to_le` etc... At least it's very explicit.

My hot take is that operator overloading should only ever be used for mathematical operators (multiplying vectors etc...), everything else is almost invariably a bad idea.


Ironically, it was proposed not so long ago to deprecate to_be/to_le in favour of to_be_bytes/to_le_bytes, since the former conflate abstract values with bit representations.


That's fine if whatever type 'var' happens to be is NOT usable as an arithmetic type, otherwise you can easily just forget to call .to_le() or .to_native(), or whatever, and end up with a bug. I don't know Rust, so don't know if this is the case?

Boost.Endian actually lets you pick between arithmetic and buffer types.

'big_uint32_buf_t' is a buffer type that requires you to call .value() or do a conversion to an integral type. It does not support arithmetic operations.

'big_uint32_t' is an arithmetic type, and supports all the arithmetic operators.

There are also variants of both endian suffixed '_at' for when you know you have aligned access.


The idiomatic way to do this in Rust is to use functions like .to_le_bytes(), so you have the u32 (or whatever) on one end and raw bytes (something like [u8; 4]) on the other. It can get slightly tedious if you're doing it by hand, but it's impossible to accidentally forget. If you're doing this kind of thing at scale, like dealing with TrueType fonts (another bastion of big-endian), it's common to reach for derive macros, which automate a great deal of the tedium.


Who decides what methods to add to the bytes type/abstraction?

If I have a 3 byte big endian integer can I access it easily in rust without resorting to shifts?

In C++ I could probably create a fairly convincing big_uint24_t type and use it in a packed struct and there would be no inconsistencies with how it's used with respect to the more common varieties


In Rust, [u8; N] and &[u8] are both primitive types, and not abstractions. It's possible to create an abstraction around either (the former even more so now with const generics), but that's not necessary. It's also possible to use "extension traits" to add methods, even to existing and built-in types[1].

I'm not sure about a 3 byte big endian integer. I mean, that's going to compile down to some combination of shifting and masking operations anyway, isn't it? I suspect that if you have some oddball binary format that needs, this it will be possible to write some code to marshal it, that compiles down to the best possible asm. Godbolt is your friend here :)

[1]: https://rust-lang.github.io/rfcs/0445-extension-trait-conven...


I agree then that in Rust you could make something consistent.

I think there's no need for explicit shifts. You need to memcpy anyway to deal with alignment issues, so you may as well just copy in to the last 3 bytes of a zero-initialized, big endian, 32bit uint.

https://gcc.godbolt.org/z/jEnsW8WfE


That's just constant folding. Here's what it looks like when you actually need to go to memory:

https://gcc.godbolt.org/z/9qGqh6M1E

And I think we're on the same page, it should be possible to get similar results in Rust.


It demonstrates that c++ is even less safe.


You are still casting one pointer type into another which can result in unaligned access.

If you need to change byte orders, you should use library to achieve that.


Boost.Endian is the library here and this code is safe because the big_uint32_t type has an alignment requirement of 1 byte.

This is why ubsan is silent and not even injecting a check in to the compiled code.

You can check the alignment constraints with static_assert (something else you can't do in standard C): https://gcc.godbolt.org/z/KTcf9ax6r


C11 has static_assert: https://gcc.godbolt.org/z/E3bGc95o3

Is also has _Generic() so you can roll up a family of endianness conversion functions and safely change types without blowing up somewhere else with a hardcoded conversion routine.


I find you missed the point of the post and the issues described in it.

In my estimation, libraries like boost are way too big and way too clever and they create more problems than they solve. Also, they don't make me happy.

You're overfocusing on a "problem" that is almost completely irrelevant for most of programming. Big endian is rare to be found (almost no hardware to be found, but some file formats and networking APIs have big-endian data in them). Where you still meet it, you don't do endianness conversions willy-nilly. You have only a few lines in a huge project that should be concerned with it. Similar situation for dealing with aligned reads.

So, with boost you end up with a huge slow-compiling dependency to solve a problem using obscure implicit mechanisms that almost no-one understands or can even spot (I would never have guessed that your line above seems to handle misalignment or byte swapping).

This approach is typical for a large group of C++ programmers, who seem to like to optimize for short code snippets, cleverness, and/or pedantry.

The actual issue described in the post was the UB that is easy to hit when doing bit shifting, caused by the implicit conversions that are defined in C. While this is definitely an unhappy situation, it's easy enough to avoid this using plain C syntax (cast expression to unsigned before shifting), using not more code than the boost-type cast in your above code.

The fact that the UB is so easy to hit doesn't call for excessive abstraction, but simply a revisit of some of the UB defined in C, and how compiler writers exploit it.

(Anecdata: I've written a fair share of C code, while not compression or encryption algorithms, and personally I'm not sure I've ever hit one of the evil cases of UB. I've hit Segmentation faults or had Out-of-bounds accesses, sure, but personally I've never seen the language or compilers "haunt me".)


Do you use UBSAN and ASAN? When you write unit tests do you feed numbers like 0x80000000 into your algorithm? When you allocate test memory have you considered doing it with mmap(4096) and putting the data at the end of the map? (Or better yet, double it and use mprotect). Those are some good examples of torture tests if you're in the mood to feel haunted.


Every day I spend futzing around with endianness is a day I'm not solving 'real' problems. These things are a distraction and a complete waste of developer time: It should be solved 'once' and only worried about by people specifically looking to improve on the existing solution. If it can't be handled by a library call, there's something really broken in the language.

(imo, both c and cpp are mainly advocated by people suffering from stockholm syndrome.)


But that's the point: No one spends a day futzing around with endianness, and there are in fact functions for swapping endianness. You can just call them, no need to hide the swap in a pointer cast expression to a type that has the dereferencing operator overloaded.


I agree with the bulk of this post.

Re the anecdata at the end. Have you ever run your code through the sanitizers? I have. CVE-2016-2414 is one of my battle scars, and I consider myself a pretty good programmer who is aware of security implications.


Very little, quite frankly. I've used valgrind in the past, and found very few problems. I just ran -fsanitize=undefined for the first time on one of my current projects, which is an embedded network service of 8KLOC, and with a quick test covering probably 50% of the codepaths by doing network requests, no UB was detected (I made sure the sanitizer works in my build by introducing a (1<<31) expression).

Admittedly I'm not the type of person who spends his time fuzzing his own projects, so my statement was just to say that the kind of bugs that I hit by just testing my software casually are almost all of the very trivial kind - I've never experienced the feeling that the compiler "betrayed" me and introduced an obscure bug for something that looks like correct code.

I can't immediately see the problem in your CVE here [0], was that some kind of betrayal by compiler situation? Seems like strange things could happen if (end - start) underflows.

[0] https://android.googlesource.com/platform/frameworks/minikin...


This one wasn't specifically "betrayal by compiler," but it was a confusion between signed and unsigned quantities for a size field, which is very similar to the UB exhibited in OP.

Also, the fact that you can't see the problem is actually evidence of how insidious these problems are :)

The rules for this are arcane, and, while the solution suggested in OP is correct, it skates close to the edge, in that there are many similar idioms that are not ok. In particular, (p[1] << 8) & 0xff00, which is code I've written, is potentially UB (hence "mask, and then shift" as a mantra). I'd be surprised if anyone other than jart or someone who's been part of the C or C++ standards process can say why.


> the fact that you can't see the problem is actually evidence of how insidious these problems are

I've looked for a while now, but still can't see it, would you be willing to share?

> (p[1] << 8) & 0xff00

With p[1] being uint8_t? Because then I cannot imagine why, and also fail to see a reason to apply the 0xff00 mask here.

If this is for int8_t instead, the problem you are alluding to is sign extension? If p[1] gets promoted to an int in the negative range, (then its representation has the high order bit set), and shifting that to the left is UB.


Yes, I was assuming it was char *, as in the OP, which can be signed. And any left shift of a negative quantity is UB in C (I'm not sure if this is fixed in recent C++), it doesn't have to be what's commonly thought of as overflow.


Raph, clearly you're just not as good a programmer as you think you are.


Why thank you Vitali. Coming from you, that is high praise indeed.


As a very minor counterpoint: I like C because frankly it’s fun. I wouldn’t start a web browser or maybe even an operating system in it today, but as a language for messing around I find it rewarding. I also think it is incredibly instructive in a lot of ways. I am not a C++ developer but ANSI C has a special place in my heart.

Also, I will say that when it comes to programming Arduinos and ESP8266/ESP32 chips, I still find that C is my go to despite things like Alia, MicroPython, etc. I think it’s possible that once Zig supports those devices fully that I might move over. But in the meantime I guess I’ll keep minding my off by one errors.


This has nothing to do with C++ because your example only hides the real issue occurring in the blog post example: The unaligned read on the array. Try adding something like

  printf("%08x\n", *((uint32_t*)(b)));
to your example and you'll see that it produces UB as well. The reason there is no UB with big_uint32_t probably is that that struct/class/whatever it is probably redefines its dereferencing operator to perform byte-wise reads.

Godbolt example: https://gcc.godbolt.org/z/seWrb5cz7


I fail to see your point. The point of my post is that the abstractions you can build in C++ are as easy to use and as efficient as doing things the wrong, unsafe way...so there's no reason not to do things in a safe, correct way.

Obviously if you write C and compile it as C++ you still end up with UB, because C++ aims for extreme levels of compatibility with C.


Sorry for being unclear. My point is that the example in the blog post does two things, a) it reads an unaligned address causing UB and b) it performs byte-order swapping. The post then goes on about avoiding UB in part b), but all the time the UB was caused by the unaligned access in a).

Of course your example solves both a) and b) by using big_uint32_t, and I agree that this is an interesting abstraction provided by Boost, but I think the takeaway "use C++ for low-level byte fiddling" is slightly misleading: Say I was a novice C++ programmer, saw your example of how C++ improves this but at the same time don't know that big_uint32_t solves the hassle of reading a word from an unaligned address for me. Now I use your pattern in my byte-fiddling code, but then I need to read a word in host endianness. What do I do? Right, I remember the HN post and write *((uint32_t*)(b+1)) (without the big_, because I don't need that!). And then I unintentionally introduced UB. In other words, big_uint32_t is a little "magic" in this case, as it suggests a similarity to uint32_t which does not actually exist.

To be honest, I don't think the byte-wise reading is in any way inappropriate in this case: If you're trying to read a word in non-native byte order from an unaligned access, it is perfectly fine to be very explicit about what you're doing in my opinion. There also is nothing unsafe about doing this as long as you follow certain guidelines, as mentioned elsewhere in this thread.


Sure, the only correct way to read an unaligned value in to an aligned data type in both C or C++ is via memcpy.

I still think being able to define a type that models what you're doing is incredibly valuable because as long as you don't step outside your type system you get so much for free.


You could also mask and shift the value byte-wise just like with an endian swap. Depending on the destination and how aggressive the compiler optimizes memcpy or not, it could even produce more optimal code, perhaps by working in registers more.

Conceptual consistency is a good thing, but there is a generally higher cognitive load to using C++ over C. I've used both C++ and C professionally, and I've gone deeper with type safety and metaprogramming than most folk. I've mostly used C for the last few years, and I don't feel like I'm missing anything. It's still possible to write hard-to-misuse code by coming up with abstractions that play to the language's strengths.

Operator overloading in particular is something I've refined my opinion on over the years. My current thought is that it's best not to use operators in user/application defined APIs, and should be reserved for implementing language defined "standard" APIs like the STL. Instead, it's better to use functions with names that unambiguously describe their purpose.


C is perfect for these problems. I like teaching the endian serialization problem because it broaches so many of the topics that are key to understanding C/C++ in general. Even if we choose to spend the majority of our time plumbing together functions written by better men, it's nice to understand how the language is defined so we could write those functions, even if we don't need to.


For sure, it's a good way to teach that C is insufficient to deal with even the simplest of tasks. Unfortunately teaching has a bad habit of becoming practice, no matter how good the intention.

With regard to teaching C++ specifically I tend to agree with this talk:

CppCon 2015 - Kate Gregory “Stop Teaching C": https://www.youtube.com/watch?v=YnWhqhNdYyk


One of her slides was titled "Stop teaching pointers!" too. My VP back at my old job snapped at me once because I got too excited about the pointer abstractions provided by modern C++. Ever since that day I try to take a more rational approach to writing native code where I consider what it looks like in binary and I've configured my Emacs so it can do what clang.godbolt.org does in a single keystroke.


For the record, she's not really saying people shouldn't learn this low level stuff... just that 'intro to C++' shouldn't be teaching this stuff first

The biggest problem with C++ in industry is that people tend to write "C/C++" when it deserves to be recognized as a language in its own right.


One does not simply introduce C++. It's the most insanely hardcore language there is. I wouldn't have stood any chance understanding it had it not been for my gentle introduction with C for several years.


Really?

Apparently the first year students at my university didn't had any issue going from Standard Pascal to C++, in the mid-90's.

Proper C++ was taught using our string, vector and collection classes, given that we were still a couple of years away from ISO C++ being fully defined.

C style programming with low level tricks were only introduced later as advanced topics.

Apparently thousands of students managed to get going the remaining 5 years of the degree.


C++ in the mid 90s was a lot simpler than C++ now.


No one obliges you to write C++20 with SFINAE template meta-programming, using classes with CTAD constructors.

Just like no Python newbie is able to master Python 3.9 full language set, standard library, numpy, pandas, django,...


Well there's a reason universities switched to Java when teaching algorithms and containers after the 90's. C++ is a weaker abstraction that encourages the kind of curiosity that's going to cause a student's brain to melt the moment they try to figure out how things work and encounter the sorts of demons the coursework hasn't prepared them to face. If I was going to teach it, I'd start with octal machine codes and work my way up. https://justine.lol/blinkenlights/realmode.html Sort of like if I were to teach TypeScript then I'd start with JavaScript. My approach to native development probably has more in common with web development than it does with modern c++ practices to be honest, and that's something I talk about in one of my famous hacks: https://github.com/jart/cosmopolitan/blob/4577f7fe11e5d8ef0a...


US universities maybe, there isn't much Java on my former university learning plan.

The only subjects that went full into Java were distributed computing and compiler design.

And during the last 20 years they already went back into their decision.

I should note that languages like Prolog, ML and Smalltalk were part of the learning subjects as well.

Assembly was part of electronic subjects where design of a pseudo CPU was also part of the themes. So we had our own pseudo Assembly, x86 and MIPS.


> Well there's a reason universities switched to Java when teaching algorithms and containers after the 90's

Where ? I learned algorithms in C and C++ (and also a bit in Caml and LISP) and I was in university 2011-2014


Yes this is the curse of knowledge, people that know c++ by their exposure to it for decades are usually unable to bring any new comer to it.


C++ makes Rust look easy to learn.


Yes, there is some value in using C for teaching these concepts. But the problem I see is that, once taught, many people will then continue to use C and their hand written byte swapping functions, instead of moving on to languages with better abstraction facilities and/or availing themselves of the (as you point out) many available library implementations of this functionality.


What are the advantages of this over a simple function with the following signature?

    uint32_t read_big_uint32(char *bytes);
Having a big_uint32_t type seems wrong to me conceptually. You should either deal with sequences of bytes with a defined endianness or with native 32-bit integers of indeterminate endianness (assuming that your code is intended to be endian neutral). Having some kind of halfway house just confuses things.


The library provides those functions too, but I don't see how having an arithmetic type with well defined size, endiannness and alignment is a bad thing.

If you're defining a struct to mirror a data structure from a device, protocol or file format then the language / type system should let you define the properties of the fields, not necessarily force you to introduce a parsing/decoding stage which could be more easily bypassed.


It is no longer arithmetic if there is an endianness. Some things are numbers and some things are sequences of bytes. Arithmetic only works on the former.


I agree, but a little nitpick: A sequence of bytes does not have a defined endianness. Only groups of more than one bytes (i.e. half words, words, double words or whatever you want to call them) have an endianness.

In practice, most projects (e.g. the Linux kernel or the socket interface) differentiate between host (indeterminate) byte order and a specific byte order (e.g. network byte order/big endian).


I'd say, putting multiple of those types into a struct that then perfectly describes the memory layout of each byte of data in memory/network packet in a reliable and user friendly way to manipulate for the coder.


I see. That does seem helpful once you consider how these types compose, rather than thinking about a one-off conversion. However, I think it would be cleaner to have a library that auto-generated a parser for a given struct paired with an endianness specification, rather than baking the endianness into the types. (Probably this could be achieved by template metaprogramming too.)


Or just use the functions in <arpa/inet.h> to convert from host to network byteorder?


this! use hton/ntoh and be happy.

nitpick: the 64bit versions are not fully available yet, htonll, ntohll


By the same token, I think most uses for C++ these days are nuts. If you're doing a greenfield project 90% of the time it's better to use Rust.

C++ has a multitude of its own pitfalls. Some of the C programmer hate for C++ is justified. After all, it's just C with a pre-processing stage in the end.

There's good reasons why many C projects never considered C++ but are already integrating the nascent Rust. I always hated low level programming until Rust made it just as easy and productive as high level stuff


Wouldn't that cast be UB because it is type punning?


No, because no punning exists here. The code is C++, so this calls a conversion function that likely does the bit manipulation internally in a legal way.


char* is a allowed to alias to other pointer types.


Hm. Afsik, you are always allowed to convert _to_ a char, but _from_ is not ok in general. See i.e. [0]

[0] https://gist.github.com/shafik/848ae25ee209f698763cffee272a5...


Why is it not ok to convert from a char? Some of the information in the gist is wrong. Type punning with unions for example is legal. ANSI X3.159-1988 is quite clear on that point in its aliasing rules. I've seen a lot of comments people post online saying you must use memcpy to read the bits in a float or that c++ forbids union punning but where is that written. Since if that were true every math library would break.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: