What would be a good data representation at that scale? Is a typed/schema-driven approach preferable?
For ingesting such massive amounts of data, it seems a sensible serialization would have minimal divergence between the wire and memory representation, such as Cap'n Proto.
Handling composite types seems tricky too -- eg, for prefixing lengths to strings or lists, it appears the extra CPU time for variable-sized integers would be preferable to the I/O overhead of billions of wasted bytes that come with a fixed-size prefix. Assuming explicit begin/end delimiters aren't even an option here.
Cap'n'Proto is actually pretty good example of a modern, high-performance serialization. I've been using it as a template for IoT wire representation discussions.
Fast wire encodings are almost universally TLV ("tag-length-value") style serializations. Delimiter scanning is inefficient and also means the parser has little ability to predict what will be coming over the wire so that it can optimize processing.
While older serializations tend to be byte oriented, newer formats use word-sized "frames" (even if not aligned) to enable nearly branchless, parallel processing of the bytes in the stream using bit-twiddling techniques or vector instructions.
For ingesting such massive amounts of data, it seems a sensible serialization would have minimal divergence between the wire and memory representation, such as Cap'n Proto.
Handling composite types seems tricky too -- eg, for prefixing lengths to strings or lists, it appears the extra CPU time for variable-sized integers would be preferable to the I/O overhead of billions of wasted bytes that come with a fixed-size prefix. Assuming explicit begin/end delimiters aren't even an option here.