Don't forget the paq family, especially because of their tunability. You could probably replace the column representation with a context model that just classifies based on the json value or so, and context mixing, at least in the pre-zpaq days, does the rest there. Also check delta encoding.
If you are able to spare the JSON format the way it is, or at least don't have to provide bit-exact representation, just data-exact ones, you might want to check out if you can make a proper json transform into a binary format suitable for this encoding, possibly checking if you can code a float represented inside a string with a printf-style encoder for the decoding, and then just store this as a datatype "single prevision, standard posix format", or "single prevision, this format, followed by a length-prefixed format string". You should make sure that some context models know where you are right now, e.g., if it's a format string, a content string, a length-prefix, etc. Context Mixing really is amazing.
Yes, zstd is definitely on the todo list. The increased performance should be quite welcome on low power edge devices.
I tried to get it to work, but had some trouble with the existing javascript bindings. I have only recently started developing in the javascript ecosystem.
It's not only increased performance. Since you can precompute a dictionary with zstd, you'll (most likely) get much better compression. Or you can at least stick to smaller block sizes.
0.4.15 currently. There are some things that are not production ready (in particular IPNS), but the basic infrastructure (the distributed content-addressed storage) is pretty solid. Discovery also works pretty well. NAT hole punching is impressive. Resource usage is low enough that you can run it on a raspberry pi with some room to spare.
We were storing telemetry that was preprocessed by the MCS. This is preferable for analysis, since you can not afford to pipe the CCSDS packets through the rather slow mission control systems every time you want to plot or analyse the data. This also has the advantage that you store the data exactly as seen in the control room.
The raw CCSDS packet stream also gets stored, but given that the MCS systems are rather inflexible, they are not as valuable for general analysis.
Have you considered existing scientific columnar data storage formats, like HDF5 or Parquet?
Their main advantage is that they have good, mature implementations in a variety of languages, which would be handy if you ever find Javascript to be too slow.
I developed an archive system for our spacecraft TM/TC using HDF5 with blosc-lz compression for speed reasons (and I plan to move to Zstd in the future). One of the main issue I have seen in the industry is the difficulty to upgrade the hardware so we have to design something working great on regular hard drives and with minimal RAM requirements.
While HDF5 is good when you manage the whole system, it is always tricky for sharing data with other people so we are also using SQLite. We lose the compression but it is very easy to share and people are more familiar with it.
For the export at GSOC we had a streaming REST interface based on akka-streams. It is used to get the data into other systems based on spark for more complex analysis.
Of course you also lose the compression, but the target system is typically only interested in a small subset of the data.
Sure, I looked at parquet and HDF5. The compression is much better with iptm for the kind of data I am interested in, and the algorithm is also much simpler. It's less than 500 LOC in total.
The reason I wrote this in typescript is that it is nice for prototyping when working with JSON data. If it ever turns out too slow, I would write a rust version. Should not take more than a few days.
I have not profiled it, but I would guess that the most time is spent in the zlib deflate implementation, which is of course already a C library, so I would not expect miracles from a rewrite in rust.
It's amazing just how much of a loaded word "telemetry" has become in the past few years --- 10 years ago it'd just remind me of space missions and such, like in the article, but now I associate the word more with pervasive surveillance and privacy invasion.
We are using sqlite for some things. But for large quantities of simply structured telemetry data it is not very good. About as good as raw json. The reason it is not very efficient is that the disk layout is optimised for fast access.