If you want more details, we were packing a Row class into a base64 encoded stri...

barrkel · on April 12, 2017

Avro can store the schema inline or out of line; with inline schemas, it's at the start of the file (embedded JSON), and it describes the schema for all the rows in that file. If you're working with Hive, the schema you put in the Hive metastore is cross-checked with each Avro file read; if any given Avro file doesn't contain a particular column, it just turns up as null for that subset of rows. Spark and Impala work similarly.

I agree serialization at scale is interesting. My particular interest right at this moment is in efficiently doing incremental updates of HDFS files (Parquet & Avro) from observing changes in MySQL tables - not completely trivial because some ETL with joins and unions is required to get data in the right shape.

cakoose · on April 12, 2017

What do you mean when you say Avro doesn't let you "cleanly update or track schema"?

From what I've read about Avro 1. It can transform data between two compatible schemas. 2. It can serialize/load schemas off the wire, so you can send the schema in the header.

If schema serialization causes too much overhead, you can set things up so you only send the schema version identifier, as long as the receiver can use that to get access to the full schema.

jnewhouse · on April 12, 2017

I think what I'd heard about was likely a poorly implemented use of Avro. I haven't actually worked with it.