user123's comments

user123 · on April 12, 2017

TLDR: we had shitty code, optimized it, now it runs well. No code examples, nothing.

jnewhouse · on April 12, 2017

If you want more details, we were packing a Row class into a base64 encoded string using an ObjectOutputStream. This is a fine thing for small scale serialization but sucks at scale, because of the reasons mentioned in the post. Sorry we don't have code examples, but it's unclear how useful it'd be given that no one else uses our file format. If you want a bit more detail on how the format works. Each metadata contains a list of typed columns to define the schema of a given part. Our map-reduce framework has a bunch of internal logic that tries to justify the written Row class with the one the Mapper class is asking for. This allows us to do things like ingest different versions of a row with in the context of a single job. I think questions of serialization at the scale are generally interesting, although ymmv. I know of one company using Avro, which doesn't let you cleanly update or track schema. They've ended up storing every schema in an HBase table and reserving the first 8 bytes to do a lookup into this table to know the row's schema.

barrkel · on April 12, 2017

Avro can store the schema inline or out of line; with inline schemas, it's at the start of the file (embedded JSON), and it describes the schema for all the rows in that file. If you're working with Hive, the schema you put in the Hive metastore is cross-checked with each Avro file read; if any given Avro file doesn't contain a particular column, it just turns up as null for that subset of rows. Spark and Impala work similarly.

I agree serialization at scale is interesting. My particular interest right at this moment is in efficiently doing incremental updates of HDFS files (Parquet & Avro) from observing changes in MySQL tables - not completely trivial because some ETL with joins and unions is required to get data in the right shape.

cakoose · on April 12, 2017

What do you mean when you say Avro doesn't let you "cleanly update or track schema"?

From what I've read about Avro 1. It can transform data between two compatible schemas. 2. It can serialize/load schemas off the wire, so you can send the schema in the header.

If schema serialization causes too much overhead, you can set things up so you only send the schema version identifier, as long as the receiver can use that to get access to the full schema.

jnewhouse · on April 12, 2017

I think what I'd heard about was likely a poorly implemented use of Avro. I haven't actually worked with it.

jlg23 · on April 12, 2017

"We had shitty code, optimized it, now it runs [maybe] better" is probably the only fixture in software development.

I'm not sure kind of code samples one would want in the context of an article as abstract as this one.

I did not take anything from the article, but had a lot of "been there, done that" moments when skimming it. The difference between the OP an me: The OP wrote about it so others can learn, something I never did.