Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you want more details, we were packing a Row class into a base64 encoded string using an ObjectOutputStream. This is a fine thing for small scale serialization but sucks at scale, because of the reasons mentioned in the post. Sorry we don't have code examples, but it's unclear how useful it'd be given that no one else uses our file format. If you want a bit more detail on how the format works. Each metadata contains a list of typed columns to define the schema of a given part. Our map-reduce framework has a bunch of internal logic that tries to justify the written Row class with the one the Mapper class is asking for. This allows us to do things like ingest different versions of a row with in the context of a single job. I think questions of serialization at the scale are generally interesting, although ymmv. I know of one company using Avro, which doesn't let you cleanly update or track schema. They've ended up storing every schema in an HBase table and reserving the first 8 bytes to do a lookup into this table to know the row's schema.


Avro can store the schema inline or out of line; with inline schemas, it's at the start of the file (embedded JSON), and it describes the schema for all the rows in that file. If you're working with Hive, the schema you put in the Hive metastore is cross-checked with each Avro file read; if any given Avro file doesn't contain a particular column, it just turns up as null for that subset of rows. Spark and Impala work similarly.

I agree serialization at scale is interesting. My particular interest right at this moment is in efficiently doing incremental updates of HDFS files (Parquet & Avro) from observing changes in MySQL tables - not completely trivial because some ETL with joins and unions is required to get data in the right shape.


What do you mean when you say Avro doesn't let you "cleanly update or track schema"?

From what I've read about Avro 1. It can transform data between two compatible schemas. 2. It can serialize/load schemas off the wire, so you can send the schema in the header.

If schema serialization causes too much overhead, you can set things up so you only send the schema version identifier, as long as the receiver can use that to get access to the full schema.


I think what I'd heard about was likely a poorly implemented use of Avro. I haven't actually worked with it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: