This seems to share a lot of parallels with the last big outage in terms of the API request overload and the EBS replication. Seems like the system need to be able to tell a bit better between a node going down and require a remirror and most of an availability zone going down.