Although he's technically correct, I think he's missing the point.
> Neither the pauseless algorithm, nor more recent [generational] C4 algorithm impose any multi-core related performance overheads compared to the most efficient collectors out there. There are no added cross-thread synchronization of any kind compared to stop-the-world algorithms. In fact, the whole point of a concurrent algorithm like C4 is to dramatically reduce GC-related synchronization overhead [stop-the-world pauses are dramatic synchronization points between GC and mutator threads].
The main overhead of a concurrent, compacting collector is the need for read barrier. You simply cannot move data while the program is running without having a way of redirecting the read requests to the latest memory location. It's true that it's technically not multi-core related overhead (instead, it's a multi-threading related overhead), but the point is moot, since with a single core, you don't need multiple threads (in a managed language, you can simply use fibres (user-space threads), like Go or Haskell do). The only reason to use a multi-threaded (concurrent) GC is to support multiple cores.
I don't know enough about GC to argue about it, but the way I read his response (including the next paragraph which is not about multicore) he seems to claim that there is not a severe performance penalty for a C4 style collector, contrary to the email from Ian Lance Taylor, so I'm not sure he missed the point. He also has a graph showing supposedly how much it's not a severe performance penalty.
He also has like 5 more responses, not just that one.