Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The article doesn’t mention but now I’m curious what kind of read length and error rate they are achieving. This could have huge impacts across all sequencing.


It looks like they're using Oxford nanopore and PacBio sequencing technologies for the long reads. These are two up-and-coming sequencing technologies focused on extremely long reads. My understanding of both is that their error rates on individual base pairs are too high to reliably determine the actual sequence on their own (something like 15% error rates). Typically the long reads from these technologies are used as a "scaffold" to resolve the large-scale structure of a DNA sequence, while another sequencing technology, usually Illumina, is used to resolve the actual sequence. (Illumina produces short reads, but it produces a lot of them, and the error rate is much lower, about 1%-5%.) In addition, since PacBio and Oxford Nanopore are very different technologies, I'm guessing that they probably have different "error profiles", so they probably partially cover for each others' deficiencies when you use both of them at the same time.

Note: Don't take any of the specific numbers above as gospel. These technologies develop extremely quickly, so it's quite likely that my knowledge of typical error rates is out of date.

In any case, here's the relevant quote from the original link (to phys.org), before it was changed to the less technical press release, which doesn't mention any specific technologies used:

"The new project built on that effort, combining nanopore sequencing with other sequencing technologies from PacBio and Illumina, and optical maps from BioNano Genomics. Using these technologies, the team produced a whole-genome assembly that exceeds all prior human genome assemblies in terms of continuity, completeness, and accuracy, even surpassing the current human reference genome by some metrics."


Illumina error rates are <<1% (~0.1%), whereas Nanopore with newer basecalling software is 5-10%. With UMIs you can get a consensus error that's also <<1%. The error profiles are indeed different: Illumina generally creates substitution errors, whereas Nanopore has trouble with "homopolymers" -- counting how many of the same letter occur in a row.


Oxford error rates are up to 15%, they have optimized published runs that show 5% or even better, but in the real world the error rates are much closer to 15%. However, Oxford read lengths can be absolutely massive compared to even PacBio. PacBio's sequencing is actually much more accurate than Oxford, but read lengths top out at about 15,000 bases I think. Illumina read lengths are a bit less than 100 bases but the systems are massively parallel as compared to both PacBio and Oxford.


I dont think you can call pacbio up and coming at this point, but nanopore certainly.

And those error rate examples are way way too high - illumina is closer to Q30, which is a 1/1000 error rate[0]. 15% would result in an unusable sequence.

https://emea.illumina.com/science/technology/next-generation...


The sequencer may report a quality score of 30, but that doesn't guarantee that the error rate when you align to the genome will actually be 1/1000. Still, you're right that good quality Illumina data can do significantly better than 1% error rate. You can't always get "good quality" data, but I imagine that the researchers on this project probably could, given the well-controlled experimental setup.

And yes, a 15% error rate does result in a sequence that is unusable for the purposes of actually knowing the sequence. But a bunch of really long reads with 15% error can still be used to resolve the large-scale structure of a sequence, and then the lower-error-rate Illumina reads can be aligned onto this large-scale scaffold in order to resolve the actual sequence. At least, this is my understanding of how these technologies are typically used together, and given the mention of PacBio, nanopore, and Illumina, that seems to be what was done in this case.


Yes the higher error ones are used for alignment- and even then too high an error rate in a very repetitive region (especially depending on the error type - misreads vs skipped bases etc) make it too challenging to build a scaffold to align your illumina reads.

As of 2018 error rate for alignments with nanopore was around 3-6 percent

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6053456/


15% is quite outdated. There have been major updates to nanopores and software. Typical single read error rate is less than 5% these days.

Single read accuracy is not as important for such projects. As coverage gets to 50-60X, expected assembly accuracy is Q30 on human.


The "ultra long" nanopore reads used in this study are often greater than 100kbp in length and occasionally up to 1Mbp


In the movie he mentions Oxford Nanopore tech and using reads of 100.000 to 1.000.000 base pairs




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: