PacBio Genome Assembly

New sequencing instruments from Oxford Nanopore and Pacific Biosciences can sequence much longer fragments of DNA than any other sequencing technology (over 2000bp compared to 100-500bp), but at a much higher error rate (typically 15% error). The long read length makes the instruments very attractive for de novo assembly of complex genomes, but the high error rate prevents traditional approaches from being used. Design and implement an algorithm for assembling these reads de novo, such as using spaced seeds for nding overlaps or an error correction pipeline that can be applied directly to the reads. Requires strong algorithm and programming skills.

The three major phases that need to be address are (1) Overlapping, (2) Unitigging, and (3) Consensus generation. All three phases will need to be reconsidered to take into account the high error rate and long read length. The overall error rate on recent data is approximately 15% with 11% insertions, 3% deletions, 1% substitutions, although the errors are randomly distributed with across the read with a uniform probability. The reads also do not have uniform read length, but follow an approximately exponential distribution (mean = 3290, max=24,405 on recent data). In your analysis you should assume the reference genome has been oversampled to 30-fold sequence coverage.


       
(Left) Alignment of a real pacbio read against the reference genome.
(Right) Distribution of read lengths on recent data


Related Information

  • PacBio Sequencing - Background on PacBio's single molecule real time sequencing platform
  • Oxford Nanopore - Overview of nanopore sequencing (details are not available on read characteristics)
  • Sequencing E. coli - (Rasko, 11) Paper on sequencing E. coli with PacBio, describes sequencing characteristics, etc
  • Hybrid Error Correction - (Koren, 12) Describes how to use short high quality reads to polish errors
  • BLASR Paper - (Chaisson, 12) Describes the algorithms for PacBio's aligner along with an analysis of errors
  • Celera Assembler - (Myers, 00) Describes the Celera Assembler as applied to the fruit fly genome

Resources

  • Minimus - (Sommer, 07) A simple assembler for benchmarking
  • PacBio reads - PacBio long reads from Lambda phage, Ecoli, Yeast, Parrot
  • FASTQ Format - Describes how the fastq read format with sequences and quality values
  • Alchemy - PacBio's read simulator.
  • pbh5tools Convert PacBio reads in h5 format into fastq format
  • pbsim - Another simulator of PacBio reads

Additional Resources (Added 11/7/2012)



Using Minimus


1. Download and install AMOS (which includes Minimus)

  mkdir ~/proj
  cd ~/proj
  
  git clone git://amos.git.sourceforge.net/gitroot/amos/amos
  cd amos
  ./bootstrap
  ./configure --prefix=`pwd`
  make
  make install

2. Download and install the simple read simulator

Download from here: http://sourceforge.net/projects/readsim/files/latest/download.

  cd ~/proj
  tar xzvf readsim-1.0.tar.gz

3. Prepare a small test genome and small read set using a portion of the E. coli genome.

See the wiki for more details.

  cd ~/proj
  mkdir small
  head -500 readsim-1.0/example/ecoli/NC_000913.fna > small/ref.fa
  ./readsim-1.0/src/readsim.py sim fa \
      --ref small/ref.fa \
      --pre small/reads.2k.10x \
      --tech pacbio_ec \
      --read_mu 2000 --cov_mu 10

4. Assemble the reads with minimus

  cd ~/proj/small

  ~/proj/amos/bin/toAmos -s reads.2k.10k.fasta -o reads.afg
  ~/proj/amos/bin/minimus reads

  ## Dump overlaps
  ~/proj/amos/bin/bank-report -b reads.bnk/ OVL > overlaps.afg


  ## Dump unitig layous
  ~/proj/amos/bin/bank-report -b reads.bnk/ LAY > layouts.afg

  ## Dump the alignment of reads in the unitigs
  ~/proj/amos/bin/analyzeSNPs -T -a -b -i reads.bnk/ > reads.align

5. Examine the overlaps

The overlaps.afg file is a text file that contains the overlap information in an XML-like format that looks like this:

{OVL
adj:N
rds:1,2
scr:0
ahg:139
bhg:1685
}

This means that reads 1 and 2 overlap in "normal" orientation, meaning the end of read 1 overlaps the beginning of read 2. The read ids are internal numeric ids assigned by AMOS, the file reads.bnk/RED.0.map has the mapping from read number to the read name in the original fasta file. The "ahg" (ahang) is the number of bases of the first read in the overlap (read 1) that occur before the overlap and the "bhg" (bhang) are the number of bases of the second read (read 2) that extend past the overlap:

1:  =============================================>        <-- 1685bp -->
2:   <-- 139bp -->  =========================================================>

Because the reads may be from either strand of the DNA, the other valid orientation is "I" for innie:

1:  =============================================>        <-- 1685bp -->
2:   <-- 139bp -->  <=========================================================

Note the ahand and/or bhang may take negative values if the second read overlaps the beginning of the first read (negative ahang), or the first read extends past the second read (negative bhang)

A:   <-- -ahang -->  =========================================================>
B:  =============================================>        <-- -bhang -->

6. Examine the unitig layouts

The layouts.afg file is a text file that contains the overlap information in an XML-like format that looks like this:

{LAY
{TLE
clr:0,338
off:0
src:1
}
{TLE
clr:0,1883
off:139
src:2
}
{TLE
clr:0,3250
off:396
src:3
}
{TLE
clr:0,604
off:450
src:4
}

The first line "{LAY" marks the beginning of a unitig, and the "{TLE" lines mark the start of a read in a unitig. The first read in the unitig is read number 1 (using reads.bnk/RED.0.map to map from number to name) and the approximate offset of the read in the unitig is base 0 (the first read is always at base 0). The usable bases of the read are 0 through 338. The second read in the unitig is read number 2, with offset 139, and usable bases 0 through 1883. If the read is in reverse complement orientation, then the clr values will go from hi to low instead.

 
read 1: off: 0   len: 338     =====================>
read 2: off: 139 len: 1883            =======================================>
read 3: off: 396 len: 3250                    ===========================================================>
read 4: off: 450 len: 604                          =================>
...

7. Examine the read alignment

The reads.align file is a text file that contains the alignment of the reads in the unitigs that looks like this:

1 709 706 G 120 GGGG 30:30:30:30 1:2:3:4
1 710 707 C 120 CCCC 30:30:30:30 1:2:3:4
1 711 708 G 120 GGGG 30:30:30:30 1:2:3:4
1 712 709 C 60 CCTC 30:30:30:30 1:2:3:4
1 713 710 G 120 GGGG 30:30:30:30 1:2:3:4
1 714 711 C 120 CCCC 30:30:30:30 1:2:3:4
1 715 712 G 120 GGGG 30:30:30:30 1:2:3:4
1 716 713 G 120 GGGG 30:30:30:30 1:2:3:4
1 717 714 T 120 TTTT 30:30:30:30 1:2:3:4
1 718 715 C 120 CCCC 30:30:30:30 1:2:3:4
1 719 716 A 120 AAAA 30:30:30:30 1:2:3:4
1 720 716 - 60 A--- 30:30:30:30 1:2:3:4
1 721 717 C 120 CCCC 30:30:30:30 1:2:3:4
1 722 718 A 120 AAAA 30:30:30:30 1:2:3:4
1 723 719 A 120 AAAA 30:30:30:30 1:2:3:4
1 724 720 C 120 CCCC 30:30:30:30 1:2:3:4
1 725 721 G 150 GGGGG 30:30:30:30:30 1:2:3:4:5
1 726 722 T 150 TTTTT 30:30:30:30:30 1:2:3:4:5
1 727 723 T 150 TTTTT 30:30:30:30:30 1:2:3:4:5
1 728 724 A 150 AAAAA 30:30:30:30:30 1:2:3:4:5
1 729 725 C 150 CCCCC 30:30:30:30:30 1:2:3:4:5
1 730 726 T 150 TTTTT 30:30:30:30:30 1:2:3:4:5

The fields are (from the first line):

1            contig id 
709          gapped contig position 
706          contig position excluding gaps in the consensus
G            consensus base
120          consensus quality
GGGG         the bases of the reads covering this position
30:30:30:30  the quality values of the bases covering this position
1:2:3:4      the read ids covering this position (use reads.bnk/RED.0.map to convert to read names)

Note the disagreement of bases at position 712, the gaps at 720, and the start of read 5 at 725. If you just want the consensus sequence for the unitigs, you can look in reads.fasta





Problem 1: Overlapping


The goal of this project is to design an algorithm that can rapidly compute overlaps with at least a specified overlap length and quality given a large collection of long high error reads (max read length=L) in the presence of high error rate . One could compute overlaps between all-pairs of long reads using a dynamic programming algorithm, but this would be extremely expensive O(n^2) read pairs, each requiring O(L^2) time to measure. Instead design a hashing/clustering scheme that can quickly identify pairs of reads that are likely to overlap and/or a parallelization scheme to compute overlaps on many cores in parallel. To get started, you can use the real and simulated PacBio data to compute overlaps between reads, comparing your results to those computed with BLASR.

Resources

  • Minimus - (Sommer, 07) A simple assembler for benchmarking
  • Minimizers - (Roberts, 04) Space efficient read overlapping algorithm
  • PacBio reads - PacBio long reads from Lambda phage, Ecoli, Yeast, Parrot
  • FASTQ Format - Describes how the fastq read format with sequences and quality values
  • BLASR - PacBio's aligner for long reads with high error rates.
  • Alchemy - PacBio's read simulator.
  • pbsim - Another simulator of PacBio reads


Problem 2: Unitigging


The goal of this project is to simplify the overlap graph down to the non-redundant set of overlaps comprising the unitigs (aka string graph). This project should also explore the consequencies of high error rates, non-uniform read lengths, and low coverage found with real data. To get started, you can using Minimus to compute overlaps between long low error rate reads, and you can compare your output to the output from minimus.

Resources

  • Unitigging Paper - (Myers, 95) Describes how to simplifying the overlap graph to construct unitigs
  • String Graph - (Myers, 05) Refines the unitigging algorithm to compute in linear time
  • Minimus - (Sommer, 07) A fast, lightweight genome assembler


Problem 3: Consensus


The goal of this project is given an approximate layout of reads which specifies the relative position of reads, compute their consensus sequence. The exact consensus can be computed using a dynammic programming algorithm, but is exponential in the number of reads. To get started you can use minimum to overlap and compute unitigs between long, low error reads, and try to replace the make-consensus module.

Resources