Next-Gen Sequencing Techniques and File Formats

This article discusses the different sequencing technologies employed by researchers such as Roche 454 and Illumina Solexa and the file formats used to store and analyze the resulting data. It also covers the cost and throughput of sequencing, as well as flow space, color space, and sequence space.

Uploaded on Jun 01, 2023 | 4 Views
miriyam

About Next-Gen Sequencing Techniques and File Formats

PowerPoint presentation about 'Next-Gen Sequencing Techniques and File Formats'. This presentation describes the topic on This article discusses the different sequencing technologies employed by researchers such as Roche 454 and Illumina Solexa and the file formats used to store and analyze the resulting data. It also covers the cost and throughput of sequencing, as well as flow space, color space, and sequence space.. The key topics included in this slideshow are Next generation sequencing, file formats, Roche 454, Illumina Solexa, flow space, color space,. Download this presentation absolutely free.

Presentation Transcript

1. IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

2. Gigabases Cost per Kb Lucinda Fulton, The Genome Center at Washington University Cost Throughput

3. Sequencing Technologies http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png

4. Sequence Space Roche 454 Flow space Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain Flow space describes sequence in terms of these base incorporations http://www.youtube.com/watch?v=bFNjxKHP8Jc AB SOLiD Color space Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known bases with a flouorescent dye Each base sequenced twice http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related Illumina/Solexa Base space Single base extentions of fluorescent-labeled nucleotides with protected 3 OH groups Sequencing via cycles of base addition/detection followed deprotection of the 3 OH http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related GenomeTV Next Generation Sequencing (lecture) http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html

7. Flexible Good: with rapidly changing data/tech Poor: validation Human Readable Convenient for de-bugging Computer doesnt care!

8. Sequences FASTA FASTQ SAM/BAM Alignments SAM/BAM MAF Annotations BED GTF GFF3 GVF VCF http://genome.ucsc.edu/FAQ/FAQformat.html http://www.sequenceontology.org/

9. FASTQ FASTA

10. FASTQ: Data Format FASTQ Text based Encodes sequence calls and quality scores with ASCII characters Stores minimal information about the sequence read 4 lines per sequence Line 1: begins with @; followed by sequence identifier and optional description Line 2: the sequence Line 3: begins with the + and is followed by sequence identifiers and description (both are optional) Line 4: encoding of quality scores for the sequence in line 2 References/Documentation http://maq.sourceforge.net/fastq.shtml Cock et al. (2009). Nuc Acids Res 38:1767-1771. Sequence data format

11. FASTQ Example FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771. For analysis, it may be necessary to convert to the Sanger form of FASTQ.

12. FASTQ: Details FASTQ Text based Encodes sequence calls and quality scores with ASCII characters Stores minimal information about the sequence read 4 lines per sequence Line 1: begins with @; followed by sequence identifier and optional description Line 2: the sequence Line 3: begins with the + and is followed by sequence identifiers and description (both are optional) Line 4: encoding of quality scores for the sequence in line 2 References/Documentation http://maq.sourceforge.net/fastq.shtml Cock et al. (2009). Nuc Acids Res 38:1767-1771.

13. Q = Phred Quality Scores P = Base-calling error probabilities Quality scores

14. !"#$%&'()*+,-./0123456789:;<= >?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator L - Illumina 1.8+ Phred+33, raw reads typically (0, 41) Format/Platform QualityScoreType ASCII encoding Sanger Phred: 0-93 33-126 Solexa Solexa:-5-62 64-126 Illumina 1.3 Phred: 0-62 64-126 Illumina 1.5 Phred: 0-62 64-126 Illumina 1.8 Phred: 0-62 33-126 *** Sanger format! Quality score encoding differ among the platforms Most analysis tools require Sanger fastq quality score encoding

15. http://main.g2.bx.psu.edu/

17. SAM (Sequence Alignment/Map) SAM is the output of aligners that map reads to a reference genome Tab delimited w/ header section and alignment section Header sections begin with @ (are optional) Alignment section has 11 mandatory fields BAM is the binary format of SAM http://samtools.sourceforge.net/ Alignment data format

18. http://samtools.sourceforge.ne t/SAM1.pdf Mandatory Alignment Fields

19. http://samtools.sourceforge.ne t/SAM1.pdf Alignment Examples Alignments in SAM format CIGAR string -> 8M2I4M1D3M

20. Annotation Formats Mostly tab delimited files that describe the location of genome features (i.e., genes, etc.) Also used for displaying annotations on standard genome browsers Important for associating alignments with specific genome features descriptions Knowing format details can be important to translating results! BED is zero based GTF/GFF are one based

21. GTF http://useast.ensembl.org/info/website/upload/gff.html Annotation data format

22. chr1 86114265 86116346 nsv433165 chr2 1841774 1846089 nsv433166 chr16 2950446 2955264 nsv433167 chr17 14350387 14351933 nsv433168 chr17 32831694 32832761 nsv433169 chr17 32831694 32832761 nsv433170 chr18 61880550 61881930 nsv433171 chr1 16759829 16778548 chr1:21667704 270866 - chr1 16763194 16784844 chr1:146691804 407277 + chr1 16763194 16784844 chr1:144004664 408925 - chr1 16763194 16779513 chr1:142857141 291416 - chr1 16763194 16779513 chr1:143522082 293473 - chr1 16763194 16778548 chr1:146844175 284555 - chr1 16763194 16778548 chr1:147006260 284948 - chr1 16763411 16784844 chr1:144747517 405362 + BED format Annotation data format

23. BED: zero based, start inclusive, stop exclusive GTF/GFF: one based, inclusive Length = stop-start Length = stop-start+1

24. GRCh37 NCBI36