A Course on Bioinformatics .

Uploaded on:
A Course on Bioinformatics. Ming Li Bioinformatics Lab Software engineering Dept., UCSB. Part 0. Why Study Bioinformatics?. The pattern of hereditary information development Numerous specialists foresee "21st century will be a century of biotechnology".
Slide 1

A Course on Bioinformatics Ming Li Bioinformatics Lab Computer Science Dept., UCSB

Slide 2

Chapter 0. Why Study Bioinformatics? The pattern of hereditary information development Many specialists foresee "21st century will be an era of biotechnology". Genomes: Human, Rice, Mouse, Yeast, 50 species microbes, greatest computerized gold mine ever. 30 billion in year 2005

Slide 3

Protein DNA (Genotype) Chapter 1. Science Phenotype

Slide 4

Animal CELL Mitochondrion Cytoplasm Nucleolus (rRNA incorporated) Nucleus Plasma membrance Cell coat Chromatin Lots of other stuff/organelles/ribosome

Slide 5

Two sorts of Cells Prokaryotes – no core (microscopic organisms) Their genomes are roundabout Eukaryotes – have core (animal,plants) Linear genomes with different chromosomes in sets. When blending up, they look like Middle: centromere Top: p-arm Bottom: q-arm

Slide 6

Mitosis and Meiosis Mitosis: homologous chromosomes match up, copy, one set to every cell. Meiosis: chromosomes split – to haploid (conceptive) cells. Haploid Number of Chromosomes: Human: 23 Rice: 12 Fruit Fly: 4 Corn: 10 Chimpanzee: 24 House mouse: 20

Slide 7

The DNA Sequences All are made of H (hydrogen), C (carbon), N (nitrogen), O (oxygen), S (sulfur), P (phosphorus). A (solitary chain) DNA succession resembles: 5\' o CH Sugar - base A,C,G,T,U 2 c Phosphate (PO ) c 4 Sugar - base A nucleotide o 3\' phosphate Ribose –RNA Deleting circumnavigated O,  Deoxyribose - DNA Sugar - base and so forth

Slide 8

The 4 bases Thymine Adenine H C A-T o N H N C N H N C H N C N C N C o H Note: this is level! H o H N H N C Uracil replaces T in RNA C N H N H N C N C G-C N C o N H Cytosine Guanine H

Slide 9

Human: 3 billion bases, 30k qualities. E. coli: 5 million bases, 4k qualities T An A T C G cDNA T An invert interpretation A T interpretation translation G C Protein mRNA C G (20 amino acids) (A,C,G,U) G C Codon: three nucleotides encode an amino corrosive. 64 codons 20 amino acids, some w/more codes A T C G T An A T

Slide 10

Genes and Proteins One quality encodes one* protein. Like a program, it begins with begin codon (e.g. ATG), then every three code one amino corrosive. At that point a stop codon (e.g. TGA) means end of the quality. Once in a while, amidst an (eukaryotic) quality, there are introns that are joined out (as garbage) amid interpretation. Great parts are called exons. This is the undertaking of quality finding.

Slide 11

Glycine (GLY) GG* Alanine(ALA) GC* Valine (VAL) GT* Leucine (LEU) CT* Isoleucine (ILE) AT(*-G) Serine (SER) AGT, AGC Threonine (THR) AC* Aspartic Acid (ASP) GAT,GAC Glutamic Acid(GLU) GAA,GAG Lysine (LYS) AAA, AAG Start: ATG, CTG, GTG Arginine (ARG) CG* Asparagine (ASN) AAT, AAC Glutamine (GLN) CAA, CAG Cysteine (CYS) TGT, TGC Methionine (MET) ATG Phenylalanine (PHE) TTT,TTC Tyrosine (TYR) TAT, TAC Tryptophan (TRP) TGG Histidine (HIS) CAT, CAC Proline (PRO) CC* Stop TGA, TAA, TAG A.A. Coding Table

Slide 12

Bad Genes - > Genetic Diseases Hemophilia: on X chromosome. Cystic firosis: on chr. 7, CFTR quality. 4% Caucasians convey the faulty quality. (latent) Sickle-Cell Anemia: single nucleotide change in the primary exon of beta-globin quality (expels a cutting site). 1 in 12 African Americans are bearers. (wiped out for homozygotes) BRCA1 quality (chr. 17q) – in charge of ½ acquired bosom malignancy (10% of bosom growth) Fragile X disorder (rationally impede) – 1 in 1250 guys, 2500 females (command, however females have incompletely communicated great quality). FMR-1 quality: tri-nucleotide rehashes >200 causes infection. P53 quality: chr. 17p, in charge of ½ of all diseases X-appraised: XX, XY, XO(f), XXY(m), XYY(m)

Slide 13

Where to get information? Larry Miller: GenBank http://www.ncbi.nlm.nih.gov Dr. Huandong Sun: SWISS-PROT: http://www.expasy.ch/sprot PDB: http://www.pdb.bnl.gov/Go to our Bioinformatics Lab site: www.cs.ucsb.edu/~mli/Bioinf/assets/index.html for all the data.

Slide 14

Chapter 2. Research and Industry Bioinformatics Labs and training programs built up in all real colleges, scattered in Biology, Physics, Computer Science Research themes: Gene-discovering Mapping Sequencing/succession gathering. Grouping Comparison: arrangement, database look Mining DNA/proteins, discovering themes Genome plan DNA clusters Computational proteomics, protein collapsing Phylogeny recreation and examination

Slide 15

The Commercial Market Current bioinformatics market is worth 300 million/year. Half information, half programming. $2 billion/year bioinformatics showcase in 5 years, anticipated by Oscar Gruss & Son. Extensive variety of various requests Bioinformatics programming require modern calculation bolster, running from probabilistic/estimate calculations, information mining, learning calculations, databases, to GUI plan

Slide 16

Celera + numerous Lion BioScience Netgenics DoubleTwist, eBioinformatics BUSINESS Landscape Universities & Research Labs Pharmaceutical Companies Hospitals, Biotech firms Sell Data & Service Sell huge Systems Sell Web Service

Slide 17

Bioinformatics Companies Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic, GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools, Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist, eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic, GeneFormatics, Molecular Simulations, Total: ~50.

Slide 18

Challenges to Computer Science Enormous size of genomics information reasonable for asymptotic investigation Many NP-difficult issues oppose moronic/non-proficient methodologies: different arrangement, far off homology, theme discovering, protein collapsing, phylogeny, quality relationship in expression information, mining and learning, … Approaching these issues from software engineering point of view will be productive.

Slide 19

Algorithmic Techniques in CB Dynamic programming (arrangement, and so forth) Divide and vanquish (Protein collapsing) Approximation calculations (grouping get together, phylogeny) Greedy calculations (succession get together) Heuristics Linear programming and unwinding

Slide 20

Some CS languages: NP-Complete: simple to figure, hard to discover. Estimation calculation: If the insignificant arrangement is f, your answer is g>f, then the guess proportion is g/f. PTAS (Polynomial time guess plot): this is best sort of estimate calculations: for any e, we can accomplish (1+ e )ideal in polynomial time (type may rely on upon e ).

Slide 21

Chapter 3. DNA Sequencing Two approaches to duplicate DNA: 1. Polymerase Chain Reaction (1986): make many duplicates of a part (~500). Needs groundworks (end fragments). Divide into 2. Temper groundworks (5\'- 3\', and 3\'- 5\'). Make two twofold chains. Rehash: 1,2,4,8,16, … 3\' … 5\' 5\' 3\' 5\' 3\' … 5\' 3\'

Slide 22

2. Cloning. Embed a vast bit of DNA into a cloning vector (infection, microscopic organisms, yeast - YAC). At that point the vector is embedded into a host cell to copy actually. DNA Sequencing: Make many duplicates (single strand) Cut them into sections of lengths ~500. For each section of length L, utilize some procedure like PCR, producing all lengths 1 ~ L with some fluorescence color. In old plan, you create all sections end with A, then with C, then G, then T, run them in 4 paths of electrophoresis gel. In the new plan, you have 4 hues (of the color) all parts in 1 path. At that point amass all parts into the most limited basic superstring by GREEDY: more than once consolidate the combine with max cover until wrap up.

Slide 23

Shortest Common Superstring In FOCS 1990, we began formalize and break down the accompanying learning issue: Infer orginal DNA arrangement from sections. Or, on the other hand: given n strings, locate the most brief normal superstring. (1994, J. ACM) The issue was ended up being NPC, 1980. Open for a long time: does GREEDY work? (I.e. does it give direct estimation?) We unraveled this, demonstrating 3, STOC\'91. Enhancements by many individuals to: 2.89, 2.81, 2.79, 2.75, 2.66, … Formal Statement: Given S={s 1 , … s n }, locate a briefest s to such an extent that for all i, s i is a substring of s. E.g. alf ate half deadly alpha horse feed  lethalphalfalfate

Slide 24

Theorem. Insatiable accomplishes 4. Evidence. Given S={s 1 , … ,s m }, build G: Nodes are s 1 , … ,s m Edges: if then include edge: where pref is the pref length. I.e. |s i |=pref+overlap length with s j |SCS(S)| = length most limited Hamilton cycle in G (Modified) Greedy repeated: discover all cycles with least weights in G, then open cycle, link to acquire the last superstring. s j pref s i pref s i s j

Slide 25

This base cycle exists C s i Assuming starting Hamilton cycle has w(C) = n Then blending s i with s j is equal to breaking into two cycles. We have: w(C 1 )+ w(C 2 ) <= n Proof: We combined (s i , s j ) on the grounds that they have max cover. Picture appears: d(s i ,s j )+d(s\'\',s\')<d(s i ,s\')+d(s\'\',s j ) Continue this procedure end with self-cycles: C 1 , C 2 , C 3 , C 4 , … S w(C i ) <= n. s j s\' s\'\' … S" s j s i C 1 S\' s i s j s\' s\'\' C 2

Slide 26

Then we open cycles Let W i =w(C i ) L i =| longest string in C i | |open C i |<= W i + L i n >= S W i Lemma. S 1 and S 2 cover <= w 1 +w 2 S (L i - 2W i ) <= n, by lemma, since L i \'s must be in conclusive SCS. |S|< S (L i +W i ) = S (L i - 2W i )+ S 3W i <= n + 3n =4n. QED w 1 w 1 w 1 w 1 s 1 w 2 w 2 w 2 s 2

Slide 27

Longest Common Subsequence (LCS). V=v 1 v 2 … v n W=w 1 w 2 … w m s(i,j) = length of LCS of V[1..i] and W[1..j] Dynamic Programming: s(i-1,j) s(i,j)=max s(i,j-1) s(i-1,j-1)+1,v i =w j Sequence Alignment s(i,j

View more...