Utilizing Cloud Technologies for Bioinformatics Applications .

Uploaded on:
Category: Product / Service
Colleagues in SALSA Project. Indiana UniversitySALSA Technology TeamGeoffrey Fox Judy QiuScott BeasonJaliya Ekanayake Thilina GunarathneThilina GunarathneJong Youl ChoiYang RuanSeung-Hee BaeHui LiSaliya Ekanayake. Microsoft ResearchTechnology Collaboration Azure (Clouds)Dennis GannonRoger BargaDryad (Parallel Runtime)Christophe Poulain CCR (Threading)George ChrysanthakopoulosDSS (Services)He
Slide 1

Utilizing Cloud Technologies for Bioinformatics Applications Judy Qiu xqiu@indiana.edu www.infomall.org/s a lsa Community Grids Laboratory Pervasive Technology Institute Indiana University MTAGS Workshop SC09 Portland Oregon November 16 2009

Slide 2

Collaborators in S A L S A Project Microsoft Research Technology Collaboration Azure (Clouds) Dennis Gannon Roger Barga Dryad (Parallel Runtime) Christophe Poulain CCR (Threading) George Chrysanthakopoulos DSS (Services) Henrik Frystyk Nielsen Indiana University S A L S A Technology Team Geoffrey Fox Judy Qiu Scott Beason Jaliya Ekanayake Thilina Gunarathne Thilina Gunarathne Jong Youl Choi Yang Ruan Seung-Hee Bae Hui Li Saliya Ekanayake Applications Bioinformatics, CGB Haixu Tang, Mina Rho, Peter Cherbas , Qunfeng Dong IU Medical School Gilbert Liu Demographics (Polis Center) Neil Devadasan Cheminformatics David Wild, Qian Zhu Physics CMS amass at Caltech (Julian Bunn) Community Grids Lab and UITS RT – PTI

Slide 3

Convergence is Happening Data serious application (three essential exercises): catch, curation , and investigation (perception) Data Intensive Paradigms Cloud foundation and runtime Parallel threading and procedures

Slide 4

MapReduce "Document/Data Repository" Parallelism Instruments Map = (information parallel) calculation perusing and composing information Reduce = Collective/Consolidation stage e.g. framing different worldwide aggregates as in histogram Communication by means of Messages/Files Portals/Users Map 1 Map 2 Map 3 Reduce Disks Computers/Disks

Slide 5

Cluster Configurations Hadoop/Dryad/MPI DryadLINQ/MPI

Slide 6

Dynamic Virtual Cluster Architecture Smith Waterman Dissimilarities, CAP-3 Gene Assembly, PhyloD Using DryadLINQ , High Energy Physics, Clustering, Multidimensional Scaling, Generative Topological Mapping Dynamic Virtual Cluster provisioning by means of XCAT Supports both stateful and stateless OS pictures Applications Apache Hadoop/MapReduce++/MPI Microsoft DryadLINQ/MPI Runtimes Linux Bare-framework Windows Server 2008 HPC Bare-framework Linux Virtual Machines Windows Server 2008 HPC Infrastructure programming Xen Virtualization Xen Virtualization XCAT Infrastructure Hardware iDataplex Bare-metal Nodes

Slide 7

Cloud Computing: Infrastructure and Runtimes Cloud foundation: outsourcing of servers, figuring, information, document space, and so forth. Taken care of through Web benefits that control virtual machine lifecycles. Cloud runtimes: instruments (for utilizing mists) to do information parallel calculations. Apache Hadoop , Google MapReduce , Microsoft Dryad, and others Designed for data recovery however are fantastic for an extensive variety of science information examination applications Can likewise do much conventional parallel registering for information mining if stretched out to bolster iterative operations Not more often than not on Virtual Machines

Slide 8

Alu and Sequencing Workflow Data is an accumulation of N arrangements – 100\'s of characters long These can\'t be considered as vectors in light of the fact that there are missing characters "Various Sequence Alignment" (making vectors of characters) doesn\'t appear to work if N bigger than O(100) Can figure N 2 dissimilarities (separations) between groupings (all sets) Find families by bunching (much preferred techniques over Kmeans). As no vectors, utilize vector free O(N 2 ) techniques Map to 3D for representation utilizing Multidimensional Scaling MDS – likewise O(N 2 ) N = 50,000 keeps running in 10 hours (all above) on 768 centers Our colleagues just gave us 170,000 groupings and need to take a gander at 1.5 million – will grow new calculations! MapReduce++ will do all means as MDS, Clustering simply require MPI Broadcast/Reduce

Slide 9

Pairwise Distances – ALU Sequences Calculate pairwise separations for a gathering of qualities (utilized for bunching, MDS) O(N^2) issue "Doubly Data Parallel" at Dryad Stage Performance near MPI Performed on 768 centers (Tempest Cluster) 125 million separations 4 hours & 46 minutes Processes work superior to strings when utilized inside vertices 100% usage versus 70%

Slide 12

Hierarchical Subclustering

Slide 13

Pairwise Clustering 30,000 Points on Tempest Clustering by Deterministic Annealing MPI Parallel Overhead Thread MPI Thread Parallelism MPI

Slide 14

Dryad versus MPI for Smith Waterman Flat is immaculate scaling

Slide 15

Hadoop/Dryad Comparison "Homogeneous" Data Dryad Hadoop Time per Alignment (ms) Dryad with Windows HPCS contrasted with Hadoop with Linux RHEL on Idataplex Using genuine information with standard deviation/length = 0.1

Slide 16

Hadoop/Dryad Comparison Inhomogeneous Data I Inhomogeneity of information does not have a critical impact when the grouping lengths are haphazardly dispersed Dryad with Windows HPCS contrasted with Hadoop with Linux RHEL on Idataplex (32 hubs)

Slide 17

Hadoop/Dryad Comparison Inhomogeneous Data II This demonstrates the normal load adjusting of Hadoop MR dynamic undertaking task utilizing a worldwide pipeline as opposed to the DryadLinq static task Dryad with Windows HPCS contrasted with Hadoop with Linux RHEL on Idataplex (32 hubs)

Slide 18

Hadoop VM Performance Degradation Performance Degradation = ( T vm – T baremetal )/T baremetal 15.3% Degradation at biggest informational index estimate

Slide 19

PhyloD utilizing Azure and DryadLINQ Derive relationship between HLA alleles and HIV codons and between codons themselves

Slide 20

Mapping of PhyloD to Azure

Slide 21

PhyloD Azure Performance Efficiency versus number of specialist parts in PhyloD model keep running on Azure March CTP Number of dynamic Azure laborers amid a keep running of PhyloD application

Slide 22

Iterative Computations K-implies Matrix Multiplication Performance of K-Means Parallel Overhead Matrix Multiplication

Slide 23

Kmeans Clustering Iteratively refining operation New maps/reducers/vertices in each cycle File framework based correspondence Loop unrolling in DryadLINQ give better execution The overheads are amazingly huge contrasted with MPI CGL-MapReduce is a case of MapReduce++ - bolsters MapReduce demonstrate with emphasis (information remains in memory and correspondence through streams not documents) Time for 20 emphasess Large Overheads

Slide 24

MapReduce++ (CGL-MapReduce) Pub/Sub Broker Network Map Worker M Worker Nodes D MR Driver User Program Reduce Worker Streaming based correspondence Intermediate outcomes are specifically exchanged from the guide errands to the decrease undertakings – wipes out neighborhood records Cacheable guide/lessen assignments - Static information stays in memory Combine stage to join diminishments User Program is the writer of MapReduce calculations Extends the MapReduce model to iterative calculations R M MRDeamon R D File System Data Split Communication

Slide 25

SALSA HPC Dynamic Virtual Cluster Hosting Monitoring Infrastructure SW-G Using Hadoop SW-G Using Hadoop SW-G Using DryadLINQ SW-G Using Hadoop SW-G Using Hadoop SW-G Using DryadLINQ Linux Bare-framework Linux on Xen Windows Server 2008 Bare-framework Cluster Switching from Linux Bare-framework to Xen VMs to Windows 2008 HPC XCAT Infrastructure iDataplex Bare-metal Nodes (32 hubs) SW-G : Smith Waterman Gotoh Dissimilarity Computation – A regular MapReduce style application

Slide 26

Monitoring Infrastructure Pub/Sub Broker Network Monitoring Interface Virtual/Physical Clusters Summarizer XCAT Infrastructure Switcher iDataplex Bare-metal Nodes (32 hubs)

Slide 27

SALSA HPC Dynamic Virtual Clusters

Slide 28

Application Classes (P arallel programming/equipment regarding 5 "Application engineering" Structures )

Slide 29

Applications & Different Interconnection Patterns Input outline Input delineate Output Pij lessen Domain of MapReduce and Iterative Extensions MPI

Slide 30

Summary: Key Features of our Approach II Dryad/Hadoop/Azure promising for Biology calculations Dynamic Virtual Clusters permit one to switch between various modes Overhead of VM\'s on Hadoop (15%) satisfactory Inhomogeneous issues as of now supports Hadoop over Dryad MapReduce ++ permits iterative issues (great direct polynomial math/datamining ) to utilize MapReduce display effectively

View more...