Utilizing MapReduce Technologies as a part of Bioinformatics and Medical Informatics .

Uploaded on:
Teammates in SALSA Project. Indiana UniversitySALSA Technology TeamGeoffrey Fox Judy QiuScott BeasonJaliya Ekanayake Thilina GunarathneThilina GunarathneJong Youl ChoiYang RuanSeung-Hee BaeHui LiSaliya Ekanayake. Microsoft ResearchTechnology Collaboration Azure (Clouds)Dennis GannonRoger BargaDryad (Parallel Runtime)Christophe Poulain CCR (Threading)George ChrysanthakopoulosDSS (Services)He
Slide 1

Utilizing MapReduce Technologies in Bioinformatics and Medical Informatics Judy Qiu xqiu@indiana.edu www.infomall.org/s a lsa Community Grids Laboratory Pervasive Technology Institute Indiana University Computing for Systems and Computational Biology Workshop SC09 Portland Oregon November 16 2009

Slide 2

Collaborators in S A L S A Project Microsoft Research Technology Collaboration Azure (Clouds) Dennis Gannon Roger Barga Dryad (Parallel Runtime) Christophe Poulain CCR (Threading) George Chrysanthakopoulos DSS (Services) Henrik Frystyk Nielsen Indiana University S A L S A Technology Team Geoffrey Fox Judy Qiu Scott Beason Jaliya Ekanayake Thilina Gunarathne Thilina Gunarathne Jong Youl Choi Yang Ruan Seung-Hee Bae Hui Li Saliya Ekanayake Applications Bioinformatics, CGB Haixu Tang, Mina Rho, Peter Cherbas , Qunfeng Dong IU Medical School Gilbert Liu Demographics (Polis Center) Neil Devadasan Cheminformatics David Wild, Qian Zhu Physics CMS aggregate at Caltech (Julian Bunn) Community Grids Lab and UITS RT – PTI

Slide 3

Dynamic Virtual Cluster Architecture Smith Waterman Dissimilarities, CAP-3 Gene Assembly, PhyloD Using DryadLINQ , High Energy Physics, Clustering, Multidimensional Scaling, Generative Topological Mapping Dynamic Virtual Cluster provisioning by means of XCAT Supports both stateful and stateless OS pictures Applications Apache Hadoop/MapReduce++/MPI Microsoft DryadLINQ/MPI Runtimes Linux Bare-framework Windows Server 2008 HPC Bare-framework Linux Virtual Machines Windows Server 2008 HPC Infrastructure programming Xen Virtualization Xen Virtualization XCAT Infrastructure Hardware iDataplex Bare-metal Nodes

Slide 4

Cluster Configurations Hadoop/Dryad/MPI DryadLINQ/MPI

Slide 5

MapReduce "Record/Data Repository" Parallelism Instruments Map = (information parallel) calculation perusing and composing information Reduce = Collective/Consolidation stage e.g. framing various worldwide wholes as in histogram Communication by means of Messages/Files Portals/Users Map 1 Map 2 Map 3 Reduce Disks Computers/Disks

Slide 6

Cloud Computing: Infrastructure and Runtimes Cloud foundation: outsourcing of servers, figuring, information, document space, and so on. Taken care of through Web benefits that control virtual machine lifecycles. Cloud runtimes: apparatuses (for utilizing mists) to do information parallel calculations. Apache Hadoop , Google MapReduce , Microsoft Dryad, and others Designed for data recovery yet are superb for an extensive variety of science information examination applications Can likewise do much conventional parallel figuring for information mining if stretched out to bolster iterative operations Not as a rule on Virtual Machines

Slide 7

Some Life Sciences Applications EST (Expressed Sequence Tag) succession get together program utilizing DNA arrangement get together program programming CAP3. Metagenomics and Alu redundancy arrangement utilizing Smith Waterman divergence calculations took after by MPI applications for Clustering and MDS (Multi Dimensional Scaling) for measurement diminishment before perception Correlating Childhood stoutness with natural variables by consolidating medicinal records with Geographical information with more than 100 properties utilizing connection calculation, MDS and hereditary calculations for picking ideal ecological elements. Mapping the 26 million passages in PubChem into a few measurements to help choice of related chemicals with advantageous Google Earth like Browser. This uses either progressive MDS (which can\'t be connected straightforwardly as O(N 2 )) or GTM (Generative Topographic Mapping).

Slide 8

Cloud Related Technology Research MapReduce Hadoop on Virtual Machines (private cloud) Dryad (Microsoft) on Windows HPCS MapReduce++ speculation to proficiently bolster iterative "maps" as in bunching, MDS … Azure Microsoft cloud FutureGrid dynamic virtual groups exchanging between VM, " Baremetal ", Windows/Linux …

Slide 9

Alu and Sequencing Workflow Data is an accumulation of N groupings – 100\'s of characters long These can\'t be considered as vectors in light of the fact that there are missing characters "Various Sequence Alignment" (making vectors of characters) doesn\'t appear to work if N bigger than O(100) Can compute N 2 dissimilarities (separations) between successions (all sets) Find families by grouping (much preferable techniques over Kmeans). As no vectors, utilize vector free O(N 2 ) techniques Map to 3D for representation utilizing Multidimensional Scaling MDS – additionally O(N 2 ) N = 50,000 keeps running in 10 hours (all above) on 768 centers Our teammates just gave us 170,000 arrangements and need to take a gander at 1.5 million – will grow new calculations! MapReduce++ will do all means as MDS, Clustering simply require MPI Broadcast/Reduce

Slide 10

Pairwise Distances – ALU Sequences Calculate pairwise separations for a gathering of qualities (utilized for bunching, MDS) O(N^2) issue "Doubly Data Parallel" at Dryad Stage Performance near MPI Performed on 768 centers (Tempest Cluster) 125 million separations 4 hours & 46 minutes Processes work superior to strings when utilized inside vertices 100% usage versus 70%

Slide 13

Hierarchical Subclustering

Slide 14

Dryad versus MPI for Smith Waterman Flat is immaculate scaling

Slide 15

Hadoop/Dryad Comparison Inhomogeneous Data I Inhomogeneity of information does not have a noteworthy impact when the succession lengths are arbitrarily dispersed Dryad with Windows HPCS contrasted with Hadoop with Linux RHEL on Idataplex (32 hubs)

Slide 16

Hadoop/Dryad Comparison Inhomogeneous Data II This demonstrates the normal load adjusting of Hadoop MR dynamic undertaking task utilizing a worldwide pipeline rather than the DryadLinq static task Dryad with Windows HPCS contrasted with Hadoop with Linux RHEL on Idataplex (32 hubs)

Slide 17

Hadoop VM Performance Degradation Performance Degradation = ( T vm – T baremetal )/T baremetal 15.3% Degradation at biggest informational index measure

Slide 18

MDS/GTM for 100K (out of 26 million) PubChem passages Distances in 2D/3D coordinate separations from database properties Number of Activity Results > 300 200 ~ 300 100 ~ 200 < 100 MDS GTM Developing progressive techniques to reach out to full 26M dataset

Slide 19

Correlation between MDS/GTM MDS GTM Canonical Correlation between MDS & GTM

Slide 20

SALSA HPC Dynamic Virtual Cluster Hosting Monitoring Infrastructure SW-G Using Hadoop SW-G Using Hadoop SW-G Using DryadLINQ SW-G Using Hadoop SW-G Using Hadoop SW-G Using DryadLINQ Linux Bare-framework Linux on Xen Windows Server 2008 Bare-framework Cluster Switching from Linux Bare-framework to Xen VMs to Windows 2008 HPC XCAT Infrastructure iDataplex Bare-metal Nodes (32 hubs) SW-G : Smith Waterman Gotoh Dissimilarity Computation – A commonplace MapReduce style application

Slide 21

Monitoring Infrastructure Pub/Sub Broker Network Monitoring Interface Virtual/Physical Clusters Summarizer XCAT Infrastructure Switcher iDataplex Bare-metal Nodes (32 hubs)

Slide 22

SALSA HPC Dynamic Virtual Clusters

Slide 23

Summary: Key Features of our Approach Dryad/Hadoop/Azure promising for Biology calculations Dynamic Virtual Clusters permit one to switch between various modes Overhead of VM\'s on Hadoop (15%) adequate Inhomogeneous issues at present supports Hadoop over Dryad MapReduce++ permits iterative issues (great direct polynomial math/datamining) to utilize MapReduce display proficiently

View more...