Questioning and Routing Data-Centric Forays into Networking - PowerPoint PPT Presentation

querying and routing data centric forays into networking l.
Skip this Video
Loading SlideShow in 5 Seconds..
Questioning and Routing Data-Centric Forays into Networking PowerPoint Presentation
Questioning and Routing Data-Centric Forays into Networking

play fullscreen
1 / 120
Download Presentation
annot
Views
Download Presentation

Questioning and Routing Data-Centric Forays into Networking

Presentation Transcript

  1. Querying and RoutingData-Centric Forays into Networking Joe Hellerstein UC Berkeley and Intel Research

  2. Note • These slides were made on PowerPoint for Mac 2004 • There are incompatibilities between the Mac and Windows versions of PowerPoint, particularly with regard to animations. • Please email the author with questions.

  3. Road Map • Emerging synergies in databases and networking • Internet-Scale Querying: PIER and  • Agenda, design space • Toward a Network Oracle () • The PIER Query Processor • Design principles & challenges • Overlay Networks: DHTs • Query Processing on DHTs • PIER in action • If time permits • Routing with queries • Related issues in Sensor Networks (TinyDB and BBQ)

  4. Background: CS262 Experiment w/ Eric Brewer • Merge OS & DBMS grad class, over a year • Eric/Joe, point/counterpoint • Some tie-ins were obvious: • memory mgmt, storage, scheduling, concurrency • Surprising: QP and networks go well side by side • Query processors are dataflow engines. • So are routers (e.g. Kohler’s CLICK toolkit). • Adaptive query techniques look even more like networking idea • E.g. “Eddy” tuple routers and TCP Congestion Control • Use simple Control/Queuing to “learn”/affect unpredictable dataflows

  5. Networking for DB Dummies (i.e. me) • Core function of protocols: data xfer • Data Manipulation • buffer, checksum, encryption, xfer to/fr app space, presentation • Transfer Control • flow/congestion ctl, detecting xmission probs, acks, muxing, timestamps, framingClark & Tennenhouse, “Architectural Considerations for a New Generation of Protocols”, SIGCOMM ‘90 • Basic Internet assumption: • “a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations” (Van Jacobson)

  6. Data Modeling! Query Opt! Exchange! C & T’s Wacky Ideas • Thesis: nets are good at xfer control, not so good at data manipulation • Some C&T wacky ideas for better data manipulation • Xfer semantic units, not packets (ALF) • Auto-rewrite layers to flatten them (ILP) • Minimize cross-layer ordering constraints • Control delivery in parallel via packet content

  7. Wacky Ideas in Query Processing • What if… • We had unbounded data producers and consumers (“streams” … “continuous queries”) • We couldn’t know our producers’ behavior or contents?? (“federation” … “mediators”) • We couldn’t predict user behavior? (“CONTROL”) • We couldn’t predict behavior of components in the dataflow? (“web services”) • We had partial failure as a given? (transactions not possible?) • Yes … networking people have been here! • Recall Van Jacobson’s quote

  8. Convergence Data Models, Query Opt, DataScalability DATABASE RESEARCH P2P Queries Approximate/Interactive QP AdaptiveDataflow SensorNetQueries TinyDBBBQ PIER Content-Based Routing KnowledgePlane Router Toolkits WirelessMeshes NETWORKING RESEARCH Adaptivity, Federated Control, NodeScalability

  9. Road Map • Emerging synergies in databases and networking • Internet-Scale Querying: PIER and  • Agenda, design space • Toward a Network Oracle () • The PIER Query Processor • Design principles & challenges • Overlay Networks: DHTs • Query Processing on DHTs • PIER in action • If time permits • Routing with queries • Related issues in Sensor Networks (TinyDB and BBQ)

  10. PIER • P2P Information Exchange and Retrieval • An Internet-Scale (peer-to-peer) query engine

  11. Internet Scale 1000’s - Millions Single Site Clusters Distributed 10’s – 100’s Our story at VLDB:What is a Very Large Data Base? [HHLLSS VLDB 03] • Challenge: How to run DB style queries at Internet Scale?! • Challenge: How can DB functionality change the Internet? Database Community Network Community

  12. What are the Key Properties? • Lots of data that is: • Naturally distributed (where it’s generated) • Centralized collection undesirable • Homogeneous in schema • Data is more useful when viewed as a whole • This is the design space we have chosen to investigate. • As opposed to … • Enterprise Information Integration • Semantic Web • Challenges tilted more heavily toward systems/algorithms • As opposed to data semantics & cleaning

  13. Who Needs Internet Scale Querying?Example 1: Filenames • Simple ubiquitous schemas: • Filenames, Sizes, ID3 tags • Early P2P filesharing apps • Napster, Gnutella, KaZaA, etc. • Built “in the garage” • “Normal” non-expert users • Not the greatest example • Often used to break copyright • Fairly trivial technology • But… • Points to key social issues driving adoption of decentralized systems • Provide real workloads to validate more complex designs

  14. Example 2: Network Traces • Schemas are mostly standardized: • IP, SMTP, HTTP, SNMP log formats, firewall log formats, etc. • Network administrators are looking for patterns within their site AND with other sites: • DoS attacks cross administrative boundaries • Tracking epidemiology of viruses/worms • Timeliness is very helpful • Might surprise you just how useful this is: • Network on PlanetLab (distributed research test bed) is mostly filled with people monitoring the network status

  15. Road Map • Emerging synergies in databases and networking • Internet-Scale Querying: PIER and  • Agenda, design space • Toward a Network Oracle () • The PIER Query Processor • Design principles & challenges • Overlay Networks: DHTs • Query Processing on DHTs • PIER in action • If time permits • Routing with queries • Related issues in Sensor Networks (TinyDB and BBQ)

  16. : Public Health for the Internet [HPPRSW 04] • Thought experiment: A Network Oracle • Queryable entity that knows about all network state • Network maps • Link loading • Point-to-point latencies/bandwidth • Event detection (e.g. firewall events) • Naming (DNS, ASs, etc.) • End-system configuration • Router configuration • Data from recent past up to near-real-time • Available to all end-systems • What might this enable?

  17. Applications of a Network Oracle • Performance fault diagnosis • Tracking network attacks • Correlating firewall logs • New routing protocols • E.g. app-specific route selection • Adaptive distributed applications • “Internet Screensavers” • A la SETI@Home • Serendipity!

  18. Benefits? • Short term: • Connect net measurement and security researchers’ datasets. Enable distributed queries for network characterization, epidemiology and alerts. • E.g. top 10 IP address result from Barford et.al. • Medium term: • Provide a service for overlay networks and planetary-scale adaptive applications • E.g. feed link measurement results into CDNs, server selection • Long term: • Change the Internet: protocols no longer assume ignorance of network state. Push more intelligence into end systems. • E.g. Host-based source routing solutions, congestion avoidance (setting timeouts)

  19. A Center for Disease Control? • Who owns the Center? What do they Control? • This will be unpopular at best • Electronic privacy for individuals • The Internet as “a broadly surveilled police state”? • Provider disincentives • Transparency = support cost, embarrassment • And hard to deliver • Can monitor the chokepoints (ISPs) • But inside intranets?? • E.g. Intel IT • E.g. Berkeley dorms • E.g. Grassroots WiFi agglomerations?

  20. Energizing the End-Users • Endpoints are ubiquitous • Internet, intranet, hotspot • Toward a uniform architecture • End-users will help • Populist appeal to home users is timely • Enterprise IT can dictate endpoint software • Differentiating incentives for endpoint vendors • The connection: peer-to-peer technology • Harnessed to the good! • Ease of use • Built-in scaling • Decentralization of trust and liability

  21. Road Map • Emerging synergies in databases and networking • Internet-Scale Querying: PIER and  • Agenda, design space • Toward a Network Oracle () • The PIER Query Processor • Design principles & challenges • Overlay Networks: DHTs • Query Processing on DHTs • PIER in action • If time permits • Routing with queries • Related issues in Sensor Networks (TinyDB and BBQ)

  22. Coarse Architecture of PIER

  23. Declarative Queries Query Plan Overlay Network Physical Network Q u e r y C o r e O p t i m i z e r R e l a t i o n a l E x e c u t i o n C a t a l o g E n g i n e M a n a g e r P I E R

  24. Road Map • Emerging synergies in databases and networking • Internet-Scale Querying: PIER and  • Agenda, design space • Toward a Network Oracle () • The PIER Query Processor • Design principles & challenges • Overlay Networks: DHTs • Query Processing on DHTs • PIER in action • If time permits • Routing with queries • Related issues in Sensor Networks (TinyDB and BBQ)

  25. Some Background on Overlay Networks [RH ITR 03] • A P2P system like PIER needs to: • Track identities & (IP) addresses of peers currently online • May be many! • May have significant Churn • Best not to have n2 ID references • Route messages among peers • If you don’t track all peers everywhere, this is “multi-hop” • This is an overlay network • Peers are doing both naming and routing • IP becomes “just” the low-level transport • All the IP routing is opaque

  26. What is a DHT? • Hash Table • data structure that maps “keys” to “values” • essential building block in software systems • Distributed Hash Table (DHT) • similar, but spread across the Internet • Interface • insert(key, value) • lookup(key)

  27. How? • Every DHT node supports a single operation: • Given key as input; route messages toward node holding key

  28. K V K V K V K V K V K V K V K V K V K V K V DHT in action

  29. K V K V K V K V K V K V K V K V K V K V K V DHT in action

  30. K V K V K V K V K V K V K V K V K V K V K V DHT in action Operation: take key as input; route messages to node holding key

  31. K V K V K V K V K V K V K V K V K V K V K V DHT in action: put() insert(K1,V1) Operation: take key as input; route messages to node holding key

  32. K V K V K V K V K V K V K V K V K V K V K V DHT in action: put() insert(K1,V1) Operation: take key as input; route messages to node holding key

  33. K V K V K V K V K V K V K V K V K V K V K V DHT in action: put() (K1,V1) Operation: take key as input; route messages to node holding key

  34. K V K V K V K V K V K V K V K V K V K V K V DHT in action: get() retrieve (K1) Operation: take key as input; route messages to node holding key

  35. DHT Design Goals • An “overlay” network with: • Flexible mapping of keys to physical nodes • Small network diameter • Small degree (fanout) • Local routing decisions • Robustness to churn • Routing flexibility • Decent locality (low “stretch”) • A “storage” or “memory” mechanism with • Best-effort persistence (via soft state)

  36. DHT Topologies • DHTs emulate InterConnect Networks • These have group-theoretic structure • Cayley and Coset graphs • Rich families of such graphs with different properties • We can exploit the structure (i.e. constraints) of the overlay • E.g. to embed complex computations with efficient communication • E.g. to reason about the “influence” of malicious nodes in the network

  37. An Example DHT: Chord • Overlayed 2k-Gons

  38. Routing in Chord • At most one of each Gon • E.g. 1-to-0

  39. Routing in Chord • At most one of each Gon • E.g. 1-to-0

  40. Routing in Chord • At most one of each Gon • E.g. 1-to-0

  41. Routing in Chord • At most one of each Gon • E.g. 1-to-0

  42. Routing in Chord • At most one of each Gon • E.g. 1-to-0

  43. Routing in Chord • At most one of each Gon • E.g. 1-to-0 • What happened? • We constructed thebinary number 15! • Routing from x to yis like computing y - x mod n by summing powers of 2 2 4 8 1 Diameter: log n (1 hop per gon type)Degree: log n (one outlink per gon type)

  44. Deconstructing DHTs • A DHT is composed of • A logical, underlying interconnection network • An “emulation scheme” • works on a “non-round” #of nodes • without global knowledge of network size • Self-monitoring components • Track and react to churn

  45. Road Map • Emerging synergies in databases and networking • Internet-Scale Querying: PIER and  • Agenda, design space • Toward a Network Oracle () • The PIER Query Processor • Design principles & challenges • Overlay Networks: DHTs • Query Processing on DHTs • PIER in action • If time permits • Routing with queries • Related issues in Sensor Networks (TinyDB and BBQ)

  46. DHTs Gave Us Equality Lookups • That’s a start on database query processing. • But what else might we want? • Range Search • Aggregation • Group By • Join • Intelligent Query Dissemination • Theme • All can be built elegantly and opaquely on DHTs! • No need to build a “special” DHT for any of these • Can leverage advances in DHT design • This is the approach we take in PIER

  47. Aggregation in a DHT • SELECT COUNT(*) FROM firewalls • One common approach: • Everybody routes their firewall records to a particular “collector” • This forms a tree • Along the way, count up totals • At root, form final result • Note: the shapes of these trees depend on the DHT topology! • Can reason about comm costs, sensitivity to failure, influence of malefactors, etc. 0 1 15 14 2 13 3 12 4 11 5 10 6 9 7 8 binomial tree

  48. Aggregation in Koorde • Recall the DeBruijn graph: • Each node x points to 2x mod n and (2x + 1) mod n

  49. Grouped Aggregation • SELECT COUNT(*) FROM firewallsGROUP BY root-domain • Everybody routes record r to hash(r.root-domain) • Simply forms a tree per group

  50. For each of my attackers, how many sites did they attack,and how many packets were involved? Joins • SELECT F.sourceIP, COUNT(DISTINCT p.*), COUNT(DISTINCT p.destIP) FROM firewalls F, packets P WHERE F.sourceIP = P.sourceIP AND F.destIP = <myIP>GROUP BY P.sourceIP • “Rehash” join: • Everybody routes their F and P records to hash(sourceIP) • Forms a tree per sourceIP, can combine tuples in each tree independently • Automatically parallelizes the join algorithm • No notion of parallelism in the code; falls out the DHT • Other join algorithms available • “Fetch matches” • Semi-join variants • Bloom-filter variants