User Tools

Site Tools


cs545

CS545 - Fundamentals of Stream Processing, Spring 2013

The world is experiencing an explosion in the amount of online data produced continuously by different types of sensors, processes, and human activities. Being able to analyze live data as it is generated continuously, and distil insights for improved decision making, is vital to several large and complex applications in domains including financial systems, cyber- and physical-security systems, environmental monitoring, health care, manufacturing systems, telecommunication networks and power distribution grids.

Existing store-and-process information management technologies are ill-suited to meet the performance, scalability, and usability requirements of these applications.

Stream processing is a novel distributed compute paradigm that supports the gathering, processing, and analysis of high-volume, heterogeneous, continuous data streams, to extract insights and actionable results in real time. Stream processing builds on research from several domains ranging from distributed systems and relational databases, to programming languages and design, to signal processing and data mining algorithms.

In this course, we provide the fundamentals of the emerging stream processing paradigm. We introduce the key components of this paradigm, including the distributed system infrastructure, the programming model, the design patterns, and the streaming analytics.

Throughout the course, we describe the underlying theoretical principles, illustrative examples and implementations, and end to end real-world case studies to provide students and practitioners a comprehensive guide to building such systems and applications, and advance the state-of-the-art.

Finally, this course includes hands-on exposure to large-scale stream processing through relevant homework assignments involving programming exercises. The central component of this class revolves around student seminars and a final design and implementation project, allowing the students to explore the state-of-the art in this field with the potential for tackling open research challenges.

Programming exercises can be performed on a stream processing middleware. Two options include:

Location and Time

  • Location: EA 502
  • Monday 8:40am-10:30am (first hour is 'spare')
  • Wednesday 10:40am-12:30:pm

Important Dates

  • Homework I
    • announced Feb 18th
    • due March 13th, midnight
  • Homework II
    • announced March 19th
    • due April 15th, midnight
  • Project deadlines
    • Send project groups and topics by Monday, Mar 4th.
    • Project demos are due May 24th.
      1. 9:30am-10:00am, News fusion application
        • Üstün Özgür
      2. 10:00am-10:30am, Implement Section 10 book algorithms as re-usable stream analytics
        • Doğukan Çağatay, Fırat Karataş
      3. 10:30am-11:00am, A Scalable Content Based Pub Sub System
        • Fuat Basık, Mert Emin Kalender
      4. 11:00am-11:30am, Earthquake location detection on Twitter streaming data
        • Tunç Gültekin, Salim Arslan
      5. 11:30am-12:00am, News fusion application
        • Tolga Çekiç
      6. 2:00pm-2:30pm, Implement the rule-based marketing platform
        • Can Telkenaroğlu, Sermetcan Baysal
      7. 2:30pm-3:00pm, Implement Section 11 book algorithms as re-usable stream analytics
        • Habibe Güldamla Özsema, İhsan Mert Özçelik
      8. 3:30pm-4:00pm, Implementing a custom scheduler for Storm
        • Yağmur Şahin, Mehmet Başaran
    • Project reports are due at the demo time.
  • Homework III
    • announced April 19th
    • due May 15th, midnight

Text Book

  • Henrique Andrade, Buğra Gedik, Deepak Turaga. Fundamentals of Stream Processing - Application Design, Systems and Analytics. Cambridge Press.

The book is going to be published this year. So for now, please contact me for a draft.

Grades

  • Seminars - %20
  • Homeworks - %30
  • Project - 50%

Homeworks

Seminars

  • Mar 4
    • Slot 1: Operator Scheduling in a Data Stream Manager
      • Mert Emin Kalender
  • Mar 11
    • Slot 1: Flux: An adaptive partitioning operator for continuous query systems
      • Fuat Basık
    • Slot 2: Providing Resiliency to Load Variations in Distributed Stream Processing
      • Doğukan Çağatay
  • Mar 18
    • Slot 1: Load Shedding in a Data Stream Manager
      • Yağmur Şahin
  • Mar 25
    • Slot 1: Efficient Construction of Compact Shedding Filters for Data Stream Processing
      • Fırat Karataş
    • Slot 2: Elastic scaling of data parallel operators in stream processing
      • Mehmet Başaran
  • Apr 1
    • Slot 1: Auto-pipelining for data stream processing
      • Güldamla Özsema
    • Slot 2: Auto-parallelizing stateful distributed streaming applications
      • İhsan Mert Özçelik
  • Apr 8
    • Slot 1: Highly-Available, Fault-Tolerant, Parallel Dataflows
      • Hakan Sözer
    • Slot 2: Fault-tolerance in the borealis distributed stream processing system
      • Üstün Özgür
  • Apr 15
    • Slot 1: Language-level checkpointing support for stream processing applications
      • Can Telkenaroğlu
    • Slot 2: Fault injection-based assessment of partial fault tolerance in stream processing applications
      • Tolga Çekiç
  • Apr 29
    • Slot 1: Streaming Pattern Discovery in Multiple Time-Series
      • Tunç Gültekin
    • Slot 2: Mining Time Changing Data Streams
      • Salim Arslan
  • May 6
    • Slot 1: Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing
      • Sermetcan Baysal
  • May 13
    • Slot 1: Modeling the execution semantics of stream processing engines with SECRET
      • Sercan Aksoy
    • Slot 2: SODA: An Optimizing Scheduler for Large-Scale Stream-Based Distributed Computer Systems
      • Buğra Yıldız

Projects

Students will design and implement group projects (in groups of 2 students) aimed at creating moderate-sized stream processing applications and to experiment and showcase state-of-the art algorithms in a close to real setting. Some project ideas:

  1. Implement all algorithms from Section 10 of the book as re-usable stream analytics.
  2. Implement all algorithms from Section 11 of the book as re-usable stream analytics.
  3. Implement the rule-based marketing platform from Section 4.5, Exercise 1.
  4. Implement the query-based network monitoring application from Section 4.5, Exercise 2.
  5. Add elastic scaling to Storm
  6. Integrate a distributed memory system with Storm/Streams
  7. Implement a smart scheduler for Storm
  8. Implement a load balancer for Storm
  9. Implement a twitter analytics application
  10. Implement a news fusion application (that finds correlations between two data sources)
  11. Implement in-memory M/R middleware using streaming (researchy)
    1. See “Muppet: MapReduce-Style Processing of Fast Data”
  12. Implement scalable pub/sub middleware using streaming (researchy)
  13. Develop a (embedded?) DSL for stream processing

Topics

  • Introduction to stream processing
  • Application development
  • Optimizations
    • Paper:
      1. A Catalog of Stream Processing Optimizations, ACM Computing Surveys, to appear (contact for access).
  • Scheduling, load balancing, load shedding
    • Reading: Section 5.3, 9.3.3
    • Papers:
      1. Operator Scheduling in a Data Stream Manager, VLDB, 2003.
      2. SODA: An Optimizing Scheduler for Large-Scale Stream-Based Distributed Computer Systems. Middleware, 2008.
      3. Flux: An adaptive partitioning operator for continuous query systems. IEEE ICDE, 2003.
      4. Providing Resiliency to Load Variations in Distributed Stream Processing. VLDB, 2006.
      5. Robust Distributed Stream Processing, ICDE, 2013.
      6. Dynamic Load Balancing for Distributed Streaming Systems. not available yet (contact for access).
      7. Load Shedding in a Data Stream Manager. VLDB, 2003.
      8. Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing. VLDB, 2007.
      9. Efficient Construction of Compact Shedding Filters for Data Stream Processing. ICDE, 2008.
  • Parallelization and scalability
    • Reading: Section 9.3.2
    • Papers:
      1. Elastic scaling of data parallel operators in stream processing. IEEE IPDPS, 2009.
      2. Auto-pipelining for data stream processing. IEEE TPDS, 2012.
      3. Auto-parallelizing stateful distributed streaming applications. PACT, 2012.
      4. Elastic Degree-of-Parallelism Adaptation for Data Stream Processing. not available yet (contact for access)
  • Fault-tolerance
    • Reading: 9.3.4
    • Papers:
      1. Highly-Available, Fault-Tolerant, Parallel Dataflows. SIGMOD, 2004.
      2. High-Availability Algorithms for Distributed Stream Processing. ICDE, 2005.
      3. A Cooperative, Self-Configuring High-Availability Solution for Stream Processing. ICDE, 2007.
      4. Fault-tolerance in the borealis distributed stream processing system. ACM TODS, 2008.
      5. Language-level checkpointing support for stream processing applications. DSN, 2009.
      6. Fault injection-based assessment of partial fault tolerance in stream processing applications. DEBS, 2011.
  • Systems
    • Reading: Chapters 7, 8
    • Papers:
      1. STREAM: The Stanford Stream Data Manager. IEEE Data Engineering Bulletin, 2003.
      2. Gigascope: A Stream Database for Network Applications. ACM SIGMOD, 2003.
      3. The Design of the Borealis Stream Processing Engine. CIDR, 2005.
      4. A model-based framework for building extensible, high performance stream processing middleware and programming language for IBM InfoSphere Streams. Software: Practice & Experience, 2012.
  • Data pre-processing and transformation
    • Sub-topics:
      • Descriptive statistics
      • Sampling
      • Sketches
      • Quatization
      • Dimensionality reduction (x)
      • Transforms
    • Reading: Chapter 10
    • Papers: See Advanced reading sections in Chapter 10
  • Stream mining: Modeling and evaluation
    • Sub-topics:
      • Classification
      • Clustering
      • Regression
      • Frequent pattern mining
      • Anomaly detection (x)
    • Reading: Chapter 11
    • Papers: See Advanced reading sections in Chapter 11

Syllabus

We will have 15 weeks of classes. Each week will have 2 lectures, where each lecture consists of 2x 50 minutes with a 10 minute break in between. The second hour of the Wednesday lecture is 'spare', and will be used rarely.

  • Week 1 (week of Feb 4 - 8)
    • Background and Introduction
  • Week 2 (week of Feb 11 - 15)
    • Application development basics
    • Flow composition
  • Week 3 (week of Feb 18 - 22)
    • Flow manipulations
  • Week 4 (week of Feb 25 - Mar 1)
    • Optimizations
  • Week 5 (week of Mar 4 - 8)
    • Seminar
    • Optimizations
  • Week 6 (week of Feb 11 - 15)
    • Seminar
    • Stream processing middleware
  • Week 7 (week of Feb 19 - 22)
    • Seminar
    • Streaming analytics - pre-processing and transformation: Intro, Descrip. Stats
  • Week 8 (week of Feb 25 - 29)
    • Seminar
    • Streaming analytics - pre-processing and transformation: Sampling, Bloom Filters
  • Week 9 (week of Apr 1 - 5)
    • Seminar
    • Streaming analytics - pre-processing and transformation: Sketches, Quantization
  • Week 10 (week of Apr 8 - 12)
    • Seminar
    • Streaming analytics - modeling and evaluation: VFDTs
  • Week 11 (week of Apr 15 - 19)
    • Seminar
    • Streaming analytics - modeling and evaluation: Microclustering
  • Week 12 (week of Apr 22 - 26)
    • Monday: No classes
    • Seminar
  • Week 13 (week of Apr 29 - May 3)
    • Seminar
    • Wednesday: No classes
  • Week 14 (week of May 6 - May 10)
    • Seminar
    • TBA
  • Week 15 (week of May 13 - May 15)
    • Seminar
cs545.txt · Last modified: 2013/06/24 10:03 by bgedik

Page Tools