Streams Posts

by Scott Schneider

I contribute posts over at StreamsDev, which is the developer-run community for the product I do research and development for, IBM Streams. These posts all involve SPL (Streams Processing Language), the programming language for developing applications on Streams.

We have an academic paper which covers the design of SPL. In brief, SPL is designed to be a programming language for distributed stream computing. The primary abstractions are operators, tuples and streams. Operators are the primary actors in the language; they receive tuples, perform computations on them, and potentially emit tuples which represent the result of that computation. Operators send and receive these tuples over streams, which are the only way they can communicate. Because operators can only communicate with each other over streams, the programming model is similar to functional languages, where functions can only “communicate” by accepting and returning values. Because the programming model of SPL forces operators to be fully isolated from each other, acting on on their explicit inputs and outputs, the language is amenable to various kinds of parallelization.

The first set of posts are brief programming tutorials, introducing some advanced techniques in SPL:

  • Wrapping Custom operators in Composites: Custom operators in SPL are a way to introduce new operator logic. Composite operators are a way to represent a streaming sub-graph (a collection of operators that communicate with each other) as a single operator. Invoking a single custom operator inside of a composite is similar to assigning an anonymous function to a variable in functional languages.
  • Genericity & Composites: In SPL, composite operators are to primitive operators as functions are to expressions in general purpose languages. Specifically, a composite operator is a reusable grouping of multiple operators with well-defined inputs and outputs so that it can be used as a single operator elsewhere in an application. This post explores one of the primary benefits of composite operators, which is that they can be generic with respect to the types they use and operate on.
  • Operator Genericity: Genericity in SPL extends to operators themselves, and composite operators can actually accept other operators as parameters. This capability makes composite operators higher order.
  • General Operator Parameters in UDP Regions: A specific technique for using several orthogonal features in SPL (custom state on operator invocations, operator parallelization and runtime specification of parameters) to use operators in parallel that were not designed for parallelism.

One of the primary reasons to write an application in SPL is to process a large amount of data quickly. The design of SPL allows programmers to first write an application focusing on functionality, and only consider performance after the application is correct. SPL exposes controls that allow programmers to tune their application for the underlying hardware. The post Optimizing Streams Applications presents the slides from a presentation I gave which teaches how the abstractions in SPL map to the underlying hardware in a cluster. It presents the application-level controls that SPL exposes, shows how to use them, and walks through improving the performance of an example application.

Finally, two posts which involved some actual research, in that they involved a significant amount of experimental evaluation:

  • Parallelized File Processing with the Parse Operator: Any real streaming application is going to have to read data from an external source, as well as parse that data from the external format to an internal representation. Sometimes, the parsing is more expensive than the reading, and in such cases, we can use parallelism to improve throughput. This post motivates why parsing may be more expensive, and walks through developing a pattern programmers can use in such situations.
  • The ElasticLoadBalance Operator: Elasticity, in this context, is dynamically adjusting the level of parallelism to maximize throughput. I adapted prior research work on elasticity into a reusable operator that can be used in the product. In this post, I explain how the elasticity algorithm works, why I had to change it from our previously published work, how to use the new operator, and experimental results demonstrating its effectiveness.