Lightning Talk
Streaming analytics serve a variety of downstream applications. With rising volumes of data and an increased need for real-time analytics, analytics costs are ballooning out of control. Many downstream applications are amenable to working with approximate analytics and thus, methods like sampling are commonly used by practitioners to reduce costs. Unfortunately, in doing so, practitioners are faced with a cost-accuracy dichotomy, having to choose between low cost and high accuracy.
Sketches aka sketching algorithms provide an opportunity to break this cost-accuracy dichotomy. Sketches can accurately estimate statistical metrics over data streams while consuming extremely low resources and only making one pass over the data. Sketches also provide theoretical error guarantees backed by extensive scientific literature. Some metrics that can be estimated by sketches include quantiles, heavy hitters (i.e. most frequent items), cardinality (number of distinct items). While sketches seem like a panacea for ballooning costs, they (a) are difficult and unintuitive to use, (b) require tuning low-level knobs to get optimal performance, and (c) are not well integrated with analytics frameworks.
To tackle these challenges, we design SketchDB, a drop-in sketch-based optimizer that sits in front of an existing database/streaming platform and provides high-accuracy low-latency analytics at a fraction of the cost, while also providing an easy-to-use high-level interface. SketchDB can estimate metrics with < 1% error and 10x lower latency, while consuming 10-30x lower memory at ingest and query time. SketchDB supports aggregate statistical metrics over entire data streams as well as subpopulations. To use SketchDB, an operator simply needs to configure SketchDB with the specific streams and corresponding metrics that should be accelerated. Our grand aim is to reduce analytics costs by multiple orders of magnitude and democratize the use of approximation primitives like sketches.