Background

Breakout Session

Speed of Apache Pinot at the Cost of Cloud Object Storage with Tiered Storage in StarTree Cloud

For real-time analytics, you need systems that can provide ultra low latency (ms) and extremely high throughput (1000s of qps). One such system is Apache Pinot, which is excellent for real-time analytics use cases like user-facing analytics and personalization.  

The users of Pinot love the speed of Pinot and want to use Pinot for all their use cases - internal analytics, ad hoc analytics, reporting. For such use cases, you typically need to store really long retention data.  

You can of course do that today, but it can get expensive to store large amounts of data in a system like Pinot, because of tightly coupled storage & compute. As the total data volume grows, more resources (compute + storage) need to be provisioned, whether or not the corresponding compute resources are utilized, resulting in a high cost to serve.  

One option for users is to introduce decoupled systems for historical data analytics. Such systems use cloud object storage, which reduces the cost. But that will take your latencies to the 10s of seconds range and also introduce the overhead of maintaining and operating a new system and federating queries.  

To address these challenges, we added Tiered Storage for Apache Pinot in StarTree Cloud, which gives you speed of Apache Pinot, at the cost of cloud storage! In this talk, we will dive deep into how we built an abstraction in Apache Pinot to make it agnostic of where the data is located. We'll talk about how we're able to query data on the cloud directly (not downloading the entire data like lazy-loading) with sub-seconds latencies in StarTree Cloud. We'll talk about the various ways you can configure and customize which portion of your data resides locally as tightly-coupled and which moves to the cloud, giving the best of both worlds.

Neha Pawar

StarTree