Data Streamign Awards Winner

Pinterest is a visual discovery engine for finding ideas like recipes, home and style inspiration, and more. The user base and the use cases that attract users to the platform have dramatically increased over the years. As of Q1 2022, Pinterest attracts over 433 million monthly active users —and these numbers continue to grow. This growth has resulted in an ever-increasing amount of generated data that requires transportation from the source to various systems, real-time and batch, for ingestion and processing. Pinterest’s huge company-wide data implementation is split into two parts below. Firstly their stable, scalable, efficient, and usable Apache Kafka platform to power business-critical functions at Pinterest. Then the second part talks about, Xenon, their stable and efficient stream processing platform for deploying, monitoring, and maintaining Apache Flink-based applications.

Best Company-wide Implementation

2022

Best Company-wide Implementation

2022

Best Company-wide Implementation

2022

Best Company-wide Implementation

2022

As much as 100% reduction in engineers' time spent per operation

PubSub: Apache Kafka®

‍

Technologies Used:

Apache Kafka®
Apache Kafka MirrorMaker
Amazon Web Services

‍

Short description of project/implementation:

‍

Providing a stable, scalable, efficient, and usable Apache Kafka platform to power business critical functions at Pinterest, such as home feed and recommendation, ads revenue, search, etc.

‍

What problem were they looking to solve with Data Streaming Technology?

‍

Some of the key problems addressed so far include:

Density of Kafka clusters: how to maintain dense Kafka clusters that easily tolerate replays and backfills, broker replacements, traffic spikes, etc?
MirrorMaker instability: how to make MirrorMaker resilient to traffic spikes at source compressed topics?
Error-prone manual repetitive maintenance operations such as topic movements, optimal topic partition placements, broker health check, broker replacements, etc.: how to avoid human error and save engineers’ time?
Client library instability and usability: how to reduce the impact of bugs in the client library and simplify configuring it?
Platform cost: how to run the platform efficiently and save on the infra costs?
Pipeline visibility: how to make it easier for their platform team and customers to see thelineage of pipelines?

‍

How did they solve the problem?

‍

Kafka cluster density: One size does not fit all. The type of workload should determine the choice of hardware for the Kafka cluster. It is also important to categorize workloads into separate clusters to allow for this optimal hardware selection. NVMe SSD-based instance types allowed us to host a large number of partitions per broker (over 300)
MirrorMaker instability: To avoid recompressing compressed messages, KIP-712 was proposed and implemented internally, reducing resource usage and improving stability under load & backpressure during spikes.
Error-prone repetitive maintenance operations and issue remediations: The steps and logic behind those tasks were all implemented into Orion, which provides both reactive and proactive maintenance via automation and gives substantial time back to Platform engineers to work on solving actual hard problems.
Client library instability: The Pinterest PSC (PubSub Client) SDK was designed and implemented to solve issues such as client library bugs, and complex client configuration, to free up customers from having to solve problems of cluster discovery, observability, and graceful error handling.
Cost efficiency: Other than optimizing the hardware each cluster runs on, and increasing density, the cost of network transfer in and out of Kafka clusters was significantly reduced by applying AZ-aware producer partitioning and consumer assignment strategy.
Improved visibility: Providing a lineage tool that illustrates the lineage of pipelines that run through Kafka clusters provided the platform team and the customers a simplified view of pipelines, their configuration, key metrics, etc.

‍

What was the positive outcome?

‍

Increased platform stability, scalability, efficiency, and usability:

Kafka clusters stability: 70% drop in the number of incidents related to Kafka
MirrorMaker stability: 75% CPU usage reduction in MirrorMaker nodes
PubSub client stability: Significant engineers' time reduction related to client configuration, optimization, troubleshooting, etc; e.g. 1 FTE month with auto error resolution.
Automation of repetitive operations: As much as 100% reduction in engineers' time spent per operation
Cost efficiency: 300% growth of data throughput and 50% reduction in cost.

‍

Additional Links:

Optimizing Kafka for the cloud
Using graph algorithms to optimize Kafka operations (Part 1, Part 2)
Shallow Mirror
Lessons Learned from Running Apache Kafka at Scale at Pinterest
Unified PubSub Client at Pinterest
Orion: A pluggable automation platform for stateful distributed systems

‍

Streaming: Flink

‍

Technologies Used:

‍

Apache Flink®
Apache Hive®
Apache Kafka®
Amazon Web Services

‍

Short description of project/implementation:

‍

Xenon provides a stable and efficient stream processing platform for deploying, monitoring and maintaining Apache Flink-based applications.

‍

What problem were they looking to solve with Data Streaming Technology?

‍

Since Apache Flink was chosen as the stream processing engine at Pinterest, the Steam Processing Platform (SPP) team has had to offer its customers a platform for

stably running their streaming jobs
quickly productionalizing new use cases and features
efficiently using infra resources and implementing best practices
automating key operations that originally required engineering time from the platform team and the customers

‍

How did they solve the problem?

‍

The SPP team

increased platform stability by implementing a self-service job deployment and job management framework
significantly bumped developer velocity by providing application developers a CICD framework, a job debugger, a job development framework, and an application certification program
reduced infra costs by implementing application auto-scaling, job health dashboards and reports, load balancing, and optimizing resource usage
provide thrift format native support, NRTG framework, SQL abstraction, federate source/sink system meta into unified catalog built on top of hive metastore

‍

What was the positive outcome?

Enabled more than 60 use cases from key initiatives, ranging from user growth, shopping, and user engagement to monetization. SPP has been efficiently and stably processing over 6 PB of data per day in near real-time at over a 10M TPS scale for more than two years.

‍

Additional Links:

‍