Pinterest is a visual discovery engine for finding ideas like recipes, home and style inspiration, and more. The user base and the use cases that attract users to the platform have dramatically increased over the years. As of Q1 2022, Pinterest attracts over 433 million monthly active users —and these numbers continue to grow. This growth has resulted in an ever-increasing amount of generated data that requires transportation from the source to various systems, real-time and batch, for ingestion and processing. Pinterest’s huge company-wide data implementation is split into two parts below. Firstly their stable, scalable, efficient, and usable Apache Kafka platform to power business-critical functions at Pinterest. Then the second part talks about, Xenon, their stable and efficient stream processing platform for deploying, monitoring, and maintaining Apache Flink-based applications.

As much as 100% reduction in engineers' time spent per operation
PubSub: Apache Kafka®
Technologies Used:
- Apache Kafka®
- Apache Kafka MirrorMaker
- Amazon Web Services
Short description of project/implementation:
Providing a stable, scalable, efficient, and usable Apache Kafka platform to power business critical functions at Pinterest, such as home feed and recommendation, ads revenue, search, etc.
What problem were they looking to solve with Data Streaming Technology?
Some of the key problems addressed so far include:
- Density of Kafka clusters: how to maintain dense Kafka clusters that easily tolerate replays and backfills, broker replacements, traffic spikes, etc?
- MirrorMaker instability: how to make MirrorMaker resilient to traffic spikes at source compressed topics?
- Error-prone manual repetitive maintenance operations such as topic movements, optimal topic partition placements, broker health check, broker replacements, etc.: how to avoid human error and save engineers’ time?
- Client library instability and usability: how to reduce the impact of bugs in the client library and simplify configuring it?
- Platform cost: how to run the platform efficiently and save on the infra costs?
- Pipeline visibility: how to make it easier for their platform team and customers to see thelineage of pipelines?
How did they solve the problem?
- Kafka cluster density: One size does not fit all. The type of workload should determine the choice of hardware for the Kafka cluster. It is also important to categorize workloads into separate clusters to allow for this optimal hardware selection. NVMe SSD-based instance types allowed us to host a large number of partitions per broker (over 300)
- MirrorMaker instability: To avoid recompressing compressed messages, KIP-712 was proposed and implemented internally, reducing resource usage and improving stability under load & backpressure during spikes.
- Error-prone repetitive maintenance operations and issue remediations: The steps and logic behind those tasks were all implemented into Orion, which provides both reactive and proactive maintenance via automation and gives substantial time back to Platform engineers to work on solving actual hard problems.
- Client library instability: The Pinterest PSC (PubSub Client) SDK was designed and implemented to solve issues such as client library bugs, and complex client configuration, to free up customers from having to solve problems of cluster discovery, observability, and graceful error handling.
- Cost efficiency: Other than optimizing the hardware each cluster runs on, and increasing density, the cost of network transfer in and out of Kafka clusters was significantly reduced by applying AZ-aware producer partitioning and consumer assignment strategy.
- Improved visibility: Providing a lineage tool that illustrates the lineage of pipelines that run through Kafka clusters provided the platform team and the customers a simplified view of pipelines, their configuration, key metrics, etc.
What was the positive outcome?
Increased platform stability, scalability, efficiency, and usability:
- Kafka clusters stability: 70% drop in the number of incidents related to Kafka
- MirrorMaker stability: 75% CPU usage reduction in MirrorMaker nodes
- PubSub client stability: Significant engineers' time reduction related to client configuration, optimization, troubleshooting, etc; e.g. 1 FTE month with auto error resolution.
- Automation of repetitive operations: As much as 100% reduction in engineers' time spent per operation
- Cost efficiency: 300% growth of data throughput and 50% reduction in cost.
Additional Links:
- Optimizing Kafka for the cloud
- Using graph algorithms to optimize Kafka operations (Part 1, Part 2)
- Shallow Mirror
- Lessons Learned from Running Apache Kafka at Scale at Pinterest
- Unified PubSub Client at Pinterest
- Orion: A pluggable automation platform for stateful distributed systems
Streaming: Flink
Technologies Used:
- Apache Flink®
- Apache Hive®
- Apache Kafka®
- Amazon Web Services
Short description of project/implementation:
Xenon provides a stable and efficient stream processing platform for deploying, monitoring and maintaining Apache Flink-based applications.
What problem were they looking to solve with Data Streaming Technology?
Since Apache Flink was chosen as the stream processing engine at Pinterest, the Steam Processing Platform (SPP) team has had to offer its customers a platform for
- stably running their streaming jobs
- quickly productionalizing new use cases and features
- efficiently using infra resources and implementing best practices
- automating key operations that originally required engineering time from the platform team and the customers
How did they solve the problem?
The SPP team
- increased platform stability by implementing a self-service job deployment and job management framework
- significantly bumped developer velocity by providing application developers a CICD framework, a job debugger, a job development framework, and an application certification program
- reduced infra costs by implementing application auto-scaling, job health dashboards and reports, load balancing, and optimizing resource usage
- provide thrift format native support, NRTG framework, SQL abstraction, federate source/sink system meta into unified catalog built on top of hive metastore
What was the positive outcome?
Enabled more than 60 use cases from key initiatives, ranging from user growth, shopping, and user engagement to monetization. SPP has been efficiently and stably processing over 6 PB of data per day in near real-time at over a 10M TPS scale for more than two years.
Additional Links:
- Flink-powered stream processing platform at Pinterest (starting from 48' 15")
- Pinterest Flink Deployment Framework
- Unified Flink Source at Pinterest: Streaming Data Processing
- Faster Flink adoption with self-service diagnosis tool at Pinterest
- The Apache Flink Story at Pinterest - Flink Forward Global 2021