Slack

Salesforce-owned Slack is the work operating system that brings people, apps, processes, and data together with trusted generative and agentic AI, fueling productivity for every employee.

Data Strategy and Contribution
2025
Data Strategy and Contribution
2025
Data Strategy and Contribution
2025
Data Strategy and Contribution
2025

The new CDC pipeline slashed latency from 48 hours to less than 10 minutes

Data Streaming Technology Used:
  • Vitess
  • Apache Kafka®
  • Debezium
  • Kafka Connect
  • Apache Iceberg®
  • Apache Spark®
  • Apicurio Registry
  • S3
  • Parquet

What problem were they looking to solve with Data Streaming Technology?

Slack's core business relies on a massive production database infrastructure built on Vitess, a sharding middleware for MySQL. This environment is vast, encompassing approximately 800 tables, 3,000 shards, and 1 Petabyte of data, sustaining a peak load of ~600,000 writes per second. To power its OLAP and analytical workloads, Slack needed to replicate this data into its S3-based data lake.

The existing process was a traditional batch pipeline. It relied on periodically reading backups from the Vitess cluster and loading that data into Parquet files. This approach presented two critical business challenges: prohibitive cost and unacceptable latency. The process of creating, transferring, and processing massive backups was resource-intensive and expensive.

Most importantly, the data available for analytics was, due to the long runtime of the batch process, up to 48 hours old. In a fast-paced environment, this significant delay handicapped decision-making, trend analysis, and the ability to react quickly to emerging patterns. The goal was to dismantle this inefficient batch system and build a new architecture that could provide fresh, near real-time data to analysts and data scientists at a fraction of the cost.

How did they solve the problem?

Slack engineered a sophisticated, streaming-based Change Data Capture (CDC) pipeline to provide a real-time flow of data from its production environment to its analytical systems. The capture fleet itself is a large distributed system, running on 20 Kafka Connect clusters with approximately 300 workers and 1,400 concurrent tasks to tail the Vitess binlogs in real time. This solution was built around a modern, open-source data stack designed for scalability and fault tolerance.

A key component of the solution is the Debezium Vitess Connector, which runs on Kafka Connect. Slack's engineers are the top committers of this open source project, showcasing an unparalleled level of commitment and expertise. Beyond just developing the connector, Slack’s engineers implemented a critical binlog event watermarking system, an innovation developed in collaboration with the Debezium & Vitess communities. This solved one of streaming's most elusive challenges: ensuring data correctness by accurately detecting time windows in a distributed system with variable lag.

To harden the pipeline against data loss, Slack's engineers identified and fixed critical, subtle edge cases in both Debezium and Kafka, contributing these essential patches back to the open-source community.

The captured change events stream through Kafka to a sink connector that archives them in Apache Iceberg format. A periodic merge job transforms the raw change data into a clean, queryable columnar table. Additionally, daily Spark jobs create snapshot tables that are used by the migrated legacy pipelines.

What was the positive outcome? 

The transition to a real-time streaming architecture yielded transformative results for Slack, delivering dramatic and quantifiable improvements in both cost and performance. The impact was felt immediately across all teams that rely on data for analytics and decision-making.

The most significant outcome was the radical reduction in data latency. The previous batch pipeline delivered data that was over two days old. The new CDC pipeline slashed this latency from 48 hours to less than 10 minutes. This unlocked the ability for analysts, data scientists, and business leaders to work with data that reflects the state of the business in near real-time, enabling more agile and informed operations.

Beyond the performance gains, the project resulted in a massive financial benefit. By eliminating the computationally expensive and inefficient process of creating and processing full database backups, the new CDC pipeline slashed costs by millions of dollars annually. This new architecture not only performs better but does so with a much smaller resource footprint, representing a monumental leap in efficiency. End-users now have faster, more reliable access to data, empowering a more data-driven culture throughout the organization.

Additional links: