Spark Streaming with Delta Lake – Tricks and Treats
Delta Lake is a smart storage & metadata layer designed to expand the capabilities of the modern file-based data lake. From data deletion to indexing and ACID transactions, Delta enriches the data lake with actual database capabilities.
This is why it was obvious to us, the Riskified Data Engineering team, that Delta should be at the center of our data lake infrastructure implementation.
In this talk, we would like to share with you the challenges we faced building a Spark Streaming platform incorporating Delta Lake.
You’ll be able to hear about using Delta Lake both as a streaming source and destination, how we implemented automated schema evolution, many hacks related to tuning Spark Streaming on Kubernetes for both cost and performance, and more!
Talk language: Hebrew
Big Data Tech Lead at Riskified. Hen has been a key player in the design and development of Riskified's next-gen big data infrastructure using DeltaLake, Airflow, Snowflake, and Spark on Kubernetes. Before Riskified, Hen worked in various tech companies, facing different scaling & big data challenges.
Hen is also an amateur pilot, foodie and loves to travel with his wife and kids.