Agenda
Registration & Breakfast
Opening Session
Keynote – How to Build a Self Serve Data Platform?
“Data is the new oil” – a quote attributed to Clive Humby explains that, like oil, data is valuable, but if unrefined it cannot really be used.
Providing access to high quality data means nothing, if organizations don’t know what to do with it.
The ability to derive meaning from, and use data, creating value is directly related to the autonomy of the data consumers.
We will explore how to create an analytics stack that integrates applications that collect, transform and expose the value of data.
The Data Swamp Face Off: Understanding the Interplay Between Different Data Roles in an Organization
In the data world, there are different types of roles that have different responsibilities and expectations. Data producers are those who generate, collect, or provide data for various purposes. Data consumers are those who use, analyze, or derive insights from data to support decision making or action. Data enablers are those who facilitate, manage, or optimize the data flow and quality between data producers and consumers.
Understanding the interplay between these roles is crucial for creating a successful data strategy and delivering value from data. In this panel, we will explore the challenges and opportunities that arise from the interactions between these players.
Coffee Break
Protecting Privacy in the Kingdom of Data: A Guide for Data Engineers
If data is king, then privacy is its crown jewel. Over the last two decades, I’ve worked as a technical leader in the data domain and have experienced first hand the importance of balancing the collection of valuable data with personal privacy protection.
I will examine real-life examples of privacy violations to emphasize the importance of privacy for data owners. Towards achieving privacy compliance, I will teach:
Different techniques to safeguard personal information, including anonymization, deliberate data decay, and differential privacy.
How and where these techniques could be applied
Additionally, the talk will explore current privacy threats and the role of data engineers in ensuring privacy and security in handling personal and sensitive information.
By the end of the talk, you will understand privacy challenges and learn practical solutions you can apply in your work. I believe that with data becoming increasingly central to our lives, privacy and data are equally important in the kingdom of data management.
Vector Similarity Server, a Short Intro
Vector similarity searches are a new type of DB, applicable for many use cases such as anomaly detection, recommendation and search.
In this talk, we would demonstrate building personalizes search engine with Vecsim and CLIP.
Death by Thousand Schema Changes: The Mechanics of Schema Evolution
Schema changes should be a simple everyday event right?
They Are Not!
Even with years of experience, production get’s broken more often than we’d like to admit when our schema evolves, and in this talk, we’re going to explore why.
Through the analysis of several production and data incidents, we’re going to uncover the mechanics of schema changes, their symbiosis with production environments, and uncover the overwhelming complexity of modern software systems.
You’re going to leave this talk with a concrete model on how to address schema changes methodically. Hopefully making your next one not as painful as mine.
Divide and Conquer in the Works: Leveraging Sub-Population Splits for Accurate Fraud Predictive Models
In the fraud prevention industry, identifying fraudulent activities can be a complex and challenging task. Accurate predictive models are crucial in order to prevent fraudulent activities before they occur. Splitting to sub-population is a powerful technique that can improve the accuracy of predictive models by segmenting data into meaningful groups based on common characteristics.
In this session, we will explore the fundamental principles of data splitting and its practical application in real world scenarios. We will discuss the benefits of data segmentation in fraud prevention and develop a comprehensive understanding of how it can help refine predictive models, minimize false positives, and optimize business value in the industry.
Furthermore, we will learn about the best practices for executing data splitting strategies and how to seamlessly integrate these strategies into existing fraud prevention systems.
Cool Down Your Compute: Advanced Iceberg Features That Will Help You Manage Your Data
Apache Iceberg gave rise to many important features that free data engineers from many common pains, such as schema evolution, concurrent writes and reads, and scan performance. However, the Iceberg project is much more than a spec. It rides on a Java library that exposes a powerful API, and on a set of metadata tables that enable building powerful applications.
The object of this talk is to highlight more nuanced yet important features in Iceberg API, and demonstrate the capabilities they make available for us. Specifically, I will focus on meta-data queries, tagging/branching, and how to run data management tasks.
How I Became Famous With Data Analysis, and How I Could Do It Much Better
Just a few months ago, I created a simple tableau dashboard with a nice dataset of all Israeli names over the years. The dashboard became viral, with almost 80,000 views, articles and interviews, which I could never predict.
In this lecture I’ll share the story behind the scenes. I’ll give my take on why it was so successful, and use this dashboard to illustrate all the mistakes I’ve made, and how I could do it much better.
Lightning talks
A Case of Customized Clustering: Choose Your Loss
Plenty of sophisticated out-of-the-box clustering solutions are readily available in data science libraries. However, they are of little help when the clusters must optimize a complex and specific loss function that is not easily differentiable. In this talk, I will walk through a simple tailored clustering algorithm we developed, which uses a customized plug-and-play loss function to cluster sequential data. By utilizing the additivity of our loss function, we dynamically optimize it in polynomial time. This approach is applicable to a vast range of single dimensional clustering problems, from time based traffic modeling, to age based insurance pricing!
Lunch
Help Your Organizations to Make Great Decisions Quickly by Providing High-Fidelity Data
We live in an era of big data and machine learning, where data drives the decision-making processes for organizations. This data must be managed and processed effectively to ensure the success of any system.
A healthy data journey is a foundation for any big data and machine learning-based system to reach its full potential. The data journey encompasses all aspects of data management, from its collection and validation to its transformation into insights that inform decisions.
The importance of a healthy data journey cannot be overstated. Without it, big data and machine learning systems results will be inaccurate, unreliable, and ultimately, less valuable.
Today, we will delve into the critical components of a healthy data journey and the steps organizations can take to ensure the quality of their data. We will explore the role of input validation, API design, data health monitoring, and well-designed data transformation mechanisms in ensuring the success of a big data and machine learning-based system.
By the end of this talk, you will have a deeper understanding of the significance of a healthy data journey and the steps you can take to ensure that your big data and machine learning-based systems are functioning at their best. So, join us as we embark on this journey to discover the key to unlocking the full potential of big data and machine learning.
Are You Sure That You’re Sure? Estimating Confidence in Your Data Work
Trust is hard to gain and easy to lose, especially when it comes to data. Recall the last time you sent out a report: were your hands a little sweaty? Did your heart skip a beat? This is because data is never 100% accurate. As data people we strive for that sweet spot that’s between data-integrity-risk and that unattainable perfectly-correct-and-complete-data. We call that spot “good enough”. But can you be truly confident that your output is “good enough”? I assert that you definitely can! In this talk I’ll share my data confidence meter – a framework for increasing assurance in your work and distilling the trust in the stakeholders you’re working with.
Real Time and Batch in a Single Dataset
Traditionally, RealTime and Batch data pipelines are consumed from different sources.
At Outbrain we’ve built a unified infrastructure for batch & RT processing, that enables us to stream all data into a single dataset.
In this session, I will explain how realtime data is unified with batch calculations, to produce a coherent view of our core data.
Selling the Future: Navigating the Non-Tech Challenges of ML Products
In this talk, we will delve into the often overlooked but critical challenges of bringing a machine learning product to market. From idea conception to product launch, there are numerous non-technical hurdles that must be overcome in order to create and sell a successful machine learning product.
We will talk about what our product, dev teams and business teams need change or be aware of for our ML product to succeed.
Join us as we explore the obstacles, lessons learned, and best practices in this field. We will discuss key areas such as product validation, go-to-market strategy, and stakeholder management.
This talk is aimed at data scientists, machine learning engineers, and business professionals who work or think of working on machine learning products. Whether you are starting a new venture or seeking to improve the success of an existing product, I hope this session will provide valuable insights and inspiration for your journey.
Dismantling Big Data With DuckDB
What if I told you, you do not need all of your Big Data Architecture and Tech Stack. What if I told you, you could could save a lot of money and resources all the while improving developer experience for all your data needs?
DuckDB is revolutionizing the way we view and handle Big Data. I will show you how you can utilize DuckDB to your advantage and address your data needs using this in-process in-memory OLAP DB in ways you never thought possible.
Having worked with Big Data and OLAP engines for many years now I know exactly where this new OLAP engine would have the highest impact in your architecture and how you should apply it.
Teaching Your Model to Do Two Things at Once
In the world of job search, we must carefully balance the needs of two parties: jobseekers and employers. Jobseekers expect highly relevant jobs to show up at the top of their search results while employers expect their position in search results to correlate well with their bids.
Correspondingly, in the past few years we’ve designed and redesigned job ranking algorithms which try to solve two problems at once. Solving the first problem serves our jobseekers: how can you optimally rank jobs given a search query and thereby ensure a positive user experience? Solving the second problem serves our employers: how can you accurately estimate the probability that a jobseeker will click on a job and thereby use this click-through-rate to determine business value?
Today, we’ll share how we’ve taught a model to do two things at once. We’ll explore past, current, and future solutions and review the exciting challenges along the way.
Coffee Break
Is AI Ready to Fully Take on the Work of Humans? Depending on the Use Case
eBay regularly works with a gigantic pool of data, collaborating with many different vendors in the process. In the past few years, the industry has seen an ongoing surge in the trend of using AI to perform actions so far done by humans; and here at eBay, too, we’ve been continuously examining this alternative, among other things to automate our catalog content management processes.
But our attempts, well… how should we put it? Made us realize that to date, the very best results can be achieved by using a combination of Data Operations professionals and ML automations. Simply put, human-machine synergy.
Today, we know that a scenario where AI fully replaces humans is at least a few more years away.
So, how did we create our combined work method? What is the proper task division between humans and machines? How do we make sure the human input is being leveraged to enhance future machine output? And do we really need to worry about being replaced?
All this and more will be discussed as part of the talk.
Testing Machine Learning Code and Artifacts, the Sane Way
Machine learning code is known for a lot of things, but testing is not one of those. Let’s face it testing machine learning code is challenging! We have been missing out on one of the biggest productivity boosters in modern software development. This talk will hopefully change it a bit
In this talk, rather than treat testing as a “necessary evil”, we will offer several testing strategies to make it easy and somewhat fun. We will cover a few simple but powerful tools for keeping your code problem-free.
Taking Your Cloud Vendor to the Next Level: Solving Massive-Scale Data Challenges
Akamai’s content delivery network (CDN) processes about 30% of the internet’s daily traffic, resulting in a massive amount of data that presents engineering challenges, both internally and with cloud vendors.
In this talk, we will discuss the barriers faced while building a data infrastructure on Azure, Databricks, and Kafka to meet strict SLAs, hitting the limits of some of our cloud vendors’ services.
We will describe the iterative process of re-architecting a massive-scale data platform using the aforementioned technologies.
We will also delve into how today, Akamai is able to quickly ingest and make available to customers TBs of data, as well as efficiently query PBs of data and return results within 10 seconds for most queries.
This discussion will provide valuable insights for attendees and organizations seeking to effectively process and analyze large amounts of data.
The Last Mile of Machine Learning Apps
AI apps are harder to succeed with than general software apps; you can have the best research, top developers and the sharpest product team in town but still face failures.
For many AI teams a too common script starts with getting a solid product requirement (like to predict customers at risk to churn), continues with developing a cutting-edge, super accurate model, but eventually once seeing almost no usage among customers, the journey ends with the app being pulled back due to low ROI, doom to failure. But this is not a prophecy!
Much can be done to make sure our apps will be useful. The last mile of users’ nurturing is where most AI apps fail; While developing a working model is the enabler for our solutions, it’s important not to forget that the end users are the ones to decide our apps’ success.
During the talk, we will walk through five key failure points to avoid. Hard learned lessons to make sure your AI apps will make it to the finish line.
Keynote – When Personalization Meets Creativity: How Personalizing the User Journey Boosts Creativity and Optimizes Revenue
As data professionals, we always aim to leverage data which then enables users to benefit from the endless possibilities of our applications and services. In this talk, we will share how we built personalized multimodal machine learning models based on our data and machine learning pipelines to personalize Lightricks users’ journey so they can boost their creativity and drive business success. By the end of the talk, you’ll learn best practices which you can adopt to boost your own business goals through personalization.