Agenda
Registration & Breakfast
Opening Session
Keynote – How to Build a Self Serve Data Platform?
“Data is the new oil” – a quote attributed to Clive Humby explains that, like oil, data is valuable, but if unrefined it cannot really be used.
Providing access to high quality data means nothing, if organizations don’t know what to do with it.
The ability to derive meaning from, and use data, creating value is directly related to the autonomy of the data consumers.
We will explore how to create an analytics stack that integrates applications that collect, transform and expose the value of data.
Panel
TBD
Coffee Break
Protecting Privacy in the Kingdom of Data: A Guide for Data Engineers
If data is king, then privacy is its crown jewel. Over the last two decades, I’ve worked as a technical leader in the data domain and have experienced first hand the importance of balancing the collection of valuable data with personal privacy protection.
I will examine real-life examples of privacy violations to emphasize the importance of privacy for data owners. Towards achieving privacy compliance, I will teach:
Different techniques to safeguard personal information, including anonymization, deliberate data decay, and differential privacy.
How and where these techniques could be applied
Additionally, the talk will explore current privacy threats and the role of data engineers in ensuring privacy and security in handling personal and sensitive information.
By the end of the talk, you will understand privacy challenges and learn practical solutions you can apply in your work. I believe that with data becoming increasingly central to our lives, privacy and data are equally important in the kingdom of data management.
Vector Similarity Server, a Short Intro
Vector similarity searches are a new type of DB, applicable for many use cases such as anomaly detection, recommendation and search.
In this talk, we would demonstrate building personalizes search engine with Vecsim and CLIP.
Death by Thousand Schema Changes: The Mechanics of Schema Evolution
Schema changes should be a simple everyday event right?
They Are Not!
Even with years of experience, production get’s broken more often than we’d like to admit when our schema evolves, and in this talk, we’re going to explore why.
Through the analysis of several production and data incidents, we’re going to uncover the mechanics of schema changes, their symbiosis with production environments, and uncover the overwhelming complexity of modern software systems.
You’re going to leave this talk with a concrete model on how to address schema changes methodically. Hopefully making your next one not as painful as mine.
Connect the Dots, Easier Said Than Done
My 5 year old son is crazy about Lego. He taught me there are multiple ways to connect two Lego bricks. This made me realize that this approach can be applied to my work as a Data Scientist.
Using Lego, I will explore how different data points can be connected and combined to build a robust model for improving fraud detection in e-commerce using machine learning, which is what we do at Riskified.
For example, addresses can be connected and combined by geographical distance or by textual similarity, which gives us a hybrid approach for improving our models.
By the end of this talk, you will gain an understanding of how hybrid techniques can improve the performance of predictive models and build your own hybrid lego model.
Cool Down Your Compute: Advanced Iceberg Features That Will Help You Manage Your Data
Apache Iceberg gave rise to many important features that free data engineers from many common pains, such as schema evolution, concurrent writes and reads, and scan performance. However, the Iceberg project is much more than a spec. It rides on a Java library that exposes a powerful API, and on a set of metadata tables that enable building powerful applications.
The object of this talk is to highlight more nuanced yet important features in Iceberg API, and demonstrate the capabilities they make available for us. Specifically, I will focus on meta-data queries, tagging/branching, and how to run data management tasks.
How I Became Famous With Data Analysis, and How I Could Do It Much Better
Just a few months ago, I created a simple tableau dashboard with a nice dataset of all Israeli names over the years. The dashboard became viral, with almost 80,000 views, articles and interviews, which I could never predict.
In this lecture I’ll share the story behind the scenes. I’ll give my take on why it was so successful, and use this dashboard to illustrate all the mistakes I’ve made, and how I could do it much better.
Lightning talks
A Case of Customized Clustering: Choose Your Loss
Plenty of sophisticated out-of-the-box clustering solutions are readily available in data science libraries. However, they are of little help when the clusters must optimize a complex and specific loss function that is not easily differentiable. In this talk, I will walk through a simple tailored clustering algorithm we developed, which uses a customized plug-and-play loss function to cluster sequential data. By utilizing the additivity of our loss function, we dynamically optimize it in polynomial time. This approach is applicable to a vast range of single dimensional clustering problems, from time based traffic modeling, to age based insurance pricing!
Lunch
Nourishing the Mind of AI: A Guide to a Healthy Data Journey
We live in an era of big data and machine learning, where data drives the decision-making processes for organizations. This data must be managed and processed effectively to ensure the success of any system.
A healthy data journey is a foundation for any big data and machine learning-based system to reach its full potential. The data journey encompasses all aspects of data management, from its collection and validation to its transformation into insights that inform decisions.
The importance of a healthy data journey cannot be overstated. Without it, big data and machine learning systems results will be inaccurate, unreliable, and ultimately, less valuable.
Today, we will delve into the critical components of a healthy data journey and the steps organizations can take to ensure the quality of their data. We will explore the role of input validation, API design, data health monitoring, and well-designed data transformation mechanisms in ensuring the success of a big data and machine learning-based system.
By the end of this talk, you will have a deeper understanding of the significance of a healthy data journey and the steps you can take to ensure that your big data and machine learning-based systems are functioning at their best. So, join us as we embark on this journey to discover the key to unlocking the full potential of big data and machine learning.
Are You Sure That You’re Sure? Estimating Confidence in Your Data Work
Trust is hard to gain and easy to lose, especially when it comes to data. Recall the last time you sent out a report: were your hands a little sweaty? Did your heart skip a beat? This is because data is never 100% accurate. As data people we strive for that sweet spot that’s between data-integrity-risk and that unattainable perfectly-correct-and-complete-data. We call that spot “good enough”. But can you be truly confident that your output is “good enough”? I assert that you definitely can! In this talk I’ll share my data confidence meter – a framework for increasing assurance in your work and distilling the trust in the stakeholders you’re working with.
Real Time and Batch in a Single Dataset
Traditionally, RealTime and Batch data pipelines are consumed from different sources.
At Outbrain we’ve built a unified infrastructure for batch & RT processing, that enables us to stream all data into a single dataset.
In this session, I will explain how realtime data is unified with batch calculations, to produce a coherent view of our core data.
Selling the Future: Navigating the Non-Tech Challenges of ML Products
In this talk, we will delve into the often overlooked but critical challenges of bringing a machine learning product to market. From idea conception to product launch, there are numerous non-technical hurdles that must be overcome in order to create and sell a successful machine learning product.
We will talk about what our product, dev teams and business teams need change or be aware of for our ML product to succeed.
Join us as we explore the obstacles, lessons learned, and best practices in this field. We will discuss key areas such as product validation, go-to-market strategy, and stakeholder management.
This talk is aimed at data scientists, machine learning engineers, and business professionals who work or think of working on machine learning products. Whether you are starting a new venture or seeking to improve the success of an existing product, I hope this session will provide valuable insights and inspiration for your journey.
Dismantling Big Data With DuckDB
What if I told you, you do not need all of your Big Data Architecture and Tech Stack. What if I told you, you could could save a lot of money and resources all the while improving developer experience for all your data needs?
DuckDB is revolutionizing the way we view and handle Big Data. I will show you how you can utilize DuckDB to your advantage and address your data needs using this in-process in-memory OLAP DB in ways you never thought possible.
Having worked with Big Data and OLAP engines for many years now I know exactly where this new OLAP engine would have the highest impact in your architecture and how you should apply it.
Teaching Your Model to Do Two Things at Once
In the world of job search, we must carefully balance the needs of two parties: jobseekers and employers. Jobseekers expect highly relevant jobs to show up at the top of their search results while employers expect their position in search results to correlate well with their bids.
Correspondingly, in the past few years we’ve designed and redesigned job ranking algorithms which try to solve two problems at once. Solving the first problem serves our jobseekers: how can you optimally rank jobs given a search query and thereby ensure a positive user experience? Solving the second problem serves our employers: how can you accurately estimate the probability that a jobseeker will click on a job and thereby use this click-through-rate to determine business value?
Today, we’ll share how we’ve taught a model to do two things at once. We’ll explore past, current, and future solutions and review the exciting challenges along the way.
Coffee Break
Is AI Ready to Fully Take on the Work of Humans? Depending on the Use Case
eBay regularly works with a gigantic pool of data, collaborating with many different vendors in the process. In the past few years, the industry has seen an ongoing surge in the trend of using AI to perform actions so far done by humans; and here at eBay, too, we’ve been continuously examining this alternative, among other things to automate our catalog content management processes.
But our attempts, well… how should we put it? Made us realize that to date, the very best results can be achieved by using a combination of Data Operations professionals and ML automations. Simply put, human-machine synergy.
Today, we know that a scenario where AI fully replaces humans is at least a few more years away.
So, how did we create our combined work method? What is the proper task division between humans and machines? How do we make sure the human input is being leveraged to enhance future machine output? And do we really need to worry about being replaced?
All this and more will be discussed as part of the talk.
Testing Machine Learning Code and Artifacts, the Sane Way
Machine learning code is known for a lot of things, but testing is not one of those. Let’s face it testing machine learning code is challenging! We have been missing out on one of the biggest productivity boosters in modern software development. This talk will hopefully change it a bit
In this talk, rather than treat testing as a “necessary evil”, we will offer several testing strategies to make it easy and somewhat fun. We will cover a few simple but powerful tools for keeping your code problem-free.
Taking Your Cloud Vendor to the Next Level: Solving Massive-Scale Data Challenges
Akamai’s content delivery network (CDN) processes about 30% of the internet’s daily traffic, resulting in a massive amount of data that presents engineering challenges, both internally and with cloud vendors.
In this talk, we will discuss the barriers faced while building a data infrastructure on Azure, Databricks, and Kafka to meet strict SLAs, hitting the limits of some of our cloud vendors’ services.
We will describe the iterative process of re-architecting a massive-scale data platform using the aforementioned technologies.
We will also delve into how today, Akamai is able to quickly ingest and make available to customers TBs of data, as well as efficiently query PBs of data and return results within 10 seconds for most queries.
This discussion will provide valuable insights for attendees and organizations seeking to effectively process and analyze large amounts of data.
It Will Affect Us All: A Digital and Data Revolution Is Coming to the Automotive Industry
There are multiple revolutions taking place in the automotive industry at the same time – EVs (electric vehicles), AVs (autonomous vehicles), SDVs (software designed vehicles) Connected vehicles, and Services. In the future of mobility – Data will be a major factor at every level, affecting us all.
Boaz, who has been part of this process for the last few years, will take us through the journey – both in terms of the way the automotive industry is transforming as well as how data and digital are increasingly being used. Furthermore, we will discuss how it opens up opportunities for Israeli startups, data companies, and even employees with new professional options.
Keynote – Data for the People
Did you ever wonder who should wake up at night when a data pipeline goes down? The engineers who built it? The engineers who use it? The DevOps people?
In this talk, I’ll uncover the inner workings of Taboola’s Data Platform Engineering. Data Platform should not be just for a select few – all engineers and data scientists should be encouraged to contribute. I’ll share some methods to maintain the integrity of the data, architectural considerations to make it all possible and what your R&D culture has to do with it.