Agenda
Opening remarks
The Future of Data is Words
The world is changing, it’s pretty clear by this point.
With the advent of large language models (LLMs) and generative AI, we are witnessing the boundaries constantly being pushed on what is possible for a machine to understand and do, simply by using words. This revolution is not only reshaping technological possibilities but also redefining people’s expectations.
However, while this revolution is happening, our dear old friend SQL is officially celebrating its 50th birthday, proving itself once more as a survivor in the world of data.
So what is it going to be? Who will control the future of data? Words or SQL?
Let’s take a journey into the past, present and future of querying data.
We will talk about how SQL became so popular, how it survived for so long and what is still making it so sticky 50 years later. We will talk about the LLM revolution, the challenges of leveraging AI to transition from SQL to NLQ (Natural language querying) and some hints to the technologies that could get us closer such as RAG, Semantic Layers and Knowledge Graphs.
Lastly we will peek into the crystal ball and try to figure out what the future might bring, how it might change the way we see modern data platforms and data self service, and what all of this might mean for us data professionals.
Investing in Data: Unlocking Tomorrow's Opportunities
In a rapidly evolving world, data is the future. This panel brings together four leading investors from diverse backgrounds to explore the opportunities and challenges that lie ahead in the data domain. From transformative technologies to emerging startups, our experts will share their insights on which trends will shape tomorrow. Join us for a forward-looking discussion that will uncover how data-driven innovation is unlocking new paths for growth and success.
Sparkless Patterns: The New Data Stack and Modern Data Architecture
In recent years, we have witnessed the emergence of powerful frameworks and libraries, such as Apache DataFusion, DuckDB, and Polars, that enable us to query and transform vast amounts of data at blazing speeds, even with modest compute resources. These frameworks not only enhance our capabilities but also pave the way for new kinds of data architectures and design patterns that challenge the traditional design of our ETL processes and query infrastructure. Workloads that once required a distributed Spark cluster to complete in a reasonable time, or queries that needed data warehouses to scan vast datasets, can now be processed more efficiently and with less overhead.
In this session, I will focus on three major frameworks: DuckDB, Apache DataFusion, and Polars, which are central to what many refer to as the ‘new data stack’ and its transformative capabilities. We will discuss the significant advantages these frameworks offer to data engineers, especially in cloud-native environments, and demonstrate the new data design patterns they enable through real use cases that utilize them.
Boosting Cost Efficiency by Cost Aware Architecture
In this session, we explore the critical role of cost-aware architecture in achieving optimal cost efficiency for modern software systems. As organizations increasingly rely on cloud services, microservices, and distributed computing, understanding and managing costs becomes paramount.
We will discuss the shift from traditional architecture to cost-aware design and explore how architectural decisions impact operational expenses. We will showcase a real life example and the lessons learned when evaluating the different cloud providers and their services – based on cost models and show how rearchitecting the solution with cost in mind made the difference.
Attendees will gain insights into designing cost-efficient systems that balance performance, scalability, and financial constraints. By adopting a cost-aware mindset, organizations can optimize their architecture and drive sustainable growth.
Personalizing User Content - The Power of a Vector & Search DB
As a user explores the open web, our mission is to recommend the most suitable content for them to read next out of millions of potential items. Our goal is optimizing for clicks and conversions – And we have to do so (very) very fast.
In this talk, we’re gonna talk about how in Taboola we reduced this problem by leveraging the power of Vespa – a high scale Vector DB. Unlike other DBs – Vespa is both a search AND a vector DB. We’ll discuss Vespa’s architecture, topology and ranking features. How we implemented near real time data updates (using Debezium, Kafka and in-house mirroring), filters to reduce the document space, the optimizations we implemented along the way – and the tools and techniques we used to measure their impact, and how we embedded deep learning model estimations into simple 32-bit vectors to generate CTR and CVR estimations to use in ranking.
Ooops... I Deleted the Whole Production Table. What Did I Learn From It?
I got some new gray hair on the spot.
After that, I took a deep breath and called my tech lead to discuss the issue. We ended up recovering in less than half a day to just coin our motto: “Write your code as if you’re going to delete production”.
Let me take you through this nightmare that taught me 3 important lessons. 3 lessons that separate tech leads from the others.
Out Of Distribution (OOD) for Classification Problems
Classification model inference is relatively a straight forward issue unless the real world input data contains examples from classes the model has not seen before. In this case, the feature vector of an ‘unseen’ input might coincide with one of typical vectors of a class the model was trained on. In this case, an unseen class might have a prediction with hight certainty level. We shall discuss this problem and it’s possible and popular solutions.
From Scaling APIs to Scaling Data to Scaling Your Team
Forget the titles, no one is born a Data Engineer. With the right guidance, attitude and passion, the transition to data engineering is not only possible, but also brings many advantages. In this talk, I will take you through my own journey of leaving the world of APIs and services as a backend engineer to enter the world of variety and scale as a data engineer. I will cover why I was passionate about this transition, the challenges I faced, the change in perception, and the benefits of being a data engineer with a background in backend engineering. By the end of this talk, you will be convinced that one of the best ways to scale your team is by incorporating new talent from various backgrounds, giving your team a diverse set of knowledge and skills.
Congratulations! Meet Your New Little Brother
As you all know, data professions have become some of the most required professions in the world.
We at the Ministry of Education felt an obligation to prepare the students for this important field, therefore we established the National Data Analysis Major for High Schools in Israel.
In this talk, we’ll tell you about the dilemmas we’ve had about what should be taught in the data analysis program and what skills would prepare the student best to be data analysis-oriented.
We will tell you about the global interest in our Major and the research being done worldwide on our groundbreaking program.
Above all, we will tell you about you- data people, and your enormous contribution to upgrading and improving the Major.
It’s Just an Alert, Don’t Wake Up!
You have a very important job, you need to keep production up & running 24/7! You setup alerts so you will know when something is wrong with it, but now there are alerts! Some of them are very important! So important that you need to leave everything that you do and start to handle them but some of them are less important and you can look at them in the morning or even ignore them.
Maybe we have too many alerts?
Or maybe we are missing alerts?
What is the purpose of an alert?
Did you try to think about the KPI of a good alert? How do we measure if an alert is good or bad?
In this talk we are going to define what is a good alert, how to create a KPI on alerts so you can make better alerts that are more accurate and help keep production up & running.
Building a Data-Driven Culture: An Anthology of Strategies and Insights
In an era defined by data abundance, the ability to harness its potential has become a competitive imperative. From startups to multinational corporations, the pursuit of a data-driven culture has never been more urgent. But amidst the buzzwords and promises, what truly distinguishes success and failure?
This talk offers an anthology of real-world stories that illuminate the journey toward building and sustaining a data-driven culture. Through some narratives drawn from diverse industries, we uncover the challenges, fails, triumphs, and transformative strategies employed by organizations committed to unleashing the power of their data.
Attendees will gain invaluable insights into practical approaches that have proven effective in fostering a data-driven culture, as well as lessons learned from initiatives that fell short of expectations, ensuring that attendees leave equipped with real-world knowledge that can be applied immediately within their organizations.
Live in the Data Wild West: The Data Contracts Sheriff
In today’s data-driven world, managing vast amounts of data from various online and offline sources to various destinations is both a challenge and an opportunity. At Riskified, we have established a robust data contracts platform to ensure trust and quality across our data ecosystem.
In this session, we will explore how we have built comprehensive data contracts that include strict testing, detailed documentation, and an extensive data catalog based on open-source products like DBT, Datahub, Great Expectations, and Elementary.
We will share our methodologies for maintaining data integrity, the tools we use for quality assurance, and the best practices for documenting and cataloging data pipelines and tables.
Join us to discover how our approach to data contracts can enhance data reliability and support informed decision-making within your organization.
Dropping Doesn’t Mean Losing
As engineers, we are used to keeping the data we produce, usually within the guidelines of our business needs (like when we create retention as the data is too old to be used) or limitations (we cannot afford more than X TB of data).
However, sometimes we might create “too much” data, which we just cannot support, and an instant drop of the data is in order.
In Outbrain, as part of the data democratization strategy, a developer is free to send (almost) any kind of log/message that he wants/needs from his microservice/application. Therefore, we had to impose mechanisms to potentially drop data base on various dimensions of the sent message (and the sending rate/volume).
In this session I will present the idea behind the system and how it works.
Exploring the Depths of Apache Iceberg's Metadata Capabilities
Apache Iceberg provides a powerful framework for managing large-scale data with advanced features that are crucial for today’s data-intensive applications. This talk will focus on the various metadata tables that Iceberg offers and how they can be leveraged to enhance data management practices across diverse use cases, such as compaction, incremental processing, and even monitoring.
Unifying Real-Time and Data Lake: Yotpo's Transition from Chaos to Coherence
Real Time and Data Lakes are becoming a single beast, these worlds are converging. Kafka became the de facto streaming platform, integrating with modern data lakes. Flink became super popular and adopted by big vendors such as Amazon, Microsoft and Alibaba. Confluent, not only adapted flink but also introduced TableFlow solution that allows to seamlessly create data lake tables based on kafka topics.
This convergence signifies the necessity for real-time and data lake systems to share a common language and standards
In this talk, I will share Yotpo’s journey in data generation and ingestion into the lake. Beginning with the transition from operational databases to daily snapshotting for lake availability, we progressed to implementing DB-to-lake streaming using CDC and Debezium. Currently, we are advancing towards well-defined Async APIs and a unified architecture ensuring better performance and cost efficiency while treating data as a first-class citizen from the very beginning of a service design. I will delve into the key components enabling this architecture, including Kafka, Outbox pattern, Flink, Cloudevents, and AsyncApi, explaining why and how we decided to put everything together in order to execute this exciting transition.
If you think the real-time layer should cooperate better with data lake, this talk is for you.
Big Data Processing with GPUs
In this talk, we present our successful optimization of Spark-based big data pipelines at PayPal, achieving 70% of cost reductions through strategic GPU utilization. We discuss the key challenges we encountered in our big data and machine learning domains, and how we used Spark RAPIDS to solve them. In addition, we’ll share a glimpse of the new GPU-accelerated data processing domain.
How to Decipher User Uncertainty with GenAI and Vector Search
User expectations are sky-high, while at the same time users have increasing difficulty articulating their complex needs in a simple search bar on a website. This talk dives into leveraging generative AI and vector search to transform vague user queries into results the user actually wanted, even if they did not know initially what they clearly wanted. We will explore why traditional search methods fall short in grasping user intent and address the common problems users face with ambiguous search queries.
Learn how GenAI generates embeddings to capture the context and semantics of queries. This talk though is not only about the theory. In the talk you will be shown practical examples of how to generate embeddings, and how to set up vector indexes.
See how these advanced search capabilities can transform user interaction and, as a result, business outcomes, making user uncertainty into certainty with GenAI and vector search.
Closing remarks
The Hitchhiker's Guide to Advanced A/B Testing Techniques
Reliable data influences your business operations at nearly every level, but what if your testing is misleading you? Join us for a deep dive into the world of A/B testing where we unravel the complexities of advanced techniques that drive better decision-making in business environments.
This 30-minute session will focus on state-of-the-art methodologies such as parallel testing, CUPED (Controlled Experiment Using Pre-Experiment Data), sequential testing, and sample ratio mismatch. Designed for data scientists, and analysts, this talk will provide actionable insights and frameworks to scale your tests, improve accuracy, and better understand the impact of your treatments.
Whether you’re looking to refine your current A/B testing practices or explore advanced approaches, this session will equip you with the knowledge to implement these techniques effectively in your A/B testing.
Democratizing Data Pipelines: Introducing the AI-Powered ETL Assistant
This session introduces an innovative ETL assistant designed to democratize data processing and significantly reduce the barrier for handling organizational data requirements. This tool we developed, leverages AI to simplify entire Data Pipelines generation and management, making it accessible even to those with minimal data engineering skills. It integrates cutting-edge technologies like OpenMetadata, Streamlit, and the OpenAI assistant, enabling users to efficiently manage data pipelines with ease.
The ETL Assistant automates DDL and pipeline generation, while ensuring strict adherence to schema requirements. This automation not only significantly reduces the setup time, but also minimizes the need for specialized data engineering knowledge, thus accelerating project timelines and reducing overhead costs. Join us to discover how AI can transform the way you manage data pipelines, making these processes more intuitive and reducing bottlenecks for the organization.
Metric Store
Imagine this: You confidently walk into a crucial meeting armed with diligently prepared numbers from your trusted analyst. Midway through your presentation, a disruptive interruption arises. A member of the audience boldly claims to have a completely different set of numbers, challenging the accuracy and validity of your analysis…
In today’s world of distributed and autonomous teams, the abundance of diverse data sources has created a pressing need for a unified and reliable metrics solution. This is where the concept of a metrics store comes into play. During the presentation, I will delve deep into the strategies and provide invaluable insights into their advantages, drawbacks, complexity, and crucial considerations for implementation. Moreover, I will explore the optimal timing for adoption, effective techniques for maximizing adoption and long-term retention, and the essential maintenance practices to keep the metrics store running smoothly.
Data Driven Advocacy: Using Vizualization Skills for Hasbara
Crisis communication during a complex conflict, like the one that erupted on October 7th, requires clear and concise messaging, especially when reaching an international audience.
Nir and Omer, data visualization experts, leveraged Tableau to create impactful data stories that cut through the noise and resonated with a global audience during this conflict. Their biggest challenge? Choosing the right narrative to effectively convey the story while maintaining data integrity.
In this talk, they’ll share their experiences in distilling complex messages into compelling visuals. By presenting real cases of visualizations developed to support Israeli advocacy during the conflict, they will provide a high-level overview of data visualization best practices and how to incorporate impactful ‘data storytelling’ in the context of Hasbara but not only.
Beyond Empowerment: Data Accuracy Challenges in Self-Service BI
Self-service BI unlocks data for everyone, but ensuring accuracy can be tricky. This session explores how a simple “BI Validated” logo tackles this challenge, addressing a common pitfall of self-service BI.
Battle of the Titans: Python vs. Tableau in Live EDA
In the rapidly evolving field of data analytics, the choice between mastering Python or Tableau remains a prevalent dilemma among professionals. This session aims to settle the debate through a dynamic, live demonstration of exploratory data analysis (EDA) with the same dataset! On one side of the ring, Efrat will master Tableau to swiftly create and manipulate visual analytics, while on the other Shuki will live code Python scientific packages, both wrestling for the best insights. Join us for a captivating session where you will judge and vote for the best EDA weapon.
Here, There and Everywhere: Spotting Data Opportunities 101
This talk is not about data.
It’s about a data state of mind – it’s about the thrill of discovering hidden opportunities and the ingenuity to turn them into impactful solutions. Everyday we are facing unique challenges and inefficiencies – both in our professional and personal life. Most if not all of those problems can be solved or prevented by using a data-centric mindset.
In this talk we will walk you through real life stories that will demonstrate how I was able to identify, develop and implement innovative solutions in a low-tech, non-data driven environment – which was eventually credited as relevant data science and analytics experience at the beginning of my career.
You will leave this talk with practical takeaways: Gain actionable strategies to identify the data opportunities, select the relevant tools to enhance efficiency and drive impactful changes in your organization.
Revolutionizing Business Intelligence: Unlocking the Power of AI for Seamless Self-Service BI Transfer
Over the last year, AI has emerged as a transformative force, reshaping industries and redefining our approach to data. In this session, I will specify how we can leverage AI in the Business Intelligence (BI) landscape – from revolutionizing development processes and documentation to facilitating knowledge sharing. A focal point of our discussion will be the advent of user-friendly, self-service BI tools. These innovations empower users to engage with data analytics directly, simplifying complex processes and democratizing data insights.
Join us as I navigate through the exciting intersections of AI and BI, unveiling the future of data-driven decision-making, and walk away with fresh insights and actionable tactics to implement AI, setting a new pace for progress and innovation in your domain.
Threat Hunting Powered by Efficient and Straightforward Anomaly Detection on Your Data Lake
Dating with a Super Model: Why Good Prompt Engineering for Data Monitoring Requires Some Flirting
Mastering prompts for automatic monitoring of trends and data events is a fine art. This talk reveals the top 3 lessons I learned while building automatic data monitoring for my product tailored to multiple different business models and tens of use cases. And like everything in life – why is it done so much better with love.
Evaluating the Unseen: Supervised Evaluation for Unsupervised Algorithm
In today’s complex and unstructured environment, many machine-learning problems involve unlabeled data. Evaluating online models requires innovative approaches to minimize manual labeling, which requires intensive resources and is usually not scalable. In this talk, we will learn how to leverage supervised classification approach to generate labels for evaluation. Using the labels can enable an in-depth evaluation of online models to detect areas of underperformance or data shifts in scale. We will walk through the use of this approach in the identity resolution clustering task, aiming to inspire how it can be used in other complex data science domains. Get ready for a practical guide to make data evaluation simpler and more effective.
Tailor-Made LLM Evaluations: How to Create Custom Evaluations for Your LLM
In the ever-changing world of Generative AI, new LLMs are being released on a daily basis, and while there are standardized scoring approaches for evaluating them, they don’t always evaluate based on what is important to us. In this talk, we will go over the two main approaches to evaluate LLMs – Benchmarking and LLM-as-a-judge. We will discuss which one to choose and how to create custom evaluations that suit our own use cases. Lastly, we will go over a set of best practices on how to create the best possible evaluation that produces an objective and deterministic score.
Correlation Sucks - Long Live Causality! Mastering Causal Inference in Data Science
In today’s data-driven world, distinguishing correlation from causation is a competitive advantage. Businesses using causal inference can attribute outcomes to actions, leading to more effective strategies and clearer insights.
This session delves into the realm of causal inference, exploring foundational concepts like potential outcomes, causal graphs, and the challenges of confounders, colliders, and mediators. Attendees will learn robust methodologies for establishing causality, from A/B testing to advanced techniques like IPTW (Inverse Probability of Treatment Weighting). We’ll showcase real-world applications with case studies in product development and marketing strategies, demonstrating how causal inference informs decision-making in tech industries.
By the end of this session, participants will understand how to critically appraise and construct causal hypotheses, turning complex data into actionable insights for strategic advantage.
Learning the Ropes of Synthetic Data
This talk will cover the basics and applications of synthetic data across various fields, emphasizing its importance in maintaining privacy while retaining data utility. It will start with a definition of synthetic data, its creation process, and how it differs from traditional de-identification techniques. The talk will then explore applications for tabular data use cases using an open data set to illustrate privacy and utility considerations.
LLMs and Knowledge Graphs: A Case Study in Blue(y)
In this talk, we will explore how LLMs and knowledge graphs can be integrated to enhance accuracy and relevance over naïve RAG. Using the TV show “Bluey” as a proof of concept, we’ll illustrate the workflows for generating a knowledge graph from unstructured data and designing an ontology and extraction pipeline. We will also discuss strategies to leverage the knowledge graph to improve the contextual understanding, accuracy and traceability of LLM responses, highlighting applications in various domains.
Beyond Opt-Outs: Quantifying User Notification Harm
This talk presents a holistic framework for quantifying user notification harm, developed through a cross-functional collaboration at Google spanning data science, user experience research and stakeholders from Google Photos, Google Maps, Google Search and Youtube.
Previously, user notification harm was something measured via a balance between clicks and opt-outs, which are both measurable qualities, but do not clearly express the end-user’s experience or motivations. As opt-outs are infrequent, relying on them as a measure of notification harm is ineffective in A/B testing. Furthermore, opt-outs can underestimate harm in users who tolerate or ignore unwanted notifications without opting out. To create a positive digital environment, we need a scalable way to measure and understand the potential harm of poorly designed or excessive notifications.
We combined quantitative and qualitative approaches to uncover the root causes of notification dissatisfaction. This involved feature engineering, ML modeling, and analyzing user feedback – particularly free-text survey responses processed with LLM technology – to better understand our end-user experience and motivations. The research revealed key indicators of user harm from notifications, giving Google product teams tools to refine their strategies and create a more positive and respectful user experience.
Navigating the Uncharted: Ensuring Prompt Quality in the Age of Language Models
In today’s era dominated by large language models, the quality of prompts is crucial for effective utilization of these powerful tools. This talk, led by Ortal, a Senior Researcher specializing in Natural Language Processing (NLP) at Gong’s AI division, will delve into the challenges of understanding prompt impact, navigating upgrades to foundation models, and selecting the most suitable models for specific tasks. I will share strategies for evaluating prompt performance, discuss methods to identify and address pitfalls, and explore decision-making processes in this ever-evolving landscape. Join me for insights into prompt quality assurance in the age of language models.
Redefining Agile for Data Science: A Team First Approach
“Agile methodologies weren’t built for data science teams and projects…” This sentiment resonates with many data scientists. The belief is that the exploratory and uncertain nature of data science clashes with Agile’s short-cycle, flexible approach, creating inefficiencies and ambiguities.
In this talk, I will present the challenges of applying Agile’s iterative cycles to data science projects without compromising their inherently exploratory nature. By encouraging collaboration and clear communication within teams, Agile strengthens adaptability, empowers decision-making, and builds accountability, leading to more cohesive and resilient teams.
This practical framework enhances both individual and team efficiency, fostering higher transparency, collaboration, and productivity. This talk is essential for team leads and individual contributors seeking to streamline their workflow and deliver more consistent, reliable results.
I Want to Build a RAG System, Now What?
Many organizations are looking into building a RAG system either for internal use or as features in their products. In this session, we will discuss the engineering and data science involved in building RAG systems in the real world. We will also share key lessons learned from the past two years, working with over 5,000 customers to develop various AI products.
Transforming Medical Records Into a Heart Attack Prediction Model
Every 34 seconds, a US citizen dies from cardiovascular disease (CVD), often preventable with early detection. Despite 47% of adults having high blood pressure, over 80% are not managing it. Hello Heart aims to bridge this gap by collecting daily blood pressure readings, medication intake, and symptoms through a mobile app. We use this data and clinical records to train ML models that alert users at risk of heart attacks or strokes.
In this talk, I’ll discuss transforming raw Electronic Health Records (EHR) into aggregated features for training and inference. EHRs, while valuable for research and model development, pose significant challenges. Key considerations for the data representation include the ideal modeling approach, hardware resources, data sparsity, and label validation.
Whether you’re a data scientist, engineer, or product manager, these decisions are crucial for extracting valuable insights from raw medical data and ensuring a proof-of-concept analysis reaches production. Join me to explore our journey from raw data to prediction and prevention!
Discover the Modern Way to Build Real-Time Analytics Solutions at Any Scale
Delve into the realm of big data analytics, with a specific focus on the challenges inherent in large-scale Real-time Analytics. Explore how contemporary technologies like SingleStore offer potential solutions to these complexities.
Walk away with actionable strategies to overcome common big data hurdles, drive business value and discover how innovative architecture and advanced features can enhance your analytics capabilities.
- Strategies for achieving Real-time data freshness in analytics.
- How analytical queries can deliver sub-second response times, even when dealing with many billions of records with highly concurrent queries.
- Simplifying architecture by leveraging the flexibility of schema, semi-structured data (JSON and BSON), and the capabilities of a relational database management system (RDBMS) in the same platform.
- Addressing data discrepancies at their root cause.
- Identifying the optimal scenarios for utilizing SingleStore, Elasticsearch, MongoDB, and Big Query.
- Effectively combining vector similarity search with analytics, RAG, and hybrid search for more accurate GenAI results.
- Real-life production use cases will also be discussed.
Accelerate Data Pipeline Development Time: Learn How to Unleash Rivery’s Gen AI for a 20X Leap
Even as data teams remain lean in 2024, data engineers are still expected to swiftly deliver data for various use cases. Adding new data sources and updating existing ones consumes nearly half of a data engineer’s time, hindering your organization’s data and AI-led goals. Rivery’s modern data platform solves this issue across all your data sources with an innovative blueprint engine and generative AI. Join this session to learn how to overcome unscalable data pipeline challenges and unlock the benefits of all of your Data.
Building Data AI Application with the Snowflake AI Data Cloud
In this session, you will learn how Snowflake can help you build your next Data and AI Application:
Strengthen Your Data Foundation: Discover how to create a robust and scalable data architecture.
Accelerate Your Enterprise AI: Explore the tools and techniques to fast-track your AI initiatives.
Build Top-Notch Data and AI Applications: Gain insights into developing high-performance applications using Snowflake’s powerful platform.
Throughout the session, we will dive into specific tools and techniques, providing you with practical knowledge that you can apply directly to your projects.
Who Should Attend: This session is designed for Data Engineers, BI Developers, and Data Scientists who are looking to elevate their skills and leverage Snowflake’s capabilities to the fullest.
A Brave New World: How New Lineage Technology Can Reduce Data Spend 40%
With Data increasingly viewed as a strategic asset, the ability to track data lineage is key for effective data management. The problem is that traditional lineage solutions are falling short in today’s complex data ecosystems.
In this presentation we will detail a multi-layered data lineage approach that delivers full visibility and insights for data management and optimization.
The Key Takeaways for Attendees Include:
- How data usage analytics will help you figure out which data products are getting used, and how much, and which ones are just sitting there collecting dust.
- How to use data cost visibility to get a clear picture of which data products are costing you big $$$, to optimize budgets and cut wasteful spending.
- Why Data product ownership attribution is essential to ensure accountability and enhanced collaboration.
- How full data lineage transparency shines a bright light on your data’s entire journey to properly govern and optimize every step.
This presentation is designed for data managers, and any professionals involved in data strategy and implementation looking to enhance their data management practices.
Deep Entity Matching for Improved User Experience
Connecting users in a meaningful and effective way is a critical challenge for many applications. In e-commerce marketplaces, sellers often list items with incomplete and noisy information, making it more difficult to be found by potential buyers. In this talk we’ll explore how we can overcome this obstacle to deliver a better experience for our users. We developed an entity matching framework that integrates information retrieval and deep learning to accurately match listed items to their corresponding catalog products.
By the end of this session you will be able to:
- Understand what entity matching is and why it matters
- Identify potential challenges in other domains that can be tackled by entity matching
- Apply this approach in your own domain to enhance user experience
This talk is meant for everyone, researchers, engineers, and managers – you are all invited!
The 10 Commandments for Crafting Custom LLMs with Domain Mastery
Is your data and domain expertise your greatest asset?
This talk explores strategic approaches to transform these assets into high-performing custom language models.
Discover key strategies for success and maximize your LLMs to drive significant business impact.
How Similarweb Serves 100s of TBs to Their Worldwide Users in Milliseconds
If you want to scare a Data Engineer with four words, ‘big data, high concurrency’ will probably do it. As data moved from the realm of BI reporting to being a customer-facing commodity, serving huge volumes of data to thousands of unforgiving app users is no small challenge. In this session, Yoav Shmaria, VP R&D SaaS Platform at Similarweb will share how, using Firebolt, they serve data about millions of websites to their worldwide customers with consistent millisecond response times. He’ll demo how their newest market tool takes keyword analysis to the next level – running complex queries on TB-scale datasets instantly.
How LSports Shoots and Scores with DoubleCloud
The technical migration to DoubleCloud and the integration of ClickHouse significantly enhanced LSports’ data handling capabilities. Query times were improved by 180x, allowing for efficient real-time data analytics. This improvement not only streamlined the handling of complex queries and large data volumes but also reinforced LSports’ competitive position within the sports data sector.
LSports aims to leverage DoubleCloud’s advanced data processing capabilities and real-time analytics services to foster the development of innovative sports data products. These enhancements are crucial for managing live event data and intricate analytics tasks, while also ensuring the platform’s scalability to accommodate increasing data volumes and a growing customer base, thereby maintaining system responsiveness and reliability.
In this session, we will talk about how DoubleCloud stands out in the market as a highly cost-effective solution, providing enhanced performance at a lower price point.