• Databases (11)

    Photo

    Alexey Goncharuk

    Querify Labs

    JIT In the Wild: What Databases Do To Speed Up Your Queries

    4 July, 10:00, «Hall 1»

    Just-in-time (JIT) compilation is widely known at least for Java developers and users, but is JIT limited to virtual machines only?

    In this talk, we will explore why and how modern databases and data management systems use JIT to make the most of the available computational resources and improve query performance, see how JIT and ahead-of-time compilation work together, glance over some of the available JIT toolkits for Java and C++ and how they can be used in non-database projects.

    The talk was accepted to the conference program

    Photo

    Oleg Bondar

    Yandex Infrastructure

    YDB is an open-source Distributed SQL Database

    4 July, 10:00, «Hall 3»

    YDB is an open-source distributed horizontally-scalable fault-tolerant database built from scratch. I would like to talk about YDB design and internals, about decisions made and use cases.

    YDB is a distributed horizontally-scalable fault-tolerant database built from scratch. Its key characteristics include:
    * strict consistency with a possibility to relax guarantees for better performance;
    * horizontal scalability;
    * geo-replication for availability;
    * support for high-performance multi-row and multi-table ACID OLTP transactions;
    * support for row based tables;
    * support for column based tables;
    * support for embedded streams;
    * support for declarative query language YQL (a dialect of SQL);
    * distributed query execution on heterogeneous databases and storages including PostgreSQL, S3, etc.;
    * YDB is used as a mission-critical database for many Internet-scale services;
    * YDB has been designed as a platform for various data storage and processing systems and aimed at solving a wide range of problems.

    YDB is currently used as a core component of:
    * a system for continuous stream processing;
    * persistent queues;
    * a high-performance network block store.

    This will be the first time a talk about YDB is presented at such a big international conference.

    The talk was accepted to the conference program

    Photo

    Mons Anderson

    Tarantool & VK Cloud

    Reliability of In-Memory Databases on the example of Tarantool

    4 July, 15:50, «Hall 3»

    An in-memory database is not a new concept. To date, there is a fairly strong relation of such solutions with the terms "cache", "non-persistent" and "unreliable".

    In-memory databases have a much wider application than a cache. And their reliability is not worse than that of the most proven relational databases.

    I will tell you about approaches that allow an in-memory database to be as reliable as sunrise.

    I will analyze the database engine from the incoming request to the synchronous replication and transactional mechanism at a speed of 1 million RPS.

    The purpose of my talk is to show that in-memory technologies are already mature and reliable enough to be the primary data storage for your product.

    The talk was accepted to the conference program

    Photo

    Igor Latkin

    KTS

    Building the Perfect Queue Broker: Our Journey with Tarantool

    4 July, 17:00, «Hall 3»

    Microservice communication is a crucial part in most applications, but it is hard to find the one and only solution that could satisfy every arising need. We’ll discuss how we at KTS created a queue builder service on top of a distributed Tarantool Database cluster, enabling everything we want from a queue broker, from channels & tenants to deferred tasks and strict user FIFO, keeping it high-performant and fault-tolerant. I’m going to outline why none of the existing solutions like Apache Kafka, RabbitMQ, NATS and others satisfy our needs and share the implementation details, choices and compromises we made. I’ll try to answer a question of why Tarantool proves (again?) to be the most flexible platform for data manipulation in the OLTP world.

    The talk was accepted to the conference program

    Photo

    Evgenii Ivanov

    Yandex Infrastructure

    Scale it easy: YDB's high performance in a nutshell

    4 July, 13:30, «Hall 3»

    Implementing a distributed database with strong consistency isn’t difficult; the challenge lies in ensuring speed and scalability. YDB excels in these aspects. In this talk, we’ll discuss YDB’s architecture and its high performance, present results of benchmarks, and compare YDB to top competitors.

    Why are performance and scalability so important nowadays? We live in the era of big data, and performance is what makes big data practically useful. Imagine a petabyte database that can handle only 1K transactions per second, or the same database requiring 1M servers to operate. In both cases, the database would not be very useful. In today’s world, big data is everywhere; databases must be fast enough to handle this data and affordable for many companies or even individuals. Performance provides both speed and low cost. A truly high-performance database allows you to process more data using less hardware.

    YDB is such a database: it is fast, affordable, and reliable. It combines high availability and scalability with strong consistency and ACID transactions. Time-tested, there are installations with thousands of servers and petabytes of data that have existed for years. In an increasingly data-driven world, YDB’s blend of performance, scalability, and reliability sets it apart as a top choice for organizations seeking to maximize the value of their data.

    In this talk, we’ll delve deeper into YDB’s architecture, starting with an overview of its key components and design principles. We’ll discuss its role in achieving high-performance and fault tolerance. Throughout the presentation, we’ll illustrate how the proper separation of key-value and query processing layers can improve performance on various request distributions. Additionally, we examine YDB’s actor system, detailing its evolution and the impact on throughput and latency. To better illustrate YDB’s performance, we’ll provide comprehensive examples of various benchmarks, including:
    * Low-level component testing to showcase the efficiency of individual modules
    * Key-value YCSB tests to demonstrate YDB’s ability to handle high-throughput workloads
    * Distributed transactions to highlight the system’s capacity for managing complex, multi-node operations

    We’ll also share case studies of real-world optimization scenarios, enabling attendees to learn:
    * How a simple yet effective configuration can increase throughput by up to 30%
    * Strategies for addressing GRPC-layer bottlenecks and their impact on performance
    * The crucial role of SDK implementations in shaping the end-user performance experience
    * The tools and methods we employ in our performance optimization efforts, such as profiling and monitoring

    In conclusion, we’ll compare YDB against top competitors such as CockroachDB and Yugabyte, analyzing various aspects such as ease of deployment, performance and scalability. By sharing these results, we’ll help the audience gain a better understanding of YDB’s capabilities and how it outperforms competing solutions in handling large-scale data processing tasks efficiently.

    The talk was accepted to the conference program

    Photo

    Grigory Reznikov

    YTsaurus

    Cypress: a distributed transactional file system in YTsaurus

    4 July, 11:10, «Hall 3»

    YTsaurus is a data storage and processing system that recently became open-sourced. This system includes storage for huge-sized tables, efficient data processing engines allowing execution of ad-hoc analytical queries as well as building of data processing pipelines, and also efficient OLTP key-value storage. As a result of 12 years of development by experienced developers, YTsaurus proved itself to be a reliable, efficient and convenient way to manage data for different purposes. YTsaurus scales well, allows to store exabytes of data and processes them using millions of CPUs.

    Cypress is a distributed file system used in the core of YTsaurus. In this talk, we will discuss the Cypress interface and its functionality, as well as demonstrate how the Cypress features can be used to simplify working with data.

    We will pay special attention to transactions and discuss how transactions allow using Cypress as a distributed coordination system.

    The talk was accepted to the conference program

    Photo

    Daniil Zakhlystov

    Nebius

    Greenplum Physical Backups with WAL-G

    4 July, 18:10, «Hall 3»

    In 2022, we released the Yandex Managed Service for Greenplum® to the public. One of its features is physical backups via WAL-G instead of logical ones (gpbackup/pg_dump). I’ll tell you about how we made cluster-wide consistent physical backup, PITR, delta-backups, and other nice features that are now available in WAL-G.

    From a quick look, Greenplum itself is multiple Postgres instances wrapped together. Since we already have WAL-G for Postgres physical backups, it should be easy to create a physical backup of Greenplum instances, right? However, there were no implemented solutions for Greenplum physical backups, so we decided to invent our own.

    I’ll go over the following topics:
    • how we were the first in the world to implement physical backups for Greenplum;
    • what challenges we’ve faced and how we solved them;
    • what benefits we’ve received as a result;
    • and, of course, I’ll tell you how to begin using physical backups with WAL-G right now!

    The talk was accepted to the conference program

    Photo

    Ashot Vardanian

    Unum Сloud

    Vector Search and Databases at Scale

    4 July, 12:20, «Hall 3»

    Vector Search databases appear on every corner. Most have already heard about Pinecone, Weaviate, Qdrant, and the Open-Source libraries that precede them ― Facebook's FAISS and Google's SCANN. We will look into the algorithms for approximate nearest neighbors search, profile them, and highlight the bottlenecks that stop most of them from scaling beyond a billion entries. This will help you navigate the increasingly complex space of vector and semantic search products, choose the optimal configuration parameters for indexing, the right neural network to produce embeddings, and the appropriate hardware for the task.

    The talk was accepted to the conference program

    Photo

    Peter Zaitsev

    Percona

    From a database in container to DBaaS on Kubernetes

    4 July, 14:40, «Hall 1»

    DBaaS is growing fast, but it’s typically proprietary and tied to one cloud vendor. We believe Kubernetes finally allows us to build a fully OpenSource DBaaS Solution capable to be deployed anywhere Kubernetes runs. Join us as we discuss running databases on Kubernetes in production.

    DBaaS is the fastest-growing way to deploy databases. It is fast, convenient, and it helps to reduce toil a lot, yet it is typically done using proprietary software and tightly coupled with the cloud vendor. We believe Kubernetes finally allows us to build a fully OpenSource DBaaS Solution capable to be deployed anywhere Kubernetes runs ― on the Public Cloud or in your private data center.

    In this presentation, we will talk about all the important steps it takes to run the database on Kubernetes in production. We will answer the following questions:
    1. Can you do it without operators?
    2. Can you work with k8s primitives only to run production-grade DB and then DBaaS?
    We will describe the most important user requirements and typical problems you would encounter while building DBaaS Solutions and explain how you can solve them using the Kubernetes Operator framework.

    The talk was accepted to the conference program

    Photo

    Evgeny Ermakov

    Toloka.AI

    Modern Data Stack: Choose Suffer Love

    Data Lakehouse, Data Mesh, Data Fabric are loud sounding terms in the Data World. But what if the desire to “implement hype tech” suddenly turns into reality? Be afraid of your desires, because that’s exactly happened at Toloka.AI — the data platform team was given the task: ”Azure. Modern Data Stack. Tomorrow.”

    As part of our talk, we will go through all the stages of our work with Modern Data Stack — choose, suffer, love — and answer following questions:

    * What is a modern data platform and on what pillars does it rest?
    * How not to drown in the world of Modern Data Stack solutions?
    * What issues of different systems integration can you encounter?
    * Which tools (of those that we have tried) are must-have, and which ones can we try to replace?
    * How has the work of analysts changed (and has it changed)?
    * What is worth repeating and what is not worth repeating if you follow the same path?

    The Program Committee has not yet taken a decision on this talk

    Photo

    Maxim Pchelin

    Nebius

    Photo

    Vladimir Verstov

    Yandex Go

    Data Management Platform over YTsaurus: The Path from User to Contributor

    4 July, 14:40, «Hall 3»

    What do you do when your business wants a fast, reliable and convenient data management platform (DMP) for DWH without spending money on paid solutions? How do you build a high-quality system for processing petabytes of data using open-source tools and modify them if necessary?

    In the talk, we will share our experiences in creating DMP that ensures day-to-day operation of DWH for Yandex Taxi, Eats, Grocery, Delivery, and others. We will present our technology stack, primarily based on YTsaurus ― the brand new open-source alternative to Hadoop. We will tell you how we built our platform over it, when we used other tools (like ClickHouse or Greenplum), and how we made improvements to YTsaurus.

    The talk was accepted to the conference program