látogató számláló

Data+ML Workshops • Monday, June 10

DuckDB: Using an in-process database for data science and data engineering

2x80 minutes • in English • in-person

Description

This workshop gives an introduction to the DuckDB in-process SQL database management system. In the workshop, I describe DuckDB's architecture and internals. Then, I continue with hands-on exercises focusing on two clients: the CLI client and the Python client. Using these clients, we explore real-world datasets using DuckDB. During the exploration, I demonstrate several features of DuckDB, including its SQL dialect and its integration with other Python libraries. I explain how DuckDB's extension ecosystem works with examples for key extensions. Finally, I describe design patterns and best practices for achieving high performance.

Please bring your laptop as there will be hands-on sessions for using DuckDB. I recommend using a local Jupyter notebook installation (e.g., running in VS Code, PyCharm, or Jupyter Lab). Alternatively, Python exercises can be performed in Google Colab.

If you have macOS, we recommend using homebrew to install the DuckDB command line client. On Windows, we recommend WinGet. Linux users should use wget.

Outline

An overview of DuckDB

  • Architecture
  • Storage and execution
  • In-memory vs. persistent mode
  • DuckDB clients
  • DuckDB versions
  • Interoperability between clients
  • What version 1.0.0 means

The DuckDB CLI client

  • Installation
  • Help and tldr page
  • Output modes
  • The `~.duckdbrc` file

The DuckDB Python client

  • DuckDB in Jupyter (Jupyter Lab, VS Code, Google Colab)
  • Pandas integration
  • Plotting
  • Relational API
  • PySpark API

Data set: Railway network and services in the Netherlands

  • Overview
  • Railway services dataset
  • Train stations dataset

DuckDB's SQL dialect via examples

  • Basics
  • Data import and export shorthands
  • Friendly SQL extensions
    • Trailing comma
    • GROUP BY ALL
    • ORDER BY ALL
    • FROM-first syntax
    • REATE TABLE ... AS SELECT
    • COLUMNS()
    • lambdas
  • Advanced SQL
    • common table extensions
    • Pivot and unpivot
    • Window functions
    • Joins
  • Design patterns

Extensions

  • Installation and autoloading
  • httpfs extension (https:// and s3:// support)
  • Parquet extension
  • Partial reading
  • File formats: JSON, Iceberg, Delta Lake
  • Outlook: spatial, vss
  • The DuckDB extension ecosystem

Performance guide

  •  Importing
  • Keys and indexing
  • Directly querying files
  • Spilling to disk

Trainer

Gábor Szárnyas is Developer Relations Advocate at DuckDB Labs. He obtained his PhD in software engineering in 2019 and spent 3 years as a post-doctoral researcher at CWI in Amsterdam, working on graph data management techniques.