DuckDB: Using an in-process database for data science and data engineering

This workshop gives an introduction to the DuckDB in-process SQL database management system. In the workshop, I describe DuckDB's architecture and internals. Then, I continue with hands-on exercises focusing on two clients: the CLI client and the Python client. Using these clients, we explore real-world datasets using DuckDB. During the exploration, I demonstrate several features of DuckDB, including its SQL dialect and its integration with other Python libraries. I explain how DuckDB's extension ecosystem works with examples for key extensions. Finally, I describe design patterns and best practices for achieving high performance.

Please bring your laptop as there will be hands-on sessions for using DuckDB. I recommend using a local Jupyter notebook installation (e.g., running in VS Code, PyCharm, or Jupyter Lab). Alternatively, Python exercises can be performed in Google Colab.

If you have macOS, we recommend using homebrew to install the DuckDB command line client. On Windows, we recommend WinGet. Linux users should use wget.

An overview of DuckDB

Architecture
Storage and execution
In-memory vs. persistent mode
DuckDB clients
DuckDB versions
Interoperability between clients
What version 1.0.0 means

The DuckDB CLI client

Installation
Help and tldr page
Output modes
The `~.duckdbrc` file

The DuckDB Python client

DuckDB in Jupyter (Jupyter Lab, VS Code, Google Colab)
Pandas integration
Plotting
Relational API
PySpark API

Data set: Railway network and services in the Netherlands

Overview
Railway services dataset
Train stations dataset

DuckDB's SQL dialect via examples

Basics
Data import and export shorthands
Friendly SQL extensions
- Trailing comma
- GROUP BY ALL
- ORDER BY ALL
- FROM-first syntax
- REATE TABLE ... AS SELECT
- COLUMNS()
- lambdas
Advanced SQL
- common table extensions
- Pivot and unpivot
- Window functions
- Joins
Design patterns

Extensions

Installation and autoloading
httpfs extension (https:// and s3:// support)
Parquet extension
Partial reading
File formats: JSON, Iceberg, Delta Lake
Outlook: spatial, vss
The DuckDB extension ecosystem

Performance guide

Importing
Keys and indexing
Directly querying files
Spilling to disk

Description

Outline

Trainer