TL;DR: DuckDB is the perfect engine for modern data analysis - it's fast, flexible, and bridges SQL and Python seamlessly, making it ideal for ad-hoc analysis and EDA workflows. We've integrated it deeply into Fabi.ai to give analysts and data scientists the speed and power they need, whether they're querying multiple file formats, working with cloud data, or switching between SQL and Python on the fly.
Whether you're doing ad hoc or exploratory data analysis or building an internal app or Python dashboard, speed, scale and flexibility are critical. Data practitioners often need to work with multiple data sources, perform ad hoc analyses, and deliver insights in real-time. Since the very foundation of Fabi.ai, Lei and I have been laser focused on empowering analysts, data scientists and other data practitioners to seamlessly navigate these challenges, and we found that DuckDB was the perfect fit to supercharge this experience. Any tool that doesn’t scale, is a bit slow or introduces friction in your workflow is a distraction, and we aim to eliminate all distractions.
DuckDB is rapidly gaining traction as a high-performance, in-process SQL database, and at Fabi.ai, we’ve made it a core part of our platform. Why? Because it aligns perfectly with the needs of modern data practitioners: fast, lightweight, and capable of handling a variety of file formats and queries, and perhaps most importantly, it’s a perfect bridge between SQL and Python, two core languages in any modern AI platform. In this post, we’ll dive into why we integrated DuckDB into Fabi.ai and how you can leverage its capabilities to level up your ad hoc and exploratory data analysis (EDA) or your python dashboarding game.
What is DuckDB, and why does it matter for ad hoc analysis and EDA?
DuckDB is often described as “SQLite for analytics,” but that comparison only scratches the surface of what it can do. Designed specifically for OLAP (Online Analytical Processing), DuckDB shines in the world of analytics, providing incredibly fast query speeds and the ability to handle large datasets that would choke a traditional SQL engine. It’s particularly powerful for interactive data exploration, the kind of work that analysts and data scientists need to do when they’re trying to quickly answer complex questions without building and deploying large-scale data pipelines.
Key features of DuckDB include:
- In-process operation: DuckDB doesn’t require a separate server, making it lightweight and easy to embed into applications like Fabi.ai.
- Optimized for analytical queries: It can handle complex SQL queries with impressive speed.
- Native support for multiple file formats: Whether you’re dealing with CSVs, Parquet files, or even JSON, DuckDB makes querying these formats simple and fast.
- Direct integration with Python: DuckDB plays well with Python DataFrames, enabling seamless transitions between SQL and Python-based workflows.
These features make DuckDB an ideal tool for ad hoc analysis and EDA, where analysts need to work quickly with a wide range of data sources, often without the luxury of pre-built infrastructure. At Fabi.ai, we saw the potential to combine DuckDB’s strengths with our platform’s existing capabilities, allowing our users to experience the best of both worlds: the flexibility of SQL and Python, wrapped in a tool that supports rapid analysis and iteration.
How Fabi.ai leverages DuckDB
How we integrated DuckDB
At Fabi.ai, we’ve integrated DuckDB directly into our core platform, making it the native engine for querying Python DataFrames. When you upload a dataset, Fabi.ai automatically creates a SQL cell, and uses the DuckDB query execution engine. This means that whether you’re querying a Parquet file, joining multiple Excel sheets, or analyzing a CSV, DuckDB is doing the heavy lifting—while you get the familiar experience of writing SQL or manipulating Python DataFrames.
This integration goes a step further by allowing you to seamlessly switch between SQL and Python in your analysis. Want to do some data wrangling in Pandas or Polars and then run a complex SQL query on the result? No problem. Need to load a file, run a quick aggregation in SQL, and then visualize the results in Python? Easy. DuckDB makes this fluid transition possible, and Fabi.ai wraps it all in a user-friendly interface designed for analysts who want power without complexity.
We also create tables in DuckDB under certain conditions to reduce compute costs and provide you with the fastest possible data analysis experience on your data. This serves as a temporary, high performance analytics cache.
Capabilities that DuckDB unlocks
One of the most exciting aspects of DuckDB is its rich set of capabilities, which we’ve fully embraced in Fabi.ai. Here are a few key features you’ll want to explore:
Advanced (and honestly, just cool and fun) SQL syntax: DuckDB comes with some unique SQL capabilities that go beyond basic querying. For example, you can reference aliases in the select clause or you can even place the FROM clause before the SELECT clause. This makes it a fantastic tool for exploratory analysis, where complex questions can arise on the fly.
Here are some of my favorite syntax tricks using our fictional company Superdope.
“Exclude” function to select all columns except ones specified:
Referencing aliases in SELECT statement:
Retrieving columns using regex to retrieves columns matching a certain pattern:
You can find a lot more cool and fun in this post written by Alex Monahan or on the DuckDB blog.
Multi-format file support: Gone are the days when you had to wrangle CSVs into shape before importing them into your SQL database. DuckDB supports reading and querying a variety of formats natively, including Parquet, JSON, and Excel files. Whether your data lives in an S3 bucket as Parquet files or is scattered across several Excel sheets, DuckDB can pull it all into a SQL-friendly environment without any friction.
Example of using DuckDB to analyze a JSON file (hint, this is great when studying an API that provides a structured JSON response):
Querying S3 directly: DuckDB also allows direct querying of data stored in S3. This is a game-changer for analysts who frequently work with cloud-stored datasets but want to avoid the hassle of downloading large files just to run a few queries. With DuckDB in Fabi.ai, you can point your query directly at an S3 path or presigned URLand start exploring the data immediately.
If you’re a Fabi.ai or DuckDB user, check out their blog series where they cover some of their favorite tips.
These capabilities, coupled with DuckDB’s performance and flexibility, open up a world of possibilities for ad hoc analysis, data discovery, and EDA. Whether you’re cleaning up messy data, running exploratory queries to find insights, or preparing a dataset for visualization, DuckDB in Fabi.ai gives you the tools to do it all, without the typical overhead or complexity.
By the way, you want to try out DuckDB without Fabi.ai? No problem, they make it super easy to install locally and get started. Check out their installation documentation.
DuckDB is a perfect tool for data analysis and EDA
In a world where data analysis often involves juggling multiple tools and workflows, DuckDB stands out as a powerful unifier. By integrating DuckDB into Fabi.ai, we’re giving data practitioners the ability to harness the full potential of SQL and Python in one place, with the performance and flexibility they need to tackle even the most complex analysis tasks.
From ad hoc queries to full-fledged exploratory data analysis, DuckDB enables fast, intuitive, and scalable workflows that fit seamlessly into the Fabi.ai platform. It’s perfect for small to medium-sized datasets and ideal for the day-to-day work of analysts and data scientists who need to move quickly from data exploration to insight generation.
We invite you to explore DuckDB’s capabilities within Fabi.ai and see for yourself how it can transform your analysis workflows. With its ability to handle a variety of file formats, query directly from cloud storage, and support advanced SQL syntax, DuckDB is more than just a SQL engine—it’s the future of flexible, scalable data analysis.