A new wave: from Big Data to Small Data

Earlier this week I attended Small Data SF, a first-time conference organized by MotherDuck, Turso and Ollama. The Small Data Manifesto in a nutshell is that more data isn’t necessarily better and that single machines and nodes have come such a long way since the early 2000s that they’re usually sufficient for most data analyses in a real-world enterprise setting.

If I’m being honest, I went into the conference with tepid expectations. I was mostly excited about connecting with some of the titans of industry who were going to be present and also learning more about DuckDB, which I’m a big fan of.

But walking away, I felt very different than I anticipated and I think I fully drank the Kool Aid. 

Sure there was some good old sales pitching disguised as thought leadership talks and a lot of self-promotion, but as a whole, the conference felt like a breath of fresh air. It felt like we were collectively taking a really big step back and collectively acknowledging that we’re recovering from Big Data Hangover™. 

To be fair, this only feels like a hangover because over the past 10 years, while we’ve all collectively been obsessing over Big Data, there have been two major changes quietly taking place:

  1. Processing power of local machines* has gone up by multiple factors
  2. New libraries and tools have been created which process data more efficiently
In the chart above**, I mapped out the typical floor and ceiling for the number of cores available on local machines and virtual machines over the years. This had to be plotted on a log scale! 
In the chart above**, I mapped out the typical floor and ceiling for the number of cores available on local machines and virtual machines over the years. This had to be plotted on a log scale! 

Since the advent of Big Data in the mid-2000s, compute power has literally increased by multiple factors. Of course, this is just the number of cores (or vCPUs), but this trend holds even when looking at RAM, network throughput and virtually every other dimension that matters for data processing. Note: A 400 vCPU virtual machine on EC2 is going to cost you a pretty penny, so I’m not suggesting that this compute power is economically viable, but you can easily get a 32 vCPU for close to a dollar per hour.

Now shifting to the business data problem, one of my favorite quotes from the conference was: 

“You don’t have a Big Data problem, you have a lot of small data problems”  - Celina Wong

And from my own personal experience and what I see every day in the industry, this just absolutely hit the nail on the head. Yes, sure, there are some companies out there that actually need to be able to do analysis on trillions of rows and terabytes of data, but that’s the exception, not the rule. One of the talks was from Gaurav Saxena who co-authored the Redshift fleet analysis, and the data from that research was referenced by George Fraser who paraphrased what he wrote in his blog post:

Of queries that scan at least 1 MB, the median query scans about 100 MB. The 99.9th percentile query scans about 300 GB. Analytic databases like Snowflake and Redshift are "massively parallel processing" systems, but 99.9% of real world queries could run on a single large node”


10 years ago, analyzing 1GB of data on your laptop basically was not an option. So Big Data along with distributed systems like Hadoop and Spark came along, and over time an industry built up around it. Executives were told that data is the new oil and that you have to moneyball your way through business problems, so they hired data teams with cool titles like Data Scientist and they spent money on platforms that supported this big data and would allow them to crunch through crazy amounts of data and find insights. But take a look at a modern laptop. Your middle-of-the-line MacBook Pro has a 12-core CPU, 18-core GPU and 18GB of memory!! If you throw DuckDB (or your other columnar-storage DB of choice at it) along with polars (which can offer a 10X speed gain on pandas), you can easily chew through gigabytes of data like it’s nothing.

Okay, so local machines are incredibly powerful and new libraries have been created that are much more performant, let’s take a quick detour. Benn Stancil, co-founder of Mode and self-proclaimed jaded critic also gave a talk, where he talked about one important point that really stuck with me: you likely don’t have such clean data and such good data that you can plug it into a BI tool and slice and dice it using all the fancy promises of BI, and get an insight. It’s much more likely that you have patchy data that you can’t drill into, and you’re much better off doing ad hoc analyses that don’t scale to try and draw conclusions as best you can to answer a business question.

Machines are much more powerful nowadays than they used to be and you probably need to be doing more ad hoc analysis to take some educated guesses about your data. I think these point to two really important trends, and trends drive industries:

  1. For most of your analyses, processing can probably be done locally, which greatly simplifies things from a tooling perspective and all at a much lower cost
  2. There’s growing discontent with BI and a movement to embrace more one-off queries

On point two, I’d expand further into two (more) points: 

  1. AI is changing how we think about analysis. AI is making ad hoc analysis 10X easier than it used to be in the past and efficiency gains are becoming undeniable
  2. I still believe BI, with a rigid semantic layer and definitions, offers tremendous amount of value, but for the right use case

So if we look at all the trends mixed in with the rise of AI, I predict a strong movement away from Big Data for everything and BI for all reporting. Those things will still exist. Walmart will still want to process millions of records every day to provide recommendations, but we’ll accept that that’s the exception. Executives will still want a BI dashboards that have numbers that they can 100% trust and “drill into” to some degree. But for the most part, internal reporting that doesn’t require a canonical dashboard, will be done locally or semi-locally and on the fly.

At Fabi.ai we’ve been locked in on helping teams answer ad hoc data requests and conduct exploratory data analysis incredibly efficiently. We have a vision that not only will data analysts and scientists be able to get incredibly useful insights in the snap of a finger, but so will product managers, accountants, growth marketing managers and other technically inclined individuals who would give you a blank stare at the mention of Spark, but can wrap their heads around SQL and Python, especially when supported with AI. From the very beginning we’ve been obsessing over performance, to ensure that data size and volume is not a barrier to this type of analysis and we’ve embraced DuckDB and polars to that end and everything happens on a single virtual machine. And coming out of this conference, I’m more confident than ever that we’re at the crest of a wave that’s going to wash over the data industry over the next 5-10 years.

Let’s connect on X or LinkedIn and chat!

A note on Iceberg: I think there’s something happening here as well and I think it may actually (counter-intuitively) play well with the idea of small data. The founder of RisingWave talks about it in this post. I haven’t fully figured it out, but it seems like we’re heading in a direction where we can handle massive amounts of data storage at scale but then handle the compute locally or semi-locally. I would love to connect with someone who has done some research in this area.

______
*Small Data SF focused mostly on true local machines, as in, your laptop. However, the improvements we’re seeing in compute power on your machine is mirrored in virtual machines in the cloud. At Fabi.ai, we don’t yet support local development (let us know if this sounds interesting to you), but we believe that the theme of doing analysis on a single machine or node still applies. I’m not a computer engineer, but I understand that a CPU core is not a 1:1 with a vCPU, but the general point is that overall processing power in non-distributed systems has gone up materially since the early 2000s. I welcome being educated on the finer points.

**I had ChatGPT help me pull this data together. I spot-checked machines from each era and the data here seems correct, or at least sufficiently to illustrate the point.

"I was able to get insights in 1/10th of the time it normally would have"

Don't take our word for it, give it a try!