Why (and how) to use SQL and Python in tandem

6 minute read

August 6, 2024

TL;DR: SQL and Python are fundamental for modern data analysis, but knowing when to use each is key. The rule of thumb: use SQL to pull and whittle down your data set, then Python for deeper analysis. But with new tech like DuckDB and Polars (plus AI) changing the landscape, these boundaries are getting more flexible.

SQL and Python are arguably the two most foundational pillars for data analysis (we should throw in spreadsheets as the third pillar for completeness). If you’re doing any form of exploratory data analysis (EDA) to get insights from your data, you must know how to leverage SQL and Python. These are musts if you’re analyzing data at any sort of scale and doing anything more than adding a few filters and pivoting your data (say you’re analyzing A/B testing, building a churn prediction model etc.)

However, to use the two in tandem effectively, it’s important to understand how they operate together. I would also wager that how you think about splitting the workload between SQL and Python is quickly evolving with the rise of exciting new technologies like DuckDB, polars and of course, AI.

In this post, I’ll talk about why SQL and Python are so fundamental to exploratory data analysis and ad hoc analysis, when to use one versus the other, and how to effectively blend the two languages to work for you. We’ll also discuss how AI is changing the way you should think about your analyses and what the new technologies mean for your workflow.

‍

SQL vs Python: What they’re designed for

If you’re already familiar with SQL and Python, you can skip ahead to the next section.

TL;DR: SQL is the language you need to extract data from your data warehouse, but you can also use it to complete simple or advanced data manipulation. On the other hand, Python is a general purpose coding language which has specific packages and libraries that make advanced data analysis, machine learning, and data visualization incredibly convenient.

The Merits of SQL and Python

Any data analysis starts with, you guessed it, data. You’re either using a file (CSV, Excel, Parquet, JSON etc.) or you’re pulling data from a data warehouse. If you’re pulling data from a warehouse you’ll need to use some SQL dialect (we’ll leave the topic of NoSQL databases for another article). SQL is designed with one purpose in mind: retrieving and slicing and dicing data. You can pull data from a single table, join across tables, aggregate, average, and do pretty much anything else you could imagine yourself doing in a spreadsheet.

Python, on the other hand, is a general purpose programming language. However, if you’re reading this article, you’re likely doing or want to do some sort of advanced data analysis. Python is probably the single best programming language - along with R - for advanced data analysis, with its vast set of open source libraries and packages designed specifically for data science, statistical analysis, and machine learning.

When to Use SQL vs. Python

The line between SQL and Python can get a bit blurry. Say you have a table of ‘companies’ you want to analyze. And let’s say you want to calculate the average number of employees for a company in this table. In SQL, you could simply write:

select avg(employee_count) from companies;

Alternatively, you could retrieve the entire list of companies from your warehouse with SQL, store it as a dataframe (in this case we’ll use Pandas), and calculate the average in Python:

query = "SELECT * FROM companies" 
df = pd.read_sql(query, engine) 
average_employee_count = df['employee_count'].mean()

This might look like Python is complicating things a bit. To be fair, it kind of is (until we look at more complex tasks).

Let’s say we want to get a distribution of NBA player shooting efficiency from a ‘player_stats’ table. In SQL, this is what our query might look like:

WITH min_max_values AS(
	SELECT
  	min(shooting_efficiency_percent) AS min_shooting_efficiency,
  	max(shooting_efficiency_percent) AS max_shooting_efficiency
  FROM dataframe1)

SELECT
floor((shooting_efficiency_percent - min_max_values.min_shooting_efficiency) / (min_max_values.max_shooting_efficiency - min_max_values.min_shooting_efficiency) * 30) AS bin,
count(*) AS frequency,
floor((shooting_efficiency_percent - min_max_values.min_shooting_efficiency) / (min_max_values.max_shooting_efficiency - min_max_values.min_shooting_efficiency) * 30) * (min_max_values.max_shooting_efficiency - min_max_values.min_shooting_efficiency) / 30 + min_max_values.min_shooting_efficiency AS bin_start,
floor((shooting_efficiency_percent - min_max_values.min_shooting_efficiency) / (min_max_values.max_shooting_efficiency - min_max_values.min_shooting_efficiency) * 30) * (min_max_values.max_shooting_efficiency - min_max_values.min_shooting_efficiency) / 30 + min_max_values.min_shooting_efficiency + (min_max_values.max_shooting_efficiency - min_max_values.min_shooting_efficiency) / 30 AS bin_end

FROM dataframe1, min_max_values
GROUP BY bin, bin_start, bin_end
ORDER BY bin;

But using Python, this becomes:

query = "SELECT * FROM player_stats"
player_stats = pd.read_sql(query, engine)
player_stats['bin'] = (player_stats['shooting_efficiency_percent'] // 0.05) * 0.05
bin_counts = player_stats['bin'].value_counts().reset_index()
bin_counts.columns = ['bin_start', 'frequency']
bin_counts = bin_counts.sort_values('bin_start')

Now, suddenly Python is looking awfully efficient. So, why not always pull all the data and use Python?

For smaller datasets, this can simply be a matter of preference (if Python is an available option in the first place). However, as you start working with larger datasets, you’ll quickly run into performance issues if you pull all your data into a dataframe to work on it with Python. You’ll generally find it much more convenient to pull the core dataset you want to study using SQL, then do the analysis on that carved out data using Python.

How do I get data from SQL to Python?

In order to work with both SQL and Python, you want to primarily use Python with a relevant library that allows you to query a remote database.

Connect SQL with Python

There are a number of Python libraries out there to connect to local or remote SQL databases. For the sake of example today, we’ll take a look at SQLAlchemy, a commonly used library that connects to most common SQL databases. Here’s an example of how our code could look:

from sqlalchemy import create_engine,
textimport pandas as pd

# Replace with your actual database connection details
DATABASE_URI = 'postgresql+psycopg2://user:password@hostname/database_name'engine = create_engine(DATABASE_URI

# Create a connection to the database
with engine.connect() as connection:
	# Define the raw SQL query
  query = text("SELECT * FROM companies WHERE industry = :industry"
  # Execute the query with a parameter and fetch results into a DataFrame
  results = connection.execute(query, {'industry': 'Technology'})
  df = pd.DataFrame(results.fetchall(), columns=results.keys())

Again, when doing this analysis, consider the volume of data. Pandas dataframes are notorious for performing slowly when dealing with large datasets. Don’t pull in more than you need (more on this below).

Leverage emerging technologies

So far, we’ve talked about how you should use SQL to whittle down your dataset to what you want to analyze, then use Python for any sort of more advanced data analysis. However, there are some new technologies coming to the scene that are changing this best practice.

DuckDB and Polars are two great examples of these new players. They’re designed specifically to handle reasonably large volumes of data incredibly fast (benchmark data).

Theoretically, this means you don’t have to be as selective when pulling data with your SQL query since you won’t take much of a hit when you analyze your data later.

For example: On certain operations, polars have a reported speed boost of anywhere from 10X to 50X on pandas. Which means that in theory, a piece of analysis that used to take 20 minutes to run can now be done on the same hardware in two minutes or less! However, more data will always make things run slower. If you’re able to conduct the bulk of the raw data manipulation in SQL, you should still aim to do so to get the biggest boost possible.

‍The role of AI in SQL and Python

AI has cemented itself as a key tool for any sort of ad hoc analysis or EDA. Let’s take a look at what AI has to do with how you use and connect SQL and Python.

Comparing the accuracy of AI for one coding language vs another (e.g. SQL vs Python) can be hard to do. At Fabi.ai we work with both languages every single day and, in our experience, AI has been much more accurate at generating Python than SQL. This is because Python uses very specific functions to perform complex tasks, so the opportunities to take a wrong turn are much lower.

If you’re looking to leverage AI to accelerate your EDAs and uncover new insights from your data, you might find yourself doing more of the work in Python.

‍

Use SQL to whittle down your data for analysis & Python to analyze data in depth

If you’re wondering where SQL ends and where Python starts when conducting exploratory data analysis or an ad hoc analysis, here’s the rule of thumb: Use SQL to pull your working set of data and use Python to manipulate it.

But as with anything coding related, the lines are blurry and the decision can simply come down to personal preference or comfort in either language.

However, regardless of which of the two you prefer for analysis today, it’s worth checking out some of the new technologies to see if they’ll make your work easier.

And, if you’re using both today in tandem, consider exploring how you can use AI in your workflows to give you a material boost in productivity.

Lastly, a shameless plug: We know how frustrating it can be to try and integrate all these recommendations into multiple tools and workflows. Which is why we built Fabi.ai’s Smartbooks product. It lets you combine the power of SQL, Python, AI, and no-code in one platform so you can easily get the benefits of each, without a ton of technical wrestling.

‍

If you’re curious to see how it works, you can get started for free in less than 5 minutes.

Marc Dupuis

CEO & Co-Founder @ Fabi.ai

Example H2

Example H3