
SQL vs Python: which to use for data analysis
TL;DR: Combining SQL and Python creates powerful workflows for advanced data manipulation. This guide demonstrates four practical examples: ETL processes using dbt and Snowpark, A/B test analysis, customer churn prediction, and interactive dashboards. Learn how SQL's data retrieval capabilities complement Python's processing and visualization tools for more efficient data-driven tasks.
In our previous post about exploratory data analysis (EDA), we discussed why integrating SQL and Python is crucial for effective data work. As data practitioners, combining these complementary tools can significantly enhance our analytical capabilities and efficiency.
This guide walks through four practical examples of SQL and Python working together:
We'll also examine Polars DataFrames, an emerging technology that offers significant performance improvements over traditional data processing approaches when handling large datasets.
Before diving into specific examples, let's explore key scenarios where SQL and Python work together effectively:
This isn’t an exhaustive list, but the four scenarios above represent ripe opportunities to uplevel our analysis work by using SQL and Python in tandem. Let's take a look at some specific examples within each.
For the example in this post, we’re going to use a fictional company “Superdope” that sells cool widgets online. They have 200 employees and data responsibilities are shared between a data engineer and a data scientist. They leverage both SQL and Python throughout their data operations.
In modern data workflows, integrating data from multiple sources, cleaning it, and loading it into a data warehouse is essential.
Traditionally, SQL has been the go-to tool for these tasks, but now, with advancements like dbt and Snowpark from Snowflake, you can directly integrate Python into your ETL processes. This approach lets you do two things:
So let’s take a look at Superdope’s use case. They have customer reviews from their online store, and they want to score the sentiment of each customer on a regular basis for better target marketing.
First we want to retrieve the most recent reviews since we don’t want to process the entire dataset every time:
WITH recent_reviews AS (
SELECT
customer_id,
review_text,
review_date
FROM
{{ ref('superdope_customer_reviews') }}
WHERE
review_date >= DATEADD(day, -7, CURRENT_DATE())
)
SELECT
customer_id,
review_text,
review_date
FROM
recent_reviews
Then we want to apply the sentiment analysis:
from snowflake.snowpark import Session
from textblob import TextBlob
def analyze_sentiment(text):
analysis = TextBlob(text)
return analysis.sentiment.polarity
def model(dbt, session: Session):
dbt.config(materialized="table")
# Load the recent reviews from the last 7 days
df = session.table("{{ ref('superdope_recent_reviews') }}").to_pandas()
# Perform sentiment analysis on each review
df['sentiment_score'] = df['review_text'].apply(analyze_sentiment)
# Convert back to Snowpark DataFrame and return
return session.create_dataframe(df)
And then of course we want to then run the model
dbt run --models superdope_recent_reviews superdope_sentiment_analysis
You can see from this example how we were able to do some analysis using Python that you would not have been able to do just with SQL and bake that directly into your data pipeline.
It’s hard to make data-driven decisions without using A/B testing at some point. Often, we’re using SQL to retrieve experimental data, and Python to perform statistical analysis and visualize the results.
ℹ️ Note for Fabi.ai users: In a lot of cases, A/B testing data may come in the form of a file. Even if your data does not live in a data warehouse, you can upload it to Fabi.ai and store it in a temporary database to accomplish the following example.
For this example, let’s say that our friends at Superdope are testing out a new checkout page to increase conversion.
First we want to retrieve the A/B test data and group it by week so that we can analyze the trend:
SELECT
user_id,
group,
conversion,
DATE_TRUNC('week', event_date) AS week
FROM
ab_test_results
WHERE
test_id = 42
GROUP BY
week, user_id, group, conversion;
Then we want to calculate the statistical significance (note: we’re not grouping by week here, but doing so would be helpful to plot the trends of the two tests over time as the experiment is running):
import polars as pl
from scipy.stats import chi2_contingency
# Load data into Polars DataFrame
df = pl.read_sql(query, connection)
# Perform chi-square test by grouping data by 'group' only
contingency_table = df.groupby('group').agg(pl.col('conversion').sum()).to_numpy()
chi2, p, _, _ = chi2_contingency(contingency_table)
print(f"Chi-square test p-value: {p}")
if p < 0.05:
print("Statistically significant difference between groups")
else:
print("No significant difference between groups")
If your A/B test is just an export from your marketing platform, you can obviously skip the SQL step, but if you do store these results in a database or data warehouse, you may find this handy. You can see how we were able to quickly filter on the relevant experiment and measure the significance in just a few lines of Python.
If you’re on a RevOps or MarketingOps team and wondering why you might not do this in dbt and store the results back in your data warehouse it’s because A/B tests tend to be more ephemeral. You may not want to go through the process of baking this into your model, which can also require advanced data engineering skills which you may not have readily available.
Machine learning is an extremely powerful tool for predicting future outcomes and is something you can use to do custom lead scoring, sales forecasting, or even churn prediction. For the example below, we’ll focus on the churn prediction use case.
When building machine learning models, we can use SQL to efficiently retrieve large datasets from databases, which we can then feed into Python’s machine learning libraries like scikit-learn to complete the prediction. Additionally, Python’s flexibility allows for advanced feature engineering, which you may need, depending on the quality of your data and the types of machine learning models you decide to use. For example, if you’re doing a regression analysis, there’s a good chance you’ll have to find some way to handle null values or cardinal values.
ℹ️ A note on Polars here: They don’t integrate well with a lot of machine learning libraries like scikit-learn. They’re constantly building more support for common libraries, but as of the time of writing this post, in its current state, it's easier to work with more traditional pandas dataframes.
OK, now let’s look at the example for Superdope. They want to use the customer sentiment score we calculated earlier to try and predict high risk customers.
First, they’ll need to retrieve customer information and create some of the baseline features they want to use for our model. In this simple example, we’ll use average order value (AOV), number of purchases and average sentiment score:
SELECT customer_id,
AVG(order_value) AS avg_order_value,
COUNT(order_id) FILTER (WHERE order_date >= NOW() - INTERVAL '90 days') AS total_purchases_last_90_days,
AVG(review_sentiment) AS avg_review_sentiment
FROM orders
JOIN reviews ON orders.customer_id = reviews.customer_id
GROUP BY customer_id;
Then, we’ll build a composite “health_score,” which we can use to label customers as “high risk” or not “high risk”. Once we’ve done that, we can build our prediction model using a simple random forecast classifier (or insert whatever model you would rather use here).
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Load data into a Pandas DataFrame
df = pd.read_sql(query, connection)
# Feature engineering for customer health score
df['health_score'] = (
0.4 * df['avg_order_value'] +
0.3 * df['total_purchases_last_90_days'] +
0.3 * df['avg_review_sentiment']
)
# Define threshold for labeling customers as 'high risk' of churn
df['high_risk'] = (df['health_score'] < threshold).astype(int) # Assume threshold is defined
# Prepare data for modeling
X = df[['avg_order_value', 'total_purchases_last_90_days', 'avg_review_sentiment']].values
y = df['high_risk'].values # Binary column indicating high-risk customers
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model to predict high-risk customers
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_report(y_test, y_pred))
Once you’ve built and tested this model, you can either do a one-off analysis on customers that aren’t yet scored, or you can consider inserting this prediction into your data model (see example above about doing this in dbt).
Custom reporting and dashboarding are essential for delivering insights to stakeholders. Often, data practitioners deliver this reporting via a business intelligence (BI) tool.
Unfortunately, most BI solutions don’t offer integrated Python kernels. So if you have to do some advanced data analysis that’s best done (or can only be done) using Python, the analysis has to happen outside of BI, written back to your data warehouse then uploaded to BI. This is a lot of heavy lifting, especially if you’re still in the exploratory phase.
If we’re looking to customize our reporting beyond the capabilities of common BI tools, we can use SQL to pull the necessary data, and Python can create dynamic and customizable reports or dashboards. This approach makes it possible to integrate machine learning in our models or create richer interactions with charts and tables in the user interface.
Let’s go back to Superdope for this last example. They built their customer risk scoring model, they were able to write the “high_risk” field back to their data warehouse, and now their stakeholders want an internal tool to pull this data. We’ll keep this example very simple, but with data apps and Python, you can start building all sorts of interesting tools to, for example, let your users upload files and move away from relying on data being in the warehouse.
In the first piece of this script, we write a function to retrieve the data from the data warehouse and store it as a DataFrame.
import streamlit as st
import pandas as pd
import sqlite3
from datetime import datetime
# Function to connect to the database and retrieve data
def load_data(connection, min_date, high_risk_threshold):
query = f"""
SELECT customer_id, last_purchase_date, high_risk
FROM superdope_scored_customers
WHERE last_purchase_date >= min_date
"""
df = pd.read_sql(query, connection, params=(min_date,))
# Adjust high_risk based on user-defined threshold
df['high_risk'] = df['high_risk'].astype(float) >= high_risk_threshold
return df
The second piece of this script is the main part where we build the Streamlit app and build the UX components for the end-user, including the filters, table and export functionality.
# Streamlit app
st.title("Superdope Customer Risk Dashboard")
# Database connection (replace 'your_database.db' with your actual database)
connection = sqlite3.connect('your_database.db')
# Date range selector
start_date = st.date_input("Select start date for last purchase", value=datetime(2023, 1, 1))
high_risk_threshold = st.slider("Set the threshold for high-risk customers", min_value=0.0, max_value=1.0, value=0.5)
# Load and filter data based on user input
data = load_data(connection, start_date, high_risk_threshold)
# Display the filtered data
st.subheader("Filtered Customer Data")
st.write(data)
# Option to download the filtered dataframe
st.download_button(
label="Download the filtered data as CSV",
data=data.to_csv(index=False),
file_name='superdope_filtered_customers.csv',
mime='text/csv'
)
# Close the database connection
connection.close()
In this example we used Streamlit, but you can accomplish this with Jupyter notebook plugins like Voilà or using Plotly Dash. These are all powerful solutions, the only two important pieces to keep in mind is that in order to share these reports (or data apps) out with your stakeholders, you’ll need to host these and manage the service along with things like the authentication method. These solutions also quickly run into scalability which may impact how much data you can process and/or your user’s experience.
Another more scalable and convenient option is to use Fabi.ai, which lets you publish interactive Python dashboards in just a few clicks.
When combining SQL and Python in your data workflows, consider these best practices:
The combination of SQL and Python creates powerful data analysis workflows. Whether you're building ETL processes, performing statistical analysis, implementing machine learning models, or creating interactive dashboards, these tools complement each other perfectly.
For those looking to streamline their data analysis while maintaining robust capabilities, consider exploring Fabi.ai for free. It provides an integrated environment for working with SQL and Python, and you can get up and running in less than five minutes.