4 practical ways to use SQL and Python in tandem
TL;DR: You can combine SQL and Python for advanced data manipulation in your day-to-day workflows. This post covers four examples of this in practice: ETL processes using dbt and Snowpark, A/B test analysis, customer churn prediction, and building interactive dashboards. We also discuss how combining SQL's data retrieval capabilities with Python's advanced processing and visualization tools can significantly improve efficiency and insights in data-driven tasks, while also introducing Polars as a high-performance alternative to Pandas for handling large datasets.
In a previous post, we talked about why working with SQL and Python in tandem is crucial for exploratory data analysis (EDA) and ad hoc analysis. As data practitioners, combining these two powerful tools can unlock a new level of efficiency and insight in our workflows.
In this post, we’ll walk through four practical examples of how SQL and Python can be used together to tackle complex data tasks:
- Leveraging Python during the transformation using dbt and Snowpark
- Analyzing statistical significance from A/B test results
- Predicting customer churn
- Building interactive dashboards w/ Plotly
We’ll also touch on Polars DataFrames, a new and emerging technology that’s quickly gaining traction in the data community. Polars offers a significant performance boost over traditional Pandas DataFrames, making it a great option for handling large datasets and performing complex transformations. If you haven’t explored Polars yet, this is a perfect opportunity to see how it can fit into your SQL and Python workflow.
Common situations where Python and SQL integrate together
Before diving into the examples at length, let’s briefly discuss the most common scenarios where SQL and Python work best together, especially in the context of EDA and ad hoc analysis:
- Advanced data integration and ETL (Extract, Transform, Load): Combining SQL’s data extraction capabilities with Python’s transformation and loading functions allows for a seamless ETL process that can handle large volumes of data efficiently. We’re going to use an example using a hybrid SQL and Python approach and Snowpark.
- Data-driven decision-making and A/B Testing: SQL can quickly retrieve experimental data, and Python can be used to perform statistical tests and visualize the results, helping teams make informed decisions faster.
- Advanced machine learning and feature engineering: SQL is often used to retrieve large datasets from databases, which can then be fed into Python’s machine learning models, which are great for forming predictions. Python also excels in feature engineering, a critical step in building effective models.
- Custom reporting and dashboarding: SQL can pull the necessary data, while Python can be used to create dynamic and customizable reports or dashboards, offering a more interactive and user-friendly experience.
This isn’t an exhaustive list, but the four scenarios above represent ripe opportunities to uplevel our analysis work by using SQL and Python in tandem. Let's take a look at some specific examples within each.
Python and SQL in practice: Step-by-step examples
For the example in this post, we’re going to use a fictional company “Superdope” that sells cool widgets online. They have 200 employees and data responsibilities are shared between a data engineer and a data scientist.
1. Advanced data integration and ETL (Extract, Transform, Load) w/ dbt and Snowpark
In modern data workflows, integrating data from multiple sources, cleaning it, and loading it into a data warehouse is essential.
Traditionally, SQL has been the go-to tool for these tasks, but now, with advancements like dbt and Snowpark from Snowflake, you can directly integrate Python into your ETL processes. This approach lets you do two things:
- Hybrid transformations: Snowpark lets you write Python code that runs directly within Snowflake so you can perform complex data manipulations and apply machine learning models without leaving your data warehouse environment.
- Seamless integration: dbt typically handles SQL-based transformations, but, with Snowpark, you can embed Python into your dbt models and create a powerful hybrid transformation layer that leverages the strengths of both SQL and Python.
So let’s take a look at Superdope’s use case. They have customer reviews from their online store, and they want to score the sentiment of each customer on a regular basis for better target marketing.
First we want to retrieve the most recent reviews since we don’t want to process the entire dataset every time:
WITH recent_reviews AS (
SELECT
customer_id,
review_text,
review_date
FROM
{{ ref('superdope_customer_reviews') }}
WHERE
review_date >= DATEADD(day, -7, CURRENT_DATE())
)
SELECT
customer_id,
review_text,
review_date
FROM
recent_reviews
Then we want to apply the sentiment analysis:
from snowflake.snowpark import Session
from textblob import TextBlob
def analyze_sentiment(text):
analysis = TextBlob(text)
return analysis.sentiment.polarity
def model(dbt, session: Session):
dbt.config(materialized="table")
# Load the recent reviews from the last 7 days
df = session.table("{{ ref('superdope_recent_reviews') }}").to_pandas()
# Perform sentiment analysis on each review
df['sentiment_score'] = df['review_text'].apply(analyze_sentiment)
# Convert back to Snowpark DataFrame and return
return session.create_dataframe(df)
And then of course we want to then run the model
dbt run --models superdope_recent_reviews superdope_sentiment_analysis
You can see from this example how we were able to do some analysis using Python that you would not have been able to do just with SQL and bake that directly into your data pipeline.
2. Data-driven decision-making for A/B testing
It’s hard to make data-driven decisions without using A/B testing at some point. Often, we’re using SQL to retrieve experimental data, and Python to perform statistical analysis and visualize the results.
ℹ️ Note for Fabi.ai users: In a lot of cases, A/B testing data may come in the form of a file. Even if your data does not live in a data warehouse, you can upload it to Fabi.ai and store it in a temporary database to accomplish the following example.
For this example, let’s say that our friends at Superdope are testing out a new checkout page to increase conversion.
First we want to retrieve the A/B test data and group it by week so that we can analyze the trend:
SELECT
user_id,
group,
conversion,
DATE_TRUNC('week', event_date) AS week
FROM
ab_test_results
WHERE
test_id = 42
GROUP BY
week, user_id, group, conversion;
Then we want to calculate the statistical significance (note: we’re not grouping by week here, but doing so would be helpful to plot the trends of the two tests over time as the experiment is running):
import polars as pl
from scipy.stats import chi2_contingency
# Load data into Polars DataFrame
df = pl.read_sql(query, connection)
# Perform chi-square test by grouping data by 'group' only
contingency_table = df.groupby('group').agg(pl.col('conversion').sum()).to_numpy()
chi2, p, _, _ = chi2_contingency(contingency_table)
print(f"Chi-square test p-value: {p}")
if p < 0.05:
print("Statistically significant difference between groups")
else:
print("No significant difference between groups")
If your A/B test is just an export from your marketing platform, you can obviously skip the SQL step, but if you do store these results in a database or data warehouse, you may find this handy. You can see how we were able to quickly filter on the relevant experiment and measure the significance in just a few lines of Python.
If you’re on a RevOps or MarketingOps team and wondering why you might not do this in dbt and store the results back in your data warehouse it’s because A/B tests tend to be more ephemeral. You may not want to go through the process of baking this into your model, which can also require advanced data engineering skills which you may not have readily available.
3. Advanced machine learning for churn prediction
Machine learning is an extremely powerful tool for predicting future outcomes and is something you can use to do custom lead scoring, sales forecasting, or even churn prediction. For the example below, we’ll focus on the churn prediction use case.
When building machine learning models, we can use SQL to efficiently retrieve large datasets from databases, which we can then feed into Python’s machine learning libraries like scikit-learn to complete the prediction. Additionally, Python’s flexibility allows for advanced feature engineering, which you may need, depending on the quality of your data and the types of machine learning models you decide to use. For example, if you’re doing a regression analysis, there’s a good chance you’ll have to find some way to handle null values or cardinal values.
ℹ️ A note on Polars here: They don’t integrate well with a lot of machine learning libraries like scikit-learn. They’re constantly building more support for common libraries, but as of the time of writing this post, in its current state, it's easier to work with more traditional pandas dataframes.
OK, now let’s look at the example for Superdope. They want to use the customer sentiment score we calculated earlier to try and predict high risk customers.
First, they’ll need to retrieve customer information and create some of the baseline features they want to use for our model. In this simple example, we’ll use average order value (AOV), number of purchases and average sentiment score:
SELECT customer_id,
AVG(order_value) AS avg_order_value,
COUNT(order_id) FILTER (WHERE order_date >= NOW() - INTERVAL '90 days') AS total_purchases_last_90_days,
AVG(review_sentiment) AS avg_review_sentiment
FROM orders
JOIN reviews ON orders.customer_id = reviews.customer_id
GROUP BY customer_id;
Then, we’ll build a composite “health_score,” which we can use to label customers as “high risk” or not “high risk”. Once we’ve done that, we can build our prediction model using a simple random forecast classifier (or insert whatever model you would rather use here).
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Load data into a Pandas DataFrame
df = pd.read_sql(query, connection)
# Feature engineering for customer health score
df['health_score'] = (
0.4 * df['avg_order_value'] +
0.3 * df['total_purchases_last_90_days'] +
0.3 * df['avg_review_sentiment']
)
# Define threshold for labeling customers as 'high risk' of churn
df['high_risk'] = (df['health_score'] < threshold).astype(int) # Assume threshold is defined
# Prepare data for modeling
X = df[['avg_order_value', 'total_purchases_last_90_days', 'avg_review_sentiment']].values
y = df['high_risk'].values # Binary column indicating high-risk customers
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model to predict high-risk customers
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_report(y_test, y_pred))
Once you’ve built and tested this model, you can either do a one-off analysis on customers that aren’t yet scored, or you can consider inserting this prediction into your data model (see example above about doing this in dbt).
4. Custom reporting & dashboarding w/ Plotly
Custom reporting and dashboarding are essential for delivering insights to stakeholders. Often, data practitioners deliver this reporting via a business intelligence (BI) tool.
Unfortunately, most BI solutions don’t offer integrated Python kernels. So if you have to do some advanced data analysis that’s best done (or can only be done) using Python, the analysis has to happen outside of BI, written back to your data warehouse then uploaded to BI. This is a lot of heavy lifting, especially if you’re still in the exploratory phase.
If we’re looking to customize our reporting beyond the capabilities of common BI tools, we can use SQL to pull the necessary data, and Python can create dynamic and customizable reports or dashboards. This approach makes it possible to integrate machine learning in our models or create richer interactions with charts and tables in the user interface.
Let’s go back to Superdope for this last example. They built their customer risk scoring model, they were able to write the “high_risk” field back to their data warehouse, and now their stakeholders want an internal tool to pull this data. We’ll keep this example very simple, but with data apps and Python, you can start building all sorts of interesting tools to, for example, let your users upload files and move away from relying on data being in the warehouse.
In the first piece of this script, we write a function to retrieve the data from the data warehouse and store it as a DataFrame.
import streamlit as st
import pandas as pd
import sqlite3
from datetime import datetime
# Function to connect to the database and retrieve data
def load_data(connection, min_date, high_risk_threshold):
query = f"""
SELECT customer_id, last_purchase_date, high_risk
FROM superdope_scored_customers
WHERE last_purchase_date >= min_date
"""
df = pd.read_sql(query, connection, params=(min_date,))
# Adjust high_risk based on user-defined threshold
df['high_risk'] = df['high_risk'].astype(float) >= high_risk_threshold
return df
The second piece of this script is the main part where we build the Streamlit app and build the UX components for the end-user, including the filters, table and export functionality.
# Streamlit app
st.title("Superdope Customer Risk Dashboard")
# Database connection (replace 'your_database.db' with your actual database)
connection = sqlite3.connect('your_database.db')
# Date range selector
start_date = st.date_input("Select start date for last purchase", value=datetime(2023, 1, 1))
high_risk_threshold = st.slider("Set the threshold for high-risk customers", min_value=0.0, max_value=1.0, value=0.5)
# Load and filter data based on user input
data = load_data(connection, start_date, high_risk_threshold)
# Display the filtered data
st.subheader("Filtered Customer Data")
st.write(data)
# Option to download the filtered dataframe
st.download_button(
label="Download the filtered data as CSV",
data=data.to_csv(index=False),
file_name='superdope_filtered_customers.csv',
mime='text/csv'
)
# Close the database connection
connection.close()
In this example we used Streamlit, but you can accomplish this with Jupyter notebook plugins like Voilà or using Plotly Dash. These are all powerful solutions, the only two important pieces to keep in mind is that in order to share these reports (or data apps) out with your stakeholders, you’ll need to host these and manage the service along with things like the authentication method. These solutions also quickly run into scalability which may impact how much data you can process and/or your user’s experience.
Another more scalable and convenient option is to use Fabi.ai, which lets you publish interactive Python dashboards in just a few clicks.
What’s next: You’re ready to start jamming with SQL and Python
When you combine SQL and Python together in your data workflows you can seriously upgrade your capabilities as a practitioner. And, with the addition of Polars, you can handle larger datasets and perform more complex transformations faster than ever before. Whether you’re working on ETL processes, real-time data analysis, A/B testing, churn prediction, sales forecasting, or custom reporting, SQL and Python together offer a versatile and efficient solution.
If you’re interested in exploring these use cases, but want to cut down on some of the manual work that goes into advanced analysis (and sharing it with stakeholders), you can check out Fabi.ai for free and get up and running in less than five minutes.