4 practical ways to use SQL and Python together for modern data analysis

TL;DR: Combining SQL and Python creates powerful workflows for advanced data manipulation. This guide demonstrates four practical examples: ETL processes using dbt and Snowpark, A/B test analysis, customer churn prediction, and interactive dashboards. Learn how SQL's data retrieval capabilities complement Python's processing and visualization tools for more efficient data-driven tasks.

In our previous post about exploratory data analysis (EDA), we discussed why integrating SQL and Python is crucial for effective data work. As data practitioners, combining these complementary tools can significantly enhance our analytical capabilities and efficiency.

This guide walks through four practical examples of SQL and Python working together:

  1. Leveraging Python during the transformation using dbt and Snowpark
  2. Analyzing statistical significance from A/B test results
  3. Predicting customer churn
  4. Building interactive dashboards w/ Plotly

We'll also examine Polars DataFrames, an emerging technology that offers significant performance improvements over traditional data processing approaches when handling large datasets.

Common scenarios for SQL and Python integration

Before diving into specific examples, let's explore key scenarios where SQL and Python work together effectively:

  1. Advanced data integration and ETL (Extract, Transform, Load) processes: Combining SQL’s data extraction capabilities with Python’s transformation and loading functions allows for a seamless ETL process that can handle large volumes of data efficiently. We’re going to use an example using a hybrid SQL and Python approach and Snowpark.
  2. Data-driven decision-making and A/B Testing: SQL can quickly retrieve experimental data, and Python can be used to perform statistical tests and visualize the results, helping teams make informed decisions faster.
  3. Advanced machine learning and feature engineering: SQL is often used to retrieve large datasets from databases, which can then be fed into Python’s machine learning models, which are great for forming predictions. Python also excels in feature engineering, a critical step in building effective models.
  4. Custom reporting and dashboarding: SQL can pull the necessary data, while Python can be used to create dynamic and customizable reports or dashboards, offering a more interactive and user-friendly experience.

This isn’t an exhaustive list, but the four scenarios above represent ripe opportunities to uplevel our analysis work by using SQL and Python in tandem. Let's take a look at some specific examples within each.

Practical examples using SQL and Python

For the example in this post, we’re going to use a fictional company “Superdope” that sells cool widgets online. They have 200 employees and data responsibilities are shared between a data engineer and a data scientist. They leverage both SQL and Python throughout their data operations.

1. Building ETL workflows with dbt and Snowpark

In modern data workflows, integrating data from multiple sources, cleaning it, and loading it into a data warehouse is essential. 

Traditionally, SQL has been the go-to tool for these tasks, but now, with advancements like dbt and Snowpark from Snowflake, you can directly integrate Python into your ETL processes. This approach lets you do two things: 

  1. Hybrid transformations: Snowpark lets you write Python code that runs directly within Snowflake so you can perform complex data manipulations and apply machine learning models without leaving your data warehouse environment.
  2. Seamless integration: dbt typically handles SQL-based transformations, but, with Snowpark, you can embed Python into your dbt models and create a powerful hybrid transformation layer that leverages the strengths of both SQL and Python.

So let’s take a look at Superdope’s use case. They have customer reviews from their online store, and they want to score the sentiment of each customer on a regular basis for better target marketing.

First we want to retrieve the most recent reviews since we don’t want to process the entire dataset every time:

WITH recent_reviews AS (
    SELECT
        customer_id,
        review_text,
        review_date
    FROM
        {{ ref('superdope_customer_reviews') }}
    WHERE
        review_date >= DATEADD(day, -7, CURRENT_DATE())
)

SELECT
    customer_id,
    review_text,
    review_date
FROM
    recent_reviews

Then we want to apply the sentiment analysis:

from snowflake.snowpark import Session
from textblob import TextBlob

def analyze_sentiment(text):
    analysis = TextBlob(text)
    return analysis.sentiment.polarity

def model(dbt, session: Session):
    dbt.config(materialized="table")

    # Load the recent reviews from the last 7 days
    df = session.table("{{ ref('superdope_recent_reviews') }}").to_pandas()

    # Perform sentiment analysis on each review
    df['sentiment_score'] = df['review_text'].apply(analyze_sentiment)

    # Convert back to Snowpark DataFrame and return
    return session.create_dataframe(df)

And then of course we want to then run the model

dbt run --models superdope_recent_reviews superdope_sentiment_analysis

You can see from this example how we were able to do some analysis using Python that you would not have been able to do just with SQL and bake that directly into your data pipeline. 

2. A/B testing analysis with SQL and Python

It’s hard to make data-driven decisions without using A/B testing at some point. Often, we’re using SQL to retrieve experimental data, and Python to perform statistical analysis and visualize the results. 

ℹ️ Note for Fabi.ai users: In a lot of cases, A/B testing data may come in the form of a file. Even if your data does not live in a data warehouse, you can upload it to Fabi.ai and store it in a temporary database to accomplish the following example.

For this example, let’s say that our friends at Superdope are testing out a new checkout page to increase conversion.

First we want to retrieve the A/B test data and group it by week so that we can analyze the trend:

SELECT 
    user_id, 
    group, 
    conversion,
    DATE_TRUNC('week', event_date) AS week
FROM 
    ab_test_results
WHERE 
    test_id = 42
GROUP BY 
    week, user_id, group, conversion;

Then we want to calculate the statistical significance (note: we’re not grouping by week here, but doing so would be helpful to plot the trends of the two tests over time as the experiment is running): 

import polars as pl
from scipy.stats import chi2_contingency

# Load data into Polars DataFrame
df = pl.read_sql(query, connection)

# Perform chi-square test by grouping data by 'group' only
contingency_table = df.groupby('group').agg(pl.col('conversion').sum()).to_numpy()
chi2, p, _, _ = chi2_contingency(contingency_table)

print(f"Chi-square test p-value: {p}")
if p < 0.05:
    print("Statistically significant difference between groups")
else:
    print("No significant difference between groups")

If your A/B test is just an export from your marketing platform, you can obviously skip the SQL step, but if you do store these results in a database or data warehouse, you may find this handy. You can see how we were able to quickly filter on the relevant experiment and measure the significance in just a few lines of Python. 

If you’re on a RevOps or MarketingOps team and wondering why you might not do this in dbt and store the results back in your data warehouse it’s because A/B tests tend to be more ephemeral. You may not want to go through the process of baking this into your model, which can also require advanced data engineering skills which you may not have readily available.

3. Churn prediction with machine learning

Machine learning is an extremely powerful tool for predicting future outcomes and is something you can use to do custom lead scoring, sales forecasting, or even churn prediction. For the example below, we’ll focus on the churn prediction use case. 

When building machine learning models, we can use SQL to efficiently retrieve large datasets from databases, which we can then feed into Python’s machine learning libraries like scikit-learn to complete the prediction. Additionally, Python’s flexibility allows for advanced feature engineering, which you may need, depending on the quality of your data and the types of machine learning models you decide to use. For example, if you’re doing a regression analysis, there’s a good chance you’ll have to find some way to handle null values or cardinal values.

ℹ️ A note on Polars here: They don’t integrate well with a lot of machine learning libraries like scikit-learn. They’re constantly building more support for common libraries, but as of the time of writing this post, in its current state, it's easier to work with more traditional pandas dataframes.

OK, now let’s look at the example for Superdope. They want to use the customer sentiment score we calculated earlier to try and predict high risk customers. 

First, they’ll need to retrieve customer information and create some of the baseline features they want to use for our model. In this simple example, we’ll use average order value (AOV), number of purchases and average sentiment score:

SELECT customer_id, 
       AVG(order_value) AS avg_order_value,
       COUNT(order_id) FILTER (WHERE order_date >= NOW() - INTERVAL '90 days') AS total_purchases_last_90_days,
       AVG(review_sentiment) AS avg_review_sentiment
FROM orders
JOIN reviews ON orders.customer_id = reviews.customer_id
GROUP BY customer_id;

Then, we’ll build a composite “health_score,” which we can use to label customers as “high risk” or not “high risk”. Once we’ve done that, we can build our prediction model using a simple random forecast classifier (or insert whatever model you would rather use here).

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load data into a Pandas DataFrame
df = pd.read_sql(query, connection)

# Feature engineering for customer health score
df['health_score'] = (
    0.4 * df['avg_order_value'] + 
    0.3 * df['total_purchases_last_90_days'] + 
    0.3 * df['avg_review_sentiment']
)

# Define threshold for labeling customers as 'high risk' of churn
df['high_risk'] = (df['health_score'] < threshold).astype(int)  # Assume threshold is defined

# Prepare data for modeling
X = df[['avg_order_value', 'total_purchases_last_90_days', 'avg_review_sentiment']].values
y = df['high_risk'].values  # Binary column indicating high-risk customers

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model to predict high-risk customers
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_report(y_test, y_pred))

Once you’ve built and tested this model, you can either do a one-off analysis on customers that aren’t yet scored, or you can consider inserting this prediction into your data model (see example above about doing this in dbt).

4. Interactive dashboards with Streamlit

Custom reporting and dashboarding are essential for delivering insights to stakeholders. Often, data practitioners deliver this reporting via a business intelligence (BI) tool. 

Unfortunately, most BI solutions don’t offer integrated Python kernels. So if you have to do some advanced data analysis that’s best done (or can only be done) using Python, the analysis has to happen outside of BI, written back to your data warehouse then uploaded to BI. This is a lot of heavy lifting, especially if you’re still in the exploratory phase.

If we’re looking to customize our reporting beyond the capabilities of common BI tools, we can use SQL to pull the necessary data, and Python can create dynamic and customizable reports or dashboards. This approach makes it possible to integrate machine learning in our models or create richer interactions with charts and tables in the user interface. 

Let’s go back to Superdope for this last example. They built their customer risk scoring model, they were able to write the “high_risk” field back to their data warehouse, and now their stakeholders want an internal tool to pull this data. We’ll keep this example very simple, but with data apps and Python, you can start building all sorts of interesting tools to, for example, let your users upload files and move away from relying on data being in the warehouse. 

In the first piece of this script, we write a function to retrieve the data from the data warehouse and store it as a DataFrame.

import streamlit as st
import pandas as pd
import sqlite3
from datetime import datetime

# Function to connect to the database and retrieve data
def load_data(connection, min_date, high_risk_threshold):
    query = f"""
    SELECT customer_id, last_purchase_date, high_risk 
    FROM superdope_scored_customers
    WHERE last_purchase_date >= min_date
    """
    df = pd.read_sql(query, connection, params=(min_date,))
    
    # Adjust high_risk based on user-defined threshold
    df['high_risk'] = df['high_risk'].astype(float) >= high_risk_threshold
    
    return df

The second piece of this script is the main part where we build the Streamlit app and build the UX components for the end-user, including the filters, table and export functionality.

# Streamlit app
st.title("Superdope Customer Risk Dashboard")

# Database connection (replace 'your_database.db' with your actual database)
connection = sqlite3.connect('your_database.db')

# Date range selector
start_date = st.date_input("Select start date for last purchase", value=datetime(2023, 1, 1))
high_risk_threshold = st.slider("Set the threshold for high-risk customers", min_value=0.0, max_value=1.0, value=0.5)

# Load and filter data based on user input
data = load_data(connection, start_date, high_risk_threshold)

# Display the filtered data
st.subheader("Filtered Customer Data")
st.write(data)

# Option to download the filtered dataframe
st.download_button(
    label="Download the filtered data as CSV",
    data=data.to_csv(index=False),
    file_name='superdope_filtered_customers.csv',
    mime='text/csv'
)

# Close the database connection
connection.close()

In this example we used Streamlit, but you can accomplish this with Jupyter notebook plugins like Voilà or using Plotly Dash. These are all powerful solutions, the only two important pieces to keep in mind is that in order to share these reports (or data apps) out with your stakeholders, you’ll need to host these and manage the service along with things like the authentication method. These solutions also quickly run into scalability which may impact how much data you can process and/or your user’s experience.

Another more scalable and convenient option is to use Fabi.ai, which lets you publish interactive Python dashboards in just a few clicks. 

Best practices for SQL and Python integration

When combining SQL and Python in your data workflows, consider these best practices:

  1. Efficient data loading:
    • Use parameterized queries to prevent SQL injection
    • Implement chunking for large datasets
    • Consider using Polars for improved performance
  2. Data processing:
    • Leverage SQL for initial data filtering and aggregation
    • Use Python for complex transformations
    • Implement proper error handling
  3. Pipeline optimization:
    • Cache intermediate results
    • Implement proper logging
    • Monitor memory usage

What’s next: Making SQL and Python work together

The combination of SQL and Python creates powerful data analysis workflows. Whether you're building ETL processes, performing statistical analysis, implementing machine learning models, or creating interactive dashboards, these tools complement each other perfectly.

For those looking to streamline their data analysis while maintaining robust capabilities, consider exploring Fabi.ai for free. It provides an integrated environment for working with SQL and Python, and you can get up and running in less than five minutes. 

Related reads

Subscribe to Query & Theory