Why (and how) to use SQL and Python in tandem
TL;DR: To query a remote SQL server and analyze the results using Python pandas), you should leverage SQLAlchemy for your database connection and pd.read_sql to store the results as a DataFrame. From there, you can easily analyze your data using common Python packages.
When working with large datasets stored in SQL databases or data warehouses, extracting, transforming, and analyzing that data directly in Python can supercharge your data analysis workflow. Rather than being limited to the capabilities of SQL alone, you can leverage powerful Python libraries like pandas to unlock advanced analytical techniques, data visualization, and machine learning workflows—all while seamlessly pulling data directly from your remote SQL database.
If you’re more of a visual learner, check out our accompanying YouTube tutorial for a step-by-step video walkthrough:
Why use Python to connect to a SQL database?
In modern data science and analytics, data frequently lives in relational databases or data warehouses such as MySQL, PostgreSQL, BigQuery, Microsoft SQL Server, or Snowflake. These systems are designed for efficient data storage and retrieval, often accessible via SQL (Structured Query Language).
While SQL is powerful for querying and joining tables, it does have limitations when it comes to certain statistical analyses, complex transformations, or plotting. For example, consider a dataset of basketball player stats across multiple seasons. You might want to identify defensive and offensive outliers—players who consistently outperform or underperform their peers.
Doing this entirely in SQL can become cumbersome. With Python’s ecosystem of libraries, you can:
- Easily calculate advanced metrics like z-scores
- Create visually appealing plots directly from your query results
- Integrate machine learning models to predict future player performance
Here’s what this might look like:
In this chart produced using ploty, we’ve identified the outliers using the stats Python module and we can clearly see that Kobe Bryant was an outlier. This would be been much more difficult to spot and convey using just SQL.
In this tutorial, we’re going to use SQLAlchemy to quickly connect to a remote serve, store the data as a Python pandas DataFrame using “pd.read_sql” and analyze and plot the data.
Connecting to your relational database/data warehouse and analyzing your data in Python
1. SQAlchemy and other alternatives
To access your data, you typically need a "driver" or "connector" library specific to your database. For example:
- MySQL: mysql-connector-python or PyMySQL
- PostgreSQL: psycopg2
- BigQuery: google-cloud-bigquery
- Microsoft SQL Server: pyodbc or mssql+pyodbc with sqlalchemy
However, using these drivers directly can get a bit messy. That’s where SQLAlchemy comes in. SQLAlchemy is a Python SQL toolkit and Object Relational Mapper (ORM) that provides a consistent interface for connecting to a wide array of databases, so you don’t have to worry about writing vendor-specific code for each one. It also integrates nicely with pandas.
Why use SQLAlchemy?
- Unified interface: You can use similar code to connect to different databases.
- ORM capabilities: If you want to scale beyond simple queries, SQLAlchemy’s ORM can map database tables to Python objects.
- Cleaner code: Establishing connections and executing queries feels more “Pythonic.”
Getting started is simple. Open up your terminal and install SQLAlchemy using pip or your preferred installation method:
pip install sqlalchemy
Depending on your database, you’ll need to install the appropriate driver. For Microsoft SQL Server, a common setup is:
pip install pyodbc
For other databases, you might need:
- PostgreSQL: pip install psycopg2
- MySQL: pip install pymysql
- BigQuery: pip install google-cloud-bigquery
Example connection (e.g. to Microsoft SQL Server):
from sqlalchemy import create_engine
# Example: connecting to a remote SQL Server database
# Replace <username>, <password>, <host>, <database>
connection_string = "mssql+pyodbc://<username>:<password>@<host>/<database>?driver=ODBC+Driver+17+for+SQL+Server"
engine = create_engine(connection_string)
That’s it! You now have an engine object that can run queries against your remote database. Let’s take a closer look at how to use DQLAlchemy and pandas DataFrames with a practical example.
A note on other ORMs:
- If you’re building a straightforward Django app, the Django ORM’s integration might be better suited for your needs
- For simpler or async-oriented projects, Peewee, Tortoise, Ormar, or Gino could be better fits
However, if you’re just getting started and want to keep it simple, SQLAlchemy is the way to go.
2. Querying your database or data warehouse and storing the results as a Python Pandas DataFrame (pd.read_sql)
If you’re retrieving relational data for analytics, you’ll likely want to work with it as a pandas DataFrame. Pandas DataFrames are tabular data structures that make it easy to clean, transform, and visualize your data. Another emerging library worth exploring is Polars, but pandas remains the most commonly used option today.
One straightforward way to retrieve SQL query results directly into a DataFrame is the pd.read_sql function. This function allows you to pass a SQL query and a database connection object (or engine) and returns a DataFrame.
Using SQLAlchemy and pd.read_sql, this is what your entire script looks like if you’re querying BigQuery:
from sqlalchemy import create_engine, text
import pandas as pd
def query_bigquery_with_sqlalchemy(project_id, query):
# Create the SQLAlchemy engine for BigQuery
engine = create_engine(f"bigquery://{project_id}")
try:
# Use pd.read_sql to execute the query and load results into a DataFrame
df = pd.read_sql(query, engine)
print("Query completed successfully!")
return df
except Exception as e:
print(f"An error occurred: {e}")
return None
def main():
# Replace with your project ID
project_id = "{your_project_id}"
# Example query (replace with your query)
query = """
SELECT
*
FROM
`{your_database.your_table}`
"""
# Execute the query and get the results as a DataFrame
df = query_bigquery_with_sqlalchemy(project_id, query)
display(df)
if __name__ == "__main__":
main()
In this example we’re simply displaying the results from a “SELECT *” query on a single table. The “query_bigquery_with_sqlalchemy” is the main function to focus on. It takes the query and project ID as input and returns a DataFrame.
In order for this to run, you will need to generate service account credentials and add these to your environment variables. In the case of BigQuery, I had to generate the JSON credentials and set the path to my credentials as an environment variable:
export GOOGLE_APPLICATION_CREDENTIALS=”path_to_your_credentials/you_credentials.json"
In the case of BigQuery, you’ll also need to provide your project ID.
Save the script above in a file of your choosing (example “your_analysis.py”)
Going back to using pd.read_sql vs generating a DataFrame without that function, without pd.read_sql, the rows that return the DataFrame would look like:
# Convert results to a DataFrame
rows = result.fetchall()
columns = result.keys()
df = pd.DataFrame(rows, columns=columns)
As you can see, pd.read_sql consolidates fetching and DataFrame creation into one convenient step, making your code cleaner and easier to maintain.
Analyzing your queries data using pandas (or Polars)
Now that you’ve got your data as a DataFrame (df), you can perform complex analysis using pandas. Let’s say you want to identify outlier players—those who are unusually high or low in certain metrics.
Step 1: Aggregate Player Stats
player_stats = df.groupby('player_name').agg({
'season': 'nunique',
'average_points_scored': 'mean',
'defensive_reb_percent': 'mean'
}).reset_index()
# Filter for players with > 5 seasons
player_stats = player_stats[player_stats['season'] > 5]
Step 2: Calculate a Z-score
To find outliers, we might use a z-score to measure how far each player’s performance deviates from the mean:
z_scores = np.abs(stats.zscore(player_stats[['average_points_scored', 'defensive_reb_percent'']]))
outliers = (z_scores > 3).any(axis=1)
Players with a high z-score in points, assists, or rebounds might be considered offensive outliers, while those with particularly low stats might be defensive specialists or just underperformers—context and domain knowledge matter here!
Step 3: Plot the Data
With the DataFrame in hand, you can visualize these outliers using plotting libraries like plotly, seaborn, bokeh or matplotlib.
fig = px.scatter(player_stats,
x='average_points_scored',
y='deferensive_reb_percent',
width=800,
height=450,
color=outliers.astype(str),
color_discrete_map={'True': '#FF8A80', 'False': '#80D8FF'},
hover_data=['player_name', 'season'])
# Update layout
fig.update_layout(
title='Average Points vs Defensive Rebounds (Players with >5 Seasons)',
xaxis_title='Average Points per Season',
yaxis_title='Defensive Rebound Percentage',
showlegend=False
)
fig.show()
Then run the following command to run your script:
python your_script.py
And you should see the image we saw above:
Here's what the complete script looks like:
from sqlalchemy import create_engine, text
import pandas as pd
import plotly.express as px
import numpy as np
from scipy import stats
def query_bigquery_with_sqlalchemy(project_id, query):
# Create the SQLAlchemy engine for BigQuery
engine = create_engine(f"bigquery://{project_id}")
try:
# Use pd.read_sql to execute the query and load results into a DataFrame
df = pd.read_sql(query, engine)
print("Query completed successfully!")
return df
except Exception as e:
print(f"An error occurred: {e}")
return None
def main():
# Replace with your project ID
project_id = "{your_project_id}"
# Example query (replace with your query)
query = """
SELECT
*
FROM
`{your_dabase.your_table}`
"""
# Execute the query and get the results as a DataFrame
df = query_bigquery_with_sqlalchemy(project_id, query)
## Do some interesting analysis on the queried data
# Group and aggregate the data
player_stats = df.groupby('player_name').agg({
'season': 'nunique',
'average_points_scored': 'mean',
'deferensive_reb_percent': 'mean'
}).reset_index()
# Filter for players with > 5 seasons
player_stats = player_stats[player_stats['season'] > 5]
# Calculate z-scores for outlier detection
z_scores = np.abs(stats.zscore(player_stats[['average_points_scored', 'defensive_reb_percent']]))
outliers = (z_scores > 3).any(axis=1)
# Create scatter plot
fig = px.scatter(player_stats,
x='average_points_scored',
y='defensive_reb_percent',
width=800,
height=450,
color=outliers.astype(str),
color_discrete_map={'True': '#FF8A80', 'False': '#80D8FF'},
hover_data=['player_name', 'season'])
# Update layout
fig.update_layout(
title='Avg. Points vs Avg. Def. Rebounds',
xaxis_title='Average Points per Season',
yaxis_title='Defensive Rebound Percentage',
showlegend=False
)
fig.show()
if __name__ == "__main__":
main()
3. Considering Polars for Larger Datasets
If you’re dealing with very large datasets and need more performance than pandas can offer, consider using Polars—a next-generation DataFrame library written in Rust. It’s still under heavy development, but early benchmarks show it to be faster and more memory-efficient than pandas in many cases.
Polars also supports SQL-like queries and integrates well with Python, although the ecosystem and tooling aren’t as mature as pandas yet.
Recap and key takeaways
By connecting Python to your remote SQL server or data warehouse, you combine the power of SQL’s data retrieval capabilities with Python’s flexible data analysis ecosystem. Using SQLAlchemy to establish a stable, database-agnostic connection, and then leveraging pd.read_sql to pull query results into a pandas DataFrame, allows you to quickly and easily perform complex transformations, statistical analyses, and data visualizations.
Key Takeaways:
- Use "pip install sqlalchemy" and appropriate database drivers to connect to various databases.
- Once connected, pandas read_sql (or pd.read_sql) simplifies getting query results into a DataFrame.
- After loading the data into a DataFrame, you can perform powerful analyses, outlier detection, aggregations, and create visualizations that would be cumbersome with pure SQL.
- Consider Polars for faster performance on large datasets or more complex transformations.
We’ve built Fabi.ai with one goal: making data retrieval, analysis and distribution incredibly easy. As part of that, we’ve abstracted away the complexity connecting to remove SQL servers and preparing the data for downstream analysis. With our data warehouse connectors, you can easily write a query and seamlessly analyze it using Python and AI. You can get started for free in less than 2 minutes.