4 practical ways to use SQL and Python together for modern data analysis
TL;DR: If you’re just here for a quick start guide, skip ahead to the "Getting Started 101" section below. In short, once you’ve imported pandas as pd, you can simply write:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
This will read your CSV file into a pandas DataFrame and print the first 10 rows. For more details, keep reading!
The read_csv() function in pandas is a cornerstone for anyone working with tabular data in Python. Whether you’re analyzing sales numbers, processing user data, or manipulating datasets for machine learning, read_csv is often your starting point when working with local files. This guide will cover everything you need to know, from the basics to advanced usage.
Basic Requirements:
- A comma-separated file (CSV)
- A basic understanding of Python
- Python installed locally on your laptop or machine
If you don’t have Python installed or aren’t ready to dive into code, we’ll touch on alternative methods at the end of this article.
For visual learners, check out our video that covers this topic:
Pandas read_csv(): Getting started 101
If your CSV file is clean and simple, you can start analyzing it with just three lines of code:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
Let’s break this down:
- Import pandas as pd to load the library
- Use pd.read_csv('data.csv') to read the CSV file into a pandas DataFrame
- Print the first 10 rows using print(df.head(10))
Important: The read_csv() function in the second row assumes the file is present in the save folder that you’re running your Python script in. In other words, if your “my_script.py” Python script lives in the “my_project” folder, you need to ensure that “data.csv” is also in the “my_project” folder and not anywhere else or in some sub-folder. If you do have a sub-folder in your project folder, such as “my_project/project_data” (which is generally best practice), then you’ll need to modify that line of code:
df = pd.read_csv('project_data/data.csv')
Pandas read_csv: Useful and common parameters
While the basic usage is straightforward, real-world datasets often come with quirks. Let’s explore some common read_csv parameters that make handling complex datasets easier.
Before starting with Python
If you’re new to Python, here are quick prerequisites:
- Install Python (https://python.org/downloads)
- Install pandas using pip install pandas
- Create and save Python scripts using your favorite code editor (e.g., VS Code, PyCharm, or Jupyter Notebook).
If you’re comfortable with these steps, feel free to skip ahead.
Common read_csv parameters
- delimiter: Use this if your file isn’t comma-separated (e.g., tab-separated files):
df = pd.read_csv('data.tsv', delimiter='\t')
- header: Specify which row to use as column names (default is the first row).
df = pd.read_csv('data.csv', header=0) # First row as header
- usecols: Select specific columns to read:
df = pd.read_csv('data.csv', usecols=['column1', 'column2'])
- index_col: Set a specific column as the index:
df = pd.read_csv('data.csv', index_col='id')
- na_values: Handle missing values:
df = pd.read_csv('data.csv', na_values=['NA', 'N/A', 'null'])
- nrows: Limit the number of rows read:
df = pd.read_csv('data.csv', nrows=100)
- dtype: Enforce specific data types for columns:
df = pd.read_csv('data.csv', dtype={'id': int, 'price': float})
Advanced features of read_csv
Reading large files efficiently
For very large datasets, consider reading in chunks:
for chunk in pd.read_csv('large_data.csv', chunksize=1000):
process(chunk)
Memory optimization:
Use low_memory=False to improve performance with mixed data types.
Handling different file encodings
If your file isn’t UTF-8 encoded, specify the encoding:
df = pd.read_csv('data.csv', encoding='ISO-8859-1')
Parsing dates
Many datasets include date fields. To ensure proper parsing, use the parse_dates parameter:
df = pd.read_csv('data.csv', parse_dates=['date_column'])
This converts the specified column to datetime format, making it easier to filter and analyze.
Error handling with bad lines
If your CSV contains problematic lines, you can use on_bad_lines:
df = pd.read_csv('data.csv', on_bad_lines='skip')
This will skip lines that cause errors during parsing.
Other common scenarios with examples
Reading password-protected files
If your CSV is compressed and password-protected, tools like pyminizip or zipfile can help. First, extract the file and then use pd.read_csv:
import zipfile
with zipfile.ZipFile('data.zip') as z:
z.extractall(pwd=b'password')
df = pd.read_csv('data.csv')
Merging multiple CSV files
For scenarios where you have multiple CSV files to process, pandas makes it simple:
import glob
all_files = glob.glob("data_folder/*.csv")
combined_df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
This concatenates all files in the folder into a single DataFrame.
Working with compressed CSVs
Compressed files like .zip, .gz, or .bz2 can be read directly using read_csv:
df = pd.read_csv('data.csv.gz', compression='gzip')
This saves time and disk space when handling large files.
Combining with other libraries
For enhanced data manipulation, combine pandas with libraries like numpy and matplotlib:
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
print(np.mean(df['column1']))
plt.hist(df['column1'])
plt.show()
This approach bridges data manipulation with statistical analysis and visualization. We touch on creating histograms and other similar charts with Python pandas in detail in this post.
Alternatives to using Python Pandas to read CSV files
If Python isn’t installed or you’re looking for quick analysis options, here are alternatives:
1. ChatGPT or Claude
Upload your CSV file and ask for analysis. These AI platforms can provide quick insights.
2. Google Sheets
- Drag and drop your CSV file into Google Sheets for a no-code solution.
- Use Google Sheets' built-in analysis tools.
3. AI platforms like Fabi.ai
Fabi.ai allows you to:
- Upload and analyze CSV files directly.
- Generate insights and visualizations without writing code while still leveraging the power of Python for advanced data analysis and visualization.
- Seamlessly integrate with other tools for advanced workflows.
4. BI tools
Tools like Tableau and Power BI can also handle CSVs for quick visualization and reporting. They’re ideal for non-programmers or teams needing dashboard capabilities.
Further learning: Pro tips and other file formats
CSV files are used in every data analysis, however, there are many other formats that you’ll likely come across in the wild. In particular, when dealing with larger enterprise data, you’ll likely find yourself working with larger data sets stored remotely on a data warehouse
Beyond CSV: Exploring other formats
- Excel files:
df = pd.read_excel('data.xlsx')
- Google Sheets: Use the Google Sheets Python API or Fabi.ai connectors to fetch data programmatically.
- JSON files - We cover in more detail how to conveniently convert JSON to a tabular format in this post.df = pd.read_json('data.json')
Querying data warehouses
If you’re conducting data analysis in an enterprise setting, there’s a very good chance you’ll quickly graduate past CSV files and into structured data warehouses which can contain much larger volumes of data. Python and pandas offer a lot of great ways to work with SQL data warehouses along with some great benefits which we’ve covered in the past.
Some of the benefits of using a data warehouse instead of CSV files:
- Avoiding versioning issues since you can constantly pull the latest data in a programmatic way.
- Real-time data analysis without tedious file transfers.
Using DuckDB for SQL queries on files
DuckDB is a highly performant analytical data warehouse which offers a lot of great features designed specifically to make data analysis a breeze and works particularly well on small datasets. You can run DuckDB locally or leverage the DuckDB functionality embedded directly within Fabi.ai.
To run DuckDB locally and upload a file:
import duckdb
query = "SELECT * FROM 'data.csv' WHERE column1 > 100"
duckdb.query(query).df()
This allows you to combine the power of SQL and Python for analysis while still working with CSV files. This is also great for working with larger CSV files.
Conclusion
You can get up and running with pd.read_csv() in a few lines of codes, and there are some handy parameters like delimiter, header and n_rows to help with some of the more common scenarios like a header, different delimiters etc.
Beyond read_csv() to upload a CSV files to a Python pandas DataFrame, you may want to consider leveraging AI solutions like Claude or ChatGPT if you’re just doing some simple data analysis, or perhaps even a BI solution or dedicated data analysis platform. Once you’re ready to work with larger enterprise data, learning how to convert the output from SQL queries into a pandas DataFrame is a great next step.
If you’d like to start analyzing your CSV file immediately with no initial Python setup and AI assistance, we invite you to try out Fabi.ai for free and start getting insights in fewer than 2 minutes.