Why use Python for data analysis (when you have Excel or Google Sheets)
TL;DR: Use Matplotlib's plt.hist() to get started quickly with pandas histograms. Consider Searborn or Plotly for more visually appealing or interactive charts. Other alternatives to histograms include boxplots, violin plots and hexbins.
Histograms are one of the most fundamental and widely used visualization tools in data analysis. Whether you're exploring the distribution of numerical data or comparing datasets, histograms provide a quick and intuitive way to understand patterns and trends. In this post, we’ll walk you through how to create histograms using Python’s pandas library, explore advanced visualization techniques, and discuss alternatives that can offer deeper insights for specific scenarios.
If you’re more of a visual learner or want to see some examples in action, check out our video:
What is a histogram?
A histogram is a type of bar chart that displays the distribution of numerical data by grouping values into intervals (or "bins"). It’s an essential tool for:
- Understanding the spread and central tendency of data.
- Identifying outliers or anomalies.
- Comparing distributions across datasets.
Each bar in a histogram represents the frequency of data points within a specific range, making it easy to visualize patterns, skewness, and variability at a glance.
For example, if you’re analyzing customer age data for a product, a histogram can show you the most common age groups, helping guide targeted marketing strategies. Histograms are a fundamental tool for exploratory data analysis and storytelling.
How to create a histogram from a Python pandas DataFrame
Pandas is a powerhouse, and fundamental, library for data manipulation in Python. If you’re doing any sort of data analysis in Python, you’re likely either using Pandas or polars. Its tight integration with Matplotlib makes it incredibly easy to create histograms directly from a DataFrame.
Let’s start by creating a histogram with Matplotlib, but in later sections, we’ll explore other options that are a bit more visually appealing, and also interactive.
Basic Python histograms using Matplotlib
Matplotlib is effectively the standard charting library for Python and is tightly integrated with pandas.
In the examples below, we’re going to generate some fake data to use in our examples. We have a script to generate this data so that you can play around with the data and see how it changes the chart produced. If you would like to follow along, here’s the script that generates some fake sales data for our Superdope company:
import pandas as pd
import numpy as np
# Set random seed for reproducibility
np.random.seed(42)
# Generate 1000 orders with different distributions for each channel
n_orders = 5000
# Generate channel data with different means
channels = np.random.choice(
['Mobile', 'Web', 'In-Store'],
size=n_orders,
p=[0.4, 0.35, 0.25] # Different probabilities for each channel
)
# Generate units sold with different distributions per channel
basket_sizes = [] # Initialize the list
for channel in channels:
if channel == 'Mobile':
# Mobile: Lower average order size
basket_sizes.append(np.random.normal(20, 8))
elif channel == 'Web':
# Web: Medium average order size
basket_sizes.append(np.random.normal(50, 12))
else: # In-Store
# In-Store: Highest average order size
basket_sizes.append(np.random.normal(150, 15))
# Create DataFrame
superdope_sales = pd.DataFrame({
'order_id': range(1, n_orders + 1),
'channel': channels,
'basket_size': basket_sizes
})
This script simply generates a list of orders with the number of items sold in each order and where the order was placed: mobile, web or in store and stores it as a “superdope_sales” pandas DataFrame.
Let’s now plot this data using Matplotlib:
import matplotlib as plt
# Create a single histogram for all transactions
plt.figure(figsize=(12, 6))
plt.hist(superdope_sales['basket_size'], bins=100, color='blue', edgecolor='black')
plt.title('Distribution of Order Sizes')
plt.xlabel('Basket Size per Order')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
# Show the plot
plt.show()
In the code above, plt.hist() is the key function to focus on. It takes the pandas DataFrame as the first argument and the “bins” which specifies the number of bins to use. The other critical function is plt.show() to display the chart. The rest is mostly adding more details like chart color and axis and chart labels.
Here’s the chart it produces:
As you can see, in the chart above where we only use 10 bins, we start to see a pattern and it looks like a bimodal distribution, but let’s see what happens when we increase the number of bins to 100:
Now we can see that there are actually three distinct distributions hidden in the data - we did this deliberately as you can see in the data generation script where we’ve given a different mean for each channel.
Think of the number of bins like the focus of a camera. You’ll need to tweak the bin count to capture the right profile of the data.
Okay, so we’ve created a basic histogram, but admittedly, it’s not the most visually appealing plot and it’s also static (users can’t hover over individual bins to see the content). Let’s explore some more advanced libraries to help you create a beautiful histogram.
More advanced histogram charting libraries
Seaborn
Seaborn builds on Matplotlib’s capabilities, offering a cleaner syntax and more visually appealing charts. It’s particularly useful for overlaying distributions and adding kernel density estimates (KDE).
import seaborn as sns
plt.figure(figsize=(12, 6))
sns.histplot(data=superdope_sales, x='basket_size',
kde=True, bins=100)
plt.title('Distribution of Order Sizes by Channel (Seaborn)')
plt.xlabel('Basket Size per Order')
plt.ylabel('Frequency')
# Show the plot
plt.show()
The kde=True option overlays a smooth curve, providing a better sense of the data’s density.
This is starting to look pretty good. Let’s take it a step further and make it interactive.
Plotly
For interactive histograms, Plotly is an excellent choice. You can zoom, pan, and hover over data points for deeper exploration.
import plotly.express as px
# Create a single histogram using Plotly
fig = px.histogram(superdope_sales, x='basket_size',
title='Distribution of Order Sizes by Channel (Plotly)',
nbins=100,
labels={'basket_size': 'Basket Size per Order', 'count': 'Frequency'})
# Show the plot
fig.show()
With just a few lines of code, you can create a chart that users can interact with directly in a browser or dashboard.
Advanced Histogram Plotting Tips
Overlaid histograms
In the examples above we’ve only been showing a single histogram for the entire dataset. But as we saw when we increased the number of bins, it looks like the distribution profile may actually be different depending on whether the sell as done on mobile, web or in store (and we know from our data generation script that that’s the case). So rather than view all sales data as a single histogram, it would be helpful to view each distribution by channel.
Let’s use Plotly again:
import plotly.express as px
# Create a single histogram using Plotly
fig = px.histogram(superdope_sales, x='basket_size',
title='Distribution of Order Sizes by Channel (Plotly)', color='channel',
nbins=100,
labels={'basket_size': 'Basket Size per Order', 'count': 'Frequency'})
# Show the plot
fig.show()
You’ll notice that this code is almost identical to the code above, but we’ve added the “color” parameter, and suddenly we can distinctly spot the pattern: In-store sales have a much higher basket size!
Histogram facets
Instead of trying to overlay all the histograms on a single chart, you can also break them out into different facets. This clearly disambiguates each distribution. However, if you choose to do this, you’ll want to make sure to be mindful of the min and max of each axis to make sure you’re telling the true story. In the example below, we’ve forced the min and max on the X axis to be the same for each chart to make sure we were comparing apples to apples.
Alternatives to histograms
While histograms are powerful, they may not always be the best choice for every analysis. Here are some powerful alternative visualizations:
Box Plots
Box plots show the distribution of data while highlighting key statistics like median, quartiles, and outliers. They’re very similar to histograms in their functionality, but they do a better job of highlighting the main range that the data falls within. In other words, you can simplify your view on a dataset even more: In-store purchases total basket sizes tend to be between $137 and $160. This type of simplification can help when it comes to decision making.
Violin plots
If you want to take your box plots to the next level, consider violin plots. They’re very similar, but offer a few key advantages:
- Distribution details: In a lot of ways, they resemble histograms more than they do box plots. You can see the detailed distribution, which can be important if you suspect different distributions within each group. For example if you think there are maybe two modes for “in-store purchases”, the box plot would obfuscate that
- Outlier details: You can see the outliers with a box plot, but in a violin plot it becomes much more obvious how much of an outlier and outlier truly is.
- Visual appeal: Violin plots are a bit more visually appealing, which can be important for user engagement!
Raincloud plots
If you want to get even a bit fancier with your distribution plots and you like violin and box plots, you may want to check out raincloud plots. These require a bit more technical know-how and aren’t necessarily ready out of the box (no pun intended), but can make for some very neat looking charts.
The plot above was generated using the ptitprince Python library (named after the drawing of a snake that ate an elephant in the Petit Prince).
Hexbin plots
Sometimes you want to understand the distribution of items along two dimensions. Let’s say you want to view the distributions of baskets by number of items per basket and basket total value. You could create two histograms, or you could look at the orders on a scatter plot. Scatter plots don’t tell you the full story though, because it fails to convey density of the points. That’s where hexbins come into play.
We can see here that most baskets are in the 3 to 7 item and $80 to $160 range.
Ridgeline plots
If you want to display the distribution for a lot of groups that are distinct, you could use a ridgeline plot. Joypy is the best Python package to quickly get started (named after Joy Division’s 1979 album cover for Unknown pleasures)
That said, ridgeline plots are a bit better looking than they are practical. They can be a fun way to tell a story but may not be the most scientifically useful plots.
Ready to try these out?
Histograms are fundamental to exploratory data analysis. They provide a quick way to profile data and understand how it’s distributed. You can quickly generate a pandas histogram in a few lines of code using matplotlib, but you can also get more sophisticated with boxplots, hexbins, violin plots and other distribution density charts.
If you want to try these out and get started without the hassle of setting up a Python environment locally, you can sign up for free at Fabi.ai and take these for a spin. Fabi.ai is an AI data analysis platform designed to make data exploration and collaboration incredibly easy.