.png)
Pandas histogram: creating histogram in Python with examples
TL;DR: EDA is your first conversation with your data. It's where you get to know each other before deeper analysis. In 2025, EDA combines traditional statistical methods with AI to help you understand your data better and faster than ever.
Exploratory Data Analysis (EDA) is the process of profiling a new dataset. It helps you understand the data, the questions it can answer, and its pitfalls. You cannot skip EDA as the crucial first step in any data analysis project.
When you’re presented with new data, skipping the EDA step can lead to a lot of backtracking and headaches. For example, you may build a data model focused on a certain field. Later, you may find that most records have missing values for that field. Or, your analysis may depend on a field's average. But, one or two outliers may skew the numbers. Proper exploratory data analysis can quickly spot these rudimentary issues and help you gain some general understanding of how two variables may relate to each other. This is often useful across many industries and situations, including:
EDA isn't a new concept. But, it's changing fast with the rise of AI solutions like OpenAI's ChatGPT and Anthropic's Claude. AI can easily profile the data and give some insights, as well as generate code to let you quickly experiment with different approaches and questions.
A small note before diving in: In this post we’re going to focus more specifically on EDA for structured and semi-structured data. Unstructured data, like images, videos, and long text, is complex and large. It, along with geographic data, needs special treatment due to size and complexity.
Before we dive into new ways that AI is transforming EDA, let’s take a moment to review some tried-and-true approaches to EDA.
The most basic form of exploratory data analysis is simply to look at some basic stats about the fields that you have in your datasets. For number (ordinal) fields, you may want to look at:
For example, if you check an age field, this profile will show you the demographic you're working with. The ratio of nulls to the count will tell you how reliable the field is.
Categorical (cardinal) fields have similar but different metrics to consider:
The categorical field is US states. This profile demonstrates how representative the data is of all states and whether it skews toward a specific state.
Python has a very convenient describe() function that can take care of generating these stats in one shot. If you’re working with a pandas DataFrame df, you can simply run:
# Generate describe summary
summary = df.describe(include='all', datetime_is_numeric=True)
# Add a row for null counts
summary.loc['nulls'] = df.isnull().sum()
print(summary)
This will print out a summary for each column:
Name Age Department
count 4.0 4.000000 4
unique 4.0 NaN 2
top Alice NaN IT
freq 1.0 NaN 2
mean NaN 35.000000 NaN
std NaN 8.164966 NaN
min NaN 25.000000 NaN
25% NaN 28.750000 NaN
50% NaN 35.000000 NaN
75% NaN 41.250000 NaN
max NaN 45.000000 NaN
nulls 1.0 1.000000 1
Once you’ve looked at some basic summary statistics for each field and have started to get a sense of what fields you may want to focus on, we can start digging a bit deeper.
Univariate analysis is the study of one (uni) variable (variate) at a time. It does so independently of other variables. In other words, it’s the action of just studying one field at a time. Technically, the summary statistics above are a form of univariate analysis. But, some univariate analysis methods are worth noting specifically.
The single most important form of univariate analysis for exploratory data analysis is the histogram. Histograms are a way of counting the number of occurrences in “bins,” which is a fancy way of saying that it’s used to look at value distribution. Plotting numerical data on a histogram immediately shows you how your data is distributed. Let’s say we’re looking at order basket sizes and the summary statistics tell us that the minimum value is $10, the maximum value is $200 and the average is $80. We might assume that most baskets are in the $80 range. But there may be baskets of very different profiles hidden behind these numbers:
As we can see in the example histogram above, there are clearly three different basket types and very few of them are $80. Histograms are powerful tools because you can also group the counts by a category field (more on that below). If you’re looking to learn more about how to create histograms in Python and some powerful alternatives, we’ve written a whole post on just that.
If the field you’re studying is a category field, histograms won’t work. The equivalent would be a bar plot showing the count of unique values.
In the example above, we can see the count by U.S. states, which gives us a good sense of which states the data represents.
The next step following univariate analysis is to analyze how two (bi) or more (multi) variables relate to each other. Let’s revisit the example from our histogram above with basket sizes. We saw that there seemed to be multiple distributions hidden in the datasets. If we play around with the data a bit and overlay histograms by different categories, we may notice something interesting.
We can now see that three distributions each correspond to a channel type. Mobile orders have the lowest basket sizes and in-store purchases are worth more than both web and mobile orders. Careful, though, correlation does not mean causation!
With multivariate analysis, we can start exploring some fun ways to slice the data. Boxplots are another great way to visualize ranges and distributions of values broken out by a category field.
You can use scatter plots, or hexbin plots, to understand the relationship between two numeric fields.
Beyond seeing how numeric values relate to each other, you may want to check for patterns based on a third, categorical variable.
Now, not only can we see that there is a correlation between height and weight, but there are two distinct groups in the data.
Moving beyond two, three, or even four variables in a multivariate analysis can start to get tedious. If you're unsure where to start, try a correlation analysis. It can help you find which variables to study together.
A correlation matrix hides some details of the variable relationships. But, it can hint at which variables may be related.
All these charts are relatively easy to generate in spreadsheets or with Python. Especially when you’re leveraging AI for Python code generation.
AI can really supercharge your EDA workflow in two big ways:
Properly leveraging AI for EDA can 10X your productivity and get you to insights much faster than you would without it. However, AI does have its pitfalls, and you do need to make sure you supervise it closely. We’ll touch on this in more detail later on.
A note on Fabi.ai: The team behind Fabi.ai has lived and breathed data analysis their entire careers, and we understand what it takes to explore data. Fabi.ai was designed as an AI-powered data analysis platform specifically to supercharge data practitioners and bring all these EDA methods together into a single tool.
Let’s walk through a practical example of EDA in action. We’re going to use publicly available datasets and walk through the EDA process step by step.
To get started, let’s download the dataset. We’re going to use some publicly available data about the 2013 NY campaign contributions.
Before diving into the data, we can take a quick look at the fields available in the datasets. WWe have things like AMNT (probably the donation amount), the office and election ID (this may break down contributions for specific campaigns), and the donor and recipient.
Here are some of the questions that we may be able to ask:
To kick things off, let’s first ask ChatGPT what it thinks we can get from the data and what fields may be interesting. After uploading the data, let’s ask “What are some questions I might be able to ask of this data? Here's some context: A list of campaign contributions for candidates in the 2013 City election.”
ChatGPT is calling out a few more ideas I hadn’t thought of:
We can also ask AI to give us some insights into the fields: “What seems reliable and important if I want to study contribution patterns by candidate and contributor?”
It came back to me with a very specific list of fields, but it also had some really interesting gotchas:
It’s calling out potentially important fields with a significant amount of missing data. We’ll need to watch out for those. It looks like information about employment and intermediaries may be unreliable. However, most campaign contributions may not use intermediaries. But, larger donations may use them. So we shouldn’t just write these fields off, but we should be prepared to dig in.
With our questions in hand and a basic understanding of the data, let’s start digging in. For this part of the analysis, we’re going to use Python. Let’s first filter down the key fields we want to focus on and rename them for usability:
df = dataframe1[['AMNT', 'PREVAMNT', 'DATE', 'REFUNDDATE', 'NAME', 'C_CODE',
'OCCUPATION', 'CITY', 'STATE', 'RECIPID', 'RECIPNAME',
'CANCLASS', 'COMMITTEE']].rename(columns={
'AMNT': 'contribution_amount',
'PREVAMNT': 'previous_contribution_amount',
'DATE': 'contribution_date',
'REFUNDDATE': 'refund_date',
'NAME': 'contributor_name',
'C_CODE': 'contributor_code',
'OCCUPATION': 'contributor_occupation',
'CITY': 'contributor_city',
'STATE': 'contributor_state',
'RECIPID': 'recipient_id',
'RECIPNAME': 'recipient_name',
'CANCLASS': 'candidate_class',
'COMMITTEE': 'committee_type'
})
display(df)
The df DataFrame should now be a bit more manageable. Let’s ask AI to generate some code to provide an overview of these fields. Here’s what it produced:
print("=== Data Types ===")
for idx, dtype in enumerate(df.dtypes.items(), 1):
print(f"{idx}. {dtype[0]}: {dtype[1]}")
print("\n=== Null Counts ===")
for idx, nulls in enumerate(df.isnull().sum().items(), 1):
print(f"{idx}. {nulls[0]}: {nulls[1]}")
print("\n=== Basic Statistics ===")
display(df.describe(include='all').style.set_table_styles([{'selector': 'th', 'props': [('text-align', 'left')]}]))
We’re working with a mix of field types here, so we need to do a bit of extra work to get a preview of the nulls and a preview of basic stats for all fields.
These stats contain some interesting information. Even just looking at the number of nulls, it’s pretty clear that most transactions are not refunds (173,211 vs 176,642). This also tells me that I may need to filter out contributions that were later refunded for my analysis to be accurate.
When I look at other stats, I can also see that contribution amounts range from -$19,084 to $3,000,000, with a mean of $451. This is an incredibly wide range that skews to the lower end. Let’s actually take a look at the distribution.
This chart isn’t very helpful. The outliers are clearly very skewed. We need to really zoom in to start seeing a pattern.
It raises the question: Is it worth analyzing more small transactions or a few large ones? Asking the AI, let’s go ahead and generate a chart that shows the cumulative contribution amounts by contribution band. This will tell us what makes up the lion's share of contributions.
This tells us an interesting story. Although there are fewer of them (only 13), contributions of more than $750k make up the majority of the total contributions. So, in this analysis, simply focusing on the larger transactions will suffice.
We will end this example here. It shows the challenges of working with real data: it's messy and confusing. Hopefully, this also shows the power of using AI for exploratory data analysis. These steps took approximately 10 minutes, but they would normally take much longer.
EDA is as much about knowing how to use the right tools as it is about knowing what questions to ask.
Sometimes you’re presented with some new data that you’re just trying to explore to understand what you’re working with. In that case, it’s important to understand where this data comes from and read any notes or documentation that came with the data.
From there, take a minute to think about the types of questions you can ask of the data. This doesn’t have to be a list of questions that you’re going to explore to the end, but it will help you think about what fields you may want to focus on. This is also a great way to leverage ChatGPT or Claude for the initial phases. Without uploading the data, you can tell the AI your fields and any context. Then, ask, "What insights could I extract from this data?"
However, if you’re working in an enterprise setting, chances are you’re looking at data for the first time after a stakeholders asked you a question. This is effectively a prepackaged hypothesis for you.
Before taking the question at face value, we recommend doing two quick checks to save yourself headache later:
Once you know the questions are worth exploring, you must ask: What fields can I count on? The data may have the answers.
As we’ve touched on above, there are some obvious ways to spot issues:
But there are also more subtle traps. If you’ve worked on enterprise data, you’ve likely seen a table that has four different “amount” fields. All these fields may look like perfectly reasonable candidates for your analysis, but only one is right. Understanding the data lineage can become very important. This is where more advanced data engineering tools like dbt, Dagster, or Monte Carlo come in handy.
Ah… we love data and we always want more. But for any data analysis, there's an optimal, "goldilocks" amount of data. It should be enough to get statistical significance. But, it shouldn’t be so much that it becomes unwieldy.
This is where the first questions involving the question and hypothesis to explore can be really important. Understanding that will generally give you a sense of the scope of data that you need.
Data teams have great analysis tools. Knowing which to use, and when, can boost productivity and quality. Let’s talk about the tools you have at your disposal to maximize your impact:
Going forward, EDA is going to look very different from how it has in the past.
With AI readily available to anyone, it must become a core part of your workflow. To maximize your efficiency and efficacy when conducting EDA, here are 9 best practices to incorporate into your work this year.
Finally, if you’re ready to supercharge your exploratory data analysis even more, Fabi.ai combines the power of AI with enterprise-grade data tools to help you uncover insights faster. Get started today with a free trial and discover why leading data teams trust us for their EDA needs.