What is exploratory data analysis (EDA): Methods, use cases, and best practices in the age of AI

15 minute read

January 23, 2025

TL;DR: EDA is your first conversation with your data. It's where you get to know each other before deeper analysis. In 2025, EDA combines traditional statistical methods with AI to help you understand your data better and faster than ever.

Exploratory Data Analysis (EDA) is the process of profiling a new dataset. It helps you understand the data, the questions it can answer, and its pitfalls. You cannot skip EDA as the crucial first step in any data analysis project.

When you’re presented with new data, skipping the EDA step can lead to a lot of backtracking and headaches. For example, you may build a data model focused on a certain field. Later, you may find that most records have missing values for that field. Or, your analysis may depend on a field's average. But, one or two outliers may skew the numbers. Proper exploratory data analysis can quickly spot these rudimentary issues and help you gain some general understanding of how two variables may relate to each other. This is often useful across many industries and situations, including:

Financial services: Detecting unusual transaction patterns and analyzing customer behavior
Healthcare: Understanding patient outcomes and identifying risk factors
Retail: Analyzing sales patterns and customer segmentation
Manufacturing: Quality control and process optimization
Technology: User behavior analysis and product usage patterns

EDA isn't a new concept. But, it's changing fast with the rise of AI solutions like OpenAI's ChatGPT and Anthropic's Claude. AI can easily profile the data and give some insights, as well as generate code to let you quickly experiment with different approaches and questions.

A small note before diving in: In this post we’re going to focus more specifically on EDA for structured and semi-structured data. Unstructured data, like images, videos, and long text, is complex and large. It, along with geographic data, needs special treatment due to size and complexity.

The essential toolkit: Traditional methods meet AI assistance

Before we dive into new ways that AI is transforming EDA, let’s take a moment to review some tried-and-true approaches to EDA.

Summary stats by field

The most basic form of exploratory data analysis is simply to look at some basic stats about the fields that you have in your datasets. For number (ordinal) fields, you may want to look at:

Min value: Smallest value
Max value: Largest value
Mean: Average value
Count: Total number of records with a value
Nulls: Total number of records with no value

For example, if you check an age field, this profile will show you the demographic you're working with. The ratio of nulls to the count will tell you how reliable the field is.

Categorical (cardinal) fields have similar but different metrics to consider:

Count: Total number of records with a value
Unique: Total number of unique values
Top/most common: Most common value
Nulls: Total number of records with no value

The categorical field is US states. This profile demonstrates how representative the data is of all states and whether it skews toward a specific state.

Python has a very convenient describe() function that can take care of generating these stats in one shot. If you’re working with a pandas DataFrame df, you can simply run:

# Generate describe summary 
summary = df.describe(include='all', datetime_is_numeric=True)

# Add a row for null counts 
summary.loc['nulls'] = df.isnull().sum() 
print(summary)

‍

This will print out a summary for each column:

Name Age Department

count 4.0 4.000000 4

unique 4.0 NaN 2

top Alice NaN IT

freq 1.0 NaN 2

mean NaN 35.000000 NaN

std NaN 8.164966 NaN

min NaN 25.000000 NaN

25% NaN 28.750000 NaN

50% NaN 35.000000 NaN

75% NaN 41.250000 NaN

max NaN 45.000000 NaN

nulls 1.0 1.000000 1

‍

Once you’ve looked at some basic summary statistics for each field and have started to get a sense of what fields you may want to focus on, we can start digging a bit deeper.

Univariate analysis (eg. Histograms)

Univariate analysis is the study of one (uni) variable (variate) at a time. It does so independently of other variables. In other words, it’s the action of just studying one field at a time. Technically, the summary statistics above are a form of univariate analysis. But, some univariate analysis methods are worth noting specifically.

The single most important form of univariate analysis for exploratory data analysis is the histogram. Histograms are a way of counting the number of occurrences in “bins,” which is a fancy way of saying that it’s used to look at value distribution. Plotting numerical data on a histogram immediately shows you how your data is distributed. Let’s say we’re looking at order basket sizes and the summary statistics tell us that the minimum value is $10, the maximum value is $200 and the average is $80. We might assume that most baskets are in the $80 range. But there may be baskets of very different profiles hidden behind these numbers:

As we can see in the example histogram above, there are clearly three different basket types and very few of them are $80. Histograms are powerful tools because you can also group the counts by a category field (more on that below). If you’re looking to learn more about how to create histograms in Python and some powerful alternatives, we’ve written a whole post on just that.

‍

If the field you’re studying is a category field, histograms won’t work. The equivalent would be a bar plot showing the count of unique values.

In the example above, we can see the count by U.S. states, which gives us a good sense of which states the data represents.

Bivariate and multivariate analysis

The next step following univariate analysis is to analyze how two (bi) or more (multi) variables relate to each other. Let’s revisit the example from our histogram above with basket sizes. We saw that there seemed to be multiple distributions hidden in the datasets. If we play around with the data a bit and overlay histograms by different categories, we may notice something interesting.

We can now see that three distributions each correspond to a channel type. Mobile orders have the lowest basket sizes and in-store purchases are worth more than both web and mobile orders. Careful, though, correlation does not mean causation!

With multivariate analysis, we can start exploring some fun ways to slice the data. Boxplots are another great way to visualize ranges and distributions of values broken out by a category field.

‍

You can use scatter plots, or hexbin plots, to understand the relationship between two numeric fields.

Beyond seeing how numeric values relate to each other, you may want to check for patterns based on a third, categorical variable.

Now, not only can we see that there is a correlation between height and weight, but there are two distinct groups in the data.

Moving beyond two, three, or even four variables in a multivariate analysis can start to get tedious. If you're unsure where to start, try a correlation analysis. It can help you find which variables to study together.

A correlation matrix hides some details of the variable relationships. But, it can hint at which variables may be related.

All these charts are relatively easy to generate in spreadsheets or with Python. Especially when you’re leveraging AI for Python code generation.

Leveraging AI and code-generation assistants

AI can really supercharge your EDA workflow in two big ways:

Cold start: AI platforms like ChatGPT and Claude let you upload CSV and Excel files. If you’re working with a relatively small and manageable dataset, this is a great place to start. You can simply upload your file and ask the AI to provide a summary of the data, and then you can quickly follow up with more detailed questions.
Code generation: When you ask ChatGPT or Claude to analyze your data, they are generating Python code. Along those lines, you don’t have to just limit yourself to asking the AI to explore the data for you. You can ask it to generate specific code to run in your own environment which can support more advanced Python packages than these tools provide.

Properly leveraging AI for EDA can 10X your productivity and get you to insights much faster than you would without it. However, AI does have its pitfalls, and you do need to make sure you supervise it closely. We’ll touch on this in more detail later on.

A note on Fabi.ai: The team behind Fabi.ai has lived and breathed data analysis their entire careers, and we understand what it takes to explore data. Fabi.ai was designed as an AI-powered data analysis platform specifically to supercharge data practitioners and bring all these EDA methods together into a single tool.

Practical example: EDA in action

Let’s walk through a practical example of EDA in action. We’re going to use publicly available datasets and walk through the EDA process step by step.

To get started, let’s download the dataset. We’re going to use some publicly available data about the 2013 NY campaign contributions.

Formulating our hypothesis

Before diving into the data, we can take a quick look at the fields available in the datasets. WWe have things like AMNT (probably the donation amount), the office and election ID (this may break down contributions for specific campaigns), and the donor and recipient.

Here are some of the questions that we may be able to ask:

Which campaigns or candidates have received the most contributions and from who?
Who tends to contribute to NY political campaigns?
Are there a few large contributors that make up the lion’s share, or do a lot of small donations add up?

Ask AI for an overview

To kick things off, let’s first ask ChatGPT what it thinks we can get from the data and what fields may be interesting. After uploading the data, let’s ask “What are some questions I might be able to ask of this data? Here's some context: A list of campaign contributions for candidates in the 2013 City election.”

‍

ChatGPT is calling out a few more ideas I hadn’t thought of:

Looking at contributions by intermediary
Looking at discrepancies between contributions and refunds or contributions exceeding legal limits

We can also ask AI to give us some insights into the fields: “What seems reliable and important if I want to study contribution patterns by candidate and contributor?”

It came back to me with a very specific list of fields, but it also had some really interesting gotchas:

It’s calling out potentially important fields with a significant amount of missing data. We’ll need to watch out for those. It looks like information about employment and intermediaries may be unreliable. However, most campaign contributions may not use intermediaries. But, larger donations may use them. So we shouldn’t just write these fields off, but we should be prepared to dig in.

Using Python for closer inspection

With our questions in hand and a basic understanding of the data, let’s start digging in. For this part of the analysis, we’re going to use Python. Let’s first filter down the key fields we want to focus on and rename them for usability:

df = dataframe1[['AMNT', 'PREVAMNT', 'DATE', 'REFUNDDATE', 'NAME', 'C_CODE', 
                          'OCCUPATION', 'CITY', 'STATE', 'RECIPID', 'RECIPNAME', 
                          'CANCLASS', 'COMMITTEE']].rename(columns={
    'AMNT': 'contribution_amount',
    'PREVAMNT': 'previous_contribution_amount',
    'DATE': 'contribution_date',
    'REFUNDDATE': 'refund_date',
    'NAME': 'contributor_name',
    'C_CODE': 'contributor_code',
    'OCCUPATION': 'contributor_occupation',
    'CITY': 'contributor_city',
    'STATE': 'contributor_state',
    'RECIPID': 'recipient_id',
    'RECIPNAME': 'recipient_name',
    'CANCLASS': 'candidate_class',
    'COMMITTEE': 'committee_type'
})
display(df)

The df DataFrame should now be a bit more manageable. Let’s ask AI to generate some code to provide an overview of these fields. Here’s what it produced:

print("=== Data Types ===")
for idx, dtype in enumerate(df.dtypes.items(), 1):
    print(f"{idx}. {dtype[0]}: {dtype[1]}")

print("\n=== Null Counts ===")
for idx, nulls in enumerate(df.isnull().sum().items(), 1):
    print(f"{idx}. {nulls[0]}: {nulls[1]}")

print("\n=== Basic Statistics ===")
display(df.describe(include='all').style.set_table_styles([{'selector': 'th', 'props': [('text-align', 'left')]}]))

We’re working with a mix of field types here, so we need to do a bit of extra work to get a preview of the nulls and a preview of basic stats for all fields.

These stats contain some interesting information. Even just looking at the number of nulls, it’s pretty clear that most transactions are not refunds (173,211 vs 176,642). This also tells me that I may need to filter out contributions that were later refunded for my analysis to be accurate.

When I look at other stats, I can also see that contribution amounts range from -$19,084 to $3,000,000, with a mean of $451. This is an incredibly wide range that skews to the lower end. Let’s actually take a look at the distribution.

‍

‍

This chart isn’t very helpful. The outliers are clearly very skewed. We need to really zoom in to start seeing a pattern.

‍

It raises the question: Is it worth analyzing more small transactions or a few large ones? Asking the AI, let’s go ahead and generate a chart that shows the cumulative contribution amounts by contribution band. This will tell us what makes up the lion's share of contributions.

‍

This tells us an interesting story. Although there are fewer of them (only 13), contributions of more than $750k make up the majority of the total contributions. So, in this analysis, simply focusing on the larger transactions will suffice.

We will end this example here. It shows the challenges of working with real data: it's messy and confusing. Hopefully, this also shows the power of using AI for exploratory data analysis. These steps took approximately 10 minutes, but they would normally take much longer.

The art of asking questions

EDA is as much about knowing how to use the right tools as it is about knowing what questions to ask.

What’s my hypothesis, and what can I ask of this data?

Sometimes you’re presented with some new data that you’re just trying to explore to understand what you’re working with. In that case, it’s important to understand where this data comes from and read any notes or documentation that came with the data.

From there, take a minute to think about the types of questions you can ask of the data. This doesn’t have to be a list of questions that you’re going to explore to the end, but it will help you think about what fields you may want to focus on. This is also a great way to leverage ChatGPT or Claude for the initial phases. Without uploading the data, you can tell the AI your fields and any context. Then, ask, "What insights could I extract from this data?"

However, if you’re working in an enterprise setting, chances are you’re looking at data for the first time after a stakeholders asked you a question. This is effectively a prepackaged hypothesis for you.

Before taking the question at face value, we recommend doing two quick checks to save yourself headache later:

Ask your stakeholder why they’re asking the question. You may have an assumption, but it’s always helpful to hear directly from them. There can be some really important hidden context. If your CMO suddenly asks for the latest report on spend by channel, they may be trying to get ahead of a question. They might also be worried about a specific channel.
Ask them what they plan to do with the data. This is especially important if the request is a bit nebulous. You can even ask your stakeholders questions like “If the data says X, what would you do with that insight?” You may find that your stakeholder is trying to make a decision, and the data they’re asking you to dig into actually doesn’t contain the answer.

What fields are reliable?

Once you know the questions are worth exploring, you must ask: What fields can I count on? The data may have the answers.

As we’ve touched on above, there are some obvious ways to spot issues:

Lots of missing values
Skewed data
Bad data entry

But there are also more subtle traps. If you’ve worked on enterprise data, you’ve likely seen a table that has four different “amount” fields. All these fields may look like perfectly reasonable candidates for your analysis, but only one is right. Understanding the data lineage can become very important. This is where more advanced data engineering tools like dbt, Dagster, or Monte Carlo come in handy.

How much data do I need?

Ah… we love data and we always want more. But for any data analysis, there's an optimal, "goldilocks" amount of data. It should be enough to get statistical significance. But, it shouldn’t be so much that it becomes unwieldy.

This is where the first questions involving the question and hypothesis to explore can be really important. Understanding that will generally give you a sense of the scope of data that you need.

Choosing your tools wisely

Data teams have great analysis tools. Knowing which to use, and when, can boost productivity and quality. Let’s talk about the tools you have at your disposal to maximize your impact:

AI: As we’ve explored above, AI should be a de facto tool in any exploratory data analysis. It can be a great way to explore your hypothesis, get an initial quick read on the data and be a sounding board.
Spreadsheets: We haven't covered spreadsheets much in this article. But, for smaller, structured datasets, they can be a great tool. They're limited in the amount and type of data they can handle. But, quick sorting and pivot tables are useful for quick spot checks.
Code + AI code generation assistant: In the enterprise, AI code assistants are now essential for exploratory data analysis. These tools provide a secure, scalable environment that connects directly to your data. They also give you control over the AI's output.

Summary: 9 EDA best practices for 2025 (and beyond)

Going forward, EDA is going to look very different from how it has in the past.

With AI readily available to anyone, it must become a core part of your workflow. To maximize your efficiency and efficacy when conducting EDA, here are 9 best practices to incorporate into your work this year.

Start with AI: To ensure security and privacy, load your data, or a sample, to an AI that supports data analysis. ChatGPT and Claude are two great places to start. Or,. if you want a data analysis platform to collaborate with coworkers, check out Fabi.ai.
Leverage AI for Python code generation: AI is great at generating SQL and Python code. It can quickly iterate through ideas as you explore your data. AI data analysis platforms with built-in code generation will be the fastest way to get started. But, simply asking ChatGPT or Claude to generate sample code and then editing it to fit your data and needs will also speed up your work.
Avoid manual data correction whenever possible: Correcting data is a normal part of exploratory data analysis. You may notice a typo, or perhaps you want to quickly remove a few rows of bad data. However, avoid doing this if possible. It’s better to keep the data in its original form and handle these issues through filters and code. This provides traceability when you come back to your analysis later or if you want to share your work with a coworker. Manual edits are hard to track for reproducibility. They can also make it harder to use your analysis in production.
Be diligent about creating and updating documentation: When you make an observation or correct an issue in the data, be sure to take note. This can be as simple as taking notes in a text field or document, or including comments in your code. This will make everything much easier for your future self and your peers who are reviewing your analysis.
Experiment with tools to find what works best for your needs: At Fabi.ai, we pride ourselves on being an all-in-one AI data platform. But, we admit that there isn’t always a one-size-fits all tool that works for everyone. In the early stage of exploration, it’s OK to try out a few different approaches with different tools before you dive into your analysis.
Ask questions: Start your EDA with questions both about the hypothesis as well as the data that you have at your disposal. Understand its limitations and how the data was produced.
Do preliminary exploration before a deep dive: Before a deep dive into your key question(s), explore all available fields. This will help you understand what you're working with. Note which fields seem usable, and gain a basic understanding of correlations in the data.‍
Build for reproducibility: Your future self or others should understand your analysis. They should know what decisions you made and why. This starts in the exploratory phase, and you start to filter out data or correct data.‍
Don’t blindly trust AI: AI is fantastic and updates daily. It’s reducing time to insight by up to 90% in the enterprise, but it can’t be blindly trusted. It doesn’t have all the unique context about the data and the business that you have, so it’s likely to make some small mistakes that can quickly add up and send you chasing errors down rabbit holes. Work with data tools that provide a way to inspect the SQL and Python code that the AI is generating and make editing that code easy.

Finally, if you’re ready to supercharge your exploratory data analysis even more, Fabi.ai combines the power of AI with enterprise-grade data tools to help you uncover insights faster. Get started today with a free trial and discover why leading data teams trust us for their EDA needs.

Marc Dupuis

CEO & Co-Founder @ Fabi.ai

Example H2

Example H3