4 practical ways to use SQL and Python in tandem
Collaborative data analysis: Why it matters and what great looks like
TL;DR: Collaborative data analysis is a new wave that brings the technical and non-technical stakeholders together around the data. It fosters a data-driven culture and increases trust in the data. Collaborative data analysis requires an analysis platform that offers SQL, Python, AI and spreadsheet-like functionality in a single environment and can push data to where your stakeholders live.
When I was a product manager I was constantly using data to help make decisions or investigate customer or product issues. I had access to traditional BI, product analytics, PostgreSQL and Snowflake along with a myriad of other tools, but more often than not I would export the data and dump it into a spreadsheet, or a Jupyter notebook (for practical reasons it was actually usually Google Colab) and take screenshots to share in Slack or slides. And I noticed that this pattern was common across every team and almost every organization.
Reflecting on this, there are a few reasons why this happened:
- Usually my insights required mashing together data from a few different sources. For example, I would need account information from Salesforce joined with customer data in our product database.
- Most of these tools fundamentally were not collaborative. So if I found something that I wanted to share with my coworkers, I needed to be in Google Sheets or Slack. If I didn’t do this, I would simply be analyzing data in a vacuum, which reduced the power of data in our decision making.
- Most of these tools don’t make it easy to keep track of changes and edits to reports. So if someone else came in and made a change, I would have no idea and I wouldn’t be able to figure out what change had been made, so I would lose trust in the data.
- I usually needed a variety of different tools to complete my analysis. SQL to extract the data, spreadsheets to do some pivoting and formatting and sometimes Python for more advanced analysis or custom visualizations.
Collaborative data analysis platforms solve these challenges by providing a complete data analysis environment with real-time collaboration and version control, reducing the need to bounce between different tools and allowing the team to discuss insights where they live. Let’s explore a bit more why this matters and what traits make a great collaborative data analysis platform.
Why does collaborative data analysis matter?
The workflow described above, involving exporting data, taking screenshots and putting it in Slack, email, slides, Jupyter notebooks/Google Colab or spreadsheets probably feels familiar. Aside from the obvious inconvenience, the lack of collaborative data analysis also tends to reduce the use of data for strategic decision making, erodes trust in data and reports and hinders data literacy across the organization.
Let’s dive into each of these in more detail.
1. Fostering a data-driven culture
When teams and individuals are able to effectively collaborate around insights, data will make its way into more conversations and decisions. Generally speaking, most enterprise leaders want to be more data-driven, but this can be difficult to achieve when it takes multiple steps to find and share insights which then get lost in the corporate ether - a Snowflake survey shows that only 32% of managers primarily use data to make decisions while 75% agree that there’s a need to increase access to data.
If the insight is directly embedded where the team works, for example in Slack, email or Google Sheets or it’s readily available in reports where you can leave comments and tag coworkers, conversations will flourish around data.
2. Improved efficiency and faster time to insights
Insights have a producer and a consumer. Oftentimes, it can take multiple iterations before the producer and consumer are aligned. If the producer is able to share their report or insight in the same environment where the consumer is viewing the data, iteration cycles will decrease significantly, boosting the team’s efficiency. The further the data is from where it’s being consumed, the more work it takes to produce a useful insight. Fabi.ai customers, for example, are able to reduce time to insight by up to 90% by working in a collaborative data analysis environment.
3. Increased trust in data
Sometimes screenshots are necessary, but as soon as the data delivered to the stakeholder is disconnected from its source, some amount of trust is lost. If a data analyst shares a report with a business stakeholder, who then exports the data and reshares it with a colleague, that colleague is likely going to question the source and accuracy of the data. On the other hand, if they saw the data along with the original author, and perhaps even the underlying data source, this immediately increases confidence in the data.
4. Increased data literacy across the organization
A lot of business stakeholders actually have a strong desire to better understand the data and are generally capable of even deciphering SQL enough to understand where the data is from and what it's showing. This is great! The more individuals across your organization have some level of understanding of the data (or at least your gold or wide tables) the better. This will help them better understand what data is available, the sorts of questions they can ask, and maybe even empower them to answer some of their own questions.
Collaborative data analysis platforms increase literacy by bringing the producers and the consumers of insights in a same environment with the underlying code and data modelling readily available. This in turn also increases trust in the data.
Characteristics of collaborative data analysis tools
Before we talk about what makes a data analysis platform collaborative, we need to understand the different types of collaboration. There are three different types of collaboration relationships:
- Collaboration between producers of insights (for example two data analysts)
- Collaboration between consumers of insights (for example between two customer success managers)
- Collaboration between the insight producer and the consumer (the data team sharing a report with the sales team)
Although there are similarities between these, for example real-time commenting and updates are important in all cases, certain personas have specific needs. For instance, effective collaboration between data team members requires version control. This allows them to work in parallel on the same analysis without overriding each other’s code.
So let’s take a look at the characteristics of a collaborative data analysis platform that serves all these stakeholders.
1. All-in-one: Spreadsheet, SQL, Python and AI
A truly collaborative data analysis platform should have all tools in a single place. Individuals bring data to where it’s easiest to analyze and collaborate. If the work can’t all be done in one place, the collaboration workflow will be broken.
Data analysis typically requires some combination of spreadsheet, SQL, Python (for advanced data analysis and visualization) and AI. The more tools the platform offers the better.
2. Connected to your spreadsheets and communication channels
Collaborative data analysis platforms should also encourage data sharing. Pushing insights and reports to where your stakeholders live, and maintaining those reports, should be easy. In particular, a good collaborative data analysis platform should offer a connector to and from Google Sheets as well as a way to push alerts to Slack or Microsoft Teams.
3. Commenting and tagging
In order to bring individuals together around a report, the data must be up to date in an environment where users can comment and tag each other. Short of this, your stakeholders will export or screenshot the data and bring it to where discussions happen: documents and Slack.
4. Reproducibility (and state management)
Reproducibility is a complex topic that merits its own article. But for the sake of this topic, it’s important to find a collaborative data analysis platform that ensures that if the same report is run twice by different individuals or at different times, that it shows the same results. This sounds trivial, but proper reproducibility and variable statement management is the exception, not the rule. Especially when more advanced data analysis using Python.
5. Version controlled
Version control with tools like GitHub, Bitbucket or Gitlab has been a staple to good software development for the past two or more decades. Collaboration between two technical stakeholders working on a same report is near impossible without version control. One analyst making updates to another data analyst’s report without versioning can quickly lead to confusion and lots of wasted time.
A good collaborative data analysis platform should offer dashboard-as-code with git support.
Technical note: What’s changed in the past 10 years to enable a mixed SQL, Python and AI collaborative data analysis environment?
If at this stage you’re thinking “Yes, this sounds awesome, but why aren’t all data analysis platforms offering all this functionality?” this section is a technical detour that gets into what’s changed in the past 5-10 years enabling new data analysis platforms to offer functionality that we could only dream of a few years ago.
Many of the legacy BI solutions were founded 10 or more years ago (yes, actually!), and there have been some massive changes since then that have changed the game. Specifically, in order to create a collaborative data analysis platform that brings together SQL, Python, AI and no-code, you need the following:
- A single-node Python environment hosted on a remote kernel. Without Python, you’re restricted to SQL and no-code, which limits the breath of data analysis possible. Just as importantly, Python is also the backbone of modern AI. AI excels at generating Python code for data analysis, which was not a major consideration prior to the launch of ChatGPT in 2022.
- Infrastructure that can quickly scale up or down. If you’re sharing a report with a stakeholder that requires its own Python kernel, the platform needs to be able to spin up kernels incredibly fast and efficiently. Barring this, the experience would be painfully slow, or worse, one user viewing a report that uses Python or AI would override what another viewer sees if they’re in the report at the same time. And that would be bad!
So why is this now possible?
- Compute power is cheap: In the past decade, the cost of compute relative to CPUs has plummeted. This means that systems designed to work on large, distributed datasets feel (and are) clunky, relative to the work that needs to be done. Affordable compute is necessary in order for a service provider to be able to offer on-demand, virtual Python kernels, and Python is a fundamental building block for AI-powered data analysis.
- Small data with columnar data storage is having a moment: In the past few years, columnar storage solutions like DuckDB and Clickhouse have started taking off, allowing data teams to crunch through gigabytes of data on single node machines. This is important, because it now means that we can spin up mini, virtual machines for each individual and still power through large amounts of data, ensuring that everyone sees the same data but in their own instance so there’s no state conflict.
- Python performance improvements: Python has continued to get faster. Especially with new libraries like Polars that offer a 10X performance improvements over Pandas. Running custom code on the fly that isn’t just SQL is now trivial and cheap.
- AI (LLMs specifically) have entered the zeitgeist and are quickly becoming a must-have for any data analysis. AI can be used to generate SQL queries, but that’s not where it shines. It shines in processing unstructured data and generating Python. Unfortunately, most data analytics platforms have omitted Python and so are not poised to properly leverage AI for data analysis.
- Kubernetes was launched ten years ago and has matured a lot since then. With cheap compute that can process millions of records on single nodes, you still need a way to quickly turn machines on and off. If you share your analysis with 20 users and your report leverages Python and AI, 20 machines need to be available in milliseconds with the right environment and libraries installed. Then you want to be able to just as quickly shut these machines down to avoid overspending. This requires careful containerization and orchestration, enabled by Kubernetes
Take all of these advances in aggregate, only now can we truly build modern, ai-powered data analysis platforms that combine all the tools you need in a single place with real-time collaboration.
Key characteristics to look for in a collaborative data analysis platform: SQL, Python, AI, versioning and real-time collaboration
In the very near future, all data analysis will be expected to be collaborative. Collaboration means both collaboration between technical and non-technical team members. So it’s just as important that the platform you choose provides best-in-class version control as it does a way to push data to Google Sheets, Slack or interactive reports.
Let’s quickly recap the key characteristics of a collaborative data analysis platform:
- All-in-one environment with SQL, Python and AI - In order to keep an analysis all in one place. From simple ad hoc data pulls to advanced data analysis, the platform needs to combine SQL for data pulling, Python for data analysis and AI assistance to accelerate the time to insights.
- Connected to Slack and Google Sheets - The platform should offer real-time and scheduled syncs to Google Sheets and Slack, two places that your team spends most of their time in.
- Provides real-time commenting and user tagging - Individuals should be able to see who viewed a report and leave comments and tag coworkers.
- Robust variable state management - Every time a report is run with specific parameters, it should return the same results. When Python is involved, this requires dependency tracking and careful state management.
- Version control - Two technical individuals should be able to work on the same report and use version control to manage code conflicts.
At Fabi.ai, we believe that data analysis should fundamentally be collaborative and that this requires a new way of thinking about data analysis and the tools we use. Our mission is to transform the way data is explored, used and shared. We’re building the first AI-powered collaborative data analysis platform designed to bring all stakeholders together. If you want to try it out, you can get started for free in less than 2 minutes.