Causal Inference Made Practical: Unlocking Business Insights with the DoWhy Library

How to Move Beyond Correlation with Python’s DoWhy Library

Every analyst stared at a dashboard showing a seemingly clear pattern many times. Premium customers churn more, marketing campaigns decreased revenue, and a new feature decreased engagement. The numbers don’t like… Or do they?

This is one of the biggest problems in data analytics: correlation is everywhere, causation is rare. Regression models can easily return coefficients, but without clear assumptions and a clear understanding of the problem, they won’t tell you the cause of a given effect. For that, we need a fundamentally different kind of reasoning: causal inference.

For decades, causal inference lived more in academia than in business settings. Fortunately, this is changing, and this article aims to cover an important part of the practical applications of causal inference. We will explore the DoWhy library created by Microsoft Research.

DoWhy is the Python library that bridges the gap between causal theory and practical data science. It provides a systematic, four-step framework for answering causal questions, grounded in Judea Pearl’s foundations of causal inference.

What makes DoWhy exceptionally useful is that it forces us to be transparent. We can’t hide behind black-box models or hidden assumptions. We have to think structurally about the problem we have to tackle. 

In this article, we’re going to walk through a hands-on scenario where a standard data analysis points to a disastrous business decision, and show how DoWhy’s workflow (Model, Identify, Estimate, Refute) reveals the causal effect.

Scenario

Let’s start with preparing a real-world scenario. We will become data scientists at a growing SaaS company. The product team just introduced a new ‘Premium Support’ tier, promising faster resolution of technical issues. For an additional fee, of course. And the executives would like to know the answer to the usual question: ‘Is this product actually working?’ And our job is to answer it.

We will use the following simulated dataset to answer the question. It consists of 1000 recent support tickets with information on resolution time, company size, and a flag indicating whether the customer paid for the upgrade to the Premium Support program. 

np.random.seed(42)
num_users = 1000

company_size = np.random.randint(10, 5000, num_users)

size_norm = (company_size - company_size.mean()) / company_size.std()
p_buy = 1 / (1 + np.exp(-(size_norm + 0.5)))
premium_support = np.random.binomial(1, p_buy)

resolution_time = 10 + (0.005 * company_size) - (5 * premium_support) + np.random.normal(0, 1, num_users)

df = pd.DataFrame({
    'company_size': company_size,
    'premium_support': premium_support,
    'resolution_time': resolution_time
})

print(df.head())

The biggest advantage of using simulated data is that we already know the answer. Company size drives adoption of premium support. Bigger companies are modelled to be more likely to subscribe. We already see that subscription to the premium support is not random. And what’s more, the company’s size increases resolution time purely because of its greater complexity.

And since we’re playing the role of God, we can see that the premium support program, by itself, reduces resolution time. We know this because we generated the data, and we can see the ground truth. In real life, of course, we never get to peek at this number. We have to estimate it from the data, which is precisely what we’re going to do now.

The Naive Trap

Now that we have our data, let’s do what every analyst does first – run a simple comparison between the two support tiers. 

The data suggests that paying for premium support increases the wait time. If we presented this to our stakeholders, the decision would be obvious. The premium support program will be terminated. It appears to be degrading the customer experience.

image.png

But we can still dig deeper. We know the program isn’t for everyone. Only the largest companies with big budgets can afford to purchase it. And larger companies usually have longer resolution times. This is a classic confounding variable that influences both the treatment and the outcome. 

When we split customers into tiers based on participation in the premium support program, we are not using comparable populations. The premium group is disproportionately composed of larger enterprises, which naturally lengthens its resolution time.

Observational data almost always have this nature, which makes discovering the true causal effect particularly tricky. The treatment has not been assigned randomly, because companies self-select into the premium support tier. 

This is where standard machine learning often fails us. If we fed our data into regression models or any other ML Library in scikit-learn, the algorithm, without any adjustments, would use program participation information to predict that a customer will have a longer waiting time. That’s where causal inference comes to our rescue. We have to include the causal mechanism from the data to discover the proper effect.

DoWhy Framework

DoWhy changes the game by forcing us to stop calculating and start modelling. It introduces a rigorous, four-step framework for identifying the true causal effect. It consists of the following steps, which we will discuss in more detail:

  1. Model. A step where we will encode our causal assumptions in the form of the DAGs – directed acyclic graphs. We will use nodes and arrows to show which variables are causing what in our model.
  2. Identify. Given our data and causal model, the library will determine what we can estimate. DoWhy will analyse our graph from step 1 and automatically determine if the causal effect can be estimated from the data.
  3. Estimate. Calculate the causal effect based on the identification strategy.
  4. Refute. Testing if our effect is robust by placebo tests or sensitivity analysis.

This framework is very universal and flexible. As we will see, each step connects to the others, making the causal inference project very transparent. The true advantage of DoWhy is that it forces us to be very explicit. You can’t do casual inference without an assumption. 

And this library forces us to clearly state our assumptions. And each of them can be questioned or validated. It makes it easy to reproduce, falsify and modify our analytical process. 

In this section, we will work through all the steps of the process and see how DoWhy helps us discover the true causal effect.

Building the causal model

The first step of the DoWhy process is also the most philosophical one. Here, we don’t even have to use data, but rather think about the structure of the problem at hand. It might sound quite unnecessary at first, but it gives us a great opportunity to think about the problem we have and the ways to tackle it. We will basically create a picture of how the world works.

And the tool to do so is a simple graph, a DAG. This concept is quite simple. It is just a set of nodes connected with arrows. Each variable is represented by one node. And arrows represent a causal relationship. An arrow from node A to node B means that variable A causes variable B.

Conceptualising a DAG forces us to think about the theory before moving to the coding or estimation. We have to assume which variables cause which for our model to work. You can think of it as quite an unnecessary and non-scientific approach. After all, why are we assuming anything before running any models? However, any machine learning or statistical model, or even a simple analysis, has hidden assumptions included. Modelling it beforehand just forces us to be more transparent and clear. And any assumptions are not final, we can always modify them and test different variations. That’s actually the core of the scientific process, which should also be a part of daily analytical work. 

For our scenario, the DAG has three nodes: company size, premium support, and resolution time. We have to think about how those variables should be connected. To start with, company size has to influence the resolution time. We already established that company size additionally influences the decision to join the premium support program. And one last arrow, the one we care about the most, is going through premium support to the resolution time. And let’s remember: we are listing our assumptions. At a later stage, we will see whether this structure holds for our dataset.

Our simple diagram is depicted below. Please note that company size is a confounder. It affects both treatment and the outcome. Once we identify its influence, we will be able to control for it at the next stage.

image.png

Coding the diagram above in the DoWhy language is very easy. The following code achieves it using a simplified approach. We also use explicit graph strings to list all the causal relations in our model, but for example, is it enough to list three variables from the chart above. We have to specify our treatment, outcome, and the confounder as common_causes. This is a simple model, but nothing stops us from adding more variables if we have a more complex dataset.

model = CausalModel(
    data=df,
    treatment='premium_support',
    outcome='resolution_time',
    common_causes=['company_size']
    )

This step looks simple, but in reality, it is the most crucial step of every causal analysis. This graph doesn’t come directly from the data but from the domain knowledge about the problem at hand. It means that the graph can and should be discussed, modified, and updated by anyone involved in solving the question. It’s also a great time to involve less technical stakeholders, like product managers and marketing teams, as their knowledge can be indispensable here. 

With the graph prepared and the model built, we are ready for the next stage: determining whether the effect we want to measure is actually recoverable from the data we have.

Identification

The next step in the DoWhy world is very important and showcases one of the library’s best advantages. At this stage, most analysts would just run a linear regression controlling for confounding variables. But DoWhy forces us to pause for a moment and think whether the causal effect we are after can be calculated from the existing data. This is the goal of the identification step. We can think about it as a bridge between the causal diagram and the statistical analysis.

We already know that company size acts as a confounder, preventing us from comparing the two groups directly. In causal inference terms, it means that there is an open backdoor path. A backdoor path is any path from treatment to the outcome that starts with an arrow pointing into the treatment. It represents a non-causal association that contaminates the effect we are trying to measure. In our DAG, company size points to premium support and resolution time. 

That means there’s a path from premium support to resolution time that runs backwards through company size, and that path has nothing to do with whether premium support actually works. 

To solve it, we have closed the backdoor path using the so-called backdoor criterion. The formal definition of it is relatively abstract, but we don’t have to worry about it, as the DoWhy can check if it’s possible to be applied here automatically. Basically, we have to find a set of variables that block every backdoor path. If we can, then conditioning on them will give us a clear causal effect. In our case, we have one such variable – company size. If we control for it, the non-causal relationship disappears, and the only remaining path from premium support to the resolution time is the direct causal one. 

This will, of course, only work if we can measure such variables. If it were unobserved, the backdoor path would remain open, and we wouldn’t be able to detect or close it. This is a crucial point about any causal inference technique. We can only control for variables we can measure. In most cases, we will always have to assume that there are no unmeasurable confounders in our causal structure. 

Running identification in DoWhy is very simple and takes only one line.

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)

print(identified_estimand)

And the output of the code above looks like this, and it’s worth unpacking. 



Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
─────────────────(E[resolution_time|company_size])
d[premiumₛᵤₚₚₒᵣₜ]
Estimand assumption 1, Unconfoundedness: If U→{premium_support} and U→resolution_time then P(resolution_time|premium_support,company_size,U) = P(resolution_time|premium_support,company_size)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

### Estimand : 4
Estimand name: general_adjustment
Estimand expression:
d
─────────────────(E[resolution_time|company_size])
d[premiumₛᵤₚₚₒᵣₜ]
Estimand assumption 1, Unconfoundedness: If U→{premium_support} and U→resolution_time then P(resolution_time|premium_support,company_size,U) = P(resolution_time|premium_support,company_size)

In this section, DoWhy states that it found a backdoor criterion to apply to the causal diagram. It correctly identifies that company size is the variable we need to control for to estimate the causal effect. The following piece of the output is less relevant for us at this point. It shows that other causal strategies aren’t available in our case, which is entirely true. 

At this stage, we confirmed that the causal question is answerable and that we should answer it by applying the backdoor criterion. We are now ready to finally calculate our causal effect.

Estimation

Identification told us what to calculate. At the estimation step, we will actually calculate it. 

The recipe DoWhy handed to us in the previous step was a backdoor adjustment. We have to control for company size, and the remaining causal effect would be attributable to the premium program. Now it’s time to apply it to the data. Worth noting that it is the first step in the DoWhy approach when we calculate anything from the dataset. The simplest and fastest way to apply the backdoor adjustments is the good, old linear regression. This is done by the following code.

estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression"
)

print(f"Causal Estimate: {estimate.value:.2f} hours")

The output shows 4.93 hours, indicating that the premium program saves nearly 5 hours of resolution time. And since we know the actual effect from the data generation program, we can see that it aligns with what we hard-code. The small difference comes from the randomness in the linear regression estimate of the effect. 

Comparing it to the naive analysis, which showed that the premium support has a longer resolution time. We have found a way to estimate the true causal effect. This way, we also saved a valuable product for the company, which could have been lost if we had relied only on the simple, naive comparison. 

The following chart showcases it intuitively. Resolution time increases as the company size increases. However, companies using the premium support program have consistently lower resolution at each level of company size. That’s exactly what we discovered while applying the backdoor adjustment. The gap between the two lines is the causal effect of the premium support program. 

image.png

Linear regression is the most straightforward and common estimation method and works well when the relationship in our data is approximately linear. But it is not the only option in the DoWhy’s estimation toolbox. The library has additional estimation methods, such as propensity score matching. We will explore those methods in the upcoming articles. 

We have to remember one very important thing when applying the causal inference. Our estimate is as good as the causal graph that we use to create the estimand. If we hadn’t included company size as a confounder, the backdoor adjustment wouldn’t have produced the correct results. Everything in the DoWhy approach is the chain, and all subsequent steps depend on the previous ones. That’s why we have an additional step to help assess the robustness of our estimate. 

Refutation

DoWhy doesn’t it treat our estimate as the endpoint of our analysis. In the last step, we need to determine whether the result we obtained is reliable. That’s exactly what we’re going to do at the refutation step. 

We are going to try to find ways to make our results appear invalid. The goal here is not to invalidate the work we’ve done, but to ensure it is reliable. If our estimate survives this step, our confidence in its validity will be much higher. 

DoWhy offers a few refutation tests, and we will go through the main types.

Placebo Test

The first test is called a placebo treatment refuter, and it follows a similar logic to the placebo testing in medical trials. It asks a simple question: what happens if we replace the actual treatment with a random one?

refutation = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name="placebo_treatment_refuter",
    num_simulations=20
)

print(refutation)

The logic is straightforward. If the premium support program truly reduces resolution time, then replacing it with a randomly generated variable should yield a result close to zero. After all, only an impactful treatment should change the outcome of interest. If it changed even with random treatment, it would mean the effect is due to other factors, not the treatment we are analysing. 

Refute: Use a Placebo Treatment
Estimated effect:-4.934458380783667
New effect:0.009059204058587334
p value:0.47035401890486206

In our case, the placebo effect is close to zero, which is exactly what we want to see. That’s a green light for the results of our analysis. 

Random Common Cause Test

The second test adds a randomly generated variable as a common cause to our dataset and reestimates the effect. The rationale here is simple. Such a variable should have no effect on either treatment or outcome by design. If the estimate changed meaningfully when adding such a variable, it would suggest that the estimate was never stable. If it changed due to the presence of a random variable, it could also change simply by including or excluding variables in the model. It would be a large red flag for the validity of our estimates.

refutation_rcc = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name="random_common_cause",
    num_simulations=20
)
print(refutation_rcc)

In our case, adding a random common cause didn’t change the result. As the new effect is virtually identical to the original one. In other words, adding a meaningless variable changed nothing. And it’s exactly what we wanted to see.

Refute: Add a random common cause
Estimated effect:-4.934458380783667
New effect:-4.934622268604485
p value:0.4435080914538394

Sensitivity Analysis

The third analyses our results from a slightly different angle. Sensitivity analysis asks how strong an unobserved confounder would have to be to make our result disappear. 

This is important because the unconfounded assumptions, the claim that we took into consideration and are measuring all the confounding variables, can never be fully verified from data alone. In reality, there might be other variables we didn’t measure. 

Sensitivity analysis quantifies how much hidden confounding our results can tolerate before it disappears. 

res = model.refute_estimate(
  identified_estimand,
  estimate,
  method_name=“add_unobserved_common_cause”,
  confounders_effect_on_treatment=“binary_flip”,
  confounders_effect_on_outcome=“linear”,
  effect_strength_on_treatment=[0.1, 0.2, 0.3],
  effect_strength_on_outcome=[1, 2, 3]
)

print(res)

This code introduces a few new features, so let’s focus on understanding what we are trying to do here. By method name add_unobserved_common_cause we tell DoWhy to simulate a hidden confounder, a variable that influences both treatment and outcome but is not present in our dataset. 

The next two parameters describe the shape of that simulated confounder. Setting confounders_effect_on_treatment to binary_flip means the hidden variable occasionally flips a customer’s premium support status, turning a subscriber into a non-subscriber and vice versa.

The strength values [0.1, 0.2, 0.3] control how aggressively this happens: at 0.1, it disrupts 10% of treatment assignments, etc.

Setting confounders_effect_on_outcome to linear means the hidden variable also adds a direct linear effect to resolution time, with strength values of 1, 2, and 3 additional hours, respectively. DoWhy tests all combinations of these strengths, from a mild hidden confounder, all the way to a severe one dominating both sides of the relationship simultaneously. All of this leads to the following output.

Refute: Add an Unobserved Common Cause
Estimated effect:-4.934458380783667
New effect:(np.float64(-3.2578649198201752), np.float64(0.33982470348470173))

The new effect shows the range of estimates across nine confounding scenarios we specified. At the mildest end, with a weak confounder, the estimated effect is still relatively large at -3.25. At the most severe corner case, it moves towards slightly above 0. And the upper bounds are what deserve the most of our attention. It marks the point at which the treatment’s impact disappears, where the hidden confounder is strong enough to absorb the treatment effect. 

However, it represents the most drastic scenario, with 30% of treatment assignments filled and with 3 hours of resolution time added. A confounder that powerful is rare in reality, especially if we know the domain well. 

So, we have to ask each other a question at this point. Is it possible to get any unmeasured confounders in our domains that are so strong as to exercise such influence? Of course, it is not always so simple, and we can never be sure. But sensitivity analysis provides a useful tool for quantifying this.

Taken together, all the refutation tests tell a consistent story. The estimate we obtained seems very reliable. It is not random or robust to all potential confounders, but it is very strong. That’s almost as much confidence as we can get from observational data. And way more than any naive comparison would ever offer.

Summary

Most data science libraries give us a function and a result. DoWhy gives us not only this, but also a framework for thinking.

The four-step workflow is simple and designed around a powerful idea. The assumptions behind any causal analysis are too important to remain hidden. By forcing us to draw a causal graph, DoWhy makes those assumptions concrete, allowing them to be read, challenged, and modified as needed. Such transparency is rare and often underappreciated.

And what’s even more important about this library is its ability to validate our results by refutation tests. Any estimates that survive all those refutations ‘attacks’ get a lot of additional credibility. 

Taken together, DoWhy is a very useful library for causal inference. It not only helps us get the right answer, but also helps us understand how we reached it. The article covered only the basics of DoWhy, but it is certainly enough to start exploring it and developing a deeper understanding of its features. 

Stay updated

New insights, tutorials and charts — straight to your inbox.

Related