Evaluating the impact of treatments or interventions is critical in various fields, including business and healthcare. Determining whether a specific action produces the desired effect is essential for making informed decisions. While randomized experiments are considered the gold standard for such evaluations, they are not always feasible.

Various causal inference methods can be utilized to estimate treatment effects in these cases. Propensity score matching is one of the most used and most important methods. This method allows us to create comparable treatment and control groups based on observed characteristics.

This article describes the powerful method used in the causal inference workshop: propensity score matching, providing a guide to this analytical technique.

**What is Propensity Score Matching?**

Propensity score matching (PSM) allows us to construct an artificial control group based on the similarity of the treated and non-treated individuals. When applying PSM, we match each treated unit with a non-treated unit of similar characteristics.

This way, we can obtain a control group without the randomized experiment. This artificial control group would consist of the non-treated units that resemble the treated group as much as possible.

While the concept of PSM may seem straightforward, its successful implementation is often complex. The key challenge lies in the details of the method, particularly in finding suitable matches based on the available data and each unit’s pre-treatment characteristics.

**Evaluating the Job Training Program**

To explore propensity score matching, we will use the famous Lalonde dataset. The dataset originates from a study by Robert LaLonde (1986), which aimed to assess the effectiveness of a job training program on earnings. We will use the dataset used by Dehejia and Wahba in their paper *“Causal Effects in Non-Experimental Studies: Reevaluating the Evaluation of Training Programs.”*

The study evaluated the National Supported Work (NSW), a program designed to help disadvantaged workers, including ex-offenders, former drug addicts, and high school dropouts, find stable employment. The primary goal was to determine whether the program positively impacted participants’ earnings.

The data we will use does not come from the randomized experiment. The authors of the abovementioned study combined the data from the randomly selected program participants with the information about the non-treated group from the Current Population Study. They also enhanced a dataset with pre-treatment variables, giving a perfect example of evaluating the propensity score method.

**Causal Diagrams**

Our goal is to estimate the effect of the job training program on income. To conceptualize the analyzed program, let’s briefly stay in a more abstract, symbolic notation.

We will name all pre-treatment variables (age, pre-treatment earnings, etc.) X.

The treatment, which will be assigned to the selected group, will be denoted as T in our symbolic notation. This variable is crucial as it represents the core intervention of the job training program.

And we will use our outcome of interest – post-treatment earnings- as Y.

We can visualize the data set using causal diagrams. On the abstract level, our data is structured as follows: Pre-treatment variables influence treatment participation decisions and future earnings:

For example, let’s assume that people with higher education (X) are more likely to opt into the treatment than people without a degree. Generally, the higher the education level, the higher the earnings. Hence, people with a degree will most likely earn more after the treatment period, regardless of whether they participate in the program.

We call such a variable a **confounder**, as it affects both the treatment decision and the outcome. The presence of an unattended confounder makes calculating causal effects impossible.

One effective method for dealing with confounders is to conduct a randomized experiment. This approach severs the link between X and T, as the treatment is now independent of the pre-treatment characteristics.

In our example, this would ensure that the distribution of education levels is balanced across the treated and untreated groups, allowing for a more accurate evaluation of the causal effect.

However, we don’t have the comfort of conducting a randomized experiment to solve the problem we’re analyzing. The treatment and control groups differ in terms of the pre-treatment variables. For example (and we will explore this more soon), the treatment might have a larger proportion of older people. Therefore, we can’t compare the earnings in both groups to calculate the treatment effect.

Enter propensity score matching. This method allows us to create an artificial control group by matching both groups. Specifically, for each unit of the treatment group, we will assign the most similar unit from the control group. This technique helps us balance the pre-treatment characteristics in a non-randomized setting, which turns into an accurate calculation of the average treatment effect.

**Dataset **

Enough of the theoretical background. It’s an excellent time to delve into the data. We will explore the dataset and then apply the propensity score matching to estimate the effect of the job program.

The dataset is relatively simple, which makes it perfect for exploring propensity score matching. It contains information about job training program participants and the control group, with variables describing each unit.

The dataset includes the following variables:

- Treat: Treatment status (1 if treated, 0 if not treated). This variable indicates whether an individual participated in the job training program.
- Age
- Edu – years of education
- Race
- Married: Marital status (1 if married, 0 otherwise).
- Nodegree: Indicator variable for educational level (1 if the individual does not have a high school degree, 0 otherwise).
- Re74: earnings in 1974
- Re75: earnings in 1975
- Re78: earnings in 1978

As part of the data preparation, we will transform the race variable into a set of dummy variables, as it will make the upcoming analysis easier:

**Initial Comparison**

Our primary goal is to estimate the treatment’s effect on the post-intervention earnings. However, we cannot compare the treated and non-treated groups directly due to the absence of a randomly selected control group. Nonetheless, we will use this comparison as a starting point for our analysis.

The initial findings are quite unexpected. Contrary to our expectations, the job training program has decreased average revenue. The difference between the two groups is nearly 500 EUR, with the untreated individuals seemingly benefiting more.

We can also compare the results by running a simple regression analysis with the treatment indicator as the only independent variable. It will also reveal the effect of a given treatment—provided that all the critical assumptions are met.

The regression output quantifies the data from the chart above. Without controlling for any additional variables, the treatment has a negative effect. Can we trust those results and conclude that the job training negatively impacted future earnings?

If that were the case, I wouldn’t write an article about propensity score matching. 🙂 Those results would be trustworthy only if both groups had the same observed and unobserved variable distribution. This situation is almost only possible with randomization.

**The Balance Between Groups**

One way to check if the groups in the experiments are comparable is to check the distribution of the pre-treatment variables. If they differ, it concludes that both groups are different, and we have to dig deeper to explore any causal relationship.

The section below shows density plots, which are graphical representations of the data distribution of the continuous variable by assignment to the treatment group. All the pre-treatment variables show distributions across both groups, indicating potential differences.

Upon analysis, it becomes apparent that the average age of individuals in the control group is slightly higher than that of the treated group – 28 years old compared to 25 years old. However, the treated group does exhibit a longer tail of older individuals.

The years of education variable is similar across the treated and control groups. Both groups have very similar average years of education – 10 years. Nevertheless, the distribution of this variable across both groups still does not look very similar.

We also have access to data measuring pre-treatment earnings from 1974 and 1975. In both years, individuals in the treatment groups earned, on average, less than the control group.

Our dataset also contains categorical variables, which are even more accessible to compare. We can compare their means in both control and treated groups. We can see that individuals assigned to the treated group are likelier to be unmarried and have a lower education level. The race composition of both the groups is also different, which might affect other socio-economical variables.

Another standard way to check the similarity of groups in experiments is to run a statistical test comparing the means or proportions between both groups. The null hypothesis in all our tests states that the means or proportions between the control and treated groups are equal. Hence, a low p-value will lead us to reject such a hypothesis and conclude that the groups have existing differences.

For the numerical variables, we will conduct the t-test using the SciPy library:

The results indicate insufficient evidence to conclude that units in the treated and control groups are different regarding years of education. Age and earnings before the treatment are significantly different, which biases the results of the simple comparison between groups.

Let’s follow a similar procedure for the categorical variables.

All categorical variables display significant differences between units assigned to the control and treatment groups. Thus, we can confidently conclude that there are substantial differences between them.

Those disparities significantly affect our analysis. Based on the comparison of the pre-treatment, we can see that both groups are very different regarding the observable variables. Hence, we can’t use simple comparisons to estimate the effect of treatment. In the causal inference universe, both groups are not **exchangeable**.

Exchangeability implies that the treated and control groups are equivalent in the distribution of confounders. If two groups are exchangeable, any differences in outcomes between them can be attributed to the treatment effect rather than to confounding variables.

In such a case, a control group can be used as a counterfactual to the treatment groups. We assume the control group would have behaved the same as the treatment group had it not been treated. Likewise, the treatment group would exhibit the same values in the outcome variable as the control group had it not been treated. It is different in our example.

I will simplify a lot here, but the crucial part of the causal inference is finding the equivalent control group in cases like ours (unless we have data from a randomized experiment). We have to find a way to control for confounders and make both groups comparable.

**Linear Regression Adjustment**

The most straightforward and traditional way to adjust for differences in the distributions of confounding variables is to use all of them as independent variables in the linear regression model.

In simplified terms, regression isolates the effect of each variable on the outcome at fixed levels of all other independent variables. We can expand the regression we ran before by adding all the remaining variables.

The coefficient of the treatment indicator will showcase the effect of job training on earnings, controlling for the confounders.

We can see that the effect is entirely different when comparing both groups. After controlling for the differences in observed variables, we can see that the job treatment achieved positive results. It increased average earnings by over 1,500 USD.

While regression is a commendable method for adjusting for unbalanced confounders and accurately indicating the treatment’s direction, it does come with a caveat. It assumes a linear relationship between treatment and outcome. Expanding the linear model by experimenting with different functional forms of independent variables is an option, but it does complicate the model’s interpretation.

However, to overcome the linearity assumption and to explore a causal effect more intuitively, we can turn to the hero of this article – the propensity score matching method. It offers a different approach that could address the limitations of the linear model.

**Propensity Score Matching**

Propensity score matching is one of the most intuitive causal inference methods. Its application is also relatively straightforward and can yield valuable and practical insights. The general idea of this approach is to estimate the probability of receiving the treatment based on measured variables. Estimating the likelihood of the treatment allows us to compute the **propensity score **for each unit.

Afterward, we will match the units of the most similar propensity scores from the control group to the units in the treatment group. This is the **matching **step. This way, we will create an artificial control group with a similar distribution of the pre-treatment variable to the treatment groups.

After matching, we can compare the outcome variable in the treatment groups to the one in the matched control groups to obtain the effect of the treatment.

To summarize, we will apply the following steps:

- Estimating propensity scores
- Checking the balance of the propensity scores
- Matching the units based on the propensity score
- Calculating the effect of the treatment using the matched units

Let’s get it done!

**Propensity Score Estimation**

To estimate the propensity score, we have to calculate the probability of being treated for each unit given the set of pre-treatment variables. This step is usually done by applying a machine learning algorithm with binary treatment indicators as a dependent variable. All classification algorithms will do the job; however, in the causal inference world, the simpler, the better.

Estimating the propensity score allows us to control for the imbalance in the pre-treatment variables in the treatment and control groups. Since the groups are imbalanced, controlling for confounders will allow us to overcome this imbalance by finding an actual probability of being assigned to the treatment group, given the data we have at hand.

In other words, propensity scores will allow us to make the distributions of pre-treatment variables comparable, similar to the outcome we would have if we conducted a randomized experiment. It sounds compelling—using observational data, we will have results similar to the one from A/B testing.

We will see that the standard logistic regression is more than enough to calculate propensity scores. We could expand this method by applying more complex tree-based algorithms like random forest or gradient boosting. However, this would require combating overfitting. In the case of causal inference, the accuracy of the model or its predictive power is not crucial. Given all the critical variables, we just have to find a probability of treatment.

Applying the logistic regression in the case of PSM is easy. We can use the LogisticRegression() class from the sklearn library and apply the model to our data. Since we are not very interested in the predictive accuracy of this step and logistic regression is a relatively simple model, there is no need to conduct any split into training and test sets.

The last column in the data frame contains the propensity score. How can we interpret this number? Given our covariates, it shows the probability that a given individual would receive a treatment. For example, in the case of the first unit in the table, the likelihood of receiving the treatment given the covariates equals 0.53 even though this person received a treatment.

**Checking Common Support**

After obtaining the propensity scores, we must check how well they are distributed across the treatment and control groups. There has to be a substantial overlap in the propensity score distributions—some units in the control must have a relatively high probability of receiving the treatment. Only in such a case can we proceed to the matching step. Such an overlap is called **common support.**

The range of the propensity scores between both groups must overlap. We need this to obtain similar units across the treatment and control groups. The comparison is only possible with an overlap, as there would need to be enough information to calculate the causal effect.

Plotting the histograms of propensity scores by the original assignment to the groups is the best visual aid for checking the standard support.

The chart below shows that individuals in the control group generally have a lower probability of being treated, and vice versa; individuals in the treated groups generally have a higher likelihood of being treated. It is something we can expect, showcasing that logistic regression is relatively well-calibrated.

The chart’s most crucial information is the existence of the overlapping bars. A substantial portion of units have a relatively high probability of being treated despite being assigned to the control group. The same goes another way: there are units in the treated groups that are less likely to be treated. It will enable us to find similar individuals in both groups, which is crucial for calculating the treatment effect.

A few paragraphs before, I mentioned that logistic regression will be our go-to algorithm. Now is an excellent time to explain this statement; fancier algorithms are often preferable in the machine learning world. We can calculate the propensity scores using the random forest to showcase this.

Similarly to the logistic regression, the following chart shows the generated propensity scores with the random forest algorithm.

This chart shows how the flawed propensity score balance looks. Similarly to the logistic regression, we can see that the probability of being treated is highly correlated with the actual treatment assigned. However, contrary to the previous approach, the overall overlap is minimal. Calculating the treatment effect based on such a distribution of propensity scores is not feasible.

Random forest overfits the data. We could overcome this problem by conducting cross-validation or restring the model. Nevertheless, this is unnecessary, as the logic regression worked well for our case. It indicates that it is often enough to run a simple algorithm to estimate the probability of being treated. From this point onwards, we will use the scores generated by the logistic regression.

**Matching**

The ‘propensity score’ part of the PSM approach is done. It is time to move to the ‘matching’ step to construct a new and better control group.

We saw previously that the original treatment and control groups were unbalanced. Almost all pre-treatment variables were different between them. The matching step aims to match units based on the previously generated propensity scores to create a balanced distribution of pre-treatment variables.

Given that the propensity score indicates the probability of being treated, individuals with similar propensity scores are likely to exhibit similar characteristics. When we identify a treated individual and someone similar but originally in the control group, the difference between them could only be attributed to the treatment. The disparity in the outcome variable between such units will indicate the treatment effect.

To understand this step, we can look at the following dummy table. It contains ten units, five assigned to the treatment group and 5 to the control group (**Treated** column). The data does not come from the randomized experiment, so each unit has a propensity score assigned. We can assume that it was calculated based on pre-treatment variables.

Unit of ID 1 was in the treated control and had a high propensity score of 0.8. The goal of the matching step is to find an individual from the control group as similar as possible. We have to find the unit from the control of the propensity score as identical as possible to 0.8. It would be unit 6, with a propensity score of 0.78.

Then, we move to the unit with ID 2. Unit 7 from the control group is the most similar, with a propensity score of 0.74. The process continues until we run out of the control group. Then, the matched units from the control group will work as the modified control group.

The modified control group will be significantly more similar to the treatment group than the original, non-random group. This improvement in similarity enhances the accuracy of calculating the difference in the outcome variables, thereby better indicating the treatment’s effect. The matching process balances both groups well and offers promising potential for more robust research outcomes.

**Treatment Effect on The Treated**

What would happen with control group units that are not matched with a treated group? Nothing; we will disregard them from the rest of the analysis, as they are not similar to anyone in the treated group. This version of propensity score matching trims down the control group only to units similar to the ones in the treatment group. It is one of the drawbacks of the propensity score matching, as we will end up with a smaller (but more accurate) data set.

Finding matches this way leads to discussing the different treatment effects in causal inference. This kind of propensity score matching allows us to calculate **the average treatment effect on treatment **(**ATT) **contrary to the average treatment effect (ATE).

I will not delve into a theoretical discussion about different treatment effects here. For this post, we only have to intuitively cover the difference between ATT and ATE. Let’s start with the latter. The average treatment effect refers to the entire population of interest; it measures the impact of the treatment in both the treatment and control groups. We can consider it the treatment effect if everyone in a given population received a treatment. For example, if the job training was available to the entire population of a given region.

What is the difference in the effect on the treated, then? It measures the impact of the treatment, but only on individuals who received the treatment. We do this by matching similar units from the control to the control. We might ask if such a treatment effect is inferior to the general average treatment effect.

I can use the favorite reply in the social science universe—*it depends*. If we would like to discover the effect of the given treatment on the entire population, then we won’t be able to use the method described here. Before you close this article with the feeling of wasted time, please hang on for a moment. Calculating the average treatment effect is exactly what we need to obtain here!

Usually, job training programs are not rolled out to the entire population. Only a subset of people who need them receive such a program. This group differs from the general population, as we have seen before. We are interested in understanding the impact of the treatment of those who received it. Understanding the effect on the entire population won’t be of significant interest, as such programs are specific. We are on the right track.

To summarize – propensity score matching involves finding matches for the treated units from the control units to estimate the treatment effect only for the treated group. We will calculate how the job training program affected those participating. This will be of great practical importance, as it will help policy-makers evaluate and optimize the program.

**Nearest Neighbors**

Let’s get back to the technical part. To find suitable matches, we can use unsupervised learning. Combining different machine learning methods is a very cool part of this approach. One of the best ways to conduct this step is to use the classic nearest neighbors algorithm.

To conduct the matching step, we will create two data frames, one exclusively for units from the treatment group and another for the control group. Then, we can initiate the NN algorithm, setting the **n **parameter to 1 as we seek the most precise match for each unit in the treatment group.

The ‘n’ parameter in the NN algorithm represents the number of nearest neighbors to consider. By setting it to 1, we ensure that each unit in the treatment group is matched with its closest counterpart in the control group.

Then, we apply the model to the control group. This step was counterintuitive for me, but we had to teach the model of the structure of the propensity scores in the control group.

Afterward, we apply the *kneighbors *method of the fitted model to treatment groups. This step enables us to find units from the control group that are the most similar to the treatment group. We store the results in two variables:

- Distances – indicating distance (or similarity) to the nearest neighbor
- Indices – location (data frame index) of the most similar unit in the control group

Subsequently, we create the **new_control **data frame based on the indices above. Our new control group contains only units selected as the most similar to the treatment groups based on the propensity score. The only step left is to join the original treatment group data with the newly matched control group. It will be the final data set used to evaluate the effect of the treatment.

After applying the matching step, we are almost ready to evaluate the treatment’s effect. However, before doing so, we must check if the matched data led to a balanced dataset.

**Checking Balance**

The ultimate goal of the propensity score matching method is to balance the confounders between a treatment group and a control group. There are many ways to check this. In the previous paragraph, we did this by comparing the distribution of each variable between both groups. This step requires a lot of data visualization and checking each variable separately.

There is an easier and faster way to compare the balance between variables commonly used in the matching method. For continuous variables, we can apply a simple metric called **standardized mean difference**:

The numerator calculates the difference between the mean of a given covariate in the control and control groups. The purpose of the denominator is to standardize this difference. It is done by calculating the pooled standard deviation and creating the weighted average of the standard deviations in both groups. This division will make the comparison will make the SMD independent of the units of a separate variable, making the comparison fair.

Calculating the mean difference for categorical variables is a more straightforward process. This is because these variables are already on the same scale. The mean of the categorical variable is the proportion of the positive rows. To calculate the mean difference, we only need to compare the differences between the proportions of its values before and after matching.

As we will reuse SMD for all the covariates, it is a good idea to encapsulate it in simple functions like the one below. The following codes created two functions:

- Standarized_mean_difference – for the continuous variables
- Calculate_proportions – mean difference for the categorical variables

The interpretation of SMD is straightforward; in the Propensity Score Matching case, the lower it is, the better. Low values of standardized mean difference show that the difference between groups in a given variable is low, and this is precisely the situation we would like to have after matching. We can encounter a value of 0.1 as the threshold, indicating that the differences between the two variables are low.

Furthermore, SMD values around 0.2 are also considered acceptable. These values signify only slight differences between variables. To illustrate, let’s visualize the situation in our analysis.

The balance after matching significantly improved compared to the initial lack of similarity between groups. The chart demonstrates that matching has notably enhanced the balance of most covariates. With most of them hovering around the 0.1 threshold, it’s evident that the control group, formed after applying the propensity score matching, is much more akin to the treatment group.

PSM did not improve the balance of age and years of education. However, both of those variables were already quite similar before the matching. We could have achieved further optimization by applying more sophisticated machine learning algorithms. We must remember that we will never achieve the perfect balance due to the lack of randomization. And even randomized experiments can showcase an imbalance in covariates due to random occurrences.

The new balance is practical enough and shows that comparable groups can be obtained even from observational data. We’re ready to calculate the treatment’s effect.

**Treatment Effect**

Average post-treatment earnings look much more intuitive now, and finally, they showcase the program’s efficiency. After adjusting for differences in confounding variables, we can see that earnings in the group affected by the employment program are much higher. Hence, we can assume that the treated groups would have earned 5,300 USD without being affected by the program. The job training program led to an increase in the average earnings of affected individuals.

We can also calculate the treatment by subtracting the average post-treatment earrings in both groups using the following code:

Comparing the difference in means is only a one-dimensional way to calculate the treatment’s effect. A better way is to run a regression similar to the one we used above, but this time on the matched data.

Why is this? Firstly, the matching is not 100% perfect. Simply comparing means assumes that the balance between covariates is perfect. It is much better than before, but there is still a tiny covariate discrepancy. Running the regression model allows us to adjust for the rest of the imbalance. Controlling for other variables also helps isolate the treatment effect while keeping the rest of the confounders fixed.

Another benefit of conducting a regression analysis is the better precision of the estimated parameters. It also gives us confidence intervals and allows us to calculate the estimates’ errors.

The job training program increased the earnings of its participants by 1,176 USD. It has undoubtedly had a significant impact, showing the solid performance of the program – and we are talking about dollars from 1978.

The regression model we ran before the matching returned a treatment effect of 1,548 USD. Propensity score matching decreases the estimated impact of the treatment.

In the case of propensity score matching, we are comparing only similar units. The matched data doesn’t include individuals entirely different from the ones in the treatment groups. Hence, the treatment’s effect is more precise because we only compare similar individuals.

Propensity score matching data also leads to a smaller imbalance between covariates. Unmatched regression might lead to an overestimation of the treatment effect. Using the matched data, the groups are more comparable, and the bias due to imbalance is lower.

For those reasons, propensity score matching leads to a better estimate of the treatment’s effect on the treatment. Although the effect obtained in this way is smaller, it still shows a positive impact of the job training program, and it is an excellent signal to run similar activities in the future.

**Limitations**

Before we conclude, we have to discuss the limitations and dangers associated with propensity score matching.

The crucial limitation of this approach lies in its dependence on the possessed data. When conducting propensity score matching, we must be cautious, as we are willingly (or unwillingly) assuming that we are considering all relevant confounders. It’s imperative to include all relevant covariates. In the job training case, we believe we included all variables influencing both treatment and the outcome, but we must remain aware of the potential for oversight.

If another variable were entirely unrelated to measured variables and strongly affecting both treatment and outcome, we would have to include it in the analysis. The treatment effect estimation will be biased if such information is unavailable.

In the language of the causal diagrams, similar to the ones from the beginning of the post, we have to include information about all relevant Xs when calculating the propensity score.

Can we test for it? Unfortunately not. Here comes the domain of knowledge, intuition, and literature. Generally, we should obtain as much data as possible (although too much can sometimes be harmful). It is generally more straightforward in commercial settings when we have access to much transactional information about customers.

Of course, this does not mean that propensity score matching is useless. We have to use it with care and analyze results from different angles. Analytics work, especially causal inference, is a mix of science and intuition, and testing various approaches will help overcome the limitations of each method. It is also a good practice to share a list of assumptions in each study.

Solving the causal problem by applying different methods (like regression adjustment, PSM, difference-in-difference, synthetic control group, etc.) can also be helpful, as we can compare the treatment effects obtained from different studies. However, not every causal inference method applies to every scenario.

**Summary**

Propensity score matching (PSM) is a robust solution for estimating causal effects in observational studies. Its unique ability to balance out differences between treatment and (non-random) control groups enhances the reliability of our inferences about the impact of interventions, making it a valuable tool in our research arsenal.

Our case study on the job training program illustrates this method’s practical application and benefits. While PSM has limitations, such as dependency on the observed variables, it is an essential tool in evaluating different activities that can be used in all settings—from academia, public policy settings, and commercial applications.

R**eferences**

https://bookdown.org/paul/applied-causal-analysis/att.html

“Causal Effects in Non-Experimental Studies: Reevaluating the Evaluation of Training Programs,” *Journal of the American Statistical Association*, Vol. 94, No. 448 (December 1999), pp. 1053-1062.

https://cran.r-project.org/web/packages/tableone/vignettes/smd.html

*Mostly Harmless Econometrics: An Empiricist’s Companion*. Princeton University Press

https://onlinelibrary.wiley.com/doi/full/10.1002/cesm.12047

Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. *Biometrika*, 70(1), 41-55.

https://rdrr.io/cran/Matching/man/lalonde.html

https://vincentarelbundock.github.io/Rdatasets/datasets.html