A Review of Copula Correction Methods to Address Regressor

The omnipresent error term in regression models does not always receive careful attention by model builders. What factors are included in this error? Naturally, it would be ideal if the error were entirely due to random shocks. However, sometimes factors that should be explicitly incorporated in the model but cannot be observed or are unavailable to be used as explanatory variables are also present in the error. Worse, often our accumulated knowledge and theories indicate that the variables seeping into the error term are systematically related to the explanatory variables included in the model. This results in regressor–error correlation, which, if ignored, leads to biased estimates.

Download Article

Get this article as a PDF

Download

Why Does Regressor–Error Correlation Arise?

As an example, Heitmann et al. (2020) investigate the effect of two key visual design decisions: brand typicality (similarity within the brand’s range) and segment typicality (similarity to the competitive set) on consumer purchasing. In the car market, visual appearance is a vital determinant of success; hence, automakers track consumers’ changing tastes and strategically incorporate these into design changes in their newest models. However, because researchers typically cannot observe these changing tastes, they are encompassed in the error term. As a result, empirical models intended to measure the impact of design changes suffer from the problem that the key regressor capturing design changes is correlated with the error. Put differently, the design change regressor is endogenous. If not corrected, the estimated impact of design changes will be biased. As our academic field matures, we continue to discover reasons for regressor–error correlations that were previously overlooked. Examples of such phenomena are (1) advertising endogeneity due to self-selection by consumers in advertising response and (2) pricing endogeneity, because firms and consumers know aspects of product quality that researchers do not see in the data. While additional information such as instrumental variables or exogenous shocks can help address the endogeneity issue, obtaining such information is often challenging. In such situations, the copula correction method provides an alternative approach.

How Does the Copula Correction Work?

The copula correction method directly addresses the issue of regressor–error correlation by assuming a plausible relationship between the endogenous regressor and the error. This additional structure enables the researcher to estimate the model parameters without bias. However, the crucial underlying condition is that the assumed relationship between the endogenous regressor and the error is appropriate. Park and Gupta’s (2012) copula correction method (P&G method hereinafter) assumes a general and convenient Gaussian copula–based relationship between the regressor and the error. The various advantages of the Gaussian copula are well known (Danaher and Smith 2011). The Gaussian copula covers nearly the full (−1, 1) range in pairwise correlation, making it a general and robust copula for most applications. Additionally, its complexity increases at a much slower rate than other multivariate models as the number of dimensions increases.

The P&G method has been extensively used in marketing in diverse contexts such as addressing potential endogeneity of product design changes (e.g., Heitmann et al. 2020), advertising content decisions (e.g., Guitart and Stremersch 2021), and marketing-mix variables (Datta et al. 2022). Since publication of Park and Gupta (2012), various methods that directly model the regressor–error relationship to avoid bias have evolved through subsequent studies. Interestingly, recent developments in this area explore the assumptions of the P&G method and make meaningful improvements by either relaxing them or suggesting alternatives that offer methodological benefits. Accordingly, with the goal of assisting applied researchers interested in employing the copula correction method, in this paper we revisit each assumption of the P&G method and illustrate how new methods enhance them.

The P&G method makes the following assumptions: (1) the endogenous regressor (let’s call it X_en) is nonnormal, (2) the error follows a normal distribution, and (3) the dependence between the endogenous regressor and the error can be captured by a Gaussian copula. The model may include exogenous regressors (let’s call them X_ex) along with the endogenous one. An implicit assumption made in the P&G method is (4) there is no correlation between exogenous and endogenous regressors.

Assumption 1 is easily testable and is often satisfied in many cases (below, we discuss methods that relax this assumption). Assumption 2 is a plausible assumption, commonly used in likelihood-based models or Bayesian models, but it can be violated, and empirical testing can be challenging, especially in situations with regressor–error correlation. Similarly, Assumption 3 is a plausible assumption but it cannot be easily tested (we will also discuss a method that relaxes Assumptions 2 and 3). Fortunately, Assumption 4 is easily testable. If X_ex and X_en are highly correlated, it is necessary to appropriately incorporate this correlation when constructing the model, as bias may arise otherwise. Haschka (2022) proposes a likelihood-based estimation method for this situation by constructing the joint distribution of the error and all explanatory variables to carry out the estimation. We also note that a number of other recently proposed methods account for the correlation between X_ex and X_en: a nonparametric control function method (Breitung, Mayer, and Wied 2024), 2sCOPE model (Yang, Qian, and Xie 2023), and SORE model (Qian and Xie 2023). Table 1 summarizes all the assumptions of the P&G method, indicates whether they are testable, and suggests methods to consider in case the assumption is violated or to enhance robustness.

Table 1. Assumptions of the P&G Method and Recent Developments

Assumption of the P&G Method		Testable?	Methods to Consider if the Assumption Is Violated or to Enhance Robustness
1	The endogenous regressor is nonnormal	Yes	• Yang, Qian, and Xie (2023)—2sCOPE model
2	The error follows a normal distribution	No	• Breitung, Mayer, and Wied (2024)—nonparametric control function method
3	The dependence between the endogenous regressor and the error can be captured by a Gaussian copula	No	• Qian and Xie (2023)—SORE model • Breitung, Mayer, and Wied (2024)—nonparametric control function method
4	There is no correlation between exogenous and endogenous regressors	Yes	• Haschka (2022) • Breitung, Mayer, and Wied (2024)—nonparametric control function method • Yang, Qian, and Xie (2023)—2sCOPE model • Qian and Xie (2023)—SORE model

Park and Gupta (2012) demonstrate that the copula correction method can be applied to discrete choice models as well. The crucial first step in applying the copula correction method is appropriately deriving the linear form of regressor–error dependence. For instance, in the analysis of aggregate sales data, prominent models such as BLP (Berry, Levinsohn, and Pakes 1995) include linear regressor–error dependence between price and common shocks (i.e., price endogeneity). Once we obtain this linear form of regressor–error dependence, applying a copula correction method can address estimation issues.

Recent Developments in Copula Correction Methods

Table 2 summarizes the key strengths of recently proposed methods. The data used in empirical marketing analyses often have a panel structure. When panel data sets have numerous cross-sectional units and relatively few time periods per unit, the challenges of estimation are addressed through fixed-effect transformation. Haschka (2022) extends Park and Gupta’s (2012) approach to panel data where fixed-effect transformation is necessary. The concern when applying fixed-effect transformation is the presence of nonspherical errors. After resolving the problem of nonspherical errors through a generalized least squares transformation, Haschka develops a copula correction method based on the joint distribution of the error and all explanatory variables.

Table 2. Key Strengths of Recently Proposed Methods

Method	Strength
Haschka (2022)	Provides fixed-effect transformation to handle data with numerous cross-sectional units but relatively few time periods per unit
Breitung, Mayer, and Wied (2024)—nonparametric control function method	Provides a robustness check of the P&G method in cases where researchers cannot justify that (1) the error follows a normal distribution, and/or (2) the dependence between the endogenous regressor and the error follows the Gaussian copula (e.g., previous studies may argue that the error deviates from normality)
Yang, Qian, and Xie (2023)—2sCOPE model	Allows for the application of the copula correction method when X_en follows a normal distribution, X_en and X_ex are correlated, and X_ex deviates from normality
Qian and Xie (2023)—SORE model	Handles discrete endogenous regressors with only a few levels, such as binary regressors or count-valued regressors with small means

The copula correction method obtains unbiased estimates of model parameters by modeling the relationship between the regressor and the error. Of course, the true relationship between the two is unknown. The P&G method provides a plausible starting point, and adding other options is naturally beneficial for empirical research. By considering models based on alternative relationships between regressors and errors, researchers can conduct more robust analyses.

In the P&G method, the assumed regressor–error correlation based on Gaussian copula allows us to decompose the error into (a) the part correlated with the endogenous regressor and (b) pure exogenous shocks that are unrelated with all the regressors. Part (a) is expressed as a nonlinear function of the endogenous regressor, and this part plays a role very similar to a control function (for an overview of control functions, see, e.g., Navarro [2010] and Wooldridge [2015]). Breitung, Mayer, and Wied (2024) propose a novel “nonparametric control function method.” In this approach, the control function that constitutes Part (a) follows a normal distribution, and Part (b) is a mean-zero shock that does not necessarily have to be normal. Consequently, Assumption 2 of the P&G method is relaxed. Similar to the P&G method, which assumes nonnormality of the endogenous regressor for model identification, the Breitung, Mayer, and Wied model requires that specific assumptions related to the distribution of the endogenous regressor be satisfied. While this approach originates from the idea of the copula correction approach, it has the advantage of not assuming a specific copula. Furthermore, Breitung, Mayer, and Wied formally demonstrate the consistency, asymptotic normality, and validity of bootstrap standard errors for the model parameters.

We turn next to Assumption 1, which is that the endogenous regressor has a nonnormal distribution. The recently proposed “two-stage copula endogeneity correction” (2sCOPE) method relaxes this requirement (Qian and Xie 2023). Additionally, like Haschka (2022), 2sCOPE assumes that the endogenous regressors, exogenous regressors, and errors are interrelated through a Gaussian copula. For estimation it employs a two-stage approach using control functions derived from the assumed model. An advantage of the method is that it allows for consistent parameter estimation even if the endogenous regressor follows a normal distribution, as long as one of the correlated exogenous regressors deviates from normality.

As noted, the essence of the copula correction approach lies in directly modeling the correlation between regressors and errors to estimate model parameters without bias. The semiparametric odds ratio (SOR) has often been used in applied research in marketing and related fields as a flexible method to capture dependence between variables (see, e.g., Chen 2007; Qian and Xie 2011). The semiparametric odds ratio endogeneity (SORE) model has recently been proposed as a method that utilizes SOR to capture regressor–error dependence (Qian and Xie 2023). One notable advantage of SOR is its ability to handle the association between discrete endogenous regressors and the error effectively. While the P&G method can be applied to discrete endogenous regressors, it does not handle endogenous regressors with only a few levels well; examples are binary regressors or count-valued regressors with small means. This limitation arises because the P&G method treats discrete endogenous regressors as realizations from underlying continuous latent variables and performs an inverse mapping from the cumulative distribution functions of endogenous regressors to the latent variables. The SORE model addresses this issue. However, this benefit comes at a cost: SORE constructs a conditional distribution from the odds ratio (OR) function and nonparametric baseline distribution functions. If the OR function is misspecified, it can lead to bias and/or issues of model nonidentification.

One of the primary reasons researchers may choose to use SORE is its ability to handle binary endogenous regressors. A more classical solution in such cases is to employ a Gaussian copula–based approach with a structure similar to the models proposed by Heckman (1976) or Lee (1983). These models assume a specific relationship between the binary endogenous regressor and the error based on Gaussian copula. In this scenario, researchers can estimate the model without bias using conditional likelihood instead of the reverse mapping proposed in Park and Gupta (2012).

The robustness of the P&G method has been stress-tested by multiple subsequent studies. In Park and Gupta (2012), the copula correction method’s performance was demonstrated in a simple setting without an intercept. Becker, Proksch, and Ringle (2022) show that the performance of copula correction in a more general setting when an intercept is included is diminished when the sample size is small. However, Qian, Xie, and Koschmann (2024) find that the substantial bias identified in Becker, Proksch, and Ringle is primarily due to their method of constructing the empirical copula. Specifically, the correction term for the empirical copula, which is based on a fixed-value percentile for the highest rank, can significantly distort the distribution of the copula correction terms, resulting in suboptimal performance of the copula correction method. When the P&G method is applied more precisely, as suggested by Qian et al., the bias in the coefficient estimate of the endogenous regressor becomes negligible when the sample size reaches 400, rather than 4,000. Becker, Proksch, and Ringle also carefully examine the nonnormality assumption and how this assumption affects the results. In a similar vein, Eckert and Hohberger (2023) investigate the performance of the P&G method when various assumptions are violated, especially in cases of near-normal endogenous regressors, nonnormal and skewed errors, and the regressor–error correlation based on non–Gaussian copulas and provide guidelines for such scenarios. Like all models, copula correction methods rely on assumptions and naturally their use requires significant caution, especially when the sample size is small. Fortunately, the series of recent papers that have extended the original P&G method address many of these situations. More specifically, the issue of nonnormality can be mitigated in the 2sCOPE method. Problems related to skewed or nonnormal errors can be addressed through the nonparametric control function method. Moreover, an advantage of both SORE and the nonparametric control function methods is their flexibility to consider relationships between regressors and errors that do not necessarily follow a Gaussian copula.

Guidance for the Applied Researcher

To wrap up, we suggest the following three-step procedure for researchers interested in applying the copula correction method.

Check whether the endogenous regressor follows a nonnormal distribution. If it is near normal, researchers can try the 2sCOPE model. If the endogenous variable is discrete and has only a few levels, such as binary regressors or count-valued regressors with small means, one can apply the SORE model. If the endogenous regressor follows a nonnormal distribution, proceed to Step 2.
Check for correlations between X_en and X_ex. If the correlations are large, apply the 2sCOPE model.^[1] If the data set has a panel structure and requires fixed-effect transformation to handle numerous cross-sectional units and relatively few time periods, apply the method proposed by Haschka et al. (2022). If there is low correlation between X_en and X_ex, apply the P&G method.
As a robustness check, consider running the nonparametric control function method if the endogenous regressor is continuous. Unfortunately, Assumptions 2 and 3 of the P&G method are not easily testable. The nonparametric control function method does not require the normality of the error (Assumption 2) or assume a specific copula structure between the endogenous regressor and the error (Assumption 3). However, it does require an alternative set of assumptions, and some of these assumptions are also difficult to test using data. We suggest the nonparametric control function method as a robustness check because, like the P&G method, it is relatively easy to apply. Finding consistent results between the P&G method and the nonparametric control function method provides greater assurance of validity.

FAQ

Additionally, we provide below answers to some frequently asked questions regarding the use of the copula correction method in practice.

Q1: Is it correct to use multiple copula correction terms for multiple endogenous variables in the same model?

Answer: This is correct. One advantage of the copula correction method based on the Gaussian copula is that it can include multiple copula correction terms to handle multiple endogenous regressors.

Q2: In estimating a model with higher-order terms (e.g., interaction and quadratic terms) of the endogenous variable, should we generate additional copula correction terms for them?

Answer: Qian, Xie, and Koschmann (2022) addresses this issue formally. They show that once copula correction terms for the main effects of endogenous regressors are included as generated regressors, there is no need to include additional correction terms for the interaction terms or higher-order terms. This simplicity in handling higher-order endogenous regression terms is a merit of the copula correction approach. More importantly, adding these unnecessary correction terms has harmful effects and leads to suboptimal solutions of endogeneity bias.

Q3: Is it acceptable to exclude nonsignificant copula correction terms from the final model when it involves multiple copula correction terms?

Answer: This issue is similar to a common challenge encountered in statistical analysis for which the final answer is not clear-cut: Should you exclude or include nonsignificant regressors when building the final model? Considering factors such as model complexity, influence on other variables, theoretical implications, and model fit, researchers may choose to drop nonsignificant regressors or leave them in the model. If the copula correction term is not significant, removing nonsignificant regressors in the final model can have positive effects in terms of model simplicity, degrees of freedom, and multicollinearity. We suggest examining how sensitive the estimates of key variables are when removing nonsignificant copula correction terms. If the effects of key variables are not very sensitive, removal may be harmless.

Q4: Is it acceptable to utilize the significance of the copula correction term as an indicator to determine whether endogeneity is a concern?

Answer: If the P&G assumptions are correct, the nonsignificance of the copula correction term implies that there is no endogeneity caused by the regressor–error correlation. While the assumptions of P&G can serve as a plausible starting point, one cannot conclusively determine the absence of endogeneity based on this result alone. Therefore, it is advisable to consider other methods as a robustness check (e.g., nonparametric control function method, 2sCOPE).

Conclusion

The copula correction method has extended beyond marketing and is increasingly being introduced and widely used in various fields, including management, economics, and psychology. Open-source code is also becoming widely available to implement the method (Gui et al. 2023). Concurrently, there has been substantial additional research on the assumptions and weaknesses of the original P&G model, leading to its development and evolution. As we know, there is no free lunch. To be able to conduct analysis without instrumental variables or additional information, copula correction methods must make assumptions about the relationship between regressors and errors. Through further research, we need to understand the relationship between regressors and errors better, both theoretically and empirically, and leverage this additional knowledge to develop a copula correction model that captures the regressor–error correlation more completely.

Footnote

^[1] Determining a precise threshold for “high” correlation is difficult and requires further research. Please refer to the last row of Table 1. We know that the bias resulting from the ignored correlation between X_en and X_ex depends on (1) the correlation between X_en and the error, (2) the correlation between X_en and X_ex, and (3) the variance of the error. If minimal regressor–error correlation is expected (based on previous results and/or theory) and the explained part in the variation of the dependent variable is large (i.e., the explanatory power of the model is high and thus the error variance is small), we can expect that the impact of the correlation between X_en and X_ex is minimal. See the appendices of Haschka (2022) and Yang, Qian, and Xie (2023). Moreover, if X_ex is highly correlated with X_en, we need to meticulously double-check the exogeneity of X_ex. Finding a suitable instrumental variable is challenging because it must be correlated with the endogenous regressor yet uncorrelated with the error term. Similarly, it is unlikely that a variable is truly exogenous if it is highly correlated with the endogenous variable.

Citation

Park, Sungho, and Sachin Gupta (2024), “A Review of Copula Correction Methods to Address Regressor–Error Correlation,” Impact at JMR. Available at: https://www.ama.org/marketing-news/a-review-of-copula-correction-methods-to-address-regressorerror-correlation/.

Acknowledgment

We would like to express our gratitude to Kapil Tuli and Rebecca Hamilton for their valuable feedback and numerous helpful suggestions during the review process, which have contributed to enhancing the utility of this article.

References

Becker, Jan-Michael, Dorian Proksch, and Christian M. Ringle (2022), “Revisiting Gaussian Copulas to Handle Endogenous Regressors,” Journal of the Academy of Marketing Science, 50, 46–66.

Berry, Steven, James Levinsohn, and Ariel Pakes (1995), “Automobile Prices in Market Equilibrium,” Econometrica, 63 (4), 841–90.

Breitung Jörg, Alexander Mayer, and Dominik Wied (2024), “Asymptotic Properties of Endogeneity Corrections Using Nonlinear Transformations,” Econometrics Journal (published online January 24), https://doi.org/10.1093/ectj/utae002.

Chen, Hua Yun (2007), “A Semiparametric Odds Ratio Model for Measuring Association,” Biometrics, 63 (2), 413–21.

Danaher, Peter J. and Michael S. Smith (2011), “Modeling Multivariate Distributions Using Copulas: Applications in Marketing,” Marketing Science, 30 (1), 4–21.

Datta, Hannes, Harald J. van Heerde, Marnik G. Dekimpe, and Jan-Benedict E.M. Steenkamp (2022), “Cross-National Differences in Market Response: Line-Length, Price, and Distribution Elasticities in 14 Indo-Pacific Rim Economies,” Journal of Marketing Research, 59 (2), 251–70.

Eckert, Christine and Jan Hohberger (2023), “Addressing Endogeneity Without Instrumental Variables: An Evaluation of the Gaussian Copula Approach for Management Research,” Journal of Management, 49 (4), 1460–95.

Gui, Raluca, Markus Meierer, Patrik Schilter, and René Algesheimer (2023), “REndo: Internal Instrumental Variables to Address Endogeneity,” Journal of Statistical Software, 107 (3), 1–43.

Guitart, Ivan A. and Stefan Stremersch (2021), “The Impact of Informational and Emotional Television Ad Content on Online Search and Sales,” Journal of Marketing Research, 58 (2), 299–320.

Haschka, Rouven E. (2022), “Handling Endogenous Regressors Using Copulas: A Generalization to Linear Panel Models with Fixed Effects and Correlated Regressors,” Journal of Marketing Research, 59 (4), 860–81.

Heckman, James J. (1976), “The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models,” Annals of Economic and Social Measurement, 5 (4), 475–92.

Heitmann, Mark, Jan R. Landwehr, Thomas F. Schreiner, and Harald J. Van Heerde (2020), “Leveraging Brand Equity for Effective Visual Product Design,” Journal of Marketing Research, 57 (2), 257–77.

Lee, Lung-Fei (1983), “Generalized Econometric Models with Selectivity,” Econometrica, 51 (2), 507–12.

Navarro, Salvador (2010), “Control Functions,” in Microeconometrics, Steven N. Durlauf and Lawrence E. Blume, eds. The New Palgrave Economics Collection. Palgrave Macmillan, 2-–28.

Park, Sungho and Sachin Gupta (2012), “Handling Endogenous Regressors by Joint Estimation Using Copulas,” Marketing Science, 31 (4), 567–86.

Qian, Yi and Hui Xie (2011), “No Customer Left Behind: A Distribution-Free Bayesian Approach to Accounting for Missing Xs in Marketing Models,” Marketing Science, 30 (4), 717–36.

Qian, Yi and Hui Xie (2023), “Correcting Regressor-Endogeneity Bias via Instrument-Free Joint Estimation Using Semiparametric Odds Ratio Models,” Journal of Marketing Research, (published online August 3), https://doi.org/10.1177/00222437231195577.

Qian, Yi, Hui Xie, and Anthony Koschmann (2022), “Should Copula Endogeneity Correction Include Generated Regressors For Higher Order Terms? No, It Hurts,” NBER Working Paper 29978, http://www.nber.org/papers/w29978.

Qian, Yi, Hui Xie, and Anthony Koschmann (2024), “A Practical Guide to Endogeneity Correction Using Copulas,” NBER Working Paper 32231, http://www.nber.org/papers/w32231.

Wooldridge, Jeffrey M. (2015), “Control Function Methods in Applied Econometrics,” Journal of Human Resources, 50 (2), 420–45.

Yang, Fan, Yi Qian, and Hui Xie (2023), “Addressing Endogeneity Using a Two-Stage Copula Generated Regressor Approach,” NBER Working Paper 29708, https://www.nber.org/papers/w29708.

More IMPACT at JMR

Marketing News

Strategies for Leveraging AI in the Customer Experience
Marketing News

Using Identity to Secure Nonprofit Donations
Article

Academia in Court: How Marketing Scholarship Informs The Law

Academic

Academic Research

Data and Analytics

Sungho Park

Sungho Park is Korbit Chaired Professor of Marketing, Seoul National University, South Korea.

Sachin Gupta

Sachin Gupta is Henrietta Johnson Louis Professor of Management and Professor of Marketing, Cornell University, USA.