Opinion-Policy Nexus

Still more about my experiences this term as a political theorist teaching methods.

How to introduce the core ideas of regression analysis: via concrete visual examples of bivariate relationships, culminating in the Gauss Markov theorem and the classical regression model? via a more abstract but philosophically satisfying story about inference and uncertainty, models and distributions? Some combination of each?

I took my lead here from my first teacher of statistics, and I want to describe and praise that approach, which still impresses me as quite beautiful in its way.

I remember with some fondness stumbling through Gary King's course on the likelihood theory of inference just over twenty years ago. That course, in turn, drew heavily on King's Unifying Political Methodology, first published in 1989.

I'm too far removed from the methods community to have a sense of how this book is now received. I remember at the time, when I took King's course, thinking that the discussion of Bayesian inference was philosophically ... well, a bit dismissive, whereas nowadays Bayes seems just fine. Revisiting the relevant sections of UPM (especially pp. 28-30) I now think my earlier assessment was unfair.

Still, UPM is easily recognizable as the approach that led Chris Achen to say the following in surveying the state of political methods little more than a decade after King's book first appeared ...

... Even at the most quantitative end of the profession, much contemporary empirical work has little long-term scientific value. “Theoretical models” are too often long lists of independent variables from social psychology, sociology, or just casual empiricism, tossed helter-skelter into canned linear regression packages. Among better empiricists, these “garbage-can regressions” have become a little less common, but they have too frequently been replaced by garbage-can maximum-likelihood estimates (MLEs). ...

Given this, it wouldn't have surprised me if, upon querying methods colleagues, I'd found that UPM remains widely liked, its historical importance for political science acknowledged, but its position in cutting-edge methods syllabi quietly shuffled to the "suggested readings" list.

Is this the case? I doubt it, but even if all that were true, UPM is the book I learned from, and it's the book I keep taking off the shelf, year after year, to see how certain basic ideas in distribution and estimation theory play out specifically for political questions.

Of course I say that as a theorist: whenever I've pondered high (statistical) theory, nothing much has ever been at stake for me personally, as a scholar and teacher. Now, with some pressure to actually do something constructive with my dilettante's interest in statistics, I wanted to teach with this familiar book ready at hand.

I haven't been disappointed, and I want to share an illustration of why I think this book should stand the test of time: King's treatment of the classical regression framework and the Gauss-Markov theorem.

Try googling "the Classical Regression Model" and you'll get a seemingly endless stream of (typically excellent) lecture notes from all over the world, no small number of which probably owe significant credit to the discussion in William Greene's ubiquitous econometrics text. High up on the list will almost certainly be Wikipedia's (actually rather decent) explanation of linear regression. The intuition behind the model is most powerfully conveyed in the bivariate case: here is the relationship, in a single year, between a measure of human capital performance for a sample of countries against their per capita GDP ...


Now, let's look at that again but with logged GDP per capita for each country in the sample (this is taken, by the way, from the most recent Penn World Table) ...


The straight line is, of course, universally understood as "the line of best fit," but that interpretation requires some restrictions, which define the conditions under which calculating that line using a particular algorithm, ordinary least squares (OLS, or simply LS), results in the best linear unbiased predictor, or estimator, of y (thus the acronym BLUE, so common in introductory treatments of the CLRM). OLS minimizes the sum of squared errors, measured vertically, along values of x (rather than, say, perpendicular to the line). Together, those conditions are the Gauss-Markov assumptions, named thus thanks to the Gauss-Markov theorem, which, given those conditions (very roughly: normally distributed and uncorrelated errors with mean zero and constant variance, and those errors uncorrelated with x, or with the columns in the multivariate matrix X), establishes OLS as the best linear unbiased estimator of coefficients in the equation that describes that ubiquitous illustrative line,


or, in matrix notion for multiple x variables,


... and that's how generations of statistics and econometrics students first encountered regression analysis: via this powerful visual intuition.

But as King notes in UPM, the intuition was never entirely satisfying upon more careful reflection. Why the sum of square errors, rather than, say, the sum of the absolute value of errors? And why calculate the respective errors along the X axis, rather than, again, perpendicular to the line we want to fit?

UPM is, so far as I know, unique (or at the very least, extraordinarily rare) in beginning not with these visual intuitions, but instead with a story about inference: how do we infer things about the world given uncertainty? How can we be clear about uncertainty itself? This is, after all, the point of an account of probability: to be precise about uncertainty, and the whole point of UPM was (is) to introduce statistical methods most useful for political science via a particular approach to inference.

So, instead of beginning with the usual story about convenient bivariate relationships and lines of best fit, UPM starts with the fundamental problem of statistical inference: we have evidence generated by mechanisms and processes in the world. We want to know how confident we should be in our model of those mechanisms and processes, given the evidence we have.

More precisely, we want to estimate some parameter [latex]\theta[/latex], taking much of the world as given. That is, we'd like to know how confident we can be in our model of that parameter [latex]\theta[/latex], given the evidence we have. So what we want to know is [latex]p( \theta | y)[/latex], but what we actually have is knowledge of the world given some parameter [latex]\theta[/latex], that is, [latex]p( y | \theta )[/latex].

Bayes's Theorem famously gives us the relationship between a conditional probability and its inverse:


We could contrive to render [latex]p(y)[/latex] as a function of [latex]p(\theta)[/latex] and [latex]p(y | \theta)[/latex] by differentiating [latex]p(\theta,y)[/latex] over the whole parameter space [latex]\Theta[/latex], [latex]\int_\Theta p(\theta) p(y| \theta)[/latex], but this still leaves us with the question of how to interpret [latex]p(\theta)[/latex].

These days that interpretive task hardly seems much of a philosophical or practical hurdle, but Fisher's famous approach to likelihood is still appealing. Instead of arguing about (variously informative) priors, we could proceed instead from an intuitive implication of Bayes's result: that [latex]p(\theta |y)[/latex] might be represented as some function of our evidence and our background understanding (such as a theoretically plausible model) of the parameter of interest. What if we took much of that background understanding as an unknown function of the evidence that is constant across rival models of the parameter [latex]\theta[/latex]?

Following King's convention in UPM, let's call these varied hypothetical models [latex]\tilde{\theta}[/latex], and then define a likelihood function as follows:

[latex]L(\tilde{\theta}|y) = g(y) p(y|\tilde{\theta})[/latex]

This gives us an appealing way to think about relative likelihoods associated with rival models of the parameter we're interested in, given the same data ...

[latex]\dfrac{L(\tilde{\theta_{i}}|y)}{L(\tilde{\theta_{j}}|y)} = \dfrac{g(y) p(y|\tilde{\theta_{i}})}{g(y) p(y|\tilde{\theta_{j}})}[/latex]

[latex]g(y)[/latex] cancels out here, but that is more than a mere computational convenience: our estimate of the parameter [latex]\theta[/latex] is relative to the data in question, where many features of the world are taken as ceteris paribus for our purposes. These features are represented by that constant function (g) of the data (y). We can drop [latex]g(y)[/latex] when considering the ratio


because our use of that ratio, to evaluate our parameter estimates, is always relative to the data at hand.

With this in mind, think about a variable like height or temperature. Or, say, the diameter of a steel ring. More relevant to the kinds of questions many social researchers grapple with: imagine a survey question on reported happiness using a thermometer scale ("If 0 is very unhappy and 10 is very happy indeed, how happy are you right now?"). We can appeal to the Central Limit Theorem to justify a working assumption that

[latex]y_{i} \sim f_{stn} (y_{i} | \mu_{i}) = \dfrac{e^{-\frac{1}{2}(y_{i}-\mu_{i})^{2}}}{\sqrt{2\pi}}[/latex]

which is just to say that our variable is distributed as a special case of the Gaussian normal distribution, but with [latex]\sigma^{2}=1[/latex].

By now you may already be seeing where King is going with this illustration. The use of a normally distributed random variable to illustrate the concept of likelihood is just that: a illustrative simplification. We could have developed the concept with any of a number of possible distributions.

Now for a further illustrative simplification: suppose (implausibly) that the central tendency associated with our random variable is constant. Suppose, for instance, that everyone in our data actually felt the same level of subjective happiness on the thermometer scale we gave them, but there was some variation in the specific number they assigned to the same subjective mental state. So, the reported numbers cluster within a range.

I say this is an implausible assumption for the example at hand, and it is, but think about this in light of the exercise I mentioned above (and posted about earlier): there really is a (relatively) fixed diameter for a steel ring we're tasked to measure, but we should expect measurement error, and that error will likely differ depending on the method we use to do the measuring.

We can formalize this idea as follows: we are assuming [latex]E(Y_{i})=\mu_{i}[/latex] for each observation i. Further suppose that [latex]Y_{i}, Y_{j}[/latex] are independent for all [latex]i \not= j[/latex]. So, let's take the constant mean to be the parameter we want to estimate, and we'll use some familiar notation for this, replacing [latex]\theta[/latex] with [latex]\beta[/latex], so that [latex]\mu_{i} = \beta_{i}[/latex].

Given what we've assumed so far (constant mean [latex]\mu = \beta[/latex], independent observations), what would the probability distribution look like? Since [latex]p(e_{i}e_{j}) = P(e_{i})p(e_{j})[/latex] for independent events [latex]e_{i}, e_{j}[/latex], the full distribution over all of those events is given by

[latex]\prod_{i}^{n} \dfrac{e^{-\frac{1}{2}(y_{i}-\beta)^{2}}}{\sqrt{2\pi}}[/latex]

Let's use this expression to define a likelihood function for [latex]\beta[/latex]:

[latex]L(\tilde{\beta}|y) = g(y) \prod_{i}^{n} f_{stn}(y|\tilde{\beta})[/latex]

Now, the idea here is to estimate [latex]\beta[/latex] and we're doing that by supposing that a lot of background information cannot be known, but can be taken as roughly constant with respect to the part of the world we are examining to estimate that parameter. Thus we'll ignore [latex]g(y)[/latex], which represents that unknown background that is constant across rival hypothetical values of [latex]\beta[/latex]. Then we'll define the likelihood of [latex]\beta[/latex] given our data, y, with the expression [latex]\prod_{i}^{n} f_{stn}(y|\tilde{\beta})[/latex] and substitute in the full specification of the standardized normal distribution for [latex]\mu_{i} = \beta_{i}[/latex],

[latex]L(\tilde{\beta}|y) = \prod_{i}^{n} \dfrac{e^{-\frac{1}{2}(y_{i}-\beta)^{2}}}{\sqrt{2\pi}}[/latex]

Remember that we're less interested here in the specific functional form of L(.) than in relative likelihoods, so any transformation of the probability function that preserves the properties of interest to us, the relative likelihoods of parameter estimates [latex]\tilde{\beta}[/latex], isn't really relevant to our use of L(.). Suppose, then, that we took the natural logarithm of [latex]L(\tilde{\beta}|y)[/latex]? Because we're taking [latex]g(y)[/latex] as constant, we know that [latex]ln(ab) = ln(a) + ln(b)[/latex] and for some constant [latex]\alpha[/latex], [latex]ln(\alpha ab) = \alpha +ln(a) + ln(b)[/latex]. So, the natural logarithm of our likelihood function is

[latex]L(\tilde{\beta}|y) = g(y) + \sum_{i}^{n} ln(\dfrac{e^{-\frac{1}{2}(y_{i}-\tilde{\beta})^{2}}}{\sqrt{2\pi}})[/latex]


[latex]= g(y) + \sum_{i}^{n} ln(\dfrac{1}{\sqrt{2\pi}}) - \dfrac{1}{2}\sum_{i}^{n}(y_{i}-\tilde{\beta})^{2}[/latex]


[latex]= g(y) - \dfrac{n}{2}ln(2\pi) - \dfrac{1}{2}\sum_{i}^{n}(y_{i}-\tilde{\beta})^{2}[/latex]

Notice that [latex]g(y) - \frac{n}{2}ln(2\pi)[/latex] doesn't include [latex]\tilde{\beta}[/latex]. Think of this whole expression, then, as a constant term that may shift the relative position of the likelihood function, but that doesn't affect it's shape, which is what we really care about. That shape of the log-likelihood function is given by

[latex]ln L(\tilde{\beta}|y) = -\dfrac{1}{2} \sum_{i}^{n} (y_{i} - \tilde{\beta})^{2}[/latex]

Now, there are still several steps left to get to the the classical regression model (most obviously, weakening the assumption of constant mean and instead setting [latex]\mu_{i}=x_{i}\beta[/latex]) but this probably suffices to make the general point: using analytic or numeric techniques (or both), we can estimate parameters of interest in our statistical model by maximizing the likelihood function (thus MLE: maximum likelihood estimation), and that function itself can be defined in ways that reflect the distributional properties of our variables.

This is the sense in which likelihood is a theory of inference: it lets us infer not only the most plausible values of parameters in our model given evidence about the world, but also measures of uncertainty associated with those estimates.

While vitally important, however, this is not really the point of my post.

Look at the tail end of the right-hand side of this last equation above. The expression there ought to be familiar: it looks suspiciously like the sum of squared residuals from the classical regression model!

So, rather than simply appealing to the pleasing visual intuitions of line-fitting; or alternatively, appealing to the Gauss-Markov theorem as the justification for least squares (LS), by virtue of yielding the best linear unbiased predictor of parameters [latex]\beta[/latex] (but why insist on linearity? or unbiasedness for that matter?), the likelihood approach provides a deeper justification, showing the conditions under which LS is the maximum likelihood estimator of our model parameters.

This strikes me as a quite beautiful point, and it frames King's entire pedagogical enterprise in UPM.

Again, there's more to the demonstration in UPM, but in our seminar at Laurier this sufficed (I hope), not to convince my (math-cautious-to-outright-phobic) students that they need to derive their own estimators if they want to do this stuff. What I hope they took away is a sense of how the tools we use in the social sciences have deep, even elegant, justifications beyond pretty pictures and venerable theorems.

Furthermore, and perhaps most importantly, understanding at least the broad brush-strokes of those justifications helps us understand the assumptions we have to satisfy if we want those tools to do what we ask of them.



Thursday, April 10, 2014 - 19:36