Standard Deviation of Binary Dummies

When I read manuscripts submitted to or published in economic journals, it is not uncommon for authors to report the standard deviation of binary dummy variables (dummies with values equal to 0 or 1). I’m interested in the TSE’s readership take on whether reporting these standard deviations is useful.

Photo of author

Author: Phil Miller

Published on:

Published in:

General

4 thoughts on “Standard Deviation of Binary Dummies”

  1. In the dummy variable case, it doesn’t really give any additional information beyond what N & p give us if you’re talking about, say, a summary table of your independent variables. SD can be directly implied from p and N. It’s largely redundant, if you ask me. And the interpretation isn’t very interesting.

    If you don’t know N, then sure. But you should be reporting N, and we know that SD decreases in N for a given p. Interpreting the SD isn’t very interesting if you ask me.

    There is probably a case where it might be helpful: when you’re interested in a confidence interval around p. For example, checking whether or not you have a fair coin. But I’m not sure this is the way you’re asking about it, and we don’t care much about some explicit confidence interval around the dummy variables themselves that we use in a regression, are we?

    My guess is that in a lot of these, the authors are just doing:

    summarize x y z

    And reporting what it says without thinking much about whether it’s worth doing. I specifically leave out SD in my tables for summarizing dummy variables. Curious what others think, though.

  2. The means already tell you the proportions in the sample, p. For binary, var = p(1-p). Not much more information and SD no longer gives you the usual proportions insight. Meh.

  3. I don’t include them because the standard deviations should be 0.5 should they not? If so, the standard deviation provides no additional information. Means of course are informative and I do provide them.

    I calculate standard errors on binary variables in whatever data set I am using because it is a quick way to check for errors in entering at least those particular variables.

  4. A standard deviation one of two definitive parameters of a Gaussian curve, which has a domain of all the real numbers. I find its use in this context absurd.

Comments are closed.