What’s Wrong with Probability Notation?

ramanujan · on Oct 21, 2009

Huh. What I was expecting to see here was a critique of the notation for expressing causality (and then expecting to cite Judea Pearl's do notation).

But regarding the post...

1) Subscripts basically solve the problem of the identical p's.

2) For iterated expectations, again, subscripts solve that problem as well.

E_X[X^Y]

indicates that it's the expectation over X.

3) However, for dummy variables it definitely can be annoying to use P(X=x), especially when writing stuff by hand. Your mental dialogue is saying "x equals x" and it's often very important to distinguish the variable from the value during manipulation.

That's why I tend to use a different letter for the dummy variable -- P(X=k) when X is discrete and either f_X(u) or P(X \in [u-du,u+du]) when X is continuous.

secret · on Oct 21, 2009

P(X=x) is probably one of the worst ideas ever. Try reading that off a board from a professor who doesn't care too much about making a distinction because it should be obvious which "x" he is referring to.

caffeine · on Oct 21, 2009

I don't really see the problem in the first place: we're just writing less by assuming that the subscript is identical to the argument (except maybe for capitalization) - and if it's not, people usually disambiguate by adding the subscripts back in.

You would have to be a twisted soul to write P(x|y) if x is drawn from r.v. Y and y from r.v. X and P[Y|X] is the distribution in question.

ramanujan · on Oct 21, 2009

> You would have to be a twisted soul to write P(x|y) if x is drawn from r.v. Y and y from r.v. X and P[Y|X] is the distribution in question.

Sure, that would be perverse. What I was referring to was more the fact that capital X and lowercase x look similar on the page and (more importantly) sound similar in my head.

Pedagogically I've found that saying "X takes on the value x" confuses a LOT of undergraduates.

Also, at least for me, if I start slinging around RVs and need to get closed form solutions, it can start to be very important to distinguish between X and the value it takes on as k or u, particularly when trying to do conditional expectations or get explicit distributions on functions of multiple RVs. It's sort of an aural Hungarian notation.

caffeine · on Oct 25, 2009

Sorry to re-open this so late, but I thought about this discussion today as I was reading a paper which had a typo in the subscript specifying a distribution: http://dx.doi.org/10.1103/PhysRevLett.103.138101 (it never ceases to amaze me that people get away with publishing glaring errors like that one).

In eq. (4), they specify P_{k|v}[x|v]. Thankfully, here, it's easy to spot the typo, because k is discrete and x is continuous. But this made me realize that my objection to these subscripts is really akin to wanting to write DRY, self-commenting code.

The fundamental information is already there in the equation. Adding extra subscripts is then like adding unnecessary comments to code - if they're right, they just add redundancy but maybe help the uninitiated; but if they're wrong, they're infinitely worse than having put nothing at all (someone who didn't know that PRL lets all kinds of crap fly could really be thrown for a loop figuring out how equ. 4 is possible).

> Pedagogically I've found that saying "X takes on the value x" confuses a LOT of undergraduates.

I haven't taught this to anyone, so your experience is more valuable than mine here. Nonetheless - I noticed in my undergraduate statistics class that programmers (i.e. people who are accustomed to obtuse rules regarding case sensitivity) had no problem with this, while other people accustomed to playing fast and loose with notation (economists and physicists in particular) were somewhat put off.

eru · on Oct 21, 2009

Our professor said that he preferred working with distributions only, instead of all that confusing random variable business.

anshul · on Oct 21, 2009

There is nothing really wrong with probability notation. As inferential steps between concepts increase in math, abuse of notation becomes indispensable.

All that probability shorthand can be unambiguously translated to formal definitions quite easily. But doing so would be analogous to writing a complex program in assembly - doable (and defined pretty much by the very fact that this is doable) but not very productive (and thus not worth doing unless you are debugging or something).

ramanujan · on Oct 21, 2009

> All that probability shorthand can be unambiguously translated to formal definitions quite easily. But doing so would be analogous to writing a complex program in assembly - doable (and defined pretty much by the very fact that this is doable) but not very productive (and thus not worth doing unless you are debugging or something).

Actually I kind of disagree here.

With R or Haskell you can easily work directly with probability densities learned from data. One frequently uses the exact Bayes' rule expression with P(X), P(Y), and P(X|Y) all being known functions to get P(Y|X).

See for example functions like ecdf, which takes in an N vector of points on a 1D line and returns an actual function, namely the empirical cumulative density.

http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ec...

Can be very handy when you want empirical quantiles (e.g. "what percentage of the time do I expect to see 12000 hits in a day, given this single column with the hits for each of the last 200 days").

caffeine · on Oct 21, 2009

I don't really understand why what you said disagrees with what your parent said?

ramanujan · on Oct 21, 2009

Perhaps I read too quickly -- when he said:

> All that probability shorthand can be unambiguously translated to formal definitions quite easily. But doing so would be analogous to writing a complex program in assembly

One possible interpretation (probably, in retrospect, the right one) is that he meant that Whitehead/Russell style axiomatization of probability was in theory possible, but would not be of much value.

I read it initially (likely wrongly in retrospect) as saying that translating the equations into an unambiguous formal computer readable definition would be intractable and/or only of theoretical interest.

BobCarpenter · on Oct 21, 2009

I thought I'd jump in as the author of the original post.

The context is that I'm trying to write an introduction to Bayesian stats for people who know calc and matrices, but may not have taken or understood math stats. Specifically, I want to (a) use the notation that's commonly used in the field (e.g. in Andrew Gelman et al.'s books, Michael Jordan et al.'s papers, etc.), and (b) not confuse readers with a long introduction to sample spaces and a sketchy description of measures, just so I could introduce precise random variable notation only to abuse it.

The big problem with trying to define continuous densities is you never get enough measure theory in an intro to probability (e.g. DeGroot and Schervis, Larsen and Marx) to bottom out in a real definition. It's not that complex, so if you're interested, I'd highly recommend Kolmogorov's own intro to analysis, which has great coverage of both Lebesgue integration (so you can understand the usual R^n case) and general measure theory (so you can impress your friends with your knowledge of analysis).

psyklic · on Oct 21, 2009

I don't think that the author's points are valid. He seems to be using references with sloppy notation.

Suffice it to say that if he picks up a mathematical probability textbook he should be satisfied.

* I do agree that people use shortcuts to make equations seem simpler, that some standard equations look complicated, and that you need to think hard about which scenario is appropriate for your application.

psyklic · on Oct 21, 2009

Just a quick rebuttal of the author's specific points:

(1) Given a set of elements X, E(X) = \sum{x \in X} xp(x). The problem the author mentioned is solved, since we are now summing over all elements in X rather than using the input variable inappropriately.

(2) Given sets of elements X and Y, and the set of ALL elements O, then p(X), p(Y), and p(X|Y) are all computed in the same manner. p(X) is shorthand for p(X|O) -- so we are now given three analogous functions, p(X|O), p(Y|O), and p(X|Y). So, Bayes' can be used to compute all three in the exact same manner, if you so wish.

The above rebuttals are obviously discrete, but there are analogous continuous variable scenarios.

nova · on Oct 21, 2009

Oh yes, I agree. Standard probability notation is handy and fast but just depends too much on the context.

I wonder if someone has created a more orthogonal notation for probability, like Sussman did with the Schemish/functional notation for differential geometry.

eru · on Oct 21, 2009

Any good pointers to Sussman's notation?

nova · on Oct 21, 2009

Besides SICM, http://groups.csail.mit.edu/mac/users/wisdom/AIM-2005-003.pd...

gwern · on Oct 21, 2009

http://mitpress.mit.edu/sicm/book.html

eru · on Oct 21, 2009

Thanks.

catzaa · on Oct 21, 2009

The author of the article already abused the notation before he said what is wrong with it. P(A) usually denotes the probability function and p(x) is a probability density function.

The problem is that the distinction between evens and variables isn't always clear.

kvh · on Oct 21, 2009

The P in P(X) and P(Y) IS actually the same P. It represents the probability of the underlying sample space. X and Y are random variables mapping from that sample space to the real line. P(X=x) is shorthand for P(X^-1(x)).

RyanMcGreal · on Oct 21, 2009

I was hoping to see something like "It's unambiguous only around three-quarters of the time." Alas.