I looked through Downey's book. (A) Definition of Probability. As far as I could...

I looked through Downey's book.

(A) Definition of Probability.

As far as I could tell, the book defined probability in only one of two ways:

(1) Relative frequency assuming finitely many trials (page 13) and

(2) Bayesian prior intuitive belief (page 52).

No, here the book is doing readers a disservice.

(B) Exponential Distribution

The 'derivation' of the exponential distribution on page 37:

"I’ll start with the exponential distribution because it is easy to work with. In the real world, exponential distributions come up when we look at a series of events and measure the times between events, which are called inter-arrival times. If the events are equally likely to occur at any time, the distribution of inter-arrival times tends to look like an exponential distribution."

is a bit too vague.

The book also uses 'random' too frequently, loosely, and unnecessarily. Instead, essentially we can just drop the word 'random' and avoid questions about what 'random' means and if some data is 'truly random'.

For (A), essentially universally in advanced work, there is only one approach:

We have a non-empty set of 'trials'. Then an 'event' is a subset of the set of all trials. The set of all trials is an also an event. The set of all events is closed under countable unions and relative complements. In particular, the empty set is an event.

A 'probability' is a function P from the set of all events to the interval [0,1]. For an event A, P(A) is its probability. The probability of the set of all trials is 1.

P is 'countably additive': For countably infinitely many pair-wise disjoint events A_i, for i = 1, 2, ..., the probability of the the union of the A_i is the sum of the P(A_i).

A 'random variable' X is a function from the set of all trials to the real numbers such that for any real number x the set of all trials w such that X(w) <= x is an event. Usually we suppress the notation of the trials and just write the event X <= x. This is our only use of 'random', and we give no more definition of it.

In practice, essentially any number at all that we observe can be regarded as a random variable. In particular, the number might have been something observed about a person. The statement on page (52) that

"Anything involving people is pretty much off the table."

would be a silly foundation for probability.

For Bayesian and prior beliefs, can regard such a belief as an estimate of a conditional probability where we condition on what we know. E.g., we know a lot about people so that given a person, we can guess that with probability over 99% the person is less than 10 feet tall. This point does not mean that we have a 'Bayesian' foundation for probability. It's better just to drop mention of 'Bayesian'.

The 'distribution' of a random variable X is the function F_X(x) = P(X <= x).

Full details are in any of, say,

Jacques Neveu, 'Mathematical Foundations of the Calculus of Probability', Holden-Day.

Leo Breiman, 'Probability', ISBN 0-89871-296-3, SIAM.

M. Loeve, 'Probability Theory, I and II, 4th Edition', Springer-Verlag.

Kai Lai Chung, 'A Course in Probability Theory, Second Edition', ISBN 0-12-174650-X, Academic Press.

Yuan Shih Chow and Henry Teicher, 'Probability Theory: Independence, Interchangeability, Martingales', ISBN 0-387-90331-3, Springer-Verlag.

Essentially what is going on is that we are defining P as a non-negative real 'measure' with 'total mass 1' as in any of:

Paul R. Halmos, 'Measure Theory', D. Van Nostrand Company, Inc.

H. L. Royden, 'Real Analysis: Second Edition', Macmillan.

Walter Rudin, 'Real and Complex Analysis', ISBN 07-054232-5, McGraw-Hill.

This approach to probability is the one used in essentially all advanced work, e.g.,

Ioannis Karatzas and Steven E. Shreve, 'Brownian Motion and Stochastic Calculus, Second Edition', ISBN 0-387-97655-8.

Jean-Rene Barra, 'Mathematical Basis of Statistics', ISBN 0-12-079240-0, Academic Press.

Early in Neveu can see some ways we get essentially pushed into this foundation for probability.

With this approach, for a random variable X, we define its expectation E[X] as just its integral with respect to measure P in the sense of measure theory. Then essentially by a routine 'change of variable' we can get E[X] as an integral in terms of the distribution F_X and the measure it defines on the reals.

With this approach, if, say, for some positive integer n we measure the height of n people, then we say that we have the values of the n random variables

X_1, X_2, ..., X_n

and do not say that we have the values of n trials of one random variable X.

Indeed, with this approach, all our experience, all of the universe, is only one trial.

Given a set of events, we can define what it means for the set to be 'independent'. And we can proceed similarly for random variables.

Then we can give a more careful statement of the central limit theorem and also state the law of large numbers. We can also say why we use an average to estimate an expectation, e.g., as in

Paul R. Halmos, "The Theory of Unbiased Estimation", 'Annals of Mathematical Statistics', Volume 17, Number 1, pages 34-43, 1946.

On page 12 there is:

"The mean of this sample is 100 pounds, but if I told you 'The average pumpkin in my garden is 100 pounds,' that would be wrong, or at least misleading. In this example, there is no meaningful average because there is no typical pumpkin."

No: An 'average' does not have to be 'typical' to be 'meaningful'. The expectation of the weight X of a pumpkin in the garden remains meaningful. And the law of large numbers will still apply.

For the exponential distribution, there is a good 'qualitative' derivation: If we have arrivals where the inter-arrival times are stationary and independent, then the inter-arrival times are independent and have the same exponential distribution. Good details are early in the chapter on Poisson processes in

Erhan Cinlar, 'Introduction to Stochastic Processes', ISBN 0-13-498089-1, Prentice-Hall.

Downey's book places a lot of emphasis on distributions, e.g., Weibull, and descriptions, e.g., skewness. So an implication is, given some data, we should try to find its distribution and various summary properties of it, but usually this is unpromising. It is good to have the concept of a distribution but, given some data, usually not good to try to find its distribution. Instead, we get our results by manipulating the data in ways where we need to know little or nothing about the distribution.

The strong emphasis on hypothesis testing is curious. There is more in:

E. L. Lehmann, 'Testing Statistical Hypotheses', John Wiley and Sons.

E. L. Lehmann, 'Nonparametrics: Statistical Methods Based on Ranks', ISBN 0-8162-4994-6.

Downey's book mentioned the chi-square distribution: The main point is that if X_1, X_2, ..., X_k are random variables with Gaussian distribution with mean 0 and variance 1, then

Y = X_1^2 + X_2^2 + ... + X_k^2

has chi-squared distribution with k degrees of freedom.

Downey's book mentioned the variance of a finite population; while variance is of high interest, the variance of a finite population is not. What is usually wanted is an estimate of variance, and for that the formula in the book should be dividing by n - 1 instead of n; this change yields an 'unbiased' estimate.

It would be good for students to get at least an intuitive understanding of the most powerful hypothesis testing via the Neyman-Pearson result: Using an analogy from real estate investing, regard the false alarm rate as money to be spent and spend the money to get the best return on investment buy buying the property with the highest ROI first, then the one with the second highest, etc. Yes, in principal this process leads to a knapsack problem which is NP-complete. Likely the nicest proof of the Neyman-Pearson result is from the Hahn decomposition; this follows from the Radon-Nikodym result with a famous proof by von Neumann in Rudin.

The mention of 'resampling' is curious: There are cases where we won't have independent random variables but will have 'exchangeable' random variables, and exchangeability can be enough.

One example is extending nonparametric tests to multidimensional data and using it for anomaly detection in server farms and networks where multidimensional data is ubiquitous, e.g.,

N. B. Waite, "A Real-Time System-Adapted Anomaly Detector", 'Information Sciences', volume 115, April, 1999, pages 221-259.