This book is available for 34.95 dollars (plus shipping and taxes) at: Amazon or
Overseas Link Inc
1006 Merrywood Dr
Edison, NJ 08817

Tel: 732-393-0132
Contact person: Ms Shivani Narang
Order here. Below is the introduction to the text.

What is probability theory?

Oliver Knill

Probability theory is a fundamental pillar of modern mathematics with relations to other mathematical areas like algebra, topology, analysis, geometry or dynamical systems. As with any fundamental mathematical construction, the theory starts by adding more structure to a set $\Omega$. In a similar way as introducing algebraic operations, a topology, or a time evolution on a set, probability theory adds a measure theoretical structure to $\Omega$ which generalizes "counting" on finite sets: in order to measure the probability of a subset $A \subset \Omega$, one singles out a class of subsets $\Acal$, on which one can hope to do so. This leads to the notion of a $\sigma$-algebra $\Acal$. It is a set of subsets of $\Omega$ in which on can perform finitely or countably many operations like taking unions, complements or intersections. The elements in $\Acal$ are called events. If a point $\omega$ in the "laboratory" $\Omega$ denotes an "experiment", an "event" $A \in \Acal$ is a subset of $\Omega$, for which one can assign a probability $\Prob[A] \in [0,1]$. For example, if $\Prob[A]=1/3$, the event happens with probability $1/3$. If $\Prob[A]=1$, the event takes place almost certainly. The probability measure $\Prob$ has to satisfy obvious properties like that the union $A \cup B$ of two disjoint events $A,B$ satisfies $\Prob[A \cup B]
= \Prob[A] + \Prob[B]$ or that the complement $A^c$ of an event $A$ has the probability $\Prob[A^c]=1-\Prob[A]$. With a probability space $(\Omega,\Acal,\Prob)$ alone, there is already some interesting mathematics: one has for example the combinatorial problem to find the probabilities of events like the event to get a "royal flush" in poker. If $\Omega$ is a subset of an Euclidean space like the plane, $\Prob[A] = \int_A f(x,y) \; dx dy$ for a suitable nonnegative function $f$, we are led to integration problems in calculus. Actually, in many applications, the probability space is part of Euclidean space and the $\sigma$-algebra is the smallest which contains all open sets. It is called the Borel $\sigma$-algebra. An important example is the Borel $\sigma$-algebra on the real line.

Given a probability space $(\Omega,\Acal,\Prob)$, one can define random variables $X$. A random variable is a function $X$ from $\Omega$ to the real line $\RR$ which is measurable in the sense that the inverse of a measurable Borel set $B$ in $\RR$ is in $\Acal$. The interpretation is that if $\omega$ is an experiment, then $X(\omega)$ measures an observable quantity of the experiment. The technical condition of measurability resembles the notion of a continuity for a function $f$ from a topological space $(\Omega,\Ocal)$ to the topological space $(\RR,\Ucal)$. A function is continuous if $f^{-1}(U) \in \Ocal$ for all open sets $U \in \Ucal$. In probability theory, where functions are often denoted with capital letters, like $X,Y, \dots $, a random variable $X$ is measurable if $X^{-1}(B) \in \Acal$ for all Borel sets $B \in \Bcal$. Any continuous function is measurable for the Borel $\sigma$-algebra. As in calculus, where one does not have to worry about continuity most of the time, also in probability theory, one often does not have to sweat about measurability issues. Indeed, one could suspect that notions like $\sigma$-algebras or measurability were introduced by mathematicians to scare normal folks away from their realms. This is not the case. Serious issues are avoided with those constructions. Mathematics is eternal: a once established result will be true also in thousands of years. A theory in which one could prove a theorem as well as its negation would be worthless: it would formally allow to prove any other result, whether true or false. So, these notions are not only introduced to keep the theory "clean", they are essential for the "survival" of the theory. We give some examples of "paradoxes" to illustrate the need for building a careful theory. Back to the fundamental notion of random variables: because they are just functions, one can add and multiply them by defining $(X+Y)(\omega) = X(\omega) + Y(\omega)$ or $(X Y)(\omega) = X(\omega) Y(\omega)$. Random variables form so an algebra $\Lcal$. The expectation of a random variable $X$ is denoted by $\E[X]$ if it exists. It is a real number which indicates the "mean" or "average" of the observation $X$. It is the value, one would expect to measure in the experiment. If $X = 1_B$ is the random variable which has the value $1$ if $\omega$ is in the event $B$ and $0$ if $\omega$ is not in the event $B$, then the expectation of $X$ is just the probability of $B$. The constant random variable $X(\omega) = a$ has the expectation $\E[X]=a$. These two basic examples as well as the linearity requirement $\E[a X + b Y]= a \E[X] + b \E[Y]$ determine the expectation for all random variables in the algebra $\Lcal$: first one defines expectation for finite sums $\sum_{i=1}^n a_i 1_{B_i}$ called elementary random variables, which approximate general measurable functions. Extending the expectation to a subset $\Lcal^1$ of the entire algebra is part of integration theory. While in calculus, one can live with the Riemann integral on the real line, which defines the integral by Riemann sums $\int_a^b f(x) \; dx \sim \frac{1}{n} \sum_{i/n \in [a,b]} f(i/n)$, the integral defined in measure theory is the Lebesgue integral. The later is more fundamental and probability theory is a major motivator for using it. It allows to make statements like that the probability of the set of real numbers with periodic decimal expansion has probability $0$. In general, the probability of $A$ is the expectation of the random variable $X(x) = f(x) = 1_A(x)$. In calculus, the integral $\int_0^1 f(x) \; dx$ would not be defined because a Riemann integral can give $1$ or $0$ depending on how the Riemann approximation is done. Probability theory allows to introduce the Lebesgue integral by defining $\int_a^b f(x) \; dx$ as the limit of $\frac{1}{n} \sum_{i=1}^n f(x_i)$ for $n \to \infty$, where $x_i$ are random uniformly distributed points in the interval $[a,b]$. This Monte Carlo definition of the Lebesgue integral is based on the law of large numbers and is as intuitive to state as the Riemann integral which is the limit of $\frac{1}{n} \sum_{x_j=j/n \in [a,b]} f(x_j)$ for $n \to \infty$.

With the fundamental notion of expectation one can define the variance, $\Var[X] = \E[X^2]-\E[X]^2$ and the standard deviation $\sigma[X] = \sqrt{\Var[X]}$ of a random variable $X$ for which $X^2 \in \Lcal^1$. One can also look at the covariance $\Cov[X Y] = \E[X Y] - \E[X] \E[Y]$ of two random variables $X,Y$ for which $X^2,Y^2 \in \Lcal^1$. The correlation $\Corr[X,Y] = \Cov[X Y]/(\sigma[X] \sigma[Y])$ of two random variables with positive variance is a number which tells how much the random variable $X$ is related to the random variable $Y$. If $\E[X Y]$ is interpreted as an inner product, then the standard deviation is the length of $X-\E[X]$ and the correlation has the geometric interpretation as $\cos(\alpha)$, where $\alpha$ is the angle between the centered random variables $X-\E[X]$ and $Y-\E[Y]$. For example, if $\Cov[X,Y]=1$, then $Y = \lambda X$ for some $\lambda>0$, if $\Cov[X,Y]=-1$, they are anti-parallel. If the correlation is zero, the geometric interpretation is that the two random variables are perpendicular. Decorrelated random variables still can have relations to each other but if for any measurable real functions $f$ and $g$, the random variables $f(X)$ and $g(X)$ are uncorrelated, then the random variables $X,Y$ are independent.

A random variable $X$ can be described well by its distribution function $F_X$. This is a real-valued function defined as $F_X(s) = \Prob[ X \leq s ]$ on $\RR$, where $\{ X \leq s \; \}$ is the event of all experiments $\omega$ satisfying $X(\omega) \leq s$. The distribution function does not encode the internal structure of the random variable $X$; it does not reveal the structure of the probability space for example. But the function $F_X$ allows the construction of a probability space with exactly this distribution function. There are two important types of distributions, continuous distributions with a probability density function $f_X=F'_X$ and discrete distributions for which $F$ is piecewise constant. An example of a continuous distribution is the standard normal distribution, where $f_X(x) = e^{-x^2/2}/\sqrt{2\pi}$. One can characterize it as the distribution with maximal entropy $I(f) = - \int \log(f(x)) f(x) \; dx$ among all distributions which have zero mean and variance $1$. An example of a discrete distribution is the Poisson distribution $\Prob[X=k] = e^{-\lambda} \frac{\lambda^k}{k!}$ on $\NN=\{0,1,2, \dots \; \}$. One can describe random variables by their moment generating functions $M_X(t) = \E[e^{X t}]$ or by their characteristic function $\phi_X(t) = \E[e^{i X t}]$. The later is the Fourier transform of the law $\mu_X = F'_X$ which is a measure on the real line $\RR$.

The law $\mu_X$ of the random variable is a probability measure on the real line satisfying $\mu_X( (a,b] ) = F_X(b)-F_X(a)$. By the Lebesgue decomposition theorem, one can decompose any measure $\mu$ into a discrete part $\mu_{pp}$, an absolutely continuous part $\mu_{ac}$ and a singular continuous part $\mu_{sc}$. Random variables $X$ for which $\mu_X$ is a discrete measure are called discrete random variables, random variables with a continuous law are called continuous random variables. Traditionally, these two type of random variables are the most important ones. But singular continuous random variables appear too: in spectral theory, dynamical systems or fractal geometry. Of course, the law of a random variable $X$ does not need to be pure. It can mix the three types. A random variable can be mixed discrete and continuous for example.

Inequalities play an important role in probability theory. The Chebychev inequality $\Prob[\vert X-\E[X]\vert \geq c] \leq \frac{\Var[X]}{c^2}$ is used very often. It is a special case of the Chebychev-Markov inequality $h(c) \cdot \Prob[X \geq c] \leq \E[h(X)]$ for monotone nonnegative functions $h$. Other inequalities are the Jensen inequality $\E[h(X)] \geq h(\E[X])$ for convex functions $h$, the Minkowski inequality $\vert\vert X+Y\vert\vert _p \leq \vert\vert X\vert\vert _p + \vert\vert Y\vert\vert _p$ or the Hölder inequality $\vert\vert X Y\vert\vert _1 \leq \vert\vert X\vert\vert _p \vert\vert Y\vert\vert _q, 1/p+1/q=1$ for random variables, $X,Y$, for which $\vert\vert X\vert\vert _p=\E[\vert X\vert^p], \vert\vert Y\vert\vert _q=\E[\vert Y\vert^q]$ are finite. Any inequality which appears in analysis can be useful in the toolbox of probability theory.

Independence is an central notion in probability theory. Two events $A,B$ are called independent, if $\Prob[A \cap B] = \Prob[A] \cdot \Prob[B]$. An arbitrary set of events $A_i$ is called independent, if for any finite subset of them, the probability of their intersection is the product of their probabilities. Two $\sigma$-algebras $\Acal,\Bcal$ are called independent, if for any pair $A \in \Acal, B \in \Bcal$, the events $A,B$ are independent. Two random variables $X,Y$ are independent, if they generate independent $\sigma$-algebras. It is enough to check that the events $A =\{ X \in (a,b) \}$ and $B=\{ Y \in (c,d) \}$ are independent for all intervals $(a,b)$ and $(c,d)$. One should think of independent random variables as two aspects of the laboratory $\Omega$ which do not influence each other. Each event $A =\{ a < X(\omega) < b \; \}$ is independent of the event $B = \{ c < Y(\omega) <d \; \}$. While the distribution function $F_{X+Y}$ of the sum of two independent random variables is a convolution $\int_{\RR} F_X(t-s) \; dF_Y(s)$, the moment generating functions and characteristic functions satisfy the formulas $M_{X+Y}(t) = M_X(t) M_Y(t)$ and $\phi_{X+Y}(t) = \phi_X(t) \phi_Y(t)$. These identities make $M_X,\phi_X$ valuable tools to compute the distribution of an arbitrary finite sum of independent random variables.

Independence can also be explained using conditional probability with respect to an event $B$ of positive probability: the conditional probability $\Prob[A \vert B] = \Prob[A \cap B]/\Prob[B]$ of $A$ is the probability that $A$ happens when we know that $B$ takes place. If $B$ is independent of $A$, then $\Prob[A\vert B] = \Prob[A]$ but in general, the conditional probability is larger. The notion of conditional probability leads to the important notion of conditional expectation $\E[X \vert \Bcal]$ of a random variable $X$ with respect to some sub-$\sigma$-algebra $\Bcal$ of the $\sigma$ algebra $\Acal$; it is a new random variable which is $\Bcal$-measurable. For $\Bcal=\Acal$, it is the random variable itself, for the trivial algebra $\Bcal=\{ \emptyset, \Omega \; \}$, we obtain the usual expectation $\E[X] = \E[X \vert \{\emptyset,\Omega \;\}]$. If $\Bcal$ is generated by a finite partition $B_1,\dots ,B_n$ of $\Omega$ of pairwise disjoint sets covering $\Omega$, then $\E[X \vert \Bcal]$ is piecewise constant on the sets $B_i$ and the value on $B_i$ is the average value of $X$ on $B_i$. If $\Bcal$ is the $\sigma$-algebra of an independent random variable $Y$, then $\E[X \vert Y ] = \E[X \vert \Bcal] = \E[X]$. In general, the conditional expectation with respect to $\Bcal$ is a new random variable obtained by averaging on the elements of $\Bcal$. One has $\E[X \vert Y ]=h(Y)$ for some function $h$, extreme cases being $\E[X \vert 1] = \E[X], \E[X \vert X] = X$. An illustrative example is the situation where $X(x,y)$ is a continuous function on the unit square with $\Prob=dx dy$ as a probability measure and where $Y(x,y) = x$. In that case, $\E[X \vert Y]$ is a function of $x$ alone, given by $\E[X \vert Y ](x) = \int_0^1 f(x,y) \; dy$. This is called a conditional integral.

A set $\{ X_t \}_{t \in T}$ of random variables defines a stochastic process. The variable $t \in T$ is a parameter called "time". Stochastic processes are to probability theory what differential equations are to calculus. An example is a family $X_n$ of random variables which evolve with discrete time $n \in \NN$. Deterministic dynamical system theory branches into discrete time systems, the iteration of maps and continuous time systems, the theory of ordinary and partial differential equations. Similarly, in probability theory, one distinguishes between discrete time stochastic processes and continuous time stochastic processes. A discrete time stochastic process is a sequence of random variables $X_n$ with certain properties. An important example is when $X_n$ are independent, identically distributed random variables. A continuous time stochastic process is given by a family of random variables $X_t$, where $t$ is real time. An example is a solution of a stochastic differential equation. With more general time like $\ZZ^d$ or $\RR^d$ random variables are called random fields which play a role in statistical physics. Examples of such processes are percolation processes.

While one can realize every discrete time stochastic process $X_n$ by a measure-preserving transformation $T: \Omega \to \Omega$ and $X_n(\omega) = X(T^n(\omega))$, probability theory often focuses a special subclass of systems called martingales, where one has a filtration $\Acal_n \subset \Acal_{n+1}$ of $\sigma$-algebras such that $X_n$ is $\Acal_n$-measurable and $\E[X_n \vert \Acal_{n-1}] = X_{n-1}$, where $\E[X_n \vert \Acal_{n-1}]$ is the conditional expectation with respect to the sub-algebra $\Acal_{n-1}$. Martingales are a powerful generalization of the random walk, the process of summing up IID random variables with zero mean. Similar as ergodic theory, martingale theory is a natural extension of probability theory and has many applications.

The language of probability fits well into the classical theory of dynamical systems. For example, the ergodic theorem of Birkhoff for measure-preserving transformations has as a special case the law of large numbers which describes the average of partial sums of random variables $\frac{1}{n} \sum_{k=1}^m X_k$. There are different versions of the law of large numbers. "Weak laws" make statements about convergence in probability, "strong laws" make statements about almost everywhere convergence. There are versions of the law of large numbers for which the random variables do not need to have a common distribution and which go beyond Birkhoff's theorem. An other important theorem is the central limit theorem which shows that $S_n=X_1+X_2+ \cdots +X_n$ normalized to have zero mean and variance $1$ converges in law to the normal distribution or the law of the iterated logarithm which says that for centered independent and identically distributed $X_k$, the scaled sum $S_n/\Lambda_n$ has accumulation points in the interval $[-\sigma,\sigma]$ if $\Lambda_n = \sqrt{2 n \log \log n}$ and $\sigma$ is the standard deviation of $X_k$. While stating the weak and strong law of large numbers and the central limit theorem, different convergence notions for random variables appear: almost sure convergence is the strongest, it implies convergence in probability and the later implies convergence convergence in law. There is also $\Lcal^1$-convergence which is stronger than convergence in probability.

As in the deterministic case, where the theory of differential equations is more technical than the theory of maps, building up the formalism for continuous time stochastic processes $X_t$ is more elaborate. Similarly as for differential equations, one has first to prove the existence of the objects. The most important continuous time stochastic process definitely is Brownian motion $B_t$. Standard Brownian motion is a stochastic process which satisfies $B_0=0$, $\E[B_t]=0$, $\Cov[B_s,B_t] = s$ for $s \leq t$ and for any sequence of times, $0=t_0<t_1< \cdots <t_i<t_{i+1}$, the increments $B_{t_{i+1}}-B_{t_i}$ are all independent random vectors with normal distribution. Brownian motion $B_t$ is a solution of the stochastic differential equation $\frac{d}{dt} B_t = \zeta(t)$, where $\zeta(t)$ is called white noise. Because white noise is only defined as a generalized function and is not a stochastic process by itself, this stochastic differential equation has to be understood in its integrated form $B_t = \int_0^t \; dB_s = \int_0^t \zeta(s) \; ds$.

More generally, a solution to a stochastic differential equation $\frac{d}{dt} X_t = f(X_t) \zeta(t) + g(X_t)$ is defined as the solution to the integral equation $X_t = X_0 +\int_0^t f(X_s) \; dB_t + \int_0^t g(X_s) \; ds$. Stochastic differential equations can be defined in different ways. The expression $\int_0^t f(X_s) \; dB_t$ can either be defined as an Ito integral, which leads to martingale solutions, or the Stratonovich integral, which has similar integration rules than classical differentiation equations. Examples of stochastic differential equations are $\frac{d}{dt} X_t = X_t \zeta(t)$ which has the solution $X_t = e^{B_t-t/2}$. Or $\frac{d}{dt} X_t = B_t^4 \zeta(t)$ which has as the solution the process $X_t = B_t^5-10 B_t^3 + 15 B_t$. The key tool to solve stochastic differential equations is Ito's formula $f(B_t) - f(B_0) = \int_0^t f'(B_s) dB_s + \frac{1}{2} \int_0^t f''(B_s) \; ds$, which is the stochastic analog of the fundamental theorem of calculus. Solutions to stochastic differential equations are examples of Markov processes which show diffusion. Especially, the solutions can be used to solve classical partial differential equations like the Dirichlet problem $\Delta u =0$ in a bounded domain $D$ with $u=f$ on the boundary $\delta D$. One can get the solution by computing the expectation of $f$ at the end points of Brownian motion starting at $x$ and ending at the boundary $u = \E_x[f(B_T)]$. On a discrete graph, if Brownian motion is replaced by random walk, the same formula holds too. Stochastic calculus is also useful to interpret quantum mechanics as a diffusion processes [2,1] or as a tool to compute solutions to quantum mechanical problems using Feynman-Kac formulas.

Some features of stochastic process can be described using the language of Markov operators $P$, which are positive and expectation-preserving transformations on $\Lcal^1$. Examples of such operators are Perron-Frobenius operators $X \to X(T)$ for a measure preserving transformation $T$ defining a discrete time evolution or stochastic matrices describing a random walk on a finite graph. Markov operators can be defined by transition probability functions which are measure-valued random variables. The interpretation is that from a given point $\omega$, there are different possibilities to go to. A transition probability measure $\Pcal(\omega,\cdot)$ gives the distribution of the target. The relation with Markov operators is assured by the Chapman-Kolmogorov equation $\Prob^{n+m}=\Prob^n \circ \Prob^m$. Markov processes can be obtained from random transformations, random walks or by stochastic differential equations. In the case of a finite or countable target space $S$, one obtains Markov chains which can be described by probability matrices $P$, which are the simplest Markov operators. For Markov operators, there is an arrow of time: the relative entropy with respect to a background measure is non-increasing. Markov processes often are attracted by fixed points of the Markov operator. Such fixed points are called stationary states. They describe equilibria and often they are measures with maximal entropy. An example is the Markov operator $P$, which assigns to a probability density $f_Y$ the probability density of $f_{\overline{Y+X}}$ where $\overline{Y+X}$ is the random variable $Y+X$ normalized so that it has mean $0$ and variance $1$. For the initial function $f=1$, the function $P^n(f_X)$ is the distribution of $S_n^*$ the normalized sum of $n$ IID random variables $X_i$. This Markov operator has a unique equilibrium point, the standard normal distribution. It has maximal entropy among all distributions on the real line with variance $1$ and mean $0$. The central limit theorem tells that the Markov operator $P$ has the normal distribution as a unique attracting fixed point if one takes the weaker topology of convergence in distribution on $\Lcal^1$. This works in other situations too. For circle-valued random variables for example, the uniform distribution maximizes entropy. It is not surprising therefore, that there is a central limit theorem for circle-valued random variables with the uniform distribution as the limiting distribution.

In the same way as mathematics reaches out into other scientific areas, probability theory has connections with many other branches of mathematics. The last chapter of these notes give some examples. The section on percolation shows how probability theory can help to understand critical phenomena. In solid state physics, one considers operator-valued random variables. The spectrum of random operators are random objects too. One is interested what happens with probability one. Localization is the phenomenon in solid state physics that sufficiently random operators often have pure point spectrum. The section on estimation theory gives a glimpse of what mathematical statistics is about. In statistics one often does not know the probability space itself so that one has to make a statistical model and look at a parameterization of probability spaces. The goal is to give maximum likelihood estimates for the parameters from data and to understand how small the quadratic estimation error can be made. A section on Vlasov dynamics shows how probability theory appears in problems of geometric evolution. Vlasov dynamics is a generalization of the $n$-body problem to the evolution of of probability measures. One can look at the evolution of smooth measures or measures located on surfaces. This deterministic stochastic system produces an evolution of densities which can form singularities without doing harm to the formalism. It also defines the evolution of surfaces. The section on moment problems is part of multivariate statistics. As for random variables, random vectors can be described by their moments. Since moments define the law of the random variable, the question arises how one can see from the moments, whether we have a continuous random variable. The section of random maps is an other part of dynamical systems theory. Randomized versions of diffeomorphisms can be considered idealization of their undisturbed versions. They often can be understood better than their deterministic versions. For example, many random diffeomorphisms have only finitely many ergodic components. In the section in circular random variables, we see that the Mises distribution has extremal entropy among all circle-valued random variables with given circular mean and variance. There is also a central limit theorem on the circle: the sum of IID circular random variables converges in law to the uniform distribution. We then look at a problem in the geometry of numbers: how many lattice points are there in a neighborhood of the graph of one-dimensional Brownian motion? The analysis of this problem needs a law of large numbers for independent random variables $X_k$ with uniform distribution on $[0,1]$: for $0 \leq \delta<1$, and $A_n=[0,1/n^{\delta}]$ one has $\lim_{n \to \infty} \frac{1}{n} \sum_{k=1}^{n} \frac{1_{A_n}(X_k)}{n^{\delta}} =1$. Probability theory also matters in complexity theory as a section on arithmetic random variables shows. It turns out that random variables like $X_n(k) = k$, $Y_n(k) = k^2+3 \; {\rm mod} \; n$ defined on finite probability spaces become independent in the limit $n \to \infty$. Such considerations matter in complexity theory: arithmetic functions defined on large but finite sets behave very much like random functions. This is reflected by the fact that the inverse of arithmetic functions is in general difficult to compute and belong to the complexity class of NP. Indeed, if one could invert arithmetic functions easily, one could solve problems like factoring integers fast. A short section on Diophantine equations indicates how the distribution of random variables can shed light on the solution of Diophantine equations. Finally, we look at a topic in harmonic analysis which was initiated by Norbert Wiener. It deals with the relation of the characteristic function $\phi_X$ and the continuity properties of the random variable $X$.

Some paradoxes in probability theory

Colloquial language is not always precise enough to tackle problems in probability theory. Paradoxes appear, when definitions allow different interpretations. Ambiguous language can lead to wrong conclusions or contradicting solutions. To illustrate this, we mention a few problems. The following four examples should serve as a motivation to introduce probability theory on a rigorous mathematical footing.

1) Bertrand's paradox (Bertrand 1889)
We throw at random lines onto the unit disc. What is the probability that the line intersects the disc with a length $\geq \sqrt{3}$, the length of the inscribed equilateral triangle?

First answer: take an arbitrary point $\Prob$ on the boundary of the disc. The set of all lines through that point are parameterized by an angle $\phi$. In order that the chord is longer than $\sqrt{3}$, the line has to lie within a sector of $60^{\circ}$ within a range of $180^{\circ}$. The probability is $1/3$.

Second answer: take all lines perpendicular to a fixed diameter. The chord is longer than $\sqrt{3}$ if the point of intersection lies on the middle half of the diameter. The probability is $1/2$.

Third answer: if the midpoints of the chords lie in a disc of radius $1/2$, the chord is longer than $\sqrt{3}$. Because the disc has a radius which is half the radius of the unit disc, the probability is $1/4$.

Like most paradoxes in mathematics, a part of the question in Bertrand's problem is not well defined. Here it is the term "random line". The solution of the paradox lies in the fact that the three answers depend on the chosen probability distribution. There are several "natural" distributions. The actual answer depends on how the experiment is performed.

2) Petersburg paradox (D.Bernoulli, 1738)
In the Petersburg casino, you pay an entrance fee $c$ and you get the prize $2^T$, where $T$ is the number of times, the casino flips a coin until "head" appears. For example, if the sequence of coin experiments would give "tail, tail, tail, head", you would win $2^3-c = 8-c$, the win minus the entrance fee. Fair would be an entrance fee which is equal to the expectation of the win, which is

\begin{displaymath}\sum_{k=1}^{\infty} 2^k \Prob[T=k] = \sum_{k=1}^{\infty} 1 = \infty \; . \end{displaymath}

The paradox is that nobody would agree to pay even an entrance fee $c=10$.

The problem with this casino is that it is not quite clear, what is "fair". For example, the situation $T=20$ is so improbable that it never occurs in the life-time of a person. Therefore, for any practical reason, one has not to worry about large values of $T$. This, as well as the finiteness of money resources is the reason, why casinos do not have to worry about the following bullet proof martingale strategy in roulette: bet $c$ dollars on red. If you win, stop, if you lose, bet $2c$ dollars on red. If you win, stop. If you lose, bet $4c$ dollars on red. Keep doubling the bet. Eventually after $n$ steps, red will occur and you will win $2^n c - (c+ 2c + \cdots + 2^{n-1}c)=c$ dollars. This example motivates the concept of martingales. Theorem ([*]) or proposition ([*]) will shed some light on this. Back to the Petersburg paradox. How does one resolve it? What would be a reasonable entrance fee in "real life"? Bernoulli proposed to replace the expectation $\E[G]$ of the profit $G=2^T$ with the expectation $(\E[\sqrt{G}])^2$, where $u(x)=\sqrt{x}$ is called a utility function. This would lead to a fair entrance

\begin{displaymath}(\E[\sqrt{G}])^2=(\sum_{k=1}^{\infty} 2^{k/2} 2^{-k})^2
= \frac{1}{(\sqrt{2}-1)^2} \sim 5.828... \; . \end{displaymath}

It is not so clear if that is a way out of the paradox because for any proposed utility function $u(k)$, one can modify the casino rule so that the paradox reappears: pay $(2^k)^2$ if the utility function $u(k)=\sqrt{k}$ or pay $e^{2^k}$ dollars, if the utility function is $u(k) = \log(k)$. Such reasoning plays a role in economics and social sciences.

The picture to the right shows the average profit development during a typical tournament of 4000 Petersburg games. After these 4000 games, the player would have lost about 10 thousand dollars, when paying a 10 dollar entrance fee each game. The player would have to play a very, very long time to catch up. Mathematically, the player will do so and have a profit in the long run, but it is unlikely that it will happen in his or her life time.

3) The three door problem (1991) Suppose you're on a game show and you are given a choice of three doors. Behind one door is a car and behind the others are goats. You pick a door-say No. 1 - and the host, who knows what's behind the doors, opens another door-say, No. 3-which has a goat. (In all games, he opens a door to reveal a goat). He then says to you, "Do you want to pick door No. 2?" (In all games he always offers an option to switch). Is it to your advantage to switch your choice?

The problem is also called "Monty Hall problem" and was discussed by Marilyn vos Savant in a "Parade" column in 1991 and provoked a big controversy. (See [3] for pointers and similar examples.) The problem is that intuitive argumentation can easily lead to the conclusion that it does not matter whether to change the door or not. Switching the door doubles the chances to win:

No switching: you choose a door and win with probability 1/3. The opening of the host does not affect any more your choice.

Switching: when choosing the door with the car, you loose since you switch. If you choose a door with a goat. The host opens the other door with the goat and you win. There are two such cases, where you win. The probability to win is 2/3.

4) The Banach-Tarski paradox (1924)

It is possible to cut the standard unit ball $\Omega = \{ x \in \RR^3 \; \vert \; \vert x\vert \leq 1 \; \}$ into 5 disjoint pieces $\Omega = Y_1 \cup Y_2 \cup Y_3 \cup Y_4 \cup Y_5$ and rotate and translate the pieces with transformations $T_i$ so that $T_1(Y_1) \cup T_2(Y_2) = \Omega$ and $T_3(Y_3) \cup T_4(Y_4) \cup T_5(Y_5) = \Omega'$ is a second unit ball $\Omega' = \{ x \in \RR^3 \; \vert \; \vert x-(3,0,0)\vert \leq 1 \}$ and all the transformed sets again don't intersect.
While this example of Banach-Tarski is spectacular, the existence of bounded subsets $A$ of the circle for which one can not assign a translational invariant probability $\Prob[A]$ can already be achieved in one dimension. The Italian mathematician Giuseppe Vitali gave in 1905 the following example: define an equivalence relation on the circle $\TT = [0,2\pi)$ by saying that two angles are equivalent $x \sim y$ if $(x-y)/\pi$ is a rational angle. Let $A$ be a subset in the circle which contains exactly one number from each equivalence class. The axiom of choice assures the existence of $A$. If $x_1,x_2, \dots $ is a enumeration of the set of rational angles in the circle, then the sets $A_i = A + x_i$ are pairwise disjoint and satisfy $\bigcup_{i=1}^{\infty} A_i = \TT$. If we could assign a translational invariant probability $\Prob[A_i]$ to $A$, then the basic rules of probability would give

\begin{displaymath}1 = \Prob[\TT] = \Prob[ \bigcup_{i=1}^{\infty} A_i ] = \sum_{i=1}^{\infty} \Prob[A_i] = \sum_{i=1}^{\infty} p \; . \end{displaymath}

But there is no real number $p=\Prob[A]=\Prob[A_i]$ which makes this possible.

Both the Banach-Tarski as well as Vitalis result shows that one can not hope to define a probability space on the algebra $\Acal$ of all subsets of the unit ball or the unit circle such that the probability measure is translational and rotational invariant. The natural concepts of "length" or "volume", which are rotational and translational invariant only makes sense for a smaller algebra. This will lead to the notion of $\sigma$-algebra. In the context of topological spaces like Euclidean spaces, it leads to Borel $\sigma$-algebras, algebras of sets generated by the compact sets of the topological space. This language will be developed in the next chapter.

Some applications of probability theory

Probability theory is a central topic in mathematics. There are close relations and intersections with other fields like computer science, ergodic theory and dynamical systems, cryptology, game theory, analysis, partial differential equation, mathematical physics, economical sciences, statistical mechanics and even number theory. As a motivation, we give some problems and topics which can be treated with probabilistic methods.

1) Random walks: (statistical mechanics, gambling, stock markets, quantum field theory).

Assume you walk through a lattice. At each vertex, you choose a direction at random. What is the probability that you return back to your starting point? Polya's theorem ([*]) says that in two dimensions, a random walker almost certainly returns to the origin arbitrarily often, while in three dimensions, the walker with probability $1$ only returns a finite number of times and then escapes for ever.

A random walk in one dimensions displayed as a graph $(t,B_t)$

2) Percolation problems (model of a porous medium, statistical mechanics, critical phenomena).

Each bond of a rectangular lattice in the plane is connected with probability $p$ and disconnected with probability $1-p$. Two lattice points $x,y$ in the lattice are in the same cluster, if there is a path from $x$ to $y$. One says that "percolation occurs" if there is a positive probability that an infinite cluster appears. One problem is to find the critical probability $p_c$, the infimum of all $p$, for which percolation occurs. The problem can be extended to situations, where the switch probabilities are not independent to each other. Some random variables like the size of the largest cluster are of interest near the critical probability $p_c$.
A variant of bond percolation is site percolation where the nodes of the lattice are switched on with probability $p$.

Generalized percolation problems are obtained, when the independence of the individual nodes is relaxed. A class of such dependent percolation problems can be obtained by choosing two irrational numbers $\alpha,\beta$ like $\alpha=\sqrt{2}-1$ and $\beta=\sqrt{3}-1$ and switching the node $(n,m)$ on if $(n \alpha + m \beta) \; {\rm mod} \; 1 \in [0,p)$. The probability of switching a node on is again $p$, but the random variables

\begin{displaymath}X_{n,m} = 1_{(n \alpha + m \beta) \; {\rm mod} \; 1 \in [0,p)} \end{displaymath}

are no more independent.

Even more general percolation problems are obtained, if also the distribution of the random variables $X_{n,m}$ can depend on the position $(n,m)$.

3) Random Schrödinger operators. (quantum mechanics, functional analysis, disordered systems, solid state physics)

Consider the linear map $Lu(n)= \sum_{\vert m-n\vert=1} u(n) + V(n) u(n)$ on the space of sequences $u=(\dots,u_{-2},u_{-1},u_0,u_1,u_2,\dots)$. We assume that $V(n)$ takes random values in $\{0,1\}$. The function $V$ is called the potential. The problem is to determine the spectrum or spectral type of the infinite matrix $L$ on the Hilbert space $l^2$ of all sequences $u$ with finite $\vert\vert u\vert\vert _2^2 = \sum_{n=-\infty}^{\infty} u_n^2$. The operator $L$ is the Hamiltonian of an electron in a one-dimensional disordered crystal. The spectral properties of $L$ have a relation with the conductivity properties of the crystal. Of special interest is the situation, where the values $V(n)$ are all independent random variables. It turns out that if $V(n)$ are IID random variables with a continuous distribution, there are many eigenvalues for the infinite dimensional matrix $L$ - at least with probability $1$. This phenomenon is called localization.

A wave $\psi(t) = e^{i L t} \psi(0)$ evolving in a random potential at $t=0$. Shown are both the potential $V_n$ and the wave $\psi(0)$ A wave $\psi(t) = e^{i L t} \psi(0)$ evolving in a random potential at $t=1$. Shown are both the potential $V_n$ and the wave $\psi(1)$ A wave $\psi(t) = e^{i L t} \psi(0)$ evolving in a random potential at $t=2$. Shown are both the potential $V_n$ and the wave $\psi(2)$

More general operators are obtained by allowing $V(n)$ to be random variables with the same distribution but where one does not persist on independence any more. A well studied example is the almost Mathieu operator, where $V(n) = \lambda \cos(\theta + n \alpha)$ and for which $\alpha/(2\pi)$ is irrational.

4) Classical dynamical systems (celestial mechanics, fluid dynamics, mechanics, population models)

The study of deterministic dynamical systems like the logistic map $x \mapsto 4 x (1-x)$ on the interval $[0,1]$ or the three body problem in celestial mechanics has shown that such systems or subsets of it can behave like random systems. Many effects can be described by ergodic theory, which can be seen as a brother of probability theory. Many results in probability theory generalize to the more general setup of ergodic theory. An example is Birkhoff's ergodic theorem which generalizes the law of large numbers.

Iterating the logistic map


on $[0,1]$ produces independent random variables. The invariant measure $\Prob$ is continuous. The simple mechanical system of a double pendulum exhibits complicated dynamics. The differential equation defines a measure preserving flow $T_t$ on a probability A short time evolution of the Newtonian three body problem. There are energies and subsets of the energy surface which are invariant and on which there is an invariant probability measure.

Given a dynamical system given by a map $T$ or a flow $T_t$ on a subset $\Omega$ of some Euclidean space, one obtains for every invariant probability measure $\Prob$ a probability space $(\Omega,\Acal,\Prob)$. An observed quantity like a coordinate of an individual particle is a random variable $X$ and defines a stochastic process $X_n(\omega) = X(T^n \omega)$. For many dynamical systems including also some 3 body problems, there are invariant measures and observables $X$ for which $X_n$ are IID random variables. Probability theory is therefore intrinsically relevant also in classical dynamical systems.

5) Cryptology. (computer science, coding theory, data encryption)

Coding theory deals with the mathematics of encrypting codes or deals with the design of error correcting codes. Both aspects of coding theory have important applications. A good code can repair loss of information due to bad channels and hide the information in an encrypted way. While many aspects of coding theory are based in discrete mathematics, number theory, algebra and algebraic geometry, there are probabilistic and combinatorial aspects to the problem. We illustrate this with the example of a public key encryption algorithm whose security is based on the fact that it is hard to factor a large integer $N=pq$ into its prime factors $p,q$ but easy to verify that $p,q$ are factors, if one knows them. The number $N$ can be public but only the person, who knows the factors $p,q$ can read the message. Assume, we want to crack the code and find the factors $p$ and $q$.

The simplest method is to try to find the factors by trial and error but this is impractical already if $N$ has 50 digits. We would have to search through $10^{25}$ numbers to find the factor $p$. This corresponds to probe 100 million times every second over a time span of 15 billion years. There are better methods known and we want to illustrate one of them now: assume we want to find the factors of $N=11111111111111111111111111111111111111111111111$. The method goes as follows: start with an integer $a$ and iterate the quadratic map $T(x)=x^2+c \; {\rm mod} \; N$ on $\{0,1.,,,.N-1 \;\}$. If we assume the numbers $x_0=a,x_1=T(a),x_2=T(T(a)) \dots$ to be random, how many such numbers do we have to generate, until two of them are the same modulo one of the prime factors $p$? The answer is surprisingly small and based on the birthday paradox: the probability that in a group of 23 students, two of them have the same birthday is larger than $1/2$: the probability of the event that we have no birthday match is $1 (364/365) (363/365) \cdots (343/365) = 0.492703\dots $, so that the probability of a birthday match is $1-0.492703=0.507292$. This is larger than $1/2$. If we apply this thinking to the sequence of numbers $x_i$ generated by the pseudo random number generator $T$, then we expect to have a chance of $1/2$ for finding a match modulo $p$ in $\sqrt{p}$ iterations. Because $p \leq \sqrt{n}$, we have to try $N^{1/4}$ numbers, to get a factor: if $x_n$ and $x_m$ are the same modulo $p$, then ${\rm gcd}(x_n-x_m,N)$ produces the factor $p$ of $N$. In the above example of the 46 digit number $N$, there is a prime factor $p=35121409$. The Pollard algorithm finds this factor with probability $1/2$ in $\sqrt{p}=5926$ steps. This is an estimate only which gives the order of magnitude. With the above $N$, if we start with $a=17$ and take $a=3$, then we have a match $x_{27720} = x_{13860}$. It can be found very fast.

This probabilistic argument would give a rigorous probabilistic estimate if we would pick truly random numbers. The algorithm of course generates such numbers in a deterministic way and they are not truly random. The generator is called a pseudo random number generator. It produces numbers which are random in the sense that many statistical tests can not distinguish them from true random numbers. Actually, many random number generators built into computer operating systems and programming languages are pseudo random number generators.

Probabilistic thinking is often involved in designing, investigating and attacking data encryption codes or random number generators.

6) Numerical methods. (integration, Monte Carlo experiments, algorithms)

In applied situations, it is often very difficult to find integrals directly. This happens for example in statistical mechanics or quantum electrodynamics, where one wants to find integrals in spaces with a large number of dimensions. One can nevertheless compute numerical values using Monte Carlo Methods with a manageable amount of effort. Limit theorems assure that these numerical values are reasonable. Let us illustrate this with a very simple but famous example, the Buffon needle problem.

A stick of length 2 is thrown onto the plane filled with parallel lines, all of which are distance $d=2$ apart. If the center of the stick falls within distance $y$ of a line, then the interval of angles leading to an intersection with a grid line has length $2 \arccos(y)$ among a possible range of angles $[0,\pi]$. The probability of hitting a line is therefore $\int_0^1 2 \arccos(y)/\pi = 2/\pi$. This leads to a Monte Carlo method to compute $\pi$. Just throw randomly $n$ sticks onto the plane and count the number $k$ of times, it hits a line. The number $2n/k$ is an approximation of $\pi$. This is of course not an effective way to compute $\pi$ but it illustrates the principle.

The Buffon needle problem is a Monte Carlo method to compute $\pi$. By counting the number of hits in a sequence of experiments, one can get random approximations of $\pi$. The law of large numbers assures that the approximations will converge to the expected limit. All Monte Carlo computations are theoretically based on limit theorems.