|
Preface 7
1Relative Entropy 9
1.1A Brief Review of Probability 9
1.2Shannon Entropy Is Plausible for Discrete Random Variable 9
1.3Shannon Entropy Fails for Continuous Random Variable 10
1.4Relative Entropy Is the Unique Solution to the Axiom 11
2Master Equation and Detailed Balance 13
2.1Conventions in This Chapter 13
2.2Master Equation Describes the Evolution of Markov Process 13
2.3Transition Rate Determines Transition Density 14
2.4Detailed Balance Provides Stationary Distribution 17
2.5Detailed Balance with Connectivity Monotonically Reduces Relative Entropy 18
2.6Monte-Carlo Simulation and Guarantee of Relaxation 19
2.7Example: Metropolis-Hastings Algorithm 22
2.8* Existence of Stationary Density Function 23
3Kramers-Moyal Expansion and Langevin Process 25
3.1Conventions in This Chapter 25
3.2Cut-off in the Moments of Transition Rate Is Essential for Spatial Smoothness 25
3.3Kramers–Moyal Expansion Formulates Transition Rate by Its Moments 29
3.4Randomness Is Absent in the First Moment of Transition Rate 30
3.5Randomness Appears in the Second Moment of Transition Rate 31
3.6Langevin Process Is a Markovian Process with \(N_{\operatorname{cut}} = 2\) 32
3.7Stationary Solution of Langevin Process Has Source-Free Degree of Freedom 33
3.8Detailed Balance of Langevin Process Lacks Source-Free Degree of Freedom 33
4Least-Action Principle 35
4.1Conventions in This Chapter 35
4.2A Brief Review of Least-Action Principle in Classical Mechanics 35
4.3Preliminary: Grassmann Variable and Berezin Integral 36
4.4Langevin Process Can Be Formulated as Path Integral 38
4.5Least-Action Principle of Distribution Has No Redundancy 41
4.6Data Fitting Is Equivalent to Least-Action Principle of Distribution 42
4.7How Far Will Information Propagate in Langevin Process? 44
4.7.1The Generic Action 44
4.7.2Renormalization Group Equations 44
4.7.3Fixed Point and Scale-Invariance 46
4.7.4* Appendix: Perturbative Method 46
4.8Example: Linear System Cannot Be Scale-Invariant 48
This is a little book about stochastic process. We start with the axiomatic system of information (chapter 1). Then using information, we introduce the continuous-time Markovian process, and show how it relax to equilibrium (chapter 2). We then move on to add spatial smoothness to Markovian process, which results in many interesting results including Langevin process (chapter 3). In the end, we generalize least-action principle to distribution and discuss the propagation of information in Langevin process (chapter 4).
The mathematical techniques employed here will not go beyond the basic calculus (Taylor expansion, improper integral, and integration by parts) and linear algebra (matrix manipulations, orthogonal diagonalization, and determinant). Knowing the basic probability theory (normal distribution and Gaussian integral) and Fourier transformation (its definition) will be helpful. We try to make it self-contained, and introduce new concept or technique only when it is essential. Statements like “obviously...” and “apparently...” are avoided; and we try to display all the steps of calculation without omitting any of them.
For each section, the title is a sentence that briefly summarizes the whole section. We use bold font for definition and italic font for emphasis.
At last, this book is written by TeXmacs, on the GPLv3 license.
1. Footnotes in the HTML export of TeXmacs are problematic. The content of footnote is incorrectly placed in front of the footnote label in the main text.
In this chapter, we discuss the axiomatization of information. By figuring out how Shannon entropy fails for continuous random variable, we build a proper axiomatic system on which the expression of information is found to be unique.
Those that are not deterministic are denoted by capital letters. But, a capital letter may also denote something that is determined. For example, a random variable has to be denoted by capital letter, like \(X\), while we can also use \(F\) to denote something determined, such as a functional.
The set of all possible values of a random variable is called the alphabet .
1.1. Some textures call it sample space. But “space” usually hints for extra structures such as vector space or topological space. So, we use “alphabet” instead (following David Mackay, see his book Information Theory, Inference, and Learning Algorithms, section 2.1. Link to free PDF: https://www.inference.org.uk/itprnn/book.pdf).
1.2. In many textures, the density or mass function is non-negative (rather than being positive). Being positive is beneficial because, for example, we will discuss the logarithm of density or mass function, for which being zero is invalid. For any value on which density or mass function vanishes, we throw it out of \(\mathcal{X}\), which in turn guarantees the positivity.
The density of a value \(x\) is usually denoted by \(p (x)\), which, as a function, is called density function. Notice that \(p (x)\) is deterministic, thus not capital. The same for mass, where \(p (x)\) is called mass function. Thus, we can say the expectation of a function \(f\) on distribution \(P\), denoted by \(\mathbb{E}_P [f]\) or \(\mathbb{E}_{x \sim P} [f (x)]\). If the alphabet \(\mathcal{X}\) is of continuum, then it is \(\int_{\mathcal{X}} \mathrm{d} x p (x) f (x)\), otherwise \(\sum_{x \in \mathcal{X}} p (x) f (x)\).
If there exists random variables \(Y\) and \(Z\), with alphabets \(\mathcal{Y}\) and \(\mathcal{Z}\) respectively, such that \(X = Y \oplus Z\) (for example, let \(X\) two-dimensional, \(Y\) and \(Z\) are the components), then we have marginal distributions, denoted by \(P_Y\) and \(P_Z\), where \(p (y) := \int_{\mathcal{Z}} \mathrm{d} z p (y, z)\) and \(p (z) := \int_{\mathcal{Y}} \mathrm{d} y p (y, z)\) if \(X\) is of continuum, and the same for mass function. Notice that we have omitted the subscript \(Y\) in \(p_Y\) (and the same for \(p_Z\)) since the \(y\) in \(p (y)\) has clearly indicated this. We marginalize \(Z\) so as to get \(P_Y\).
We further have the conditional distribution of \(Y\) given \(Z\), denoted by \(P_{Y|Z}\), where \(p (y|z) := p (y, z) / p (z)\) (we omit the subscript of \(p_{Y|Z}\) too). Suppose that we samples lots of \((Y, Z)\) values from \(P\), and then filters the pairs with \(Z = z\). The frequency of \(Y = y\) found in the filtered samples is approximated by \(p (y|z)\).
The Shannon entropy is well-defined for discrete random variable. Let \(X\) a discrete random variables with alphabet \(\{ 1, \ldots, n \}\) with \(p_i\) the mass of \(X = i\). The Shannon entropy is thus a function of \((p_1, \ldots, p_n)\) defined by
where \(k\) is a positive constant. Interestingly, this expression is unique given some plausible axioms, which can be qualitatively expressed as
\(H\) is a continuous function of \((p_1, \ldots, p_n)\);
larger alphabet has higher uncertainty (information or entropy); and
if we have known some information, and based on this knowledge we know further, the total information shall be the sum of all that we know.
Here, we use uncertainty, surprise, information, and entropy as interchangeable.
The third axiom is also called the additivity of information. For two independent variables \(X\) and \(Y\) with distributions \(P\) and \(Q\) respectively, the third axiom indicates that the total information of \(H (P Q)\) is \(H (P) + H (Q)\). But, the third axiom indicates more than this. It also defines a “conditional entropy” for dealing with the situation where \(X\) and \(Y\) are dependent. Jaynes gives a detailed declaration to these axioms.
1.3. See the appendix A of Information Theory and Statistical Mechanics by E. T. Jaynes, 1957. A free PDF version can be found on Internet: https://bayes.wustl.edu/etj/articles/theory.1.pdf.
1.4. On measures of information and entropy by Alfréd Rényi, 1961. A free PDF version can be found on Internet: http://digitalassets.lib.berkeley.edu/math/ucb/text/math_s4_v1_article-27.pdf.
As we will see, when extending the alphabet to continuum, this problem naturally ceases.
The Shannon entropy, however, cannot be directly generalized to continuous random variable. Usually, the entropy for continuous random variable \(X\) with alphabet \(\mathcal{X}\) and distribution \(P\) is given as a functional of the density function \(p (x)\),
which, however, is not well-defined. The first issue is that the \(p\) has dimension, indicated by \(\int_{\mathcal{X}} \mathrm{d} x p (x) = 1\). This means we put a dimensional quantity into logarithm which is invalid. The second issue is that the \(H\) is not invariant under coordinate transformation \(X \rightarrow Y := \varphi (X)\) where \(\varphi\) is a diffeomorphism. But as a “physical” quantity, \(H\) should be invariant under “non-physical” transformations.
To eliminate the two issues, we shall extends the axiomatic description of entropy. The key to this extension is introducing another distribution, \(Q\), which has the same alphabet as \(P\); and instead considering the uncertainty (surprise) caused by \(P\) when prior knowledge has been given by \(Q\). As we will see, this will solve the two issues altogether. Explicitly, we extend and quantify the axioms in section 1.2 as follow.
Axiom
\(H\) is a smooth and local functional of \(p\) and \(q\);
\(H (P, Q) > 0\) with \(P \neq Q\) and \(H (P, P) = 0\); and
If \(X = Y \oplus Z\), and if \(Y\) and \(Z\) independent, then \(H (P, Q) = H (P_Y, Q_Y) + H (P_Z, Q_Z)\), where \(P_Y, \ldots, Q_Z\) are marginal distributions.
The first axiom employs the locality of \(H\), which is thought as natural since \(H\) has been a functional. The second axiom indicates that \(H\) vanishes only when there is no surprise caused by \(P\) (thus \(P = Q\)). It is a little like the second axiom for Shannon entropy. The third axiom, like the third in Shannon entropy, claims the additivity of surprise: if \(X\) has two independent parts, the total surprise shall be the sum of each.
We are to derive the explicit expression of \(H\) based on the three axioms. The result is found to be unique.
Based on the first axiom, there is a function \(h : (0, + \infty) \times (0, + \infty) \rightarrow [0, + \infty)\) such that \(H\) can be expressed as
We are to determine the explicit form of \(h\). Thus, from second axiom,
holds for all distribution \(P\). Since \(p\) is positive and \(h\) is non-negative, then we have \(h (p (x), p (x)) = 0\) for all \(x \in \mathcal{X}\). The distribution \(P\) is arbitrary, thus we find \(h (x, x) = 0\) for any \(x \in (0, + \infty)\).
Now come to the third axiom. Since \(Y\) and \(Z\) are independent, \(H (P, Q)\) can be written as \(\int_{\mathcal{X}} \mathrm{d} y \mathrm{d} z p_Y (y) p_Z (z) h (p_Y (y) p_Z (z), q_Y (y) q_Z (z))\). Thus, the third axiom implies
Following the previous argument, we find \(h (a x, b y) = h (a, b) + h (x, y)\) for any \(a, b, x, y \in (0, + \infty)\). Taking derivative on \(a\) and \(b\) results in \(\partial_1 h (a x, b y) x = \partial_1 h (a, b)\) and \(\partial_2 h (a x, b y) y = \partial_2 h (a, b)\). Since \(\partial_1 h (a, a) + \partial_2 h (a, a) = (\mathrm{d} / \mathrm{d} a) h (a, a) = 0\), we get \(\partial_1 h (a x, a y) x + \partial_2 h (a x, a y) y = 0\). Letting \(a = 1\), it becomes a first order partial differential equation \(\partial_1 h (x, y) x + \partial_2 h (x, y) y = 0\), which has a unique solution that \(h (x \mathrm{e}^t, y \mathrm{e}^t)\) is constant for all \(t\). Choosing \(t = - \ln y\), we find \(h (x, y) = h (x / y, 1)\). Now \(h\) reduces from two variables to one. So, plugging this result back to \(h (a x, b y) = h (a, b) + h (x, y)\), we have \(h (x y, 1) = h (x, 1) + h (y, 1)\). It looks like a logarithm. We are to show that it is indeed so. By taking derivative on \(x\) and then letting \(y = 1\), we get an first order ordinary differential equation \(\partial_1 h (x, 1) = \partial_1 h (1, 1) / x\), which has a unique solution that \(h (x, 1) = \partial_1 h (1, 1) \ln (x) + C\), where \(C\) is a constant. Combined with \(h (x, y) = h (x / y, 1)\), we finally arrive at \(h (x, y) = \partial_1 h (1, 1) \ln (x / y) + C\). To determine the \(\partial_1 h (1, 1)\) and \(C\), we use the second axiom \(\partial_1 h (1, 1) \int \mathrm{d} x p (x) \ln (p (x) / q (x)) + C > 0\) when \(p \neq q\) and \(\partial_1 h (1, 1) \int \mathrm{d} x p (x) \ln (p (x) / p (x)) + C = 0\). The second equation results in \(C = 0\). By Jensen's inequality, the integral \(\int \mathrm{d} x p (x) \ln (p (x) / q (x))\) is non-negative, thus from the first equation, \(\partial_1 h (1, 1) > 0\). Up to now, all things about \(h\) have been settled. We conclude that there is a unique expression that satisfies all the three axioms, which is
where \(k > 0\). This was first derived by Solomon Kullback and Richard Leibler in 1951, so it is called Kullback–Leibler divergence (KL-divergence for short), denoted by \(D_{\operatorname{KL}} (P\|Q)\). Since it characterizes the relative surprise, it is also called relative entropy (entropy for surprise).
The locality is essential for relative entropy to arise. For example, Renyi divergence, defined by
also satisfies the three axioms when locality is absent.
In the end, we examine the two issues appeared in Shannon entropy (section 1.3). In \(H (P, Q)\), the logarithm is \(\ln (p / q)\) which is dimensionless. And a coordinate transformation \(X \rightarrow Y := \varphi (X)\) makes \(\int \mathrm{d} x p (x) = \int \mathrm{d} y | \det (\partial \varphi^{- 1}) (y) | p (\varphi^{- 1} (y)) =: \int \mathrm{d} y \tilde{p} (y)\), thus \(p \rightarrow \tilde{p} := | \det (\partial \varphi^{- 1}) | p \circ \varphi^{- 1}\). The same for \(q \rightarrow \tilde{q} := | \det (\partial \varphi^{- 1}) | q \circ \varphi^{- 1}\). The common factor \(| \det (\partial \varphi^{- 1}) |\) will be eliminated in \(\ln (p / q)\), leaving \(H (P, Q)\) invariant (since \(\int \mathrm{d} x p \ln (p / q) \rightarrow \int \mathrm{d} y \tilde{p} \ln (\tilde{p} / \tilde{q})\), which equals to \(\int \mathrm{d} x p \ln (p / q)\)). So, the two issues of Shannon entropy cease in relative entropy.
In this chapter, we discuss continuous-time Markovian process. Using the result obtained in chapter 1, we declare how a Markovian process relaxes to its stationary equilibrium.
Let \(X\) a multi-dimensional random variables, being, discrete, continuous, or partially discrete and partially continuous, with alphabet \(\mathcal{X}\) and distribution \(P\). Even though the discussion in this section applies to both discrete and continuous random variables, we use the notation of the continuous. The reason is that converting from discrete to continuous may cause problems (section 1.3), while the inverse will be safe and direct as long as any smooth structure of \(X\) is not employed throughout the discussion.
Without losing generality, consider a pile of sand on a desk. The desk has been fenced in so that the sands will not flow out of the desk. Imagine that these sands are magic, having free will to move on the desk. The distribution of sands changes with time. In the language of probability, the density of sands at position \(x\) of the desk is described by a time-dependent density function \(p (x, t)\), where the total mass of the sands on the desk is normalized to one, and the position on the desk characterizes the alphabet \(\mathcal{X}\).
Let \(q_{t \rightarrow t'} (y|x)\) denote the portion of density at position \(x\) that transits to position \(y\), from \(t\) to \(t'\). Then, the transited density will be \(q_{t \rightarrow t'} (y|x) p (x, t)\). There may be some portion of density at position \(x\) that does not transit during \(t \rightarrow t'\) (the lazy sands). In this case we imagine the sands transit from position \(x\) to \(x\) (stay on \(x\)), which is \(q_{t \rightarrow t'} (x|x)\). Now, every sand at position \(x\) has transited during \(t \rightarrow t'\), and the total portion shall be \(100\%\), which means
\(\displaystyle \int_{\mathcal{X}} \mathrm{d} y q_{t \rightarrow t'} (y|x) = 1. \) | (2.1) |
As portion, \(q_{t \rightarrow t'}\) cannot be negative, thus \(q_{t \rightarrow t'} (x|y) \geqslant 0\) for each \(x\) and \(y\) in \(\mathcal{X}\). We call \(q_{t \rightarrow t'}\) the transition density. Not like the density function of distribution, transition density can be zero in a subset of \(\mathcal{X}\).
The transition makes a difference on density at position \(x\). The difference is caused by the density transited from \(x\), which is \(\int_{\mathcal{X}} \mathrm{d} y q_{t \rightarrow t'} (y|x) p (x, t)\), and that transited to \(x\), which is \(\)\(\int_{\mathcal{X}} \mathrm{d} y q_{t \rightarrow t'} (x|y) p (y, t)\). Thus, we have
By inserting equation (2.1), we find
\(\displaystyle p (x, t') = \int_{\mathcal{X}} \mathrm{d} y q_{t \rightarrow t'} (x|y) p (y, t), \) | (2.2) |
which is called the discrete time master equation. When \(t' = t\), we have \(p (x, t) = \int_{\mathcal{X}} \mathrm{d} y q_{t \rightarrow t} (x|y) p (y, t)\), indicating that
where \(\delta (x - y)\) indicates Kronecker's delta \(\delta_{x, y}\) when \(\mathcal{X}\) is discrete, or Dirac's delta function when \(\mathcal{X}\) is continuous. Even though we call it Dirac's delta function, it is in fact a generalized function. A generalized function has meaning only when it is integrated together with other (not generalized) functions. For example, Dirac's delta function has \(\int_{\mathcal{X}} \mathrm{d} x \delta (x - y) f (x) = f (y)\) for any usual function \(f\).
In addition, if the change of the distribution of sands is smooth, that is, there is not a sand lump that jumping from one place to another in an arbitrarily short period of time, then \(q_{t \rightarrow t'}\) is smooth on \(t'\). Taking derivative on \(t'\) and then setting \(t'\) to \(t\), we have
\(\displaystyle \frac{\partial p}{\partial t} (x, t) = \int_{\mathcal{X}} \mathrm{d} y r_t (x, y) p (y, t), \) | (2.3) |
where \(r_t (x, y) := \lim_{t' \rightarrow t} (\partial q_{t \rightarrow t'} / \partial t') (x|y)\), called transition rate. It is called the continuous time master equation, or simply master equation. The word “master” indicates that the transition rate has completely determined (mastered) the evolutionary behavior of distribution.
Even though all these concepts are born of the pile of sand, they are applicable to any stochastic process where the distribution \(P (t)\) is time-dependent (but the alphabet \(\mathcal{X}\) is time-invariant), no matter whether the random variable is discrete or continuous.
A stochastic process is Markovian if the transition density \(q_{t \rightarrow t'}\) depends only on the time interval \(\Delta t := t' - t\), thus \(q_{\Delta t}\). In this case, transition rate \(r\) is time-independent, so the master equation becomes
\(\displaystyle \frac{\partial p}{\partial t} (x, t) = \int_{\mathcal{X}} \mathrm{d} y r (x, y) p (y, t) . \) | (2.4) |
Since we only deal with Markovian stochastic process throughout this note, when referring to master equation, we mean equation 2.4. And to discrete time master equation, equation 2.5:
\(\displaystyle p (x, t + \Delta t) = \int_{\mathcal{X}} \mathrm{d} y q_{\Delta t} (x, y) p (y, t) . \) | (2.5) |
Before finishing this section, we discuss the demanded conditions for transition rate. The normalization of transition density 2.1 implies that \(\int_{\mathcal{X}} \mathrm{d} x r (x, y) = 0\). This can be seen by Taylor expanding \(q_{\Delta t}\) by \(\Delta t\), as \(q_{\Delta t} (x|y) = \delta (x - y) + r (x, y) \Delta t + \omicron (\Delta t)\), where we have inserted \(q_0 (x|y) = \delta (x - y)\) and the definition of \(r\). Also from this Taylor expansion, we see that the non-negativity of \(q_{\Delta t}\) implies \(r (x, y) \geqslant 0\) when \(x \neq y\). Since \(p\) is a density function of distribution, and density function is defined to be positive (see section 1.1), the equation 2.2 must conserve this positivity. We are to show that this is guaranteed by the master equation itself, without any extra condition demanded for the transition rate. It is convenient to use discrete notations, thus replace \(x \rightarrow i\), \(y \rightarrow j\), and \(\int \rightarrow \sum\). The master equation turns to be \((\mathrm{d} p_i / \mathrm{d} t) (t) = \sum_j r_{i j} p_j (t)\). Notice that it becomes an ordinary differential equation. Recall that \(r_{i j} \geqslant 0\) when \(i \neq j\), and thus \(r_{i i} \leqslant 0\) (since \(\sum_j r_{j i} = 0\)). We separate the right hand side to \(r_{i i} p_i (t) + \sum_{j : j \neq i} r_{i j} p_j (t)\), and the worst situation is that \(r_{i j} = 0\) for each \(j \neq i\) and \(r_{i i} < 0\). In this case, the master equation reduces to \((\mathrm{d} p_i / \mathrm{d} t) (t) = r_{i i} p_i (t)\), which has the solution \(p_i (t) = p_i (0) \exp (r_{i i} t)\). It implies that \(p_i (t) > 0\) as long as \(p_i (0) > 0\), indicating that master equation conserves the positivity of density function. As a summary, we demand transition rate \(r\) to be \(r (x, y) \geqslant 0\) when \(x \neq y\) and \(\int_{\mathcal{X}} \mathrm{d} x r (x, y) = 0\).
We wonder, given a transition rate, can we obtain the corresponding transition density? Generally, we cannot get the global (finite) from the local (infinitesimal). For example, we cannot determine a function only by its first derivative at the origin. But, master equation has a group-like structure, by which the local accumulates to be global. We are to show how this happens.
We can use the master equation 2.4 to calculate \(\partial^n p / \partial t^n\) for any \(n\). For \(n = 2\), by inserting master equation 2.4 (to the blue term), we have
We then insert master equation 2.4 again (to the green term), and find
Following the same steps, it can be generalized to higher order derivatives, as
Notice the pattern: a sequence of \(r\) and a rightmost \(p (x, t)\). The reason for this pattern to arise is that \(q_{\Delta t}\), thus \(r\), is independent of \(t\): a Markovian property.
On the other hand, Taylor expand the both sides of equation 2.5 by \(\Delta t\) gives, at \((\Delta t)^{n + 1}\) order,
where, for simplifying notation, we have denoted the \(n\)th-order derivatives of \(q_{\Delta t}\) by
So,\(\) by equaling the two expressions of \((\partial^{n + 1} p / \partial t^{n + 1}) (z, t)\), we find
For \(n = 1, 2, \ldots\). This holds for all \(p (x, t)\), thus
Recalling that \(q_{\Delta t} (z|x) = \delta (z - x) + r (z, x) \Delta t + \omicron (\Delta t)\), we have the Taylor expansion of \(q_{\Delta t}\), as
2.1. Another derivation uses exponential mapping. By regarding \(p\) a time-dependent element in functional space, and \(r\) as a linear operator, it becomes (we add a hat for indicating operator, using dot \(\cdot\) for its operation)
This operator differential equation has a famous solution, called exponential mapping, \(p (t) = \exp (\hat{r} t) p (0)\), where the exponential operator is defined by Taylor expansion \(\exp (\hat{L}) := \hat{1} + \hat{L} + (1 / 2!) \hat{L}^2 + \cdots\) for any linear operator \(\hat{L}\). Indeed, by taking derivative on \(t\) on both sides, we find \((\mathrm{d} p / \mathrm{d} t) (t) = \hat{r} \cdot \exp (\hat{r} t) p (0) = \hat{r} \cdot p (t)\). Recall the discrete time master equation, \(p (\Delta t) = \hat{q}_{\Delta t} \cdot p (0)\), where the transition density \(\hat{q}_{\Delta t}\) is regarded as a linear operator too (so we put a hat on it). We find \(\exp (\hat{r} \Delta t) \cdot p (0) = \hat{q}_{\Delta t} \cdot p (0)\), which holds for arbitrary \(p (0)\), implying \(\hat{q}_{\Delta t} = \exp (\hat{r} \Delta t) = 1 + \hat{r} \Delta t + (1 / 2!) (\hat{r} \cdot \hat{r}) (\Delta t)^2 + \cdots\). Going back to functional representation, we have the correspondences \(\hat{q}_{\Delta t} \rightarrow q_{\Delta t} (z|x)\), \(\hat{r} \rightarrow r (z, x)\), \(\hat{r} \cdot \hat{r} \rightarrow \int \mathrm{d} y r (z, y) r (y, x)\), \(\hat{r} \cdot \hat{r} \cdot \hat{r} \rightarrow \int \mathrm{d} y_1 \mathrm{d} y_2 r (z, y_2) r (y_2, y_1) r (y_1, x)\), and so on, thus recover the relation between \(q_{\Delta t}\) and \(r\).
\(\displaystyle \begin{array}{rl} q_{\Delta t} (z|x) = & \delta (z - x)\\ + & (\Delta t) r (z, x)\\ + & \frac{(\Delta t)^2}{2!} \int_{\mathcal{X}} \mathrm{d} y r (z, y) r (y, x)\\ + & \cdots\\ + & \frac{(\Delta t)^{n + 1}}{(n + 1) !} \int_{\mathcal{X}} \mathrm{d} y_1 \cdots \int_{\mathcal{X}} \mathrm{d} y_n r (z, y_n) r (y_n, y_{n - 1}) \cdots r (y_1, x)\\ + & \cdots .\\ & \end{array} \) | (2.6) |
Well, this is a complicated formula, but its implication is straight forward and very impressive: the transition density is equivalent to transition rate, even though transition rate is derived from infinitesimal time-interval transition density.
This may be a little weird at the first sight. For example, consider \(q'_{\Delta t} (y|x) := q_{\Delta t} (y|x) + f (y, x) \Delta t^2\), where \(f\) is any function ensuring that \(q'_{\Delta t}\) is non-negative and normalized (thus \(\int_{\mathcal{X}} \mathrm{d} y f (y, x) = 0\)). Following the previous derivation, we find that the discrete time master equation
also leads to the (continuous time) master equation 2.4 with the same \(r\) as that of \(q_{\Delta t}\). So, we should have \(q'_{\Delta t} = q_{\Delta t}\), which means \(f\) is not free, but should vanish.
The answer to this question is that, a transition density is not free to choose, but sharing the same degree of freedom as that of its transition rate. The fundamental quantity that describes the evolution of a continuous time Markov process is transition rate. For example, consider \(p (z, t + \Delta t + \Delta t')\) for any \(\Delta t\) and \(\Delta t'\). Directly, we have
but on the other hand, by applying discrete time master equation twice, we find
By equaling the two expressions of \(p (z, t + \Delta t + \Delta t')\), we find
Since \(p (x, t)\) can be arbitrary, we arrive at
\(\displaystyle q_{\Delta t + \Delta t'} (z|x) = \int_{\mathcal{X}} \mathrm{d} y q_{\Delta t'} (z|y) q_{\Delta t} (y|x) . \) | (2.7) |
This provides an addition restriction to the transition density.
Interestingly, obeying equation 2.7 is sufficient for \(q_{\Delta t}\) to satisfy equation 2.6. Precisely, let \(q_{\Delta t} (x|y)\) is a function that is smooth on \(\Delta t\) with \(q_0 (x|y) = \delta (x - y)\), we are to show that, if \(q_{\Delta t}\) satisfies equation 2.7, then it will obey equation 2.6. To do so, we take derivative on equation 2.7 by \(\Delta t'\) at \(\Delta t' = 0\), resulting in
where \(r (z, y) := q_0^{(1)} (z|y)\). Then, we are to Taylor expand both sides by \(\Delta t\). On the right hand side, we have
and on the left hand side,
So, we get the Taylor expansion on both sides. At \((\Delta t)^0\) order,
which is just the definition of \(r\). At \((\Delta t)^1\) order,
Iteratively at \((\Delta t)^{n + 1}\) order, we will find
again. And this implies equation 2.6. So, we conclude this paragraph as follow: obeying equation 2.7 is the sufficient and essential condition for a function \(q_{\Delta t} (x|y)\), which is smooth on \(\Delta t\) with \(q_0 (x|y) = \delta (x - y)\), to satisfy equation 2.6; additionally, if \(q_{\Delta t} (x|y)\) is non-negative and normalized on \(x\) (namely, \(\int_{\mathcal{X}} \mathrm{d} x q_{\Delta t} (x|y) = 1\)), then \(q_{\Delta t}\) is a transition density.
Let \(\Pi\) a stationary solution of master equation 2.4. Then, its density function \(\pi\) satisfies \(\int_{\mathcal{X}} \mathrm{d} y r (x, y) \pi (y) = 0\). Since we have demanded that \(\int_{\mathcal{X}} \mathrm{d} y r (y, x) = 0\), the stationary master equation can be re-written as
But, this condition is too weak to be used. A more useful condition, which is stronger than this, is that the integrand vanishes everywhere:
\(\displaystyle r (x, y) \pi (y) = r (y, x) \pi (x), \) | (2.8) |
which is called the detailed balance condition.
Interestingly, for a transition rate \(r\) that satisfies detailed balance condition 2.8, the transition density \(q_{\Delta t}\) generated by \(r\) using equation 2.6 satisfies a similar relation
\(\displaystyle q_{\Delta t} (x|y) \pi (y) = q_{\Delta t} (y|x) \pi (x) . \) | (2.9) |
To see this, consider the third line in equation (2.6), where the main factor is
Following the same steps, we can show that all terms in equation 2.6 share the same relation, indicating \(q_{\Delta t} (z|x) \pi (x) = q_{\Delta t} (x|z) \pi (z)\).
Given the time \(t\), if the time-dependent distribution \(P (t)\) and the stationary distribution \(\Pi\) share the same alphabet \(\mathcal{X}\), which means \(p (x, t) > 0\) and \(\pi (x) > 0\) for each \(x \in \mathcal{X}\), we have defined the relative entropy between them, as
\(\displaystyle H (P (t), \Pi) = \int_{\mathcal{X}} \mathrm{d} x p (x, t) \ln \frac{p (x, t)}{\pi (x)} .\) | (2.10) |
It describes the uncertainty (surprise) caused by \(P (t)\) when prior knowledge is given by \(\Pi\). It is a plausible generalization of Shannon entropy to continuous random variables.
We can calculate the time-derivative of relative entropy by master equation 2.4 . Generally, the time-derivative of relative entropy has no interesting property. But, if the \(\pi\) is more than stationary but satisfying a stronger condition: detailed balance, then \(\mathrm{d} H (P (t), \Pi) / \mathrm{d} t\) will have a regular form
2.2. The proof is given as follow. Directly, we have
Since \(\int_{\mathcal{X}} \mathrm{d} x (\partial p / \partial t) (x, t) = (\partial / \partial t) \int_{\mathcal{X}} \mathrm{d} x p (x, t) = 0\), the second term vanishes. Then, we get
Now, we replace \(\partial p / \partial t\) by master equation 2.4, as
Then, insert detailed balance condition \(r (y, x) = r (x, y) \pi (y) / \pi (x)\), as
Since \(x\) and \(y\) are dummy, we interchange them in the integrand, and then insert detailed balance condition again, as
By adding the two previous results together, we find
from which we directly get the result. Notice that this proof is very tricky: it uses detailed balance condition twice, between which the expression is symmetrized. It is an ingenious mathematical engineering.
\(\displaystyle \frac{\mathrm{d}}{\mathrm{d} t} H (P (t), \Pi) = - \frac{1}{2} \int_{\mathcal{X}} \mathrm{d} x \int_{\mathcal{X}} \mathrm{d} y r (x, y) \pi (x) \left( \frac{p (x, t)}{\pi (x)} - \frac{p (y, t)}{\pi (y)} \right) \left( \ln \frac{p (x, t)}{\pi (x)} - \ln \frac{p (y, t)}{\pi (y)} \right) . \) | (2.11) |
We are to check the sign of the integrand. The \(r (x, y)\) is negative only when \(x = y\), on which the integrand vanishes. Thus, \(r (x, y)\) can be treated as non-negative, so is the \(r (x, y) \pi (y)\) (since \(\pi (x) > 0\) for all \(x \in \mathcal{X}\)). Now, we check the sign of the last two terms. If \(p (x, t) / \pi (x) > p (y, t) / \pi (y)\), then \(\ln [p (x, t) / \pi (x)] > \ln [p (y, t) / \pi (y)]\), thus the sign of the last two terms is positive. The same goes for \(p (x, t) / \pi (x) < p (y, t) / \pi (y)\). Only when \(p (x, t) / \pi (x) = p (y, t) / \pi (y)\) can it be zero. Altogether, the integrand is non-positive, thus \(\mathrm{d} H / \mathrm{d} t \leqslant 0\).
The integrand vanishes when either \(r (x, y) = 0\) or \(p (x, t) / \pi (x) = p (y, t) / \pi (y)\). If \(r (x, y) > 0\) for each \(x \neq y\), then \((\mathrm{d} / \mathrm{d} t) H (P (t), \Pi) = 0\) only when \(p (x, t) / \pi (x) = p (y, t) / \pi (y)\) for all \(x, y \in \mathcal{X}\), which implies that \(p (\cdot, t) = \pi\) (since \(\)\(\int_{\mathcal{X}} \mathrm{d} x p (x, t) = \int_{\mathcal{X}} \mathrm{d} x \pi (x) = 1\)), or \(P (t) = \Pi\).
Contrarily, if \(r (x, y) = 0\) on some subset \(U \subset \mathcal{X} \times \mathcal{X}\), it seems that \((\mathrm{d} / \mathrm{d} t) H (P (t), \Pi) = 0\) cannot imply \(p (x, t) / \pi (x) = p (y, t) / \pi (y)\) on \(U\). But, if there is a \(z \in \mathcal{X}\) such that both \((x, z)\) and \((y, z)\) are not in \(U\), then \((\mathrm{d} / \mathrm{d} t) H (P (t), \Pi) = 0\) implies \(p (x, t) / \pi (x) = p (z, t) / \pi (z)\) and \(p (y, t) / \pi (y) = \pi (z, t) / \pi (z)\), thus implies \(p (x, t) / \pi (x) = p (y, t) / \pi (y)\). It hints for connectivity. Precisely, for each \(x, z \in \mathcal{X}\), if there is a series \((y_1, \ldots, y_n)\) from \(x\) (\(y_1 := x\)) to \(z\) (\(y_n := z\)) with both \(r (y_{i + 1}, y_i)\) and \(r (y_i, y_{i + 1})\) are positive for each \(i\), then we say \(x\) and \(z\) are connected , and the series is called a path . It means there are densities transiting along the forward and backward directions of the path . In this situation, \((\mathrm{d} / \mathrm{d} t) H (P (t), \Pi) = 0\) implies \(p (x, t) / \pi (x) = p (z, t) / \pi (z)\).
2.3. We have, along the path, \(p (y_1, t) / \pi (y_1) = p (y_2, t) / \pi (y_2) = \cdots = p (y_n, t) / \pi (y_n)\), thus \(p (x, t) / \pi (x) = p (z, t) / \pi (z)\) since \(x = y_1\) and \(z = y_n\).
Let us examine the connectivity further. We additionally define that every element in \(\mathcal{X}\) is connected to itself, then connectivity forms an equivalence relation. So, it separates \(\mathcal{X}\) into subsets (equivalence classes) \(\mathcal{X}_1, \ldots, \mathcal{X}_n\) with \(\mathcal{X}_i \cap \mathcal{X}_j = \varnothing\) for each \(i \neq j\) and \(\mathcal{X}= \cup_{i = 1}^n \mathcal{X}_i\). In each subset \(\mathcal{X}_i\), every two elements are connected. In this way, the whole random system are separated into many independent subsystems. The distributions \(P_i (t)\) and \(\Pi_i\) defined in the subsystem \(i\) have the alphabet \(\mathcal{X}_i\) and densities functions \(p_i (x, t) := p (x, t) / \int_{\mathcal{X}_i} \mathrm{d} x p (x, t)\) and \(\pi_i (x) := \pi (x) / \int_{\mathcal{X}_i} \mathrm{d} x \pi (x)\) respectively (the denominators are used for normalization). Applying the previous discussion to this subsystem, we find \(P_i (t) = \Pi_i\) at \((\mathrm{d} / \mathrm{d} t) H (P_i (t), \Pi_i) = 0\).
So, for the whole random system or each of its subsystems, the following theorem holds.
Theorem
Many textures use Fokker-Planck equation to prove the monotonic reduction of relative entropy. After an integration by parts, they arrive at a negative definite expression, which means the monotonic reduction. This proof needs smooth structure on \(X\), which is essential for integration by parts. In this section, we provides a more generic alternative to the proof, for which smooth structure on \(X\) is unnecessary.
How to numerically simulate the evolution of master equation 2.4 that tends to equilibrium (without which the simulation will not terminate)? Using the metaphor of sands (see section 2.2 ), we simulate each sand, but replace its free will by a transition probability. Explicitly, we initialize the sands (that is, their positions) randomly. Then iteratively update the position of each sand. In each iteration, a sand jumps from position \(x\) to position \(y\) with the probability \(q_{\Delta t} (y|x) \approx \delta (y - x) + r (y, x) \Delta t\) where \(\Delta t\) is sufficiently small. Not every jump is valid. On one hand, we have to ensure that computer has a sampler that makes random sampling for \(q_{\Delta t} (y|x)\). On the other hand, to ensure the termination, the transition rate \(r\), together with the density function \(\pi\), shall satisfy the detailed balance condition 2.8 . (Section 2.7 will provide a method that constructs such a transition rate from the density function.) Then, we expect that the simulation will iteratively decrease the difference between the distribution of the sands and the \(\Pi\). We terminate the iteration when they have been close enough. In this way, we simulate a collection of sands evolves with the master equation to equilibrium, and finally distributes as \(\Pi\). This process is called Monte-Carlo simulation , first developed by Stanislaw Ulam in 1940s while he was working on the project of nuclear weapons at Los Alamos National Laboratory. The name is in memory of Ulam's uncle who lost all his personal assets in Monte Carlo Casino, Monaco.
2.4. There are multiple motivations for Monte-Carlo simulation. An important one comes from numerical integration. The problem is calculating the integral \(\int_{\mathcal{X}} \mathrm{d} x \pi (x) f (x)\) for a density function \(\pi\) and an arbitrary function \(f : \mathcal{X} \rightarrow \mathbb{R}\). When \(\mathcal{X}\) has finite elements, this integral is easy to compute, which is \(\sum_{x \in \mathcal{X}} \pi (x) f (x)\). Otherwise, this integral will be intractable. Numerically, this integral becomes the expectation \((1 / | \mathcal{S} |) \sum_{x \in \mathcal{S}} f (x)\) where \(\mathcal{S}\) is a collection of elements randomly sampled from distribution \(\Pi\), whose density function is the \(\pi\). By central limit theorem (briefly, the mean of i.i.d. random variables \(X_1, \ldots, X_N\) with mean \(\mathbb{E} [X_i] = 0\) and variance \(\operatorname{Var} [X_i] = \sigma^2\) for some \(\sigma\), has standard derivation \(\sigma / \sqrt{N}\) when \(N\) is large enough), the numerical error \(\left| \int_{\mathcal{X}} \mathrm{d} x \pi (x) f (x) - (1 / | \mathcal{S} |) \sum_{x \in \mathcal{S}} f (x) \right|\) is proportional to \(1 / \sqrt{| \mathcal{S} |}\), which can be properly bounded as long as \(| \mathcal{S} |\) is large enough. But, how to sample from a distribution if you only know its density function (recall in section 1.1, a distribution is the combination of its density function and its sampler)? The answer is using Monte-Carlo simulation.
Like the Euler method in solving dynamical system, however, a finite time step results in a residual error. This residual error must be analyzed an controlled, so that the distribution will evolve toward \(\Pi\), as we have expected. To examine this, we calculate the \(H (P (t + \Delta t), \Pi) - H (P (t), \Pi)\) where \(\Delta t\) is small but still finite, and check when it is negative (such that \(H (P (t))\) monotonically decreases to \(P (t) \rightarrow \Pi\)).
By definition, we have
Inserting \({\int_{\mathcal{X}} \mathrm{d} x p (x, t + \Delta t) \ln (p (x, t) / \pi (x, t))}\) gives
The first line is recognized as \(H (P (t + \Delta t), P (t))\), which is non-negative. Following the same steps in section 2.5 (but using discrete time master equation 2.5 instead, and detailed balance condition 2.9 for transition density), the second line reduces to
which is non-positive (suppose that \(r\) connects every two elements in \(\mathcal{X}\)). So, the sign of \(\Delta H\) is determined by that which line has greater absolute value. The first line depends only on the difference between \(P (t)\) and \(P (t + \Delta t)\), thus \(\Delta t\), while the second line additionally depends on the difference between \(P (t)\) and \(\Pi\) (the factor \(q_{\Delta t} (x|y)\) also depends on \(\Delta t\)). When \(\Delta t \rightarrow 0\), the first line vanishes, while the second does not until \(P (t) \rightarrow \Pi\). This suggests us to investigate how fast each term converges as \(\Delta t \rightarrow 0\).
To examine the speed of convergence, we calculate the leading order of \(\Delta t\) in each line. To make it clear, we denote the first line by \(\Delta H_1\) and the second line \(\Delta H_2\). Taylor expanding \(\Delta H_1\) by \(\Delta t\) gives
2.5. The first line
To Taylor expand the right hand side by \(\Delta t\), we expand \(p (x, t + \Delta t)\) to \(\omicron (\Delta t^2)\), as
and the same for \(\ln p (x, t + \Delta t)\), as
Plugging in \((\mathrm{d} / \mathrm{d} x) \ln f (x) = f' (x) / f (x)\) and then \((\mathrm{d}^2 / \mathrm{d} x^2) \ln f (x) = f'' (x) / f (x) - (f' (x) / f (x))^2\), we find
So, the \(\Delta t\) order term in \(\Delta H_1\) is
where we used the normalization of \(p\). The \(\Delta t^2\) term in \(\Delta H_1\) is
Using the normalization of \(p\) as before, it is reduced to
Altogether, we arrive at
where, by master equation 2.4, \((\partial / \partial t) \ln p (x, t) = \int_{\mathcal{X}} \mathrm{d} x r (x, y) p (y, t) / p (x, t)\). For \(\Delta H_2\), we insert equation 2.11 after Taylor expanding \(q_{\Delta t}\) by \(\Delta t\), and obtain
We find \(\Delta H_1\) converges with speed \(\Delta t^2\) while \(\Delta H_2\) has speed \(\Delta t\).
Thus, given \(P (t) \neq \Pi\) (so that \(\Delta H_2 \neq 0\), recall section 2.5), there must be a \(\delta > 0\) such that for any \(\Delta t < \delta\), we have \(| \Delta H_1 | < | \Delta H_2 |\), in which case the \(\Delta H = \Delta H_1 + \Delta H_2 < 0\) (recall that \(\Delta H_1 \geqslant 0\) and \(\Delta H_2 \leqslant 0\)). The \(\delta\) is bounded by
This bound is proportional to the difference between \(P (t)\) and \(\Pi\) (represented by the first factor). When \(P (t)\) has approached \(\Pi\) (that is, \(P (t) \approx \Pi\) but not exactly equal), \(\delta\) has to be extremely small. (This is a little like supervised machine learning where \(\Delta t\) acts as learning rate and \(H (P (t), \Pi)\) as loss. In the early stage of training, the loss function has a greater slope and we can safely employ a relatively larger learning rate to speed up the decreasing of loss. But, we have to tune the learning rate to be smaller and smaller during the training, in which the slope of loss function is gradually decreasing. Otherwise, the loss will not decrease but keep fluctuating when it has been sufficiently small, since the learning rate now becomes relatively too big.)
Metropolis-Hastings algorithm is a simple method that constructs transition rate for any given stationary distribution such that detailed balance condition holds. Explicitly, given a stationary distribution \(\Pi\), and an auxiliary transition rate \(\gamma\), ensuring that \(\gamma (x, y) > 0\) for each \(x\) and \(y\) in alphabet \(\mathcal{X}\) such that \(x \neq y\), the transition rate \(r\) is given by
\(\displaystyle r (x, y) = \min \left( 1, \frac{\gamma (y, x) \pi (x)}{\gamma (x, y) \pi (y)} \right) \gamma (x, y) . \) | (2.12) |
This transition rate connects every two elements in \(\mathcal{X}\) (since \(\gamma (y, x) > 0\) for each \(x \neq y\)). In addition, together with \(\pi\), it satisfies the detailed balance condition 2.8. Directly,
Thus detailed balance condition holds. So, theorem 2.1 states that, evolved by the master equation 2.4, any initial distribution will finally relax to the stationary distribution \(\Pi\).
Metropolis-Hastings algorithm was first proposed by Nicholas Metropolis and others in 1953 in Los Alamos, and then improved by Canadian statistician Wilfred Hastings in 1970. This algorithm was first defined for transition density. Together with a positive auxiliary transition density \(g\), the transition density is defined as
\(\displaystyle q (x|y) := \min \left( 1, \frac{g (y|x) \pi (x)}{g (x|y) \pi (y)} \right) g (x|y), \) | (2.13) |
where \(g\) is positive-definite on \(\mathcal{X}\). Notice that, in equation 2.13 there is no extra time parameter like the \(q_{\Delta t} (x|y)\) in section 2.2. It can be seen as a fixed time interval, which can only be used for discrete time master equation.
This definition has an intuitive and practical explanation. The two factors can be seen as two conditional probability. The factor \(g (x|y)\) first proposes a transition from \(y\) to \(x\). (In numerical simulation, we have to ensure that computer has a sampler for sampling an \(x\) from the conditional probability \(g (x|y)\).) Then, this proposal will be accepted by Bernoulli probability with the ratio given by the first factor in the right hand side. If accepted, then transit to \(x\), otherwise stay on \(y\). Altogether, we get a conditional probability jumping from \(y\) to \(x\), the \(q (x|y)\).
It is straight forward to check that, if, in addition, \(g\) smoothly depends on a parameter \(\Delta t\) as \(g_{\Delta t}\), so is \(q\) as \(q_{\Delta t}\), and if we expand \(g_{\Delta t}\) at \(\Delta t \rightarrow 0\) as \(g_{\Delta t} (x|y) = \delta (x - y) + \gamma (x, y) \Delta t + \omicron (\Delta t)\), then we will find \(q_{\Delta t} (x|y) = \delta (x - y) + r (x, y) \Delta t + \omicron (\Delta t)\). Indeed, when \(x = y\), we have \(q_{\Delta t} (x|x) = g_{\Delta t} (x, x)\). And when \(x \neq y\), \(\delta (x - y) = 0\), we find
Altogether, for each \(x, y \in \mathcal{X}\), we find \(q_{\Delta t} (x|y) = \delta (x - y) + r (x, y) \Delta t + \omicron (\Delta t)\). In practice, we use the Metropolis-Hastings algorithm 2.13 to numerically simulate master equation 2.4. But, based on the discussion in section 2.6, the \(\Delta t\) in \(g_{\Delta t}\) shall be properly bounded to be small (or equivalently speaking, \(g\) shall be “principal diagonal”) so as to ensure the relaxation \(P (t) \rightarrow \Pi\).
Given a transition rate, we wonder if there exists a density function such that detailed balance condition 2.8 holds. Actually, equation 2.8 defines a density function. For example, if both \(r (x, y)\) and \(r (y, x)\) are not zero, we can construct \(\pi (y)\) by given \(\pi (x)\) as \(\pi (y) = \pi (x) r (y, x) / r (x, y)\). Generally, if \(x\) and \(y\) are connected, then there is a path \(P := (p_0, \ldots, p_n)\) from \(x\) to \(y\) with \(p_0 = x\) and \(p_n = y\) (path and connectivity are defined in section 2.5), and define
Thus, \(\pi (y)\) (the \(\pi (p_n)\)) is constructed out of \(\pi (x)\) (the \(\pi (p_0)\)). Let \(\rho (x, y) := \ln r (x, y) - \ln r (y, x)\), it becomes
or in continuous format,
\(\displaystyle \ln \pi (y) = \ln \pi (x) + \int_P \mathrm{d} s \rho (s), \) | (2.14) |
where \(\rho (s)\) is short for \(\rho (p_{s + 1}, p_s)\) along the path \(P\). In this way, given \(x_0 \in \mathcal{X}\), we define any \(x \in \mathcal{X}\) that is connected to \(x_0\) by \(\ln \pi (x) := \ln \pi (x_0) + \int_P \mathrm{d} s \rho (s)\). And \(\pi (x_0)\) is determined by the normalization of \(\pi\).
But, there can be multiple paths from \(x\) to \(y\) which are connected in \(\mathcal{X}\). For example, consider two paths \(P\) and \(P'\), then we have \(\int_P \mathrm{d} s \rho (s) = \int_{P'} \mathrm{d} s \rho (s)\). Generally, if \(C\) is a circle which is a path starting at an element \(x \in \mathcal{X}\) and finally end at \(x\) (but not simply standing at \(x\)), then
\(\displaystyle \oint_C \mathrm{d} s \rho (s) = 0. \) | (2.15) |
It means every path along two connected elements in \(\mathcal{X}\) is equivalent. If the condition 2.15 holds, we can simplify the notation in equation 2.14 by
where \(\int_x^y\) indicates any path from \(x\) to \(y\) (if \(x\) and \(y\) are connected).
Condition 2.15 implies that the previous construction does define a \(\pi\) that holds the detailed balance condition. Given \(x, y \in \mathcal{X}\), we have \(\ln \pi (x) = \ln \pi (x_0) + \int_{x_0}^x \mathrm{d} s \rho (s)\) and \(\ln \pi (y) = \ln \pi (x_0) + \int_{x_0}^y \mathrm{d} s \rho (s)\). If \(x\) and \(y\) are connected, then, by condition 2.15,\(\) \(\rho (y, x) = \int_x^{x_0} \mathrm{d} s \rho (s) + \int_{x_0}^y \mathrm{d} s \rho (s)\) (the \(\rho (y, x)\) indicates the path \((x, y)\), “jumping” directly from \(x\) to \(y\)), thus \(\ln \pi (y) = \ln \pi (x) + \rho (y, x)\), which is just the detailed balance condition 2.8. And if \(x\) and \(y\) are not connected, then both \(r (x, y)\) and \(r (y, x)\) shall vanish (recall the requirements of transition rate in section 2.2: if \(r (x, y) = 0\), then \(r (y, x) = 0\)), and detailed balance condition holds naturally.
So, condition 2.15 is essential and sufficient for the existence of \(\pi\) that holds the detailed balance condition 2.8. If \(\mathcal{X}\) is a simply connected smooth manifold, then using Stokes's theorem, we have \(\nabla \times \rho = 0\) on \(\mathcal{X}\). But, generally \(\mathcal{X}\) is neither simply connected nor smooth, but involving independent subsystems and discrete. In these cases, condition 2.15 becomes very complicated.
In many applications, we consider the inverse question: given a density function, if there exists a transition rate such that detailed balance condition holds. This inverse problem is much easier, and a proper transition rate can be constructed out of the density function (such as in Metropolis-Hastings algorithm).
We follow the discussion in chapter 2, but focusing on the specific situation where there is extra smooth structure on \(X\). This smoothness reflects on the connectivity of the alphabet \(\mathcal{X}\), and on the smooth “spatial” dependence of the density function and transition rate. This indicates that the conclusions in chapter 2 hold in this section, but the inverse is not guaranteed.
Follow the conventions in chapter 2. In addition, thought this chapter, we set the alphabet to be Euclidean to get sufficient connectivity. Namely, \(\mathcal{X}=\mathbb{R}^d\) for some integer \(d \geqslant 1\).
We employ the Einstein's convention. That is, we omit the sum notation for the duplicated indices as long as they are “balanced”. For example, \(x_{\alpha} y^{\alpha}\) represents \(\sum_{\alpha} x_{\alpha} y^{\alpha}\). The \(\alpha\) appears twice in the expression, once in subscript (the \(x_{\alpha}\)) and once in superscript (the \(y^{\alpha}\)), for which we say indices are balanced. Expression like \(x_{\alpha} y_{\alpha}\), however, does not represent a summation over \(\alpha\), because indices are not balanced (both are subscript). A more complicated example is \(\partial_{\alpha} A^{\alpha}_{\beta} x^{\beta}\), which means \(\sum_{\alpha} \sum_{\beta} \partial_{\alpha} A^{\alpha}_{\beta} x^{\beta}\).
For spatial smoothness, we assume that the density function \(p (x, t)\) of a time-dependent distribution \(P (t)\) and the transition rate \(r (x, y)\) are smooth on \(x\) and \(y\). Besides the smoothness of functions, however, spatial smoothness demands more. To declare this, we consider the situation where the mass of \(P (t)\) is centered at \(x\) initially, namely \(p (x', 0) = \delta (x - x')\). Then, after a small temporal period \(\Delta t\), there is some portion of mass transits to elsewhere. By master equation 2.4, the change in density is
Inserting the initial condition \(p (x', 0) = \delta (x - x')\) and denoting \(\epsilon := y - x\), we get
We assume that the portion of mass does not transit far away from \(x\), but in its neighbor, namely \(\epsilon\) is really small in scale. Quantitatively speaking, the scale is reflected by moments, where the \(n\)th-moment is defined by
Thus, the assumption turns to be
\(\lim_{\Delta t \rightarrow 0} \mathbb{E} [\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_n}] = 0\), and
as \(\Delta t\) tends to zero, \(\mathbb{E} [\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_n} \epsilon^{\alpha_{n + 1}} \cdots \epsilon^{\alpha_{n + m}}]\) converges (to zero) faster than \(\mathbb{E} [\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_n}]\).
For the second condition, we shall expect that \((\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_n})\) will become much smaller by multiplying more small (random) variables.
Plugging in the expression of \(p (x + \epsilon, \Delta t)\), we find
Then, introducing (to distinguish moments, which is defined on density, we employ \(K\) instead of \(M\) for denoting the “moments of transition rate”)
we have
So the first condition simply implies \(| K_n^{\alpha_1 \cdots \alpha_n} (x) | < + \infty\). The second condition is non-trivial. We are to show that it indicates a cut-off. That is, there shall exist an positive integer \(N_{\operatorname{cut}}\), such that \(K_n = 0\) for any \(n > N_{\operatorname{cut}}\). And we will find the estimation \(\mathbb{E} [\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_n}] = \mathcal{O} (\Delta t^{\sharp (n / N_{\operatorname{cut}})})\), where \(\sharp\) is the ceiling function (which rounds its argument to the nearest greater integer).
As an example for exploration, we first cut-off at \(N_{\operatorname{cut}} = 2\), namely \(K_n = 0\) for any \(n > 2\). We are to calculate \(\mathbb{E} [\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_n}]\) up to \(\omicron (\Delta t^2)\). This demands the relation between transition rate and transition density (equation 2.6),
Starting at \(p (y, 0) = \delta (x - y)\), master equation gives
Inserting the expansion of \(q_{\Delta t}\) makes
The first term is \(\Delta t K_n^{\alpha_1 \cdots \alpha_n} (x)\), so it is \(\Delta t K_1 (x)\) and \(\Delta t K_2 (x)\) for \(n = 1, 2\) respectively, and vanishes otherwise. In the rest, we focus on the second term, denoting the integral as \(F_n^{\alpha_1 \cdots \alpha_n} (x)\), thus the second term is \(\Delta t^2 / 2 F_n^{\alpha_1 \cdots \alpha_n} (x)\). First notice that the integral has identity
Thus
We explore the calculation of \(F_n\) by evaluating \(F_1 (x)\). By inserting an \((\epsilon - y)\) term, we get
While integrating over \(\epsilon\), the first line gives \(\int \mathrm{d} y K_1^{\alpha_1} (x + y) r (x + y, x)\), and the second vanishes because of the property \(\int \mathrm{d} x r (x, y) = 0\). So, we find
Taylor expansion of \(K_1\) at \(x\) gives
The first term on the right hand side vanishes. The second term is \(K_1^{\beta_1} (x) \partial_{\beta_1} K_1^{\alpha_1} (x)\), and the third is \(\)\((1 / 2!) K_2^{\beta_1 \beta_2} (x) \partial_{\beta_1} \partial_{\beta_2} K_1^{\alpha_1} (x)\). The rest terms are all vanishing because \(K_n = 0\) for any \(n > 2\). So, we find
Following the same process, we can obtain for \(F_2 (x)\),
3.1. Since
we have
Again, by integrating over \(\epsilon\), the first line on the right hand side is \(\int_{\mathbb{R}^d} \mathrm{d} y K_2^{\alpha_1 \alpha_2} (x + y) r (x + y, x)\) and the last line vanishes. The second line is \(\int_{\mathbb{R}^d} \mathrm{d} y K_1^{\alpha} (x + y) r (x + y, x) y^{\beta} +\operatorname{perm}\). Thus,
Then, again, Taylor expansion of \(K\)s at \(x\) gives
and
where we have used \(K_n = 0\) for any \(n > 2\) to cut the Taylor series. So, we find
where \(\operatorname{perm} (\alpha_1, \alpha_2)\) permutes the \(\alpha_1\) and \(\alpha_2\) for any term that is not symmetric (which is the forth term on the right hand side). Then for \(F_3 (x)\), we have
3.2. Since
we have
Again, using \(K_n = 0\) for any \(n > 2\), we get
Taylor expansion of \(K\)s at \(x\) gives
And for \(F_4 (x)\),
At higher order expansion of \(q_{\Delta t}\) by \(r\) (equation 2.6), the same goes. As an example, we examine the expansion at \(\mathcal{O} (\Delta t^3)\), as
where we have omitted the \(\mathcal{O} (\Delta t^2)\) terms. Following the same derivation, we find it contributes to \(\mathbb{E} [\epsilon^{\alpha}]\) by the term
We insert an \((\epsilon - y)\) term again, and get
The second line vanishes after integrating over \(\epsilon\) because \(\int \mathrm{d} x r (x, y) = 0\). The first line can be re-written as
And integrating over \(\epsilon\) gives
Again, we Taylor expand \(K_1\) at \(x\), resulting in
By integrating over \(y\), we find the first line vanishes because \(\int \mathrm{d} x r (x, y) = 0\). We recognize that the second line is just the \(F^{\beta}_1 (x)\), and the third line is just the \(F_2^{\beta_1 \beta_2} (x)\). Namely,
So, the problem is deduced to that we have solved at lower order expansion of \(q_{\Delta t}\). It means that we can calculate to arbitrary order of expansion iteratively, and the process is the same at each order.
Observing these results, we find the following rules (with explanations).
The superscripts are assigned to two \(K\)s together with partial derivatives, ensuring that the extra indices (such as \(\beta_1\)) are all summed over (namely, contracted).
The reason why there are two \(K\)s in \(F_n\) is that there are two \(r\)s in the expansion of \(q_{\Delta t}\) at \(\Delta t^2\) order.
For each \(n\)th order partial derivative, multiply it by a factor \(1 / n!\)
Because partial derivative comes from Taylor expansion, and \(1 / n!\) is the coefficient.
Non-symmetric assignments have to be permuted.
For example, the \(K_2^{\alpha_1 \beta_1} (x) \partial_{\beta_1} K_1^{\alpha_2} (x)\) term in \(F_2^{\alpha_1 \alpha_2} (x)\) is not symmetric on \(\alpha_1\) and \(\alpha_2\), thus has to be permuted in the \(\operatorname{perm} (\alpha_1, \alpha_2)\).
Symmetric assignments appear only once.
Also in \(F_2^{\alpha_1 \alpha_2} (x)\), the \(K_1^{\alpha_1} (x) K_1^{\alpha_2} (x)\) term is symmetric on \(\alpha_1\) and \(\alpha_2\), so it appears only once, being absent in the \(\operatorname{perm} (\alpha_1, \alpha_2)\).
We add all the possible assignments together as the final result.
These rules can be directly generalized to \(\mathcal{O} (\Delta t^m)\) in the expansion of \(q_{\Delta t}\) by \(r\), in which there can be \(m\) \(K\)s.
We have found that both \(\mathbb{E} [\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_3}]\) and \(\mathbb{E} [\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_4}]\) are of \(\mathcal{O} (\Delta t^2)\), since both \(F_3 (x)\) and \(F_4 (x)\) are non-zero. But \(F_5 (x)\) must vanish since we cannot get five superscripts with only two \(K_n\)s with \(n = 1, 2\). This further implies that any \(F_n\) with \(n > 4\) is zero, leading \(\mathbb{E} [\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_n}]\) to \(\omicron (\Delta t^2)\). But at higher (than 2) order of \(\Delta t\), there will be more (than two) \(K\)s in \(\mathbb{E} [\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_n}]\). Then, based on the rules we just found, the number of combination of indices will be greater (than \(4\)). These combinations, however, will always be “exhausted” when \(n\) has been sufficiently large. That is, there will be finite \(\mathbb{E} [\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_n}]\)s at any given order of \(\Delta t\). This is just the formal expression of the second condition we assumed. So, the assumption is guaranteed with cut-off. And conversely, only with a cut-off can we guarantee the assumption. This can be generalized to cut-off at any \(N_{\operatorname{cut}}\), in which \(\mathbb{E} [\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_n}] = \mathcal{O} (\Delta t^{\sharp (n / N_{\operatorname{cut}})})\) (recall that \(\sharp\) is the ceiling function).
Since all \(K\)s are finite, and we only have finite \(K\)s, we can relate the \(K\)s to the transition rate \(r\) explicitly. To do so, we first introduce an arbitrary test function \(\varphi : \mathbb{R}^d \rightarrow \mathbb{R}\) in Schwarts space \(S (\mathbb{R}^d)\), which is a functional space in which function mapping from \(\mathbb{R}^d\) to \(\mathbb{R}\) is smooth and rapidly falls to zero in the region far from origin. For example, Gaussian function (the density function of normal distribution) is in Schwarts space \(S (\mathbb{R})\). Since \(\varphi\) is in Schwarts space, in which functions are smooth, we can Taylor expanding \(\varphi\) at origin, which gives
Because of the normalization of transition density, we have \(\int_{\mathbb{R}^d} \mathrm{d} \epsilon r (x + \epsilon, x) = 0\), thus
where we have inserted the definition of \(K\)s and that \(K_n = 0\) for any \(n > N_{\operatorname{cut}}\) (section 3.2). Then, because of the identity
integration by parts on the right hand side gives
Plugging this back, we find
Since \(\varphi\) is arbitrary, we finall arrive at
\(\displaystyle r (x + \epsilon, x) = \sum_{n = 1}^{N_{\operatorname{cut}}} \frac{(- 1)^n}{n!} K_n^{\alpha_1 \cdots \alpha_n} (x) (\partial_{\alpha_1} \cdots \partial_{\alpha_n} \delta) (\epsilon) . \) | (3.1) |
This is called Kramers–Moyal expansion .
3.3. If we plug expansion 3.1 into master equation, then we get
Notice that \((\partial / \partial y^{\alpha_1}) \cdots (\partial / \partial y^{\alpha_n}) \delta (x - y) = (- 1)^n (\partial_{\alpha_1} \cdots \partial_{\alpha_n} \delta) (x - y)\), we get
Taking integration by parts on the right hand side and then integrating over \(y\), we get the Kramers–Moyal expansion
This is the form of Kramers–Moyal expansion that appears in many textures.
For example, to evaluate \(\partial_{\alpha} \delta (- x)\), we have to employ an arbitrary test function \(\varphi \in S (\mathbb{R}^d, \mathbb{R}^d)\), and calculate \(\int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \delta (- x) \varphi^{\alpha} (x)\). First, notice that \(\partial_{\alpha} \delta (- x)\) is in fact \((\partial_{\alpha} \delta) (- x)\) and that \((\partial \delta / \partial x^{\alpha}) (- x) = - (\partial / \partial x^{\alpha}) \delta (- x)\), thus
Then, integration by parts gives \(- \int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} [\delta (- x)] \varphi^{\alpha} (x) = \int_{\mathbb{R}^d} \mathrm{d} x \delta (- x) \partial_{\alpha} \varphi^{\alpha} (x)\). After inserting the relation \(\delta (x) = \delta (- x)\), we arrive at \(\int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \delta (- x) \varphi^{\alpha} (x) = \partial_{\alpha} \varphi^{\alpha} (0)\). On the other hand, we have, by integration by parts, \(- \int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \delta (x) \varphi^{\alpha} (x) = \int_{\mathbb{R}^d} \mathrm{d} x \delta (x) \partial_{\alpha} \varphi^{\alpha} (x) = \partial_{\alpha} \varphi^{\alpha} (0)\). Altogether, we find \(\int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \delta (- x) \varphi^{\alpha} (x) = - \int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \delta (x) \varphi^{\alpha} (x)\), for any \(\varphi \in S (\mathbb{R}^d, \mathbb{R}^d)\). Thus, \(\partial_{\alpha} \delta (- x)\) is evaluated to be \(- \partial_{\alpha} \delta (x)\). That is, \(\partial_{\alpha} \delta\) is odd . Following the same process, we can show that \(\partial_{\alpha} \partial_{\beta} \delta\) is even .
3.4. We are to calculate \(\int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \partial_{\beta} \delta (- x) f^{\alpha \beta} (x)\), where \(f \in S (\mathbb{R}^d, \mathbb{R}^{d \times d})\). Again, noticing that \((\partial_{\alpha} \partial_{\beta} \delta) (- x) = \partial_{\alpha} \partial_{\beta} [\delta (- x)]\), we have
Then integration by parts gives
That is, \(\int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \partial_{\beta} \delta (- x) f^{\alpha \beta} (x) = \partial_{\alpha} \partial_{\beta} f^{\alpha \beta} (0)\). On the other hand, we have
So,
holds for any \(f \in S (\mathbb{R}^d, \mathbb{R}^{d \times d})\), thus \(\)\(\partial_{\alpha} \partial_{\beta} \delta (- x) = \partial_{\alpha} \partial_{\beta} \delta (x)\).
In section 3.2, we have found that a finite cut-off on the moments of transition rate is essential for spatial smoothness. We are to show that cut-off at \(N_{\operatorname{cut}} = 1\) (namely \(K_n = 0\) for any \(n > 1\)) only results in a deterministic evolution. To do so, we plug Kramers–Moyal expansion 3.1 into master equation, and find (re-denote \(K_1\) to \(f\) for simplicity)
Since \((\partial_{\alpha} \delta) (x - y) = - \partial_{\alpha} [\delta (x - y)]\), integration by parts on the right hand side gives
Thus,
This is the continuity equation or transport equation. It was used for describing the evolution of the density of incompressible liquids. We are to solve this equation explicitly. As a first order partial differential equation, we can use the method of characteristics. At the first step, we fully expand the equation, as
The next step is constructing a parameterized curve \((x (s), t (s))\) for \(s \in [0, + \infty)\) called characteristic, obeying
and
It has solution \(t (s) = s + t (0)\). If we set \(t (0) = 0\), then we have \(t = s\) and
from which we solve \(x (t)\), leading to
which is the left hand side of the original equation along characteristic. Thus, by equaling to the right hand side of the original equation, we get
Now, the original partial differential equation is converted into an ordinary differential equation. It has the unique solution
It indicates that the density at \(x (0)\) will “transport” along the curve \(x (t)\) as time evolves. For example, consider \(p (y, 0) = \delta (y - x (0))\), that is all mass is centered at \(x (0)\). Then \(p (x, t)\) will have density only at \(x (t)\), and \(p (y, t) = 0\) for any \(y \neq x (t)\), since \(y\) is traced back to \(y (0)\) along the characteristic \(y (t)\) and \(y (0) \neq x (0)\). That is to say, transport equation is deterministic.
In section 3.4, we have analyzed the cut-off at \(N_{\operatorname{cut}} = 1\) and found it deterministic, thus not a stochastic process. It indicates that we have to cut-off at least at \(N_{\operatorname{cut}} = 2\). We are to show that, if \(K_2\) as a matrix-valued field is positive definite, then the randomness of Markovian process is guaranteed.
We examine this by using an example. Let \(K_1\) is everywhere vanishing and \(K_2\) is a constant identity matrix. Then, Kramers–Moyal expansion 3.1 becomes
Plugging into master equation, we find
Integration by parts and then integrating over \(y\) gives
This equation is the famous heat equation or diffusion equation, first investigated by French mathematician Joseph Fourier in 1822 for modeling how heat diffuses. For initial value \(p (x, 0)\), it has the solution
where the factor \(1 / \sqrt{\cdots}\) comes from normalization \(\int_{\mathbb{R}^d} \mathrm{d} x p (x, t) = 1\). From this expression, we can read out the transition density directly, as
\(\displaystyle q_t (x|y) = \frac{1}{\sqrt{(2 \pi t)^n}} \exp \left( - \frac{1}{2 t} \delta_{\alpha \beta} (x^{\alpha} - y^{\alpha}) (x^{\beta} - y^{\beta}) \right), \) | (3.2) |
which obeys a normal distribution with \(y\) as its mean and \(t\) as its variance. So, we find randomness in the cut-off \(N_{\operatorname{cut}} = 2\).
Historically, in 1827, botanist Robert Brown noticed that pollen particles automatically shakes in water. It was first explained by Albert Einstein in 1905. He argued that the pollen particles are constantly stricken by water molecules, and found the transition density to be equation 3.2. Hence, the stochastic process described by this transition density is named by Brownian motion. Even though the techniques used for deriving this transition density had been mature when Brown first observed this phenomenon, but almost one hundred years after Brown's discover, in 1918, Norbert Wiener first constructed a complete mathematical theory for this stochastic process. So, it is also called Wiener process.
The transition can be seen as an accumulation of a series tiny transitions, each is caused by one strike from a water molecule. The strike obeys a distribution which is identical and independent (since each strike is individual) with zero mean (as a result of \(K_1 = 0\)). This distribution, however, is unknown. Although, we find that the accumulative effect always obeys a normal distribution. We can abstract this and conclude a corollary as follow.
Corollary
Each \(X_i\) can be seen as a strike by water molecule. Further, the mean of \(Y\) can be calculated by the linearity of expectation, as \(\mathbb{E} [Y] =\mathbb{E} [X_1] + \cdots +\mathbb{E} [X_N] = 0\). And because of independency, we have \(\mathbb{E} [Y^{\alpha} Y^{\beta}] =\mathbb{E} [X_1^{\alpha} X_1^{\beta}] + \cdots +\mathbb{E} [X_N^{\alpha} X_N^{\beta}]\). Let \(\Sigma^{\alpha \beta} := \mathbb{E} [X_i^{\alpha} X^{\beta}_i]\), which is the same for all \(i\) because \(X_i\)s are identical, we find \(\mathbb{E} [Y^{\alpha} Y^{\beta}] = N \Sigma^{\alpha \beta}\). This is the central limit theorem, the most famous theorem in probability theory. Now, we have found for central limit theorem a physical description, the Brownian motion, and found it as a corollary of expansion 3.1.
The most generic form of transition rate with \(N_{\operatorname{cut}} = 2\) is (by Kramers–Moyal expansion 3.1 with \(N_{\operatorname{cut}} = 2\), and re-denote \(K_1\) by \(f\) and \(K_2\) by \(\Sigma\))
\(\displaystyle r (x + \epsilon, x) = - f^{\alpha} (x) \partial_{\alpha} \delta (\epsilon) + \frac{1}{2} \Sigma^{\alpha \beta} (x) \partial_{\alpha} \partial_{\beta} \delta (\epsilon), \) | (3.3) |
A (continuous time) Markovian process with this transition rate is called a Langevin process or Langevin dynamics. It was first developed by French physicist Paul Langevin in 1908.
Plugging equation 3.3 into master equation, and applying integration by parts as usual, we find
\(\displaystyle \frac{\partial p}{\partial t} (x, t) = - \partial_{\alpha} (f^{\alpha} (x) p (x, t)) + \frac{1}{2} \partial_{\alpha} \partial_{\beta} (\Sigma^{\alpha \beta} (x) p (x, t)) . \) | (3.4) |
This equation is called Fokker-Planck equation, found by Adriaan Fokker and Max Planck in 1914 and 1917 respectively, or Kolmogorov forward equation, independently discovered in 1931.
As a matrix-valued field, \(\Sigma\) is symmetric and everywhere positive definite. Symmetry means \(\Sigma^{\alpha \beta} (x) = \Sigma^{\beta} (x)\). This is a direct result of its definition \(\int \mathrm{d} \epsilon r (x + \epsilon, x) \epsilon^{\alpha} \epsilon^{\beta}\). To see why it is positive definite, we consider the expectation
Under a proper coordinates, it becomes diagonalized with the diagonal element
So, \(\Sigma (x)\) is positive definite for any \(x \in \mathbb{R}^d\).
In this section and section 3.8, we use Langevin process as an example to explicitly show what the detailed balance condition appends to the stationary condition.
The master equation of Langevin process (equation 3.4) has stationary solution \(\Pi\) which satisfies (since there is only one variable \(x\), we use \(\partial\) instead of \(\nabla\))
which means
\(\displaystyle f^{\alpha} (x) \pi (x) = \frac{1}{2} \partial_{\beta} (\Sigma^{\alpha \beta} (x) \pi (x)) + \nu^{\alpha} (x), \) | (3.5) |
where \(\nu : \mathbb{R}^d \rightarrow \mathbb{R}^d\) is an arbitrary vector field such that \(\partial_{\alpha} \nu^{\alpha} (x) = 0\).
The vector field \(\nu\) has an intuitive explanation. Regarding \(\nu\) as a flux on \(\mathbb{R}^d\), we find that there is not net flux flowing out of anywhere in \(\mathbb{R}^d\). Otherwise, suppose there is \(x \in \mathbb{R}^d\) and a closed surface \(S\) around \(x\) such that the net flux \(\int \mathrm{d} S \cdot \nu (x)\) does not vanish. Then, by Stokes theorem, the surface integral \(\int \mathrm{d} S \cdot \nu (x) = \int \mathrm{d} x \nabla \cdot v (x) = 0\), thus conflicts. Such a vector field \(\nu\) is called free of source or source-free.
After discussing stationary distribution of Fokker-Planck equation (as a master equation), we continue investigate when will Langevin process relax an initial distribution to the stationary. By theorem 2.1, this is equivalent to ask: when will the transition rate of Langevin process satisfy detailed balance condition? Detailed balance condition reads \(r (x + \epsilon, x) \pi (x) = r (x, x + \epsilon) \pi (x + \epsilon)\). Directly inserting equation 3.3, we get for the left hand side,
and, for the right hand side,
where in the last line, we have used \(\partial_{\alpha} \delta (- x) = - \partial_{\alpha} \delta (x)\) and \(\partial_{\alpha} \partial_{\beta} \delta (- x) = \partial_{\alpha} \partial_{\beta} \delta (x)\) (derived in the end of section 3.3).
As generalized functions, we are to examine these two expressions by using an arbitrary test function \(\varphi \in S (\mathbb{R}^d)\). Thus, for the left hand side,
Integration by parts gives (note that the \(\partial\) is applied on \(\epsilon\))
The right hand side is a little complicated,
Again, integration by parts results in (again, the \(\partial\) operator is applied on \(\epsilon\))
By equaling \(\int_{\mathbb{R}^d} \mathrm{d} \epsilon r (x + \epsilon, x) \pi (x) \varphi (\epsilon)\) and \(\int_{\mathbb{R}^d} \mathrm{d} \epsilon r (x, x + \epsilon) \pi (x + \epsilon) \varphi (\epsilon)\), since \(\varphi\) is arbitrary, we find, for the \(\varphi (0)\) terms,
and for \(\partial \varphi (0)\) terms,
The \(\partial \partial \varphi (0)\) terms vanishes automatically. Altogether, we find the detailed balance condition for Langevin process to be
\(\displaystyle f^{\alpha} (x) \pi (x) = \frac{1}{2} \partial_{\beta} (K^{\alpha \beta} (x) \pi (x)) . \) | (3.6) |
Comparing with the stationary solution of Langevin process (equation 3.5), the source-free vector field \(\nu\) is absent here. Recall in section 2.4 where detailed balance condition was first encountered, we said that detailed balance condition is stronger than just being stationary. Now, in Langevin process, this becomes concrete: detailed balance condition is stronger than stationary condition in the sense that it lacks the source-free degree of freedom that appears in the stationary condition. The lost degree of freedom is the cost of ensuring that any initial distribution will finally relax to the stationary.
We apply the discussions in chapter 3 to least-action principle. After deriving the action of Langevin process, we investigate the criticality.
Follow the conventions in chapter 3 (except for section 4.6 where the alphabet \(\mathcal{X}\) can be discrete). In addition, we use \(P (\theta)\) for a parameterized distribution, where \(\theta\) is the collection of parameters. Its density function is \(p (x, \theta)\), where random variable \(X\) takes the value \(x\).
In physics, least-action principle gives the dynamics of the state of an evolutionary system, determining how it evolves with time. The state of an evolutionary system is called a configuration. As the state changes with time, the evolution of configuration can be seen as a path in a space, like a contrail in the sky, indicating the movement of an airplane. This space is called configuration space, which is generally Euclidean, \(\mathbb{R}^d\) for some \(n\). A path is a function with single parameter \(x : [t_i, t_f] \rightarrow \mathbb{R}^d\), where \(t_i\) and \(t_f\) denote the initial and final time respectively. Without losing generality, we standardize the time interval from \([t_i, t_f]\) to \([0, 1]\). To introduce the least-action principle, consider the collection of paths with fixed boundaries, that is, \(\mathcal{P} (x_0, x_1) := \{ x : [0, 1] \rightarrow \mathbb{R}^d |x (0) = x_0, x (1) = x_1 \}\) given the boundaries \((x_0, x_1)\). An action is a scalar functional of path with fixed boundaries, thus an action \(S (\cdot |x_0, x_1) : \mathcal{P} (x_0, x_1) \rightarrow \mathbb{R}\), where we use a vertical line to separate variables and those that are given as constants (the boundaries \((x_0, x_1)\)), which should not be confused with the vertical line in conditional probability, like \(p (x|y)\). For example, the configuration space of an (one-dimensional) harmonic oscillator is \(\mathbb{R}\), and the evolution is characterized by a path \(x : [0, 1] \rightarrow \mathbb{R}\). The action of harmonic oscillator is given by the functional
\(\displaystyle S_{\operatorname{HO}} (x|x_0, x_1) = \frac{1}{2} \int_0^1 \mathrm{d} t [\dot{x}^2 (t) - \omega^2 x^2 (t)], \) | (4.1) |
where \(\dot{x} := \mathrm{d} x / \mathrm{d} t\), \(\omega \in \mathbb{R}\), and \(x (0) = x_0\), \(x (1) = x_1\).
Roughly, least-action principle states that, in the real world, the paths with the fixed boundaries are those that minimize the action. To quantitatively declare the least-action principle, we have to describe the minimum of an action mathematically. Recall that a local minimum, or generally an extremum, \(x_{\star}\) of a function \(f\) is characterized by \((\partial f / \partial x^{\alpha}) (x_{\star}) = 0\) for each component \(\alpha\). How can we generalize this from function to functional (action is a functional)? The trick is discretizing the time. Precisely, we uniformly separate the time interval \([0, 1]\) into \(T\) fragments. Thus, the path \(x\) is discretized as a vector \((x (0), x (1 / T), \ldots, x ((T - 1) / T), x (1))\), each component is an endpoint of a fragment. Since the boundaries are fixed in least-action principle, \(x (0)\) and \(x (1)\) are constant rather than variables. Hence, the true degree of freedom is \((x (1 / T), \ldots, x ((T - 1) / T))\). Least-action principle in classical mechanics then states that, given the (discretized) action \(S\) and the boundaries \((x_0, x_1)\), there is at most one path \(x_{\star} \in \mathcal{P} (x_0, x_1)\) such that
\(\displaystyle \frac{\partial S}{\partial x (i / T)} (x_{\star} |x_0, x_1) = 0, \) | (4.2) |
for each \(i = 1, \ldots, T - 1\) and any \(T > 1\), and that \(x_{\star}\) is the path in real world.
Take harmonic oscillator as example. To discretize its action (equation 4.1), we replace the integral \(\int_0^1 \mathrm{d} t\) by mean \((1 / T) \sum_{i = 0}^T\) and \(x (t)\) by \(x (i / T)\). Thus the second term becomes \((\omega^2 / 2 T) \sum_{i = 0}^T x^2 (i / T)\). For the first term, the derivative \(\dot{x} (t)\) is replaced by its difference \(T [x ((i + 1) / T) - x (i / T)]\), hence the summation shall terminated at \(T - 1\) instead of \(T\). Altogether, the action 4.1 is discretized as
Given \(i\), \(x (i / T)\) appears in two terms in \(S_{\operatorname{HO}}\), the \(i\) and \(i + 1\) terms in the summation. They have derivatives \(T [- x ((i + 1) / T) + x (i / T)] - (\omega^2 / T) x (i / T)\) and \(T [x (i / T) - x ((i - 1) / T)]\) respectively. So, we find
for \(i = 1, \ldots, T - 1\). The right hand side is the discretized \(\ddot{x}_{\star} (t) + \omega^2 x_{\star} (t)\), for \(t \in (0, 1)\) (notice we have excluded the \(t = 0, 1\), corresponding to \(i = 0, T\) respectively). So, least-action principle, \(\partial S_{\operatorname{HO}} / \partial x (i / T) (x_{\star} |x_0, x_1) = 0\), implies the correct dynamics of harmonic oscillator in textbooks, which is \(\ddot{x}_{\star} (t) + \omega^2 x_{\star} (t) = 0\).
4.1. The dynamics with fixed boundaries is called boundary value problem. But in physics, the dynamics we obtained from the least-action principle is applied to initial value problem, where the initial “phase” (for physical system, it involves initial position and velocity), instead of boundaries, is fixed. This mysterious application leads to some interesting results. For an \(m\)th-order dynamics (for example, harmonic oscillator is a second order dynamics since it involves at most the second derivative of path), an initial value problem has \((T + 1 - m)\) variables (there are \(T + 1\) endpoints on the path), since the \(m\) degree of freedom has been assigned to the initial values. On the other hand, the boundary value problem has \((T + 1 - 2)\) degree of freedom, since there are always two boundaries (\(t = 0\) and \(t = 1\)). So, for the success of this mysterious application, we must have \(m = 2\). That is, the initial value problem has to be second order.
We can generalize the least-action principle to any system, evolutionary or not, where variables locate in a high-dimensional Euclidean space and, given some conditions, action is a scalar function on it. It states that the real world datum locates in the minimum of the action. Precisely, given the conditioned action \(S\) (we may hide the condition \(y\) into \(S\) instead of explicitly writing it out), there is a at most one \(x_{\star}\) such that
\(\displaystyle \frac{\partial S}{\partial x^{\alpha}} (x_{\star}) = 0, \) | (4.3) |
and that \(x_{\star}\) is the real world datum.
There are, however, redundant degrees of freedom in action \(S\). We may construct multiple actions all satisfying equation 4.3. Knowing the extremum of a function cannot imply the shape of the function. The action has much more degrees of freedom than that is needed for revealing the real world datum in classical mechanics. But combined with uncertainty, as we will see in section 4.5, action is completely determined by the real world distribution (the correspondence of real world datum), with nothing redundant.
We have to briefly introduce Grassmann variable, on which Berezin integral is based. Grassmann variable is an extension of real (or complex) variable,by introducing in the anti-commutative variables. Given a set of variables \(\{ \zeta_i |i = 1, \ldots, n \}\), we demand that the anti-commutative relation between \(\zeta\)s
But for any real (or complex) variable \(x\), we demand a commutative relation
Because of anti-commutation, we have \(\zeta_i \zeta_i = - \zeta_i \zeta_i\), thus \(\zeta_i \zeta_i = 0\). So, a polynomial of single Grassmann variable is always linear, as \(f (\zeta) = a + b \zeta\) where coefficients \(a, b \in \mathbb{R}\). And a polynomial of two Grassmann variables, \(\zeta\) and \(\eta\), is
where coefficients are real numbers. Extra term like \(\zeta \eta \zeta = - \zeta \zeta \eta = \zeta \zeta \eta\), thus vanishes. Generally, a polynomial of \(n\) Grassmann variables has terms with no more than \(n\) factors.
A function can be defined via its Taylor expansion in real number. So, the exponential function for Grassmann number is defined by
If \(\zeta\) is a single Grassmann variable, we have \(\exp (\zeta) = 1 + \zeta\) since other terms are all vanishing. This linearity, however, breaks when consider more Grassmann variables. For example, consider \(\exp \left( \sum_{i = 1}^n \zeta_i \eta_i \right)\) where \(\zeta\) and \(\eta\) are multiple Grassmann variables. The maximal order is
since \(\zeta_i \eta_i\) and \(\zeta_j \eta_j\) is commutative.
Now, we introduce the integral on Grassmann variables. The integral is a linear operator, defined by
The integral measures \(\{ \mathrm{d} \zeta_i |i = 1, \ldots, n \}\) are also anti-commutative, namely
So, as an example, we have
\(\displaystyle \int \mathrm{d} \eta_n \int \mathrm{d} \zeta_n \cdots \int \mathrm{d} \eta_1 \int \mathrm{d} \zeta_1 \exp \left( \sum_{i = 1}^n \zeta_i \eta_i \right) = \cdots + \left[ \int \mathrm{d} \eta_n \int \mathrm{d} \zeta_n \cdots \int \mathrm{d} \eta_1 \int \mathrm{d} \zeta_1 (\zeta_1 \eta_1 \cdots \zeta_n \eta_n) \right] = 1, \) | (4.4) |
where the \([\cdots]\) terms are all vanishing because they do not have sufficiently many Grassmann variables.
Next, we investigate how linear transformation effects the integral measure. To do so, consider a \(d\)-dimensional Grassmann variable, in which each component is a single Grassmann variable, \(\zeta = (\zeta^1, \ldots, \zeta^d)\). If we take the linear transformation
then the product
where the summation includes all permutations of \((\beta_1, \ldots, \beta_d)\). If we re-arrange the \(\zeta^{\beta_1} \cdots \zeta^{\beta_d}\) factor to \(\zeta^1 \cdots \zeta^d\), then there comes a factor \((- 1)^{\operatorname{sign} (\beta)}\), where \(\operatorname{sign} (\beta)\) is the signature of the permutation \((\beta_1, \ldots, \beta_d)\). For example, \((2, 1, 3, 4, \ldots, d)\) has signature \(1\) since by only one permutation \(1 \leftrightarrow 2\), can we recover the natural order \((1, 2, 3, 4, \ldots, d)\). So,
The right hand side is recognized as \(\)\(\zeta^1 \cdots \zeta^d \times \det A\). On the other hand, since both \(\int \mathrm{d} \zeta_d \cdots \int \mathrm{d} \zeta_1 (\zeta_1 \cdots \zeta_d)\) and \(\int \mathrm{d} \zeta'_d \cdots \int \mathrm{d} \zeta'_1 (\zeta'_1 \cdots \zeta'_d)\) results in one, namely
we find the transformation of integral measure, as
\(\displaystyle \int \mathrm{d} \zeta'_d \cdots \int \mathrm{d} \zeta'_1 = \det A \times \int \mathrm{d} \zeta_d \cdots \int \mathrm{d} \zeta_1 . \) | (4.5) |
After introducing Grassmann variable and the integration on multiple such variables, we come to the formula of Berezin integral, which is an analogy of Gaussian integral for Grassmann variables. Consider the Gaussian-like integral
where \(A \in \mathbb{R}^{d \times d}\) (or \(\mathbb{C}^{d \times d}\)) and we have briefly denoted
\(\displaystyle \int \mathrm{d} \zeta \mathrm{d} \eta := \int \mathrm{d} \zeta_d \int \mathrm{d} \eta_d \cdots \int \mathrm{d} \zeta_1 \int \mathrm{d} \eta_1 . \) | (4.6) |
Defining \(\zeta'_{\beta} := A_{\alpha \beta} \zeta^{\alpha}\) and using equation 4.5, it becomes
The rightmost factor has been evaluated in equation 4.4, which results in \(1\), thus
\(\displaystyle \int \mathrm{d} \zeta \mathrm{d} \eta \exp (A_{\alpha \beta} \zeta^{\alpha} \eta^{\beta}) = \det A. \) | (4.7) |
where
In this section, we are to formulate the master equation into path integral. The path integral formulation was found by Paul Dirac in 1933 who was trying to using Lagrangian in quantum mechanism. It was then developed by physicist Richard Feynman and mathematician Mark Kac in 1947. Now, path integral is applied not only to quantum field theory, but also many other areas such as stochastic process. Path integral generally has the formalism
\(\displaystyle \int_{\mathbb{R}^d} \mathrm{d} x_1 \cdots \int_{\mathbb{R}^d} \mathrm{d} x_n \exp (- S (x_1, \ldots, x_n)), \) | (4.8) |
where a series \((x_1, \ldots, x_n)\) is called a “path”, and the \(S\) is called the “action” of path. Apparently, it is an integral of all possible paths, thus named as “path integral”.
To derive a path integral formulation for master equation, we follow the standard derivation of path integral for quantum mechanism.
4.2. For example, section 1.2 of Quantum Field Theory in a Nutshell, by A. Zee, second edition.
\(\displaystyle p (x_N, N \Delta t) = \int_{\mathcal{X}} \mathrm{d} x_{N - 1} \cdots \int_{\mathcal{X}} \mathrm{d} x_0 q_{\Delta t} (x_N |x_{N - 1}) \cdots q_{\Delta t} (x_1 |x_0) p (x_0, 0) . \) | (4.9) |
The next step is making \(\Delta t\) small and re-expressing \(q_{\Delta t} (x_{i + 1} |x_i)\) in exponential. This, however, cannot be straight-forward since the leading term of \(q_{\Delta t} (x|y)\) is \(\delta (x - y)\) which cannot be converted into exponential. But, we can consider its Fourier transform, since \(\delta (x - y)\), if regarding as a Dirac's delta function, has Fourier coefficient \(\exp (- \mathrm{i} k_{\alpha} (x^{\alpha} - y^{\alpha}))\). This suggest us to consider the Fourier transform of transition rate (as discussed in section 2.3, transition rate has completely determined the Markovian process), as
This forces the alphabet \(\mathcal{X}\) to be Euclidean space \(\mathbb{R}^d\), because we cannot perform the same thing on Kronecker's delta when the alphabet is discrete, or when the alphabet is continuous but not Euclidean. Then, we have
Now, the \([\cdots]\) part can be converted into exponential as \(\exp (\hat{r} (x, k) \Delta t + \omicron (\Delta t))\), so that
We are to integrate over \(k\). If we Taylor expand \(\hat{r} (x, k)\) by \(\mathrm{i} k\) at \(k = 0\), then the expansion coefficient is
We meet the moments of transition rate again (it first appears in section 3.2). Thus, we have
where the zeroth order term vanishes because of the property of transition rate \(\int_{\mathbb{R}^d} \mathrm{d} \epsilon r (x + \epsilon |x) = 0\). Then, we find \(q_{\Delta t} (x + \epsilon |x)\) to be
\(\displaystyle \int_{\mathbb{R}^d} \mathrm{d} k \exp \left( [K_1^{\alpha} (x) \Delta t - \epsilon^{\alpha}] \mathrm{i} k_{\alpha} - \frac{\Delta t}{2!} K_2^{\alpha \beta} (x) k_{\alpha} k_{\beta} - \frac{\mathrm{i} \Delta t}{3!} K_3^{\alpha \beta \gamma} (x) k_{\alpha} k_{\beta} k_{\gamma} + \cdots + \omicron (\Delta t) \right) . \) | (4.10) |
The summation terminates at the cut-off \(N_{\operatorname{cut}}\) of \(K_n\) (for the necessity of cut-off, see section 3.2).
This integral is complicated except for \(N_{\operatorname{cut}} = 2\) where it becomes a Gaussian integral, and the Markovian process deduces to a Langevin process, defined in section 3.6. In this situation, we have (re-denote \(K_1\) by \(f\) and \(K_2\) by \(\Sigma\) as in section 3.6)
As proved in the same section, \(\Sigma\) is everywhere positive definite. Then by the formula of Gaussian integral, which holds for any positive definite matrix \(A\),
we find (replace \(A_{\alpha \beta}\) by \(\Sigma^{\alpha \beta} (x) \Delta t\) and \(b_{\alpha}\) by \(\mathrm{i} [f^{\alpha} (x) \Delta t - \epsilon^{\alpha}]\))
It indicates that the transition density of Langevin process is approximately Gaussian.
To match the path integral formulism 4.8, in which all integration variables are in the exponential, we have to convert the \([\det \Sigma (x)]^{- 1 / 2}\) factor into exponential too. To do so, we first use Cholesky factorization to remove the root, and then use Berezin integral to convert determinant to exponential.
To introduce Cholesky factorization, we fix the argument \(x\) and omit it for simplicity, so \(\Sigma (x)\) is written as \(\Sigma\). Since \(\Sigma\) is symmetric and positive definite, we can diagonalize it using an orthogonal matrix \(E\) as \(\Sigma = E^T \Lambda E\), where the diagonal \(\Lambda_{\alpha \beta} = \delta_{\alpha \beta} \lambda_{\beta}\) with \(\lambda_{\beta} > 0\). Define \(\sqrt{\Lambda}_{\alpha \beta} := \delta_{\alpha \beta} \sqrt{\lambda_{\beta}}\), thus \(\Lambda = \sqrt{\Lambda}^T \sqrt{\Lambda}\), and \(\Sigma = C^T C\) where \(C := \sqrt{\Lambda} E\). We thus factorize \(\Sigma\) into the “square” of \(C\). This was first discovered by French military officer André-Louis Cholesky, who was killed in battle a few months before the end of World War I, dead at age 31. Instead of the matrix-valued field \(C\), we use its inverse (since both \(E\) and \(\sqrt{\Lambda}\) are invertible) \(R := C^{- 1}\), thus \(\Sigma^{- 1} = R^T R\), or \((\Sigma^{- 1})_{\alpha \beta} = \sum_{\gamma = 1}^d R_{\gamma \alpha} R_{\gamma \beta}\). So, we have (insert the omitted \(x\) again)
and thus
Now, the determinant gets rid of root (and fraction). Remark that \(R (x)\) may not be a symmetric matrix.
Next, we use Berezin integral to convert the determinant \(\det R (x)\) to exponential. Recall the formula of Berezin integral for any \(A \in \mathbb{R}^{d \times d}\) (introduced in section 4.3),
where \(\zeta\) and \(\eta\) are multi-dimensional Grassmann variables, and the integral measure \(\int \mathrm{d} \zeta \mathrm{d} \eta\) is defined by equation 4.6. Replacing \(A\) by \(R (x) \Delta t\), we find
we convert the determinant into exponential, as
In physics, the Grassmann variables \(\zeta\) and \(\eta\) are called “ghost variables”.
Plugging back to equation 4.9, we get
where the integral measure is defined by
and the “action” of Langevin process is
\(\displaystyle S (x, \zeta, \eta) := \sum_{i = 0}^{N - 1} \Delta t \left\{ \frac{1}{2} \sum_{\alpha = 1}^d \left[ R_{\alpha \beta} (x_i) \left( \frac{x_{i + 1}^{\beta} - x_i^{\beta}}{\Delta t} - f^{\beta} (x_i) \right) \right]^2 + R_{\alpha \beta} (x_i) \zeta^{\alpha}_i \eta_i^{\beta} + \omicron (1) \right\} . \) | (4.11) |
The \(C := (d / 2) (\ln 2 \pi - 3 \ln \Delta t)\) is independent of \(x\), \(\zeta\), or \(\eta\), thus is regarded as constant.
When \(R (x)\) is constant, we can choose a proper coordinates of \(x\) such that \(R\) becomes identity matrix, thus the ghost variables disappear since \(\int \mathrm{d} \zeta \int \mathrm{d} \eta \exp \left( \sum_{\alpha} \zeta^{\alpha} \eta^{\alpha} \right) = 1\). In this situation, the \(S\) becomes
\(\displaystyle S (x) := \sum_{i = 0}^{N - 1} \Delta t \left\{ \frac{1}{2} \sum_{\alpha = 1}^d \left[ \left( \frac{x_{i + 1}^{\alpha} - x_i^{\alpha}}{\Delta t} - f^{\alpha} (x_i) \right) \right]^2 + \omicron (1) \right\}, \) | (4.12) |
and accordingly, the integral measure becomes \(\int D (x) := \int_{\mathbb{R}^n} \mathrm{d} x_0 \cdots \int_{\mathbb{R}^n} \mathrm{d} x_{N - 1}\). The expression is greatly simplified when \(R\) is constant.
The final step is formally taking \(\Delta t \rightarrow 0\) and \(N \rightarrow + \infty\) while keeping \(N \Delta t\) finite, and replacing the discrete index \(i\) with a continuous variable \(t \in [0, t_f]\), the action becomes
in which \((\mathrm{d} x / \mathrm{d} t) (t) := (x_{i + 1} - x_i) / \Delta t\). These are formal definitions, and we will never really take the limit \(\Delta t \rightarrow 0\) and \(N \rightarrow + \infty\) since it is numerically not well-defined. For example, the constant \(C\) diverges, so is the \((\mathrm{d} x / \mathrm{d} t) (t)\).
As a summary, we find the master equation of Langevin process can be formulated as a path integral, in which the action is given by equation 4.11, in which \(\zeta\) and \(\eta\) are Grassmann variables.
Final remark on cut-off \(N_{\operatorname{cut}} = 2\). If choose \(N_{\operatorname{cut}} > 2\), it is hard to see how to integrate the improper integral 4.10, and even to show why it is finite. For example, if \(N_{\operatorname{cut}} = 4\), the \((\Delta t / 4!) K^{\alpha \beta \gamma \sigma}_4 (x) k_{\alpha} k_{\beta} k_{\gamma} k_{\sigma}\) term will dominate the integral when \(k\) is far from origin. But we cannot ensure that this term will suppress the integrand as \(\| k \| \rightarrow + \infty\) so as to make the improper integral finite. We cannot even diagonalize the fourth order symmetric tensor \(K_4 (x)\) (because diagonalizing a fourth order symmetric tensor has \(\mathcal{O} (d^4)\) restrictions, but a coordinate transformation has only \(\mathcal{O} (d^2)\) degrees of freedom, so this cannot be done except for specific situations).
Dynamics in classical mechanics are always deterministic. That is, once the initial conditions (for initial value problem) or the boundaries (for boundary value problem) are fixed, then the path is fully determined, in which randomness is forbidden. There are, however, many phenomena in nature that have intrinsic randomness. For example, molecular movement obeys a normal distribution with variance proportional to time interval. The dynamics of starling flocks also has intrinsic randomness, which is the “free will” of each bird, so is ant colony, human society, and any interactive system in which each element has some level of intrinsic uncertainty. For these cases, the real world datum is not simply a path, but a distribution of path. Precisely, we use a distribution \(Q\) to describe real world phenomenon that has intrinsic randomness.
This is what we have derived in section 4.4. From probabilistic perspective, the right hand side of equation 4.9 can be viewed as marginalizing the random variables \((X_0, \ldots, X_{N - 1})\), and the product \(q (x_N |x_{N - 1}) \cdots q (x_1 |x_0)\) can be seen as the density function of \((X_1, \ldots, X_N)\), where we have omitted the subscript \(\Delta t\) for simplicity. To see this clearly, we first notice that \(q (x_2 |x_1) = q (x_2 |x_0, x_1)\) holds for any \(x_0\), because \(q (x_2 |x_1)\) is not explicitly dependent on \(x_0\). Then, we have \(q (x_2 |x_1) q (x_1 |x_0) q (x_0) = q (x_2 |x_0, x_1) q (x_1 |x_0) q (x_0)\). Repeatedly using the definition of conditional density, it becomes \(q (x_2 |x_0, x_1) q (x_0, x_1) = q (x_0, x_1, x_2)\). Dividing \(q (x_0)\) on both sides, we get \(q (x_1, x_2 |x_0) = q (x_2 |x_1) q (x_1 |x_0)\). Repeating this step, we will find
recognized as a conditional density of random variables \((X_1, \ldots, X_N)\) given \(x_0\). If we regard the series \((x_1, \ldots, x_N)\) as a “movie” of evolution of the stochastic system, in which each \(x_i\) can be seen as a “frame”, then the density function \(q\) characterizes the distribution of evolution.
For any density function \(q (x)\) with \(x \in \mathbb{R}^d\), we can always define, up to an arbitrary constant,
\(\displaystyle S (x) := - \ln q (x) +\operatorname{const}. \) | (4.13) |
Thus, \(q (x) = \exp (- S (x)) / Z\) where \(Z := \int \mathrm{d} x \exp (- S (x))\). This \(S\) has some properties that can be analog to the action in classical mechanics. Indeed, by plugging in the definition of \(S\), we find
The integrand of the right most expression is a divergence, so it results in a boundary integral. But since \(q\), as a density function, is normalized, the boundary integral shall vanish as \(\| x \| \rightarrow + \infty\). So, we conclude that
This is analog to equation 4.3, where the minimum \(x_{\star}\) is replaced by the expectation \(\mathbb{E}_Q\). Secondly, distribution \(Q\) most likely samples (recall section 1.1 that distribution has a sampler) the \(x\) that maximizes \(q\), thus minimizes \(S\). For these reasons, we illustrate the \(S\) defined by \(q\) as the action of \(Q\). Contrary to the action in classical mechanics, the \(S\) here is completely determined by the real world distribution \(Q\) (because it is defined by the density function \(q\)), without any redundancy. This is the direct implication that distribution involves more information than its most likely datum.
Given a collection of real world data, we are to find a distribution that fits the data. These data can be seen as samples from an unknown distribution which characterizes the real world. We are to figure out a method to fit the real world distribution by given some samples of it.
Let \(P (\theta)\) represent a distribution parametrized by \(\theta \in \mathbb{R}^m\). Its alphabet \(\mathcal{X}\) is either discrete or continuous. From its density function, \(p (\cdot, \theta)\), we get a parameterized action \(S (\cdot, \theta)\) such that
\(\displaystyle p (x, \theta) = \exp (- S (x, \theta)) / Z (\theta), \) | (4.14) |
where \(Z (\theta) = \int \mathrm{d} x \exp (- S (x, \theta))\) for ensuring the normalization \(\int \mathrm{d} x p (x, \theta) = 1\). This is consistent with the action defined by equation 4.13, except that the action here is parameterized, and that we omit the constant \(\beta\) since it is irrelevant throughout this section.
What we have is a collection of data, sampled from an unknown distribution \(Q\). And we are to adjust the parameters \(\theta\) so that \(P (\theta)\) approximates \(Q\). To do so, we minimize the relative entropy between \(Q\) and \(P (\theta)\), which is defined as \(H (Q, P (\theta)) := \int \mathrm{d} x q (x) \ln (q (x) / p (x, \theta))\). This expression is formal. Since we do not know the density function of \(Q\), all that we can do with \(Q\) is computing the expectation \(\mathbb{E}_Q [f] = (1 / | Q |) \sum_{x \in Q} f (x)\) for any function \(f\), where we use \(Q\) as a set of data. With this realization, we have, after plugging equation 4.14 into \(H (Q, P (\theta))\),
By omitting the \(\theta\)-independent terms, we get the loss function
The parameters that minimize \(L (\theta)\) also minimize \(H (Q, P (\theta))\), and vice versa. We can find the \(\theta_{\star} := \operatorname{argmin}L\) by iteratively updating \(\theta\) along the direction \(- \partial L / \partial \theta\). To calculate \(- \partial L / \partial \theta\), we start at
The first term is recognized as \(-\mathbb{E}_Q [\partial S / \partial \theta^{\alpha}]\). For the second term, since \(Z (\theta) = \int_{\mathcal{X}} \mathrm{d} x \exp (- S (x, \theta))\), we have
where in the last equality, we used the definition of \(p (x, \theta)\) (the green factor). This final expression is just the \(\mathbb{E}_{P (\theta)} [\partial S / \partial \theta^{\alpha}]\). Altogether, we arrive at
\(\displaystyle - \frac{\partial L}{\partial \theta^{\alpha}} (\theta) =\mathbb{E}_{P (\theta)} \left[ \frac{\partial S}{\partial \theta^{\alpha}} (\cdot, \theta) \right] -\mathbb{E}_Q \left[ \frac{\partial S}{\partial \theta^{\alpha}} (\cdot, \theta) \right] . \) | (4.15) |
At the minimum, we shall have \(\partial L / \partial \theta = 0\). Then, we find that \(\theta_{\star}\) obeys
\(\displaystyle \mathbb{E}_{P (\theta_{\star})} \left[ \frac{\partial S}{\partial \theta^{\alpha}} (\cdot, \theta_{\star}) \right] =\mathbb{E}_Q \left[ \frac{\partial S}{\partial \theta^{\alpha}} (\cdot, \theta_{\star}) \right] . \) | (4.16) |
It can be read from equation 4.15 that minimizing \(L\) is to increase \(S (\cdot, \theta)\) on the sampled points (the first term) while decrease it on data points (the second term). As figure 4.1 illustrates, this way of optimization will site real world data onto local minima of \(S (\cdot, \theta)\), in statistical sense.
In this way, we find an analytical distribution \(P (\theta)\) that approximates the empirical distribution \(Q\). The \(S (\cdot, \theta)\) that defines \(P (\theta)\) describes the interaction between the different components of an entity. This entity may be of physics, like a collection of particles. But it can also be words, genes, flock of birds, and so on.
As an example, if we want to get the action that characterizes the stochastic dynamics of starling flocks, we take movies for many flocks. Each movie is a series of frames that log the positions of each bird at each time instant. These movies provide the real world data. The parameterized action \(S\) can be expressed by a neural network. Then, iterating by equation 4.15 until \(\| \partial L / \partial \theta \|\) has been small enough gives an \(S (\cdot, \theta_{\star})\) that mimics the stochastic dynamics of starling flocks. To compute the expectation \(\mathbb{E}_{P (\theta)} [\ldots]\) in equation 4.15, we can employ Monte-Carlo simulation with the transition rate satisfying detailed balance condition with \(P (\theta)\) as the stationary distribution. For discrete random variables, Monte-Carlo simulation with Metropolis-Hastings algorithm (section 2.7) is available; and for continuous random variables, Langevin dynamics (section 3.8) will be more efficient.
We are to determine how far information will propagate in Langevin process. For this kind of problem, physicists have invented a technique called renormalization group. This technique was first proposed by Murray Gell-Mann and Francis Low in 1954, applied to quantum field theory of fundamental particles. Following this research, Kenneth Wilson, who was a PhD student of Gell-Mann, started his malathion in 1961. He published his first paper on renormalization group eight years later, in 1969. This technique was then further developed and applied to many areas in and even out of physics, such as biology, society, and finance.
To show how it works, we start with an action that is generalized from action 4.12, as
\(\displaystyle S (x) = \sum_{i = - \infty}^{+ \infty} \sum_{\alpha = 1}^d \left[ \frac{(x^{\alpha}_{i + 1} - x^{\alpha}_i)^2}{2 \epsilon} - (x^{\alpha}_{i + 1} - x^{\alpha}_i) \varphi^{\alpha} (x_{i + 1}, x_i) + \epsilon \xi^{\alpha} (x_{i + 1}, x_i) + \omicron (\epsilon) \right], \) | (4.17) |
where \(\varphi, \xi : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}^d\). Comparing with action 4.12, we find \(\epsilon = \Delta t\), \(\varphi (x_{i + 1}, x_i) = f (x_i)\), and \(\xi (x_{i + 1}, x_i) = f^2 (x_i) / 2\). There are another two differences between them. Here, we do not fix boundary (namely, the fixed boundary \(x_N\) in action 4.12, resulted from master equation 4.9), and let the index \(i\) run from \(- \infty\) to \(+ \infty\) rather than from \(0\) to \(N\). As we will see later in this section, these generalizations are crucial for deriving renormalization group.
To add the boundaries back and restrict the range of index \(i\), all we need to do is calculating the expectation \(\int \mathrm{d} x_0 \int \mathrm{d} x_N \exp (- S (x)) \delta (x_0 - y_0) \delta (x_N - y_N)\), which fixed the boundaries at \(x_0 = y_0\) and \(x_N = y_N\). The Dirac's delta functions also separate the indices before \(0\), those between \(0\) and \(N\), and those after \(N\). In this way, the indices are restricted between \(0\) and \(N\).
Renormalization group technique bases on the fact that there are as many even numbers as integers. This is a famous result that was first claimed by George Cantor. For our purpose, we marginalize all the variable \(x_i\) in \(q (x)\) where \(i\) is odd. Namely, we are to compute an “effective action” \(S'\) defined by
\(\displaystyle S' (x) := - \ln \left[ \prod_{i \in \mathbb{Z}} \int_{\mathbb{R}^d} \mathrm{d} x_{2 i + 1} \exp (- S (x)) \right], \) | (4.18) |
where \(S' (\ldots, x_{- 4}, x_{- 2}, x_0, x_2, x_4, \ldots)\) contains only the variables with even index. Interestingly, it is to be revealed that, by a proper re-scaling of \(x\), \(S'\) has exactly the same format as \(S\).
Given \(i\), we are to show how to marginalize \(x_{2 i + 1}\). This variables appear in two terms in action 4.17, with indices \(2 i + 1\) and \(2 i\). So, we are to integrate \(\int_{\mathbb{R}^d} \mathrm{d} x_{2 i + 1} \exp \left( \sum_{\alpha} J^{\alpha} \right)\) where
This integral is hard to calculate. A general strategy is using perturbative method. In our situation, \(\epsilon\) serves as the small quantity for perturbation.
First, we have an algebraic identity
4.3. Directly, expand \((x - y)^2 + (y - z)^2 = x^2 - 2 x y + y^2 + y^2 - 2 y z + z^2\). Then, collect the \(y\) terms together, as \(2 (y^2 - (x + z) y) = 2 (y - (x + z) y + (x + z)^2 / 4) - (x + z)^2 / 2 = 2 (y - (x + z) / 2)^2 - (x + z)^2 / 2\), in which the last term can be further combined with the rest terms \(x^2 + z^2\), as \(- (x + z)^2 / 2 + x^2 + z^2 = (x - z)^2 / 2\). Altogether, we find
If replace \(x \rightarrow x^{\alpha}_{2 i}\), \(y \rightarrow x^{\alpha}_{2 i + 1}\), and \(z \rightarrow x^{\alpha}_{2 i + 2}\), then we get what we need.
Remark that the second term looks like the first term in action 4.17, except for an \(1 / 2\) factor. Then, \(J^{\alpha}\) becomes
The first term is a quadratic form of \(x_{2 i + 1}\). It suggests that we shall treat the integral as a perturbation to the Gaussian integral, and use perturbative method to integrate it out. Following this strategy, we define \(\bar{x}_{2 i + 1} := (x_{2 i + 2} + x_{2 i}) / 2\) and \(y := x_{2 i + 1} - \bar{x}_{2 i + 1}\). \(\)And the integral becomes
It means the \(y\) obeys a normal distribution with zero mean and diagonal covariance \(\Sigma_{\alpha \beta} = (\epsilon / 2) \delta_{\alpha \beta}\). We have a rough estimation \(y = \mathcal{O} \left( \sqrt{\epsilon} \right)\). Following the standard perturbative method, we find (details given in appendix 4.7.4)
where we have defined
\(\displaystyle x'_i := x_{2 i} / \sqrt{2}, \) | (4.19) |
\(\displaystyle \varphi' (x'_{i + 1}, x'_i) := \frac{1}{\sqrt{2}} [\varphi (\bar{x}_{2 i + 1}, x_{2 i}) + \varphi (x_{2 i + 2}, \bar{x}_{2 i + 1})], \) | (4.20) |
and
\(\displaystyle \begin{array}{rl} \xi' (x'_{i + 1}, x'_i) := & \xi^{\alpha} (\bar{x}_{2 i + 1}, x_{2 i}) + \xi^{\alpha} (x_{2 i + 2}, \bar{x}_{2 i + 1})\\ - & \frac{1}{2} \partial_{\alpha} \varphi^{\alpha} (\bar{x}_{2 i + 1}, x_{2 i}) - \partial_{\alpha}' \varphi^{\alpha} (x_{2 i + 2}, \bar{x}_{2 i + 1})\\ - & \frac{1}{8} (x^{\alpha}_{2 i + 2} - x^{\alpha}_{2 i}) [\Delta \varphi^{\alpha} (\bar{x}_{2 i + 1}, x_{2 i}) + \Delta' \varphi^{\alpha} (x_{2 i + 2}, \bar{x}_{2 i + 1})]\\ - & \frac{1}{4} [\varphi^{\alpha} (\bar{x}_{2 i + 1}, x_{2 i}) - \varphi^{\alpha} (x_{2 i + 2}, \bar{x}_{2 i + 1})]^2\\ - & \frac{1}{4} \sum_{\alpha' = 1}^d (x^{\alpha}_{2 i + 2} - x^{\alpha}_{2 i}) [\partial_{\alpha'} \varphi^{\alpha} (\bar{x}_{2 i + 1}, x_{2 i}) + \partial_{\alpha'}' \varphi^{\alpha} (x_{2 i + 2}, \bar{x}_{2 i + 1})] \times\\ & \times [\varphi^{\alpha'} (\bar{x}_{2 i + 1}, x_{2 i}) - \varphi^{\alpha'} (x_{2 i + 2}, \bar{x}_{2 i + 1})]\\ - & \frac{1}{16} \sum_{\alpha', \beta = 1}^d (x^{\alpha}_{2 i + 2} - x^{\alpha}_{2 i}) [\partial_{\beta} \varphi^{\alpha} (\bar{x}_{2 i + 1}, x_{2 i}) + \partial_{\beta}' \varphi^{\alpha} (x_{2 i + 2}, \bar{x}_{2 i + 1})] \times\\ & \times (x^{\alpha'}_{2 i + 2} - x^{\alpha'}_{2 i}) [\partial_{\beta} \varphi^{\alpha'} (\bar{x}_{2 i + 1}, x_{2 i}) + \partial_{\beta}' \varphi^{\alpha'} (x_{2 i + 2}, \bar{x}_{2 i + 1})] . \end{array} \) | (4.21) |
So, up to an irrelevant constant term,
\(\displaystyle S' (x') = \sum_{i = - \infty}^{+ \infty} \sum_{\alpha = 1}^n \left[ \frac{1}{2 \epsilon} \left( {x'}^{\alpha}_{i + 1} {- x'}^{\alpha}_i \right)^2 - \left( {x'}^{\alpha}_{i + 1} {- x'}^{\alpha}_i \right) \varphi^{\prime \alpha} (x'_{i + 1}, x'_i) {+ \epsilon \xi'}^{\alpha} (x_{i + 1}', x_i') + \omicron (\epsilon) \right], \) | (4.22) |
which has exactly the same format as \(S\) (equation 4.17). The iterative equations 4.19, 4.20, 4.21, and 4.22 are called renormalization group equations.
Now we use the renormalization group equations to study the information propagation qualitatively. The distance that information propagates is reflected by the “normalized” correlation (also called Pearson coefficient)
As usual, the covariance is given by
in which the expectation is
The denominators in covariance and expectation are used for eliminating the irrelevant constant term in \(S\). If information can propagate between indices \(0\) and \(2^n\), then the correlation \(\operatorname{Corr} (X_0^{\alpha}, X^{\beta}_{2^n})\) cannot not tend to zero.
To calculate correlation, we first marginalize the indices between \(0\) and \(2^n\) using renormalization group equations, which requires \(n\) iterations. At each iteration, for example, the expatiation becomes
where the factor \(\sqrt{2}\) comes from \(x_{2 i} = \sqrt{2} x'_i\).
The iteration of \(S'\) can be very complicated such that only numerical computation is possible. But, there is a special situation where the iteration result in a fixed point, namely \(\varphi' = \varphi\) and \(\xi' = \xi\), thus \(S' = S\) after an iteration. The expression of action then keeps invariant during iteration, so is the correlation. The system on the fixed point is called scale-invariant or scale-free. In a scale-invariant system, information can be propagated to infinity without loosing, an optimal communication system. Because of this, scale-invariant systems are expected to be everywhere in Nature.
Using \(x_{2 i + 1} = y + \bar{x}_{2 i + 1}\), the second line in \(J^{\alpha}\) becomes
Since \(\bar{x}_{2 i + 1} - x_{2 i} = x_{2 i + 2} - \bar{x}_{2 i + 1} = (x_{2 i + 2} - x_{2 i}) / 2\), it turns to be
Taylor expansion by \(y\) results in (recall the estimation \(y = \mathcal{O} \left( \sqrt{\epsilon} \right)\))
where we have denoted \(\partial \varphi\) as the partial derivative taken on the first argument, and \(\partial' \varphi\) on the second. Notice that we have used Einstein's convention in this expansion (see the conventions in chapter 3), hiding the summations of indices \(\beta\) and \(\gamma\). Also up to \(\omicron (\epsilon)\), the third line simply \(\operatorname{becomes}\)
Altogether, the integral becomes
with the “independent part” (the color is for later usage)
which is independent of \(y\), and the “interactive part”
which depends on \(y\) and is \(\mathcal{O} \left( \sqrt{\epsilon} \right)\). We Taylor expands \(\exp \left( \sum_{\alpha} V^{\alpha} (y) \right)\) as
where \(\mathbb{E}_Y [\ldots]\) represents the Gaussian integral of \(y\). We will neglect the constant factor \((\pi \epsilon)^{- d / 2}\), so that \(\mathbb{E}_Y [1] = 1\). This constant factor can be absorbed into the action as an irrelevant constant term. Plugging in the definition of \(V^{\alpha} (y)\), together with \(\mathbb{E}_Y [y^{\alpha}] = 0\) and \(\mathbb{E}_Y [y^{\alpha} y^{\beta}] = (\epsilon / 2) \delta_{\alpha \beta}\), we get (color for later usage)
where the Laplacian \(\Delta := \sum_{\alpha = 1}^n \partial_{\alpha} \partial_{\alpha}\), and the same \(\Delta' := \sum_{\alpha = 1}^n \partial'_{\alpha} \partial'_{\alpha}\). Next, we first notice that \(\mathbb{E}_Y \left[ \left( \sum_{\alpha = 1}^n V^{\alpha} (y) \right)^2 \right] = \sum_{\alpha, \alpha' = 1}^n \mathbb{E}_Y [V^{\alpha} (y) V^{\alpha'} (y)]\), where
Thus (color for later usage)
Plugging all these back to the integral, up to an irrelevant constant term, we get (color indicates where the term comes from)
Define \(x'_i := x_{2 i} / \sqrt{2}\) and then \(\varphi' (x'_{i + 1}, x'_i) := [\varphi (\bar{x}_{2 i + 1}, x_{2 i}) + \varphi (x_{2 i + 2}, \bar{x}_{2 i + 1})] / \sqrt{2}\), it becomes
where we have collected all \(\mathcal{O} (\epsilon)\) terms into \(\epsilon \xi' (x'_{i + 1}, x'_i)\), in which
As an example, consider the linear function \(f_{\alpha} (x) = A_{\alpha \beta} x^{\beta}\), where \(A \in \mathbb{R}^{d \times d}\). Then, initially we have \(\varphi_{\alpha} (x_{i + 1}, x_i) = A_{\alpha \beta} x_i^{\beta}\) and \(\xi_{\alpha} (x_{i + 1}, x_i) = (A_{\alpha A \beta} x_i^{\beta})^2 / 2\). We focus on the iteration of \(\varphi\) (namely, equation 4.20). The generic form of \(\varphi\) is
where \(u\) and \(v\) are coefficients (“couplings” in physics). With this generic form, we find (recall that \(\bar{x}_{2 i + 1} = (x_{2 i + 2} + x_{2 i}) / 2\))
Inserting the \(x_i' = x_{2 i} / \sqrt{2}\), we get
So, we have the iterations
It has fixed point at \(u_{\star} = (3 u_{\star} + v_{\star}) / 2\) and \(v_{\star} = (3 v_{\star} + u_{\star}) / 2\), which have solution \(u_{\star} = - v_{\star}\). This fixed point, however, is not stable. As it is shown in figure 4.2, a tiny departure from the line \(u + v = 0\) will push the \((u, v)\) away from the fixed point. So, a linear system cannot arrive at the fixed point, thus cannot be scale-invariant.