|
Prologue 7
1Relative Entropy 9
1.1Conventions in This Chapter 9
1.2A Brief Review of Probability 9
1.3Shannon Entropy Is Plausible for Discrete Random Variable 10
1.4Shannon Entropy Fails for Continuous Random Variable 10
1.5Relative Entropy Is the Unique Solution to the Axiom 11
2Master Equation and Detailed Balance 13
2.1Conventions in This Chapter 13
2.2Master Equation Describes the Evolution of Markov Process 13
2.3Transition Rate Determines Transition Density 15
2.4Detailed Balance Provides Stationary Distribution 17
2.5Detailed Balance with Connectivity Monotonically Reduces Relative Entropy 18
2.6Monte-Carlo Simulation and Guarantee of Relaxation 19
2.7Example: Metropolis-Hastings Algorithm 22
3Kramers-Moyal Expansion and Langevin Process 23
3.1Conventions in This Chapter 23
3.2Cut-off in the Moments of Transition Rate Is Essential for Spatial Smoothness 23
3.3Transition Rate Is Determined by Its Moments 28
3.4Randomness Is Absent in the First Moment of Transition Rate 30
3.5Randomness Appears in the Second Moment of Transition Rate 31
3.6Langevin Process Is a Markovian Process with \(N_{\operatorname{cut}} = 2\) 32
3.7Transition Density of Langevin Process Is Nearly Gaussian 33
3.8Stationary Solution of Langevin Process Has Source-Free Degree of Freedom 34
3.9Detailed Balance of Langevin Process Lacks Source-Free Degree of Freedom 34
4Least-Action Principle 37
4.1Conventions in This Chapter 37
4.2A Brief Review of Least-Action Principle in Classical Mechanics 37
4.3Least-Action Principle of Distribution Has No Redundancy 38
4.4Data Fitting Is Equivalent to Least-Action Principle of Distribution 39
4.5\(\clubsuit\) Example: Least-Action Principle in Supervised Machine Learning 42
5Path Integral 43
5.1Conventions in This Chapter 43
5.2Markovian Process with Euclidean Alphabet Can Be Formulated as Path Integral 43
5.2.1\(\clubsuit\) Estimation of the Residue 45
5.3Langevin Process with Constant Covariance Has a Path Integral on Alphabet 47
5.4\(\clubsuit\) Grassmann Variable, Berezin Integral, and Ghosts 48
Epilogue 51
This is a little book about stochastic process. We start with the axiomatic system of information (chapter 1). Then using information, we introduce the continuous time Markovian process, and show how it relax to equilibrium (chapter 2). We then move on to add spatial smoothness to Markovian process, which results in many interesting results including Langevin process (chapter 3). In the end, we generalize least-action principle to distribution and comparing data-fitting with least-action principle (chapter 4). Readers may omit the sections in which the titles start with \(\clubsuit\). They are interesting digressions.
The mathematical techniques employed here will not go beyond the basic calculus (Taylor expansion, improper integral, and integration by parts) and linear algebra (equivalence relation and equivalence class, matrix manipulations, orthogonal diagonalization, and determinant). Knowing the basic probability theory (normal distribution and Gaussian integral) will be beneficial. We try to make it self-contained, and introduce new concept or technique only when it is essential. Statements like “obviously...” and “apparently...” are avoided; and we try to display all the steps of calculation without omitting any of them.
For each section, the title is a sentence that briefly summarizes the whole section. We use bold font for definition and italic font for emphasis. Only important equations are numbered. So, readers can quickly skim through by reviewing section titles, bold and italic texts, and numbered equations.
At last, this book is written by TeXmacs, using the GFDL-1.3 license. You can contact me with e-mail: shuiruge@whu.edu.cn.
In this chapter, we discuss the axiomatization of information. By figuring out how Shannon entropy fails for continuous random variable, we build a proper axiomatic system on which the expression of information is found to be unique.
Those that are not deterministic are denoted by capital letters. But, a capital letter may also denote something that is determined. For example, a random variable has to be denoted by capital letter, like \(X\), while we can also use \(F\) to denote something determined, such as a functional.
Maybe the best analogy of probability is deterministic. Consider a pile of sand distributed on a table. At each location of the table, there is a density of sands. And for each area of the table, we can compute the total mass of sands in that area. When we move some portion of sands from one area to another on the table, the mass of all sands keeps constant. Even though the pile of sand model lacks any randomness, it does furnish a visual picture for the main concepts that constitute probability.
The set of all possible values of a random variable is called the alphabet (a synonym for table).
1.1. Many textures call it sample space. But “space” usually hints for extra structures such as vector space or topological space. So, we use “alphabet” instead, following David Mackay. (See his remarkable book Information Theory, Inference, and Learning Algorithms, section 2.1. Link to free PDF: https://www.inference.org.uk/itprnn/book.pdf)
1.2. In many textures, the density or mass function is non-negative (rather than being positive). Being positive is beneficial because, for example, we will discuss the logarithm of density or mass function, for which being zero is invalid. For any value on which density or mass function vanishes, we throw it out of \(\mathcal{X}\), which in turn guarantees the positivity.
The density of a value \(x\) is usually denoted by \(p (x)\), which, as a function, is called density function. Notice that \(p (x)\) is deterministic, thus not capital. The same for mass, where \(p (x)\) is called mass function. Thus, we can say the expectation of a function \(f\) on distribution \(P\), denoted by \(\mathbb{E}_P [f]\) or \(\mathbb{E}_{x \sim P} [f (x)]\). If the alphabet \(\mathcal{X}\) is of continuum, then it is \(\int_{\mathcal{X}} \mathrm{d} x p (x) f (x)\), otherwise \(\sum_{x \in \mathcal{X}} p (x) f (x)\).
If there exists random variables \(Y\) and \(Z\), with alphabets \(\mathcal{Y}\) and \(\mathcal{Z}\) respectively, such that \(X = Y \oplus Z\) (for example, let \(X\) two-dimensional, \(Y\) and \(Z\) are the components), then we have marginal distributions, denoted by \(P_Y\) and \(P_Z\), where \(p (y) := \int_{\mathcal{Z}} \mathrm{d} z p (y, z)\) and \(p (z) := \int_{\mathcal{Y}} \mathrm{d} y p (y, z)\) if \(X\) is of continuum, and the same for mass function. Notice that we have omitted the subscript \(Y\) in \(p_Y\) (and the same for \(p_Z\)) since the \(y\) in \(p (y)\) has clearly indicated this. We marginalize \(Z\) so as to get \(P_Y\).
We further have the conditional distribution of \(Y\) given \(Z\), denoted by \(P_{Y|Z}\), where \(p (y|z) := p (y, z) / p (z)\) (we omit the subscript of \(p_{Y|Z}\) too). Suppose that we samples lots of \((Y, Z)\) values from \(P\), and then filters the pairs with \(Z = z\). The frequency of \(Y = y\) found in the filtered samples is approximated by \(p (y|z)\).
The Shannon entropy is well-defined for discrete random variable. Let \(X\) a discrete random variables with alphabet \(\{ 1, \ldots, n \}\) with \(p_i\) the mass of \(X = i\). The Shannon entropy is thus a function of \((p_1, \ldots, p_n)\) defined by
where \(k\) is a positive constant. Interestingly, this expression is unique given some plausible axioms, which can be qualitatively expressed as
\(H\) is a continuous function of \((p_1, \ldots, p_n)\);
larger alphabet has higher uncertainty (information or entropy); and
if we have known some information, and based on this knowledge we know further, the total information shall be the sum of all that we know.
Here, we use uncertainty, surprise, information, and entropy as interchangeable.
The third axiom is also called the additivity of information. For two independent variables \(X\) and \(Y\) with distributions \(P\) and \(Q\) respectively, the third axiom indicates that the total information of \(H (P Q)\) is \(H (P) + H (Q)\). But, the third axiom indicates more than this. It also defines a “conditional entropy” for dealing with the situation where \(X\) and \(Y\) are dependent. Jaynes gives a detailed declaration to these axioms.
1.3. See the appendix A of Information Theory and Statistical Mechanics by E. T. Jaynes, 1957. A free PDF version can be found on Internet: https://bayes.wustl.edu/etj/articles/theory.1.pdf.
1.4. On measures of information and entropy by Alfréd Rényi, 1961. A free PDF version can be found on Internet: http://digitalassets.lib.berkeley.edu/math/ucb/text/math_s4_v1_article-27.pdf.
As we will see, when extending the alphabet to continuum, this problem naturally ceases.
The Shannon entropy, however, cannot be directly generalized to continuous random variable. Usually, the entropy for continuous random variable \(X\) with alphabet \(\mathcal{X}\) and distribution \(P\) is given as a functional of the density function \(p (x)\),
which, however, is not well-defined. The first issue is that the \(p\) has dimension, indicated by \(\int_{\mathcal{X}} \mathrm{d} x p (x) = 1\). This means we put a dimensional quantity into logarithm which is invalid. The second issue is that the \(H\) is not invariant under coordinate transformation \(X \rightarrow Y := \varphi (X)\) where \(\varphi\) is a diffeomorphism. But as a “physical” quantity, \(H\) should be invariant under “non-physical” transformations.
To eliminate the two issues, we shall extends the axiomatic description of entropy. The key to this extension is introducing another distribution, \(Q\), which has the same alphabet as \(P\); and instead considering the uncertainty (surprise) caused by \(P\) when prior knowledge has been given by \(Q\). As we will see, this will solve the two issues altogether. Explicitly, we extend and quantify the axioms in section 1.3 as follow.
Axiom
\(H\) is a smooth and local functional of \(p\) and \(q\);
\(H (P, Q) > 0\) with \(P \neq Q\) and \(H (P, P) = 0\); and
If \(X = Y \oplus Z\), and if \(Y\) and \(Z\) independent, then \(H (P, Q) = H (P_Y, Q_Y) + H (P_Z, Q_Z)\), where \(P_Y, \ldots, Q_Z\) are marginal distributions.
The first axiom employs the locality of \(H\), which is thought as natural since \(H\) has been a functional. The second axiom indicates that \(H\) vanishes only when there is no surprise caused by \(P\) (thus \(P = Q\)). It is a little like the second axiom for Shannon entropy. The third axiom, like the third in Shannon entropy, claims the additivity of surprise: if \(X\) has two independent parts, the total surprise shall be the sum of each.
We are to derive the explicit expression of \(H\) based on the three axioms. The result is found to be unique.
Based on the first axiom, there is a function \(h : (0, + \infty) \times (0, + \infty) \rightarrow [0, + \infty)\) such that \(H\) can be expressed as
We are to determine the explicit form of \(h\). Thus, from second axiom,
holds for all distribution \(P\). Since \(p\) is positive and \(h\) is non-negative, then we have \(h (p (x), p (x)) = 0\) for all \(x \in \mathcal{X}\). The distribution \(P\) is arbitrary, thus we find \(h (x, x) = 0\) for any \(x \in (0, + \infty)\).
Now come to the third axiom. Since \(Y\) and \(Z\) are independent, \(H (P, Q)\) can be written as \(\int_{\mathcal{X}} \mathrm{d} y \mathrm{d} z p_Y (y) p_Z (z) h (p_Y (y) p_Z (z), q_Y (y) q_Z (z))\). Thus, the third axiom implies
Following the previous argument, we find \(h (a x, b y) = h (a, b) + h (x, y)\) for any \(a, b, x, y \in (0, + \infty)\). Taking derivative on \(a\) and \(b\) results in \(\partial_1 h (a x, b y) x = \partial_1 h (a, b)\) and \(\partial_2 h (a x, b y) y = \partial_2 h (a, b)\). Since \(\partial_1 h (a, a) + \partial_2 h (a, a) = (\mathrm{d} / \mathrm{d} a) h (a, a) = 0\), we get \(\partial_1 h (a x, a y) x + \partial_2 h (a x, a y) y = 0\). Letting \(a = 1\), it becomes a first order partial differential equation \(\partial_1 h (x, y) x + \partial_2 h (x, y) y = 0\), which has a unique solution that \(h (x \mathrm{e}^t, y \mathrm{e}^t)\) is constant for all \(t\). Choosing \(t = - \ln y\), we find \(h (x, y) = h (x / y, 1)\). Now \(h\) reduces from two variables to one. So, plugging this result back to \(h (a x, b y) = h (a, b) + h (x, y)\), we have \(h (x y, 1) = h (x, 1) + h (y, 1)\). It looks like a logarithm. We are to show that it is indeed so. By taking derivative on \(x\) and then letting \(y = 1\), we get an first order ordinary differential equation \(\partial_1 h (x, 1) = \partial_1 h (1, 1) / x\), which has a unique solution that \(h (x, 1) = \partial_1 h (1, 1) \ln (x) + C\), where \(C\) is a constant. Combined with \(h (x, y) = h (x / y, 1)\), we finally arrive at \(h (x, y) = \partial_1 h (1, 1) \ln (x / y) + C\). To determine the \(\partial_1 h (1, 1)\) and \(C\), we use the second axiom \(\partial_1 h (1, 1) \int \mathrm{d} x p (x) \ln (p (x) / q (x)) + C > 0\) when \(p \neq q\) and \(\partial_1 h (1, 1) \int \mathrm{d} x p (x) \ln (p (x) / p (x)) + C = 0\). The second equation results in \(C = 0\). By Gibbs' inequality, the integral \(\int \mathrm{d} x p (x) \ln (p (x) / q (x))\) is non-negative, thus from the first equation, \(\partial_1 h (1, 1) > 0\). Up to now, all things about \(h\) have been settled. We conclude that there is a unique expression that satisfies all the three axioms, which is
where \(k > 0\). This was first derived by Solomon Kullback and Richard Leibler in 1951, so it is called Kullback–Leibler divergence (KL-divergence for short), denoted by \(D_{\operatorname{KL}} (P\|Q)\). Since it characterizes the relative surprise, it is also called relative entropy (entropy for surprise).
The locality is essential for relative entropy to arise. For example, Renyi divergence, defined by
also satisfies the three axioms when locality is absent.
In the end, we examine the two issues appeared in Shannon entropy (section 1.4). In \(H (P, Q)\), the logarithm is \(\ln (p / q)\) which is dimensionless. And a coordinate transformation \(X \rightarrow Y := \varphi (X)\) makes \(\int \mathrm{d} x p (x) = \int \mathrm{d} y | \det (\partial \varphi^{- 1}) (y) | p (\varphi^{- 1} (y)) =: \int \mathrm{d} y \tilde{p} (y)\), thus \(p \rightarrow \tilde{p} := | \det (\partial \varphi^{- 1}) | p \circ \varphi^{- 1}\). The same for \(q \rightarrow \tilde{q} := | \det (\partial \varphi^{- 1}) | q \circ \varphi^{- 1}\). The common factor \(| \det (\partial \varphi^{- 1}) |\) will be eliminated in \(\ln (p / q)\), leaving \(H (P, Q)\) invariant (since \(\int \mathrm{d} x p \ln (p / q) \rightarrow \int \mathrm{d} y \tilde{p} \ln (\tilde{p} / \tilde{q})\), which equals to \(\int \mathrm{d} x p \ln (p / q)\)). So, the two issues of Shannon entropy cease in relative entropy.
In this chapter, we discuss continuous time Markovian process. Using the result obtained in chapter 1, we declare how a Markovian process relaxes to its stationary equilibrium.
Follow the conventions in chapter 1. Let \(X\) a multi-dimensional random variables, being, discrete, continuous, or partially discrete and partially continuous, with alphabet \(\mathcal{X}\) and distribution \(P\). Even though the discussion in this chapter applies to both discrete and continuous random variables, we use the notation of the continuous. The reason is that converting from discrete to continuous may cause problems (section 1.4), while the inverse is safe and direct.
When using conditional density function to describe the probability of transition, like \(p_{t \rightarrow t'} (x' |x)\) which denotes the transition from \(x\) at time \(t\) to \(x'\) at time \(t'\), we find it convenient to adopt the notation to \(p_{t \rightarrow t'} (x \rightarrow x')\).
We generalize the pile of sand model in section 1.2, imagining that these sands are magic, having free will to move on the table. The distribution of sands changes with time. In the language of probability, the density of sands at position \(x\) on the table is described by a time-dependent density function \(p (x, t)\) where \(t\) represents time. The total mass of the sands is kept invariant and normalized to one. The alphabet \(\mathcal{X}\) is the table itself (namely, every position on the table).
Let \(q_{t \rightarrow t'} (x \rightarrow x')\) denote the portion of density at position \(x\) at time \(t\) that transits to position \(x'\) at time \(t'\). Then, the transited density will be \(p (x, t) q_{t \rightarrow t'} (x \rightarrow x')\). There may be some portion of density at position \(x\) that does not transit during \(t \rightarrow t'\) (the lazy sands). In this case we regard the sands as transiting from position \(x\) to \(x\) (staying on \(x\)), which is \(q_{t \rightarrow t'} (x \rightarrow x)\). Now, every sand at position \(x\) has transited during \(t \rightarrow t'\), and the total portion shall be \(100\%\), which means
\(\displaystyle \int_{\mathcal{X}} \mathrm{d} x' q_{t \rightarrow t'} (x \rightarrow x') = 1. \) | (2.1) |
As portion, \(q_{t \rightarrow t'}\) cannot be negative, thus \(q_{t \rightarrow t'} (x \rightarrow x') \geqslant 0\) for each \(x\) and \(x'\) in \(\mathcal{X}\). We call \(q_{t \rightarrow t'}\) the transition density. Not like the density function of distribution, transition density can be zero in a subset of \(\mathcal{X}\).
The transition makes a difference in the density at position \(x'\). The difference is caused by the density transited from \(x'\), which is \(\int_{\mathcal{X}} \mathrm{d} x p (x', t) q_{t \rightarrow t'} (x' \rightarrow x)\), and that transited to \(x'\), which is \(\)\(\int_{\mathcal{X}} \mathrm{d} x p (x, t) q_{t \rightarrow t'} (x \rightarrow x')\). Thus, we have
By inserting equation 2.1, we find
\(\displaystyle p (x', t') = \int_{\mathcal{X}} \mathrm{d} x p (x, t) q_{t \rightarrow t'} (x \rightarrow x'), \) | (2.2) |
which is called the discrete time master equation. When \(t' = t\), we have \(p (x', t) = \int_{\mathcal{X}} \mathrm{d} x p (x, t) q_{t \rightarrow t} (x \rightarrow x')\), indicating that
where \(\delta (x - x')\) indicates Kronecker's delta \(\delta_{x, x'}\) when \(\mathcal{X}\) is discrete, or Dirac's delta function when \(\mathcal{X}\) is continuous. Even though we call it Dirac's delta function, it is in fact a generalized function. A generalized function has meaning only when it is integrated together with other (not generalized) functions. For example, Dirac's delta function has \(\int_{\mathcal{X}} \mathrm{d} x \delta (x - y) f (x) = f (y)\) for any usual function \(f\).
In addition, if the change of the distribution of sands is smooth, that is, there is not a sand lump that jumping from one place to another in an arbitrarily short period of time, then \(q_{t \rightarrow t'}\) is smooth on \(t'\). Taking derivative on \(t'\) and then setting \(t'\) to \(t\), we have
\(\displaystyle \frac{\partial p}{\partial t} (x', t) = \int_{\mathcal{X}} \mathrm{d} x p (x, t) r_t (x, x'), \) | (2.3) |
where
is transition rate. Equation 2.3 is called the continuous time master equation, or simply master equation. The word “master” indicates that the transition rate has completely determined (mastered) the evolutionary behavior of distribution.
Even though all these concepts are born of the pile of sand, they are applicable to any stochastic process where the distribution \(P (t)\) is time-dependent (but the alphabet \(\mathcal{X}\) is time-invariant), no matter whether the random variable is discrete or continuous.
A stochastic process is Markovian if the transition density \(q_{t \rightarrow t'}\) depends only on the time difference \(\Delta t := t' - t\), thus \(q_{t \rightarrow t'}\) is re-denoted by \(q_{\Delta t}\). In this situation, transition rate \(r\) becomes time-independent, so the master equation turns to be
\(\displaystyle \frac{\partial p}{\partial t} (x', t) = \int_{\mathcal{X}} \mathrm{d} x p (x, t) r (x, x') . \) | (2.4) |
Since we only deal with Markovian stochastic process throughout this note, when referring to master equation, we mean equation 2.4. And to discrete time master equation, we mean:
\(\displaystyle p (x', t + \Delta t) = \int_{\mathcal{X}} \mathrm{d} x p (x, t) q_{\Delta t} (x \rightarrow x') . \) | (2.5) |
Before finishing this section, we discuss the demanded conditions for transition rate. The normalization of transition density 2.1 implies that \(\int_{\mathcal{X}} \mathrm{d} y r (x, y) = 0\). This can be seen by Taylor expanding \(q_{\Delta t}\) as \(q_{\Delta t} (x \rightarrow y) = \delta (x - y) + r (x, y) \Delta t + \omicron (\Delta t)\), where we have inserted \(q_0 (x \rightarrow y) = \delta (x - y)\) and the definition of \(r\). Also from this Taylor expansion, we see that the non-negativity of \(q_{\Delta t}\) implies \(r (x, y) \geqslant 0\) when \(x \neq y\). Since \(p\) is a density function of distribution, and density function is defined to be positive (see section 1.2), equation 2.2 must conserve this positivity. We are to show that this is guaranteed by master equation itself, without any extra condition demanded for transition rate. It is convenient to use discrete notations, thus we replace \(x \rightarrow i\), \(y \rightarrow j\), and \(\int \mathrm{d} x \rightarrow \sum\). Master equation turns to be \((\mathrm{d} p_j / \mathrm{d} t) (t) = \sum_i p_i (t) r_{i j}\). It is an ordinary differential equation. Recall that \(r_{i j} \geqslant 0\) when \(i \neq j\), and thus \(r_{i i} \leqslant 0\) (since \(\sum_j r_{j i} = 0\)). We separate the right hand side to be \(p_j (t) r_{j j} + \sum_{i : i \neq j} p_i (t) r_{i j}\), and the worst situation is that \(r_{i j} = 0\) for each \(i \neq j\) and \(r_{j j} < 0\). In this case, master equation reduces to \((\mathrm{d} p_j / \mathrm{d} t) (t) = p_j (t) r_{j j}\), which has the solution \(p_j (t) = p_j (0) \exp (r_{j j} t)\). It implies that \(p_j (t) > 0\) as long as \(p_j (0) > 0\), indicating that master equation conserves the positivity of density function. As a summary, we demand transition rate \(r\) to be \(r (x, y) \geqslant 0\) when \(x \neq y\) and \(\int_{\mathcal{X}} \mathrm{d} y r (x, y) = 0\).
We wonder, given a transition rate, can we obtain the corresponding transition density? Generally, we cannot get the global (or the finite) from the local (or the infinitesimal). For example, we cannot determine a function only by its first derivative at the origin. But, master equation has a group-like structure, by which the local accumulates to be global. We are to show how this happens.
We can use the master equation 2.4 to calculate \(\partial^n p / \partial t^n\) for any \(n\). For \(n = 2\), by inserting master equation 2.4 (to the blue term), we have
We then insert master equation 2.4 again (to the green term), and find
Following the same steps, it can be generalized to higher order derivatives, as
Notice the pattern: a leftmost \(p (x_0, t)\) a sequence of \(r\). The reason for this pattern to arise is that transition rate \(r\), is independent of \(t\): a Markovian property.
On the other hand, Taylor expand the both sides of discrete time master equation 2.5 by \(\Delta t\) gives, at \((\Delta t)^n\) order,
where, for simplifying notation, we have denoted the \(n\)th-order derivatives of \(q_{\Delta t}\) by
So,\(\) by equaling the two expressions of \((\partial^n p / \partial t^n) (x, t)\), we find
for any \(n \in \{ 2, 3, \ldots \}\). This holds for all \(p\), thus
Recalling that \(q_{\Delta t} (x \rightarrow x') = \delta (x - x') + r (x, x') \Delta t + \omicron (\Delta t)\), we have the Taylor expansion of \(q_{\Delta t}\), as
2.1. Another derivation uses exponential mapping. By regarding \(p\) a time-dependent element in functional space, and \(r\) as a linear operator, it becomes (we add a hat for indicating operator, using dot \(\cdot\) for its operation)
This operator differential equation has a famous solution, called exponential mapping, \(p (t) = \exp (\hat{r} t) p (0)\), where the exponential operator is defined by Taylor expansion \(\exp (\hat{L}) := \hat{1} + \hat{L} + (1 / 2!) \hat{L}^2 + \cdots\) for any linear operator \(\hat{L}\). Indeed, by taking derivative on \(t\) on both sides, we find \((\mathrm{d} p / \mathrm{d} t) (t) = \hat{r} \cdot \exp (\hat{r} t) p (0) = \hat{r} \cdot p (t)\). Recall the discrete time master equation, \(p (\Delta t) = \hat{q}_{\Delta t} \cdot p (0)\), where the transition density \(\hat{q}_{\Delta t}\) is regarded as a linear operator too (so we put a hat on it). We find \(\exp (\hat{r} \Delta t) \cdot p (0) = \hat{q}_{\Delta t} \cdot p (0)\), which holds for arbitrary \(p (0)\), implying \(\hat{q}_{\Delta t} = \exp (\hat{r} \Delta t) = 1 + \hat{r} \Delta t + (1 / 2!) (\hat{r} \cdot \hat{r}) (\Delta t)^2 + \cdots\). Going back to functional representation, we have the correspondences \(\hat{q}_{\Delta t} \rightarrow q_{\Delta t} (z|x)\), \(\hat{r} \rightarrow r (z, x)\), \(\hat{r} \cdot \hat{r} \rightarrow \int \mathrm{d} y r (z, y) r (y, x)\), \(\hat{r} \cdot \hat{r} \cdot \hat{r} \rightarrow \int \mathrm{d} y_1 \mathrm{d} y_2 r (z, y_2) r (y_2, y_1) r (y_1, x)\), and so on, thus recover the relation between \(q_{\Delta t}\) and \(r\).
\(\displaystyle \begin{array}{rl} q_{\Delta t} (x \rightarrow x') = & \delta (x - x')\\ + & (\Delta t) r (x, x')\\ + & \frac{(\Delta t)^2}{2!} \int_{\mathcal{X}} \mathrm{d} x_1 r (x, x_1) r (x_1, x')\\ + & \cdots\\ + & \frac{(\Delta t)^n}{n!} \int_{\mathcal{X}} \mathrm{d} x_1 \cdots \int_{\mathcal{X}} \mathrm{d} x_{n - 1} r (x, x_1) \cdots r (x_{n - 1}, x')\\ + & \cdots . \end{array} \) | (2.6) |
Well, this is a complicated formula, but its implication is straight forward and very impressive: the transition density is equivalent to transition rate, even though transition rate is derived from the transition density with infinitesimal time-interval.
This may be a little weird at the first sight. For example, consider \(q'_{\Delta t} (x \rightarrow x') := q_{\Delta t} (x \rightarrow x') + f (x, x') \Delta t^2\), where \(f\) is any function ensuring that \(q'_{\Delta t}\) is non-negative and normalized (thus \(\int_{\mathcal{X}} \mathrm{d} y f (x, y) = 0\)). Following the previous derivation, we find the discrete time master equation
also leads to (continuous time) master equation 2.4 with the same \(r\) as that of \(q_{\Delta t}\). So, we should have \(q'_{\Delta t} = q_{\Delta t}\), which means \(f\) is not free, but should vanish.
The answer to this question is that, a transition density is not free to choose, but sharing the same degree of freedom as that of its transition rate. The fundamental quantity that describes the evolution of a continuous time Markov process is transition rate. For example, consider \(p (z, t + \Delta t + \Delta t')\) for any \(\Delta t\) and \(\Delta t'\). Directly, we have
but on the other hand, by applying discrete time master equation twice, we find
By equaling the two expressions of \(p (z, t + \Delta t + \Delta t')\), we find
Since \(p\) can be arbitrary, we arrive at
\(\displaystyle q_{\Delta t + \Delta t'} (x \rightarrow z) = \int_{\mathcal{X}} \mathrm{d} y q_{\Delta t} (x \rightarrow y) q_{\Delta t'} (y \rightarrow z) . \) | (2.7) |
This provides an addition restriction to the transition density.
Interestingly, obeying equation 2.7 is sufficient for \(q_{\Delta t}\) to satisfy equation 2.6. Precisely, let \(q_{\Delta t} (x \rightarrow y)\) is a function that is smooth on \(\Delta t\) with \(q_0 (x \rightarrow y) = \delta (x - y)\), we are to show that, if \(q_{\Delta t}\) satisfies equation 2.7, then it will obey equation 2.6. To do so, we take derivative on equation 2.7 by \(\Delta t'\) at the origin, resulting in
where \(r (z, y) := q_0^{(1)} (z|y)\). Then, we are to Taylor expand both sides by \(\Delta t\). On the right hand side, we have
and on the left hand side,
So, we get the Taylor expansion on both sides. At \((\Delta t)^0\) order,
which is just the definition of \(r\). At \((\Delta t)^1\) order,
Iteratively at \((\Delta t)^n\) order, we will find
again. And this implies equation 2.6. So, we conclude this paragraph as follow: obeying equation 2.7 is the essential and sufficient condition for a function \(q_{\Delta t} (x|y)\), which is smooth on \(\Delta t\) with \(q_0 (x|y) = \delta (x - y)\), to satisfy equation 2.6; additionally, if \(q_{\Delta t} (x|y)\) is non-negative and normalized on \(x\) (namely, \(\int_{\mathcal{X}} \mathrm{d} x q_{\Delta t} (x|y) = 1\)), then \(q_{\Delta t}\) is a transition density.
Let \(\Pi\) a stationary solution of master equation 2.4. Then, its density function \(\pi\) satisfies \(\int_{\mathcal{X}} \mathrm{d} x \pi (x) r (x, y) = 0\). Since we have demanded that \(\int_{\mathcal{X}} \mathrm{d} x r (y, x) = 0\), the stationary master equation can be re-written as
But, this condition is too weak to be used. A more useful condition, which is stronger than this, is that the integrand vanishes everywhere:
\(\displaystyle \pi (x) r (x, y) = \pi (y) r (y, x), \) | (2.8) |
which is called the detailed balance condition.
Interestingly, for a transition rate \(r\) that satisfies detailed balance condition 2.8, the transition density \(q_{\Delta t}\) generated by \(r\) using equation 2.6 satisfies a similar relation
\(\displaystyle \pi (x) q_{\Delta t} (x \rightarrow y) = \pi (y) q_{\Delta t} (y \rightarrow x) . \) | (2.9) |
To see this, consider the third line in equation 2.6. It contributes to \(\pi (x) q_{\Delta t} (x \rightarrow z)\) by the main factor \(\int \mathrm{d} y \pi (x) r (x, y) r (y, z)\). By inserting the detailed balance condition for \(r (x, y)\) as \(\pi (x) r (x, y) = \pi (y) r (y, x)\), the main factor becomes \(\int \mathrm{d} y r (y, x) \pi (y) r (y, z)\). Then inserting the detailed balance condition again, for \(r (y, z)\) as \(\pi (y) r (y, z) = \pi (z) r (z, y)\), results in \(\int \mathrm{d} y \pi (z) r (z, y) r (y, x)\). It is the corresponding main factor that contributes to \(\pi (z) q_{\Delta t} (z \rightarrow x)\). Applying the same process to other lines in equation 2.6, we arrive at equation 2.9.
Given the time \(t\), if the time-dependent distribution \(P (t)\) and the stationary distribution \(\Pi\) share the same alphabet \(\mathcal{X}\), which means \(p (x, t) > 0\) and \(\pi (x) > 0\) for each \(x \in \mathcal{X}\), then the relative entropy between them is well-defined, as
\(\displaystyle H (P (t), \Pi) = \int_{\mathcal{X}} \mathrm{d} x p (x, t) \ln \frac{p (x, t)}{\pi (x)} .\) | (2.10) |
It characterizes the difference between distributions \(P (t)\) and \(\Pi\). It is a plausible generalization of Shannon entropy to continuous random variables (see chapter 1).
We can calculate the time-derivative of relative entropy by master equation 2.4 . Generally, the time-derivative of relative entropy has no interesting property. But, if the \(\pi\) is more than stationary but satisfying a stronger condition: the detailed balance condition, then \(\mathrm{d} H (P (t), \Pi) / \mathrm{d} t\) will have a regular form
2.2. The proof is given as follow. Directly, we have
Since \(\int_{\mathcal{X}} \mathrm{d} x (\partial p / \partial t) (x, t) = (\partial / \partial t) \int_{\mathcal{X}} \mathrm{d} x p (x, t) = 0\), the second term vanishes. Then, we get
Now, we replace \(\partial p / \partial t\) by master equation 2.4, as
Then, insert detailed balance condition \(r (y, x) = \pi (x) r (x, y) / \pi (y)\), as
Since \(x\) and \(y\) are dummy, we interchange them in the integrand, and then insert detailed balance condition again, as
By adding the two previous results together, we find
from which we directly get the result. Notice that this proof is very tricky: it uses detailed balance condition twice, between which the expression is symmetrized. It is an ingenious mathematical engineering.
\(\displaystyle \frac{\mathrm{d}}{\mathrm{d} t} H (P (t), \Pi) = - \frac{1}{2} \int_{\mathcal{X}} \mathrm{d} x \int_{\mathcal{X}} \mathrm{d} y \pi (x) r (x, y) \left( \frac{p (x, t)}{\pi (x)} - \frac{p (y, t)}{\pi (y)} \right) \left( \ln \frac{p (x, t)}{\pi (x)} - \ln \frac{p (y, t)}{\pi (y)} \right) . \) | (2.11) |
We are to check the sign of the integrand. The \(r (x, y)\) is negative only when \(x = y\), on which the integrand vanishes. Thus, \(r (x, y)\) can be treated as non-negative, so is the \(\pi (x) r (x, y)\) factor (since \(\pi (x) > 0\) for all \(x \in \mathcal{X}\)). Now, we check the sign of the last two terms. If \(p (x, t) / \pi (x) > p (y, t) / \pi (y)\), then \(\ln [p (x, t) / \pi (x)] > \ln [p (y, t) / \pi (y)]\), thus the sign of the last two terms is positive. The same goes for \(p (x, t) / \pi (x) < p (y, t) / \pi (y)\). Only when \(p (x, t) / \pi (x) = p (y, t) / \pi (y)\) can it be zero. Altogether, the integrand is non-positive, thus \(\mathrm{d} H / \mathrm{d} t \leqslant 0\).
The integrand vanishes when either \(r (x, y) = 0\) or \(p (x, t) / \pi (x) = p (y, t) / \pi (y)\). If \(r (x, y) > 0\) for each \(x \neq y\), then \((\mathrm{d} / \mathrm{d} t) H (P (t), \Pi) = 0\) only when \(p (x, t) / \pi (x) = p (y, t) / \pi (y)\) for all \(x, y \in \mathcal{X}\), which implies that \(p (\cdot, t) = \pi\) (since \(\)\(\int_{\mathcal{X}} \mathrm{d} x p (x, t) = \int_{\mathcal{X}} \mathrm{d} x \pi (x) = 1\)), or \(P (t) = \Pi\).
Contrarily, if \(r (x, y) = 0\) on some subset \(U \subset \mathcal{X} \times \mathcal{X}\), it seems that \((\mathrm{d} / \mathrm{d} t) H (P (t), \Pi) = 0\) cannot imply \(p (x, t) / \pi (x) = p (y, t) / \pi (y)\) on \(U\). But, if there is a \(z \in \mathcal{X}\) such that both \((x, z)\) and \((y, z)\) are not in \(U\), then \((\mathrm{d} / \mathrm{d} t) H (P (t), \Pi) = 0\) implies \(p (x, t) / \pi (x) = p (z, t) / \pi (z)\) and \(p (y, t) / \pi (y) = \pi (z, t) / \pi (z)\), thus implies \(p (x, t) / \pi (x) = p (y, t) / \pi (y)\). It hints for connectivity. Precisely, for each \(x, z \in \mathcal{X}\), if there is a series \((y_1, \ldots, y_n)\) from \(x\) (\(y_1 := x\)) to \(z\) (\(y_n := z\)) with both \(r (y_i, y_{i + 1})\) and \(r (y_{i + 1}, y_i)\) are positive for each \(i\), then we say \(x\) and \(z\) are connected, and the series is called a path. It means there are densities transiting along the forward and backward directions of the path. In this situation, \((\mathrm{d} / \mathrm{d} t) H (P (t), \Pi) = 0\) implies \(p (x, t) / \pi (x) = p (z, t) / \pi (z)\). We have, along the path, \(p (y_1, t) / \pi (y_1) = p (y_2, t) / \pi (y_2) = \cdots = p (y_n, t) / \pi (y_n)\), thus \(p (x, t) / \pi (x) = p (z, t) / \pi (z)\) since \(x = y_1\) and \(z = y_n\). So, by repeating the previous discussion in the case “\(r (x, y) > 0\) for each \(x \neq y\)”, we find \(P (t) = \Pi\) at \((\mathrm{d} / \mathrm{d} t) H (P (t), \Pi) = 0\) if every two elements in \(\mathcal{X}\) are connected.
Let us examine the connectivity further. We additionally define that every element in \(\mathcal{X}\) is connected to itself, then connectivity forms an equivalence relation. So, it separates \(\mathcal{X}\) into subsets (equivalence classes) \(\mathcal{X}_1, \ldots, \mathcal{X}_n\) with \(\mathcal{X}_i \cap \mathcal{X}_j = \varnothing\) for each \(i \neq j\) and \(\mathcal{X}= \cup_{i = 1}^n \mathcal{X}_i\). In each subset \(\mathcal{X}_i\), every two elements are connected. In this way, the whole random system are separated into many independent subsystems. The distributions \(P_i (t)\) and \(\Pi_i\) defined in the subsystem \(i\) have the alphabet \(\mathcal{X}_i\) and density functions \(p_i (x, t) := p (x, t) / \int_{\mathcal{X}_i} \mathrm{d} x p (x, t)\) and \(\pi_i (x) := \pi (x) / \int_{\mathcal{X}_i} \mathrm{d} x \pi (x)\) respectively (the denominators are used for normalization). Applying the previous discussion to this subsystem, we find \(P_i (t) = \Pi_i\) at \((\mathrm{d} / \mathrm{d} t) H (P_i (t), \Pi_i) = 0\).
So, for the whole random system or each of its subsystems, the following theorem holds.
Theorem
Many textures use Fokker-Planck equation to prove the monotonic reduction of relative entropy. After an integration by parts, they arrive at a negative definite expression, which means the monotonic reduction. This proof needs smooth structure on \(X\), which is essential for integration by parts. In this section, we provides a more generic alternative to the proof, for which smooth structure on \(X\) is unnecessary.
How to numerically simulate the evolution of master equation 2.4 that tends to equilibrium (at which the simulation terminates)? Using the simile of sands (see section 2.2 ), we simulate each sand, but replace its free will by a transition probability. Explicitly, we initialize the position of each sand randomly. Then iteratively update the positions. In each iteration, a sand jumps from position \(x\) to position \(y\) with the density function \(q_{\Delta t} (x \rightarrow y) \approx \delta (x - y) + r (x, y) \Delta t\) where \(\Delta t\) is sufficiently small. Not every transition is valid. On one hand, we have to ensure that computer has a sampler that makes random sampling for \(q_{\Delta t} (x \rightarrow y)\). On the other hand, to ensure the termination, the transition rate \(r\), together with the density function \(\pi\), shall satisfy the detailed balance condition 2.8 . Section 2.7 provides an algorithm that constructs such a transition rate from the density function \(q_{\Delta t}\). Then, we expect that the simulation will iteratively decrease the difference between the distribution of the sands and the \(\Pi\). We terminate the iteration when they have been close enough. In this way, we simulate a collection of sands evolves with the master equation to equilibrium, and finally distributes as \(\Pi\). This process is called Monte-Carlo simulation , first developed by Stanislaw Ulam in 1940s while he was working on the project of nuclear weapons at Los Alamos National Laboratory. The name is in memory of Ulam's uncle who lost all his personal assets in Monte Carlo Casino, Monaco.
2.3. There are multiple motivations for Monte-Carlo simulation. An important one comes from numerical integration. The problem is calculating the integral \(\int_{\mathcal{X}} \mathrm{d} x \pi (x) f (x)\) for a density function \(\pi\) and an arbitrary function \(f : \mathcal{X} \rightarrow \mathbb{R}\). When \(\mathcal{X}\) has finite elements, this integral is easy to compute, which is \(\sum_{x \in \mathcal{X}} \pi (x) f (x)\). Otherwise, this integral will be intractable. Numerically, this integral becomes the expectation \((1 / | \mathcal{S} |) \sum_{x \in \mathcal{S}} f (x)\) where \(\mathcal{S}\) is a collection of elements randomly sampled from distribution \(\Pi\) (its density function is the \(\pi\)). By central limit theorem, the numerical error \(\left| \int_{\mathcal{X}} \mathrm{d} x \pi (x) f (x) - (1 / | \mathcal{S} |) \sum_{x \in \mathcal{S}} f (x) \right|\) is proportional to \(1 / \sqrt{| \mathcal{S} |}\), which can be properly bounded as long as \(| \mathcal{S} |\) is large enough. But, how to sample from a distribution if you only know its density function (recall in section 1.2, a distribution is the combination of its density function and its sampler)? The answer is using Monte-Carlo simulation.
Like the Euler method in solving dynamical system, however, a finite time step results in residual error. This residual error must be analyzed an controlled, so that the distribution will evolve toward \(\Pi\), as we have expected. To examine this, we calculate the \(H (P (t + \Delta t), \Pi) - H (P (t), \Pi)\) where \(\Delta t\) is small but still finite, and check when it is negative (such that \(H (P (t))\) monotonically decreases to \(P (t) \rightarrow \Pi\)).
By definition, we have
Inserting \({\int_{\mathcal{X}} \mathrm{d} x p (x, t + \Delta t) \ln (p (x, t) / \pi (x, t))}\) gives
The first line is recognized as \(H (P (t + \Delta t), P (t))\), which is non-negative. Following the same steps in section 2.5 (but using discrete time master equation 2.5 instead, and detailed balance condition 2.9 for transition density), the second line reduces to
which is non-positive (suppose that \(r\) connects every two elements in \(\mathcal{X}\)). So, the sign of \(\Delta H\) is determined by that which line has greater absolute value. The first line depends only on the difference between \(P (t)\) and \(P (t + \Delta t)\), thus \(\Delta t\), while the second line additionally depends on the difference between \(P (t)\) and \(\Pi\) (the factor \(q_{\Delta t} (x|y)\) also depends on \(\Delta t\)). When \(\Delta t \rightarrow 0\), the first line vanishes, while the second does not until \(P (t) \rightarrow \Pi\). This suggests us to investigate how fast each term converges as \(\Delta t \rightarrow 0\).
To examine the speed of convergence, we calculate the leading order of \(\Delta t\) in each line. To make it clear, we denote the first line by \(\Delta H_1\) and the second line by \(\Delta H_2\). Taylor expanding \(\Delta H_1\) by \(\Delta t\) gives
2.4. The first line
To Taylor expand the right hand side by \(\Delta t\), we expand \(p (x, t + \Delta t)\) to \(\omicron (\Delta t^2)\), as
and the same for \(\ln p (x, t + \Delta t)\), as
Plugging in \((\mathrm{d} / \mathrm{d} x) \ln f (x) = f' (x) / f (x)\) and then \((\mathrm{d}^2 / \mathrm{d} x^2) \ln f (x) = f'' (x) / f (x) - (f' (x) / f (x))^2\), we find
So, the \(\Delta t\) order term in \(\Delta H_1\) is
where we used the normalization of \(p\). The \(\Delta t^2\) term in \(\Delta H_1\) is
Using the normalization of \(p\) as before, it is reduced to
Altogether, we arrive at
where, by master equation 2.4, \((\partial / \partial t) \ln p (x, t) = \int_{\mathcal{X}} \mathrm{d} w p (w, t) r (w, x) / p (x, t)\). For \(\Delta H_2\), we insert equation 2.11 after Taylor expanding \(q_{\Delta t}\) by \(\Delta t\), and obtain
We find \(\Delta H_1\) converges with speed \(\Delta t^2\) while \(\Delta H_2\) has speed \(\Delta t\).
Thus, given \(P (t) \neq \Pi\) (so that \(\Delta H_2 \neq 0\), recall section 2.5), there must be a \(\delta > 0\) such that for any \(\Delta t < \delta\), we have \(| \Delta H_1 | < | \Delta H_2 |\), in which case the \(\Delta H = \Delta H_1 + \Delta H_2 < 0\) (recall that \(\Delta H_1 \geqslant 0\) and \(\Delta H_2 \leqslant 0\)). The \(\delta\) is bounded by
This bound is proportional to the difference between \(P (t)\) and \(\Pi\) (namely, the first factor). When \(P (t)\) has approached \(\Pi\) (that is, \(P (t) \approx \Pi\) but not exactly equal), \(\delta\) has to be extremely small. (This is a little like supervised machine learning where \(\Delta t\) acts as learning rate and \(H (P (t), \Pi)\) as loss. In the early stage of training, the loss has a greater slope of decrease and we can safely employ a relatively larger learning rate to speed up the training. But, we have to tune the learning rate to be smaller and smaller during the training, in which the slope of loss is gradually decreasing. Otherwise, the loss will not decrease but keep fluctuating when it has been sufficiently small, since the learning rate now becomes relatively too big.)
Metropolis-Hastings algorithm is a simple method that constructs transition rate for any given stationary distribution such that detailed balance condition holds. Explicitly, given the density function of a stationary distribution \(\Pi\), and an auxiliary transition rate \(\gamma\) (ensuring that \(\gamma (x, y) > 0\) for each \(x\) and \(y\) in alphabet \(\mathcal{X}\) with \(x \neq y\)) the transition rate \(r\) is constructed by
\(\displaystyle r (x, y) = \gamma (x, y) \min \left( 1, \frac{\pi (y) \gamma (y, x)}{\pi (x) \gamma (x, y)} \right) . \) | (2.12) |
This transition rate connects every two elements in \(\mathcal{X}\) (since \(\gamma (y, x) > 0\) for each \(x \neq y\)). In addition, together with \(\pi\), it satisfies detailed balance condition 2.8. By equation 2.12, we have
We extract the second variable out of \(\min\)function, makes it
The right hand side is recognized as \(\pi (y) r (y, x)\), implying that detailed balance condition 2.8 holds for the \(r\) constructed by equation 2.12. Theorem 2.1 then states that, evolved by the master equation 2.4, any initial distribution will finally relax to the stationary distribution \(\Pi\).
Metropolis-Hastings algorithm was first proposed by Nicholas Metropolis and others in 1953 in Los Alamos, and then improved by Canadian statistician Wilfred Hastings in 1970. This algorithm was first defined for transition density. Together with a positive auxiliary transition density \(g\), the transition density is defined as
\(\displaystyle q (x \rightarrow y) := g (x \rightarrow y) \min \left( 1, \frac{\pi (y) g (u \rightarrow x)}{\pi (x) g (x \rightarrow y)} \right), \) | (2.13) |
where \(g\) is positive-definite on \(\mathcal{X}\). Notice that, in equation 2.13 there is no extra time parameter like the \(q_{\Delta t} (x \rightarrow y)\) in section 2.2. It can be seen as a fixed time interval, which can only be used for discrete time master equation.
This definition has an intuitive and practical explanation. The two factors can be seen as two conditional probability. The factor \(g (x \rightarrow y)\) first proposes a transition from \(x\) to \(y\). (In numerical simulation, we have to ensure that computer has a sampler for sampling an \(x\) from the conditional density function \(g (x \rightarrow y)\).) Then, this proposal will be accepted by Bernoulli probability with the ratio given by the first factor in the right hand side. If accepted, then transit to \(y\), otherwise stay on \(x\). Altogether, we get a conditional probability jumping from \(x\) to \(y\), the \(q (x \rightarrow y)\).
It is straight forward to check that, if, in addition, \(g\) smoothly depends on a parameter \(\Delta t\) as \(g_{\Delta t}\), so is \(q\) as \(q_{\Delta t}\), and if we Taylor expand \(g_{\Delta t}\) at \(\Delta t \rightarrow 0\) as \(g_{\Delta t} (x \rightarrow y) = \delta (x - y) + \gamma (x, y) \Delta t + \omicron (\Delta t)\), then we will find \(q_{\Delta t} (x \rightarrow y) = \delta (x - y) + r (x, y) \Delta t + \omicron (\Delta t)\). Indeed, when \(x = y\), we have \(q_{\Delta t} (x \rightarrow x) = g_{\Delta t} (x \rightarrow x)\). And when \(x \neq y\), \(\delta (x - y) = 0\), we find
Altogether, for each \(x, y \in \mathcal{X}\), we find \(q_{\Delta t} (x \rightarrow y) = \delta (x - y) + r (x, y) \Delta t + \omicron (\Delta t)\). In practice, we use the Metropolis-Hastings algorithm 2.13 to numerically simulate master equation 2.4. But, based on the discussion in section 2.6, the \(\Delta t\) in \(g_{\Delta t}\) shall be properly bounded to be small so as to ensure the relaxation \(P (t) \rightarrow \Pi\).
We follow the discussion in chapter 2, but focusing on the specific situation where there is extra smooth structure on \(X\). This smoothness reflects on the connectivity of the alphabet \(\mathcal{X}\), and on the smooth “spatial” dependence of the density function and transition rate. This indicates that the conclusions in chapter 2 hold in this section, but the inverse is not guaranteed.
Follow the conventions in chapter 2. In addition, thought this chapter, we set the alphabet to be Euclidean to get sufficient connectivity. Namely, \(\mathcal{X}=\mathbb{R}^d\) for some integer \(d \geqslant 1\).
We employ the Einstein's convention. That is, we omit the sum notation for the duplicated indices as long as they are “balanced”. For example, \(x_{\alpha} y^{\alpha}\) represents \(\sum_{\alpha} x_{\alpha} y^{\alpha}\). The \(\alpha\) appears twice in the expression, once in subscript (the \(x_{\alpha}\)) and once in superscript (the \(y^{\alpha}\)), for which we say indices are balanced. Expression like \(x_{\alpha} y_{\alpha}\), however, does not represent a summation over \(\alpha\), because indices are not balanced (both are subscript). A more complicated example is \(\partial_{\alpha} A^{\alpha}_{\beta} x^{\beta}\), which means \(\sum_{\alpha} \sum_{\beta} \partial_{\alpha} A^{\alpha}_{\beta} x^{\beta}\). Einstein's convention is extremely helpful in taking Taylor expansion.
For spatial smoothness, we assume that the density function \(p (x, t)\) of a time-dependent distribution \(P (t)\) are smooth on \(x\) and \(y\). Besides the smoothness of functions, however, spatial smoothness demands more. To declare this, we consider the situation where the mass of \(P (t)\) is centered at \(x\) initially, namely \(p (x', 0) = \delta (x - x')\). Then, after a small temporal period \(\Delta t\), there is some portion of mass transits to elsewhere. By master equation 2.4, the change in density is
Inserting the initial condition \(p (x', 0) = \delta (x - x')\) and denoting \(\epsilon := y - x\), we get
We assume that the portion of mass does not transit far away from \(x\), but in its neighbor, namely \(\epsilon\) is really small in scale. Quantitatively, the scale is reflected by moments, where the \(n\)th-moment is defined by
\(\displaystyle M^{\alpha_1 \cdots \alpha_n}_n (x, \Delta t) := \int_{\mathbb{R}^d} \mathrm{d} \epsilon p (x + \epsilon, \Delta t) (\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_n}) .\) | (3.1) |
Thus, the assumption turns to be
\(M^{\alpha_1 \cdots \alpha_n}_n (x, 0) = 0\), and
as \(\Delta t\) tends to zero, \(M_{n + m} (x, \Delta t)\) converges (to zero) faster than \(M_{n + m} (x, \Delta t)\).
For the second condition, we shall expect that \((\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_n})\) will become much smaller when multiplied more small (random) variables.
Plugging in the expression of \(p (x + \epsilon, \Delta t)\), we find
Then, introducing (for distinguishing from moments, which is defined by density function, we using \(K\) instead of \(M\) for denoting the “moments of transition rate”)
we have
So the first condition simply implies
\(\displaystyle \sup_{x \in \mathbb{R}^d} | K_n (x) | < + \infty . \) | (3.2) |
The second condition is non-trivial. We are to show that it indicates a cut-off. That is, there shall exist an positive integer \(N_{\operatorname{cut}}\), such that \(K_n = 0\) for any \(n > N_{\operatorname{cut}}\). And we will find the estimation
\(\displaystyle M_n (x, \Delta t) = \mathcal{O} (\Delta t^{\sharp (n / N_{\operatorname{cut}})}), \) | (3.3) |
where \(\sharp\) is the ceiling function (which rounds its argument to the nearest greater integer). To obtain equation 3.3, we have to demand that (which will be realized later)
\(\displaystyle \sup_{x \in \mathbb{R}^d} | \partial_{\alpha_1} \cdots \partial_{\alpha_m} K_n (x) | < + \infty \) | (3.4) |
holds for any indices \((\alpha_1, \ldots, \alpha_m)\). Together with equation 3.2, we demand that the moments of transition rate and its derivatives are uniformly bounded on \(\mathbb{R}^d\). We devote the rest of this section to prove equation 3.3.
As an example for exploration, we first cut-off at \(N_{\operatorname{cut}} = 2\), namely \(K_n = 0\) for any \(n > 2\). We are to calculate \(\mathbb{E} [\epsilon^{\alpha_1} \cdots \epsilon^{\alpha_n}]\) up to \(\omicron (\Delta t^2)\). This demands the relation between transition rate and transition density (equation 2.6),
Starting at \(p (y, 0) = \delta (x - y)\), master equation gives
Inserting the expansion of \(q_{\Delta t}\) makes
The first term is \(\Delta t K_n^{\alpha_1 \cdots \alpha_n} (x)\), so it is \(\Delta t K_1 (x)\) and \(\Delta t K_2 (x)\) for \(n = 1, 2\) respectively, and vanishes otherwise. In the rest, we focus on the second term, denoting the integral as
thus the second term is \(\Delta t^2 / 2 F_n^{\alpha_1 \cdots \alpha_n} (x)\). First notice that the integral has identity
Thus
We explore the calculation of \(F_n\) by evaluating \(F_1 (x)\). By inserting an \((\epsilon - y)\) term, we get
While integrating over \(\epsilon\), the first line gives \(\int \mathrm{d} y r (x, x + y) K_1^{\alpha_1} (x + y)\), and the second vanishes because of the property \(\int \mathrm{d} y r (x, y) = 0\). So, we find
Taylor expansion of \(K_1\) at \(x\) gives
The first term on the right hand side vanishes. The second term is \(K_1^{\beta_1} (x) \partial_{\beta_1} K_1^{\alpha_1} (x)\), and the third is \(\)\((1 / 2!) K_2^{\beta_1 \beta_2} (x) \partial_{\beta_1} \partial_{\beta_2} K_1^{\alpha_1} (x)\). The rest terms are all vanishing because \(K_n = 0\) for any \(n > 2\). So, we find
\(\displaystyle F_1^{\alpha_1} (x) = K_1^{\beta_1} (x) \partial_{\beta_1} K_1^{\alpha_1} (x) + \frac{1}{2} K_2^{\beta_1 \beta_2} (x) \partial_{\beta_1} \partial_{\beta_2} K_1^{\alpha_1} (x) . \) | (3.5) |
Following the same process, we can obtain for \(F_2 (x)\),
3.1. Since
we have
Again, by integrating over \(\epsilon\), the first line on the right hand side is \(\int_{\mathbb{R}^d} \mathrm{d} y r (x, x + y) K_2^{\alpha_1 \alpha_2} (x + y)\) and the last line vanishes. The second line is \(\int_{\mathbb{R}^d} \mathrm{d} y r (x, x + y) K_1^{\alpha} (x + y) y^{\beta} +\operatorname{perm}\). Thus,
Then, again, Taylor expansion of \(K\)s at \(x\) gives
and
where we have used \(K_n = 0\) for any \(n > 2\) to cut the Taylor series. So, we find
where \(\operatorname{perm} (\alpha_1, \alpha_2)\) permutes the \(\alpha_1\) and \(\alpha_2\) for any term that is not symmetric (which is the forth term on the right hand side). Then for \(F_3 (x)\), we have
3.2. Since
we have
Again, using \(K_n = 0\) for any \(n > 2\), we get
Taylor expansion of \(K\)s at \(x\) gives
And for \(F_4 (x)\),
When \(n > 4\), \(F_n = 0\) because \(K_n = 0\) for \(n > 2\). So, the \(F_n\) terminates at \(n = 4\).
For higher order expansion of \(q_{\Delta t}\) by \(\Delta t\) (equation 2.6), the same goes. As an example, we examine the expansion at \(\mathcal{O} (\Delta t^3)\), as
where we have omitted the \(\mathcal{O} (\Delta t^2)\) terms. Following the same derivation, we find it contributes to \(\mathbb{E} [\epsilon^{\alpha}]\) by the term
We insert an \((\epsilon - y')\) term again, and get
The second line vanishes after integrating over \(\epsilon\) because \(\int \mathrm{d} x r (x, y) = 0\). The first line can be re-written as
And integrating over \(\epsilon\) gives
Again, we Taylor expand \(K_1\) at \(x\), resulting in
By integrating over \(y\), we find the first line vanishes because \(\int \mathrm{d} x r (x, y) = 0\). We recognize that the second line is just the \(F^{\beta}_1 (x)\), and the third line is just the \(F_2^{\beta_1 \beta_2} (x)\). Since \(F_n = 0\) for any \(n > 4\), the series terminates at \(F_4\). Thus, we arrive at
\(\displaystyle \begin{array}{ll} G^{\alpha}_1 (x) = & F_1^{\beta_1} (x) \partial_{\beta_1} K_1^{\alpha} (x) + \frac{1}{2!} F_2^{\beta_1 \beta_2} (x) \partial_{\beta_1} \partial_{\beta_2} K_1^{\alpha} (x) +\\ & \frac{1}{3!} F_3^{\beta_1 \beta_2 \beta_3} (x) \partial_{\beta_1} \partial_{\beta_2} \partial_{\beta_3} K_1^{\alpha} (x) + \frac{1}{4!} F_4^{\beta_1 \beta_2 \beta_3 \beta_4} (x) \partial_{\beta_1} \partial_{\beta_2} \partial_{\beta_3} \partial_{\beta_4} K_1^{\alpha} (x) . \end{array} \) | (3.6) |
Comparing with \(F_1\) (equation 3.5), we find the first line is the same as \(F_1\) except that the \(K_n\)s (rather than the \(\partial K_n\)s) are replaced by \(F_n\). Since \(F_n\) terminates at \(n = 4\) (while \(K_n\) terminates at \(n = 2\)), there are more terms in \(G_1\) than in \(F_1\), namely the second line.
So, the problem for higher order expansion of \(q_{\Delta t}\) is deduced to that we have solved at lower order. It means that we can calculate to arbitrary order of expansion iteratively, and the process is the same at each order.
Observing these results, we find the following rules (with explanations).
The superscripts are assigned to two \(K\)s together with partial derivatives, ensuring that the extra indices (such as \(\beta_1\)) are all summed over (namely, contracted).
The reason why there are two \(K\)s in \(F_n\) is that there are two \(r\)s in the expansion of \(q_{\Delta t}\) at \(\Delta t^2\) order.
For each \(n\)th order partial derivative, multiply it by a factor \(1 / n!\)
Because partial derivative comes from Taylor expansion, and \(1 / n!\) is the coefficient.
Non-symmetric assignments have to be permuted.
For example, the \(K_2^{\alpha_1 \beta_1} (x) \partial_{\beta_1} K_1^{\alpha_2} (x)\) term in \(F_2^{\alpha_1 \alpha_2} (x)\) is not symmetric on \(\alpha_1\) and \(\alpha_2\), thus has to be permuted in the \(\operatorname{perm} (\alpha_1, \alpha_2)\).
Symmetric assignments appear only once.
Also in \(F_2^{\alpha_1 \alpha_2} (x)\), the \(K_1^{\alpha_1} (x) K_1^{\alpha_2} (x)\) term is symmetric on \(\alpha_1\) and \(\alpha_2\), so it appears only once, being absent in the \(\operatorname{perm} (\alpha_1, \alpha_2)\).
We add all the possible assignments together as the final result.
These rules can be directly generalized to \(\mathcal{O} (\Delta t^m)\) in the expansion of \(q_{\Delta t}\) by \(r\), in which there can be \(m\) \(K\)s.
We have found that both \(M_3 (x, \Delta t)\) and \(M_4 (x, \Delta t)\) are of \(\mathcal{O} (\Delta t^2)\), since both \(F_3 (x)\) and \(F_4 (x)\) are non-zero. But \(F_5 (x)\) must vanish since we cannot get five superscripts with only two \(K_n\)s with \(n = 1, 2\). This further implies that any \(F_n\) with \(n > 4\) is zero, leading \(M_n (x, \Delta t)\) to \(\omicron (\Delta t^2)\). But at higher (than 2) order of \(\Delta t\), there will be more (than two) \(K\)s in \(M_n (x, \Delta t)\). Then, based on the rules we just found, the number of combination of indices will be greater (than \(4\)). These combinations, however, will always be “exhausted” when \(n\) has been sufficiently large. That is, there will be finite \(M_n (x, \Delta t)\)s at any given order of \(\Delta t\). This is just the formal expression of the second condition we assumed. So, the assumption is guaranteed with cut-off. And conversely, only with a cut-off can we guarantee the assumption. This can be directly generalized to cut-off at any positive integer \(N_{\operatorname{cut}}\), namely \(K_n = 0\) for any \(n > N_{\operatorname{cut}}\). In this situation, we have (recall that \(\sharp\) is the ceiling function) \(M_n (x, \Delta t) = \mathcal{O} (\Delta t^{\sharp (n / N_{\operatorname{cut}})})\).
Besides, there is additional requirement on \(K_n\). During the calculation, we have employed the condition that the partial derivatives of \(K_n\) are well-defined. That is, for any indices \((\alpha_1, \ldots, \alpha_m)\),
\(\displaystyle \sup_{x \in \mathbb{R}^d} | \partial_{\alpha_1} \cdots \partial_{\alpha_m} K_n (x) | < + \infty . \) | (3.7) |
Together with equation 3.2, we conclude that the moments of transition rate and its derivatives are uniformly bounded on \(\mathbb{R}^d\).
Since all \(K\)s are finite, and we only have finite \(K\)s, we can relate the \(K\)s to the transition rate \(r\) explicitly. To do so, we first introduce an arbitrary test function \(\varphi : \mathbb{R}^d \rightarrow \mathbb{R}\) in Schwarts space \(S (\mathbb{R}^d)\), which is a functional space in which function mapping from \(\mathbb{R}^d\) to \(\mathbb{R}\) is smooth and rapidly tends to zero in the region far from origin.
3.3. Precisely, a function \(\varphi \in S (\mathbb{R}^d)\) is infinitely differentiable (namely \(\varphi \in C^{\infty} (\mathbb{R}^d)\)) and satisfies the condition
for any indices \((\alpha_1, \ldots, \alpha_m)\) and \((\beta_1, \ldots, \beta_n)\). As \(\| x \| \rightarrow + \infty\), \(\varphi\) (also its partial derivatives) falls faster than any polynomial (namely, the \(x^{\alpha_1} \cdots x^{\alpha_m}\)).
Because of the normalization of transition density, we have \(\int_{\mathbb{R}^d} \mathrm{d} \epsilon r (x, x + \epsilon) = 0\), thus
where we have inserted the definition of \(K\)s and that \(K_n = 0\) for any \(n > N_{\operatorname{cut}}\) (section 3.2). Then, because of the identity
integration by parts on the right hand side gives
Plugging this back, we find
Since \(\varphi\) is arbitrary, we finall arrive at
\(\displaystyle r (x, x + \epsilon) = \sum_{n = 1}^{N_{\operatorname{cut}}} \frac{(- 1)^n}{n!} K_n^{\alpha_1 \cdots \alpha_n} (x) (\partial_{\alpha_1} \cdots \partial_{\alpha_n} \delta) (\epsilon) . \) | (3.8) |
This is called Kramers–Moyal expansion. It indicates that transition rate is determined by its moments \(K_n\)s. Transition rate has two continuous variables (namely, the \(x\) and \(y\) in \(r (x, y)\)), but each \(K_n\) has only one variable and there are only finite number of \(K_n\)s (section 3.2). So, spatial smoothness greatly reduces the degree of freedom in stochastic process.
Because of the Dirac's delta functions, this transition rate is a generalized function. That is, only when applied to a test function can they be evaluated. For example, to evaluate \(\partial_{\alpha} \delta (- x)\), we have to employ an arbitrary test function \(\varphi \in S (\mathbb{R}^d)\), and calculate \(\int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \delta (- x) \varphi (x)\). First, notice that \(\partial_{\alpha} \delta (- x)\) is in fact \((\partial_{\alpha} \delta) (- x)\) and that \((\partial \delta / \partial x^{\alpha}) (- x) = - (\partial / \partial x^{\alpha}) \delta (- x)\), thus
Then, integration by parts gives \(- \int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} [\delta (- x)] \varphi (x) = \int_{\mathbb{R}^d} \mathrm{d} x \delta (- x) \partial_{\alpha} \varphi (x)\). After inserting the relation \(\delta (x) = \delta (- x)\), we arrive at \(\int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \delta (- x) \varphi (x) = \partial_{\alpha} \varphi (0)\). On the other hand, we have, by integration by parts, \(- \int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \delta (x) \varphi (x) = \int_{\mathbb{R}^d} \mathrm{d} x \delta (x) \partial_{\alpha} \varphi (x) = \partial_{\alpha} \varphi (0)\). Altogether, we find \(\int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \delta (- x) \varphi (x) = - \int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \delta (x) \varphi (x)\), for any \(\varphi \in S (\mathbb{R}^d)\). Thus, \(\partial_{\alpha} \delta (- x)\) is evaluated to be \(- \partial_{\alpha} \delta (x)\). That is, \(\partial_{\alpha} \delta\) is odd . Following the same process, we can show that \(\partial_{\alpha} \partial_{\beta} \delta\) is even .
3.4. We are to calculate \(\int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \partial_{\beta} \delta (- x) f (x)\), where \(f \in S (\mathbb{R}^d)\). Again, noticing that \((\partial_{\alpha} \partial_{\beta} \delta) (- x) = \partial_{\alpha} \partial_{\beta} [\delta (- x)]\), we have
Then integration by parts gives
That is, \(\int_{\mathbb{R}^d} \mathrm{d} x \partial_{\alpha} \partial_{\beta} \delta (- x) f (x) = \partial_{\alpha} \partial_{\beta} f (0)\). On the other hand, we have
So,
holds for any \(f \in S (\mathbb{R}^d)\), thus \(\)\(\partial_{\alpha} \partial_{\beta} \delta (- x) = \partial_{\alpha} \partial_{\beta} \delta (x)\).
If we plug expansion 3.8 into master equation, then we get
Notice that \((\partial / \partial w^{\alpha_1}) \cdots (\partial / \partial w^{\alpha_n}) \delta (x - w) = (- 1)^n (\partial_{\alpha_1} \cdots \partial_{\alpha_n} \delta) (x - w)\), we get
If \(p (\cdot, t) \in S (\mathbb{R}^d)\), then we have \(p (x, t) \rightarrow 0\) as \(\| x \| \rightarrow + \infty\). It means we can take integration by parts on the right hand side as (boundary terms vanish)
Then, integrating over \(w\) gives
\(\displaystyle \frac{\partial p}{\partial t} (x, t) = \sum_{n = 1}^{N_{\operatorname{cut}}} \frac{(- 1)^n}{n!} \left( \frac{\partial}{\partial x^{\alpha_1}} \cdots \frac{\partial}{\partial x^{\alpha_n}} \right) [K_n^{\alpha_1 \cdots \alpha_n} (x) p (x, t)] . \) | (3.9) |
This is the form of Kramers–Moyal expansion that appears in many textures.
If \(p (\cdot, t) \in S (\mathbb{R}^d)\), then \(p (\cdot, t)\) as well as its partial derivatives rapidly tend to zero in the region far from origin. Together with condition 3.7 (\(K_n\) and its partial derivatives are bounded), we can show that \((\partial p / \partial t) (\cdot, t) \in S (\mathbb{R}^d)\) too. It then implies that \(p (\cdot, t') \in S (\mathbb{R}^d)\) for any \(t' > t\). That is, if a time-dependent density function \(p (\cdot, t)\) sits in Schwartz space initially, then it stays in Schwartz space during the evolution.
In section 3.2, we have found that a finite cut-off on the moments of transition rate is essential for spatial smoothness. We are to show that cut-off at \(N_{\operatorname{cut}} = 1\) (namely \(K_n = 0\) for any \(n > 1\)) only results in a deterministic evolution. With \(N_{\operatorname{cut}} = 1\), equation 3.9 reduces to (re-denote \(K_1\) to \(f\) for simplicity)
This is the continuity equation or transport equation. It was used for describing the evolution of the density of incompressible liquids. We are to solve this equation explicitly. As a first order partial differential equation, we can use the method of characteristics. At the first step, we fully expand the equation, as
The next step is constructing a parameterized curve \((x (s), t (s))\) for \(s \in [0, + \infty)\) called characteristic, obeying
and
It has solution \(t (s) = s + t (0)\). If we set \(t (0) = 0\), then we have \(t = s\) and
from which we solve \(x (t)\), leading to
which is the left hand side of the original equation along characteristic. Thus, by equaling to the right hand side of the original equation, we get
Now, the original partial differential equation is converted into an ordinary differential equation. It has the unique solution
It indicates that the density at \(x (0)\) will “transport” along the curve \(x (t)\) as time evolves. For example, consider \(p (y, 0) = \delta (y - x (0))\), that is all mass is centered at \(x (0)\). Then \(p (x, t)\) will have density only at \(x (t)\), and \(p (y, t) = 0\) for any \(y \neq x (t)\), since \(y\) is traced back to \(y (0)\) along the characteristic \(y (t)\) and \(y (0) \neq x (0)\). That is to say, transport equation is deterministic.
In section 3.4, we have analyzed the cut-off at \(N_{\operatorname{cut}} = 1\) and found it deterministic, thus not a stochastic process. It indicates that we have to cut-off at least at \(N_{\operatorname{cut}} = 2\). We are to show that, if \(K_2\) as a matrix-valued field is positive definite, then the randomness of Markovian process is guaranteed.
We examine this by using an example. Let \(K_1\) is everywhere vanishing and \(K_2\) is a constant identity matrix. Then, equation 3.9 reduces to
\(\displaystyle \frac{\partial p}{\partial t} (x, t) = \frac{1}{2} \delta^{\alpha \beta} \partial_{\alpha} \partial_{\beta} p (x, t) . \) | (3.10) |
This equation is the famous heat equation or diffusion equation, first investigated by French mathematician Joseph Fourier in 1822 for modeling how heat diffuses. His method is now named as Fourier transformation.
Define Fourier transformation of \(p (x, t)\) on \(x\) as
It has an inverse transformation
This relation holds because, by plugging \(\hat{p} (k, t)\) into the right hand side, we find it
Integrating over \(k\) and using the relation
3.5. TODO: explain this.
we find the right hand side \(\int \mathrm{d} y \delta (x - y) p (y, t) = p (x, t)\) which is the left hand side. By plugging this inverse transformation into heat equation 3.10, we get
Multiplied by \(\exp (- \mathrm{i} k_{\alpha} x^{\alpha})\) on both sides and integrating over \(x\) and then \(k'\), we arrive at
This is an ordinary differential equation for each \(k\). It has solution
Thus, by the inverse transformation
Inserting the definition of \(\hat{p} (k, 0)\), we get
The second integral is Gaussian. By the formula of Gaussian integral, which holds for any positive definite matrix \(A\),
we can integrate the second integral and find (replacing \(A_{\alpha \beta}\) by \(\delta^{\alpha \beta} t\) and \(b_{\alpha}\) by \(\mathrm{i} (x^{\alpha} - w^{\alpha})\))
From this expression, we can read out the transition density directly, as
\(\displaystyle q_t (w \rightarrow x) = \prod_{\alpha = 1}^d \frac{1}{\sqrt{2 \pi t}} \exp \left( - \frac{1}{2 t} (x^{\alpha} - w^{\alpha})^2 \right), \) | (3.11) |
which obeys a normal distribution with \(w\) as its mean and \(t\) as its variance. So, we find randomness in the cut-off \(N_{\operatorname{cut}} = 2\).
Historically, in 1827, botanist Robert Brown noticed that pollen particles automatically shakes in water. It was first explained by Albert Einstein in 1905. He argued that the pollen particles are constantly stricken by water molecules, and found the transition density to be equation 3.11. Hence, the stochastic process described by this transition density is named by Brownian motion. Even though the techniques used for deriving this transition density had been mature when Brown first observed this phenomenon, but almost one hundred years after Brown's discover, in 1918, Norbert Wiener first constructed a complete mathematical theory for this stochastic process. So, it is also called Wiener process.
The most generic form of transition rate with \(N_{\operatorname{cut}} = 2\) is (by Kramers–Moyal expansion 3.8 with \(N_{\operatorname{cut}} = 2\), and re-denote \(K_1\) by \(f\) and \(K_2\) by \(\Sigma\))
\(\displaystyle r (x, x + \epsilon) = - f^{\alpha} (x) \partial_{\alpha} \delta (\epsilon) + \frac{1}{2} \Sigma^{\alpha \beta} (x) \partial_{\alpha} \partial_{\beta} \delta (\epsilon), \) | (3.12) |
A (continuous time) Markovian process with this transition rate is called a Langevin process or Langevin dynamics. It was first developed by French physicist Paul Langevin in 1908.
Plugging equation 3.12 into master equation 3.9, we find
\(\displaystyle \frac{\partial p}{\partial t} (x, t) = - \partial_{\alpha} (f^{\alpha} (x) p (x, t)) + \frac{1}{2} \partial_{\alpha} \partial_{\beta} (\Sigma^{\alpha \beta} (x) p (x, t)) . \) | (3.13) |
This equation is called Fokker-Planck equation, found by Adriaan Fokker and Max Planck in 1914 and 1917 respectively, or Kolmogorov forward equation, independently discovered in 1931.
As a matrix-valued field, \(\Sigma\) is symmetric and everywhere positive definite. Symmetry means \(\Sigma^{\alpha \beta} (x) = \Sigma^{\beta \alpha} (x)\). This is a direct result of its definition \(\int \mathrm{d} \epsilon r (x, x + \epsilon) \epsilon^{\alpha} \epsilon^{\beta}\). To see why it is positive definite, we consider the expectation
Under a proper coordinate of \(\epsilon\), it becomes diagonalized with the diagonal element (which is the eigenvalue)
So, \(\Sigma (x)\) is positive definite for any \(x \in \mathbb{R}^d\).
The transition density of Langevin process is hard to obtain, except for some very simple situations (such as that in section 3.5). Even though, it can be properly approximated by Gaussian density function. Explicitly, let \(q_{\Delta t} (x \rightarrow y)\) denote the transition density of a Langevin process. As in section 3.6, we re-denote \(K_1\) by \(f\) and \(K_2\) by \(\Sigma\). Then, consider the Gaussian conditional density function (which may not be a transition density)
We are to show that, for any function \(\varphi\) in Schwartz space \(S (\mathbb{R}^d)\),
\(\displaystyle \int_{\mathbb{R}^d} \mathrm{d} \epsilon q_{\Delta t} (x \rightarrow x + \epsilon) \varphi (\epsilon) = \int_{\mathbb{R}^d} \mathrm{d} \epsilon \tilde{q}_{\Delta t} (x \rightarrow x + \epsilon) \varphi (\epsilon) + \omicron (\Delta t) . \) | (3.14) |
That is, the transition density of Langevin process is approximated by a Gaussian conditional density function, in the sense of applying onto Schwartz space.
To prove equation 3.14, we first notice that the left hand side can be expanded by \(\Delta t\) as
So, all we need to do is proving that
To show this, we Taylor expand \(\varphi\) at origin. The right hand side becomes
where we have inserted the moments of transition rate with cut-off \(N_{\operatorname{cut}} = 2\). Since \(\tilde{q}_{\Delta t} (x \rightarrow y)\) is the density function of the normal distribution with mean \(x + f (x) \Delta t\) and covariance \(\Sigma (x) \Delta t\), the left hand side is evaluated to be
which equals to the right hand side. Thus equation 3.14 holds.
Because of this Gaussian approximation, we can approximate Langevin process using a stochastic difference equation
where \(\eta (X)\) obeys a normal distribution with zero mean and covariance \(\Sigma (X) \Delta t\) (this is why \(\eta\) depends on \(X\)). We can separate the dependence of \(X\) in \(\eta\) using Cholesky factorization.
To introduce Cholesky factorization, we fix the argument \(x\) and omit it for simplicity, so \(\Sigma (x)\) is written as \(\Sigma\). Since \(\Sigma\) is symmetric and positive definite (proved in section 3.6), we can diagonalize it using an orthogonal matrix \(E\) as \(\Sigma = E^T \Lambda E\), where the diagonal \(\Lambda_{\alpha \beta} = \delta_{\alpha \beta} \lambda_{\beta}\) with \(\lambda_{\beta} > 0\). Define \(\sqrt{\Lambda}_{\alpha \beta} := \delta_{\alpha \beta} \sqrt{\lambda_{\beta}}\), thus \(\Lambda = \sqrt{\Lambda}^T \sqrt{\Lambda}\), and \(\Sigma = C^T C\) where \(C := \sqrt{\Lambda} E\). We thus factorize \(\Sigma\) into the “square” of \(C\). Notice that \(C\) is invertible, and \(C^{- 1} = E^T \left( \sqrt{\Lambda} \right)^{- 1}\). This was first discovered by French military officer André-Louis Cholesky, who was killed in battle a few months before the end of World War I, dead at age 31. So, we have (insert the omitted \(x\) again) \(\Sigma (x) = C^T (x) C (x)\) and the stochastic difference equation comes to be
where \(\Delta W^{\alpha}\) obeys a standard normal distribution (we use \(W\) to indicate Wiener process).
In the limit \(\Delta t \rightarrow 0\), the approximation becomes exact (so \(\approx\) is replaced by \(=\)), and the equation turns from difference to be differential, as
\(\displaystyle \frac{\mathrm{d} X^{\alpha}}{\mathrm{d} t} (t) = f^{\alpha} (X (t)) + C^{\alpha}_{\beta} (X (t)) \frac{\mathrm{d} W^{\beta}}{\mathrm{d} t} (t) \) | (3.15) |
with
\(\displaystyle \mathbb{E} \left[ \frac{\mathrm{d} W^{\alpha}}{\mathrm{d} t} (t) \frac{\mathrm{d} W^{\beta}}{\mathrm{d} t} (t') \right] = \delta^{\alpha \beta} \delta (t - t') . \) | (3.16) |
This is the format that Langevin process appears in many textbooks.
In this section and section 3.9, we use Langevin process as an example to explicitly show what the detailed balance condition appends to the stationary condition.
The master equation of Langevin process (equation 3.13) has stationary solution \(\Pi\) which satisfies (since there is only one variable \(x\), we use \(\partial\) instead of \(\nabla\))
which means
\(\displaystyle f^{\alpha} (x) \pi (x) = \frac{1}{2} \partial_{\beta} (\Sigma^{\alpha \beta} (x) \pi (x)) + \nu^{\alpha} (x), \) | (3.17) |
where \(\nu : \mathbb{R}^d \rightarrow \mathbb{R}^d\) is an arbitrary vector field such that \(\partial_{\alpha} \nu^{\alpha} (x) = 0\).
The vector field \(\nu\) has an intuitive explanation. Regarding \(\nu\) as a flux on \(\mathbb{R}^d\), we find that there is not net flux flowing out of anywhere in \(\mathbb{R}^d\). Otherwise, suppose there is \(x \in \mathbb{R}^d\) and a closed surface \(S\) around \(x\) such that the net flux \(\int \mathrm{d} S \cdot \nu (x)\) does not vanish. Then, by Stokes theorem, the surface integral \(\int \mathrm{d} S \cdot \nu (x) = \int \mathrm{d} x \nabla \cdot v (x) = 0\), thus conflicts. Such a vector field \(\nu\) is called free of source or source-free.
After discussing stationary distribution of Fokker-Planck equation (as a master equation), we continue investigate when will Langevin process relax an initial distribution to the stationary. By theorem 2.1, this is equivalent to ask: when will the transition rate of Langevin process satisfy detailed balance condition? Detailed balance condition reads \(r (x + \epsilon, x) \pi (x) = r (x, x + \epsilon) \pi (x + \epsilon)\). Directly inserting equation 3.12, we get for the left hand side,
For the right hand side, we first have
Then, inserting equation 3.12 gives
Since \(\partial_{\alpha} \delta (- x) = - \partial_{\alpha} \delta (x)\) and \(\partial_{\alpha} \partial_{\beta} \delta (- x) = \partial_{\alpha} \partial_{\beta} \delta (x)\) (derived in the end of section 3.3), it turns to be
As generalized functions, we are to examine these two expressions by using an arbitrary test function \(\varphi\) in Schwartz space \(S (\mathbb{R}^d)\). Applying \(\pi (x) r (x, x + \epsilon)\) to \(\varphi\) gives
Integration by parts gives (note that the \(\partial\) is applied on \(\epsilon\))
By applying \(\pi (x + \epsilon) r (x + \epsilon, x)\) to \(\varphi\), we get
Again, integration by parts results in (again, the \(\partial\) operator is applied on \(\epsilon\))
By equaling \(\int \mathrm{d} \epsilon \pi (x) r (x, x + \epsilon) \varphi (\epsilon)\) and \(\int \mathrm{d} \epsilon \pi (x + \epsilon) r (x + \epsilon, x) \varphi (\epsilon)\), since \(\varphi\) is arbitrary, we find, for the \(\varphi (0)\) terms,
and for \(\partial \varphi (0)\) terms,
The \(\partial \partial \varphi (0)\) terms vanishes automatically. Altogether, we find the detailed balance condition for Langevin process to be
\(\displaystyle f^{\alpha} (x) \pi (x) = \frac{1}{2} \partial_{\beta} (\Sigma^{\alpha \beta} (x) \pi (x)) . \) | (3.18) |
Comparing with the stationary solution of Langevin process (equation 3.17), the source-free vector field \(\nu\) is absent here. Recall in section 2.4 where detailed balance condition was first encountered, we said that detailed balance condition is stronger than just being stationary. Now, in Langevin process, this becomes concrete: detailed balance condition is stronger than stationary condition in the sense that it lacks the source-free degree of freedom that appears in the stationary condition. The lost degree of freedom is the cost of ensuring that any initial distribution will finally relax to the stationary.
We apply the discussions in chapter 3 to least-action principle.
Follow the conventions in chapter 3 (except for section 4.4 where the alphabet \(\mathcal{X}\) can be discrete). In addition, we use \(P (\theta)\) for a parameterized distribution, where \(\theta\) is the collection of parameters. Its density function is \(p (x, \theta)\), where random variable \(X\) takes the value \(x\).
In physics, least-action principle gives the dynamics of the state of an evolutionary system, determining how it evolves with time. The state of an evolutionary system is called a configuration. As the state changes with time, the evolution of configuration can be seen as a path in a space, like a contrail in the sky, indicating the movement of an airplane. This space is called configuration space, which is generally Euclidean, \(\mathbb{R}^d\) for some \(n\). A path is a function with single parameter \(x : [t_i, t_f] \rightarrow \mathbb{R}^d\), where \(t_i\) and \(t_f\) denote the initial and final time respectively. Without losing generality, we standardize the time interval from \([t_i, t_f]\) to \([0, 1]\). To introduce the least-action principle, consider the collection of paths with fixed boundaries, that is, \(\mathcal{P} (x_0, x_1) := \{ x : [0, 1] \rightarrow \mathbb{R}^d |x (0) = x_0, x (1) = x_1 \}\) given the boundaries \((x_0, x_1)\). An action is a scalar functional of path with fixed boundaries, thus an action \(S (\cdot |x_0, x_1) : \mathcal{P} (x_0, x_1) \rightarrow \mathbb{R}\), where we use a vertical line to separate variables and those that are given as constants (the boundaries \((x_0, x_1)\)), which should not be confused with the vertical line in conditional probability, like \(p (x|y)\). For example, the configuration space of an (one-dimensional) harmonic oscillator is \(\mathbb{R}\), and the evolution is characterized by a path \(x : [0, 1] \rightarrow \mathbb{R}\). The action of harmonic oscillator is given by the functional
\(\displaystyle S_{\operatorname{HO}} (x|x_0, x_1) = \frac{1}{2} \int_0^1 \mathrm{d} t [\dot{x}^2 (t) - \omega^2 x^2 (t)], \) | (4.1) |
where \(\dot{x} := \mathrm{d} x / \mathrm{d} t\), \(\omega \in \mathbb{R}\), and \(x (0) = x_0\), \(x (1) = x_1\).
Roughly, least-action principle states that, in the real world, the paths with the fixed boundaries are those that minimize the action. To quantitatively declare the least-action principle, we have to describe the minimum of an action mathematically. Recall that a local minimum, or generally an extremum, \(x_{\star}\) of a function \(f\) is characterized by \((\partial f / \partial x^{\alpha}) (x_{\star}) = 0\) for each component \(\alpha\). How can we generalize this from function to functional (action is a functional)? The trick is discretizing the time. Precisely, we uniformly separate the time interval \([0, 1]\) into \(T\) fragments. Thus, the path \(x\) is discretized as a vector \((x (0), x (1 / T), \ldots, x ((T - 1) / T), x (1))\), each component is an endpoint of a fragment. Since the boundaries are fixed in least-action principle, \(x (0)\) and \(x (1)\) are constant rather than variables. Hence, the true degree of freedom is \((x (1 / T), \ldots, x ((T - 1) / T))\). Least-action principle in classical mechanics then states that, given the (discretized) action \(S\) and the boundaries \((x_0, x_1)\), there is at most one path \(x_{\star} \in \mathcal{P} (x_0, x_1)\) such that
\(\displaystyle \frac{\partial S}{\partial x (i / T)} (x_{\star} |x_0, x_1) = 0, \) | (4.2) |
for each \(i = 1, \ldots, T - 1\) and any \(T > 1\), and that \(x_{\star}\) is the path in real world.
Take harmonic oscillator as example. To discretize its action (equation 4.1), we replace the integral \(\int_0^1 \mathrm{d} t\) by mean \((1 / T) \sum_{i = 0}^T\) and \(x (t)\) by \(x (i / T)\). Thus the second term becomes \((\omega^2 / 2 T) \sum_{i = 0}^T x^2 (i / T)\). For the first term, the derivative \(\dot{x} (t)\) is replaced by its difference \(T [x ((i + 1) / T) - x (i / T)]\), hence the summation shall terminated at \(T - 1\) instead of \(T\). Altogether, the action 4.1 is discretized as
Given \(i\), \(x (i / T)\) appears in two terms in \(S_{\operatorname{HO}}\), the \(i\) and \(i + 1\) terms in the summation. They have derivatives \(T [- x ((i + 1) / T) + x (i / T)] - (\omega^2 / T) x (i / T)\) and \(T [x (i / T) - x ((i - 1) / T)]\) respectively. So, we find
for \(i = 1, \ldots, T - 1\). The right hand side is the discretized \(\ddot{x}_{\star} (t) + \omega^2 x_{\star} (t)\), for \(t \in (0, 1)\) (notice we have excluded the \(t = 0, 1\), corresponding to \(i = 0, T\) respectively). So, least-action principle, \(\partial S_{\operatorname{HO}} / \partial x (i / T) (x_{\star} |x_0, x_1) = 0\), implies the correct dynamics of harmonic oscillator in textbooks, which is \(\ddot{x}_{\star} (t) + \omega^2 x_{\star} (t) = 0\).
4.1. The dynamics with fixed boundaries is called boundary value problem. But in physics, the dynamics we obtained from the least-action principle is applied to initial value problem, where the initial “phase” (for physical system, it involves initial position and velocity), instead of boundaries, is fixed. This mysterious application leads to some interesting results. For an \(m\)th-order dynamics (for example, harmonic oscillator is a second order dynamics since it involves at most the second derivative of path), an initial value problem has \((T + 1 - m)\) variables (there are \(T + 1\) endpoints on the path), since the \(m\) degree of freedom has been assigned to the initial values. On the other hand, the boundary value problem has \((T + 1 - 2)\) degree of freedom, since there are always two boundaries (\(t = 0\) and \(t = 1\)). So, for the success of this mysterious application, we must have \(m = 2\). That is, the initial value problem has to be second order.
We can generalize the least-action principle to any system, evolutionary or not, where variables locate in a high-dimensional Euclidean space and, given some conditions, action is a scalar function on it. It states that the real world datum locates in the minimum of the action. Precisely, given the conditioned action \(S\) (we may hide the condition \(y\) into \(S\) instead of explicitly writing it out), there is a at most one \(x_{\star}\) such that
\(\displaystyle \frac{\partial S}{\partial x^{\alpha}} (x_{\star}) = 0, \) | (4.3) |
and that \(x_{\star}\) is the real world datum.
There are, however, redundant degrees of freedom in action \(S\). We may construct multiple actions all satisfying equation 4.3. Knowing the extremum of a function cannot imply the shape of the function. The action has much more degrees of freedom than that is needed for revealing the real world datum in classical mechanics. But combined with uncertainty, as we will see in section 4.3, action is completely determined by the real world distribution (the correspondence of real world datum), with nothing redundant.
Dynamics in classical mechanics are always deterministic. That is, once the initial conditions (for initial value problem) or the boundaries (for boundary value problem) are fixed, then the path is fully determined. Randomness is forbidden. There are, however, many phenomena in nature that have intrinsic randomness. For example, molecular movement obeys a normal distribution with variance proportional to time interval. The dynamics of starling flocks also has intrinsic randomness, which is the “free will” of each bird, so is ant colony, human society, and any interactive system in which each element has some level of intrinsic uncertainty. For these cases, the real world datum is not simply a path, but a distribution of path. In this section, we are to describe these phenomena in the language of Markovian process.
Temporally, we go back to use the old fashioned notation for conditional density functions instead of the arrowed, thus \(q_{\Delta t} (x \rightarrow y)\) is written as \(q_{\Delta t} (y|x)\). By repeatedly applying (discrete time) master equation 2.5, we get
\(\displaystyle p (x_N, N \Delta t) = \int_{\mathbb{R}^d} \mathrm{d} x_0 \cdots \int_{\mathbb{R}^d} \mathrm{d} x_{N - 1} p (x_0, 0) q_{\Delta t} (x_1 |x_0) \cdots q_{\Delta t} (x_N |x_{N - 1}) . \) | (4.4) |
The right hand side can be viewed as marginalizing the random variables \((X_0, \ldots, X_{N - 1})\), and the product \(q_{\Delta t} (x_1 |x_0) \cdots q_{\Delta t} (x_N |x_{N - 1})\) can be seen as the density function of \((X_1, \ldots, X_N)\), where we have omitted the subscript \(\Delta t\) for simplicity. To see this clearly, we first notice that \(q (x_2 |x_1) = q (x_2 |x_0, x_1)\) holds for any \(x_0\), because \(q (x_2 |x_1)\) is not explicitly dependent on \(x_0\). Then, we have \(q (x_2 |x_1) q (x_1 |x_0) q (x_0) = q (x_2 |x_0, x_1) q (x_1 |x_0) q (x_0)\). Repeatedly using the definition of conditional density, it becomes \(q (x_2 |x_0, x_1) q (x_0, x_1) = q (x_0, x_1, x_2)\). Dividing \(q (x_0)\) on both sides, we get \(q (x_1, x_2 |x_0) = q (x_2 |x_1) q (x_1 |x_0)\). Repeating this step, we will find
recognized as a conditional density of random variables \((X_1, \ldots, X_N)\) given \(x_0\). If we regard the series \((x_1, \ldots, x_N)\) as a “movie” or a “path” of evolution of the stochastic system, in which each \(x_i\) can be seen as a “frame”, then the density function \(q (x_1, \ldots, x_N |x_0)\) characterizes the distribution of evolution.
If define \(S (x_0, x_1, \ldots, x_N) := - \ln q (x_1, \ldots, x_N |x_0)\), then the master equation becomes
So, the \([\cdots]\) part transits the density function \(p (\cdot, 0)\) to \(p (\cdot, N \Delta t)\). It is an integration over all possible ways of evolution. If we treat \((x_0, \ldots, x_N)\) together as a single \(x\), then we can generalize this to any density function \(q (x)\) with \(x \in \mathbb{R}^d\). We can always define
\(\displaystyle S (x) := - \ln q (x) . \) | (4.5) |
Thus, \(q (x) = \exp (- S (x))\).
The \(S\) defined by 4.5 has some properties that can be analog to the action in classical mechanics. Indeed, by plugging in the definition of \(S\), we find
The integrand of the right most expression is a divergence, so it results in a boundary integral. But since \(q\), as a density function, is normalized, the boundary integral shall vanish as \(\| x \| \rightarrow + \infty\). So, we conclude that
This is analog to equation 4.3, where the minimum \(x_{\star}\) is replaced by the expectation \(\mathbb{E}_Q\). Secondly, the distribution \(Q\) (whose density function is the \(q\)) most likely samples (recall section 1.2 that distribution has a sampler) the \(x\) that maximizes \(q\), thus minimizes \(S\). For these reasons, we illustrate the \(S\) defined by \(q\) as the action of \(Q\). Contrary to the action in classical mechanics, the \(S\) here is completely determined by the real world distribution \(Q\) (because it is defined by the density function \(q\)), without any redundancy. This is the direct implication that distribution involves more information than its most likely datum.
Given a collection of real world data, we are to find a distribution that fits the data. These data can be seen as samples from an unknown distribution which characterizes the real world. We are to figure out a method to fit the real world distribution by given some samples of it.
Let \(P (\theta)\) represent a distribution parametrized by \(\theta \in \mathbb{R}^m\). Its alphabet \(\mathcal{X}\) can be a discrete set or a continuous space like \(\mathbb{R}^d\). From its density function, \(p (\cdot, \theta)\), we get a parameterized action \(S (\cdot, \theta)\) such that
\(\displaystyle p (x, \theta) = \exp (- S (x, \theta)) / Z (\theta), \) | (4.6) |
where \(Z (\theta) := \int \mathrm{d} x \exp (- S (x, \theta))\) for ensuring \(\int \mathrm{d} x p (x, \theta) = 1\). This is consistent with the action defined by equation 4.5, except that the action here is parameterized, and that we add a normalization factor \(Z (\theta)\) for convenience (even though it can be absorbed into \(S (x, \theta)\)).
What we have is a collection of data, sampled from an unknown distribution \(Q\). And we are to adjust the parameters \(\theta\) so that \(P (\theta)\) approximates \(Q\). To do so, we minimize the relative entropy between \(Q\) and \(P (\theta)\), which is defined as \(H (Q, P (\theta)) := \int \mathrm{d} x q (x) \ln (q (x) / p (x, \theta))\). This expression is formal. Since we do not know the density function of \(Q\), all that we can do with \(Q\) is computing the expectation \(\mathbb{E}_Q [f] = (1 / | Q |) \sum_{x \in Q} f (x)\) for any function \(f\), where we use \(Q\) as a set of data. With this realization, we have, after plugging equation 4.6 into \(H (Q, P (\theta))\),
By omitting the \(\theta\)-independent terms, we get the loss function
The parameters that minimize \(L (\theta)\) also minimize \(H (Q, P (\theta))\), and vice versa. We can find the \(\theta_{\star} := \operatorname{argmin}L\) by iteratively updating \(\theta\) along the direction \(- \partial L / \partial \theta\). To calculate \(- \partial L / \partial \theta\), we start at
The first term is recognized as \(-\mathbb{E}_Q [\partial S / \partial \theta^{\alpha}]\). For the second term, by inserting the definition of \(Z (\theta)\), we get
where in the last equality, we used the definition of \(p (x, \theta)\) (the green factor). This final expression is just the \(\mathbb{E}_{P (\theta)} [\partial S / \partial \theta^{\alpha}]\). Altogether, we arrive at
\(\displaystyle - \frac{\partial L}{\partial \theta^{\alpha}} (\theta) =\mathbb{E}_{P (\theta)} \left[ \frac{\partial S}{\partial \theta^{\alpha}} (\cdot, \theta) \right] -\mathbb{E}_Q \left[ \frac{\partial S}{\partial \theta^{\alpha}} (\cdot, \theta) \right] . \) | (4.7) |
At the minimum, we shall have \(\partial L / \partial \theta = 0\). Then, we find that \(\theta_{\star}\) obeys
\(\displaystyle \mathbb{E}_{P (\theta_{\star})} \left[ \frac{\partial S}{\partial \theta^{\alpha}} (\cdot, \theta_{\star}) \right] =\mathbb{E}_Q \left[ \frac{\partial S}{\partial \theta^{\alpha}} (\cdot, \theta_{\star}) \right] . \) | (4.8) |
It can be read from equation 4.7 that minimizing \(L\) is to increase \(S (\cdot, \theta)\) on the sampled points (the first term) while decrease it on data points (the second term). As figure 4.1 illustrates, this way of optimization will site real world data onto local minima of \(S (\cdot, \theta)\), in statistical sense.
The parameterized action \(S\) can be constructed out of universal functions such as neural networks. Then, iterating by equation 4.7 until \(\| \partial L / \partial \theta \|\) has been small enough gives an \(S (\cdot, \theta_{\star})\) that mimics the stochastic dynamics of starling flocks. To compute the expectation \(\mathbb{E}_{P (\theta)} [\ldots]\) in equation 4.7, we employ Monte-Carlo simulation. The transition rate shall satisfy detailed balance condition with \(P (\theta)\) as the stationary distribution. For discrete random variables, Monte-Carlo simulation with Metropolis-Hastings algorithm (section 2.7) is available; and for continuous random variables, Langevin process (section 3.6, 3.9, and 5.3) will be more efficient. In addition, many numerical computation modules have build-in samplers for many distributions, such as categorical distribution and normal distribution; when \(P (\theta)\) is happen to be one of them, computing \(\mathbb{E}_{P (\theta)} [\ldots]\) will be straight-forward.
As an example, if we want to get the action that characterizes the stochastic dynamics of starling flocks, we take movies for many flocks. Each movie is a series of frames that log the positions of each bird at each time instant. These movies provide the real world data, namely the distribution \(Q\).
We could construct the parameterized action \(S\) by a neural network that outputs a scalar float value. But we can go deeper, assuming that the stochastic dynamics of flocks obeys a Langevin process with constant covariance. Namely, we employ the action 5.9, and construct the \(f\) therein using neural network, which outputs a float vector (array).
To evaluate the \(\mathbb{E}_{P (\theta)} [\ldots]\) in equation 4.7, we first randomly initialize a group of samples, each representing a movie of a flock. Then, we use Langevin process to efficiently simulate master equation. By section 3.7, we iterate each sample by
\(\displaystyle x^{\alpha} \leftarrow x^{\alpha} - \nabla^{\alpha} S (x, \theta) \Delta t + \Delta W^{\alpha} . \) | (4.9) |
for \(0 < \Delta t \ll 1\). The gradient \(\nabla\) is taken on \(x\), and the random variable \(\Delta W^{\alpha}\) obeys a normal distribution with zero mean and variance \(\Delta t\). This specific transition density satisfies detailed balance condition (see section 3.9, especially the equation 3.18). Then, theorem 2.1 claims that the iteration 4.9 will relax the samples towards the stationary distribution, which has density function proportional to \(\exp (- S (x, \theta))\). There is not need to wait until the samples have been fully relaxed. In practice, we find that ten or several more iterations has been sufficient for a good approximation. Finally, out of these iterated samples, \(\mathbb{E}_{P (\theta)} [\cdots]\) is evaluated. Repeating this process and iterating the \(\theta\) using gradient descent, we arrive at the best-fit \(\theta_{\star}\) which encodes the stochastic dynamics of flocks.
In this section, we apply the result in section 4.4 to supervised machine learning, including both regression and classification tasks. We suppose that readers who read this section have the basic familiarity to machine learning.
As usual, dataset is characterized by empirical distribution \(Q\). In supervised machine learning, each datum is a pair consisting of an input and an output. By marginalizing the output, we get the input distribution \(Q_X\). Then, dividing \(Q\) by \(Q_X\) gives the conditional distribution \(Q_{Y|X}\), which samples the output by giving an input.
The task of supervised machine learning is searching for an analytical distribution that approximates the \(Q_{Y|X}\). To do so, we use a parameterized distribution \(P (x, \theta)\), where \(\theta\) denotes the parameter, and find the best-fit parameter \(\theta_{\star}\) by minimizing the relative entropy \(H (P (x, \theta), Q_{Y|X} (x))\), where \(x\) is sampled from \(Q_X\). Then, we have the loss function
For regression tasks, the alphabet is \(\mathbb{R}\), and \(P (x, \theta)\) is a normal distribution with unit variance. Thus, the density function \(p (y|x, \theta)\) is
for some scale function \(h (x, \theta)\). We read out the action in section 4.4 as
and the normalization factor \(Z (x, \theta)\) is constant. So, omitting the \(\theta\)-independent terms in \(L (\theta)\), we get
For classification tasks, the alphabet is \(\{ 1, \ldots, C \}\) for some positive integer \(C\), and \(P (x, \theta)\) is a categorical distribution. Thus, the density function \(p (y|x, \theta)\) is a softmax function of \(h (x, \theta)\). That is,
We read out the action in section 4.4 as
and the normalization factor \(Z (x, \theta) = \sum_{\beta} \exp (h^{\beta} (x, \theta))\). The loss function is evaluated to be
which is the usual cross-entropy loss in classification tasks. As discussed in section 4.4, for minimizing the loss function, we have to evaluate the expectation \(\mathbb{E}_{P (x, \theta)} [\partial S / \partial \theta]\) by sampling from \(P (x, \theta)\). This corresponds to the contrastive learning technique used in training models such as word2vec in which the number of classes is extremely large.
Follow the conventions in chapter 3.
In section 4.3, we have shown how to define an action using distribution. Also in the same section, we found that the master equation can be written as a integral of all possible paths, as
where the action (of distribution) is given by
If we regard the series \((x_0, \ldots, x_N)\) as a path from \(x_0\) to \(x_N\), then it is an integral over all possible paths.
Path integral formulation was found by Paul Dirac in 1933 who was trying to using Lagrangian in quantum mechanism. It was then developed by physicist Richard Feynman and mathematician Mark Kac in 1947. Now, path integral is applied not only to quantum field theory, but also many other areas such as stochastic process. Path integral has the general formalism
\(\displaystyle \int_{\mathbb{R}^d} \mathrm{d} x_0 \cdots \int_{\mathbb{R}^d} \mathrm{d} x_{N - 1} \exp (- S (x_0, \ldots, x_N)) f (x_0, \ldots, x_N), \) | (5.1) |
where a series \((x_0, \ldots, x_N)\) is called a “path”, the \(S\) is called the “action” of path (in section 4.3, we explained why we should call \(S\) an action), and the \(f\) is an arbitrary function called an “observable”. So, we just found a path integral formulation for master equation.
To obtain the action, we have to evaluate the logarithm of transition density \(\ln q_{\Delta t} (x \rightarrow y)\) when \(\Delta t\) is small. This, however, cannot be straight-forward since the leading term of \(q_{\Delta t} (x \rightarrow y)\) is \(\delta (x - y)\) which cannot be converted into exponential. But, we can consider its Fourier transformation (introduced in section 3.5), since \(\delta (x - y)\), if regarding as a Dirac's delta function, has Fourier coefficient \(\exp (- \mathrm{i} k_{\alpha} (x^{\alpha} - y^{\alpha}))\). This suggest us to consider the Fourier transformation of \(q_{\Delta t}\), as
This forces the alphabet to be Euclidean space \(\mathbb{R}^d\), because we cannot perform the same thing on Kronecker's delta when the alphabet is discrete, or when the alphabet is continuous but not Euclidean. Roughly, we have \(q_{\Delta t} (x \rightarrow x + \epsilon) \approx \delta (\epsilon) + r (x, x + \epsilon) \Delta t\). Thus, we may expect that
where we have inserted the Fourier transformation of transition rate, \(\hat{r} (x, k) = \int \mathrm{d} \epsilon \exp (- \mathrm{i} k_{\alpha} \epsilon^{\alpha}) r (x, x + \epsilon)\). It has inverse Fourier transformation
Plugging this back into equation 4.4, we get a path integral formulation
with abbreviation
The residue of this approximation is found to be non-trivial. We tackle this in section 5.2.1. As the result, we find
\(\displaystyle p (x_N, N \Delta t) = \int D (x, k) p (x_0, 0) \exp (- S (x, k)) + \omicron (N \Delta t), \) | (5.2) |
where the action is
If we Taylor expand \(\hat{r} (x, k)\) by \(k\) at the origin, then the coefficient is
which is recognized as \((- \mathrm{i})^n K_n^{\alpha_1 \cdots \alpha_n} (x)\). We meet the moments of transition rate again (it first appears in section 3.2). Thus, we have
where the zeroth order term vanishes (thus the summation begins at \(n = 1\)) because of the property of transition rate \(\int \mathrm{d} \epsilon r (x, x + \epsilon) = 0\). As discussed in section 3.2, the summation terminates at \(N_{\operatorname{cut}}\). Then, the action becomes
\(\displaystyle S (x, k) = - \sum_{i = 0}^{N - 1} \Delta t \left\{ \mathrm{i} (k_i)_{\alpha} \left( \frac{x^{\alpha}_{i + 1} - x_i^{\alpha}}{\Delta t} \right) + \sum_{n = 1}^{N_{\operatorname{cut}}} \frac{(- \mathrm{i})^n}{n!} K_n^{\alpha_1 \cdots \alpha_n} (x_i) (k_i)_{\alpha_1} \cdots (k_i)_{\alpha_n} \right\} . \) | (5.3) |
The path is a poly-line on \(\mathbb{R}^{2 d}\) (involving both \(x \in \mathbb{R}^d\) and \(k \in \mathbb{R}^d\)). But in a specific situation where \(N_{\operatorname{cut}} = 2\) and \(K_2 (x)\) is constant (matrix), the \(k\)-components can be analytically marginalized (section 5.3). Then, the path integral is taken on the alphabet \(\mathbb{R}^d\) (namely, the path is a series \((x_0, x_1, \ldots, x_N)\) with each \(x_i \in \mathbb{R}^d\)).
To explicitly evaluate the residue left in section 5.2, we have to consider the full expansion of \(q_{\Delta t}\) (equation 2.6), rather than expanding to the first order of \(\Delta t\). We have evaluated the zeroth and the first order coefficients in the full expansion. Now for higher orders, define
for \(n \geqslant 1\). Then, the full expansion of \(q_{\Delta t}\) gives
The residue hides in the \(\hat{g}_n (x, k)\)s. We are to calculate \(\hat{g}_n (x, k)\) by induction, starting at \(\hat{g}_1 (x, k) = \hat{r} (x, k)\). Directly,
Denoting \(\epsilon' := \epsilon + x - y_1\), we find
The \([\cdots]\) factor is recognized as \(\hat{g}_n (x, k)\). Thus, (reuse the \(\epsilon\) notation for simplicity)
Then, Taylor expanding \(\hat{g} (y_1 + \epsilon, k)\) by \(\epsilon\) at origin results in (\(\partial\) is derived on the first variable, and \(\partial'\) on the second)
The integral is recognized as partial derivatives \(\partial^m \hat{r} / [\partial (- \mathrm{i} k_{\alpha_1}) \cdots \partial (- \mathrm{i} k_{\alpha_m})]\), thus we arrive at
\(\displaystyle \hat{g}_{n + 1} (x, k) = \sum_{m = 0}^{+ \infty} \frac{\mathrm{i}^m}{m!} \frac{\partial^m \hat{g}_n}{\partial x^{\alpha_1} \cdots \partial x^{\alpha_m}} (x, k) \frac{\partial^m \hat{r}}{\partial k_{\alpha_1} \cdots \partial k_{\alpha_m}} (x, k) . \) | (5.4) |
In \(\hat{g}_n (x, k)\), the term with \(m = 0\) is \([\hat{r} (x, k)]^n\). So, we conclude that
where \(\hat{\zeta} (x, k, \Delta t)\) collects the terms in each \(\hat{g}_n (x, k)\) except for the \([\hat{r} (x, k)]^n\). Together with equation 5.4, we get
\(\displaystyle \hat{\zeta} (x, k, \Delta t) = \sum_{n = 2}^{+ \infty} \frac{\Delta t^n}{n!} \sum_{m = 1}^{+ \infty} \frac{\mathrm{i}^m}{m!} \frac{\partial^m \hat{g}_{n - 1}}{\partial x^{\alpha_1} \cdots \partial x^{\alpha_m}} (x, k) \frac{\partial^m \hat{r}}{\partial k_{\alpha_1} \cdots \partial k_{\alpha_m}} (x, k) . \) | (5.5) |
Since the terms like \([\hat{r} (x, k)]^n\) have been absent in \(\hat{\zeta} (x, k, \Delta t)\), \(m\) starts at \(1\). And since \(\hat{g}_1 (x, k) = \hat{r} (x, k)\), it contributes nothing to \(\hat{\zeta} (x, k, \Delta t)\), and \(n\) starts at \(2\). Thus, \(\hat{\zeta} (x, k, \Delta t) = \omicron (\Delta t)\) and it is the residue we are seeking for. Temporally, we have no idea whether the series in \(\hat{\zeta} (x, k, \Delta t)\) converge or not, but keep them formal (we will examine this later).
For going back to \(q_{\Delta t} (x \rightarrow x + \epsilon)\), we perform inverse Fourier transformation. This refers to the expansion of a function \(f\) as
5.1. This can be viewed as a generalization of Kramers-Moyal expansion. In fact, we are to prove this expansion by following the steps in section 3.3, in which we proved Kramers-Moyal expansion. Explicitly, consider a function \(\varphi \in S (\mathbb{R}^d)\), thus Taylor expanding \(\varphi\) at origin gives
We relate the integral in the \([\cdots]\) to the Fourier transformation \(\hat{f}\) by
Thus,
On the other hand, because of the identity
integration by parts on the right hand side gives
Plugging this back, we find
Since \(\varphi\) is arbitrary, we finall arrive at
\(\displaystyle f (x) = \sum_{n = 0}^{+ \infty} \frac{(- \mathrm{i})^n}{n!} (\partial^{\alpha_1} \cdots \partial^{\alpha_n} \hat{f}) (0) (\partial_{\alpha_1} \cdots \partial_{\alpha_n} \delta) (x) .\) | (5.6) |
Notice that the left hand side of this expansion is a function, while the right hand side is a series of generalized functions (the derivatives of Dirac's delta function). So, this expansion has meaning only when it is applied onto a test function in Schwartz space. This is just our situation, where \(q_{\Delta t}\) is applied to its right hand side in equation 4.4, which, as we have discussed in the end of section 3.3, is in Schwartz space if initially \(p (\cdot, 0)\) is. Replacing \(\hat{f} (k)\) by \(\hat{\zeta} (x, k, \Delta t)\), we find
Applying to \(\varphi \in S (\mathbb{R}^d)\) and taking integration by parts gives
So, we have to show that the coefficients are well-defined. Inserting equation 5.5, we find the coefficient
The terms in the series of the right hand side are all proportional to partial derivatives of \(\hat{r} (x, k)\) on \(k\) at origin, which is proportional to the moments of transition rate, \(K (x)\). As it is concluded in section 3.2, \(K_n (x)\) vanishes for all \(n > N_{\operatorname{cut}}\). So, the series also terminates at finite \(m\). In the series of \(n\), the \(n\)-th term (as a series of \(m\)) approximately terminates at \(m = n N_{\operatorname{cut}}\). And since all \(K (x)\) are bounded, the series of \(m\) increases linearly with \(n\). Hence, the series of \(n\) converges. It means that, the application onto \(\varphi\) results in a well-defined residue, proportional to \(\Delta t^2\).
The path integral 5.2 integrates on paths over both \(x\) and \(k\). In this section, we are to see when we can marginalize (integrate out) the \(k\)-component, leaving only the \(x\), namely the path on the alphabet.
Given \(i \in \{ 0, \ldots, N - 1 \}\), we are to integrate out the \(k_i\) in equation 5.2, together with the action 5.3. Now we can safely neglect the residue, and write the integral as (replacing \(x_i\) by \(x\), \(k_i\) by \(k\), and \(x_{i + 1} - x_i\) by \(\epsilon\) for simplicity)
\(\displaystyle I (x) := \int_{\mathbb{R}^d} \frac{\mathrm{d} k}{(2 \pi)^d} \exp \left( \mathrm{i} [\epsilon^{\alpha} - K_1^{\alpha} (x) \Delta t] k_{\alpha} - \frac{\Delta t}{2!} K_2^{\alpha \beta} (x) k_{\alpha} k_{\beta} + \frac{\mathrm{i} \Delta t}{3!} K_3^{\alpha \beta \gamma} (x) k_{\alpha} k_{\beta} k_{\gamma} + \cdots \right) . \) | (5.7) |
The series terminates at the cut-off \(N_{\operatorname{cut}}\) of \(K_n\).
This integral is complicated except for \(N_{\operatorname{cut}} = 2\) where it becomes a Gaussian integral, and the Markovian process deduces to a Langevin process, defined in section 3.6. In this situation, we have (re-denote \(K_1\) by \(f\) and \(K_2\) by \(\Sigma\) as in section 3.6)
As proved in the same section, \(\Sigma\) is everywhere positive definite. Then by the formula of Gaussian integral, which holds for any positive definite matrix \(A\),
we find (replacing \(A_{\alpha \beta}\) by \(\Sigma^{\alpha \beta} (x) \Delta t\) and \(b_{\alpha}\) by \(\mathrm{i} [\epsilon^{\alpha} - f^{\alpha} (x) \Delta t]\))
\(\displaystyle I (x) = \frac{1}{\sqrt{(2 \pi \Delta t)^d \det \Sigma (x)}} \exp \left( - \frac{\Delta t}{2} [\Sigma^{- 1} (x)]_{\alpha \beta} \left[ \frac{\epsilon^{\alpha}}{\Delta t} - f^{\alpha} (x) \right] \left[ \frac{\epsilon^{\beta}}{\Delta t} - f^{\beta} (x) \right] \right) . \) | (5.8) |
Notice that this is consistent with the result in section 3.7.
But there is an extra factor \(\sqrt{\det \Sigma (x)}\) out of exponential. To match the path integral formulism 5.1, in which all integration variables are in the exponential, we have to convert the factor into exponential too. It is found that only when \(\Sigma\) is constant can we do so (we left the general situation to section 5.4). As a real symmetric matrix, we perform the Cholesky factorization introduced in section 3.7 to \(\Sigma\), so that \(\Sigma = C^T C\) for some \(d \times d\) matrix \(C\). Coordinate transformation \(x \rightarrow x C\) (also taken on \(\epsilon\) and \(f\)) then eliminates the \(\Sigma\) matrix in the exponential, leaving
up to a constant term in the exponential. Plugging \(I (x)\) back to equation 5.2, we find
with
and
\(\displaystyle S (x) = \sum_{i = 0}^{N - 1} \Delta t \left\{ \frac{1}{2} \sum_{\alpha = 1}^d \left[ \left( \frac{x_{i + 1}^{\alpha} - x_i^{\alpha}}{\Delta t} - f^{\alpha} (x_i) \right) \right]^2 \right\} . \) | (5.9) |
It is a path integral on the alphabet.
Final remark on cut-off \(N_{\operatorname{cut}} = 2\). If choose \(N_{\operatorname{cut}} > 2\), it is hard to see how to integrate the improper integral 5.7, and even to show why it is finite. For example, if \(N_{\operatorname{cut}} = 4\), the \((\Delta t / 4!) K^{\alpha \beta \gamma \sigma}_4 (x) k_{\alpha} k_{\beta} k_{\gamma} k_{\sigma}\) term will dominate the integral when \(k\) is far from origin. But we cannot ensure that this term will suppress the integrand as \(\| k \| \rightarrow + \infty\) so as to make the improper integral finite. We cannot even diagonalize the fourth order symmetric tensor \(K_4 (x)\) (because diagonalizing a fourth order symmetric tensor has \(\mathcal{O} (d^4)\) restrictions, but a coordinate transformation has only \(\mathcal{O} (d^2)\) degrees of freedom, so this cannot be done except for specific situations).
We have to briefly introduce Grassmann variable, on which Berezin integral is based. Grassmann variable is an extension of real (or complex) variable,by introducing in the anti-commutative variables. Given a set of variables \(\{ \zeta_i |i = 1, \ldots, n \}\), we demand that the anti-commutative relation between \(\zeta\)s
But for any real (or complex) variable \(x\), we demand a commutative relation
Because of anti-commutation, we have \(\zeta_i \zeta_i = - \zeta_i \zeta_i\), thus \(\zeta_i \zeta_i = 0\). So, a polynomial of single Grassmann variable is always linear, as \(f (\zeta) = a + b \zeta\) where coefficients \(a, b \in \mathbb{R}\). And a polynomial of two Grassmann variables, \(\zeta\) and \(\eta\), is
where coefficients are real numbers. Extra term like \(\zeta \eta \zeta = - \zeta \zeta \eta = \zeta \zeta \eta\), thus vanishes. Generally, a polynomial of \(n\) Grassmann variables has terms with no more than \(n\) factors.
A function can be defined via its Taylor expansion in real number. So, the exponential function for Grassmann number is defined by
If \(\zeta\) is a single Grassmann variable, we have \(\exp (\zeta) = 1 + \zeta\) since other terms are all vanishing. This linearity, however, breaks when consider more Grassmann variables. For example, consider \(\exp \left( \sum_{i = 1}^n \zeta_i \eta_i \right)\) where \(\zeta\) and \(\eta\) are multiple Grassmann variables. The maximal order is
since \(\zeta_i \eta_i\) and \(\zeta_j \eta_j\) is commutative.
Now, we introduce the integral on Grassmann variables. The integral is a linear operator, defined by
The integral measures \(\{ \mathrm{d} \zeta_i |i = 1, \ldots, n \}\) are also anti-commutative, namely
So, as an example, we have
\(\displaystyle \int \mathrm{d} \eta_n \int \mathrm{d} \zeta_n \cdots \int \mathrm{d} \eta_1 \int \mathrm{d} \zeta_1 \exp \left( \sum_{i = 1}^n \zeta_i \eta_i \right) = \cdots + \left[ \int \mathrm{d} \eta_n \int \mathrm{d} \zeta_n \cdots \int \mathrm{d} \eta_1 \int \mathrm{d} \zeta_1 (\zeta_1 \eta_1 \cdots \zeta_n \eta_n) \right] = 1, \) | (5.10) |
where the \([\cdots]\) terms are all vanishing because they do not have sufficiently many Grassmann variables.
Next, we investigate how linear transformation effects the integral measure. To do so, consider a \(d\)-dimensional Grassmann variable, in which each component is a single Grassmann variable, \(\zeta = (\zeta^1, \ldots, \zeta^d)\). If we take the linear transformation
then the product
where the summation includes all permutations of \((\beta_1, \ldots, \beta_d)\). If we re-arrange the \(\zeta^{\beta_1} \cdots \zeta^{\beta_d}\) factor to \(\zeta^1 \cdots \zeta^d\), then there comes a factor \((- 1)^{\operatorname{sign} (\beta)}\), where \(\operatorname{sign} (\beta)\) is the signature of the permutation \((\beta_1, \ldots, \beta_d)\). For example, \((2, 1, 3, 4, \ldots, d)\) has signature \(1\) since by only one permutation \(1 \leftrightarrow 2\), can we recover the natural order \((1, 2, 3, 4, \ldots, d)\). So,
The right hand side is recognized as \(\)\(\zeta^1 \cdots \zeta^d \times \det A\). On the other hand, since both \(\int \mathrm{d} \zeta_d \cdots \int \mathrm{d} \zeta_1 (\zeta_1 \cdots \zeta_d)\) and \(\int \mathrm{d} \zeta'_d \cdots \int \mathrm{d} \zeta'_1 (\zeta'_1 \cdots \zeta'_d)\) results in one, namely
we find the transformation of integral measure, as
\(\displaystyle \int \mathrm{d} \zeta'_d \cdots \int \mathrm{d} \zeta'_1 = \det A \times \int \mathrm{d} \zeta_d \cdots \int \mathrm{d} \zeta_1 . \) | (5.11) |
After introducing Grassmann variable and the integration on multiple such variables, we come to the formula of Berezin integral, which is an analogy of Gaussian integral for Grassmann variables. Consider the Gaussian-like integral
where \(A \in \mathbb{R}^{d \times d}\) (or \(\mathbb{C}^{d \times d}\)) and we have briefly denoted
\(\displaystyle \int \mathrm{d} \zeta \mathrm{d} \eta := \int \mathrm{d} \zeta_d \int \mathrm{d} \eta_d \cdots \int \mathrm{d} \zeta_1 \int \mathrm{d} \eta_1 . \) | (5.12) |
Defining \(\zeta'_{\beta} := A_{\alpha \beta} \zeta^{\alpha}\) and using equation 5.11, it becomes
The rightmost factor has been evaluated in equation 5.10, which results in \(1\), thus
\(\displaystyle \int \mathrm{d} \zeta \mathrm{d} \eta \exp (A_{\alpha \beta} \zeta^{\alpha} \eta^{\beta}) = \det A. \) | (5.13) |
This is called Berezin integral, named after the Soviet Russian mathematician and physicist Felix Berezin.
After introducing Berezin integral, we go back to deal with equation 5.8 where \(\Sigma\) is not a constant. We first use Cholesky factorization to remove the square root and the fraction. Then, using Berezin integral to convert the determinant into exponential.
Introduced in section 3.7, Cholesky factorization decomposes \(\Sigma (x)\) into \(C^T (x) C (x)\). Instead of the matrix-valued field \(C\), it is more convenient to use its inverse \(R (x) := C^{- 1} (x)\), thus \(\Sigma^{- 1} (x) = R^T (x) R (x)\). So, we have
and thus
\(\displaystyle I (x) = \frac{1}{\sqrt{(2 \pi \Delta t)^d}} [\det R (x)] \exp \left( - \frac{\Delta t}{2} \sum_{\alpha = 1}^d \left[ R_{\alpha \beta} (x) \left( \frac{\epsilon^{\beta}}{\Delta t} - f^{\beta} (x) \right) \right]^2 \right) . \) | (5.14) |
Now, the determinant gets rid of square root and fraction. Remark that \(R (x)\) may not be a symmetric matrix.
Then, using Berezin integral, we can convert the \(\det R (x)\) factor in equation 5.14 into exponential. Replacing \(A\) by \(R (x) \Delta t\), we find
we convert the determinant into exponential, as
5.2. You may convert the determinant directly into exponential by using logarithm, namely \(\det R (x) = \exp \{ \ln [\det R (x)] \}\). This fails in our situation because we expect the term in the exponential to be proportional to \(\Delta t\).
In physics, the Grassmann variables \(\zeta\) and \(\eta\) are called “ghost variables”.
Plugging back to master equation 5.2, we get
with the abbreviation
and the “action” of Langevin process is
\(\displaystyle S (x, \zeta, \eta) := \sum_{i = 0}^{N - 1} \Delta t \left\{ \frac{1}{2} \sum_{\alpha = 1}^d \left[ R_{\alpha \beta} (x_i) \left( \frac{x_{i + 1}^{\beta} - x_i^{\beta}}{\Delta t} - f^{\beta} (x_i) \right) \right]^2 + R_{\alpha \beta} (x_i) \zeta^{\alpha}_i \eta_i^{\beta} \right\} . \) | (5.15) |
The \(C := - (d / 2) (\ln 2 \pi + 3 \ln \Delta t)\) is independent of \(x\), \(\zeta\), or \(\eta\), thus is regarded as constant.
In the summer of 2024, I was wondering how a stochastic system relaxes to its equilibrium. I did not find a textbook that matched my expectation. Traditional textbooks are either too complicated for me or not rigorous. So, I decided to build up the theory by hands. I found it much more fascinating to do mathematics than reading textbooks. The proof of relaxation generalized that found by Ludwig Boltzmann in 1872. I read Boltzmann's proof in a textbook of statistical mechanism fifteen years ago. Then, to make it self-consistent, I wrote an introduction to relative entropy, which was the starting point of the proof. Surprisingly, by appending locality into the axioms of information, relative entropy would be the unique possibility. The next surprise came from introducing smooth structure to stochastic system. I found the cut-off by a series of complex calculation. Later on in the winter, I found a way of expanding a function by a series of generalized functions. It led to a direct proof of Kramers-Moyal expansion. It also gave raise to a rigorous proof of path integral formulation for stochastic system, resulting in a new perspective to data-fitting.
After climbing to the top of a mountain, the horizon is broaden. Out of what we have explored, there arises more interesting questions. How does information propagate in a stochastic system? Why do the starlings in a flock, ants in a colony, and even trees in a forest behave as a whole like an individual organism? There are much more stories to be told, much more treasures to be sought, and much more beauties to be enjoyed. The new trip has been ahead.