Common-used Formula

Entropy formula of multivariate normal distribution

Assume \(x \sim N(\mu, \Sigma)\) and \(x\in R^p\), then the entropy is \[E_x(-\ln P(x; \mu, \Sigma))=\frac{1}{2}\ln|\Sigma| + \frac{p}{2}(1+\ln(2\pi)).\]

For more details, please see https://math.stackexchange.com/questions/2029707/entropy-of-the-multivariate-gaussian .

Decomposition of Matrix

Cholskey decomposition, QR decomposition, LU decomposition, SVD decomposition and Jordan decomposition: see https://blog.csdn.net/mucai1/article/details/85242098

Matrix inverse trick: Assume \(A \in R^{p\times p}\) is invertible, we can speed up the computation by Cholskey decomposition, \[A = L L^T.\] Then we have \[A^{-1} = L^{-1} (L^{-1})^T.\]

Computational Complexity of Matrix

We list the common-used complexity, see https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Transforms for more details.

The complexity of computing \(|A|\) is \(O(n^3)\) if \(A \in R^{n\times n}\).
The complexity of computing \(A^{-1}\) is \(O(n^3)\) if \(A \in R^{n\times n}\).
The complexity of \(AB\) is \(O(nmp)\) if \(A \in R^{n\times m}, B \in R^{m \times p}\).
The complexity of Singular value decomposition of \(A\) is \(O(mn^2)\) if \(A \in R^{m\times n}\), where \(n\geq m\).
The complexity of QR decomposition of \(A\) is \(O(\min(mn^2,nm^2))\) if \(A \in R^{m\times n}\). ## 计算矩阵占用多少运行内存的公式

在64位计算机中，一个double-precision scaler占用64-bit，也就是8个Bytes. 但是，在C++中the size of long double is 128-bit，这个long double类型的大小是普通double类型的2倍。
bit与byte的区别一字节等于8位,即1B=8b。在书写单位时一定要注意B字母的大小写和含义。MByte含义是“兆字节”,Mbit的含义是“兆比特”。1KByte=1024Byte, 1MB=1024KB
For back-of-the envelope computations I use the simpler formula: an r x c matrix double-precision matrix requires rc8/10^9 gigabytes(GB) of RAM.

Stirling’s Formula

This formula transfers the factorial term into a polynomial term. The formula is given by \[n! = (\frac{n}{e})^n \sqrt{2\pi n}.\]

References:
James Stirling, 1970, Differential Method with a Tract on Summation and Interpolation of Infinite Series.

Sherman–Morrison–Woodbury Formula

This formula is commonly used for statistical computing, which transfers a high-dimensional matrix inverse into a low-dimensional matrix inverse.

Assume \(A \in R^{p \times p}, U \in R^{p\times n}, B \in R^{n \times n}, V \in R^{p \times n}\), \(A\) and \(B\) are invertible, then \[(A + UBV^T)^{-1}=A^{-1} - A^{-1} U(B^{-1} + V^T A^{-1} U)^{-1} V^T A^{-1}. \tag{2.1}\] See http://fourier.eng.hmc.edu/e176/lectures/algebra/node6.html for the detail of proofs.

Special Case I: In statistics, we care about \(R=(\lambda I_p + X^T X)^{-1}\), where \(X \in R^{n \times p}\). We know \(R = \lambda^{-1} (I_p + X^T \lambda^{-1} I_n X)^{-1}\) Let \(A=I_p, U=V=X^T, B =\lambda^{-1} I_n\). By \((2.1)\), we can easily obtain \[R = \lambda^{-1} \{I_p - X^T(\lambda I_n + X X^T)^{-1} X\}.\]

Matrix multiplication formula

Vector calculation plays an important role in statistical computing.

Suppose \(A \in R^{I \times J}, B \in R^{K \times L}, C \in R^{J \times M}, D \in R^{L \times N}\)，then \[(A \otimes B) (C \otimes D)=(AC) \otimes (BD), \] \[(A \otimes B)^+= A^+ \otimes B ^+, (A \otimes B)^T= A^T \otimes B ^T,\] \[A \odot B \odot C = (A \odot B) \odot C = A \odot (B \odot C)\] \[(A \odot B)^T (A \odot B) = A^TA * B^TB,A \in R^{I \times J}, B \in R^{K \times J}\] \[(A \odot B)^+= (A^TA * B^TB)^+ (A \odot B)^T,\] 其中A+表示Moore-Penrose伪逆。

矩阵乘积向量化换序公式： \[vec(AXB) = (B^T \otimes A) vec (X). \tag{A1}\]

特例1：令B=b为列向量，则 \[AXb= (b^T \otimes A)vec(X) \tag{A3}\]

特例2：若\(X \in R^{I \times J}, b\in R^{J\times 1}\)，则有 \[Xb = (b^T \otimes I_I) vec(X) \in R^I, \tag{A2}\] 其中II表示\(I \times I的单位矩阵。由(A1)可以直接推导出(A2)，因为Xb = I_I Xb\)，列向量是一种特殊的矩阵。

设\(A\in R^{K\times I}, B \in R^{I \times m}, C \in R^{m \times n}，于是基于(A1)\)，可以推导出常用的矩阵向量化边缘轮换公式： \[vec(ABC) = (I_n \otimes AB) vec(C)=(C^T B^T \otimes I_K) vec(A) \tag{A4}\] 注意：第一个等式将矩阵C的向量化版本轮换到最右方的边缘；第二个等式将矩阵A的向量化版本轮换到最右方的边缘。在统计中这样做的好处是：如果A或C是我们感兴趣的参数，则(A4)将参数和数据分离，帮助我们推导参数的显示表达式，或者迭代的显式表达。

(A4)的特例： \[vec(AB) = (I_m \otimes A) vec(B) = (B^T \otimes I_K) vec(A) \tag{A5}\] 利用(A5)，我们可以得到 \[vec(ABC)= (I_n \otimes A) (C^T \otimes I_I) vec(B) \tag{A6}\] (A6)告诉我们如何将夹在矩阵中间的参数轮换到边上去。

其他公式：1）Hardamard乘积：\(A, B \in R^{n \times m}\)， vec(A∗B)=vec(A)∗vec(B)

2)Inner product（内积）：\(A, B \in R^{n \times m}\)， \[\langle A, B\rangle = tr(A^TB)=vec(A)^T vec(B)=vec(B^T)^T vec(A)\]

Common-used matrix(vector) derivative formula

As mentioned above, there are two common-used laying-out systems of partial derivatives in vectors and matrices, and no standard appears to be emerging yet. In general, the numerator layout convention is often for the purposes of convenience. It is claimed that a vector is layouted in column vector form as default.

It is worthwhile to note that \(y=F(w), y\in R, w\in R^q\), then \[\frac{\partial y}{\partial w}= (\frac{\partial y}{\partial w_1}, \cdots, \frac{\partial y}{\partial w_q}) \in R^{1\times q}.\]
A scalar making derivative on a column vector produces a row vector. But a column vector funciton making derivative on a scalar produces a column vector, e.g. \[\frac{\partial w}{\partial y}= (\frac{\partial w_1}{\partial y}, \cdots, \frac{\partial w_q}{\partial y})^T \in R^{q \times 1}.\]
The derivative of a vector function (a vector whose components are functions) \(y\in R^p\) with respect to an input vector \(x\in R^q\) is written (in numerator layout notation) as \[\frac{\partial y}{\partial x}=[\frac{\partial y_1}{\partial x_1}, \cdots, \frac{\partial w_p}{\partial x_1}; \cdots; \frac{\partial y_1}{\partial x_q}, \cdots, \frac{\partial y_q}{\partial x_q}]^T \in R^{p\times q},\] whose each column is \(\frac{\partial y}{\partial x_j} \in R^{p \times 1}\),where the notation mimics the matrix in Matlab.
The derivative of a scalar function \(y\in R\) w.r.t. an matrix \(X\in R^{p\times q}\) is written as \[\frac{\partial y}{\partial X}=[\frac{\partial y}{\partial x_{11}}, \cdots, \frac{\partial y}{\partial x_{p1}}; \cdots; \frac{\partial y}{\partial x_{1q}}, \cdots, \frac{\partial y}{\partial x_{pq}}] \in R^{q\times p}\]
The derivative of a matrix function \(Y\in R^{p\times q}\) w.r.t. a scalar \(x\in R\) is known as tangent matrix, and written as \[\frac{\partial Y}{\partial x}=[\frac{\partial y_{11}}{\partial x}, \cdots, \frac{\partial y_{1q}}{\partial x}; \cdots; \frac{\partial y_{p1}}{\partial x}, \cdots, \frac{\partial y_{pq}}{\partial x}] \in R^{p\times q}\]

In a word, the size of a derivative is dimension of dependent variables times that of transpose of independent variables, i.e., dim(dy/dx)=dim(dim(y) dim(x^T)). For example, \(y\in R, X\in R^{p\times q}\), then \(dim(dy/dX)= dim(1\times (q\times p))=q\times p\).

(1)Scalar-to-Matrix derivative: let \(W\in R^{p\times q}\)

\(F(W)= a^TWb: \frac{\partial F}{\partial W}= b a^T;\)
\(F(W)= a^TW^Tb: \frac{\partial F}{\partial W}= a b^T;\)
\(F(W)=a(W)b(W):\frac{\partial F}{\partial W}=a(W)\frac{\partial b(W)}{\partial W}+ b(W)\frac{\partial a(W)}{\partial W};\);
Trace funtion is only welll-defined for square matrix, so if \(p=q\), \(F(W)=tr(W): \frac{\partial F}{\partial W}= I_q;\)
\(F(W)=tr(U(W)+ V(W)): \frac{\partial F}{\partial W}= \frac{\partial tr(U)}{\partial W}+\frac{\partial tr(V)}{\partial W};\)
\(F(W)=tr(AW)=tr(WA): \frac{\partial F}{\partial W}= A;\)
\(F(W)=tr(AW^T)=tr(W^TA): \frac{\partial F}{\partial W}= A^T;\)
\(F(W)=tr(W^TAW): \frac{\partial F}{\partial W}= W^T(A+A^T);\) I found that we can use the LRD(Left and Right) Derivative method to obtain the result, i.e. \(\frac{\partial tr(W^TAW)}{\partial W}= \frac{\partial tr((W^TA)W)}{\partial W}+ \frac{\partial tr(W^T(AW))}{\partial W}=W^TA + W^T A^T\), where \((\cdot)\) is regard as constant.
\(F(W)=tr(W^{-1}A): \frac{\partial F}{\partial W}= -W^{-1}AW^{-1};\)
\(F(W)=tr(AWB)=tr(WBA): \frac{\partial F}{\partial W}= BA;\)
\(F(W)=tr(AWBW^TC)=tr(CAWBW^T): \frac{\partial F}{\partial W}= (CAWB)^T+ BW^TCA;\) This can be obtained by LRD method.
\(F(W)=|W|: \frac{\partial F}{\partial W}= |W|W^{-1};\)
\(F(W)=\ln(a|W|): \frac{\partial F}{\partial W}= W^{-1};\)
\(F(W)=\|Wa +b\|^2: \frac{\partial F}{\partial W}= 2a (Wa+b)^T;\)
\(F(W)=(Wa)^T C W b: \frac{\partial F}{\partial W}= ab^TW^TC^T + b a^T W^TC;\) This can be otained by 1), 2) and LRD method.

I conclude 1) the Layout-Match (LM) method: \(nrow(\frac{\partial F}{\partial W})=ncol(W)\) to check the correction of dimension of results; 2) LRD method can help to deduce the derivative of multiplication terms.

See https://en.wikipedia.org/wiki/Matrix_calculus#Derivatives_with_vectors for More information.