Khoa Blog

CVX_2: Convex set and Cone

07/04/2021 00:20

cvx2

1. Convex sets

Definition 1: A set $\mathbf{C}$ is convex if the line segment between any two distinct points in $\mathbf{C}$ lies in $\mathbf{C}$ .
- For any $\mathbf{x_1}$ , $\mathbf{x_2}$ $\in\mathbf{C}$ , $\theta\in[0,1]$ , we have $\theta\mathbf{x_1}+(1-\theta)\mathbf{x_2}\in\mathbf{C}$ .

CVX_1: Affine set

15/03/2021 16:03

cvx1

1. Line and line segment:

Suppose we have two points :
- The line through two this points: $\mathbf{y}=\theta\mathbf{x_1}+(1-\theta)\mathbf{x_2} \ \ \ \ \ \ \ \ \forall\theta\in\mathcal{R}\tag1.$
- The line segment between two this points: $\mathbf{y}=\theta\mathbf{x_1}+(1-\theta)\mathbf{x_2} \ \ \ \ \ \ \ \ \forall\theta\in[0,1]\tag2.$

ML&PR_10: Linear Regression in statistic view

14/01/2021 16:03

mlpr1x

In this post, we discuss about Linear Regression in statistic view, thereby answer why we should use Mean Square Error as a loss function.
We suppose that target variable $t$ is given by a deterministic function $y(\mathbf{x},\mathbf{w})$ with additive Gaussian noise $\epsilon$ : $t=y(\mathbf{x},\mathbf{w})+\epsilon\tag1$ where $\epsilon$ is a zero mean Gaussian random variable with precision $\beta$ . We can write: $p(t|\mathbf{x},\mathbf{w},\beta)=\mathcal{N}(t|y(\mathbf{x},\mathbf{w}),\beta^{-1})\tag2.$
Now consider a data set of inputs $\mathbf{X}=\{\mathbf{x}_1,...,\mathbf{x}_N\}$ with corresponding target values $t_1,...t_N$ . We have: $p(\mathbf{t}|\mathbf{X},\mathbf{w},\beta)=\prod_{n=1}^N\mathcal{N}(t_n|\mathbf{x}_n\mathbf{w},\beta^{-1})\tag3.$
We wish to maximize $p(\mathbf{t}|\mathbf{X},\mathbf{w},\beta)$ , synonymous with maximize $\ln p(t|\mathbf{X},\mathbf{w},\beta)$ : $\begin{align} \ln p(t|\mathbf{X},\mathbf{w},\beta)&=\sum_{n=1}^N\ln\mathcal{N}(t_n|\mathbf{w}^\top\phi(\mathbf{x}_n),\beta^{-1})\tag4\\ &=\sum_{n=1}^N\ln\frac{1}{\sqrt{2\pi\beta^{-1}}}\exp({-\frac{(t_n-\mathbf{w}^\top\phi(\mathbf{x}_n))^2}{2\beta^{-1}}})\tag5\\ &=N\ln\frac{1}{\sqrt{2\pi\beta^{-1}}}-\beta E_D(\mathbf{w})\tag6 \end{align}$ where $E_D(\mathbf{w})=\frac{1}{2}\sum_{n=1}^N(t_n-\mathbf{w}^\top\phi(\mathbf{x}_n))^2,\tag7$ we must minimize $E_D(\mathbf{w})$ - the MSE loss. And we are done!

ML&PR_9: Bayes’ theorem for Gaussian variables

05/01/2021 16:02

mlpr9

In this post, we discuss about Bayes' theorem for Gaussian variables. With $\mathbf{x}= \begin{pmatrix} \mathbf{x}_a\\\mathbf{x}_b\tag1 \end{pmatrix},$ given $p(\mathbf{x}_a)$ and a Gaussian conditional distribution $p(\mathbf{x}_b|\mathbf{x}_a)$ , we wish to find $p(\mathbf{x}_a,\mathbf{x}_b)=p(\mathbf{x})$ . As show in MLPR8, $p(\mathbf{x}_b|\mathbf{x}_a)$ has mean is a linear function of $\mathbf{x}_a$ : $\begin{align} p(\mathbf{x}_a)&=\mathcal{N}(\mathbf{x}_a\mathbf{|\mu,\Lambda}^{-1})\tag2\\ p(\mathbf{x}_b\mathbf{|x}_a)&=\mathcal{N}(\mathbf{x}_b\mathbf{|Ax}_a+\mathbf{b,L}^{-1})\tag3. \end{align}.$
We consider the joint distribution: $p(\mathbf{x})=p(\mathbf{x}_a)p(\mathbf{x}_b\mathbf{|x}_a)\tag4.$ Take the log, we obtain: $\begin{align} \ln p(\mathbf{x})&=\ln p(\mathbf{x}_a)+p(\mathbf{x}_b\mathbf{|x}_a)\\ &=-\frac{1}{2}(\mathbf{x}_a-\mu)^\top\mathbf{\Lambda}(\mathbf{x}_a-\mu)-\frac{1}{2}(\mathbf{x}_b-\mathbf{Ax}_a-\mathbf{b})^\top\mathbf{L}(\mathbf{x}_b-\mathbf{Ax}_a-\mathbf{b})+\text{const}\tag5. \end{align}$
In order to find covariance matrix of $p(\mathbf{x})$ , we rewrite the $(5)$ as a quadratic function: $\begin{align} &-\frac{1}{2}(\mathbf{x}_a-\mu)^\top\mathbf{\Lambda}(\mathbf{x}_a-\mu)-\frac{1}{2}(\mathbf{x}_b-\mathbf{Ax}_a-\mathbf{b})^\top\mathbf{L}(\mathbf{x}_b-\mathbf{Ax}_a-\mathbf{b})\\ &=-\frac{1}{2}\mathbf{x}_a^\top\mathbf{\Lambda x}_a-\frac{1}{2}\mathbf{x}_b^\top \mathbf{Lx}_b-\frac{1}{2}\mathbf{x}_a^\top \mathbf{A^\top LAx}_a+\frac{1}{2}\mathbf{x}_b^\top \mathbf{LAx}_a+\frac{1}{2}\mathbf{x}_a^\top\mathbf{A^\top Lx}_b+\text{const}\tag6\\ &=-\frac{1}{2}\mathbf{x}_a^\top\mathbf{(\Lambda+A^\top LA) x}_a-\frac{1}{2}\mathbf{x}_b^\top \mathbf{Lx}_b+\frac{1}{2}\mathbf{x}_b^\top\mathbf{ LAx}_a+\frac{1}{2}\mathbf{x}_a^\top\mathbf{A^\top Ly}+\text{const}\tag7\\ &=-\frac{1}{2}\begin{pmatrix}\mathbf{x\\y}\end{pmatrix}^\top\begin{pmatrix}\mathbf{\Lambda}+\mathbf{A^\top LA}&-\mathbf{A^\top L}\\-\mathbf{LA}&\mathbf{L}\end{pmatrix}\begin{pmatrix}\mathbf{x\\y}\end{pmatrix}\tag9. \end{align}$
Finally, we get: $\begin{align} \mathbf{E}[\mathbf{x}]&=\begin{pmatrix}\mu\\ \mathbf{A\mu+b}\end{pmatrix}\tag{10}\\ \text{cov}[\mathbf{x}]&=\begin{pmatrix}\mathbf{\Lambda}+\mathbf{A^\top LA}&-\mathbf{A^\top L}\\-\mathbf{LA}&\mathbf{L}\end{pmatrix}^{-1}\tag{11}. \end{align}$

ML&PR_8: Conditional Gaussian

30/11/2020 23:02

mlpr8

Consider two sets of variables are jointly Gaussian, then, the conditional distribution of one set conditioned on the other is again Gaussian.
Suppose $\mathbf{x}$ is a $D$ -dimensional vector with Gaussian distribution $\mathcal{N(\mathbf{x|\mu,\Sigma})}$ . We split $\mathbf{x}$ into two parts: $\mathbf{x}_a$ and $\mathbf{x}_b$ where $\mathbf{x}_a$ takes first $M$ components of $\mathbf{x}$ and $\mathbf{x}_b$ takes $D-M$ remaining components. $\mathbf{x}= \begin{pmatrix} \mathbf{x}_a\\\mathbf{x}_b\tag1 \end{pmatrix}.$
We now define the mean of $\mathbf{x}$ : $\mathbf{\mu}= \begin{pmatrix} \mathbf{\mu}_a\\\mathbf{\mu}_b\tag2 \end{pmatrix}$ and the covariance matrix of $\mathbf{x}$ : $\mathbf{\Sigma}= \begin{pmatrix} \mathbf{\Sigma}_{aa}&\mathbf{\Sigma}_{ab}\\\mathbf{\Sigma}_{ba}&\mathbf{\Sigma}_{bb}\tag3. \end{pmatrix}.$ In there, $\mathbf{\Sigma}$ , $\mathbf{\Sigma}_{aa}$ and $\mathbf{\Sigma}_{bb}$ are symmetric and $\mathbf{\Sigma}_{ab}=\mathbf{\Sigma}_{ba}^\top$ .
We now define precision matrix takes a form: $\mathbf{\Lambda} = \mathbf{\Sigma}^{-1}\tag4.$ Because $\mathbf{\Sigma}$ is symmetric, $\mathbf{\Lambda}$ also is symmetric $(\text{Appendix1})$ . So we can rewrite this matrix as follows: $\mathbf{\Lambda}= \begin{pmatrix} \mathbf{\Lambda}_{aa}&\mathbf{\Lambda}_{ab}\\ \mathbf{\Lambda}_{ba}&\mathbf{\Lambda}_{bb}. \end{pmatrix}\tag5$ where $\mathbf{\Lambda}$ , $\mathbf{\Lambda}_{aa}$ and $\mathbf{\Lambda}_{bb}$ are symmetric and $\mathbf{\Lambda}_{ab}=\mathbf{\Lambda}_{ba}^\top$ . Note that: $\mathbf{\Lambda}_{aa}$ is not the invert of $\mathbf{\Sigma}_{aa}$ , similar to $\mathbf{\Lambda}_{bb}$ . We will discuss about it later.
Now we discuss about the conditional distribution $p(\mathbf{x}_a|\mathbf{x}_b)$ , consider $\mathbf{x}_b$ is the observed value. We start form the joint distribution $p(\mathbf{x})=p(\mathbf{x}_a,\mathbf{x}_b)$ . To explore it, we consider the quadratic form of Gaussian distribution (as mentioned in MLPR8) combine with the partitioning $(3)$ and $(5)$ : $\begin{align} -\frac{1}{2}(\mathbf{x-\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\mu)&=-\frac{1}{2}(\mathbf{x}_a-\mu_a)^\top\mathbf{\Lambda}_{aa}(\mathbf{x}_a-\mu_a)\\ &\ \ \ \ -\frac{1}{2}(\mathbf{x}_a-\mu_a)^\top\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\mu_b)\\ &\ \ \ \ -\frac{1}{2}(\mathbf{x}_b-\mu_b)^\top\mathbf{\Lambda}_{ba}(\mathbf{x}_a-\mu_a)\\ &\ \ \ \ -\frac{1}{2}(\mathbf{x}_b-\mu_b)^\top\mathbf{\Lambda}_{bb}(\mathbf{x}_b-\mu_b).\tag6\\ \end{align}$ The $(6)$ is the function of $\mathbf{x}_a$ , we can use this property $\begin{align} -\frac{1}{2}(\mathbf{a}-\mu)^\top\mathbf{\Sigma}^{-1}(\mathbf{a}-\mu)&=-\frac{1}{2}\mathbf{a}^\top\mathbf{\Sigma}^{-1}\mathbf{a}\\ &\ \ \ \ +\frac{1}{2}\mathbf{a^\top\Sigma}^{-1}\mu+\frac{1}{2}\mathbf{\mu^\top\Sigma}^{-1}\mathbf{a}\\ &\ \ \ \ \ -\frac{1}{2}\mu^\top\mathbf{\Sigma}^{-1}\mu\tag7\\ &=\frac{1}{2}\mathbf{a}^\top\mathbf{\Sigma}^{-1}\mathbf{a} +\mathbf{a^\top\Sigma}^{-1}\mu-\frac{1}{2}\mu^\top\mathbf{\Sigma}^{-1}\mu\tag8\\ \end{align}$ (because of $\mathbf{\Sigma}^{-1}$ is symmetric then $\mathbf{a^\top\Sigma}^{-1}\mu=\mathbf{\mu^\top\Sigma}^{-1}\mathbf{a}$ ) to rewrite it by: $\begin{align} -\frac{1}{2}(\mathbf{x-\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\mu)&=-\frac{1}{2}\mathbf{x}_a^\top\mathbf{\Lambda}_{aa} \mathbf{x}_a+\mathbf{x}_a^\top\mathbf{\Lambda}_{aa}\mu_a-\frac{1}{2}\mu_a^\top\mathbf{\Lambda}_{aa}\mu_a\\ &\ \ \ \ \ -\frac{1}{2}\mathbf{x}_a^\top\mathbf{\Lambda}_{ab} \mathbf{x}_b+\frac{1}{2}\mathbf{x}_a^\top\mathbf{\Lambda}_{ab}\mu_b+\frac{1}{2}\mathbf{\mu}_a^\top\mathbf{\Lambda}_{ab}\mathbf{x}_b-\frac{1}{2}\mu_a^\top\mathbf{\Lambda}_{ab}\mu_b\\ &\ \ \ \ \ -\frac{1}{2}\mathbf{x}_b^\top\mathbf{\Lambda}_{ba} \mathbf{x}_a+\frac{1}{2}\mathbf{x}_b^\top\mathbf{\Lambda}_{ba}\mu_a+\frac{1}{2}\mathbf{\mu}_b^\top\mathbf{\Lambda}_{ba}\mathbf{x}_a-\frac{1}{2}\mu_b^\top\mathbf{\Lambda}_{ba}\mu_a\\ &\ \ \ \ \ -\frac{1}{2}(\mathbf{x}_b-\mu_b)^\top\mathbf{\Lambda}_{bb}(\mathbf{x}_b-\mu_b)\tag9\\ &=-\frac{1}{2}\mathbf{x}_a^\top\mathbf{\Lambda}_{aa} \mathbf{x}_a+\mathbf{x}_a^\top\{\mathbf{\Lambda}_{aa}\mu_a-\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\mu_b)\}+C\tag{10} \end{align}$ where $C=-\frac{1}{2}\mu_a^\top\mathbf{\Lambda}_{aa}\mu_a+\mu_a^\top\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\mu_b)-\frac{1}{2}(\mathbf{x}_b-\mu_b)^\top\mathbf{\Lambda}_{bb}(\mathbf{x}_b-\mu_b).\tag{11}$ We obtain $(10)$ from $(9)$ by property $\mathbf{\Lambda}_{ab}=\mathbf{\Lambda}_{ba}^\top$ . We see this is again a quadratic form with $C$ is independent of $\mathbf{x}_a$ . So the condition distribution $p(\mathbf{x}_a|\mathbf{x}_b)$ will be Gaussian.
We use the quadratic form to determine mean and covariance of conditional distribution, denoted as $\mu_{a|b}$ and $\mathbf{\Sigma}_{a|b}$ , respectively. Consider the second order of $\mathbf{x}_a$ : $-\frac{1}{2}\mathbf{x}_a^\top\mathbf{\Lambda}_{aa}\mathbf{x}_a\tag{12}$ we can inference the covariance of $p(\mathbf{x}_a|\mathbf{x}_b)$ is: $\mathbf{\Sigma}_{a|b}=\mathbf{\Lambda}_{aa}^{-1}\tag{13}.$ Next, we consider the first order of $\mathbf{x}_a$ : $\mathbf{x}_a\{\mathbf{\Lambda}_{aa}\mu_a-\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\mu_b)\}\tag{14}$ we can obtain: $\begin{align} \mu_{a|b}&=\mathbf{\Sigma}_{a|b}\{\mathbf{\Lambda}_{aa}\mu_a-\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\mu_b)\}\\ &=\mu_a-\mathbf{\Lambda}_{aa}^{-1}\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\mu_b).\tag{15} \end{align}$
Next, we determine each path of $\mathbf{\Lambda}$ based on the following lemma: $\begin{pmatrix} \mathbf{A}&\mathbf{B}\\\mathbf{C}&\mathbf{D} \end{pmatrix}^{-1}=\begin{pmatrix} \mathbf{M}&-\mathbf{MBD}^{-1}\\-\mathbf{D}^{-1}\mathbf{CM}&\mathbf{D}^{-1}+\mathbf{D}^{-1}\mathbf{CMBD}^{-1} \end{pmatrix}\tag{16}$ where $\mathbf{M}$ is the Schur complement $(\text{Appendix2})$ of $\mathbf{D}$ and $\mathbf{M}^{-1}$ is the Schur complement of $\mathbf{A}$ , defined as: $\mathbf{M}=(\mathbf{A}-\mathbf{BD}^{-1}\mathbf{C})^{-1}\tag{17}.$
Back to $(4)$ we have: $\begin{pmatrix} \mathbf{\Sigma}_{aa}&\mathbf{\Sigma}_{ab}\\\mathbf{\Sigma}_{ba}&\mathbf{\Sigma}_{bb} \end{pmatrix}^{-1}=\begin{pmatrix} \mathbf{\Lambda}_{aa}&\mathbf{\Lambda}_{ab}\\ \mathbf{\Lambda}_{ba}&\mathbf{\Lambda}_{bb} \end{pmatrix}.\tag{18}$ So: $\begin{align} \mathbf{\Lambda}_{aa}&=(\mathbf{\Sigma}_{aa}-\mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}\mathbf{\Sigma}_{ba})^{-1}\tag{19}\\ \mathbf{\Lambda}_{ab}&=-(\mathbf{\Sigma}_{aa}-\mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}\mathbf{\Sigma}_{ba})^{-1}\mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}\tag{20} \end{align}.$