This page shares things that I have digested with my comment including books, reaseach papers, blog posts, online courses and everything as input to my brain that I have paid real attention to. It serves two purposes. 1. I am what I “read”, from the input of me, I hope you can have an understanding of what I am. 2. I wish this list can provide an index for people who hold similar interest with me. Therefore I will try my best to be honest on listing items that I have spend significant time on, and refrain from wasting time on things that serves no good to make me what I really want to be.

• The Future of the Mind: The Scientific Quest to Understand, Enhance, and Empower the Mind [upd. 2018/12/27] This is a video from Michio Kaku talking about his book with the same name as the title in MSR.
• Variational Inference for Machine Learning [upd. 2018/12/17] This is a slide from Shakir Mohamed. When we talk about inference, most of the time we are talking about marginalization $p(y)=\int p(y,\theta)d\theta$, expectation $\mathbb E[f(y)\mid x]=\int f(y)p(y\mid x)dy$ or prediction $p(y_{t+1})=\int p(y_{t+1}\mid y_t)p(y_t)dy_t$. In most cases, we want to do marginalization and obtain a posterior given the marginalization. Variational inference borrows the idea of variational calculus, where we want to obtain a function that maximize a scalar function, in this case, we want to minimize the KL divergence by optimizing variational parameters. From another perspective, we can also use importance sampling to tackle the problem of marginalization, however it won’t generalize to high dimensional situation, therefore we need to rely on other integral techniques. By taking a log to the marginal likelihood we are interested in, Jensen’s inequality comes to help and gives a lower-bound once we have a proposal distribution for the latent variable. Having the lower-bound leaves space for bounded optimization, since we essentially turn the calculation problem into a directly optimization for the proposal distribution, and enables tons of techniques such as variational EM, Monte Carlo approximation and amortized variational inference, in which stochastic VI can be implemented very easily and efficiently. However there are still a lot of topics that is not yet covered, such as the information bottleneck, belief propagation and expectation propagation. Hope I can learn them in the future.
• Variational methods in Iain Murray’s MLPR notes [upd. 2018/12/04]
Variational methods utilize the KL divergence to approximate the unknown posterior with a distribution whose structure is known by us, therefore we can manipulate the parameter to fit the posterior.
• A First Course in Machine Learning, Chapter 7. Principal Components Analysis and Latent Variable Models [upd. 2018/12/03]
This chapter introduces PCA in a very intuitive way: the pricipal components are the directions in the original $D$ dimension space that have the largest variance if you project the data in these directions. To find the direction represented by $\mathrm w$ with the largest variance of $y^{(n)}=\mathrm w^t\mathrm x^{(n)}$, we can derive it from the definition of variance, $\sigma^2=\frac{1}{N}\sum_{n=1}^N(y^{(n)}-\bar y)^2$, where $\bar y={1\over N}\sum_{n=1}^N\mathrm w^t\mathrm x^{(n)} = \mathrm w^T\mathrm{ \bar y} = \mathrm w^T\mathrm 0$ if we make the assumption that the data is zero-mean which can be achieved by subtracting the mean before processing. Therefore $\sigma^2={1\over N}\sum_{n=1}^Ny_n^2={1\over N}\sum_{n=1}^N\mathrm w^T\mathrm x^{(n)}\mathrm x^{(n)T}\mathrm w=\mathrm w^T\left({1\over N}\sum_n=1^N \mathrm x^{(n)}\mathrm x^{(n)T}\right)\mathrm w=\mathrm w^TC\mathrm w$, where $C={1\over N}\sum_{n=1}^N(\mathrm x^{(n)}-\mathrm{\bar x}) (\mathrm x^{(n)}-\mathrm{\bar x})^T$ is the variance of the data which is not affected whether we centralize it or not. To maximize $C$ in a reasonable way, we have to add constraint on $\mathrm w$ otherwise it goes to infinite. Under the constraint that $\mathrm w^T\mathrm w=1$ we can subtract the Lagrangian term $\lambda (\mathrm w^T\mathrm w-1)$ from the expression and set the gradient to 0, this reveals $C\mathrm w=\lambda\mathrm w$ as the exact definition of eigenvectors and eigenvalue, to show the variance is actually correspond to the eigenvalue, one can multiply each side of $\sigma^2=\mathrm w^TC\mathrm w$ by $1=\mathrm w^t\mathrm w$ and eliminate the $\mathrm w^T$ term, resulted in $\sigma^2\mathrm w=C\mathrm w​$. In this case, we can choose the eigenvector that corresponds to the largest eigenvalue, therefore we can maximize the variance which is exactly equals to the eigenvalue. As for the second direction of project, we want it can result in the largest variance too, however we want the direction to be orthogonal to the first direction. Thankfully the eigenvectors are themselves orthogonal, therefore we can just pick the eigenvector with the second largest eigenvalue, and so far so forth.
• Perspectives on Universal Dependencies [upd. 2018/11/30]
Since I have took part in this year’s CoNLL shared task on universal dependency parsing, I find this slide from its organizers really helpful, it provides the annotation details of the universal dependency project as well as the philosophy of the UD project. For example they take the content words into primary account which constructs the foundational layer that is applicable for every language, and use function words as the language strategies that can vary among different languages. I learned a lot about how people design a NLP dataset and realized the universal dependency project is really a great contribution, since they have to come up with an universal annotation that is applicable to all the languages, and annotate them in such a massive scale! However I am not quite clear with the second part which introduces UDepLambda, a transition from syntax into semantics. I am not very familiar with semantics, I need more study on it.
• A First Course in Machine Learning, Chapter 4. Bayesian Inference [upd. 2018/11/29]
This chapter introduces 3 approaches to deal with the inference of Bayesian model, the MAP estimation, Laplace approximation and MCMC. Direct calculation of the posterior requires normalizing the product of the likelihood and prior $g(\mathrm w;\mathcal D)=p(\mathcal D\mid \mathrm w)p(\mathrm w)$ by the marginal likelihood $p(\mathcal D)=\int g(\mathrm w;\mathcal D)\ \mathrm {dw}$, however, if we are only interested in the point estimation of the posterior, we don’t necessarily need to normalize, as long as we can detect the “mode” of the posterior then we are finished. Therefore we want to maximize $g(\mathrm w;\mathcal D)$, this is called maximum a posterior estimation in contrast with the maximum likelihood estimation. In bayesian logistic regression, we can use Newton-Raphson method to find the zero point of $\nabla_{\mathrm w} \log g(\mathrm w;\mathcal D)$, therefore by iteratively update $\hat {\mathrm w}=\mathrm w-\nabla_{\mathrm w} \log g(\mathrm w;\mathcal D)\left(\frac{\partial^2 \log g(\mathrm w;\mathcal D)}{\partial\mathrm w\partial \mathrm w^T}\right)^{-1}$ we can find the maximum by noticing that the Hessian matrix is negative defined. With the MAP estimation $\hat{\mathrm w}$ and Hessian matrix, one can use Taylor expansion near $\hat{\mathrm w}$ to expand the posterior to the third order, and then approximate it with a multivariate Gaussian with $\hat{\mathrm w}$ as its mean and the negative of the inverse of the Hessian matrix as its covariance matrix. With the two method above, although we can approximate the posterior
• A First Course in Machine Learning, Chapter 3. The Bayesian Approach to Machine Learning [upd. 2018/11/27]
I think AFCML by Simon Rogers and Mark Girolami is the best introductory machine learing book you can find, and this chapter together with the next chapter about Bayesian Inference are the my most favorite. In this chapter, they use a very concrete example of estimate the distribution of the parameter $r$ of a binomial distribution of a tossing game. A beta distribution $p(r\mid \alpha, \beta)=\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}r^{\alpha-1}(1-r)^{\beta-1}$ is a conjugate prior of binomial likelihood, which means the posterior is also a beta distribution, therefore we can get rid of evaluating the marginal likelihood because we can normalize the posterior by making it a beta density. Since we can update the posterior each time we do an experiment, the authors visualize the gradual update of the posterior, which makes the bayesian approach tangible and gives us intuition of interpreting the prior as the belief from previous experiments. More specifically, in the beta distribution, $\alpha$ and $\beta$ can be interpreted as we have observed $\alpha$ heads and $\beta$ tails within $\alpha+\beta$ times of experiment, because the update of $N$ further experiements will add $y_N$ onto $\alpha$ and $N-y_N$ onto $\beta$ where $y_N$ stands for number of heads in the $N$ experiments. Given the mean and variance of beta distribution, $\mathbb E[\mathcal B(\alpha, \beta)]=\frac{\alpha}{\alpha+\beta}$ and $\mathtt{var}(\mathcal B(\alpha, \beta))=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha + \beta + 1)}$, it is easy to see that the larger $\alpha$ and $\beta$ are, the less the variance of the posterior is, and the more even the more unbiased. Once we have the posterior, we are able to compute the marginal likelihood conversely. Thinking about the parameter of choosing the pror, e.g. $\alpha, \beta$ in this case, the marginal likelihood is essentially $p(\mathrm y_N\mid \alpha, \beta)$, therefore the marginal likelihood actually tells us how appropriate of the prior belief is given the the data. We can then choose the model by maximize the marginal likelihood, this is known as the type 2 maximum likelihood method (the type 1 marximum likelihood method is the typical one).
We can also generalize the bayesian approach to regression, where we treat the prior and likelihood to be Gaussian so that they form conjugate prior again.
• AT&T Archives: The UNIX Operating System [upd. 2018/11/27]
I watched this video before sleep on bed. People from the Bell lab who create the Unix system illustrate the design of this system to us. The three layer kernel, shell and utilities demonstrate the philosophy of unix: to combine small things together. Pipeline and file system which seems to be very natural to me is at that time quite innovative what makes the Unix OS outstanding. I found this video from a MIT open course 6.828: Operating System Engineering. I would like to investigate this course in more depth in the future if I want to take the OS course in Fudan U.
• PRML Chapter 6 Kernel Method [upd. 2018/11/26]
I have tried to read PRML the book for several times, however I gave up as many times as I try. This time, after nearly one year of learning machine learning, I finally found that this book appear to me and I am super excited. This chapter talks about the so called kernel method, which can be seen as a non-parametric extension to parametric method by basis function. Any positive semidefinite matrix for any input $\mathrm{x}_N$ makes a kernel function $k(\mathrm{x}^{(n)}, \mathrm{x}^{(m)}) = \phi(\mathrm x^{(n)})^T \phi(\mathrm x^{(m)})$ for some basis function $\phi$. We can build the kernel function instead of the basis function to make it possible to have more complex basis function that we can not write down such as infinite dimension basis function. Then we comes to Gaussian process, from which the kernel rises naturally. GP is essentially to find a $N$ dimension Gaussian distribution of a function $f$ over $N$ data point $\mathrm{x}_N$, e.g. $\mathtt{Y}_N={f(\mathrm x^{(1)}, \cdots, \mathrm x^{(N)}}$. We model $\mathtt{Y}_N$ by a linear model $\mathrm Y^{(n)}=\mathrm{w}^T\phi(\mathrm x^{(n)})$, where $\mathrm w$ as parameter has Gaussian prior $\mathcal{N}(\mathrm{w}\mid \mathrm 0, \alpha^{-1}\mathrm{I})$. Because $\mathrm Y=\Phi\mathrm w$ is a linear combination of Gaussians, it is also Gaussian. Therefore we can estimate the mean and covariance of $\mathrm Y$. Apparently the mean is $\mathrm 0$ and the covariance is $\mathbb E(\Phi\mathrm w)=\Phi\mathbb E(\mathrm w)\Phi^T=\alpha^{-1}\Phi\Phi^T=\mathrm K$, where $\mathrm K_{n,m}=\phi(\mathrm x^{(n)})^T\phi(\mathrm x^{(m)})$ is indeed a kernel. From this point, we can ignore the basis function $\phi$ and construct the kernel instead. So far, we have come to $\mathrm Y\sim \mathcal N(\mathrm 0, \mathrm K)$, additionally we can have a noise $\epsilon$ to allow for variation to model the target $\mathrm t_N={t^{(1)},\cdots,t^{(N)}}$. In making use of GP for regression, we can draw the posterior distribution of $p(t^{(N+1)}\mid \mathrm t_N,\mathrm x_N, \mathrm x^{(N+1)})$ from the joint distribution $p(\mathrm t_{N+1})$, by noticing that the conditional distribution of Gaussian distribution from a joint Gaussian distribution is also a Gaussian. To use GP for classification, we can use a sigmoid function $\sigma$ over the output of GP. However this makes posterior $p(t^{(N+1)}\mid \mathrm a_N)$ where $t=\sigma(a)$ intractable. We have to use approximation. We can use Laplace Approximation which approximate the posterior by a Gaussian, therefore we only need to estimate the mean and variance, the mean is the mode of the posterior which can be find by take the gradient and set it to zero. In terms of the optimization, Newton-Raphon method can be used, where we need to calculate the second derivatives which can be used to estimate the variance by Taylor expansion. This is the story of the PRML chapter 6. I really appreciate it!
• Tutorial - What is a variational autoencoder? [upd. 2018/11/25]
My beloved post talking about VAE from two different perspectives, neural network and probabilistic model, in a unified style. In NN perspective, we want to minimize the loss term made of a reconstruction loss and a regularization term in the form of KLD. In the PM aspect, it derives the motivation of the KLD by introducing the ELBO as the proxy for another intractable KLD between the variational distribution and the true posterior of interest that we want to minimize. And it shows the maximization of the ELBO is equivalent to maximize the negation of the loss in the NN setting， which unite the two aspects together.
• Simulating A Biased Coin [upd. 2018/10/12]
You can compare the binary expansion of $p$ along with the outcome of tossing a fair coin, until a mismatch happened, you can prove the correctness and effectiveness.
• Gumbel Softmax and Straight Through Trick [upd. 2018/10/11]
with convincing deductions
• Markov Chain Monte Carlo [upd. 2018/10/08]
MCMC can be view as a sampling trick for a very high dimension distribution. Like generating names when you have a probability distribution of accepting a name. I learned about the stationary distribution, how to get it (by multiple the transition matrix many times to any initial distribution). Once you have a graph with its vertices have a stationary distribution of a given distribution need to be sampled from, you can just randomly walk on the graph and stop when you walked for a long long path, the stop point is the sample you have. This is called Metropolis-Hastings algorithm, and the way to construct this graph is given in this post. Spoiler: The transition matrix is given by $\displaystyle p_{i,j} = \frac1r \min(1, p(j) / p(i)), \\ p_{i,i} = 1 - \sum_{(i,j) \in E(G); j \neq i} p_{i,j}$.