Language isn’t a formal system. Language is glorious chaos.
How do we represent the meaning of a word?
$$\mbox{symbol} \rightarrow \mbox{idea or thing}$$
\(\mbox{WordNet}\): good but missing sufficient context, less or hard up to date, bad for computers to ingest.
One-hot encoding: sparse and localist representation, missing notation of similarity. Distributional semantics: embed a word’s context in a vector
You shall know a word by the company it keeps
J. R. Firth
Word2vec
^c7f69e
Fun fact: it turns out corpus is actually a third-declension noun not a fourth-declension noun
$$\mathit{L(\theta)} = \prod_{t=1}^{T}\prod_{ -m \leq j \leq m, j \neq 0}P(w_{t+j}|w_t;\theta)$$
$$\rightarrow$$
$$J(\theta)=-\frac{1}{T}\sum_{t=1}^{T}\sum_{ -m \leq j \leq m, j \neq 0}\log P(w_{t+j}|w_t;\theta)$$
Calculating the probability with softmax $$P(o|c) = \frac{exp(u_o^Tv_c)}{\sum_{w \in V}exp(u_w^Tv_c)}$$
It tells us how similar word \(o\) and \(c\) are to each other in a scale of the whole vocabulary.
Derivatives
we want to maximize the \(J(\theta)\). Taking the derivative for the center word vector:
$$\frac{\partial \log P(u_o|v_c)}{\partial{v_c}} = \log exp(u_o^Tv_c) - \log \sum_{w \in V}exp(u_w^Tv_c)$$
$$\frac{\partial \log P(u_o|v_c)}{\partial{v_c}} = u_o - \frac{1}{\sum_{w \in V}exp(u_w^Tv_c)} \sum_{j=1}^{V} \frac{\partial}{\partial v_c} exp(u_j^Tv_c)$$ $$\frac{\partial \log P(u_o|v_c)}{\partial{v_c}} = u_o - \frac{1}{\sum_{w \in V}exp(u_w^Tv_c)} \sum_{j=1}^{V} exp(u_j^Tv_c) \times u_j$$ $$\frac{\partial \log P(u_o|v_c)}{\partial{v_c}} = u_o - \sum_{j=1}^{V} u_j \times \frac{exp(u_j^Tv_c)}{\sum_{w \in V}exp(u_w^Tv_c)}$$
$$\frac{\partial \log P(u_o|v_c)}{\partial{v_c}} = u_o - \sum_{j=1}^{V} u_j \times P(u_x|v_c)$$