Lecture 2 Word Vectors and Word Senses

1. Recap on Word2vec

  • Words tend to have a high similarity with high-frequency words (of, the, and)
  • Words can be similar to others in many different directions in high-dimension space

2. Optimization: Gradient Descent

\(\alpha\): learning rate $$\theta^{\mbox{new}} = \theta^{\mbox{old}} - \alpha\triangledown_\theta J(\theta)$$

Stochastic Gradient Descent

Use mini-batch to do the calculation instead of updating the network after one whole pass.

Details

  • Two matrices to simlify the calculation.
  • Two variants: Skip-gram (context words -> center word) and Continous Bag of Words (center word -> context words)
  • Negative sampling instead of naïve softmax

3. Co-occurrence

  • sparse and large matrix -> reduce the dimensionality -> SVD
  • Hacks:
    • Ramped/weighted windows
    • Function words are capped
    • Use Perason Correlation instead of counts
Count basedDirect prediction
LSA, HALSkip-gram, CBOW, …
Fast trainingScale with corpus size
Efficient usage of statsInefficient usage of stats
Word similarityBeyond word similarity
Issues with large countsTransferability

4. GloVe

\(P(x|\mbox{ice})\) should be larger with \(x=\mbox{solid}\) and \(x=\mbox{water}\), \(P(x|\mbox{steam})\) should be larger with \(x=\mbox{gas}\) and \(x=\mbox{water}\). Then \(\frac{P(x|\mbox{ice})}{P(x|\mbox{steam})}\) would be high when \(x=\mbox{solid}\) or low when \(x=\mbox{gas}\) .

If we assume \(w_i \cdot w_j = \log P(i|j)\), then the \(w_x \cdot (w_a - w_b) = \log \frac{P(x|a)}{P(x|b)}\)

5. Evaluation

IntrinsicExtrinsic
evaluation on a subtaskevaluation on a real task
fast to computedifficult to compute
understanding the systemunderstanding changes to the system
not understanding the real performancenot understanding the system

Intrinsic evaluations

  • analogies
  • word similarity
  • word sense
  • disambiguity