Skip to content

Latest commit

 

History

History
95 lines (74 loc) · 2.78 KB

NLP.page

File metadata and controls

95 lines (74 loc) · 2.78 KB

TODO merge in notes from 6.864 shingles markov chain pos tagging logistic stemming porter stemming: exclusion lists

general NLP problems

  • POS tagging
  • sentence detection
  • tokenization/segmentation: figuring out word/sentence boundaries
    • simplest: over all possible segmentations, return seg w max prod over all words of P(word)
  • chunking
  • parsing
  • named entity recognition
  • coreference

misc

  • $n$-grams of chars: useful for language guessing

term frequency inverse document frequency (TFIDF)

  • term frequency is num occurrences of term $t_i$ in doc $d_j$ over total num words in $d_j$ (how prominent $t_i$ is in $d_j$):

    $$ tf_{i,j} = \frac{ n_{i,j} }{ \sum_k n_{k,j} } $$

  • inverse document frequency is num docs over num docs with $t_i$ (more common words are less important):

    $$ idf_i = \log \frac{ |D| }{ | { d : t_i \in d } | } $$

  • tf-idf = tf * idf

soundex algorithm

  • rudemintary "sounds-alike" phonetic algorithm for encoding things that sound similar to the same representation, for matching

vector space model

  • represent text docs as vectors of identifiers
  • eg index terms: each dim corresponds to a term; some possible values are:
    • bits indicating presence
    • counts
    • tf-idf
  • relevancy rankings: treat query as doc, find most cosine-similar doc
  • limitations
    • long docs poorly represented, have poor similarity values (small scalar product/numerator and large dimensionality/denominator)
    • order information not preserved

cosine similarity

  • $\frac{A \cdot B}{|A| |B|}$
  • intuitively, find angle between vector representations of docs
  • don't actually need to calc cosine; can just leave as is; ordering preserved
  • $-1$ means opposite, 0 means same, 1 means identical; in IR, components are always non-neg, so range is [0,1]

jaccard similarity/coefficient/index

  • for binary features
  • simple: $J(A,B) = \frac{|A \cap B|}{|A \cup B|}$
  • extended jaccard coefficient aka tanimoto coefficient: $T(A,B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B}$
    • extended cosine similarity that yields jaccard for binary features

NLP at Google (Katja Filippova)

  • NLP problems at google: translation, speech recognition, language modeling, information extraction
    • 7-gram language models on 2T+ tokens using simplified smoothing
  • graph-based multi-sentence (sentence fusion) summarization
    • each unique (token,pos) is a vertex; also start & end vertices
    • merge graphs of multiple sentences together (merging vertices may require their neighbors to also be the same)
    • edges are inverse frequencies of transitions (bigram freqs)
    • find $k$ shortest paths from start to end
    • constraint: path must contain a verb
    • use google news' clusters as dataset
  • http://videolectures.net/russir2010_filippova_nlp/