tf_idf算法初步总结

好久之前就对tf-idf半生不熟,这次试着总结实践一下。

Abstract: TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

Tf(w) = (Number of times the word appears in a document)/(Total number of words in a document)

Idf(w) = log (Number of documents/Number of documents that contain word w)

待续..