Friday, March 29, 2013
Reading Note for Week 11
The main idea of clustering is to divide a set to subsets according to shared properties. In the domain of Information Retrieval, we are trying to dividing documents into categories. Each document can be represented by a vector of different words in its collection. Documents with in a cluster are supposed to as similar as possible to each other. The key in this task is how to measure the similarity between document representations. Distance is primarily used as along with some algorithms, such as the most popular one- K-Means, Hierarchical algorithms, Spectral algorithms. The other topic discussed in the book is text classification. Different from clustering where it divides documents into subsets without knowing nothing about the categories, and machine has to decide how to divide and where to put the document, in text classification, categories are predefined and machine has to decide where to put the document after that. Naive beyesian algorithm, SVM are used in this task.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment