Document clustering with explicit semantic analysis (ESA)

oleh: Muhammad Adnan, Muhammad Rafi

Format:	Article
Diterbitkan:	Shaheed Zulfikar Ali Bhutto Institute of Science and Technology 2014-07-01

Deskripsi

Document clustering recently became a very vital approach as numbers of documents on web and on proprietary repositories are increased in unprecedented manner. The documents that are written in human language generally contain some context and usage of words mainly depends upon the same context, recently researchers have tried to enrich document representation via some external knowledge base. This can facilitate the contextual information in the clustering process. We proposed an enrichment process with explicit content analysis using Wikipedia as knowledge base. Our approach is distinct in the sense we only uses the conceptual words from a document and their frequency to embed the contextual information. Hence, our approach does not over enrich the documents. A vector based representation, with cosine similarity and agglomerative hierarchical clustering is used to perform actual document clustering. We compare our proposed method with existing relevant approaches on NEWS20 dataset, with evaluation measure for clustering like: F-Score, Entropy and Purity.

Find in Library

Indexed Open Access Databases

Document clustering with explicit semantic analysis (ESA)

Deskripsi