Find in Library
Search millions of books, articles, and more
Indexed Open Access Databases
Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages
oleh: Yang Yuan, Xiao Li, Ya-Ting Yang
Format: | Article |
---|---|
Diterbitkan: | MDPI AG 2019-12-01 |
Deskripsi
To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks.