Find in Library
Search millions of books, articles, and more
Indexed Open Access Databases
BaNeL: an encoder-decoder based Bangla neural lemmatizer
oleh: Md. Ashraful Islam, Md. Towhiduzzaman, Md. Tauhidul Islam Bhuiyan, Abdullah Al Maruf, Jesan Ahammed Ovi
Format: | Article |
---|---|
Diterbitkan: | Springer 2022-04-01 |
Deskripsi
Abstract This study presents an efficient framework of deriving lemma from an inflected Bangla word considering its parts-of-speech as context. Bangla is a morphologically rich Indo-Aryan language where around 70% words are inflected, and some words have around 90 different inflected forms making it one of the most challenging languages for lemmatization. The unavailability of a sufficiently large appropriate dataset in Bangla makes the task even more strenuous. A reliable robust Bangla lemmatizer will create new possibilities for other dependent fields like automatic language translation and grammatical correction to flourish in Bangla. In this paper, we have described a new larger Bangla dataset for lemmatization and an encoder-decoder-based sequence_to_sequence framework for it. After tuning the hyper-parameters, the proposed framework yielded 95.75% character accuracy and 91.81% exact match on the testing split of the prepared dataset which is significantly higher than existing other approaches in Bangla for lemmatization. Article Highlights This article: Discusses lemmatization task in Bangla and demonstrates difference with stemming Presents an artificial neural network based efficient model for lemmatization that yields comparatively better performance than existing ones Describes a new large dataset for lemmatization in Bangla language