Specificity Analysis of Genome Based on Statistically Identical K-Words With Same Base Combination

oleh: Hyein Seo, Yong-Joon Song, Kiho Cho, Dong-Ho Cho

Format:	Article
Diterbitkan:	IEEE 2020-01-01

Deskripsi

<italic>Goal:</italic> Individual characteristics are determined through a genome consisting of a complex base combination. This base combination is reflected in the k-word profile, which represents the number of consecutive k bases. Therefore, it is important to analyze the genome-specific statistical specificity in the k-word profile to understand the characteristics of the genome. In this paper, we propose a new k-word-based method to analyze genome-specific properties. <italic>Methods:</italic> We define k-words consisting of the same number of bases as statistically identical k-words. The statistically identical k-words are estimated to appear at a similar frequency by statistical prediction. However, this may not be true in the genome because it is not a random list of bases. The ratio between frequencies of two statistically identical k-words can then be used to investigate the statistical specificity of the genome reflected in the k-word profile. In order to find important ratios representing genomic characteristics, a reference value is calculated that results in a minimum error when classifying data by ratio alone. Finally, we propose a genetic algorithm-based search algorithm to select a minimum set of ratios useful for classification. <italic>Results:</italic> The proposed method was applied to the full-length sequence of microorganisms for pathogenicity classification. The classification accuracy of the proposed algorithm was similar to that of conventional methods while using only a few features. <italic>Conclusions:</italic> We proposed a new method to investigate the genome-specific statistical specificity in the k-word profile which can be applied to find important properties of the genome and classify genome sequences.

Find in Library

Indexed Open Access Databases

Specificity Analysis of Genome Based on Statistically Identical K-Words With Same Base Combination

Deskripsi