A taxonomic classifier for 16s and ITS sequences based on deep learning

Conference: BIBE 2018 - International Conference on Biological Information and Biomedical Engineering
06/06/2018 - 06/08/2018 at Shanghai, China

Proceedings: BIBE 2018

Pages: 6Language: englishTyp: PDF

Personal VDE Members are entitled to a 10% discount on this title

Authors:
Tang, Darong (School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou, China & Research Center for Biomedical Informatics, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China)
Yang, Shuxin (School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou, China)
Liu, Zhihua (Research Center for Biomedical Informatics, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China)
Cai, Yunpeng (Research Center for Biomedical Informatics, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China & Shenzhen Engineering Laboratory of Health Big Data Analyses Technology, Shenzhen, China)

Abstract:
High-throughput sequencing technology has been used extensively, Microbial data is usually generated by sequencing ribosomal 16S and ITS regions. Microbial research can be carried out extensively by classifying the sequence data into correct microbial taxonomic units. Many methods have been proposed for this purpose, but classification accuracy still needs to be improved extensively. In this study, a deep learning approach was proposed to solve this problem. A model consisting of 4-layer convolutional neural networks and 2-layer fully connected neural networks was designed. Benchmark experiments were conducted by using RDP and Greengenes database as training and testing sets, respectively. Experimental results indicate that compared to classical RDP Naïve Bayesian Classifier, this model effectively extracts features and achieves better accuracy, the accuracy of 96.30% was the highest on genus level. Once model is trained, we can assign taxonomy to multiple query sequences parallel with GPU and without sequence data of reference database.