Predicting protein-RNA interaction amino acids using random forest based on submodularity subset selection

Research output: Contribution to journalJournal articleResearchpeer-review

  • Xiaoyong Pan
  • Lin Zhu
  • Yong-Xian Fan
  • Junchi Yan

Protein-RNA interaction plays a very crucial role in many biological processes, such as protein synthesis, transcription and post-transcription of gene expression and pathogenesis of disease. Especially RNAs always function through binding to proteins. Identification of binding interface region is especially useful for cellular pathways analysis and drug design. In this study, we proposed a novel approach for binding sites identification in proteins, which not only integrates local features and global features from protein sequence directly, but also constructed a balanced training dataset using sub-sampling based on submodularity subset selection. Firstly we extracted local features and global features from protein sequence, such as evolution information and molecule weight. Secondly, the number of non-interaction sites is much more than interaction sites, which leads to a sample imbalance problem, and hence biased machine learning model with preference to non-interaction sites. To better resolve this problem, instead of previous randomly sub-sampling over-represented non-interaction sites, a novel sampling approach based on submodularity subset selection was employed, which can select more representative data subset. Finally random forest were trained on optimally selected training subsets to predict interaction sites. Our result showed that our proposed method is very promising for predicting protein-RNA interaction residues, it achieved an accuracy of 0.863, which is better than other state-of-the-art methods. Furthermore, it also indicated the extracted global features have very strong discriminate ability for identifying interaction residues from random forest feature importance analysis.

Original languageEnglish
JournalComputational Biology and Chemistry
Pages (from-to)324-330
Number of pages7
Publication statusE-pub ahead of print - 13 Nov 2014

ID: 196174697