With the explosive growth of databanks consisting of protein sequences, there is an increasing need for
annotating a number of newly discovered enzyme sequences. Given a protein sequence, the question arises on how to
identify whether it is an enzyme or a non-enzyme? If it is an enzyme, and then which main functional class does it belong
to? Since the biology experiment methods are both time-consuming and expensive, it is highly desired to develop an in
silicon method to address these problems. In this paper, two effective methods are taken into consideration to constitute
the 2-layer predictor: the 1st layer prediction engine respectively extracts 188-D features based on composition and
physical-chemical property of protein and extract 20-D features by using position-specific scoring matrix (PSSM), for
determining a query protein as an enzyme or a non-enzyme; the 2nd layer prediction engine extracts 20-D feature by
PSSM and is designed for predicting the main family class of the enzyme. In our experiment, multifunctional enzymes
due to their specific characterstics are viewed as the 7th category of enzyme. As a result, the accuracy of 1st layer prediction reaches 98.99% (188-D) and 98.25% (20-D) using 10-cross-validation, and for the 2nd layer prediction, 97.12% by Random Forest and 98.39% accuracy by IB1 are obtained. These high accuracies indicate that the current method could be an effective and promising high throughput method in the enzyme research. Furthermore, we developed an online web server which can be accessed via http://datamining.xmu.edu.cn:8080/PredictE/.
Bioinformatics, enzyme family class, machine learning, multi-functional enzyme.
School of Software, Xiamen University, Xiamen, 361005, P.R. China.