P0850 Towards a Universal in silico Predictor of Protein-Protein Interactions Based Only on Protein Sequence Information

Guilherme Targino Valente , Universidade Estadual Paulista - UNESP, Botucatu, Brazil
Marcio Luis Acencio , Universidade Estadual Paulista - UNESP, Botucatu, Brazil
Ney Lemke , Universidade Estadual Paulista - UNESP, Botucatu, Brazil
Cesar Martins , Universidade Estadual Paulista - UNESP, Botucatu, Brazil
Computational methods can contribute to the prediction of protein-protein interactions (PPIs). In an effort to create a species-independent PPI predictor, we propose the development of a machine learning method based on physicochemical features of protein sequences. We retrieved 117,823 experimentally validated PPIs from a range of eukaryotes from BioGrid database, and 1,423 non-interacting pairs from Negatome database. The amino acid (aa) compositions were calculated for all proteins and the resultant values were discretized in four bins and then combined to form 400 learning attributes. The machine learning algorithm J48 was independently trained on 30 sets of balanced datasets and the generated predictors were tested by 10-fold-cross validation using the software WEKA. The results suggested that the learning attribute representing the fraction of cysteine in a pair of proteins is the most relevant attribute to classify two proteins as interacting or non-interacting (average recall of 0.72; an average ROC area of 0.79). We also trained the J48 with bagging and the recall and ROC values increased ~ 24% (0.89) and ~18% (0.93), respectively. The role of cysteine for PPIs prediction can be supported by the biological importance of this aa: its sulfur atoms can form a covalent disulfide bridge and has an important role for tertiary structure stability. The biological plausibility regarding the role of cysteine in PPIs associated with the satisfactory predictive performance indicates that the present approach is a promising tool for PPIs predictions from newly sequenced organisms.