Abstract

Protein-ligand interaction (PLI), prediction is a very important topic in the world of Computational Biology and Bioinformatics. PLI is particularly useful for function annotation. Given the added complexity of determining the interaction of three dimensionals flexible objects interacting in space, the use of Machine Learning can be adequate given that for any provided sequence we have a large set of features that can influence each other in that interaction. Our work aims to explore the use of Random Forest (RF) of Machine Learning (ML); chosen as a practical and robust approach for PLI. We selected 5020 proteins as our initial database and 10 features to create a RF model. It generates a prediction and creates a modified pdb file with the atoms of the binding site predicted. We’ve obtained three models with varying levels of weights for the predicted values. This has produced a method that is applicable to a wider range of proteins. Neither of them is able to make predictions of proteins which include non-proteic elements. It is also unable to calculate on large proteins with several subunits. For small proteins it overestimates the amount of binding sites. Overall, the model is able to make good predictions of the binding site of a protein in a wide range of protein families.

Introduction

Protein–ligand interactions (PLIs) are central to biological systems; and predicting the interacting residues is useful for constructing PLI networks, analyzing mutations, drug design, drug discovery and improving annotation of protein function (1). Proteins can interact with other molecules. Interaction partners include ions, small organic molecules, membrane lipids, nucleic acids, small peptides, and proteins, to generate homo- and hetero-complexes. In the crowded cellular environ- ment, proteins through evolution have been able to develop and maintain efficiency and binding specificity for function (2).

Experimental techniques commonly employed to determine the structure of protein complexes at atomic-scale resolution include X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM). Information about interface residues can also be obtained by alanine scanning mutagenesis experiments or various footprinting experiments, such as hydrogen/deuterium exchange or hydroxyl radical footprinting. As they may come useful they have the problem of still being currently expensive and low throughput. That's why in-silico techniques could prove useful to fill the gap and determine 3D structure and interactions, specially given current available knowledge, larger datasets and that GPU acceleration have enabled the training of deeper neural network architectures. Broadly, there are 3 categories for PLI site prediction: a method focused in Machine Learning; Structure-based methods and Sequence-based methods. Structure-based methods seek to perform prediction of interaction sites by leveraging protein structural information. Sequence-based methods perform predictions based on protein sequences and form the bulk of the existing body of work due to the relative abundance of protein sequence data, examples are: PROFEAT

ProPy, ASAquick, HHBlits and ANCHOR. While docking and structure-based methods typically require structural data, sequence-based approaches benefit from greater availability of data.

Machine learning and Random Forest

According to Arthur Samuel, ML is defined as the field of study that gives computers the ability to learn without being explicitly programmed (3). The utilization of ML comes to need when the data is abundant and interaction is expected among variables and instances. Powerful algorithms based on machine learning (ML) can extract information from data sets and infer properties of never-seen-before examples. ML tools address the problem of protein–ligand interactions (PLIs) adopting different data sets, input features, and architectures.

Support vector machines (SVM) use kernels to estimate the optimal linear separation between two classes of data.
Hidden Markov Models (HMMs) adopt probabilistic models to learn the most probable labeling of input samples, taking complex contexts into consideration.
Shallow feed-forward neural networks (NN) consist of neurons that communicate through connections whose weights are trained with the back-propagation algorithm.
Deep learning methods are NNs with many hidden layers that extract complex relations among input features.
Recurrent networks extract relationships in sequential data through memory layers, feedback, and time-delay loops.
Convolutional networks consist of several filters that extract and pool local relations from input layers organized as matrices or tensors.
Graph convolutional networks extend learning to structures where the relations among neurons are described by graphs, while attention networks use an additional layer to identify the most relevant parts of the input flexibly (2).

RFs or Random Decision Forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. RFs are attractive because of their advantages: straight forward training, fast classification and robustness that surpasses Decision Trees due multiple classifiers, meaning they are less sensitive to changes in individual instances (4).

Materials and Methods

Data extraction

In this project, we will be using the subset of the scPDB dataset generated by the PUResNET team as our primary training dataset. This dataset contains a large number of protein structures and their corresponding ligands, making it ideal for training a neural network-based model (5). They developed an independent training dataset, which is a subset of (6). First, they grouped each of the protein structures from scPDB according to the UniProt ID. Then, I calculated the Tanimoto coefficient. Second, they selected the longest sequenced protein structure from each UniProt ID cluster based on the Tanimoto coefficient. If the Tanimoto coefficient was equal to or greater than 80%, they considered it a similar structure. Finally, they performed manual inspection using PYMOL, and selected 5020 protein structures out of 16034. From the PUResNET database, we extracted only the protein and binding site files. We then downloaded from the RCSB database the protein in PDB format. We only use these files for the extraction of features for the model generation. By using a diverse range of proteins for training and validation, we aimed to develop a robust and accurate model for predicting protein-ligand binding sites. Additionally, we used the BindingDB dataset (BDB) as a simple visual validation set to evaluate the performance of our model on a separate set. From the BDB, we extracted PDB files from different subsets of the BDB, such as articles, ChEMBL (7), and patents. The secondary structure information was extracted using DSSP (8). DSSP is a longstanding tool for calculating secondary structural descriptors of proteins from their structures.

Features selection

It has been observed that a single protein feature alone is inadequate for predicting PLI sites, as it lacks sufficient information. Therefore, combining some of these features has been found to be a more effective approach for improving the performance of machine learning in PLI site prediction. The residue features we used for our model where: Coordinates: The 3D spatial position of the residue within the protein structure. Amino acid: The type of amino acid that makes up the residue. Binding atom: The atom within the residue that interacts with other molecules, such as ligands or other proteins. It is the variable we’re trying to predict, so it is only present in the training set. Sequence Entropy: A measure of the disorder or randomness of the residue's movement or conformation. Isoelectric point: it corresponds to the pH in which the residue is found with no charge. Hydrophobicity: This property arises due to the non-polar nature of the molecule or part of the molecule, which means it lacks an electrical charge or dipole moment that can interact with water's polar nature. Secondary structure: The local structural arrangement of the residue, such as alpha helices or beta sheets. Solvent accessible surface area (SASA): The area of the residue's surface that is accessible to solvent molecules, which can affect its interactions with other molecules. B-factor: A measure of the thermal motion or flexibility of the residue. Phi and psi angles: The angles of rotation around the peptide bond that connect the residue to its neighbors, which can affect the protein's overall structure. Alpha-carbon distance: The distance between the alpha-carbon atoms of the residue and its neighbors, which can also affect the protein's overall structure. These features were selected on the basis of availability and ease of use.

Other potential features described in the literature, but not implemented in out model, are: ASA related values, Average depth index (DPX), Average protrusion index (CX), Minimal protrusion index, Maximal protrusion index, Maximal depth index (9), evolutionary conservation, marginal essentiality, co-essentiality, MIPS functional catalog, position-specific scoring matrices (PSSM)s, residue interface propensity (10).

ML tool

Regarding Machine Learning and Random Forest we decided to use scikit-learn (11) which is a popular Python library used for machine learning tasks such as classification, regression, and clustering. It provides a range of tools and algorithms for data preprocessing, feature extraction, model selection, and model evaluation; Additionally is of easy use and comprehensive documentation. As a caveat we know that scikit-learn has the con of limited scalability, limited flexibility and features for deep learning; but for the purposes of this project we determined it fitted our demands and the pros exceeded the cons. Mainly following the principle that if it can be done with machine learning rather than deep learning, we should, to save some computing power. For deep learning frameworks, popular choices include: PyTorch, TensorFlow, and Theano.

Discussion

The method for prediction ended up using 4 models. Each of the models have a different weight given the positive binding site prediction. We’ve found that using this can allow for better results. The optimal weight overall was found to be between 6 and 7. These two weights work really well in the prediction of small-regular sized proteins. For larger proteins we use the models 8 and 10. As mentioned before, the code has been designed to be easily used from the command line so that the user might develop their own models. The main consideration is time, since it takes around 30 minutes to develop the model itself. The user might also want to develop their own pickle with the data, but it would be necessary to also consider the time needed to do this step. In our experience with our limited hardware this task takes around 3 hours for 2000 samples, which end up resulting in only 614 useful tables. This discrepancy is caused by the fact that the pdb files and the mol2 files were not exactly the same. There are some cases where the protein is simply saved in a different way. For instance, mol2 files always had a single subunit of the protein, and the pdb counterpart had way more repetitions of the same structure. This and the coordinates were slightly different. The binding site file contained the atoms in the mol2 structure, so we needed to extract the binding site residue by using the mol2 files, and all the other features from the pdb files. This resulted in our limited training set. With more powerful hardware we might have been able to extract the features of the entire 5020 proteins in the dataset, but we had to settle for a 2000 protein sample. This might be a reason as to why there might be some variation in the results. It is also recommended to add the file ‘pdb_testing.py’ to the path so that one can create a prediction from anywhere in the computer. The files are always saved in the path of the pdb file, or in the case that it is a pdb code, it will create a directory with all the predicted information.

Limitations

Given the nature of machine learning, the model is as good as the information provided and it will not be able to work on all protein families with the same degree of confidence. Due to local hardware limitations the model was trained using by taking a sample of 2000 proteins from the original 5020 elements in the database; and only the ones that didn’t raise any errors were saved, leading to only keeping 616 proteins for the model. This was later divided more and more, so it is possible to have some protein families more represented than others in the training set. Nevertheless, the results are promising, suggesting that if we were to increase the model’s training feed, we foresee even better results.

The models are not able to properly predict the binding site of proteins of extreme lengths, meaning: if the protein is too short, its binding relative to the length of the whole protein is very large and the software will tend to interpret the whole as a binding site; On the other hand with models 8 and 10, which are supposed to work better for larger models, they didn’t work as anticipated, so if the protein is too large in length, the prediction struggles with added levels of complexity and results in worse results. We project that with a model of weight 15, much better results would be achieved for large proteins.

As caveats in usage of the software:

In some rare cases, the presence of non-protein components in the aminoacid sequence outputs errors.
In more uncommon situations the software wasn’t able to extract the alpha carbons for the calculations.
In less than 5% of our testing, given amino acids with specially added biochemical groups or where DSSP is unable to calculate secondary structure features; a count feature mismatching occurs, creating a dimension error in the matrix generated.

Conclusions

We can conclude that they have proven to be fairly useful for the prediction of binding sites in a wide set of proteins with varying levels of accuracy and that with more training even better results can be achieved. RF has proven to be a versatile machine learning method for the purposes of this project. It would be an interesting project to try and do this with different ML algorithms. This way we could go even further in our predictions and check how the different algorithms compare to each other. A potential extension of this project would be to see how we could implement another model but rather than protein-ligand interactions, it predicts protein-protein interactions or protein-inhibitor interactions. This would produce a more robust method of prediction with a higher range of applicable subjects. It would also be an interesting approach to redo this project using DeepLearning algorithms. We tried first with this approach but failed in the process, so we would be willing to try again in the future to check and see if there are any differences with our current approach. In conclusion, with new methods and technology, developing new tools for protein-ligand binding prediction is an endeavor that needs constant reevaluation and iteration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

theory.md

theory.md

Abstract

Introduction

Machine learning and Random Forest

Materials and Methods

Data extraction

Features selection

ML tool

Discussion

Limitations

Conclusions

References

Files

theory.md

Latest commit

History

theory.md

File metadata and controls

Abstract

Introduction

Machine learning and Random Forest

Materials and Methods

Data extraction

Features selection

ML tool

Discussion

Limitations

Conclusions

References