Compound-Protein Interaction Prediction

Overview
This project focuses on predicting interactions between compounds and proteins using advanced machine learning models. By leveraging protein language models and custom embedding pooling techniques, we enhance accuracy in function prediction tasks over traditional methods.

Methodology

Protein Encoding: Various protein language models with advanced embedding pooling methods were employed to capture function-specific information, improving prediction performance.
Compound Encoding: My colleague Zhiqing Xu developed a novel directed message-passing neural network (d-MPNN) using Extended Connectivity Fingerprints (ECFP) to represent compounds.
Interaction Simulation: Transformer-inspired architectures model the interaction between proteins and compounds, enabling accurate prediction across multiple kinetic parameters.

Key Contributions

Demonstrated improved function prediction (especially in kinetic parameter prediction) with advanced pooling methods.
Introduced transformer-inspired architectures for better compound-protein interaction simulation.
Developed machine-learning-ready kinetic datasets for benchmarking sequence-to-function models.

Further Information
This project provides a robust framework for compound-protein interaction prediction, aiding research in drug discovery, enzyme engineering, and other biotechnology applications. The datasets created are accessible for benchmarking in related research.

Above: Components of the compound-protein interaction model and its performance, including protein embedding pooling methods, the performance on kinetic parameter datasets, and the ML-ready datasets.

GitHub Repository
Will be posted once the publication is out! Please stay tuned.