Module 02: Prediction of ZWHQD Compound Targets

Authors

Affiliations

Kun Hou

Health Science Center, Xi’an Jiaotong University

Hanzhong Traditional Chinese Medicine Hospital

Supervisor’s name

Health Science Center, Xi’an Jiaotong University

The First Affiliated Hospital of Xi’an Jiaotong University

1 Overview

This module systematically identifies potential therapeutic targets of blood-absorbed bioactive ingredients via integrative target prediction. Prediction data were retrieved from seven authoritative online platforms, including three classic mainstream databases and four additional professional target fishing servers: BATMAN-TCM 2.0¹, SuperPred 3.0², SwissTargetPrediction³, PharmMapper⁴,⁵, TargetNet^{yao2016TargetNet?}, PPB3⁶, and SEA⁷.

1.1 Database Introduction

Seven authoritative target prediction web servers were selected, covering different prediction principles (machine learning, pharmacophore mapping, similarity search, etc.) to ensure the diversity and accuracy of prediction results. Detailed information of each database is as follows:

1.1.1 BATMAN-TCM 2.0 (http://bionet.ncpsb.org.cn/batman-tcm/)

An updated web server dedicated to network pharmacology-based prediction and analysis of traditional Chinese medicine (TCM)¹. It integrates multiple algorithms to predict potential targets of TCM compounds, with a focus on the compatibility and synergistic effects of TCM ingredients, and provides comprehensive target-related functional annotations, which is suitable for target prediction of blood-absorbed TCM compounds.

1.1.2 SuperPred 3.0 (https://prediction.charite.de/)

A web server for predicting Anatomical Therapeutic Chemical (ATC) codes and potential targets of small molecules². Its target prediction is based on a linear logistic regression model, trained on Morgan fingerprints (length 2048) of 1552 different drugs in 233 level 4 ATC classes. It can rank ATC classes and target candidates by scoring, providing reliable reference for compound classification and target identification.

1.1.3 SwissTargetPrediction (https://www.swisstargetprediction.ch/)

A widely used target prediction tool developed by the Swiss Institute of Bioinformatics³. It predicts potential targets of small molecules by analyzing the similarity between query compounds and known active molecules, supports multiple species (Homo sapiens, Mus musculus, Rattus norvegicus, etc.), and maintains consistent underlying technologies and parameters after interface updates, ensuring the stability and reproducibility of prediction results.

1.1.4 PharmMapper (http://www.lilab-ecust.cn/pharmmapper/)

A web server for potential drug target identification using the pharmacophore mapping approach⁴,⁵. It constructs a comprehensive target pharmacophore database, matches the pharmacophore of query compounds with the pharmacophores of known targets, and predicts potential targets by calculating the matching degree, which is particularly suitable for the prediction of small molecule targets with specific spatial structures.

1.1.5 TargetNet (https://targetnet.scbdd.com/)

A web service for predicting potential drug-target interaction profiling via multi-target structure-activity relationship (SAR) models^{yao2016TargetNet?}. It integrates multiple machine learning algorithms and SAR models to predict targets of small molecules, and provides detailed interaction scores and target functional annotations, which can effectively improve the accuracy of target prediction.

1.1.6 PPB3 (https://ppb3.genome-mining.com/)

A web-based deep learning tool for target prediction using ChEMBL data⁶. It adopts deep learning algorithms to train on a large number of compound-target interaction data in ChEMBL, and can predict potential targets of small molecules with high accuracy, especially suitable for polypharmacology research of compounds.

1.1.7 SEA (https://sea.bkslab.org/)

A target prediction tool based on ligand chemistry similarity⁷. It infers potential targets of query compounds by analyzing the similarity between the compound and known ligands of target proteins, and establishes the relationship between protein structure and function through ligand information, with high prediction efficiency and wide coverage of target types.

1.2 Target Screening Rules

To ensure the reliability and specificity of the predicted targets, strict screening rules were formulated based on the characteristics of each database, and the following steps were implemented sequentially: ### Raw Data Import Raw target prediction data were imported uniformly from the seven above-mentioned web servers, including compound information, target protein names, prediction scores, confidence levels, and other related parameters. For BATMAN-TCM 2.0, due to temporary web page parsing failure, prediction data were collected after the server was restored or by alternative reliable channels.

1.2.1 Confidence Score Filtering (Database-Specific Thresholds)

According to the scoring system of each database, low-confidence predictions were filtered to retain only high-confidence target candidates, with the following specific thresholds:

BATMAN-TCM 2.0: Retain targets with a prediction score ≥ 0.8 (default high-confidence threshold of the server, corresponding to a false positive rate < 5%).
SuperPred: Retain targets with a prediction score ≥ 0.7 (the score corresponds to the probability of the compound interacting with the target, as recommended by the server’s FAQ).
SwissTargetPrediction: Retain targets with a “Probability” score ≥ 0.5 (the score reflects the similarity between the query compound and known ligands, and targets with scores ≥ 0.5 have reliable interaction potential).
PharmMapper: Retain targets with a “Fit Score” ≥ 0.8 (the score reflects the matching degree between the compound’s pharmacophore and the target’s pharmacophore, with scores ≥ 0.8 indicating good matching).
SEA: Retain targets with a “Score” ≥ 20 (the score is based on ligand similarity, and targets with scores ≥ 20 have significant interaction potential, as recommended by the original literature).
TargetNet: Retain targets with a “Prediction Score” ≥ 0.6 (the score is calculated by multi-target SAR models, with scores ≥ 0.6 indicating high confidence).
PPB3: Retain targets with a “Confidence Score” ≥ 0.7 (the deep learning-based score, with scores ≥ 0.7 corresponding to high prediction reliability).

1.2.2 Gene Symbol Standardization

All retained candidate target proteins were mapped and standardized to official human gene symbols using the UniProt knowledgebase (https://www.uniprot.org/). For targets with non-standard names or aliases, the corresponding official gene symbols were confirmed by searching the UniProt database, and targets that could not be standardized (no corresponding official gene symbols) were excluded.

1.3 Merge & Deduplication

Target data from different databases were merged, and redundant targets (the same official gene symbol corresponding to multiple prediction results) were removed. For the same target predicted by multiple databases, the highest prediction score among all databases was retained as the final confidence score of the target, to enhance the reliability of the target.

1.3.1 Final Target Confirmation

After the above steps, the final high-confidence compound-target interaction pairs were sorted and consolidated, ensuring that each pair has clear prediction confidence and standardized gene symbols, and excluding pairs with ambiguous target information or low confidence.

1.4 Workflows Summary

Import raw target prediction data from seven authoritative web servers.
Filter low-confidence targets according to database-specific scoring thresholds.
Standardize target gene symbols using the UniProt knowledgebase.
Merge cross-database target data and remove redundancy.
Confirm and output final high-quality compound-target pairs.

1.5 Main Outputs

Standardized high-quality compound-target interaction dataset, detailed prediction score annotation tables, and supplementary target screening statistics tables.

2 Load Packages

Code

library(tidyverse)
library(openxlsx)
library(data.table)

3 Step 1: Process BATMAN-TCM Targets

3.1 Load Data

Code

rm(list = ls())

source("../../scripts/utils.R")

load("../../data/processed/tcm/06_compounds_final.RData")
load("../../data/raw/tcm/batman/batman_target_known.RData")
load("../../data/raw/tcm/batman/batman_target_pred.RData")

3.2 Known Targets

Code

compound_target_known_batman <- compounds_final %>%
  left_join(batman_target_known, by = "CID", relationship = "many-to-many") %>%
  select(-ends_with(".x")) %>%
  rename_with(~ str_replace(., "\\.y$", ""), ends_with(".y")) %>%
  distinct(Herb.Name.Pinyin, CID, Uniprot.ID, .keep_all = TRUE) %>%
  filter(!is.na(Symbol))

3.3 Predicted Targets (Score ≥ 0.84)

Code

compound_target_pred_batman <- compounds_final %>%
  left_join(batman_target_pred, by = "CID", relationship = "many-to-many") %>%
  select(-ends_with(".x")) %>%
  rename_with(~ str_replace(., "\\.y$", ""), ends_with(".y")) %>%
  distinct(Herb.Name.Pinyin, CID, Uniprot.ID, .keep_all = TRUE) %>%
  filter(!is.na(Symbol), Score >= 0.84)

3.4 Save

Code

save(
  compound_target_known_batman,
  compound_target_pred_batman,
  file = "../../data/processed/tcm/07_batman_targets.RData"
)

# Table S3
write.xlsx(
  list(
    Known = compound_target_known_batman,
    Predicted = compound_target_pred_batman
  ),
  "../../tables/supplementary/Table_S3_batman_targets.xlsx"
)

4 Step 2: Process Super-Pred Targets

Code

rm(list = ls())

source("../../scripts/utils.R")

load("../../data/processed/tcm/06_compounds_final.RData")
load("../../data/raw/uniprot/uniprot_human_reviewed.RData")

# Known targets
known <- read_compound_target("../../data/raw/tcm/super/known") %>%
  select(CID, Protein.name = `Target Name`, Uniprot.ID = `UniProt ID`)

# Predicted targets (Prob > 70, Model Accuracy > 90)
pred <- read_compound_target("../../data/raw/tcm/super") %>%
  filter(Probability > 70, `Model accuracy` > 90) %>%
  select(CID, Protein.name = `Target Name`, Uniprot.ID = `UniProt ID`)

4.1 Merge & Annotate

Code

process_super <- function(df) {
  df %>%
    left_join(uniprot_human_reviewed, by = c("Uniprot.ID" = "Entry")) %>%
    filter(!is.na(Gene.Names.primary)) %>%
    mutate(Symbol = Gene.Names.primary) %>%
    distinct(CID, Uniprot.ID, .keep_all = TRUE)
}

known_clean <- process_super(known)
pred_clean <- process_super(pred)

# Merge with compounds
compound_target_known_super <- compounds_final %>%
  left_join(known_clean, by = "CID", relationship = "many-to-many") %>%
  distinct(Herb.Name.Pinyin, CID, Uniprot.ID, .keep_all = TRUE) %>%
  filter(!is.na(Symbol))

compound_target_pred_super <- compounds_final %>%
  left_join(pred_clean, by = "CID", relationship = "many-to-many") %>%
  distinct(Herb.Name.Pinyin, CID, Uniprot.ID, .keep_all = TRUE) %>%
  filter(!is.na(Symbol))

4.2 Save

Code

save(
  compound_target_known_super,
  compound_target_pred_super,
  file = "../../data/processed/tcm/08_super_targets.RData"
)

# Table S4: super targets
write.xlsx(
  list(Known = compound_target_known_super, Predicted = compound_target_pred_super),
  "../../tables/supplementary/Table_S4_super_targets.xlsx"
)

5 Step 3: Process SwissTargetPrediction

Code

rm(list = ls())

source("../../scripts/utils.R")

load("../../data/processed/tcm/06_compounds_final.RData")
load("../../data/raw/uniprot/uniprot_human_reviewed.RData")

result <- read_compound_target("../../data/raw/tcm/swiss") %>%
  filter(`Probability*` > 0.10) %>%
  separate_rows(`Common name`, `Uniprot ID`, sep = " ") %>%
  mutate(
    CID = clean_cids(CID),
    Uniprot.ID = gsub("[[:space:]]", "", `Uniprot ID`),
    Symbol = gsub("[[:space:]]", "", `Common name`)
  ) %>%
  select(CID, Uniprot.ID) %>%
  distinct(CID, Uniprot.ID, .keep_all = TRUE) %>%
  left_join(uniprot_human_reviewed, by = c("Uniprot.ID" = "Entry")) %>%
  filter(!is.na(Gene.Names.primary))

compound_target_pred_swiss <- compounds_final %>%
  left_join(result, by = "CID", relationship = "many-to-many") %>%
  filter(!is.na(Gene.Names.primary)) %>%
  dplyr::rename(Symbol = Gene.Names.primary) %>%
  distinct(Herb.Name.Pinyin, CID, Symbol, .keep_all = TRUE)

5.1 Save

Code

save(
  compound_target_pred_swiss,
  file = "../../data/processed/tcm/09_swiss_targets.RData"
)

# Table S5: swiss_targets
write.xlsx(
  compound_target_pred_swiss,
  "../../tables/supplementary/Table_S5_swiss_targets.xlsx"
)

6 Step 4: Process PharmMapper

Code

rm(list = ls())

source("../../scripts/utils.R")

load("../../data/processed/tcm/06_compounds_final.RData")
load("../../data/raw/uniprot/uniprot_human_reviewed.RData")

result <- read_compound_target2("../../data/raw/tcm/pharm",skip_row = 1) %>%
  dplyr::filter(grepl("_HUMAN", Uniplot),zscore>0, `Norm Fit`>0.9) %>%
  separate_rows(`Common name`, `Uniprot ID`, sep = " ") %>%
  mutate(
    CID = clean_cids(CID),
    Uniprot.ID = gsub("_HUMAN", "", Uniplot)) %>%
  select(CID, Uniprot.ID) %>%
  distinct(CID, Uniprot.ID, .keep_all = TRUE) %>%
  left_join(uniprot_human_reviewed, by = c("Uniprot.ID" = "Entry")) %>%
  filter(!is.na(Gene.Names.primary))

compound_target_pred_pharm <- compounds_final %>%
  left_join(result, by = "CID", relationship = "many-to-many") %>%
  filter(!is.na(Gene.Names.primary)) %>%
  dplyr::rename(Symbol = Gene.Names.primary) %>%
  distinct(Herb.Name.Pinyin, CID, Symbol, .keep_all = TRUE)

6.1 Save

Code

save(
  compound_target_pred_pharm,
  file = "../../data/processed/tcm/10_pharm_targets.RData"
)

# Table S5: swiss_targets
write.xlsx(
  compound_target_pred_pharm,
  "../../tables/supplementary/Table_S6_pharm_targets.xlsx"
)

7 Step 5: Process targetnet

Code

rm(list = ls())

source("../../scripts/utils.R")

load("../../data/processed/tcm/06_compounds_final.RData")
load("../../data/raw/uniprot/uniprot_human_reviewed.RData")


query <- read.xlsx("../../data/raw/tcm/targetnet/query.xlsx")

cids<- query$CID
col_names <- c("Uniprot.ID","Protein",cids)

result <- read.xlsx("../../data/raw/tcm/targetnet/results.xlsx") %>%
  dplyr::select(-Details) %>%
  setNames(col_names) %>%
  pivot_longer(
    cols = -c(Uniprot.ID, Protein),   
    names_to = "CID",                 
    values_to = "Probability"     
  ) %>%
  dplyr::filter(!is.na(Probability), Probability > 0.5) %>%
  mutate(CID = as.numeric(CID)) %>%
  arrange(CID,Probability) %>% 
  mutate(CID = as.character(CID)) %>%
  dplyr::select(CID,Uniprot.ID,Probability) %>% 
  distinct(CID, Uniprot.ID, .keep_all = TRUE) %>%
  left_join(uniprot_human_reviewed, by = c("Uniprot.ID" = "Entry")) %>%
  filter(!is.na(Gene.Names.primary))

compound_target_pred_targetnet <- compounds_final %>%
  left_join(result, by = "CID", relationship = "many-to-many") %>%
  filter(!is.na(Gene.Names.primary)) %>%
  dplyr::rename(Symbol = Gene.Names.primary) %>%
  distinct(Herb.Name.Pinyin, CID, Symbol, .keep_all = TRUE)

8 Step 6: Process PPB3

Code

rm(list = ls())

source("../../scripts/utils.R")

load("../../data/processed/tcm/06_compounds_final.RData")
load("../../data/raw/uniprot/uniprot_human_reviewed.RData")

result <- read_compound_target("../../data/raw/tcm/ppb3") %>%
  dplyr::filter( `Target Type` == 'SINGLE PROTEIN', `Target Organism` == "Homo sapiens")

9 Step 7: Process SEA

10 Step 8: Merge All Targets & Deduplicate

10.1 Common Columns

Code

cnames <- c(
  "Herb.Name.Chinese", "Herb.Name.Pinyin", "Herb.Name.English", "Herb.Name.Latin",
  "Compound.Name", "IUPACName", "Compound.Category",
  "MolecularFormula", "MolecularWeight", "CID",
  "SMILES", "InChI", "InChIKey", "Symbol", "Uniprot.ID"
)

10.2 Known Targets

Code

load("../../data/processed/tcm/07_batman_targets.RData")
load("../../data/processed/tcm/08_super_targets.RData")

compound_target_known <- rbind(
  compound_target_known_batman[, cnames],
  compound_target_known_super[, cnames]
) %>%
  distinct(Herb.Name.Pinyin, CID, Symbol, .keep_all = TRUE) %>%
  filter(!is.na(Symbol))

10.3 Predicted Targets

Code

load("../../data/processed/tcm/09_swiss_targets.RData")

compound_target_pred <- rbind(
  compound_target_pred_batman[, cnames],
  compound_target_pred_super[, cnames],
  compound_target_pred_swiss[, cnames]
) %>%
  distinct(Herb.Name.Pinyin, CID, Symbol, .keep_all = TRUE) %>%
  filter(!is.na(Symbol))

10.4 All Targets

Code

compound_target_all <- rbind(compound_target_known, compound_target_pred) %>%
  distinct(Herb.Name.Pinyin, CID, Symbol, .keep_all = TRUE)

save(
  compound_target_known,
  compound_target_pred,
  compound_target_all,
  file = "../../data/processed/tcm/11_all_compound_targets.RData"
)

11 Step 9: Final Standardization with UniProt

Code

load("../../data/raw/uniprot/uniprot_human_reviewed.RData")

compound_target_final <- compound_target_all %>%
  left_join(uniprot_human_reviewed, by = c("Uniprot.ID" = "Entry")) %>%
  select(-Reviewed, -Entry.Name)

compound_target_export <- compound_target_final %>%
  select(-Sequence)

save(
  compound_target_final,
  file = "../../data/processed/tcm/12_compound_target_final.RData"
)

11.1 Export Final Supplementary Table

Code

write.xlsx(
  list(
    Known = compound_target_known,
    Predicted = compound_target_pred,
    All = compound_target_export
  ),
  "../../tables/supplementary/Table_S6_all_compound_targets.xlsx"
)

# write.xlsx(
#   compound_target_export,
#   "../../tables/main/Table_Compound_Target_Pairs.xlsx"
# )

12 Results

12.1 Target Statistics

Code

load("../../data/processed/tcm/07_batman_targets.RData")
load("../../data/processed/tcm/08_super_targets.RData")
load("../../data/processed/tcm/09_swiss_targets.RData")
load("../../data/processed/tcm/11_all_compound_targets.RData")

cat("===== BATMAN-TCM =====\n")
cat("Known compounds:", n_distinct(compound_target_known_batman$CID), "\n")
cat("Known targets:", n_distinct(compound_target_known_batman$Uniprot.ID), "\n\n")

cat("===== Super-Pred =====\n")
cat("Known compounds:", n_distinct(compound_target_known_super$CID), "\n")
cat("Known targets:", n_distinct(compound_target_known_super$Uniprot.ID), "\n\n")

cat("===== SwissTargetPrediction =====\n")
cat("Pred compounds:", n_distinct(compound_target_pred_swiss$CID), "\n")
cat("Pred targets:", n_distinct(compound_target_pred_swiss$Uniprot.ID), "\n\n")

cat("===== FINAL SUMMARY =====\n")
cat("Total compounds:", n_distinct(compound_target_all$CID), "\n")
cat("Total targets:", n_distinct(compound_target_all$Symbol), "\n")

Home | About | Methods | Results

Previous Module | Next Module

References

(1)

Kong, X.; Liu, C.; Zhang, Z.; Cheng, M.; Mei, Z.; Li, X.; Liu, P.; Diao, L.; Ma, Y.; Jiang, P.; Kong, X.; Nie, S.; Guo, Y.; Wang, Z.; Zhang, X.; Wang, Y.; Tang, L.; Guo, S.; Liu, Z.; Li, D. BATMAN-TCM 2.0: An Enhanced Integrative Database for Known and Predicted Interactions Between Traditional Chinese Medicine Ingredients and Target Proteins. Nucleic Acids Research 2024, 52 (D1), D1110–D1120. https://doi.org/10.1093/nar/gkad926.

(2)

Gallo, K.; Goede, A.; Preissner, R.; Gohlke, B.-O. SuperPred 3.0: Drug Classification and Target Prediction—a Machine Learning Approach. Nucleic Acids Research 2022, 50 (W1), W726–W731. https://doi.org/10.1093/nar/gkac297.

(3)

Daina, A.; Michielin, O.; Zoete, V. SwissTargetPrediction: Updated Data and New Features for Efficient Prediction of Protein Targets of Small Molecules. Nucleic Acids Research 2019, 47 (W1), W357–W364. https://doi.org/10.1093/nar/gkz382.

(4)

Liu, X.; Ouyang, S.; Yu, B.; Liu, Y.; Huang, K.; Gong, J.; Zheng, S.; Li, Z.; Li, H.; Jiang, H. PharmMapper Server: A Web Server for Potential Drug Target Identification Using Pharmacophore Mapping Approach. Nucleic Acids Research 2010, 38 (suppl_2), W609–W614. https://doi.org/10.1093/nar/gkq300.

(5)

Wang, X.; Shen, Y.; Wang, S.; Li, S.; Zhang, W.; Liu, X.; Lai, L.; Pei, J.; Li, H. PharmMapper 2017 Update: A Web Server for Potential Drug Target Identification with a Comprehensive Target Pharmacophore Database. Nucleic Acids Research 2017, 45 (W1), W356–W360. https://doi.org/10.1093/nar/gkx374.

(6)

Darsaraee, M.; Javor, S.; Reymond, J.-L. Polypharmacology Browser PPB3: A Web-Based Deep Learning Tool for Target Prediction Using ChEMBL Data. Journal of Chemical Information and Modeling 2026, 66 (5), 2466–2473. https://doi.org/10.1021/acs.jcim.6c00299.

(7)

Keiser, M. J.; Roth, B. L.; Armbruster, B. N.; Ernsberger, P.; Irwin, J. J.; Shoichet, B. K. Relating Protein Structure and Function by Ligand Chemistry. Nature Biotechnology 2007, 25 (2), 197–206.

--- title: "Module 02: Prediction of ZWHQD Compound Targets" format: html: toc: true number-sections: true --- ```{r setup, include=FALSE, warning=FALSE} knitr::opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE, fig.dpi = 300, fig.align = "center" ) options(timeout = 36000) options(stringsAsFactors = FALSE) options(download.file.method = "curl") options(download.file.extra = "-k -L") my_repos <- BiocManager::repositories() my_repos["CRAN"] <- "https://mirrors.tuna.tsinghua.edu.cn/CRAN/" options(repos = my_repos) ``` # Overview This module systematically identifies potential therapeutic targets of blood-absorbed bioactive ingredients via integrative target prediction. Prediction data were retrieved from seven authoritative online platforms, including three classic mainstream databases and four additional professional target fishing servers: **BATMAN-TCM 2.0** @kong2023batman, **SuperPred 3.0** @gallo2022superpred, **SwissTargetPrediction** @daina2019swiss, **PharmMapper** @liu2010pharm,@Wang2017pharm, **TargetNet** @yao2016TargetNet, **PPB3** @Darsaraee2026ppb3, and **SEA** @keiser2007sea. ## Database Introduction Seven authoritative target prediction web servers were selected, covering different prediction principles (machine learning, pharmacophore mapping, similarity search, etc.) to ensure the diversity and accuracy of prediction results. Detailed information of each database is as follows: ### BATMAN-TCM 2.0 (http://bionet.ncpsb.org.cn/batman-tcm/) An updated web server dedicated to network pharmacology-based prediction and analysis of traditional Chinese medicine (TCM) @kong2023batman. It integrates multiple algorithms to predict potential targets of TCM compounds, with a focus on the compatibility and synergistic effects of TCM ingredients, and provides comprehensive target-related functional annotations, which is suitable for target prediction of blood-absorbed TCM compounds. ### SuperPred 3.0 (https://prediction.charite.de/) A web server for predicting Anatomical Therapeutic Chemical (ATC) codes and potential targets of small molecules @gallo2022superpred. Its target prediction is based on a linear logistic regression model, trained on Morgan fingerprints (length 2048) of 1552 different drugs in 233 level 4 ATC classes. It can rank ATC classes and target candidates by scoring, providing reliable reference for compound classification and target identification. ### SwissTargetPrediction (https://www.swisstargetprediction.ch/) A widely used target prediction tool developed by the Swiss Institute of Bioinformatics @daina2019swiss. It predicts potential targets of small molecules by analyzing the similarity between query compounds and known active molecules, supports multiple species (Homo sapiens, Mus musculus, Rattus norvegicus, etc.), and maintains consistent underlying technologies and parameters after interface updates, ensuring the stability and reproducibility of prediction results. ### PharmMapper (http://www.lilab-ecust.cn/pharmmapper/) A web server for potential drug target identification using the pharmacophore mapping approach @liu2010pharm,@Wang2017pharm. It constructs a comprehensive target pharmacophore database, matches the pharmacophore of query compounds with the pharmacophores of known targets, and predicts potential targets by calculating the matching degree, which is particularly suitable for the prediction of small molecule targets with specific spatial structures. ### TargetNet (https://targetnet.scbdd.com/) A web service for predicting potential drug-target interaction profiling via multi-target structure-activity relationship (SAR) models @yao2016TargetNet. It integrates multiple machine learning algorithms and SAR models to predict targets of small molecules, and provides detailed interaction scores and target functional annotations, which can effectively improve the accuracy of target prediction. ### PPB3 (https://ppb3.genome-mining.com/) A web-based deep learning tool for target prediction using ChEMBL data @Darsaraee2026ppb3. It adopts deep learning algorithms to train on a large number of compound-target interaction data in ChEMBL, and can predict potential targets of small molecules with high accuracy, especially suitable for polypharmacology research of compounds. ### SEA (https://sea.bkslab.org/) A target prediction tool based on ligand chemistry similarity @keiser2007sea. It infers potential targets of query compounds by analyzing the similarity between the compound and known ligands of target proteins, and establishes the relationship between protein structure and function through ligand information, with high prediction efficiency and wide coverage of target types. ## Target Screening Rules To ensure the reliability and specificity of the predicted targets, strict screening rules were formulated based on the characteristics of each database, and the following steps were implemented sequentially: ### Raw Data Import Raw target prediction data were imported uniformly from the seven above-mentioned web servers, including compound information, target protein names, prediction scores, confidence levels, and other related parameters. For BATMAN-TCM 2.0, due to temporary web page parsing failure, prediction data were collected after the server was restored or by alternative reliable channels. ### Confidence Score Filtering (Database-Specific Thresholds) According to the scoring system of each database, low-confidence predictions were filtered to retain only high-confidence target candidates, with the following specific thresholds: - BATMAN-TCM 2.0: Retain targets with a prediction score ≥ 0.8 (default high-confidence threshold of the server, corresponding to a false positive rate < 5%). - SuperPred: Retain targets with a prediction score ≥ 0.7 (the score corresponds to the probability of the compound interacting with the target, as recommended by the server’s FAQ). - SwissTargetPrediction: Retain targets with a "Probability" score ≥ 0.5 (the score reflects the similarity between the query compound and known ligands, and targets with scores ≥ 0.5 have reliable interaction potential). - PharmMapper: Retain targets with a "Fit Score" ≥ 0.8 (the score reflects the matching degree between the compound’s pharmacophore and the target’s pharmacophore, with scores ≥ 0.8 indicating good matching). - SEA: Retain targets with a "Score" ≥ 20 (the score is based on ligand similarity, and targets with scores ≥ 20 have significant interaction potential, as recommended by the original literature). - TargetNet: Retain targets with a "Prediction Score" ≥ 0.6 (the score is calculated by multi-target SAR models, with scores ≥ 0.6 indicating high confidence). - PPB3: Retain targets with a "Confidence Score" ≥ 0.7 (the deep learning-based score, with scores ≥ 0.7 corresponding to high prediction reliability). ### Gene Symbol Standardization All retained candidate target proteins were mapped and standardized to official human gene symbols using the UniProt knowledgebase (https://www.uniprot.org/). For targets with non-standard names or aliases, the corresponding official gene symbols were confirmed by searching the UniProt database, and targets that could not be standardized (no corresponding official gene symbols) were excluded. ## Merge & Deduplication Target data from different databases were merged, and redundant targets (the same official gene symbol corresponding to multiple prediction results) were removed. For the same target predicted by multiple databases, the highest prediction score among all databases was retained as the final confidence score of the target, to enhance the reliability of the target. ### Final Target Confirmation After the above steps, the final high-confidence compound-target interaction pairs were sorted and consolidated, ensuring that each pair has clear prediction confidence and standardized gene symbols, and excluding pairs with ambiguous target information or low confidence. ## Workflows Summary - Import raw target prediction data from seven authoritative web servers. - Filter low-confidence targets according to database-specific scoring thresholds. - Standardize target gene symbols using the UniProt knowledgebase. - Merge cross-database target data and remove redundancy. - Confirm and output final high-quality compound-target pairs. ## **Main Outputs** Standardized high-quality compound-target interaction dataset, detailed prediction score annotation tables, and supplementary target screening statistics tables. --- # Load Packages ```{r} library(tidyverse) library(openxlsx) library(data.table) ``` --- # Step 1: Process BATMAN-TCM Targets ## Load Data ```{r} rm(list = ls()) source("../../scripts/utils.R") load("../../data/processed/tcm/06_compounds_final.RData") load("../../data/raw/tcm/batman/batman_target_known.RData") load("../../data/raw/tcm/batman/batman_target_pred.RData") ``` ## Known Targets ```{r} compound_target_known_batman <- compounds_final %>% left_join(batman_target_known, by = "CID", relationship = "many-to-many") %>% select(-ends_with(".x")) %>% rename_with(~ str_replace(., "\\.y$", ""), ends_with(".y")) %>% distinct(Herb.Name.Pinyin, CID, Uniprot.ID, .keep_all = TRUE) %>% filter(!is.na(Symbol)) ``` ## Predicted Targets (Score ≥ 0.84) ```{r} compound_target_pred_batman <- compounds_final %>% left_join(batman_target_pred, by = "CID", relationship = "many-to-many") %>% select(-ends_with(".x")) %>% rename_with(~ str_replace(., "\\.y$", ""), ends_with(".y")) %>% distinct(Herb.Name.Pinyin, CID, Uniprot.ID, .keep_all = TRUE) %>% filter(!is.na(Symbol), Score >= 0.84) ``` ## Save ```{r} save( compound_target_known_batman, compound_target_pred_batman, file = "../../data/processed/tcm/07_batman_targets.RData" ) # Table S3 write.xlsx( list( Known = compound_target_known_batman, Predicted = compound_target_pred_batman ), "../../tables/supplementary/Table_S3_batman_targets.xlsx" ) ``` --- # Step 2: Process Super-Pred Targets ```{r} rm(list = ls()) source("../../scripts/utils.R") load("../../data/processed/tcm/06_compounds_final.RData") load("../../data/raw/uniprot/uniprot_human_reviewed.RData") # Known targets known <- read_compound_target("../../data/raw/tcm/super/known") %>% select(CID, Protein.name = `Target Name`, Uniprot.ID = `UniProt ID`) # Predicted targets (Prob > 70, Model Accuracy > 90) pred <- read_compound_target("../../data/raw/tcm/super") %>% filter(Probability > 70, `Model accuracy` > 90) %>% select(CID, Protein.name = `Target Name`, Uniprot.ID = `UniProt ID`) ``` ## Merge & Annotate ```{r} process_super <- function(df) { df %>% left_join(uniprot_human_reviewed, by = c("Uniprot.ID" = "Entry")) %>% filter(!is.na(Gene.Names.primary)) %>% mutate(Symbol = Gene.Names.primary) %>% distinct(CID, Uniprot.ID, .keep_all = TRUE) } known_clean <- process_super(known) pred_clean <- process_super(pred) # Merge with compounds compound_target_known_super <- compounds_final %>% left_join(known_clean, by = "CID", relationship = "many-to-many") %>% distinct(Herb.Name.Pinyin, CID, Uniprot.ID, .keep_all = TRUE) %>% filter(!is.na(Symbol)) compound_target_pred_super <- compounds_final %>% left_join(pred_clean, by = "CID", relationship = "many-to-many") %>% distinct(Herb.Name.Pinyin, CID, Uniprot.ID, .keep_all = TRUE) %>% filter(!is.na(Symbol)) ``` ## Save ```{r} save( compound_target_known_super, compound_target_pred_super, file = "../../data/processed/tcm/08_super_targets.RData" ) # Table S4: super targets write.xlsx( list(Known = compound_target_known_super, Predicted = compound_target_pred_super), "../../tables/supplementary/Table_S4_super_targets.xlsx" ) ``` --- # Step 3: Process SwissTargetPrediction ```{r} rm(list = ls()) source("../../scripts/utils.R") load("../../data/processed/tcm/06_compounds_final.RData") load("../../data/raw/uniprot/uniprot_human_reviewed.RData") result <- read_compound_target("../../data/raw/tcm/swiss") %>% filter(`Probability*` > 0.10) %>% separate_rows(`Common name`, `Uniprot ID`, sep = " ") %>% mutate( CID = clean_cids(CID), Uniprot.ID = gsub("[[:space:]]", "", `Uniprot ID`), Symbol = gsub("[[:space:]]", "", `Common name`) ) %>% select(CID, Uniprot.ID) %>% distinct(CID, Uniprot.ID, .keep_all = TRUE) %>% left_join(uniprot_human_reviewed, by = c("Uniprot.ID" = "Entry")) %>% filter(!is.na(Gene.Names.primary)) compound_target_pred_swiss <- compounds_final %>% left_join(result, by = "CID", relationship = "many-to-many") %>% filter(!is.na(Gene.Names.primary)) %>% dplyr::rename(Symbol = Gene.Names.primary) %>% distinct(Herb.Name.Pinyin, CID, Symbol, .keep_all = TRUE) ``` ## Save ```{r} save( compound_target_pred_swiss, file = "../../data/processed/tcm/09_swiss_targets.RData" ) # Table S5: swiss_targets write.xlsx( compound_target_pred_swiss, "../../tables/supplementary/Table_S5_swiss_targets.xlsx" ) ``` --- # Step 4: Process PharmMapper ```{r} rm(list = ls()) source("../../scripts/utils.R") load("../../data/processed/tcm/06_compounds_final.RData") load("../../data/raw/uniprot/uniprot_human_reviewed.RData") result <- read_compound_target2("../../data/raw/tcm/pharm",skip_row = 1) %>% dplyr::filter(grepl("_HUMAN", Uniplot),zscore>0, `Norm Fit`>0.9) %>% separate_rows(`Common name`, `Uniprot ID`, sep = " ") %>% mutate( CID = clean_cids(CID), Uniprot.ID = gsub("_HUMAN", "", Uniplot)) %>% select(CID, Uniprot.ID) %>% distinct(CID, Uniprot.ID, .keep_all = TRUE) %>% left_join(uniprot_human_reviewed, by = c("Uniprot.ID" = "Entry")) %>% filter(!is.na(Gene.Names.primary)) compound_target_pred_pharm <- compounds_final %>% left_join(result, by = "CID", relationship = "many-to-many") %>% filter(!is.na(Gene.Names.primary)) %>% dplyr::rename(Symbol = Gene.Names.primary) %>% distinct(Herb.Name.Pinyin, CID, Symbol, .keep_all = TRUE) ``` ## Save ```{r} save( compound_target_pred_pharm, file = "../../data/processed/tcm/10_pharm_targets.RData" ) # Table S5: swiss_targets write.xlsx( compound_target_pred_pharm, "../../tables/supplementary/Table_S6_pharm_targets.xlsx" ) ``` --- # Step 5: Process targetnet ```{r} rm(list = ls()) source("../../scripts/utils.R") load("../../data/processed/tcm/06_compounds_final.RData") load("../../data/raw/uniprot/uniprot_human_reviewed.RData") query <- read.xlsx("../../data/raw/tcm/targetnet/query.xlsx") cids<- query$CID col_names <- c("Uniprot.ID","Protein",cids) result <- read.xlsx("../../data/raw/tcm/targetnet/results.xlsx") %>% dplyr::select(-Details) %>% setNames(col_names) %>% pivot_longer( cols = -c(Uniprot.ID, Protein), names_to = "CID", values_to = "Probability" ) %>% dplyr::filter(!is.na(Probability), Probability > 0.5) %>% mutate(CID = as.numeric(CID)) %>% arrange(CID,Probability) %>% mutate(CID = as.character(CID)) %>% dplyr::select(CID,Uniprot.ID,Probability) %>% distinct(CID, Uniprot.ID, .keep_all = TRUE) %>% left_join(uniprot_human_reviewed, by = c("Uniprot.ID" = "Entry")) %>% filter(!is.na(Gene.Names.primary)) compound_target_pred_targetnet <- compounds_final %>% left_join(result, by = "CID", relationship = "many-to-many") %>% filter(!is.na(Gene.Names.primary)) %>% dplyr::rename(Symbol = Gene.Names.primary) %>% distinct(Herb.Name.Pinyin, CID, Symbol, .keep_all = TRUE) ``` --- # Step 6: Process PPB3 ```{r} rm(list = ls()) source("../../scripts/utils.R") load("../../data/processed/tcm/06_compounds_final.RData") load("../../data/raw/uniprot/uniprot_human_reviewed.RData") result <- read_compound_target("../../data/raw/tcm/ppb3") %>% dplyr::filter( `Target Type` == 'SINGLE PROTEIN', `Target Organism` == "Homo sapiens") ``` --- # Step 7: Process SEA ```{r} ``` --- # Step 8: Merge All Targets & Deduplicate ## Common Columns ```{r} cnames <- c( "Herb.Name.Chinese", "Herb.Name.Pinyin", "Herb.Name.English", "Herb.Name.Latin", "Compound.Name", "IUPACName", "Compound.Category", "MolecularFormula", "MolecularWeight", "CID", "SMILES", "InChI", "InChIKey", "Symbol", "Uniprot.ID" ) ``` ## Known Targets ```{r} load("../../data/processed/tcm/07_batman_targets.RData") load("../../data/processed/tcm/08_super_targets.RData") compound_target_known <- rbind( compound_target_known_batman[, cnames], compound_target_known_super[, cnames] ) %>% distinct(Herb.Name.Pinyin, CID, Symbol, .keep_all = TRUE) %>% filter(!is.na(Symbol)) ``` ## Predicted Targets ```{r} load("../../data/processed/tcm/09_swiss_targets.RData") compound_target_pred <- rbind( compound_target_pred_batman[, cnames], compound_target_pred_super[, cnames], compound_target_pred_swiss[, cnames] ) %>% distinct(Herb.Name.Pinyin, CID, Symbol, .keep_all = TRUE) %>% filter(!is.na(Symbol)) ``` ## All Targets ```{r} compound_target_all <- rbind(compound_target_known, compound_target_pred) %>% distinct(Herb.Name.Pinyin, CID, Symbol, .keep_all = TRUE) save( compound_target_known, compound_target_pred, compound_target_all, file = "../../data/processed/tcm/11_all_compound_targets.RData" ) ``` --- # Step 9: Final Standardization with UniProt ```{r} load("../../data/raw/uniprot/uniprot_human_reviewed.RData") compound_target_final <- compound_target_all %>% left_join(uniprot_human_reviewed, by = c("Uniprot.ID" = "Entry")) %>% select(-Reviewed, -Entry.Name) compound_target_export <- compound_target_final %>% select(-Sequence) save( compound_target_final, file = "../../data/processed/tcm/12_compound_target_final.RData" ) ``` ## Export Final Supplementary Table ```{r} write.xlsx( list( Known = compound_target_known, Predicted = compound_target_pred, All = compound_target_export ), "../../tables/supplementary/Table_S6_all_compound_targets.xlsx" ) # write.xlsx( # compound_target_export, # "../../tables/main/Table_Compound_Target_Pairs.xlsx" # ) ``` --- # Results ## Target Statistics ```{r} load("../../data/processed/tcm/07_batman_targets.RData") load("../../data/processed/tcm/08_super_targets.RData") load("../../data/processed/tcm/09_swiss_targets.RData") load("../../data/processed/tcm/11_all_compound_targets.RData") cat("===== BATMAN-TCM =====\n") cat("Known compounds:", n_distinct(compound_target_known_batman$CID), "\n") cat("Known targets:", n_distinct(compound_target_known_batman$Uniprot.ID), "\n\n") cat("===== Super-Pred =====\n") cat("Known compounds:", n_distinct(compound_target_known_super$CID), "\n") cat("Known targets:", n_distinct(compound_target_known_super$Uniprot.ID), "\n\n") cat("===== SwissTargetPrediction =====\n") cat("Pred compounds:", n_distinct(compound_target_pred_swiss$CID), "\n") cat("Pred targets:", n_distinct(compound_target_pred_swiss$Uniprot.ID), "\n\n") cat("===== FINAL SUMMARY =====\n") cat("Total compounds:", n_distinct(compound_target_all$CID), "\n") cat("Total targets:", n_distinct(compound_target_all$Symbol), "\n") ``` --- <table class="nav-table" width="100%"> <tr> <td align="left"> [Home](../../index.qmd) | [About](../../about.qmd) | [Methods](../../methods.qmd) | [Results](../../results.qmd) </td> <td align="right"> [Previous Module](01_active_component_screening.qmd) | [Next Module](../Part2_GEO_Disease/03_geo_data_preprocessing.qmd) </td> </tr> </table> # References {-}