This vignette will take you through all important steps for the analysis of dose-response experiments with protti.
Before analysing your data please ensure that it is of sufficient quality by using protti’s quality control functions (see Quality control (QC) workflow) If you would like to compare a control condition with one treatment condition (or multiple unrelated treatments) you can check out the data analysis vignette for single dose treatment data. For experiments with treatment titrations (i.e. samples treated with different amounts of the same compound, protein, metabolite, element, ion, etc.) you can fit sigmoidal log-logistic regression curves to the data and determine significantly changing precursors, peptides or proteins based on Pearson correlation and ANOVA q-value. For help with data input please click here.
A typical dose-response experiment contains multiple samples that were treated with different amounts of e.g. a drug. Replicates of samples treated with the same dose make up a condition. Commonly, the first concentration is 0 ( = control; e.g. the treatment with the solvent of the drug). Dose-response treatments require a minimal number of treatments to fit curves with sufficient quality. In protti at least 5 different conditions should be present. Another consideration for your experiment is the range of concentrations. On the one hand they should not be too close together since effects are usually best identified over a larger concentration range. On the other hand you should make sure to not space them out too far from each other. It is generally advised to space them evenly on a logarithmic scale of the base 10 or \(e\) (Euler’s number). You can also include steps in between e.g. 100, 500, 1000 or 100, 200, 1000 for a log10 scale. Using a rather broad concentration range for an experiment in which you do not know what to expect is advisable.
protti is fitting four-parameter log-logistic dose-response models to your data. It utilizes the drm()
and LL.4()
functions from the drc
package for this. You can also select a non logarithmic model using the L.4()
function in case your data does not follow a log-logistic but a logistic regression.
For limited proteolysis-coupled to mass spectrometry (LiP-MS) data, dose-response curves have been used previously for the identification of drug binding sites in complex proteomes (Piazza 2020). Since LiP-MS data is usually analysed on the peptide or precursor level, utilising additional information about peptide behaviour from multiple conditions reduces false discovery rate.
Before starting the analysis of your data you need to load protti and additional packages utilised for the analysis. As described on the main page of protti, the tidyverse package collection is by design working well together with the rest of the package. You can load packages after you installed them with the library()
function.
# Load packages
library(protti)
library(dplyr)
library(magrittr)
library(ggplot2)
For this vignette we are using a subset of proteins from an experiment of HeLa cell lysates treated with 9 doses of Rapamycin followed by LiP-MS. Rapamycin forms a gain-of-function complex with the FK506-binding protein (FKBP12) that binds and allosterically inhibits mTORC1 (Sabatini 1994). Since Rapamycin is known to be a highly specific drug, we expect FKBP12 to show up as one of the only interacting proteins.
We included 39 random proteins and FKBP12 in this sample data set. The proteins were sampled using the seed 123.
If you want to read your data into R, we suggest using the read_protti()
function. This function is a wrapper around the fast fread()
function from the data.table
package and the clean_names()
function from the janitor
package. This will allow you to not only load your data into R very fast, but also to clean up the column names into lower snake case. This will make it easier to remember them and to use them in your data analysis. To ensure that your data input has the right format you can check out our input preparation vignette.
# Load data
<- read_protti("your_data.csv") rapamycin_dose_response
In this example the rapamycin dose-response data set is included in protti and you can easily use it by calling the data()
function. This will load the data into your R session.
::data("rapamycin_dose_response") utils
After loading data into R we would like to clean it up in order to remove rows that are problematic for our analysis. The rapamycin_dose_response
data is based on a report from Spectronaut™. It includes convenient columns that label peptides as being decoys (eg_is_decoy
). Other software reports might have similar columns that should be checked. MaxQuant for example includes information on contaminating proteins that you should remove prior to data analysis. In this case we would like to remove any peptides left in the report that are decoys (used for false discovery rate estimation). It is a good practice to just run this even though usually decoys are already filtered out.
Due to the fact that with increasing raw intensities (fg_quantity
) also variances increase, statistical tests would have a bias towards lower intense peptides or proteins. Therefore you should log2 transform your data to correct for this mean-variance relationship. We do this by using dplyr
’s mutate()
together with log2()
.
In addition to filtering and log2 transformation it is also advised to median normalise your data to equal out small differences in overall sample intensities that result from unequal sample concentrations. protti provides the median_normalisation()
function for this purpose. This function generates an additional column called normalised_intensity_log2
that contains the normalised intensities.
Note: If your search tool already normalised your data you should not normalise it again.
In addition to the removal of decoys we also remove any non-proteotypic peptides (pep_is_proteotypic
) from our analysis. This is specific for the analysis of LiP-MS data. If a peptide is non-proteotypic it is part of two or more proteins. If we detect a change in non-proteotypic peptides it is not possible to clearly assign which of the proteins it is coming from.
# Filter, log2 transform and normalise data
<- rapamycin_dose_response %>%
data_normalised filter(eg_is_decoy == FALSE) %>%
mutate(intensity_log2 = log2(fg_quantity)) %>%
median_normalisation(sample = r_file_name,
intensity_log2 = intensity_log2) %>%
filter(pep_is_proteotypic == TRUE)
It is also useful to check the intensity distribution of all precursors with the protti function qc_intensity_distribution()
. Usually, the distribution of intensities is a skewed normal distribution of values missing on the left flank. This is due to the detection limit of the mass spectrometer. The probability of not observing a value gets higher with decreasing intensities.
For experiments measured in data independent acquisition (DIA) mode. It is good to filter out very low intensity values that are not part of the distribution. These values are likely false assignments of peaks. If you look closely you can see a small peak around 0 corresponding to these values. In the case of our distribution we could chose a cutoff at a log2 intensity of 5, which corresponds to a raw intensity of 32.
qc_intensity_distribution(data = data_normalised,
grouping = eg_precursor_id,
intensity = normalised_intensity_log2,
plot_style = "histogram")
<- data_normalised %>%
data_normalised filter(normalised_intensity_log2 > 5)
Before we fit dose-response models to our data, we can check how samples cluster. Ideally, samples that belong to the same treatment concentration cluster together. A common way that you can analyse this is correlation based hierachical clustering using the qc_sample_correlation()
function from protti. Basically, the correlation of samples is calculated based on the correlation of precursor (or peptide/protein) intensities. By default qc_sample_correlation()
uses spearman correlation, however you can change that to any correlation supported by the cor()
function.
Note: Many of protti’s plotting function also have the option to display an interactive version of the plot with an interactive
argument.
qc_sample_correlation(data = data_normalised,
sample = r_file_name,
grouping = eg_precursor_id,
intensity_log2 = normalised_intensity_log2,
condition = r_condition)
For our Rapamycin data set we cannot identify any clear sample clustering based on correlation, because only a small subset of proteins was selected for this example. Furthermore, data sets in which only a few changes are expected commonly do not cluster nicely. This is because precursors/peptides/proteins have very similar intensity values throughout all conditions and clustering is in this case based on variance. In case of more global changes samples should usually cluster nicely.
In addition to correlation based clustering a principal component analysis can be performed using the protti function qc_pca()
. Similar to before, good clustering is usually dependent on the amount of changing precursors (or peptides/proteins).
qc_pca(data = data_normalised,
sample = r_file_name,
grouping = eg_precursor_id,
intensity = normalised_intensity_log2,
condition = r_condition)
The main part of a dose-response data analysis is the fit_drc_4p()
function. It fits four-parameter log-logistic dose-response curves for every precursor (or peptide/protein) with the following equation:
\[y = d + \frac{a - d}{1 + (\frac{x}{c})^{-b}}\] \(x\) is the dependent variable
\(y\) is the independen variable
\(a\) is the minimum value that can be obtained
\(d\) is the maximum value that can be obtained
\(c\) is the point of inflection
\(b\) is the Hill’s coefficient (e.g. the negative slope at the inflection point)
The output of fit_drc_4p()
provides extensive information on the goodness of fit. Furthermore, the function filters and ranks fits based on several criteria which include completeness of data and a significance cutoff based on an adjusted p-value obtained from ANOVA. For details about the filtering steps you can read the function documentation by calling ?fit_drc_4p
. If you only care about your potential hits (and exclude precursors/peptides/proteins that do not pass the filtering e.g. due to too little observations), you can choose filter = "pre"
, which filters the data before model fitting. This speeds up the process because less models need to be fit. If you want to perform enrichment analysis later on you should keep the default filter = "post"
. This will fit all possible models and then only annotate models based on whether they passed or failed the filtering step (passed_filter
).
<- data_normalised %>%
fit fit_drc_4p(sample = r_file_name,
grouping = eg_precursor_id,
response = normalised_intensity_log2,
dose = r_condition,
filter = "post",
retain_columns = c(pg_protein_accessions)) # make sure to retain columns that you need later but that are not part of the function
If you have to fit many models this function may take about an hour or longer. If you have the computational resources you can speed this up by using parallel_fit_drc_4p()
, a parallelised version of this function.
Note: Keep in mind that by spreading the model fitting over multiple cores, more memory than usual is needed. Furthermore, using the maximal number of available cores may not be the most efficient, because it takes time to copy the data to and from each core - this process will not be performed in parallel. Therefore, it is advised that the run time on each core is at least as long as the time to set up (data transfer, environment preparation etc.) all cores.
# setup of cores. Make sure you have the future package installed
::plan(future::multiprocess, workers = 3)
future
# fit models in parallel
<- data_normalised %>%
parallel_fit parallel_fit_drc_4p(sample = r_file_name,
grouping = eg_precursor_id,
response = normalised_intensity_log2,
dose = r_condition,
retain_columns = c(pg_protein_accessions),
n_cores = 3)
# remove workers again after you are done
::plan(future::sequential) future
If we have a look at all precursors based on their rank (calculated from correlation and ANOVA q-value), we can see that as expected most of them are FKBP12 (P62942) peptides. One benefit of analysing LiP-MS data on the precursor level (rather than peptide level) is that multiple lines of evidence for a change in the specific peptide can be used (we have data for each different charge and modification state). Therefore, it is always good to check if there are other precursors of a good scoring peptide that do not show any regulation at all. This would mean that the reason for the observed response is not based on a biological effect.
rank | score | eg_precursor_id | pg_protein_accessions | anova_adj_pval | correlation | ec_50 |
---|---|---|---|---|---|---|
1 | 0.919 | VFDVELLKLE.2 | P62942 | 3.98e-13 | 0.967 | 3.6e+06 |
2 | 0.914 | RGQTC[Carbamidomethyl (C)]VVHYTGMLEDGK.3 | P62942 | 4.73e-14 | 0.947 | 3.0e+05 |
3 | 0.888 | GWEEGVAQMSVGQR.2 | P62942 | 1.08e-13 | 0.947 | 4.7e+05 |
4 | 0.823 | GWEEGVAQ.1 | P62942 | 2.47e-10 | 0.976 | 7.8e+06 |
5 | 0.798 | LVFDVELLK.2 | P62942 | 6.00e-11 | 0.959 | 4.4e+06 |
6 | 0.778 | GWEEGVAQMSVGQR.3 | P62942 | 1.71e-11 | 0.944 | 7.7e+05 |
7 | 0.775 | RGQTC[Carbamidomethyl (C)]VVHYTGMLEDGK.4 | P62942 | 1.05e-12 | 0.926 | 5.7e+05 |
8 | 0.719 | YTGMLEDGK.2 | P62942 | 4.19e-10 | 0.945 | 3.5e+06 |
9 | 0.703 | LTISPDYAYGAT.2 | P62942 | 1.02e-09 | 0.946 | 3.0e+06 |
10 | 0.693 | RGQTC[Carbamidomethyl (C)]VVH.2 | P62942 | 1.02e-09 | 0.943 | 7.6e+06 |
11 | 0.623 | LVFDVELLKLE.2 | P62942 | 1.19e-08 | 0.934 | 5.2e+06 |
12 | 0.550 | VFDVELLK.2 | P62942 | 8.99e-09 | 0.908 | 3.0e+06 |
13 | 0.500 | VVHYTGMLEDGK.3 | P62942 | 3.26e-08 | 0.900 | 6.7e+06 |
14 | 0.405 | RGQTC[Carbamidomethyl (C)]VVHYTGM[Oxidation (M)]LEDGK.3 | P62942 | 1.37e-04 | 0.916 | 4.3e+05 |
15 | 0.381 | VVHYTGMLEDGK.2 | P62942 | 9.64e-07 | 0.880 | 7.4e+06 |
16 | 0.230 | GWEEGVAQM.2 | P62942 | 8.26e-06 | 0.843 | 1.6e+06 |
17 | 0.143 | DTVATQLSEAVDATR.2 | O60664 | 4.89e-04 | 0.838 | 4.8e+07 |
18 | 0.100 | DYFEEYGKIDTIEIITDR.3 | P22626 | 1.39e-04 | 0.816 | 3.2e+07 |
19 | 0.000 | RGQTC[Carbamidomethyl (C)]VVHYTGMLEDGK.2 | P62942 | 1.24e-02 | 0.811 | 1.3e+05 |
Since we know the true positive binding partner of Rapamycin to be FKBP12 we can plot the true positive rate based on the rank each precursor received.
It may happen that many precursors pass the filtering threshold and are thus potentially affected by the treatment. One possible reason for this can be that filtering thresholds are not very strict. If you have information about real targets of your treatment, you can set a manual rank cutoff based on the true positive rate evaluation. Another likely explanation is that other precursors/peptides/proteins may also be affected by the treatment. This can happen through direct interaction with the treatment or as a secondary effect: the potential hit might be an interaction partner of your target and a change in your target might also lead to a change the potential hit. Precursors that have a low rank with not many true positive hits in their vicinity might not be actual hits. You can set a manual rank cutoff at a point where the true positive rate flattens.
If the true positive rate does not drop and almost all of your hits are true positives, you can try to reduce the correlation cutoff in fit_drc_4p()
to be even less conservative. This might allow you to find novel interaction partners. If the above true positive rate evaluation would have been observed for the complete Rapamycin data set, such a step might have been a good practice. However, in the complete data set a lot more precursors pass the filtering and the true positive rate drops after about the first 25 ranks. This would mean that many of the additionally ranked precursors could be false positive interactions.
Note: This kind of analysis is only suited for experiments where you know the true interaction partner. In most analyses you will not know the target (or all targets) of your treatment so you might want to use this kind of analysis with caution.
After model fitting, you can easily visualise model fits by plotting them with the plot_drc_4p()
function. You can provide one or more precursor IDs to the targets
argument or just return plots for all precursors with targets = "all"
. Make sure that the amount of plots is reasonable to return within R. You can also directly export your plots by setting export = TRUE
.
Note: The output of fit_drc_4p()
includes two columns with data frames containing the model and original data points required for the plot. These data frames contain the original names for dose
and response
that you provided to fit_drc_4p()
.
# Model plotting
plot_drc_4p(fit,
grouping = eg_precursor_id,
dose = r_condition,
response = normalised_intensity_log2,
targets = "_VFDVELLKLE_.2",
unit = "pM",
export = FALSE)
After models are fit, you can have a deeper look into the hits you obtained. protti provides several functions that help you to identify patterns in your data.
You can test your data set for gene ontology (GO) term enrichment (go_enrichment()
). Alternatively, pathway annotations provided in the KEGG database can be used to test for pathway enrichments (kegg_enrichment()
).
If you know which proteins bind or interact with your specific treatment, you can provide your own list of true positive hits and check if these are enriched in your significant hits by using protti’s treatment_enrichment()
function. For our LiP-MS experiment using Rapamycin we are probing direct interaction with proteins in contrast to functional effects. The only protein that Rapamycin is known to bind to is FKBP12 and testing the significance of enrichment for a single or even a few proteins does not make much sense. However, testing for enrichment is especially useful if your treatment affects many proteins, since it can help you to reduce the complexity of your result.
The STRING database provides a good resource for the analysis of protein interaction networks. It is often very useful to check for interactions within your significant hits. For LiP-MS data this sometimes explains why proteins that do not directly interact with your treatment are still significantly affected. With network_analysis()
, protti provides a useful wrapper around some STRINGdb
package functions.
protti has several functions for fetching database information. One commonly used database in the proteomics community is UniProt. By providing the fetch_uniprot()
function with a character vector of your protein IDs, you can fetch additional information about the proteins. These include names, IDs from other databases, gene ontology annotations, information about the active metabolite and metal binding sites, protein length and sequence. You can freely select which information you are most interested in by providing it to the columns
arguments of the fetch_uniprot()
function based on the information they provide for download.
# fetching of UniProt information
<- unique(fit$pg_protein_accessions)
unis <- fetch_uniprot(unis)
uniprot
# annotation of fit data based on information from UniProt
<- fit %>%
fit_annotated left_join(uniprot, by = c("pg_protein_accessions" = "id")) %>% # columns containing proteins IDs are named differently
mutate(binds_treatment = pg_protein_accessions == "P62942") # create new column with prior knowledge about binding partners of treatment
As mentioned earlier, for this specific data set enrichment and also network analysis does not make a lot of sense. This is mainly because we only use a very small random subset of the complete data set. Furthermore, Rapamycin is known to only interact with FKBP12 and enrichment analysis will not properly work if your treatment only interacts with a few protein.
Nevertheless, we will demonstrate below how you could use some additional functions on your own data.
### GO enrichment using "molecular function" annotation from UniProt
go_enrichment(fit_annotated,
protein_id = pg_protein_accessions,
is_significant = passed_filter,
go_annotations_uniprot = go_molecular_function) # column obtained from UniProt
### KEGG pathway enrichment
# First you need to load KEGG pathway annotations from the KEGG database for your specific organism of interest.
# In this case HeLa cells were used, therefore the organism of interest is homo sapiens (hsa)
<- fetch_kegg(species = "hsa")
kegg
# Next we need to annotate our data with KEGG pathway IDs and perform enrichment analysis
%>%
fit left_join(kegg, by = c("pg_protein_accessions" = "uniprot_id")) %>% # columns containing proteins IDs are named differently
kegg_enrichment(protein_id = pg_protein_accessions,
is_significant = passed_filter,
pathway_id = pathway_id, # column name from kegg data frame
pathway_name = pathway_name) # column name from kegg data frame
### Treatment enrichment analysis
treatment_enrichment(fit_annotated,
protein_id = pg_protein_accessions,
is_significant = passed_filter,
binds_treatment = binds_treatment,
treatment_name = "Rapamycin")
### Network analysis
%>%
fit_annotated filter(passed_filter == TRUE) %>% # only analyse hits that were significant
network_analysis(protein_id = pg_protein_accessions,
string_id = database_string, # column from UniProt containing STRING IDs
organism_id = 9606, # tax ID can be found in function documentation or STRING website
binds_treatment = binds_treatment,
plot = TRUE)