readyomics

readyomics streamlines single- or multi-omics data analysis pipelines.

The starting input should be a pre-processed raw omics data matrix (e.g. gene counts, metabolites or protein abundances, etc.) and a table of related sample information to be used for additional processing and statistical analysis.

readyomics splits processing and statistical analysis steps into different functions. This has the following advantages:

Example

The following example shows the overall workflow of readyomics.

The example data is a 16S rRNA gene sequencing metataxonomics dataset from human fecal samples. They are a subset of 29 samples from a T2DM prospective cohort screened for MASLD:

Forlano, R.*, Martinez-Gili, L.*, Takis, P., Miguens-Blanco, J., Liu, T., Triantafyllou, E., … Manousou, P. (2024). Disruption of gut barrier integrity and host–microbiome interactions underlie MASLD severity in patients with type-2 diabetes mellitus. Gut Microbes, 16(1). https://doi.org/10.1080/19490976.2024.2304157

Example input files can be found in the inst/extdata folder of the readyomics source, and can be imported using base::system.file() command as shown below in Import and inspect data.

Note: this example is meant to showcase basic package usage. It is not meant to provide guidelines on suitable study-specific choices for data processing and analysis.

Installation

install.packages("readyomics")

Load the package

library(readyomics)

Import and inspect data

# Raw matrix of ASV counts
asv_counts <- read.csv(system.file("extdata", "asv_raw_counts.csv", package = "readyomics"), 
                       check.names = FALSE, 
                       row.names = 1)

# Taxonomy table
taxa <- read.csv(system.file("extdata", "taxonomy.csv", package = "readyomics"), 
                 check.names = FALSE, 
                 row.names = 1)

# Sample data
sample_data <- read.csv(system.file("extdata", "sample_data.csv", package = "readyomics"), 
                        check.names = FALSE, 
                        row.names = 1)
head(asv_counts)
head(taxa)
head(sample_data)

Format sample_data

sample_data must include a column named sample_id with unique IDs for each sample in the study. For longitudinal studies, sample_id still needs to be unique, therefore a column such as subject_id can then be used to indicate samples coming from the same individual.

The omics X data matrix must also use the same sample_id labels as row names, so that they can be matched before analysis. readyomics always checks sample ID matching.

In addition, categorical variables must be set as factors for downstream statistical analyses.

sample_data <- sample_data |>
  dplyr::mutate(sample_id = rownames(sample_data),
         groups_liver = factor(groups_liver, levels = c("FL", "FL_HS")),
         sex = factor(sex),
         PPI = factor(PPI),
         smoking = factor(smoking),
         alcohol_imp = factor(alcohol_imp))

Process data using process_ngs()

The choice of processing function will depend on the type of omics being analysed. We will use process_ngs() in the example, as it is suitable for metataxonomics.

See also process_ms() if you need to process mass spectrometry (MS) data like metabolomics or proteomics. process_ms() is compatible with nuclear magnetic resonance spectroscopy (NMR) data, although with limited processing options.

# Samples as rows required for process_ngs
asv_counts <- t(asv_counts)

# Process asv_counts data
asv_ready <- process_ngs(X = asv_counts, 
                         sample_data = sample_data, 
                         taxa_table = taxa, 
                         normalise = "none",
                         transform = "clr", 
                         eco_phyloseq = FALSE)

When verbose = TRUE we can see that some useful information is printed in the console (see process_ngs() for all settings and parameters options). For example, that a phylogenetic tree has not been provided, that 1 sample has been removed due to falling under the default 500 reads threshold for sequencing depth, and that 372 features were kept after filtering for 10 % prevalence.

process_ngs() generates a list object that contains:

We will see how to apply the processed data for common omics analysis.

PCA visualisation with mva()

pca <- mva(X = asv_ready$X_processed, 
           sample_data = asv_ready$sdata_final, 
           group_colour = "groups_liver", 
           plot_title = "Beta diversity (Aitchison)")

By default, mva() will show a PCA diagnostics plot, which is very useful to examine potential outliers, batch effects and clusters related to the groups of interest.

In addition to storing the ropls::opls() object, mva() generates a publication-ready scores plot, in this case color-coded by the variable groups_liver as we indicated:

pca$scores_plot

Because the plot is generated using ggplot2, it can be further tweaked using ggplot2 functions before export.

Permutational analysis of variance with permanova()

PERMANOVA is quite popular in microbiome datasets to inspect differences in beta diversity, but it can also be applied to any type of omics data.

By default, the function sets parameter indenpendent = TRUE, meaning permanova() also calculates the overall variance for each covariable in the model independently, in addition to the vegan::adonis2() output.

# Define model
rhs_model <- ~ groups_liver + alcohol_imp + sex + age + PPI + smoking

# Run PERMANOVA
pova <- permanova(X = asv_ready$X_processed, 
                  sample_data = asv_ready$sdata_final,
                  formula_rhs = rhs_model, 
                  platform = "ngs", 
                  assay = "Taxonomy", 
                  seed = 165)

We see a warning and related verbose message (when verbose = TRUE; the default) about permutation parameters being used for all the variables. This is because you can set up specific permutation control parameters for each covariable with perm_control. This is particularly useful for example in longitudinal studies, where you might need to apply permutation blocks according to certain variables. In our case, we are good with the default 999 unrestricted random permutations, so we can ignore the warning and verbose. See permanova() for more information.

The output is a list that contains the distance matrix and permutation matrix used, as well as two tables of results permanova_joint, which is the default vegan::adonis2() result, and permanova_indep, when independent = TRUE.

pova$permanova_joint
pova$permanova_indep

Differential ANAlysis with dana()

A common analysis goal with omics data is to try and find omic features associated with traits of interest. dana() is readyomics function for differential analysis. At present, dana() supports linear regressions with stats::lm() and linear mixed effects models with lme4::lmer(). dana() main strengths are:

We will fit a linear regression with the main variables in our dataset:

# future::plan(multisession, workers = 4) # For running 4 processes in parallel
# progressr::handlers(global = TRUE) # To show progress bar
dana_asv <- dana(X = asv_ready$X_processed, 
                 sample_data = asv_ready$sdata_final,
                 formula_rhs = rhs_model, 
                 term_LRT = "groups_liver",
                 platform = "ngs")
# plan(sequential) # To disable parallel processing

We first see a warning message suggesting to use parallel computing due to the X matrix having more than 100 features. However this can be ignored as stats::lm() is very quick anyway.

The dana() function generates a dana object including the input data used, and the publication-ready fit table summarizing the main model results for each feature analysed ("feat_id" column).

head(dana_asv$fit)

If working with multiple omics, is useful to also specify platform and assay columns for seamless downstream omics results integration.

When one or more terms are specified for likelihood ratio testing with term_LRT, dana also includes the lrt table with the LRT results. In our case we used LRT to test groups_liver.

head(dana_asv$lrt)

Adjusting for multiple comparisons with adjust_pval()

We leverage R base pipes to conveniently add adjusted P values to the dana object results with adjust_pval().

dana_asv <- dana_asv |>
  adjust_pval(padj_by = "terms", 
              padj_method = "BH", 
              padj_method_LRT = "BH")

The fit table now has new columns with the adjusted P values. Because we chose to adjust by terms in the fit table (padj_by = "terms"), this will adjust each covariable P value individually. We see more than one column of adjusted P values with the format padj_[padj_method]_[term] (e.g. "padj_BH_sex2").

The alternative would be (padj_by = "all") in which all nominal P values in fit will be adjusted together. In this case, a single column would be added to fit, with the format padj_[padj_method].

If a formula fixed term was also tested via LRT, adjust_pval() will also add the adjusted LRT P value to the fit table. LRT P values have the "_LRT" suffix (see last column "padj_BH_groups_liver_LRT").

head(dana_asv$fit)

Similarly, we can inspect the lrt table with the added adjusted P values.

head(dana_asv$lrt)

While a single adjustment method should be chosen a priori, it is possible to add more than one P value adjustment method (e.g. padj_method = c("BH", "storey"), padj_method_LRT = c("BH", "bonferroni")). See adjust_pval() for more information.

Adding feature labels with add_feat_name() or add_taxa()

Original labels for some omics data can contain non-alphanumeric characters and symbols that could be problematic to fit dana() models. process_ms() allows to rename features as "feat_1", "feat_2", and so on, by setting rename_feat = TRUE, while storing the original labels.

add_feat_name() can then be used to add the original labels to the dana object before plotting significant results.

Similarly, for metataxonomics and metagenomics data, it is more interesting to read the genus/species/strain of origin rather than a cryptic "ASV1". In this case, we would use add_taxa(). Which also adds the corresponding higher hierarchy taxa names.

dana_asv <- dana_asv |>
  add_taxa(taxa_table = taxa, 
           taxa_rank = "asv")

We can see that various columns have been added to the fit table. Because taxa_rank = "asv", the column for the feature label taxon_name by default adds the species (if provided) or the genus name to the ASV ID, collapsed by “_“, a format quite common in publication plots.

head(dana_asv$fit)

Visualise results with ready_plots()

We can pipe ready_plots() to the previous command, or skip the previous step and directly apply ready_plots() to the dana object to explore the results with the default feature IDs ("feat_id" column).

ready_plots() automatically generates and stores the most commonly used publication-ready plots:

  • Differential analysis significant coefficients (volcano, dotplot, heatmap - capped at 50 features),
  • Significant features abundance (capped at the 10 most significant, but plots for specific features can be requested with X_colnames parameter).
# Generates warning due to lack of significant features:
dana_asv <- dana_asv |>
  ready_plots(term_name = "groups_liver", # Formula fit term of interest
              pval_match = "groups_liver_LRT", # LRT adjusted P values will be used
              sdata_var = "groups_liver", # Grouping variable for individual feature plots
              group_colours = c(FL = "#4daf4a", # Optional color customization
                                FL_HS = "#377eb8"))

In our example, no features were significant after multiple comparison correction. This prompts a warning message from ready_plots() which returns the dana object with an empty plots list. This is somewhat expected as we are using a small portion of the original dataset as a toy example.

A formal analysis would end here, concluding that the null hypothesis could not be rejected. In our case, we will use the nominal P value to showcase the ready_plots() functionality.

dana_asv <- dana_asv |>
  ready_plots(term_name = "groups_liver",
              pval_match = "Pr", # Nominal P value for illustration purposes
              sdata_var = "groups_liver", 
              group_colours = c(FL = "#4daf4a", 
                                FL_HS = "#377eb8"))

When verbose = TRUE (the default) ready_plots() displays the total number of significant features at the chosen P value threshold. We can inspect all plots at once by simply running:

dana_asv$plots

Because plots are generated with ggplot2, they can easily be further customized.

Bonus (microbiome): dana() for alpha diversity

# Compute alpha diversity measures for the ASV phyloseq object
alpha <- phyloseq::estimate_richness(asv_ready$phyloseq_raw$asv) |>
  dplyr::select(-matches("ace|chao|fisher")) |>
  scale()

# Calculate alpha diversity differences among groups with dana()
dana_alpha <- dana(X = alpha, 
                   sample_data = asv_ready$sdata_final,
                   formula_rhs = rhs_model, 
                   term_LRT = "groups_liver",
                   platform = "ngs")

Bonus (microbiome): dana() for higher taxonomic ranks

The following example shows only the code for genus rank, but this code could be easily wrapped into a function that iterates across all taxa ranks to analyse all at once, using parallel computation with future::plan().

# CLR-transform genus rank counts table
genus_ready <- process_ngs(X = as.data.frame(asv_ready$phyloseq_raw$genus@otu_table), 
                           sample_data = asv_ready$sdata_final, 
                           taxa_table = taxa, 
                           normalise = "none",
                           transform = "clr",
                           raw_phyloseq = FALSE,
                           eco_phyloseq = FALSE,
                           verbose = FALSE)

# Run dana
dana_genus <- dana(X = genus_ready$X_processed, 
                   sample_data = genus_ready$sdata_final,
                   formula_rhs = rhs_model, 
                   term_LRT = "groups_liver",
                   platform = "ngs") |>
  adjust_pval(padj_by = "terms", 
              padj_method = "BH", 
              padj_method_LRT = "BH") |>
  add_taxa(taxa_table = taxa, 
           taxa_rank = "genus") |>
  ready_plots(term_name = "groups_liver",
              pval_match = "Pr", # Nominal P value for illustration purposes
              sdata_var = "groups_liver",
              group_colours = c(FL = "#4daf4a",
                                FL_HS = "#377eb8"))

# Inspect plots
dana_genus$plots