Practical applications of learned concepts in R

Construct a phyloseq Object

Here, we show how to construct the phyloseq object based on the taxonomic table, the ASV table, and the metadata. We use the 88 soil dataset (Lauber et al. 2009) obtained from the gglasso (Schaipp, Vlasovets, and Müller 2021) tutorial. The raw data can be downloaded from the GitHub repositories. According the to tutorial the data is already processed in the following way:

  1. Filter for OTUs with minimum abundance of \(100\) and

  2. add pseudo-count of \(1\).

Data sources:

  • Taxonomy: https://github.com/Vlasovets/GGLasso/blob/cfbf01535c88bbcd3ba60f24b5b867472d549f89/data/soil/original/88soils_taxonomy.txt

  • OTU table: https://github.com/Vlasovets/GGLasso/blob/master/data/soil/processed/soil_116.csv

  • Metadata: https://github.com/Vlasovets/GGLasso/blob/master/data/soil/processed/ph.csv

A detailed tutorial on constructing a phyloseq object can be found on the official phyloseq website.

We will need 2 packages for analysis:

  1. phyloseq for the data structure and

  2. tidyverse for data wranging.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(phyloseq)

Load the data

Read the different data tables.

# The OTU table contains rownames in the column X
otu_raw <- read.csv("./data/soil_raw/soil_116.csv", header = TRUE, row.names = "X")
tax_raw <- read.table("./data/soil_raw/88soils_taxonomy.txt", 
                      header = TRUE, sep = "\t")
# The ph data also contains rownames in X.SampleID
ph <- read.csv("./data/soil_raw/ph.csv", header = TRUE, row.names = "X.SampleID")

Print the data.

head(otu_raw[, 1:6])
        X1124701 X697997 X203969 X205391 X843189 X3431064
103.CA2       16       3       1       1       1        4
103.CO3       15       5       1       1       1        2
103.SR3        2       1       1       1       1        2
103.IE2        9       1       1       2       1        1
103.BP1       14      68       1       1       1        4
103.VC2        8       1       1       2       1        8

The OTU data looks good, but we must fix the column names since they start with X.

head(tax_raw)
  Feature.ID
1    1000512
2    1000547
3    1000654
4    1000757
5    1000876
6    1001333
                                                                                                      Taxon
1                     k__Bacteria;p__Actinobacteria;c__Thermoleophilia;o__Gaiellales;f__Gaiellaceae;g__;s__
2          k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__
3     k__Bacteria;p__Bacteroidetes;c__Sphingobacteriia;o__Sphingobacteriales;f__Sphingobacteriaceae;g__;s__
4          k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Bradyrhizobiaceae;g__;s__
5 k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Nocardioidaceae;g__Nocardioides;s__
6                       k__Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__;s__
dim(tax_raw)
[1] 7396    2

The taxonomic information is for the unfiltered OTUs; therefore, we need to filter this file as well. We also need to divide the data frame into seven taxonomic levels instead of one string for phyloseq.

head(ph)
          ph
103.BZ1 5.12
103.CR1 8.00
103.GB2 7.57
103.GB3 7.18
103.GB1 6.84
103.GB5 8.22

This data frame looks fine.

Transform the data

Remove the X in the column name so they match with the taxonomy data.

colnames(otu_raw) <- str_remove_all(colnames(otu_raw), pattern = "X")

Filter the taxonomic information and split the data into the different taxonomic ranks.

tax_raw <- tax_raw %>%
  filter(Feature.ID %in% colnames(otu_raw))

# Sanity check: do the dimensions match? 116 Taxonomic assignments
# and 116 OTUs.
dim(tax_raw)
[1] 116   2
dim(otu_raw)
[1]  89 116
taxonomic_ranks <-
  c("Kingdom", "Phylum", "Class", "Order",
    "Family", "Genus", "Species")

tax_raw <- tax_raw %>%
  separate_wider_delim(., cols = "Taxon",
                       delim = ";", names = taxonomic_ranks)

# Phyloseq needs the OTU ids as rownames
tax_raw <- tax_raw %>%
  column_to_rownames("Feature.ID")

Construct the phylo object

We need the otu table and taxonomic table in matrix form to construct the phylo object. Also, we need to remove the pseudo-count of 1

# remove pseudo-count of 1
otu_raw <- otu_raw - 1
# transform to matrix
otu <- as.matrix(otu_raw)
tax <- as.matrix(tax_raw)

Construct the objects for the final phyloseq object

# We have taxa in the columns so we need to specifiy this
otu <- otu_table(otu, taxa_are_rows = FALSE)
tax <- tax_table(tax)
sample_df <- sample_data(ph)

Combine everything in the phylo object.

soil <- phyloseq(otu, tax, sample_df)

Finally, save the data.

saveRDS(soil, file = "./data/soil_processed/soil.RDS")

Computational environment

sessionInfo()
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_Germany.utf8  LC_CTYPE=English_Germany.utf8   
[3] LC_MONETARY=English_Germany.utf8 LC_NUMERIC=C                    
[5] LC_TIME=English_Germany.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] phyloseq_1.40.0 lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0  
 [5] dplyr_1.1.1     purrr_1.0.1     readr_2.1.4     tidyr_1.3.0    
 [9] tibble_3.2.1    ggplot2_3.4.1   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] Biobase_2.56.0         jsonlite_1.8.4         splines_4.2.3         
 [4] foreach_1.5.2          stats4_4.2.3           GenomeInfoDbData_1.2.8
 [7] yaml_2.3.7             pillar_1.9.0           lattice_0.20-45       
[10] glue_1.6.2             digest_0.6.31          XVector_0.36.0        
[13] colorspace_2.1-0       htmltools_0.5.5        Matrix_1.5-3          
[16] plyr_1.8.8             pkgconfig_2.0.3        zlibbioc_1.42.0       
[19] scales_1.2.1           tzdb_0.3.0             timechange_0.2.0      
[22] mgcv_1.8-42            generics_0.1.3         IRanges_2.30.0        
[25] withr_2.5.0            BiocGenerics_0.42.0    cli_3.6.1             
[28] survival_3.5-3         magrittr_2.0.3         crayon_1.5.2          
[31] evaluate_0.20          fansi_1.0.4            nlme_3.1-162          
[34] MASS_7.3-58.2          vegan_2.6-4            tools_4.2.3           
[37] data.table_1.14.8      hms_1.1.3              lifecycle_1.0.3       
[40] Rhdf5lib_1.18.2        S4Vectors_0.34.0       munsell_0.5.0         
[43] cluster_2.1.4          Biostrings_2.64.0      ade4_1.7-22           
[46] compiler_4.2.3         GenomeInfoDb_1.32.2    rlang_1.1.0           
[49] rhdf5_2.40.0           grid_4.2.3             RCurl_1.98-1.12       
[52] iterators_1.0.14       rhdf5filters_1.8.0     biomformat_1.24.0     
[55] rstudioapi_0.14        htmlwidgets_1.6.2      igraph_1.4.1          
[58] bitops_1.0-7           rmarkdown_2.21         gtable_0.3.3          
[61] codetools_0.2-19       multtest_2.52.0        DBI_1.1.3             
[64] reshape2_1.4.4         R6_2.5.1               knitr_1.42            
[67] fastmap_1.1.1          utf8_1.2.3             permute_0.9-7         
[70] ape_5.7-1              stringi_1.7.12         parallel_4.2.3        
[73] Rcpp_1.0.10            vctrs_0.6.1            tidyselect_1.2.0      
[76] xfun_0.38             

References

Lauber, Christian L, Micah Hamady, Rob Knight, and Noah Fierer. 2009. “Pyrosequencing-Based Assessment of Soil pH as a Predictor of Soil Bacterial Community Structure at the Continental Scale.” Applied and Environmental Microbiology 75 (15): 5111–20.
Schaipp, Fabian, Oleg Vlasovets, and Christian L. Müller. 2021. “GGLasso - a Python Package for General Graphical Lasso Computation.” Journal of Open Source Software 6 (68): 3865. https://doi.org/10.21105/joss.03865.