7. Functional analysis¶
Sequedex not only places each 10-mer on a phylogenetic tree, but it also searches an 962 example sets of genes for a functional assignment as well. For the seed_0911.m1 set of functional assignments, we used the SEED classification of functions, which have the added benefit of a well-defined hierarchical rollup. The names and sets of genes are ennumerated in section Definition of functional classifications, where clicking on each label provides the annotation for each gene included (across the phylogeny) in the subsystem. For the ribosome, category si_0962, both the large and small subunits from across the 1550 species in the tree of life were translated into all three forward reading frames to make 10-mer ‘amino acid’ signatures, while the bacteria and archaea also had the tRNAs translated in a similar fasion. In the event a gene with a 10-mer signature appears in multiple categories, a metagenomic read contains 10-mers from different genes in different categories, the read is apportioned equally among all categories.
In additional to the genomic (DNA) metagenomic data sets with the phylogenetic profiles examined in the previous chapter:
- the set of
synthetic data from reference genomes
withlabels
, - a set of
environmental microbiomes
withlabels
- a set of
human microbiomes
withlabels
we provide transcriptomic (RNA) data sets from publically available data sets to illustrate the comparisons:
- a set of
human tisue-specific expression
withlabels
- a set of
marine eukaryotic transcriptomes
withlabels
- a set of
transcriptomes from plaque microbiomes
withlabels
7.1. Visualizing SEED functional profiles with Sequestat¶
Graphical visualization can occur with Sequestat and the igraph package in R. You will first need to download the functional definitions
(or use any what-Life2550-xxGB.0xseed_0911.m1.tsv file). Here, we assume both it and sequestat.r are available in the user’s home directory (~/), and that the working directory of the R session is in a directory containing sequedex output files. Also, we assume the user is using the .Rdata file with the graph layout in it
. Since we often do not want Ribosomal or Unclassified read counts to skew normalizations when plotting counts, we set them to zero. Analysis of all of the data sets will start with the following:
source("~/Sequedex-docs/dl/sequestat.r")
library(igraph)
load("~/Sequedex-docs/dl/seed.layout.Rdata")
This results in the graph layout below, to which we have added labels to aid in reading the graphs below, from which we leave the labels off.
Reading in particular Sequedex output files will begin with:
expt <- Read.functional("./", "~/Sequedex-docs/dl/what", type.ref="Life2550-40", data.type="what")
expt$layout=seed.layout
expt$data[963,]=0
expt$data[964,]=0
plot.graph(hatlas,1, simple.name= T,scol= "blue",sign= F, dimension= 2,cex=0.6)
To read in the tsv data sets listed above, download them and read them into the data structures as follows:
source("~/Sequedex-docs/dl/sequestat.r")
library(igraph)
syn=Read.graph("~/Sequedex-docs/dl/what", col=TRUE, sat=1)
data=read.table("~/Sequedex-docs/dl/syn.Life.what",sep="\t",header=F)
syn$data=data
load("~/seed.layout.Rdata")
syn$layout=seed.layout
syn$data[963,]=0
lbl=read.table("~/Sequedex-docs/dl/syn.Life.lbl",sep="\t",header=F)
lbl=lbl[!is.na(lbl)]
colnames(syn$data)=lbl
plot.graph(syn,1, simple.name= F,scol= "blue",sign= T, dimension= 2,cex=0.6)
Diff.graph(syn,1,2, dim=2, alpha= 0.000000001)
env=Read.graph("~/Sequedex-docs/dl/what", col=TRUE, sat=1)
data=read.table("~/Sequedex-docs/dl/env.Life.what",sep="\t",header=F)
env$data=data
load("~/seed.layout.Rdata")
env$layout=seed.layout
env$data[963,]=0
lbl=read.table("~/Sequedex-docs/dl/env.Life.lbl",sep="\t",header=F)
lbl=lbl[!is.na(lbl)]
colnames(env$data)=lbl
plot.graph(env,1, simple.name= F,scol= "blue",sign= T, dimension= 2,cex=0.6)
Diff.graph(env,1,2, dim=2, alpha= 0.000000001)
hmb=Read.graph("~/Sequedex-docs/dl/what", col=TRUE, sat=1)
data=read.table("~/Sequedex-docs/dl/hmb.Life.what",sep="\t",header=F)
hmb$data=data
load("~/seed.layout.Rdata")
hmb$layout=seed.layout
hmb$data[963,]=0
lbl=read.table("~/Sequedex-docs/dl/hmb.Life.lbl",sep="\t",header=F)
lbl=lbl[!is.na(lbl)]
colnames(hmb$data)=lbl
plot.graph(hmb,1, simple.name= F,scol= "blue",sign= T, dimension= 2,cex=0.6)
Diff.graph(hmb,1,2, dim=2, alpha= 0.000000001)
hatlas=Read.graph("~/Sequedex-docs/dl/what", col=TRUE, sat=1)
data=read.table("~/Sequedex-docs/dl/hatlas.Life.what",sep="\t",header=F)
hatlas$data=data
load("~/seed.layout.Rdata")
hatlas$layout=seed.layout
hatlas$data[963,]=0
lbl=read.table("~/Sequedex-docs/dl/hatlas.Life.lbl",sep="\t",header=F)
lbl=lbl[!is.na(lbl)]
colnames(hatlas$data)=lbl
plot.graph(hatlas,1, simple.name= F,scol= "blue",sign= T, dimension= 2,cex=0.6)
Diff.graph(hatlas,1,2, dim=2, alpha= 0.000000001)
ocean=Read.graph("~/Sequedex-docs/dl/what", col=TRUE, sat=1)
data=read.table("~/Sequedex-docs/dl/ocean.Life.what",sep="\t",header=F)
ocean$data=data
load("~/seed.layout.Rdata")
ocean$layout=seed.layout
ocean$data[963,]=0
lbl=read.table("~/Sequedex-docs/dl/ocean.Life.lbl",sep="\t",header=F)
lbl=lbl[!is.na(lbl)]
colnames(ocean$data)=lbl
plot.graph(ocean,1, simple.name= F,scol= "blue",sign= T, dimension= 2,cex=0.6)
Diff.graph(ocean,1,2, dim=2, alpha= 0.000000001)
caries=Read.graph("~/Sequedex-docs/dl/what", col=TRUE, sat=1)
data=read.table("~/Sequedex-docs/dl/caries.Life.what",sep="\t",header=F)
caries$data=data
load("~/seed.layout.Rdata")
caries$layout=seed.layout
caries$data[963,]=0
lbl=read.table("~/Sequedex-docs/dl/caries.Life.lbl",sep="\t",header=F)
lbl=lbl[!is.na(lbl)]
colnames(caries$data)=lbl
plot.graph(caries,1, simple.name= F,scol= "blue",sign= T, dimension= 2,cex=0.6)
Diff.graph(caries,1,2, dim=2, alpha= 0.000000001)
Alternatively, the graph can be looked at in three dimensions:
hatlas$layout <- layout.fruchterman.reingold(hatlas, dim=3)
plot.graph(hatlas,Val = 3, simple.name= F,scol= "blue",sign= T, dimension= 2,cex=0.6)
It is frequently the case that the ribosomal reads dominate the functional classifications. To eliminate the ribosomal reads from consideration, enabling a re-scaling of the other categories, type:
hatlas$data[963,]=0.
7.2. Top 100 functional categories¶
In this first section, we examine the functional categories with the most reads mapping to them across DNA sequenced across seven microbiomes and mRNA sequenced across two eukarotic marine samples. Functional categories will attract more reads for several reasons: first, because there may be many reads in them (eg. the citric acid cycle); second, because the genes may be highly conserved across numerous organisms (eg. the ribosome); and third, because the gene may be highly expressed in a particular microbial environment (eg. photosystem II). The seven groups of DNA sequenced across seven microbiomes include three groups from the human microbiome data shown early, grouped into the skin (ear and nose), the mouth, and ‘cavities’, which includes stool samples and vaginal samples. It also includes the environmental microbiomes, grouped into four categories; soils, fresh water, salt water, and fermented samples. The two transcriptome samples were the green algae and dinoflagellates from the eukaryotic marine algae transcriptome project, sponsored by the Moore foundation and sequenced at NCGR.
The goal of this section is to provide an understanding of how 10 percent of the more important functional categories are defined, represented in a variety of microbial environments, and relate to some of the other community resources available for functional annotation of proteins. We plot them in groups of ten, with the y-axis labeled in parts per thousand, and skipping the most prevalent functional category, si_0962, the large and small subunits of the ribosome.
SEED description are available for many of these subsystems.
The ten most prevalent categories include:
si_0042; Methionine biosynthesis
in the Amino Acids and Derivatives rollup category. It can be found on Kegg map 270. The SEED description by Dmitry Rodionov describes a variet of pathways leading to methionine.si_0068; Glycolysis and Gluconeogenesis
in the Carbohydrates rollup category. It can be found on Kegg map 10. The SEED description by Svetlana Gerdes and Ross Overbeek, describe glycolysis and gluconeogenesis.si_0660; Ribosome LSU bacterial
in the Protein Metabolism rollup category. It can be found on Kegg map 3010.si_0645; Photosystem II
in the Photosynthesis rollup category. It can be found on Kegg map 195.si_0707; Protein chaperones
in the Protein Metabolism rollup category. DnaJ is HSP-40 and dnaK is HSP-70. See the Wikipedia page and PMID 16952052.si_0415; DNA replication
in the DNA Metabolism rollup category.si_0673; Universal GTPases
in the Protein Metabolism rollup category. Caldon, et al. suggest that the 11 universal GTPases are either necessary for ribosome function or transmitting information from the ribosome to downstream targets for the purpose of generating specific cellular responses.si_0791; F0F1 type ATP synthase
in the Respiration rollup category. It can be found on Kegg map 195.si_0602; Purine conversions
in the Nucleosides and Nucleotides rollup category. It can be found on Kegg map 230.si_0851; dnaK Heat shock cluster
in the Stress Response rollup category. DnaJ is HSP-40 and dnaK is HSP-70. See the Wikipedia page and PMID 16952052.
Categories 11-20 include:
si_0043; Methionine Degradation
in the Amino Acids and Derivatives rollup category. It can be found on Kegg map 270si_0279; Bacterial Cell Division
in the Clustering-based subsystems rollup category. It can be found on Kegg map 4112.si_0640; Phosphate metabolism
in the Phosphorous Metabolism rollup category. Some description can be found in Gebhard, et al..si_0600; De Novo Purine Biosynthesis
in the Nucleosides and Nucleotides rollup category. It can be found on Kegg map 230.si_0448; FAS II; Fatty Acid Biosynthesis
in the Fatty Acids, Lipids, and Isoprenoids rollup category. It can be found on Kegg map 61 for biosynthesis, Kegg map 62 for elongation. The SEED description by Andrei Osterman describes fatty acid biosynthesis through FAS II, which is largely homologous to FAS I.si_0060; Calvin-Benson cycle
in the Carbohydrates rollup category. It can be found on Kegg map 710.si_0122; Serine-glyoxylate cycle
in the Carbohydrates rollup category. It can be found on Kegg map 630.si_0746; RNA polymerase bacterial
in the RNA Metabolism rollup category. It can be found on Kegg map 3020.si_0080; TCA Cycle
in the Carbohydrates rollup category. It can be found on Kegg map 20.si_0032; Gln, Glu, Asn, & Asp Biosynthesis
in the Amino Acids and Derivatives rollup category. It can be found on Kegg map 250.
Categories 21-30 include:
si_0662; Ribosome SSU bacterial
in the Protein Metabolism rollup category. It can be found on Kegg map 3010.si_0526; Ton and Tol transport systems
in the Membrane Transport rollup category. Danese, et al. describe the Ton system to obtain iron in Brucella spp. Housden, et al. describe how Ton and Tol facilitate uptake of chelated iron or group B colicins through active transport.si_0614; Ribonucleotide reduction
in the Nucleosides and Nucleotides rollup category. The Wikipedia article describes how ribonucleotide reductase converts RNA to DNA, maintaining an appropriate concentration of DNA throughout the cell cycle. The SEED description by Dmitry Rodinov desribes the three classes of ribonucleotide reductases.si_0205; Peptidoglycan Biosynthesis
in the Cell Wall and Capsule rollup category. It can be found on Kegg map 550. The SEED description by Vassily Portnoy, Olga Vassieva, and Rick Stevens describes peptidoglycan biosynthesis.si_0725; Queuosine - Archaeosine Biosynthesis
in the RNA Metabolism rollup category. The SEED description describes the synthesis and incorporation of the modified bases of tRNA, Queuosine and Archaeosine.si_0377; Pyridoxin (Vitamin B6) Biosynthesis
in the Cofactors, Vitamins, Prosthetic Groups, Pigments rollup category. It can be found on Kegg map 750.si_0065; Entner-Doudoroff Pathway
in the Carbohydrates rollup category. From Wikipedia, the Entner-Doudoroff pathway is a low-efficiency pathway to take glucose to pyruvate, found mostly in Gram-negative organisms, such as Pseudomonas, Rhizobium, Azotobacter, and Agrobacterium.si_0077; anaplerotic pyruvate metabolism I: PEP
in the Carbohydrates rollup category. It can be found on Kegg map 20. See the Wikipedia article.si_0701; ATP-dependent proteolysis in bacteria
in the Protein Metabolism rollup category.si_0589; Ammonia assimilation
in the Nitrogen Metabolism rollup category. It can be found on Kegg map 910. The SEED description by Ed Frank, describes the glutamate dehydrogenase or GS-GOGAT pathways.
Categories 31-40 include:
si_0812; Respiratory Complex I
in the Respiration rollup category.si_0405; DNA repair, bacterial
in the DNA Metabolism rollup category. The SEED description by Michael Kubal, describes a process of DNA base excision repair through detection, breakage, exonuclease activity, DNA polyerase, and DNA ligase.si_0606; De Novo Pyrimidine Synthesis
in the Nucleosides and Nucleotides rollup category. It can be found on Kegg map 240.si_0609; pyrimidine conversions
in the Nucleosides and Nucleotides rollup category. It can be found on Kegg map 240.si_0023; Branched-Chain Amino Acid Biosynthesis
in the Amino Acids and Derivatives rollup category. It can be found on Kegg map 290.si_0363; Folate Biosynthesis
in the Cofactors, Vitamins, Prosthetic Groups, Pigments rollup category. It can be found on Kegg map 790. The SEED description by Valerie de Crecy-Lagard and Andrew Hanson describes folate biosynthesis.si_0459; Glycerolipid and Glycerophospholipid Metabolism
in the Fatty Acids, Lipids, and Isoprenoids rollup category. It can be found on Kegg map 561 and Kegg map 564. The SEED description by Vasiliy Portnoy describes glycerolipid an diglycerophospholipid biosynthesis.si_0404; DNA repair, UvrABC system
in the DNA Metabolism rollup category.si_0558; ZZ gjo need homes
in the Miscellaneous rollup category.si_0334; Met-tRNA formyltransferase gene cluster
in the Clustering-based subsystems rollup category.
Categories 41-50 include:
si_0073; Pentose phosphate pathway
in the Carbohydrates rollup category. It can be found on Kegg map 30.si_0331; Cell division-ribosomal stress proteins cluster
in the Clustering-based subsystems rollup category.si_0062; Photorespiration (oxidative C2 cycle)
in the Carbohydrates rollup category.si_0393; Heme - Siroheme Biosynthesis
in the Cofactors, Vitamins, Prosthetic Groups, Pigments rollup category. It can be found on Kegg map 860. The SEED description by Svetlana Gerdes describes tetrapyrrole biosynthesis.si_0357; Biotin biosynthesis
in the Cofactors, Vitamins, Prosthetic Groups, Pigments rollup category. It can be found on Kegg map 780. The SEED description by Dmitry Rodionov, describes how biotin, vitamin H, which is an essential cofactor for a class of important metabolic enzymes.si_0362; 5-FCL-like protein
in the Cofactors, Vitamins, Prosthetic Groups, Pigments rollup category.si_0038; Lysine Biosynthesis DAP Pathway
in the Amino Acids and Derivatives rollup category. It can be found on Kegg map 300.si_0006; Arginine Biosynthesis, extended
in the Amino Acids and Derivatives rollup category. It can be found on Kegg map 330.si_0078; Pyruvate metabolism II: acetyl-CoA, acetogenesis from pyruvate
in the Carbohydrates rollup category. It can be found on Kegg map 620 and Kegg map 770.si_0085; Maltose and Maltodextrin Utilization
in the Carbohydrates rollup category. It can be found on Kegg map 500.
Categories 51-60 include:
si_0059; CO2 uptake, carboxysome
in the Carbohydrates rollup category.si_0654; Potassium homeostasis
in the Potassium metabolism rollup category.si_0002; Glycine and Serine Utilization
in the Amino Acids and Derivatives rollup category. It can be found on Kegg map 260.si_0668; Translation elongation factors bacterial
in the Protein Metabolism rollup category.si_0455; Isoprenoid Biosynthesis
in the Fatty Acids, Lipids, and Isoprenoids rollup category. The SEED description by Olga Zagnitko, describes how the major terpenoid building blocks, isopentenyl diphosphate and dimethylallyl diphosphate, are produced by the mevalonate and non-mevalonate pathways.si_0178; Sialic Acid Metabolism
in the Cell Wall and Capsule rollup category.si_0135; Glycogen metabolism
in the Carbohydrates rollup category. It can be found on Kegg map 500.si_0153; Bacterial Cytoskeleton
in the Cell Division and Cell Cycle rollup category.si_0046; Threonine and Homoserine Biosynthesis
in the Amino Acids and Derivatives rollup category. It can be found on Kegg map 260.si_0403; DNA Repair; Base Excision
in the DNA Metabolism rollup category.
Categories 61-70 include:
si_0035; Histidine Biosynthesis
in the Amino Acids and Derivatives rollup category. It can be found on Kegg map 340.si_0772; ATP-dependent RNA helicases, bacterial
in the RNA Metabolism rollup category.si_0738; tRNA modification Archaea
in the RNA Metabolism rollup category.si_0492; ABC transporter oligopeptide (TC_3.A.1.5.1)
in the Membrane Transport rollup category.si_0704; GroEL GroES
in the Protein Metabolism rollup category.si_0389; Chlorophyll Biosynthesis
in the Cofactors, Vitamins, Prosthetic Groups, Pigments rollup category. It can be found on Kegg map 860. The SEED description by Svetlana Gerdes and Veronika Vonstein, describes the biosynthesis of chlorophyll, used in photosynthesis.si_0593; Nitrate and nitrite ammonification
in the Nitrogen Metabolism rollup category. It can be found on Kegg map 910.si_0392; Coenzyme B12 biosynthesis
in the Cofactors, Vitamins, Prosthetic Groups, Pigments rollup category.si_0568; Iojap
in the Miscellaneous rollup category.si_0749; bacterial transcription factors
in the RNA Metabolism rollup category.
Categories 71-80 include:
si_0939; Multidrug Resistance Efflux Pumps
in the Virulence, Disease, and Defense rollup category.si_0663; Ribosome SSU eukaryotic and archaeal
in the Protein Metabolism rollup category.si_0160; chromosome partitioning
in the Cell Division and Cell Cycle rollup category.si_0358; Coenzyme A Biosynthesis
in the Cofactors, Vitamins, Prosthetic Groups, Pigments rollup category. It can be found on Kegg map 770.si_0280; RNA-metabolizing Zn-dependent hydrolases
in the Clustering-based subsystems rollup category.si_0853; Choline and Betaine Uptake and Betaine Biosynthesis
in the Stress Response rollup category.si_0866; Redox-dependent regulation of nucleus processes
in the Stress Response rollup category.si_0158; Macromolecular synthesis operon
in the Cell Division and Cell Cycle rollup category.si_0010; Polyamine Metabolism
in the Amino Acids and Derivatives rollup category.si_0929; Cobalt-zinc-cadmium resistance
in the Virulence, Disease, and Defense rollup category.
Categories 81-90 include:
si_0419; DNA topoisomerases, Type II, ATP-dependent
in the DNA Metabolism rollup category. They are described in the Wikipedia article.si_0945; Resistance to fluoroquinolones
in the Virulence, Disease, and Defense rollup category. Fluoroquinolones target DNA gyrase and topoisomerase IV. See Hooper, et al..si_0187; KDO2-Lipid A biosynthesis
in the Cell Wall and Capsule rollup category.si_0672; Translation termination factors bacterial
in the Protein Metabolism rollup category.si_0739; tRNA modification Bacteria
in the RNA Metabolism rollup category.si_0930; Copper homeostasis
in the Virulence, Disease, and Defense rollup category.si_0826; Soluble cytochromes and functionally related electron carriers
in the Respiration rollup category.si_0396; Thiamin biosynthesis
in the Cofactors, Vitamins, Prosthetic Groups, Pigments rollup category. The SEED description by Dmitry Rodionov describes how vitamin B1, thiamin, is synthesized.si_0676; tRNA aminoacylation, Asp and Asn
in the Protein Metabolism rollup category.si_0083; Lactose and Galactose Uptake and Utilization
in the Carbohydrates rollup category. It can be found on Kegg map 052.
Categories 91-100 include:
si_0037; Cysteine Biosynthesis
in the Amino Acids and Derivatives rollup category. It can be found on Kegg map 270.si_0210; UDP-N-acetylmuramate from Fructose-6-phosphate Biosynthesis
in the Cell Wall and Capsule rollup category. The SEED description by Vasiliy Portnoy and Olga Zagnitko describes production of a major building block for biosynthesis of peptidoglycan.si_0373; NAD and NADP cofactor biosynthesis ,global
in the Cofactors, Vitamins, Prosthetic Groups, Pigments rollup category. The SEED description by Andrei Osterman describes NAD and NADP biosynthesis.si_0097; Fermentations: Mixed acid
in the Carbohydrates rollup category.si_0795; Terminal cytochrome C oxidases
in the Respiration rollup category. It can be found on Kegg map 190.si_0063; Dehydrogenase complexes
in the Carbohydrates rollup category.si_0407; MutL-MutS DNA repair system, bacterial
in the DNA Metabolism rollup category. It can be found on Kegg map ko03430.si_0665; Ribosome biogenesis bacterial
in the Protein Metabolism rollup category.si_0008; Arginine and Ornithine Degradation
in the Amino Acids and Derivatives rollup category. It can be found on Kegg map 330.
7.3. Identifying enriched functions¶
Using field replicates to identify important determinants of ecosystem function
face figure
Algal transcriptomes
Tooth decay study