Greengenes 12_10 is released!

16 10 2012

We’re pleased to announce that the Greengenes 12_10 release is live. This release takes Greengenes from 408k sequences to over 1 million. At the OTU level, Greengenes has grown from 35k 97% OTUs to 85k 97% OTUs.

The large increase in the number of reference sequences is particularly exciting for researchers working on less-well-characterized environments. For example, when performing reference-based OTU picking for the “88 soils” (Lauber et al, 2009) data, the number of reads assigned to the reference database rose from 53.7k when picking OTUs against the 4feb2011 OTUs to 66.4k when picking OTUs against the 12_10 OTUs. This means that we have reliable, detailed taxonomic information for many more of the OTUs, as well as a more reliable tree relating these OTUs (relative to a de novo constructed tree). We see a similar increase in the number of assigned OTUs, albeit less dramatic, in the Moving Pictures of the Human Microbiome (Caporaso et al, 2011) study, going from 18.9k with 4feb2011 to 20.0k with 12_10. It is likely that we will continue to see large gains in the number of OTUs that match Greengenes in less well-characterized environments with subsequent releases.

We are also excited to announce that Greengenes is now protected under the Creative Commons Share-Alike license, permanently placing the resource in the public domain. In addition, a guiding body, the Greengenes Consortium, has been formed to direct the resource into the future. The Consortium represents academic and biotech interests with members from the Australian Centre for Ecogenomics at the University of Queensland, the BioFrontiers Institute at the University of Colorado, and Second Genome, Inc.

The release can be obtained from http://greengenes.secondgenome.com.

Due to the substantial increase in the size of Greengenes, we performed several comparisons of the OTUs and taxonomy against the prior Greengenes release to confirm that the results match what we expect. Specifically, we performed closed- and open-reference OTU picking on the Moving Pictures of the Human Microbiome (Caporaso et al, 2011) data and on the “88 soils” (Lauber et al, 2009) data. Procrustes analysis (performed with transform_coordinate_matrices.py and compare_3d_plots.py) shows a very strong concordence between the UniFrac PCoA plots resulting from both methods of OTU picking with both the 4feb2011 and 12_10 Greengenes OTUs. For example, comparing closed-reference OTU picking against the 4feb2011 Greengenes OTUs and the 12_10 Greengenes OTUs on the 88 soils data yielded a highly significant Procrustes result (M2=0.028, p<0.01; samples colored by soil pH):

Taxonomy summaries of the Moving Pictures samples are highly correlated as well (p<0.001; performed with compare_taxa_summaries.py), where the largest differences results from the reclassification of some Tenericutes sequences from 4feb2011 (top panel) as Firmicutes in 12_10 (bottom panel):

You can use the 12_10 Greengenes OTUs with QIIME in the same way as the 4feb2011 release. You can pass representative set fasta files for reference-based OTU picking (open-reference OTU picking discussed here and closed-reference OTU picking discussed here), or use the sequences and taxonomy files to retrain the RDP classifier as described here.

Enjoy!

Greg and Daniel

Advertisements

Actions

Information




%d bloggers like this: