These results are of great practical significance
for studies on similar environmental samples, and new primer formulations could be designed using our results. One strategy is to increase coverage through the introduction of proper degenerate nucleotides. Although the total number of sequences SB525334 in vivo in a metagenomic dataset may be very large, the number of 16S rRNA gene sequences is limited, and may account for only approximately 0.2% of all sequence reads [33, 34]. In contrast, the metatranscriptomic analysis of environmental samples generates a large number of small subunit sequences [35]. Although the short length (approximately 200bp) of the sequences currently
deposited in metatranscriptomic datasets are not appropriate for assessing primer coverage, the further development of pyrosequencing will make such assessments possible in the near future. Methods Retrieval of 16S rRNA gene sequences from the RDP A FASTA file for all bacterial 16S rRNA gene sequences was downloaded from the “RESOURCES” section of the RDP website (release 10.18; http://rdp.cme.msu.edu/) [14]. With the help of the service “BROWSERS”, see more good quality, almost full-length (size ≥ 1200bp) sequences were obtained. These sequences were extracted from the FASTA file by Perl scripts. A final dataset with 462,719 bacterial 16S rRNA gene sequences was constructed MRIP (referred to as the “RDP dataset”). Elimination of primer contamination
in the RDP dataset Most sequences deposited in the RDP dataset were generated by PCR. However, as described by Frank et al. [18], many of these sequences lack correct primer trimming. Only sequence fragments extending at least 3 nucleotides past the start (the 5′ end) of the longest version of each primer were considered uncontaminated by the PCR primers. Because the sequences selected from the RDP were all longer than 1200bp, only the primer-binding sites for 27F, 1390R and 1492R could be contaminated (Additional file 4: Figure S3). Thus, 15,045, 188,792 and 35,462 sequences were selected for the primers 27F, 1390R and 1492R, respectively, as containing authentic primer-binding sites. Retrieval of 16S rDNA sequences from the metagenomic datasets Selection of metagenomic datasets Metagenomic datasets were selected from the CAMERA website (release v.1.3.2.30; http://camera.calit2.net/) [15]. Given the read length and the diversity of sample sources, 7 microbial metagenomic datasets constructed by shotgun sequencing were chosen (average sequence length ≫ 900bp, sequence number ≫ 300,000): AntarcticaAquatic, AcidMine, BisonMetagenome, GOS, GutlessWorm, HumanGut and HOT. Detailed descriptions for each dataset are listed in Table 2.