1 Example data
I will use a metagenomic dataset as an example for the analysis in the following sections, helping you get familiar with downstream analysis of omics data and the usage of pctax
.
Look at the Example data of a microbiome (simulated data):
Metadata
Firstly, the metadata. This is where we record essential information about our study samples. Typically, rows represent sample IDs, while columns encompass various macroscopic features of the samples. These features include experimental groups (control or treatment), sampling time, location, various environmental factors at the sampling point, and phenotypic characteristics of the host, among others.
print(metadata)
Id | Group | env1 | env2 | env3 | env4 | env5 | env6 | lat | long | |
---|---|---|---|---|---|---|---|---|---|---|
NS1 | NS1 | NS | 3.057248 | 10.235708 | 5.554576 | 8.084997 | 25.007946 | -1.1545668 | 26.94422 | 103.4767 |
NS2 | NS2 | NS | 4.830219 | 11.134527 | 5.613455 | 8.556829 | 16.676898 | 0.8116874 | 29.08733 | 109.6196 |
NS3 | NS3 | NS | 3.753133 | 10.062318 | 5.582916 | 10.226572 | 21.689255 | 1.4073321 | 28.25164 | 104.0361 |
NS4 | NS4 | NS | 4.262264 | 10.844010 | 5.258419 | 9.002256 | 24.810460 | 1.4780532 | 33.82415 | 106.8651 |
NS5 | NS5 | NS | 2.476135 | 7.525840 | 6.255314 | 9.357587 | 19.705527 | 0.0581309 | 33.51011 | 105.4571 |
NS6 | NS6 | NS | 5.131004 | 10.827615 | 5.180966 | 8.141506 | 18.390209 | -1.7003257 | 31.86864 | 102.7832 |
WS1 | WS1 | WS | 4.690185 | 8.868384 | 5.534423 | 2.922556 | 13.066594 | -0.9073270 | 25.67656 | 102.2946 |
WS2 | WS2 | WS | 5.500007 | 8.270563 | 6.698076 | 3.711924 | 6.344009 | -0.1699797 | 27.69990 | 106.0343 |
WS3 | WS3 | WS | 3.220505 | 8.435364 | 7.462542 | 3.906052 | 15.703113 | -1.5205620 | 28.04572 | 108.9124 |
WS4 | WS4 | WS | 5.624307 | 7.174707 | 5.387799 | 2.777254 | 12.503655 | 1.6144087 | 33.86966 | 110.1844 |
WS5 | WS5 | WS | 5.013274 | 7.678983 | 6.478364 | 3.527165 | 7.391619 | -0.6876136 | 28.36314 | 107.0412 |
WS6 | WS6 | WS | 6.321235 | 7.822989 | 6.262504 | 3.238742 | 10.298175 | 0.0661551 | 30.07997 | 105.0054 |
CS1 | CS1 | CS | 5.242789 | 12.053449 | 8.383412 | 7.175002 | 17.666552 | 1.0230426 | 32.83965 | 103.8978 |
CS2 | CS2 | CS | 5.402243 | 9.865916 | 6.760709 | 5.050641 | 19.775379 | 1.7248702 | 30.29499 | 101.6969 |
CS3 | CS3 | CS | 5.474717 | 12.489934 | 5.729690 | 4.215989 | 16.861294 | -0.8506381 | 29.90803 | 106.0819 |
CS4 | CS4 | CS | 6.915080 | 12.492414 | 6.845870 | 5.280682 | 15.011610 | 0.5285857 | 31.87761 | 104.2137 |
CS5 | CS5 | CS | 6.355684 | 13.085380 | 6.474958 | 5.893205 | 17.686923 | -0.5588746 | 27.94134 | 103.1896 |
CS6 | CS6 | CS | 6.381007 | 10.461389 | 7.432614 | 7.173710 | 17.387503 | -0.0904096 | 35.29004 | 106.2336 |
Here, the metadata simulates a study on soil microbiome:
- Id: Unique identifier for each sample (name).
- Group: Experimental grouping (NS, WS, CS, and their actual meanings are not necessary for our simulation).
- env1~6: Environmental factors at the sampling points (e.g., pH, temperature, humidity, etc.).
- lat and long: Latitude and longitude recording the simulated sampling location (with no actual significance).
Feature abundance table
Next is the feature abundance table generated through upstream processing, such as microbial abundance in metagenomics, gene abundance in transcriptomics, or metabolite abundance in metabolomics.
Typically, rows represent the names of features, and columns represent sample names. It’s a common practice to align the column names of the abundance table exactly with the row names of the metadata. This alignment is highly advantageous for subsequent analyses.
head(otutab)
NS1 | NS2 | NS3 | NS4 | NS5 | NS6 | WS1 | WS2 | WS3 | WS4 | WS5 | WS6 | CS1 | CS2 | CS3 | CS4 | CS5 | CS6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
s__un_f__Thermomonosporaceae | 1092 | 1920 | 810 | 1354 | 1064 | 1070 | 1252 | 1597 | 1330 | 941 | 1233 | 1011 | 2313 | 2518 | 1709 | 1975 | 1431 | 1527 |
s__Pelomonas_puraquae | 1962 | 1234 | 2362 | 2236 | 2903 | 1829 | 644 | 495 | 1230 | 1284 | 953 | 635 | 1305 | 1516 | 844 | 1128 | 1483 | 1174 |
s__Rhizobacter_bergeniae | 588 | 458 | 889 | 901 | 1226 | 853 | 604 | 470 | 1070 | 1028 | 846 | 670 | 1029 | 1802 | 1002 | 1200 | 1194 | 762 |
s__Flavobacterium_terrae | 244 | 234 | 1810 | 673 | 1445 | 491 | 318 | 1926 | 1493 | 995 | 577 | 359 | 1080 | 1218 | 754 | 423 | 1032 | 1412 |
s__un_g__Rhizobacter | 1432 | 412 | 533 | 759 | 1289 | 506 | 503 | 590 | 445 | 620 | 657 | 429 | 1132 | 1447 | 550 | 583 | 1105 | 903 |
s__un_o__Burkholderiales | 886 | 683 | 824 | 912 | 1502 | 1029 | 235 | 252 | 359 | 381 | 387 | 351 | 551 | 540 | 477 | 559 | 513 | 496 |
s__un_g__Streptomyces | 516 | 510 | 621 | 424 | 205 | 322 | 340 | 548 | 1590 | 776 | 493 | 508 | 624 | 757 | 560 | 1058 | 449 | 512 |
s__Lentzea_flaviverrucosa | 424 | 1033 | 310 | 440 | 311 | 485 | 414 | 416 | 309 | 505 | 673 | 407 | 805 | 600 | 815 | 415 | 683 | 463 |
s__un_g__Actinoplanes | 338 | 805 | 349 | 443 | 261 | 549 | 297 | 448 | 632 | 382 | 552 | 417 | 579 | 322 | 439 | 441 | 752 | 512 |
s__un_g__Rhizobium | 369 | 357 | 684 | 774 | 1033 | 666 | 213 | 186 | 281 | 274 | 408 | 279 | 360 | 598 | 243 | 274 | 517 | 273 |
s__un_g__Noviherbaspirillum | 321 | 344 | 317 | 364 | 561 | 364 | 470 | 386 | 235 | 415 | 351 | 184 | 435 | 497 | 511 | 419 | 320 | 383 |
s__un_f__Comamonadaceae | 170 | 176 | 375 | 367 | 521 | 385 | 194 | 509 | 484 | 304 | 503 | 194 | 386 | 285 | 410 | 281 | 578 | 544 |
s__Bradyrhizobium_neotropicale | 318 | 415 | 449 | 330 | 371 | 380 | 365 | 315 | 279 | 238 | 406 | 375 | 306 | 274 | 330 | 358 | 352 | 368 |
s__Streptomyces_ederensis | 234 | 262 | 524 | 248 | 148 | 211 | 145 | 232 | 593 | 289 | 224 | 175 | 445 | 296 | 245 | 305 | 817 | 354 |
s__Actinocorallia_herbida | 260 | 315 | 58 | 454 | 144 | 184 | 162 | 277 | 151 | 268 | 253 | 194 | 396 | 470 | 240 | 310 | 463 | 233 |
s__un_g__Amycolatopsis | 198 | 429 | 90 | 258 | 154 | 150 | 81 | 115 | 59 | 184 | 106 | 107 | 243 | 284 | 99 | 142 | 1547 | 103 |
s__Actinophytocola_burenkhanensis | 117 | 140 | 1152 | 58 | 30 | 64 | 268 | 140 | 74 | 186 | 175 | 125 | 139 | 31 | 296 | 251 | 201 | 368 |
s__un_p__Proteobacteria | 210 | 173 | 144 | 130 | 87 | 192 | 256 | 193 | 182 | 171 | 227 | 273 | 220 | 183 | 325 | 252 | 251 | 320 |
s__Kribbella_catacumbae | 152 | 370 | 194 | 121 | 99 | 129 | 174 | 194 | 163 | 166 | 196 | 158 | 209 | 313 | 195 | 295 | 377 | 222 |
s__un_o__Rhizobiales | 202 | 205 | 322 | 237 | 235 | 215 | 254 | 161 | 147 | 161 | 178 | 215 | 156 | 183 | 222 | 166 | 146 | 203 |
Here, the otutab
represents the abundance of each identified microbial species across all samples.
Feature annotation (optional)
Having both metadata and a feature abundance table allows for various analyses.
Sometimes, additional information comes in the form of feature annotation, containing details about each feature. For instance, in metagenomic data, this might include taxonomic information such as phylum, class, order, family, and genus for each microbial species. In transcriptomics, it could involve functional descriptions and classifications corresponding to gene IDs.
This additional layer of annotation enhances our understanding of the features being analyzed. Typically, aligning the row names of feature annotation with the row names of the feature abundance table is advantageous for subsequent analyses.
head(taxonomy)
Kingdom | Phylum | Class | Order | Family | Genus | Species | |
---|---|---|---|---|---|---|---|
s__un_f__Thermomonosporaceae | k__Bacteria | p__Actinobacteria | c__Actinobacteria | o__Actinomycetales | f__Thermomonosporaceae | g__un_f__Thermomonosporaceae | s__un_f__Thermomonosporaceae |
s__Pelomonas_puraquae | k__Bacteria | p__Proteobacteria | c__Betaproteobacteria | o__Burkholderiales | f__Comamonadaceae | g__Pelomonas | s__Pelomonas_puraquae |
s__Rhizobacter_bergeniae | k__Bacteria | p__Proteobacteria | c__Gammaproteobacteria | o__Pseudomonadales | f__Pseudomonadaceae | g__Rhizobacter | s__Rhizobacter_bergeniae |
s__Flavobacterium_terrae | k__Bacteria | p__Bacteroidetes | c__Flavobacteriia | o__Flavobacteriales | f__Flavobacteriaceae | g__Flavobacterium | s__Flavobacterium_terrae |
s__un_g__Rhizobacter | k__Bacteria | p__Proteobacteria | c__Gammaproteobacteria | o__Pseudomonadales | f__Pseudomonadaceae | g__Rhizobacter | s__un_g__Rhizobacter |
s__un_o__Burkholderiales | k__Bacteria | p__Proteobacteria | c__Betaproteobacteria | o__Burkholderiales | f__un_o__Burkholderiales | g__un_o__Burkholderiales | s__un_o__Burkholderiales |
s__un_g__Streptomyces | k__Bacteria | p__Actinobacteria | c__Actinobacteria | o__Actinomycetales | f__Streptomycetaceae | g__Streptomyces | s__un_g__Streptomyces |
s__Lentzea_flaviverrucosa | k__Bacteria | p__Actinobacteria | c__Actinobacteria | o__Actinomycetales | f__Pseudonocardiaceae | g__Lentzea | s__Lentzea_flaviverrucosa |
s__un_g__Actinoplanes | k__Bacteria | p__Actinobacteria | c__Actinobacteria | o__Actinomycetales | f__Micromonosporaceae | g__Actinoplanes | s__un_g__Actinoplanes |
s__un_g__Rhizobium | k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | o__Rhizobiales | f__Rhizobiaceae | g__Rhizobium | s__un_g__Rhizobium |
s__un_g__Noviherbaspirillum | k__Bacteria | p__Proteobacteria | c__Betaproteobacteria | o__Burkholderiales | f__Oxalobacteraceae | g__Noviherbaspirillum | s__un_g__Noviherbaspirillum |
s__un_f__Comamonadaceae | k__Bacteria | p__Proteobacteria | c__Betaproteobacteria | o__Burkholderiales | f__Comamonadaceae | g__un_f__Comamonadaceae | s__un_f__Comamonadaceae |
s__Bradyrhizobium_neotropicale | k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | o__Rhizobiales | f__Bradyrhizobiaceae | g__Bradyrhizobium | s__Bradyrhizobium_neotropicale |
s__Streptomyces_ederensis | k__Bacteria | p__Actinobacteria | c__Actinobacteria | o__Actinomycetales | f__Streptomycetaceae | g__Streptomyces | s__Streptomyces_ederensis |
s__Actinocorallia_herbida | k__Bacteria | p__Actinobacteria | c__Actinobacteria | o__Actinomycetales | f__Thermomonosporaceae | g__Actinocorallia | s__Actinocorallia_herbida |
s__un_g__Amycolatopsis | k__Bacteria | p__Actinobacteria | c__Actinobacteria | o__Actinomycetales | f__Pseudonocardiaceae | g__Amycolatopsis | s__un_g__Amycolatopsis |
s__Actinophytocola_burenkhanensis | k__Bacteria | p__Actinobacteria | c__Actinobacteria | o__Actinomycetales | f__Pseudonocardiaceae | g__Actinophytocola | s__Actinophytocola_burenkhanensis |
s__un_p__Proteobacteria | k__Bacteria | p__Proteobacteria | c__un_p__Proteobacteria | o__un_p__Proteobacteria | f__un_p__Proteobacteria | g__un_p__Proteobacteria | s__un_p__Proteobacteria |
s__Kribbella_catacumbae | k__Bacteria | p__Actinobacteria | c__Actinobacteria | o__Actinomycetales | f__Nocardioidaceae | g__Kribbella | s__Kribbella_catacumbae |
s__un_o__Rhizobiales | k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | o__Rhizobiales | f__un_o__Rhizobiales | g__un_o__Rhizobiales | s__un_o__Rhizobiales |
Here, the taxonomy
includes taxonomic information for each species, providing valuable insights when exploring the composition of species.