BDB-Lab Summer 2023 Updates
Looking forward.
Anna will attend the FEMS 2023 in Hamburg, Germany (July 9 - 13). She will give a flash poster presentation titled “Exploring the dog gut microbiome within a large-scale investigation of animal gut metagenomes” on Monday (July 10 at 16:30h; Genetics and Genomics session). Feel free to get in touch with Anna if you will be at #FEMS2023 !
Luis will attend the European Conference of Computational Biology in Lyon, France (July 23 - 27) and talk about SemiBin2 on Tuesday (July 25 at 13h50; MICROBIOME COSI). Feel free to get in touch with Luis if you will be at #ISMBECCB2023 !
Focus of the Quarter: small proteins in the global microbiome
There is no strong consensus as to what qualifies as a small protein.1 We use an upper limit of 100 amino acids, but other authors use 50 amino acids. Although the existence of these proteins has been known for a long time, relatively few of them were well-characterized and, most relevant for the type of work we do, they were ignored in large-scale studies. The reason for ignoring small sequences was not because we were convinced that there was nothing there. Rather, it was the easiest way to deal with the problem of false positives! Fundamentally, while a long coding sequence (without an intermediate STOP) is very unlikely to emerge by chance and thus can be seen as evidence of selection, short open read frames (smORFs2) can be the result of random DNA. Thus, state-of-the-art gene prediction tools such as Prodigal set lower limits on the size of the ORFs they predict, as otherwise the false positive rate would be too high. Nonetheless, there have been a few recent enterprising efforts to tackle this blindspot. Ami Bhatt’s group was responsible for a landmark study in 2019, looking at the human microbiome, which has been followed by others.
In the Big Data Biology Lab, we have been exploring small proteins in the global microbiome since 2019, and this is now one of the main areas of focus in the group.
The AMPSphere (which we covered in more depth two years ago) antimicrobial small proteins (AMPs3) with the dual objective of obtaining novel sequences with biotechnological potential and understanding the ecological roles they play in natura (after all, microbes are not producing these molecules for our benefit). To get at the ecological role of AMPs, we measured the AMP density, or the number of predicted AMPs per assembled megabasepair for different samples and different bacterial species. One cool finding was that bacteria from animal microbiomes produce more AMPs than those from non-host samples. We were even able to relate the AMP density of a species to how much that species is predicted to transmit between humans (obtained from an excellent recent paper on this question). We think that AMPs may be playing an under-appreciated role in a lot of microbial ecology processes.4
The AMPSphere resource is freely available at https://ampsphere.big-data-biology.org/
We also constructed a global microbial smORFs catalog (GMSC, covered before) from global metagenomes and genomes from microbial isolates. This parallels our earlier efforts in cataloging the canonical-length proteins5, GMGCv1. We found that archaea harbor more small proteins than bacteria (as a fraction of their genome size), and archaeal small proteins are more likely to be transmembrane or secreted.
To facilitate the use of this resource, we provide a tool called GMSC-mapper to annotate small proteins from microbial genomes through homology searching. The full resource is still a Work in Progress, but the beta version is available at https://gmsc.big-data-biology.org/.
People.
Marija defended her Ph.D. thesis on “Prokaryotic Evolution in the Age of Computational Genomics: From One Patient to Global HGT Trends.” Congratulations Marija! (And she just opened a Twitter account – give her a follow!)
Luis is moving to the Centre for Microbiome Research at the Queensland University of Technology. Get in touch if you want to do a PhD in sunny Brisbane working on the microbiome.
Papers & Preprints.
ResFinderFG2 manuscript is now published in the Nucleic Acids Research. Svetlana and Luis collaborated with the EMBARK colleagues to build a new version of the ResFinderFG database, which collates data on antibiotic resistance genes obtained by functional metagenomics.
Celio has submitted a manuscript on the AMPSphere, which is now in review. We will publish a preprint as well soon.
Marija has submitted her work on horizontal gene transfer for publication (preprint).
Tools.
Shaojun and Luis contributed a new indexing structure to strobealign, which drastically reduced memory consumption. The main authors included this work in the recently released strobealign 0.11.0 (as well as several other improvements). We will soon add strobealign as an optional aligner in NGLess as well.
Jug 2.3.0 was released with several minor improvements.
Conferences/talks.
Luis presented on the EMBARK spring webinar an ongoing work on antibiotic resistance gene annotation of metagenomes by Svetlana.
Yiqian and Luis attended the Microproteins 2023 conference in Denmark (May 31 - June 2). Yiqian presented a poster and a flash talk titled “With small ORFs come great datasets: the global microbial small ORFs catalog (GMSC),” and she won a best poster prize!
Marija was at the ASM Microbe 2023 in Houston (TX, USA), presenting a rapid-fire talk and a poster presentation titled “A global survey of eco-evolutionary pressures acting on horizontal gene transfer”.
Other.
We have a new logo. The last one lasted 5 years and it was time to change.
This last quarter we have been doing some field work and science away from the computer. We collected soil and dog fecal samples and are now waiting for the sequencing results. Stay tuned!
We have continued to upload videos on our youtube channel. Check our last one: Get eggnog-mapper annotations for GMGC data.
That is all for now, but we are always happy to hear from people. So email us or schedule a Zoom with Luis to chat science.
Or even whether we should call them small proteins or microproteins or peptides. Some authors will insist that peptide (as per its etymology) should only be used for molecules that results from breaking up larger ones, but this is not accepted by everyone. The field is, generally speaking, in need of a terminology cleanup.
smORFs or sORFs. As mentioned in the previous footnote, this is a field in need of some nomenclature standardization.
for antimicrobial peptides, see the previous footnotes on how the nomenclature is not standardized
We will soon make our density estimates public so you can look up all that of any bugs you wish; we really just need to solve a few technical issues. Email us if you want early access, though!
Again, there is no consensus on what to call things. In this case, we lack a good name for “non-small proteins.”