Summer 2021 Update

AMPSphere: what? why? and by whom? Also, we're at #WorldMicrobeForum

This is our second quarterly update. We will send four of these a year, coinciding (roughly) with the two equinoxes and the two solstices. Every four months, we will highlight one of the projects in the group as well as give short updates on everything else that has been happening.

Several of us are currently attending the World Microbe Forum (we have posters!), so you can “find us” there.

Summer 2021 Focus: AMPSphere by Célio Dias Santos-Júnior

What is the AMPSphere?

AMPSphere is a resource created by us in collaboration with the Bork group at EMBL that comprises the global antimicrobial peptides (AMP) catalog. Currently, AMPSphere consists of  almost 1 million non-redundant peptides (~890k) from ~64k metagenomes from several different environments and about 100k microbial genomes from proGenomes v2. These AMPs clustered into almost 5k families.

To the generation of AMPSphere, we used a pipeline derived from the original project of the Global Microbial Gene Catalog (GMGC) and Macrel (, to predict peptides and prospect them for AMPs using random forests. We had created Macrel to perform with high precision in metagenome-like datasets, which makes our approach very robust. 

What can be done with the AMPSphere now?

Now, it is possible to track AMPs back to their producers and targets by using correlation-based approaches. We also can better understand the microbial competition in some environmental microbiomes. Also, it is possible to create maps of animals' bodies for AMPs and even try to correlate them with conditions, such as diseases. AMP candidates from those approaches can be synthesized and tested against known pathogens and, in the future, integrate therapeutic protocols. The AMPSphere dataset also can be interesting to develop methods to detect spurious gene predictions and validation. 

Where do you plan to take the project in the future?

The idea is to continuously expand the AMP set to become a central reference for AMP research. We also intend to annotate these sequences better, using these correlated-based approaches to get pieces of information hard to predict from the raw sequence, such as their targets. We also intend to test a set of these sequences against the World Health Organization selection of priority pathogens, for example, MRSA, E. coli, Pseudomonas aeruginosa, etc. As more people join the project, the possibilities of work with AMPSphere, which are too broad, can be better explored.

Can you also tell us a bit about yourself: what was your path to get here?

I joined the Big Data Biology group in 2019, right after obtaining my Ph.D. degree. My work until then focused on the terrestrial organic matter degradation by Amazon river microbes through metagenomics and population genomes at the Federal University of São Carlos (UFSCar) - Brazil, and Institut del Ciencies del Mar - Barcelona, Spain (ICM). Previously, I concluded my masters' degree in the same institution on the use of computer technologies for ab initio protein production and got my bachelor degree in Biotechnology from the Federal University of Uberlândia (UFU) - Brazil and the Faculty of Science and Technology in the University of Coimbra (FCTUC) - Portugal. My main interests are in the interface of bioinformatics and wet lab. I have always worked with applied omics and recently expanded to the data science/machine learning field. My expertise includes (meta)genomics/transcriptomics/proteomics, bioprospection and microbial genomics.

What are your future (scientific) plans?

My interests range a bit but always converge into the molecular ecology or even the molecular tools used by microbes to thrive in different environments. So far, my opportunities are related to academic research in big data analysis and metagenomics. As a long-term goal, I want to become a senior researcher. Besides the possibilities related to the academy, I have always been curious about the industrial environment, so I cannot exclude that from the list, at least yet. In terms of projects, I have collaborated with some groups around the world for some time now. Their work is mainly related to the microbiome of insects, rivers, and the sea. I enjoy these projects because they have a biotechnology background with potential applications, such as bioprocesses, biocontrol, and bioengineering. Continuous development is also one of my mottos. Thus, keeping my collaboration to projects, such as AMPSphere, is also one of my priorities.

Where can people find you and get in touch?

My main publications are mostly listed/available on Google Scholar and ResearchGate. People can follow interesting projects or code via GitHub.  I am always open to discussing science and a good way to keep in touch is via email or twitter.

BDB-Lab Updates

People. This Summer we have three remote interns joining us: Anna Vines, Ariana Thakurdyal, and Nilesh Gupta.

Tool. Our semi-supervised binning tool is now known as SemiBin and is available on PyPI and on bioconda. The project is available at

Manuscript. The GUNC manuscript (a Bork group manuscript that we collaborated on) was published in Genome Biology. See our earlier blogpost on it.

Manuscript. Luis was quoted discussing the future of bioimaging

Seminar. In April, at the first EMBARK AMR webinar, Luis talked about “Quantifying AMR at very large scales”

Tutorial. We announced a tutorial series where we will hold online hands-on sessions to train people on using our tools.

Online resource. Svetlana is publicly curating microbiome events.

Looking Forward