08 July 2024

Protein Space and the Protein Universe: Introduction

Blue dots and smears in a circular graph, depicting structural links between protein clustersEvolution is easier than we think, and one great way to see why is to look at what we know about protein evolution.

Proteins have been evolving on our planet for about 4 billion years. Their appearance almost certainly precedes the beginning of life itself. We still don't know how the whole thing got off the ground, but once the stage was set (in living cells), evolution began exploring Protein Space. As it did so, it slowly created the Protein Universe. Since these two concepts — Protein Space and the Protein Universe — are so central to understanding and picturing protein evolution, we should carefully define what they mean.

Protein Space is the set of all possible proteins. More precisely, it's the set of all possible strings of amino acids of a particular maximum length. Estimating the size of this set produces numbers so vast that they defy description. As I've described before, while adapting Dan Dennett's metaphor of the "Library of Mendel," possible proteins are far more numerous than elemental particles in the universe.

It's easy to specify a Protein Space: all you need is an "alphabet" and a maximum length. In our current reality, the alphabet consists of 20 amino acids. The maximum length is up to you — I chose 1000 amino acids in order to capture about 80% of known proteins, but the biggest known protein (the aptly-named titin) contains more than 30,000 amino acids. No matter what, you'll get a set of possibilities bigger than your human mind is equipped to comprehend.

Protein Space is therefore a lot like the Library of Babel (which contains texts) and the Library of Mendel (which contains gene sequences), in that it is a fantastical construct that is nevertheless fully grounded in reality. (Because Protein Space is so similar to those two fantastical libraries, it has been named the Library of Maynard Smith in honor of the great evolutionary thinker who first conceived of it.) In other words, the things in each of those libraries are straightforward possible things of our world: books, DNA sequences, amino acid sequences. Protein Space is impossibly large but utterly down to earth.

The Protein Universe is a subset of Protein Space, defined simply as all proteins known to exist (and to have existed).

Like the universe we live in, the Protein Universe is continually expanding, in two different ways. First,  previously-unknown natural proteins are being discovered. In fact, the past few years will be remembered as an era of explosive growth of the Protein Universe, as large-scale efforts to identify new proteins proceed far faster than any effort to study them in the lab. Second, completely new proteins are being created by enterprising humans.

There is actually an atlas of the Protein Universe, complete with tools to find outposts of protein function. It's big, so big that I had to use the simplified version on my Chromebook. There is a kinder, gentler atlas of the human protein universe — it's a little more fun but of course it's a tiny subset of the full Protein Universe.

We can see that the Protein Universe is huge (as of summer 2024, there are just under 250 million protein sequences in the authoritative UniProt database). It contains protein sequences that are either known to exist on Earth or are at least likely to exist, and it will continue to expand. But it will never contain even a tiny fraction of Protein Space. If you were to randomly sample from Protein Space, the likelihood that you would choose a protein in the Protein Universe is effectively zero.

With those definitions established, we can start to explore questions about the nature of the Protein Universe and how evolution built it. Or did evolution discover it?


Image credit: Figure 2a from Durairaj, J., Waterhouse, A.M., Mets, T. et al. Uncovering new families and folds in the natural protein universe. Nature 622, 646–653 (2023).

No comments: