05 August 2023

Contemplating libraries in biology. Not that kind. Not that one either.

What is a library? If you ask a biologist (especially a molecular biologist) this question, they are likely to ask for clarification. In their work, they are likely to make regular use of two very different kinds of libraries.

The first is the kind that we've had for millenia: a collection of books, journals, and media that is ordered and curated by people. These are the OG libraries, with 'book' at the very root of the word. They're rapidly evolving in our digital world, but I think they are still essentially what they've always been. Your friend the molecular biologist may not regularly go to a separate room or building to find materials, but they will use the library often.

The second is an extension of the OG concept of a library, but is still called a 'library' by your friend. It contains information, perhaps in vast amounts, but is not ordered or curated. Crucially, it is a specific collection of a particular type of information: genetic information. And while it's neither ordered nor curated, it is physical, and is designed to be searched. The contents of the library might be DNA sequences (genes or even just chunks of some interesting genome) or protein sequences. Unlike your favorite public library, this one doesn't come with a search feature: you have to do that yourself. The process of searching a library is called screening. Your molecular biologist friend can go to the institutional library to read about these kinds of libraries, and find techniques on how to screen one, then perhaps go to a colleague or a vendor to obtain a library. Or she will obtain tools to make one herself.

In my previous post, I talked of an even more radical extension of the concept of a library: a collection of all the versions of any kind of text (a book, a genome, a set of proteins). This is a library in the sense that it is a collection, and it could be ordered (alphabetically, for example) but it really can't be curated. In fact, this conception of a library is not curated, and that's the point. These libraries only exist in principle, because they are so incomprehensibly vast that they could not exist physically (the universe isn't nearly big enough).

The canonical example of this kind of library is the Library of Babel. It was conceived in a short story (by Jorge Luis Borges), based on ideas that had existed before. What matters to us here is how the library was "constructed." The author arbitrarily defined a 'book' as a string of text of a particular length. (Information-wise, that's all any book is, if you ignore pictures.) He defined his alphabet as 25 characters. Then he envisioned every single possible combination of those characters in a string of that length. This means, for example, that the library contains the full text of Virginia Woolf's "A Room of One's Own," but it also contains a version with all the names changed and a version that ends with Sonnet 130 (one of those obnoxious "Dark Lady" sonnets), and indescribably many versions with alterations that eventually obliterate the essay entirely. But importantly, the library contains almost exclusively gibberish. Borges' story describes people who navigate this universe looking for meaning. Spoiler alert: most go mad.

Such a library is a useful thought experiment in contemplating the universe of possibilities in biology, and especially when considering the universe of possible texts written in DNA or in protein sequences. As I wrote yesterday, Dan Dennett conceived the Library of Mendel, which is just like the Library of Babel but with a different (smaller) alphabet—the DNA alphabet of A,G,T, and C. Then I asked us to consider a library of protein sequences, with an alphabet of 20 letters (the 20 amino acids known to make up the last level of the genetic code). The message of yesterday's post was simply this: all of those libraries are so large that we are tempted to invent new words to somehow communicate their vastness. That vastness, I argue, can mislead us into a sense that evolution, which must explore libraries of indescribable magnitude, is so hard that it is effectively impossible.

Now, in yesterday's post I suggested the "Library of Crick" as a name for the protein library, since Francis Crick was instrumental in unraveling the genetic code (DNA via RNA to the amino acids that make up proteins). But the protein sequence library has already been conceived, by a legendary evolutionary biologist named John Maynard Smith. One of today's legendary evolutionary biologists, Frances Arnold of Caltech, wrote beautifully about the protein universe, referring to Dennett and Borges, in 2011 in the newsletter of the American Society of Microbiology (ASM). She dubbed this library the Library of Maynard Smith, and that will be its name henceforth.

Prof Arnold's essay seems to have been lost when the newsletter (called Microbe) ceased to exist. The ASM site shows a new newsletter and seems not to have archives of Microbe. I hope I'm wrong, but for now you can find the essay on Prof Arnold's site. I can't tell whether it's licensed to share. The piece mentions that it is "one of a series that are adapted from an upcoming ASM Press book on Darwin, evolution, and microbiology." I haven't seen the book yet but would buy it the second I did!


Image Credit: "The Bodleian Library, Oxford," Line engraving by J. Le Keux after F. Mackenzie, 1836. From Wellcome Images, public domain.

No comments: