29 June 2010

Introns. Let's think about this, people. Part II.

Before we explore what introns are and how they work, let me correct the misuse of my words by one of the ID attack kittens. Months ago, referring to Steve Meyer's claim that introns "are now known to play many important functional roles in the cell," I sought to put intron "function" into context as follows:
The human genome contains at least 190,000 introns (though it's been recently estimated to contain almost 210,000). Together those introns comprise almost 1/4 of the human genome. One fourth. That's 768 million base pairs. And biologists have identified "important functional roles" for a handful of them. How many? Oh, probably a dozen, but let's be really generous. Let's say that a hundred introns in the human genome are known to have "important functional roles." Oh fine, let's make it a thousand. Well, guys, that leaves at least 189,000 introns without function, and gosh, they're snipped out of the transcripts and discarded before the darn things even leave the nucleus.
One critic has interpreted me as claiming that I know that 189,000 introns have no function. That's not my point, and I think most people know that.

No, the point is simply this: finding a "function" for a small number of introns doesn't cut it, at least not if one seeks to claim that introns are generally functional elements in a genome characterized by efficient, streamlined packaging of functional information. And the reason is that introns comprise a huge component of the human genome. So even if one had identified function for a thousand of them, it would be a tiny fraction of the total. It would be like finding a Yugo that's still running.

But let's back up. What is an intron, and how would we know whether it had a "function" or not?

Picture a gene as a set of instructions for making a protein. The instructions are in the form of a code. The code is copied from the DNA in the genome, safely stored in the nucleus. The copy is made of RNA, not DNA, and that copy is shipped to the cellular worksites where the code is translated into a protein. Simple, so far, because we skipped at least one important step. That copy of the coded instructions is often far, far longer than it needs to be. What's all the excess? The excess code is spliced out of the copy before it even leaves the nucleus. It comes in chunks called introns.

Okay, so there's a little excess information that's removed. So? Heh. Hold on. You need to see just how much code is cut out of some genes. Before we look at some specific examples, keep this in mind: less than two percent of the human genome specifically codes for protein, and around 25% is devoted to introns. Come on, think about it: the genes in your genome are interrupted by chunks of information that are spliced out and discarded before the copy is even sent to the worksite for manufacturing protein. Those chunks of information are more than ten times bigger (in aggregate) than the coding sequences.

We have good reasons for suspecting that relatively little of that huge pile of digital information participates in what we would call "function" – that's for the next post. For now, to show you just how pervasive intron sequences are in your genes, let's look at two examples, one from a textbook and one from my own research.

Example 1: the Factor VIII gene


Factor VIII is a protein involved in blood clotting, and it's present in mammals and birds (at least) among the vertebrates. Here is a diagram of the gene from Molecular Biology of the Cell:

Image from Molecular Biology of the Cell online, Alberts et al., 2002.

The red stuff is the coding sequence. The yellow stuff is all introns. The gene is quite large: roughly 190,000 base pairs from start to finish. The coding sequence (in the form of 26 exons) is less than 10,000 base pairs. In other words, about 5% of the gene for Factor VIII is devoted to making the protein. The other 95% is introns.

What functions reside in those 180,000 base pairs of code that is snipped out of the copy and discarded in the nucleus? We don't know. But here are some things we do know. The codes for alternative splicing, that Richard Sternberg points to as evidence for "function" of at least 90% of all introns, are tiny – typically just a few base pairs, and as Sternberg's post notes, only occasionally are these codes more than a short distance from the end of the intron. In other words, those enormous introns in the Factor VIII gene, which are about 8000 base pairs long on average, can harbor tiny codes near their ends which signal the cell to splice in different ways. Folks, that leaves something like 99% of the intronic sequence yet unaccounted for. My assertion is not that we know that the sequence has no function. My point is that we don't know what the overwhelming bulk of that sequence is doing. To say that there are even a few thousand base pairs of sequence that even look functional would be a stretch, and we'd still have tens of thousands more to go. Sternberg mentions microRNAs, which are very interesting genomic elements that influence gene expression. They're called microRNAs for a reason: they're less than 100 base pairs in length. How many microRNAs do you think we can find in the Factor VIII gene? You'll need a lot to make any dent in the huge swaths of discarded sequence in that gene. It's starting to sound like a fiber cereal commercial.

Here's one way we could get a handle on the amount of intron sequence that might have an important function. We could look at similar organisms to see if they have the same stuff, and whether it seems to be conserved. If the introns look the same in some similar but not extremely closely-related species, we might start to suspect that they're important for function. And if they vary a lot in size, we might start to suspect that their makeup isn't mission-critical. Let's see whether the introns in this gene are important to birds, for example.

The chicken has a Factor VIII gene, too. The chicken gene is around 1/10 the size of the human gene: it's less than 20,000 base pairs in length. The coding sequence is just a little smaller than in human, between 5000 and 6000 base pairs. What does this mean? It means that the chicken gene is a tiny fraction of the size of the human gene, and the difference is almost completely due to introns. The chicken introns are far smaller. They're not less numerous – as near as I can tell, both genes have the same number of introns – they're just a whole lot smaller.

Do you see how the "introns have function" thing is a problem here? It's not helping us understand the size of the introns, which dwarf the coding sequence in the human, and are suddenly an order of magnitude smaller in a bird. Whether you're a Darwinian hyperadaptationist or just a fan of intelligent design, you're a long way from providing a coherent explanation for this pattern of arrangement of genetic information.

And I didn't pick the most outrageous example. Look up the human dystrophin gene someday. Reputed to be the biggest gene in nature, it's 2.4 million base pairs long. Something like 15,000 base pairs are devoted to making the protein. Yeah. You do the math.

Example 2. The mDia1 gene


This gene encodes a signaling protein that is one of our central research topics. The gene is about 104,000 base pairs long, and the coding sequence is distributed over about 26 chunks (again, these are the exons) which account for less than 6000 of those 104,000 base pairs. Among the 25 or so introns are two very large ones: right at the beginning of the gene is an intron over 30,000 base pairs long, and about halfway through another is about 35,000. So, my favorite human gene is about 6% coding sequence. The rest is composed of introns, including two monsters that could easily harbor the entire chicken Factor VIII gene.

I think that's enough genomics for one overlong post. I hope the point is clear. Those who wish to assign functions to introns need to explain huge stretches of non-coding DNA, sometimes tens of thousands of base pairs long, vastly more than any splicing code or microRNA collection or anything else can account for. I'm not saying I know they do nothing. I'm saying that those who wish to describe genomes as wonders of information storage have an awful lot of work to do.

Next time: the kinds of "functions" we do find in introns.