Twenty years ago, an enormous scientific effort revealed that the human genome contains 20,000 protein-coding genes, but they account for just 2% of our DNA. The rest of was written off as junk – but we are now realising it has a crucial role to play.
When the 13-year-long effort to sequence the entire "book of life" encoded within the human genome was declared "complete" in April 2003, there were high expectations. It was hoped that the Human Genome Project, at a cost of around $3bn (£2.5bn), would yield treatments for chronic illnesses, and shed light on everything that is genetically determined about our lives.
But even as the press conferences were being held to herald the triumph of this new era of biological understanding, this instruction manual for human life had already thrown up an unexpected surprise.
At the time, the prevailing belief was that the vast majority of the human genome would consist of instructions for making proteins, the building blocks of all living organisms that perform a bewildering range of roles within and between our cells. With over 200 different types of cells in the human body, it seemed to make sense that each would need its own genes to carry out its necessary functions. The appearance of unique sets of proteins were thought to have been vital in the evolution of our species and our cognitive powers. (We are, after all, the only species capable of sequencing our own genome.)
Instead, it transpired that less than 2% of the three billion letters of the human genome are dedicated to proteins. Only around 20,000 distinct protein-coding genes were found to exist in the long lines of molecules known as base pairs that make up our DNA sequences. Geneticists were astonished to find that humans have similar numbers of protein-making genes to some of the simplest creatures on the planet. Suddenly the scientific world was faced with an uncomfortable truth: perhaps much of our understanding of what makes us human was actually wrong?
"I just remember the incredible shock," says Samir Ounzain, a molecular biologist and chief executive of a company called Haya Therapeutics, which is attempting to use our understanding of human genetics to develop new treatments for cardiovascular disease, cancer and other chronic illnesses. "That was the moment where people started wondering, 'maybe we have the wrong conceptualisation of biology?'"
The remaining 98% of our DNA became known as dark matter, or the dark genome, a mysterious melee of letters with no obvious meaning or purpose. Initially some geneticists suggested that the dark genome was simply junk DNA or the rubbish bin of human evolution – the remnants of broken genes which had long ceased to be relevant.
For others though, it was always obvious that the dark genome was crucial to our understanding of humanity. "Evolution has absolutely no tolerance for junk," says Kári Stefánsson, chief executive of the Icelandic company deCODE genetics, which has sequenced more whole genomes than any other institution in the world. "There must be an evolutionary reason to maintain the size of the genome."
Now, two decades on, we have the first inklings of the role of the dark genome. Its primary function appears to be regulating the decoding process, or expression, of protein-making genes. It helps to control how our genes behave in response to all the environmental pressures our bodies face throughout our lives, ranging from diet to stress, pollution, exercise, and how much we sleep, a field known as epigenetics.
Ounzain says he likes to think of proteins as the hardware components of life, while the dark genome is the software, processing and responding to external information. As a result, the more we learn about the dark genome, the more we understand human complexity, and how we became who we are.
"If you think of us as a species, we're master adapters to the environment at every level," says Ounzain. "And that adaptation is the information processing. When you go back to the question of what makes us different to a fly or a worm, we've increasingly realised that the answers lie in the dark genome."
Transposons and our evolutionary past
As scientists first began sifting through the book of life in the mid 2000s, one of the biggest challenges was that the non-protein coding regions of the human genome appeared to be littered with sequences of repetitive DNA known as transposons. These repetitive sequences are so ubiquitous that they comprise nearly half the genome in all living mammals.
"Even assembling the first human genome was made more problematic by the presence of these repetitive sequences," says Jef Boeke, who runs the Dark Matter Project at New York University Langone, an academic medical centre in New York City. "Just analysing any kind of sequence is much easier if it's a unique sequence."
Initially, transposons were ignored by geneticists. Most genetic studies choose to focus purely on the exome – the small, protein-coding region of the genome. But over the past decade, the rise of more sophisticated DNA sequencing technologies have allowed geneticists to study the dark genome in greater detail than ever before. One experiment, where researchers deleted a specific transposon fragment in mice leading to half of the animal's pups dying before birth, illustrated that some transposon sequences may be critical to our survival.
Perhaps the best explanation for why transposons exist in our genomes could be that they are extremely ancient, dating back to the earliest life forms, says Boeke. Other scientists have suggested that they come from viruses which have invaded our DNA over the course of human history, before gradually being repurposed in the body to confer some useful purpose.
"Most of the time, transposons are pathogens which infect us, and they can infect cells in the germline, the type of cells that we pass on to the next generation," says Dirk Hockemeyer, assistant professor of cell biology at University of California, Berkley. "Then they can get inherited, and lead to the stable integration into the genome."
Boeke describes the dark genome as acting like a living fossil record of crucial alterations in our DNA which occurred long ago in ancient history. One of the most fascinating elements of transposons is that they can move from one part of the genome to another – a behaviour which gives them their name – creating or reversing mutations in genes, sometimes with dramatic consequences.
The movement of a transposon into a different gene may have been responsible for the loss of the tail in the great ape family, which led to our species developing the ability to walk upright. "Here you have this one-time thing that happened which had a huge effect on evolution, giving rise to a whole lineage of great apes including us," says Boeke.
But just as our growing understanding of the dark genome is explaining more about evolution, it can also shed new light on why diseases emerge. Ounzain points out that if you look at genome-wide association studies (GWAS), which search for genetic variations among large numbers of people to identify those linked to disease, the vast majority linked to chronic illnesses like Alzheimer's, diabetes, and heart disease are not in the protein-coding regions, but in the dark genome.
The dark genome and disease
Panay in the Philippines is best known for shimmering white sands and a regular influx of tourists, but this idyllic setting hides a tragic secret. The island has the highest number of cases in the world of an incurable movement disorder called X-linked dystonia Parkinsonism (XDP). Like Parkinson's disease, people with XDP develop a range of symptoms affecting their ability to walk, as well as their capacity to respond quickly to various situations.
Since XDP was first discovered in the 1970s, it has only ever been found in people of Filipino descent, something which had long been a mystery until geneticists found that these individuals all have the same unique variant of a gene called TAF1. The onset of symptoms seems to be driven by a transposon in the middle of the gene, which is able to regulate its function in a way which causes harm to the body over time. It is thought that this gene variant first emerged around 2,000 years ago, before being passed on and becoming established in the population.
"The TAF1 gene is an essential gene, meaning it's required for the growth and multiplication of all cell types," says Boeke. "When you tweak its expression, you get this very specific defect that manifests as this horrible form of Parkinsonism."
This is a simple example of how some DNA sequences in the dark genome can control the function of various genes, either activating or repressing the process of turning genetic information into proteins in response to environmental cues.
How your genes shape who you are
The dark genome also provides instructions for the formation of various kinds of molecules, known as non-coding RNAs, which can have various roles ranging from helping to assemble proteins, blocking the process of protein production, or helping to regulate gene activity. "The RNAs produced by the dark genome act as the conductors in the orchestra, conducting how your DNA responds to the environment," says Ounzain.
It is these non-coding RNAs which are now increasingly being seen as the link between the dark genome and various chronic illnesses. The thinking goes that if we consistently give the dark genome the wrong signals, for example through a lifestyle of smoking, poor diet, and inactivity, the RNA molecules it produces can send the body into a disease state, altering gene activity in a way which increases inflammation in the body or promotes cell death. It is thought that certain non-coding RNAs can enhance the activity of or switch off a gene called p53, which normally acts to prevent the formation of tumours. In complex diseases like schizophrenia or depression, an entire cacophony of non-coding RNAs may be acting in synchrony to decrease or increase the expression of certain genes.
But our growing appreciation of the dark genome's importance is already leading to new approaches for treating these illnesses. While the drug development industry has typically fixated on targeting proteins, some are realising that it may prove more effective to try and disrupt the non-coding RNAs which are controlling the genes in charge of these processes.
In the field of cancer vaccines, where companies conduct DNA sequencing on a patient's tumour sample to try and identify a suitable target for the immune system to attack, the majority of approaches have focused only on the protein-coding regions of the genome. However, German-based biotech CureVac is pioneering an approach where they analyse the non-protein coding regions as well in the hope of finding a target which can disrupt the cancer at its source.
Ounzain's company, Haya Therapeutics, are currently pursuing a drug development programme targeting a series of non-coding RNAs which drive scar tissue formation, or fibrosis, in the heart, a process which can lead to heart failure. One of the hopes is that this approach could minimise the side effects which come with many common medicines.
"The problem with drugging proteins is that there are only 20,000 or so in your body, and most of them are expressed in many different cells and pathways which are unrelated to the disease," says Ounzain. "But the dark genome is exquisitely specific in its activity. There are non-coding RNAs which regulate fibrosis only in the heart, so by drugging them, you have a potentially very safe medicine."
The unknowns
At the same time, some of the excitement has to be tempered by the fact that we have barely scratched the surface in terms of understanding how the dark genome functions. We know very little about what geneticists describe as the basic rules – how do these non-protein coding sequences communicate with each other to regulate gene activity? And how exactly do these complex webs of interactions manifest over long periods of time into disease traits, such as the neurodegeneration seen in Alzheimer's?
"We're just at the beginning right now," says Hockemeyer. "The next 15 to 20 years will be all that – identifying specific behaviours in cells that could lead to disease, and then trying to identify the parts of the dark genome which could be involved in modifying these behaviours. But we now have tools to probe into this, which we didn't have before."
One such tool is gene editing. Boeke and his team are currently attempting to learn more about how the symptoms of XDP develop by replicating the TAF1 gene transposon insertion in mice. In future, a more ambitious version of this project could attempt to understand how non-protein coding DNA sequences regulate genes by building chunks of synthetic DNA from scratch and transplanting it into mouse cells.
"We are now involved in at least two projects where we take a huge chunk of DNA that does nothing, and then we try and install all these elements into it," says Boeke. "We place a gene there, put a non-coding sequence just in front of it, and another one far away, and see how this gene now behaves. We now have the tools to actually build bits of the dark genome from the bottom up and try to understand it."
Hockemeyer predicts that as we learn more, the genetic book of life will continue to throw up unexpected surprises, just as it did when the first genome was sequenced 20 years ago.
"There are a lot of questions," he says. "Is our genome still evolving over time? Will we be able to completely decode it? We are still in this dark, open space that we're venturing into, and there are a lot of really cool discoveries to be made."
More about: