The Genomic Landscape circa 2012 – Eric Green


Andy Baxevanis:
Okay, good morning everyone and thanks for coming. My name’s Andy Baxevanis and on behalf
of my co-chairs Tyra Wolfsberg and Eric Green I’d like to welcome you to this 10th edition
of Current Topics in Genome Analysis. This course is really intended to provide a survey
of major areas within the fields of genomics and bioinformatics and the individual lectures
are all going to be presented by our colleagues who are leaders in their respective fields.
For those of you who have not been with us before, we hope that this course will help
bring you up to speed in areas that are becoming more and more prominent in biological research
and for those of you who are joining us once again we hope that the lectures fill in some
of the gaps in your background and update you on some important changes in genomic technologies
and approaches since the last time this course was offered two years ago. Before turning the platform over to today’s
speaker, I’d like to spend a few minutes going over some logistical information for the course.
There are 13 lectures in the series, an hour and a half each starting today and ending
on April 25. We’ll be meeting here in the Lipsett Amphitheatre promptly at 9:30, please.
You’ll notice hopefully from the course syllabus that you picked up on the way in that all
of the lectures are intermingled between the laboratory-based and the computationally-based
lectures and we hope that this serves to convey to all of you that you really need to use
these kinds of approaches in concert with one another to do cutting edge biological
research in the future. Now, one of the primary ways we’re going to
be providing you information over the next 13 weeks is through the course’s website,
which you can just find at genome.gov/course2012. From the main page there are a series of links
here that will take you to the syllabus so you’ll see what all the lectures are and the
handouts and, well, our intent is that the course handouts will be put up on the website
a couple of days before each lecture, allowing you to download them, print them out and read
ahead. Just by way of reminder, we won’t be having copies of the handouts available here
in the lecture hall. So, please be sure to print out a hard copy before you come and
join us in the hall. Of course, we hope that you’ll be able to
join us in person each week and have the opportunity to interact with all of the lecturers but
if you happen to miss a lecture, we’ve made arrangements to have each one of the lectures
videotaped and once the YouTube version is available on Genome TV you’ll be able to watch
that at your desktop and we anticipate that the lectures will be available probably about
one to two weeks after the live lecture. There’s also a mailing list for the course that many
of you have already subscribed to. If you haven’t, I strongly encourage you to subscribe.
We’ll be sending out reminders of each of the upcoming lectures as well as any information
about changes in the schedule or cancellation. It is winter time; there undoubtedly will
be a snow day somewhere along the line so we’ll give you a heads up before you come
to the hall if there’s any changes in the schedule. With respect to continuing — medication,
that’s good — [laughter] Andy Baxevanis:
— [laughs] continuing medical education — Male Speaker:
— Same thing. Andy Baxevanis:
— credits — it could be the same thing. [laughs] Here is the accreditation statement:
you can earn one and a half credits per session for a maximum of 19.5 AMA PRA Category 1 Credits
trademark for the course. So any physicians who are in the hall, please make sure to sign
in on the sign-up sheets that are at the back of the hall to get your CME credits. You actually
have to be in the hall to earn the credits; you can’t earn the credits by watching the
videos online. Just by way of disclosure, none of the three of us are — as the planners,
have any financial interests or relationships with a commercial entity that is relevant
to the course. One final detail, if you’re carrying a mobile
phone, BlackBerry, pager, please take a moment to put them on silent, please just as a courtesy
to the speakers. Okay, so with that by way of introduction, it’s my pleasure to introduce
to you today, our first speaker in the course, Dr. Eric Green, one of our, my fellow course
organizers. He is the director of the National Human Genome Research Institute. Prior to
his employment as a NHGRI director in 2009, he served for many years as NHGRI’s scientific
director beginning in 2002 and I had the pleasure of serving as his deputy during his time in
that role. He was also the founding director of the NIH Intramural Sequencing Center, a
state of the art DNA sequencing facility that’s played an important role in the advancement
of genomic science, particularly in the area of comparative genomics, something that we’re
going to be talking about many times throughout the next 13 weeks. During the almost two decades that he spent
directing his own independent research program, he and his group made major contributions
towards our understanding of the human genome, having had significant involvement in the
sequencing of the human genome going back to the very beginning of the Human Genome
Project and having developed technologies and strategies for the large-scale analysis
of vertebrate genomes which really have provided us great insights into genome structure, function
and evolution. Because of his work in the field of genomics,
Eric’s received numerous awards and recognitions including his induction into the American
Society for Clinical Investigation and the Association of American Physicians. Today,
Dr. Green will be presenting his perspective of the current genomic landscape, thereby
setting the stage for many of the talks that will follow his over the next 13 weeks. Those
of you who have had an opportunity to listen to Eric lecture in the past already know he’s
a wonderful speaker and I’m very sure that you’re very much going to enjoy today’s talk.
So, with that, please join me in welcoming today’s speaker, Dr. Eric Green. [applause] Eric Green:
Thank you, Andy. Let me, let me start out by offering my own personal thanks to Andy
and Tyra for organizing this series. I am honorifically included as one of the three
organizers. I did almost nothing, essentially nothing, but the historic involvement I think
is why my name is still left as one of the organizers. I reflect back, Andy and I have
— started this series back in the late ’90s. If our numbers are right, this is the 10th
time we’ve done this series and it just started one day over a discussion I had with Andy
back a long time ago, we said, “Do you think people would be interested in hearing sort
of a survey set of lectures about genomics?” And sure enough, it’s been wildly successful
now in its 10th iteration and by any metrics, we think this outreach effort we do at NHGRI
on behalf of the NIH community and actually broader now with the reach we get by the web
and by our YouTube channel is well worth the effort we put into it. But, the thanks really
should go to them for organizing this series. I wanted to also start off with two disclosures.
The first disclosure is, as I put this lecture together I realized that the topic was even
— my title was a little more grandiose than what was realistic for this lecture. So, in
fact, the genomic landscape is particularly huge, so I’m going to really pretty much limit
my lecture to the human genomic landscape which you’ll see after about 70 minutes or
75 minutes is also incredibly huge and just is enough just to deal with just the human
side of it. So that’s disclosure number one. Disclosure number two is that I’m fairly boring
and I have no relevant financial relationships with commercial interests, so I had to do
that for the CME part. So with those disclosures in mind, let me
tell you what my plans are for this lecture and they really have evolved in the 10 times
I’ve done this, in part because there’s just been so much that’s happened in genomics.
So I’m actually going to start by providing a historic context for genomics and particularly
human genomics, and I’m going to run that through the Human Genome Project as if that’s
a historic event in the distant past, amazingly enough because it feels like yesterday. I
want to spend actually the bulk of the time setting the stage for what has gone on in
genomics since the end of the Human Genome Project, and in doing so, setting us up for
a lot of the other lectures you’re going to be getting in more specific areas. Then I’m
going to end the talk giving you sort of a landscape view towards the future. So really what this lecture, which was deliberately
designed as the first lecture of this series, is just a big tour. It’s a tour of the past,
it’s a tour of the present and it’s a projection into the future. It really is and I really
am nothing more than a warm-up act here for the other 12 lecturers that you’re going to
hear about, they’re going to really give you a lot of meat and details. I’m really just
setting the context for everything you’re about to hear. Well, starting at the beginning, the genomics,
there’s a rich history and a lot of territory I could potentially cover. There’s a lot of
many places that I could highlight in the history of genetics and genomics. And in thinking
about what to really emphasize, I really wanted to do this to ensure that what the examples
I gave in many ways set the context for what you’re going to hear in the coming weeks.
In picking specific examples, I guess you could and probably should start with Mendel
and his contributions to untanglings and basic principles of genetics which nowadays are
incredibly relevant in thinking about the genetic basis of human disease. We got to
get into DNA a little bit and so Miescher deserves credit for discovering the chemical
of DNA. But then people like Avery and his colleagues deserve credit for figuring out
that DNA was actually a hereditary material, figuring out by doing all those strange experiments
with bacteria and injecting them in different forms into mice and rats and seeing which
was lethal and which was not. But, if I had to pick one single historic
accomplishment that set the stage for genomics it would clearly have to be Watson and Crick
and the discovery of the double helical structure in 1953. Arguably this publication and this
accomplishment, I believe, was the single most important scientific publication, scientific
discovery of the last century because it just set the stage for so much of what was going
to take place in biomedical research since 1953. And in many ways, even though the word
genomics had certainly not been invented then, it really set the stage for genomics because
it was that insight that was provided by the double helical structure of DNA that immediately
became apparent how it is that DNA was the information molecule necessary for biological
life. And it also then set up a series of studies that answered innumerable questions
such related to how it was that DNA encoded information for making the building blocks
of cells, proteins in particular. Coming out of that, of course, was the central
dogma of molecular biology, that DNA made RNA and made protein. We now know it’s a lot
more complicated than that but at the time it was an incredibly important fundamental
principle to appreciate. And also coming out, if you fast forward to the 1960s, was of course
the elucidation of the genetic code and understanding how it was that DNA encoded the information
for proteins. And I can’t help but point out that as I wandered here from the front of
the clinical center and back through here towards the lobby of Lipsett Auditorium, passing
by the remarkably nice museum display about Marshall Nirenberg and his accomplishments
in elucidating the genetic code and putting the NIH firmly in the historic context of
this important discovery on the way to our knowledge of how DNA works. And his work here
at NIH and the intramural program in elucidating the genetic code will forever be an important
part of biomedical history. You can then fast forward of course to the
late 1970s, early 1980s, with that came the molecular biology revolution. We learned how
to manipulate, clone DNA and be able to use it in all sorts of applications for biomedical
research. And the microbiology revolution also included the development and methods
for actually sequencing DNA. Because of course what we had learned along the way was that
DNA was incredibly simple, basically consists of four chemicals. We don’t even have to say
the chemicals because we could just abbreviate them as their first letters, G, A, T and C.
And with the ability to actually sequence long stretches of DNA, it became obviously
possible to really now start getting at the underpinnings of how DNA becomes an information
molecule. So by about the, you know, late 1980s and
then certainly ideas started bubbling up, recognizing that the whole concept of a genome,
the entire genetic complement, the entire DNA compliment of a cell, of an organism and
so forth is a finite problem, and that the human genome, for example just consists of
three billion of these Gs, As, Ts and Cs. And with methods available for now being able
to sequence DNA and now increasingly more sophisticated molecular biology methods available
for being able to manipulate large stretches of DNA in a laboratory, the audacious idea
of actually determining the complete sequence of the human genome, all three billion letters,
sort of came to the forefront. And so indeed, that set the stage for the next revolution,
the genomic revolution which really took place throughout the 1990s. And of course, the centerpiece for the genomic
revolution was this endeavor, the Human Genome Project, which began about 21 years ago. This
large international effort highly coordinated across multiple countries with but a major
leadership role provided by people here at the NIH had as a major focus around just getting
the complete sequence of a reference human genome. I will tell you because this is I
guess I sort of entered the picture a little bit here. I was a post-doctoral fellow, recent
M.D., Ph.D. graduate and got involved as a post-doctoral project working on some of their
earliest technologies that were then used and involved in the Human Genome Project and
had evolved in the project itself as one of the first funded centers then at Washington
University and got out of the gates on day one in the Human Genome Project and then participated
in it throughout. I’ll tell you a couple things about the genome
project when it began as a participant’s view, especially a young, impressionable post-doctoral
fellow participant, is that we had no idea what we were doing. There really was this
audacious goal at the end of sequencing the human genome and there were some cursory methods
but there was fundamentally no plan how we were actually going to get there, it was just
a compelling, incredible sense of purpose of what this was all about and then sprinkle
in a little bit of fear of actually not being sure you’re actually going to be successful.
It’s the perfect mixture if you bring a lot of good people together, they figure out a
way. And sure enough that way was figured more rapidly than we could’ve ever anticipated. A short 10 years later, of course, came the
announcement that a draft sequence, the human genome had been generated capturing lots of
tension by leaders around the world. Some of our current leaders even, I bet you can
even see the White House at the time was involved and even the popular press picked this up
as a story of major historic significance. That moment was a press moment in many ways,
the scientific moment comes with publications and of course just a few months later came
out this historic publication in Nature 11 years ago reporting the initial analysis and
generation of that draft sequence of the human genome. It wasn’t a complete story then, it
was just a draft sequence, had lots of refinement needed to be done, back to the laboratories
we all went. We refined that sequence and then completed it and in April of 2003 declared
completion of a reference sequence of the human genome and with that, came an end to
the Human Genome Project. So that’s a rapid pace historic review of
what took place getting us through the Human Genome Project. It was also important to take
a pause for a minute and think about what the genome project was and what it wasn’t.
What it fundamentally wasn’t was the completion of the field of genomics; if anything it was
really just the beginning. And there has been a tremendous amount that’s been written about
the historic significance of the genome project. I could show you lots of different slides,
I happen to like this one which is fairly recent from April of this past year where
this individual Adam Rutherford in The Guardian who writes a lot about genomics pointed out
how the Human Genome Project was just the starting point. And he wrote the following:
he said, “The mistake that we often make,” and I’ve heard lots of people make it, “is
we say that the Human Genome Project was an end point.” In fact, the Human Genome Project
was a pregnancy. Ten years later we now have a clue, what we don’t yet know, the Human
Genome Project may be finished but understanding our genome is only just beginning. And I would
actually say that is a very important thing to keep in mind and in many ways sets the
stage for this entire lecture series, is that we have the genome project as a starting line
and we have so much in front of it and so much has developed. And what this course is
going to do, these lectures are to teach you a little bit about some of the topical areas
and drill you into detail on those because you will quickly see, and I will show you
over the next hour, all the new opportunities this is creating and all the new challenges
that still remain. So with that as the starting line, the starting
point if you will, since the end of the genome project around here, just wave after wave
after wave, every single year there seemed to be accomplishment after accomplishment.
And what you are looking at here is actually the pull out piece from the reprint that I’m
going to talk about later. That’s available in the back of the amphitheater and this is
just sort of a nice view of it, but you have a version that you can now pull out and put
up on your refrigerator in your laboratory or your office. But it does illustrate the
fact that the genome project as a starting point really did represent the beginning of
a tremendous number of accomplishments that have taken place since then. And so that’s what I want to now transition
into, tell you about what have been some of the major accomplishments in genomics since
the end of the genome project. Now I will focus again, my attention on human. I will
focus my attention on health, this is the National Institute of Health, and needless
to say that’s the area that obviously we’re focusing on in particular at National Human
Genome Research Institute. And in fact, when we sort of looked at that moment in time,
where genomics was going, we were very much aligned with what the popular impression was
as well, is that the reason we did the genome project was because we saw the opportunities
to understand how our genome worked and figuring out how to use that knowledge for improving
the way we practice medicine. And so, really essentially as soon as the
genome project ended, the popular press picked up on this and even the scientific press picked
up on this, marrying the idea of genomics and medicine leading to phrases such as “Genomic
Medicine,” featured here on these two covers. There are other phrases that have been applied
to this, but genomic medicine, what I mean is health care tailored to the individual
based on genomic information. Not treating patients as generic individuals, but having
some insights about their own unique genetic makeup that may allow you to tailor how you
take care of them based on that genomic knowledge. Largely synonymous with things like personalized
medicine, individualized medicine, you’ll even hear it referred to as precision medicine.
There are, you could argue around the edges what these different phrases mean, the one
we tend to gravitate to, I tend to use is genomic medicine. But they’re all largely
meaning fundamentally the same thing. Personalizing, individualizing in a more precise fashion.
How you take care of patients based on knowledge of their own unique genomic makeup. Well you might imagine as the National Human
Genome Research Institute we take this sort of a thing very seriously and see as our mission,
having accomplished the Human Genome Project, figuring out how to make genomic medicine
a reality. And so we think a lot about that journey that we now are all on. Defining the
path that’s going to get us from that starting point of the Human Genome Project to the finish
line, vaguely defined as realizing genomic medicine as I just defined it. Now we go into
this journey, which I admit will be a long, hard journey, a little optimistic. We were
quite successful initially at being able to come up with a successful end to the Human
Genome Project, and while the number of steps might be unknown, and I wouldn’t even pretend
to define all of them, I remain reasonably optimistic that this is going to be a successful
journey. But we have to prove it in order to sort of really put a check mark there.
I believe that in carrying out this journey it’s sort of required if we’re going to truly
fulfill the promise of why we sequence the human genome in the first place. But this is a very useful framework for us
to think about and the steps that are going to be needed as we inch closer and closer
to realizing genomic medicine. And the other thing about those steps, and this is what
I’m going to describe to you, is as you make this journey, with every step you get a little
more data, a little more knowledge. And with that comes a little bit more insights about
disease and about medicine and how you might be using genomics to improve the way you take
care of patients. So what I’m going to describe to you now are
five of these steps. These are not comprehensive in nature, these five steps I’ve just chosen
as major topical areas and you’re going to see they relate in many ways to things you’re
going to hear about more and more from future lectures. But these five are just meant to
illustrate the kinds of steps that are needed to make this journey a successful one. So let’s start with something very fundamental:
understanding the function of the human genome sequence. Now, let me remind you once again
what the genome project was about and what it wasn’t about. The genome project was really
about first mapping the human genome, organizing our understanding of it and getting organized
in a fashion that would allow us to sequence the genome and then going through a phase
where we actually rolled up our sleeves and actually determined the order of the three
billion letters in the human genome. That was the genome project. That was not about
understanding that sequence, because interpreting the human genome is really very much an activity
that’s going to go well beyond the Human Genome Project and it’s going to take a number of
years that I couldn’t even guess right now, but I wouldn’t dare want to put a specific
number on it. And the reason why is just a reminder that
DNA sequence is fairly complicated stuff. On the one hand it’s simple because it’s only
four letters. It’s just complicated because it just goes on and on and on and on. Shown
here is real sequence of the human genome, it’s only about .0001 percent to the human
genome. But it immediately reveals the fact that coming out of that text is hardly an
immediate interpretation of how it is that it actually functions. Well, when it came time to now looking at
the human genome sequence and starting to try to interpret it to figure out its function,
you had to go with what you knew at the time, when the genome project ended. And indeed
that’s what we did. What did we know the most about at that time? Well the thing we knew
the most about when the genome project ended were we knew about genes. We knew about coding
sequences. We actually were fairly sophisticated, in part because of Marshall Nirenberg’s contributions,
to understanding how it is that DNA actually was able to encode information for making
proteins. And we were even more sophisticated by great knowledge that we had gained about
intermediate molecules such as RNA. And we even knew that genes were consisted of exons
such as these colored boxes that actually had the exact nucleotides that were encoding
specific amino acids and they were broken up every once in a while by blocks of DNA
that were introns. And we even know that when RNA got made that you would splice then together
all your exons to sort of make them adjacent. And there was even alternate splicing whereby
some exons got put into messages and others did not. So we had some knowledge about that but even
more important we sort of fundamentally understood the language of DNA when it came for encoding
information about proteins because we had the famous genetic code look up table that
you see right outside this auditorium. And so with that became quite a bit of information
about being able to go through a sequence and be able to use various tools of both knowledge
about RNA sequences that we were able to generate and have large datasets, but also predictive
tools, computer tools that allow us to go through and systematically review the human
genome sequence and now just start highlighting all the coding sequences that we were aware
of. And so this was actually a situation where
we got a lot out of the gates fairly soon because of our knowledge of how genes work.
Now, I don’t mean to imply for a minute that we fully understand the full repertoire of
genes nor do I mean to imply for a minute that the complexity associated with gene expression,
which genes get expressed where, and all the ultimate forms and where all the different
issues associated with gene expression incredibly set — an incredibly large set of complicated
topics. And one lecture you’re going to hear from is from Paul Meltzer who will later in
this course describe for you some of genomic approaches that are being really — that are
being used for better understanding gene expression. But beyond these yellow highlighted sequences
that represented coding sequences, of course, came tremendous desire to understand what
else is out there in DNA sequence besides DNA directly coding for protein. And here
are lots of clues that come up that there was a lot of important choreography being
orchestrated by none-coding DNA sequences, non-coding meaning they didn’t code for proteins,
but how are we going to find those? We didn’t have a genetic code, we didn’t have knowledge
about many of these things and we knew lurking in DNA sequence were lots of surprises about
how DNA might function. Well here we actually needed a consultant,
because we needed help. We didn’t really have computer tools available, we didn’t have a
lot of knowledge and sort of surveyed the available consultants and ironically the consultants
that had — the consultant that had taught us the most about what we needed at that moment
in time even pre-dated Mendel. And in fact, the consultant we needed was Darwin because
it was Darwin who actually laid the foundation intellectually for what was going to be needed
for being able to take on the next challenge in genomics. And Darwin said a lot of things
and this is a quote that allegedly was attributed to him. Although recently I’ve been told that
it was unclear whether he really said it or not, but it sets the stage for what his contributions
were, where the quote says, “It’s not the strongest of the species that survives nor
the most intelligent that survives, it’s the one that’s most adaptable to change.” Because
what Darwin taught us was that species are able to somehow adapt to change in environments.
He didn’t know about genomes, he didn’t know about DNA, but he knew something was going
on. The something that was going on was that DNA was changing and that there was evolutionary
processes in play that scrambled up the DNA and then individuals within species that adapted
well to the environment because of those genetic changes are the ones that thrived, the ones
that were able to survive and as a result they adapted the best. But all this was being kept track of in their
genomes. And so a contemporary genomicist, in this case wrote, “For the last three and
a half billion years evolution has been taking notes and those notes are all kept in the
genome sequences.” So, what was very clear was that we could learn a lot about our own
genome by reviewing the laboratory notebooks of species that have undergone various biological
innovations to look for things that have changed and things that have not changed. In particular,
the things that have not changed that are common across many, many, many species are
likely to be those that are the most biologically important; otherwise evolution would have
stepped in and changed them. So the whole notion of comparative genomics
became a very important area of research immediately following the Human Genome Project and it
was the realization that we are just, as a species, this small, teeny, little, insignificant
twig in a very sophisticated, complicated tree of biological innovation just across
the mammals and the vertebrates. And the genome project recognized this and we, in parallel
with sequencing the human genome, other species had been selected for genome sequencing and
exactly for that purpose. But these had been mostly biological systems such as mouse and
rat and so forth that had been used as laboratory models and saw the importance. But we recognized
that it was actually to fully harness the power of evolutionized notebooks and to truly
take advantage of these lessons that Darwin has taught us that we need a statistical power
to be able to go in and look at lots and lots and lots of species’ genomes and be able to
figure out what has and has not changed over tens of millions of years of evolutionary
time. And to do so you wanted to sample many different branches, not just ones you were
interested in; you actually wanted all different branches across the phylogenetic tree of mammals.
And in fact, that’s exactly what has taken place. And fast forwarding lots of studies had been
done in order to sort of accomplish the kinds of comparative genomic studies that were of
interest. Upwards of 30 million species have now had their complete genomes sequenced.
Among a huge literature that you could find, I would just point to this recent paper that
came out in Nature describing what I think is the most robust analyses so far across
as many species and this equates 29 million species. And with that has come tremendous
insights about the most highly conserved parts of the human genome based upon analyses of
many different mammals. What has that taught us, in terms of just sheer numbers? If you
look at the human genome, what is this now teaching us? Well, what we now have tremendously
good data for is something like about five percent of the human genome is constrained,
evolutionary conserved across virtually all mammals. So about five percent of our three
billion letters are constrained at such a high degree across so many different species
that are widely separated in evolution, that they’ve almost for certain going to be biologically
important. Evolution just would’ve never tolerated to keep them the same if it wasn’t for the
fact that they were evolutionarily important. Now that’s still a lot of bases, that’s about
150 million bases as a minimum that we’re going to need to really understand at a biological
level and it’s probably a lower bound and there’s lots of reasons I can give you, it’s
probably not five percent, it’s probably higher than that. But it’s on that sort of order
of magnitude to keep that in mind. Well, what have we learned? What’s consuming that five
percent? Is most of it protein coding? Small part nine coding or is it just the opposite? Well we have a pretty good inventory of our
protein coding parts of our genome, the yellow stuff. And we now know that only constitutes
about one and a half percent of our genome encodes for protein. So out of that five percent
that’s incredibly conserved, only about one and a half percent, in other words five — of
the five percent, one and a half percent across the whole genome is protein coding. Now that
corresponds to something on the order of 20,000, genes you can argue a little about what the
exact number is but that’s about what it’s looking like. But of course we make many,
many more than 20,000 proteins because of all the different alternate splicing that
goes on with, across different mRNAs and meanwhile there’s lots of different ways we decorate
our proteins in terms of post-translational modifications. Where we as a species get our complexity is
not in our gene number; it’s what we do with every gene. That’s why we’re more complicated
than a lot of other organisms that have smaller genomes, similar gene counts but are just
not as complicated: they don’t know PowerPoint for example and we know PowerPoint and so
that’s [unintelligible]. We do a lot more with every gene than they do. Well, wait a second. If five percent is important
but only one and a half percent is protein coding, do the math. That’s three and a half
percent that remains. That is not protein coding but it is important. In fact, it’s
so important it’s evolutionary constrained to the same degree as protein coding regions.
So we need to color in an additional three and a half percent of the genome with another
color and that is functionally important, but in ways that are other than directly coding
for proteins. Well, what is that non-coding functional sequences, what are they doing?
Well we know about quite a bit of it, as broad classes. We know for example, that there’s
this incredibly complicated choreography of gene regulation involving lots of different
kinds of elements, non-coding elements, promoters, enhancers, silencers, insulators and so forth.
And a huge amount has been learned and lots of information is now available about the
complicated circuitry involved in regulating genes and all of that circuitry is non-coding
functional sequences. We also know that there’s a lot of important
functional sequences that are involved in packaging up our chromosomes, that a lot of
these sequences are also involved in segregating chromosomes and also in replicating chromosomes.
And so most of the non-coding sequences are elements that are relevant for all of these.
And meanwhile, wow have we learned a lot about RNAs. Remember, central dogma taught us that
DNA made RNA and virtually all that RNA went on to make protein, few exceptions like ribosomal
RNA. Well, we now know that RNA is a very — RNA molecules could do all sorts of things
and they can function in biologically important ways then that I feel is just exploding. And
these are all non-coding RNAs. In this particular case they fall in that category. And of course
I have to add a question mark because there is no way I believe today that we have discovered
all the ways that DNA can confer function and I’m sure this slide will expand in the
coming decades. So, what does that leave us? That leaves us
with simply, take five percent minus one and a half percent, we’d leave off three and a
half percent of our genome dysfunctional non-coding sequences. These are gene regulatory elements,
as I described, chromosomal functional elements I described. Oh, and of course then there’s
the question mark because I believe there are undiscovered functional elements. You’re
just not reading about it yet in text books but they’re out there and we’re going to find
them. We’re going to characterize them and then we’re going to catalogue them. Well, of course it gets more complicated than
that because what’s transpired since the end of the genome project is a massive upturn
in our knowledge about ways that DNA confers function beyond its primary sequence, because
everything I’m talking about here is a primary sequence of DNA. We are now learning more
and more about this other language of DNA, the epigenomic language where DNA gets decorated,
gets decorated with methyl groups, it gets methylated with histone proteins and this
is now coming to the fore because of knowledge that epigenomic changes are very relevant
in disease processes and therefore become very relevant for all sorts –and developmental
processes — therefore very, very relevant for biology more broadly. Laura Elnitski will be coming next month and
describing both the epigenomic landscape of the human genome, she’ll also be describing
some of these gene regulatory elements and their importance in understanding non-coding
parts of the genome. Well, we recognized as a community of genomicists that this was really
important stuff that we needed to sort of be able to now start interpreting the human
genome sequence, making that publically available, helping the biological community understand
the sequence. We needed to know what the primary sequence level, epigenomic level and so forth.
That is the reason why, and I’m sure Laura will mention this in her talk, that our institute,
for example, launched major projects revolving around cataloging functional elements in genomes. The major one that we launched is called the
ENCODE Project Encyclopedia for DNA Elements which focused on the human genome but we also
kicked off a complementary project called modENCODE for model organisms ENCODE, which
focused on similar studies but looking at the much smaller genomes of laboratory models,
specifically drosophila and nematode worm, and also some projects with mouse as well.
And these projects, especially, let me emphasize again, the human aspects of ENCODE now have
been published a significant pilot effort and soon later this year you’ll be reading
a major paper that will come out of ENCODE and its accomplishments. And what this means
for any of you is that nowadays if you have a particular genomic region of interest, and
you want to know what has been established about its functional significance, we will
overwhelm you. That’s what I will tell. You will go to a browser and you will open
up that browser and you will dial in those regions, let’s say you’ll do a couple of regions
and you’ll see things like this, which you will find overwhelming, which is fine, because
all of this data represents laboratory and computational data where there are gene models.
Where there are RNA molecules being made across that stretch of DNA, where there are transcription
factors binding, where there are regions of open chromatin, where there are various epigenomic
marks, and every one of these tracks reflects that. And from that, you could try to interpret
what it all means and more and more this will become an issue of interpreting the massive
datasets that have been generated by ENCODE and other efforts. And earlier this — or
I guess last year at this point the consortium put out a users’ guide, basically a manual,
how to interpret the ENCODE data and I would point you to that if you’re interested in
actually navigating and using ENCODE data for your own uses. And of course, ENCODE is not the only project
involved, there are other projects even here at NIH, a major road map or common fund project
looking specifically at other genomics, very complementary to what ENCODE is doing, and
more and more data is getting generated along the lines. Oh, and by the way, it’s not just
about the primary DNA sequence and it’s not just about epigenomic marks and DNA. We are
learning increasingly that there’s yet a whole additional level of complexity because we’re
learning and we already knew that, you know, DNA is a three-dimensional molecule that’s
existing within the nucleus and there’s probably a lot of stuff going on there in the nucleus
that might be very relevant to genome function and in fact more and more, such as described
in this review article last year in Nature, the genome has a three-dimensional structure
and with it comes some interactions that also become very relevant for genome function. So that is a quick whirlwind view of just
that first step where all I’m emphasizing is interpreting the human genome sequence.
This is an effort that, you know, x number of years out from the Human Genome Project
we’ve gone about this far, it’s like a great novel; decades from now we’ll still be interpreting
the human genome sequence. This is going to be an effort I’m sure will take place for
all of our lifetimes and even then we’ll be refining it more and more. At best, right
now we’re at sort of at a Cliff Notes stage of this, we’re just understanding the fundamentals.
But trust me, there’s a lot more to be learned. Before I move on past this first step, I should
at least emphasize the fact that there are other interesting things coming out of some
of the studies I just quickly reviewed for you, especially in comparative genomics, that
may not be directly on a trajectory towards — to human health, but indirectly understanding
more about fundamental biological principles. For example, there’s certainly been a major
upswing in understanding of human genomics, human evolution by using genomic tools and
featured both in the popular press, in the scientific process certainly is a great interest
in our evolutionary origins including species that no longer are here but the tools of DNA
sequencing allow us to explore. And just increasingly I think excitement around understanding fundamental
principles of evolution, first with our own species, but then more broadly across all
animal species. And just as an example, an effort known as Genome 10K is an effort that
is attempting to collect DNA samples from 10,000 vertebrate species and having them
available so that when the cost of sequencing drops sufficiently it will mean that we can
actually just generate complete genome sequences of every available animal, vertebrate. And
with that, one could imagine, comes a rich set of information for being able to explore
biological innovation and fundamental principles of evolution. And one could imagine next generation,
either our kids or our grandkids, the way they’ll learn evolution is not from textbooks,
but by sitting at computer screens and surveying the complete genome sequences of thousands
and thousands of vertebrates, understanding the innovations that took place. And it would
just be a far more robust, sophisticated way to understand evolutionary processes. Okay, so that was the first step. What’s the
next step along this journey? Well, the next step is not just understanding how a hypothetical
human genome sequence operates but understanding how, because we’re not just interested in
some hypothetical reference sequence, we’re interested in our sequences. We’re interested
in our patients’ sequences. We want to know how we differ probably more than anything
because that’s the underlying issues associated with understanding how to better treat patients
down the road. So understanding human genomic variation became a very major priority shortly
after the genome project. And the fundamental idea of course is that each of us has two
genomes. And us, we don’t have just three billion letters, we have six billion letters.
We got three billion from mom, three billion from dad and across those six billion letters,
we vary every once in a while. Compared to the person sitting next to you, across your
three –your six billion letters, there are probably about three to five million places
in your genome where that single nucleotide is different. So about three to five million
single nucleotide differences between you and the person sitting next to you. There’s
probably tens of thousands of places where there actually are larger structural variants,
either things that have come in or things that are gone or places that have been duplicated
or you carry multiple copies and the person sitting next to you only has one copy. That
sort of thing, these are known as structural variants. But the fact of the matter is, we know that
these variants, indicated here by “V,” are sprinkled throughout. But we also know that
the great, great, great, great, great, great majority of these have really no phenotypic
consequences whatsoever, they’re completely innocent. But a subset are very relevant,
a subset, a small subset of them might be sort of one of these metaphorical time bombs
that might influence your getting a particular disease and give you increased risk for a
disease. Oh there might be other variants that are good variants that might be more
attributable to positive phenotypic features. But, we would like to know which of these
are completely phenotypically neutral and which ones are phenotypically consequential. And the other thing we also believe is that
there’s a lot variants we all share in common. Compared to the person sitting next to you
it’s not like you’ve got three to five million and it’s completely different set of compared
to the person next to you, there are probably a lot that are in common. So the idea was
could we catalog a lot of these variants and find out at a large scale what all the variants
are that are at least common above some threshold and then study them and figure out which of
those can we sort of ignore and which are the ones that we might be really interested
in figuring out, might have a disease or other phenotypic consequence. So this was the rationale for launching another
international project that began shortly when the genome project ended called the International
HapMap Project. And its goals were to not only develop very deep catalogs of genomic
variation, but also to understand a little bit about the relationship of those variants
across stretches of human chromosomes. We now know that all these variants across a
stretch of chromosome are not completely random in their — in how they move from one generation
to the next but rather they are, they’re clustered together in what are called haplotype blocks,
whereby within a given stretch of DNA, a series of variants tend to be inherited in block
from one generation to the next and knowing that structural relationship across variants
would be very valuable. And so, through a series of studies, this large international
effort, one published in ’05, ’07 and then in ’10 significant, millions and millions
of common variants across different human populations were catalogued, made available
publicly. And also additional information about their relationship to one another across
these haplotype blocks. When better technologies became available,
the more ambitious endeavor was launched called the 1000 Genomes Project which attempted to
now use new sequencing technologies I’ll be talking about in a minute, to basically get
deeper and deeper catalogues of genomic variation, again across different human populations.
This is going so well that its name is actually sort of outdated. Somehow the project involves
several thousand genomes that are now having their complete sequence established or at
least a sequence across parts of their genomes so it could basically get to the rare and
rare variance. A pilot phase of this was reported in 2010, in the very last issue of Nature.
Then you can read about and you’ll be reading more about this in many publications, it will
be coming out in 1000 Genomes in the next year or two. Lynn Jorde is going to be here
on March 7 and really dig much deeper into population genomics, all this about human
genetic variation and I’m sure he’ll be talking about HapMap Project and 1000 Genomes Project
as well. So, we now have lots of information about
functional sequences in the human genome, lots of catalogs about common variants and
increasingly rare variants across the human genome, across different human populations.
And with that, comes the opportunity for the third step along this journey. The third step
being, attempting to now understand the genomic basis for human disease. Which of those variants
play a role in human disease? And in describing what has been accomplished in genomics since
the end of the genome project in the area of human disease work and genomic applications
that have advanced the field. It is actually very useful to describe sort of a framework
once again that sort of summarizes what I call the genomic architecture of genetic diseases.
And this is an oversimplified view of human disease, but it’s a useful one for what I’m
going to describe to you. There really are two classes of diseases to
think about. All diseases have a genetic component associated with them, some to a greater degree,
some to a lesser degree. All diseases have a genetic influence. But there’s one class
of disease that are fundamentally rare, rare across the human population. But these are
genetically simple, because they’re simple because they really involved one gene. Also,
our Greg Mendel gets the name associated and also referred to as Medelian disorders. So,
these are diseases where the predominant risk is a change or mutation in a single gene.
Yes, there might be other genetic variants that influence the severity of disease and
yes, there might be some environmental contributions that influence the disease. But fundamentally
it’s mutations in a single gene that cause that particular disease. But these are rare. These are not what fill
hospitals and clinics around the world, they don’t represent the major health care burden,
they’re important but they, but they pale by comparison in terms of overall health burden
worldwide compared to these diseases. These are common diseases. Oh by the way, so diseases
like this of course are sickle cell disease, cystic fibrosis and Huntington’s disease and
so forth. But these are diseases that all of us have, or all of us have family members
who have: it’s hypertension, it’s diabetes, it’s heart disease, it’s mental illness, it’s
different kinds of cancer and so forth. And these are the more common diseases, unfortunately
they’re more complicated because they involve multiple genes. They’re non-Mendelian because
it’s not a single gene disorder, instead it’s usually a series of genes that are involved
each with a genetic variant that confers risk that all conspired together with what is typically
a larger influence of the environment to confer overall risk forgetting that disease. So, these are sort of the two major classes.
Now, I want to point out because people often when I give talks like this will say, “Ah,
you’re only talking about the genetic contributions of disease and there are all these important
environmental contributions of disease.” So, I just want to emphasize before I get that
criticism is that there is absolutely a role for both the genome and the environment in
human disease, that’s why I represent the pie charts the way I do. The fact of the matter
is, on the genetics side, there have been remarkable genome analysis technologies that
have evolved in the past five years in particular, or 10 years. We certainly have been in significant
advances in environmental monitoring technologies but nobody could argue the last decade has
brought significant more advances than the technologies for analyzing genomes than environmental
monitoring. So, I’m going to emphasize the genomic side of this equation but it’s not
out of disrespect for the environmental contributions. That’s critically important, it’s just I don’t
particularly have much expertise in that and I also don’t have as much to report based
on technology advances in the last decade. So what’s happened with rare diseases and
common diseases since the end of the genome project? Well, what I can tell you is that
there has been an explosion in our ability to identify the genetic basis of single gene
disorders since the end of the genome project actually, even since the beginning of the
genome project. So here’s a cumulative graph that shows the number of genes that have been
identified that are basically, when mutated, cause a single gene disorder. Note the genome
project began here, there are only a handful of successful examples before the data from
the genome project became available early as maps, clones and eventually sequence. And
then it’s just taken off ever since and didn’t update this slide yet for 2011 but it absolutely
continues to trend upwards. It has been remarkable and unpredictable that there’s a relatively
simple path to be able to go from having individuals with rare genetic diseases nowadays to being
able to figure out the genetic basis. Not in every case, but in general you can see
what the trend has been. And what this has resulted in is a fairly
impressive accumulation of knowledge because we now know the molecular basis of something
like 3,500 rare Mendelian diseases and traits. So that is absolute and you can see before
the genome project was like five. So, that’s pretty impressive. Now that is absolutely
the glass half full; there is a glass half empty side of this pie chart and that is that
there still remains about a couple thousand where we know the disease but we don’t yet
know the molecular basis for it and then there’s another couple of thousand where we think
it’s a single gene disorder or trait but we don’t yet know the genetic basis. So this
is the glass half empty, remember this slide; you will see it later in my talk. So, that’s success in many ways. And with
that comes from medicine knowledge about gene function, we now could attribute specific
functions to individual genes because we have individuals with defects in that gene and
we can see what its cause is when mutated. What about common genetic diseases? Well,
what I will tell you is a lot of skepticism about would we ever be able to line up enough
analytical and laboratory based horse power to be able to unravel the complexities of
the common diseases with all their minor contributions from a whole lot of individual variance. But the idea behind the HapMap Project was
to simplify that process. I’m just going to give you a very quick review of what happened,
but one of our speakers is going to give this in greater detail. The fundamental idea was
with knowledge about these haplotype blocks across every human chromosome. The idea was
could we line up individuals with common genetic diseases such as hypertensive individuals.
Take a thousand people with hypertension, thousand people without hypertension. And
scan across each of their genomes and all those individuals and figure out, are there
particular variants that are inherited more often in those with hypertension than without
hypertension and with that then give clues of where to look to see where there might
be genetic variants that are causing greater risk for hypertension. But doing that across millions and millions
of variants was simply not approachable with the cost of doing genotyping. But since you
knew about these haplotype blocks, could you imagine just taking not all the markers like
all these little black lines up here and of course by the individual places of variation
across this particular human chromosome. But instead of taking all, let’s say thousand
or hundreds of people or hundreds of markers across this particular block, just pick one
or two. And have those one or two variants be proxies for this entire haplotype block.
This inverted red triangle is a given haplotype block on a given chromosome. And could you do that systematically and you’d
simplify the process of not just having to look at hundreds of thousands of markers and
having those service proxies for their original haplotype blocks. So what do I mean by that
is the simple experiment as you do this and you take an individual and you let’s say you
take a marker then for simplicity we’ll say, it comes in a green flavor and a purple flavor,
and you do it from this haplotype block and you — those with hypertension are here, those
without hypertension are there. And you, just by eyeballing, you can see there’s no correlation.
You would rule out this block as being relevant to the disease. But what happens if you looked
at this block and now the variant you picked happened to come in an orange flavor and a
blue flavor. Wow, those with hypertension tended to get the orange block more than the
blue block, or the orange marker more than the blue marker; perhaps therefore somewhere
within this haplotype block might be a variant that might end up conferring risk for hypertension.
It may not be the orange one, it might be one a little over but it’s just be basically
correlating the inheritance of this block and the — of getting hypertension. So, you’d rule out regions like this, you’d
rule in regions like that. If you did this across the entire genome, this is called,
this is genome-wide, and what we’re basically doing here is an association study. Associating
this haplotype block with being hypertensive and ruling out this haplotype block with hypertension.
So it’s a genome-wide association study that’s called the GWAS. Well, what I will tell you is I just gave
you the one-minute version. It is really complicated, what goes on, it’s not a simple PowerPoint
like this. So, Karen Mohlke we’re going to bring up here from University of North Carolina,
she’s going to explain it to you far more sophisticated than I just did. What I’m going
to do is tell you that this has been an impressive success. Because what has happened is PowerPoint
slides like this are easy to make but actually demonstrating scientifically that the strategy
works was a question mark. But the good news is that it did work. In fact, this was the
first example of it, this sort of became the poster child for GWAS studies, age-related
macular degeneration and genetically complex disorder that some of the earliest HapMap
data was used and demonstrated that in fact a region on chromosome one actually had a
gene that had a variant in it that conferred risk for getting this particular disorder. At NHGRI, we actually started cataloging this.
We’re very interested in monitoring this field as it evolved. So, what we started doing is
every time a successful genome-wide association study was published in the literature, we
would survey it, our Office of Population Genomics would curate it and then this particular
case would mark the place in the genome whereby that association had been demonstrated, sticking
a little lollipop at the particular region of the chromosome. That was that success story
in 2005, and 2006 there were a couple more. By 2007, it became quite crazy. It seemed
like every single time you would open an issue of Nature Genetics or Science or Nature or
increasingly Human Molecular Genetics or PloS Genetics, and this continued throughout 2008,
you’d find paper after paper after paper reporting successful genome-wide association studies,
in each case, sticking one or more lollipops in discrete regions of the genome. Now it’s
important to emphasize that they didn’t necessarily know the exact genetic cause, but what they
are doing is basically going and getting it down to an individual neighborhood of a chromosome
that would need to be searched in greater, greater detail to actually figure out what
the causative variant might be. And these phenotypes that are associated with these
lollipops are all these common diseases that are filling hospitals and clinics around the
world. This trend absolutely continued throughout 2010 and also 2011, oh and let’s just pause
there just to see, you can see our genome is littered with lollipops, with all these
successful regions being demonstrated to perhaps happen to be relevant for an important human
disease. Where once upon a time, there was essentially
no successful genome-wide association studies, you can see already that the threshold of
a thousand successful publications was crossed last year and has left behind a tremendous
amount of work to be done, because you still don’t know the genetic basis, but you now
have a much more limited search to try to figure out what’s going on. Now this is, once again, glass half full.
Lots of successful genome-wide association studies. There is a glass half empty side
of the story, there’s a couple actually glass half empty side of the story, new challenges.
The other thing we’ve learned which is sort of interesting is that as we’ve really made
successful forays into understanding the genomic basis of rare disease and common diseases
is that there’s a pattern that’s emerging. And that pattern is when it comes to rare
genetic diseases, single gene disorders, the great majority, again not all of them, but
the great majority of them turn out to be coding mutations. They are changes in the
protein coding portions of genes. But the exact opposite is turning out to be true in
these — these common complex diseases where again it’s not exclusive, but the majority
of them seem to be out in non-coding portions of the genome. Remember that purple stuff
which I told you we barely understand and we have a lot more to learn, regulatory regions
and so forth? That seems to be where the variants are residing associated with this very important
class of common diseases. Now there is another glass half empty side
of the story because despite the fact of having a thousand successful genome-wide association
studies and lots of knowledge of where to look, it’s still not accounting for all the
heritability associated with these common diseases. So there is still a lot of mystery
associated where all, you’ll hear Karen Mohlke describe as heritability. And it could be
that it’s just not the common variants that we’re familiar with working with, and increasingly
there’s lots of people who believe that a lot of these variants that are conferring
risk for complex diseases are very, very rare variants but together across the population,
each of us harbor some very rare variants that have not turned up yet in any of these
variation studies, and those are the ones that are conferring risk. And what all that is pointing to, whether
it’s doing the next set of analyses for genome-wide association study to sort of drill down into
these neighborhoods and find out all the variants, figure out which ones are positive, or the
recognition that indeed you need to go in deeper and get more and more rare variants
from all those people with hypertension. Either way you’re going, what it’s pointing to is
we need to sequence a lot of peoples’ genomes. We need to go through those thousand people
hypertension to sequence their whole genome or sequence at least all their coding regions.
And so this leads us to the fourth major step along this journey which now has really become
the dominant force in genomics and that is we need technologies to routinely sequence
whole genomes. We knew it was necessary, we knew it was necessary back then when the genome
project ended but what I will tell you is we never thought we were going to be as successful
as we’ve turned out to be. What do I mean by that? Well, when the genome
project ended in April 2003, I made sure I put out a new publication that described a
vision for the future of genomics research and we said, “Wow, we have the sequence in
hand, what do we need to do with it?” And we described all sorts of crazy things we
wanted to do. Some were even more crazy than others. And one of the craziest things we
said, I was one of the authors on this so I could really make fun of it because at the
time I can’t believe we really put this into press. But we actually put into print, in
Nature of all places, that we absolutely needed technological leads that seemed so far off
as to be almost fictional but which, if they could be achieved, would revolutionize biomedical
research and clinical practice. Now, we didn’t just stop there, we really
had to go sort of the next level and even be more audacious because we said as an example,
we need the ability to sequence DNA at costs that are lower by four to five orders of magnitude
than current costs, allowing the human genome to be sequenced for $1,000 or less. This is
the first time put into print the idea of getting the cost of sequencing a genome down
to something that was quite affordable, $1,000 was the marker we put in the sand. $1,000
seemed like a very reasonable price for a clinical test and that’s the reason why we
picked that number. But why are we sort of a little exuberant and a little overambitious?
Well, it’s because genome sequencing at the time we wrote that was still quite expensive.
You know, for example, sequencing that first human genome by the Human Genome Project cost
something on the order of a billion dollars. And when we were putting into print was basically
the idea that having now done this one time, one time, one human genome sequence, one billion
dollars, that somehow in the not too distant future we were going to develop fancy technologies
that would lop off a lot of zeroes off that billion and deliver something, a genome sequence
of $1,000. Well, this became a bit of a rallying cry
in the community. In fact, the phrase “a thousand dollar genome” sort of became the battle cry
for technology development. Our institute put out lots of grants to try to stimulate
this field that were actually quite successful. Fortunately, the private sector got quite
involved in this, many companies sprouted up and an incredibly intense effort to develop
newer and newer, better and better technologies sort of came to the forefront. Because the
idea was to just get rid of these factories that had generated that first reference sequence
as part of the human genome project and develop something really fancy shown here in icon
form: some nano this, some micro that, some mini channel, whatever. Something that would
be so efficient and so scalable that would allow you to sequence an individual patient’s
genome, individual clinical subject’s genome, for something like $1,000. Well, I can just tell you, there’s nobody,
anybody who tells you otherwise, they’re just they’re not telling you the truth. Nobody
expected things to happen as well as they’ve happened within the past eight or nine years.
Because it’s not just one or two or three or four different new technologies but it’s
really more like five, six, seven, eight, nine new technologies. Shown here are just
some of the platforms that you can know and purchase or getting your own laboratory. These
are what are referred to as next gen or next generation DNA sequencing technologies and
I’m not even necessarily talking about any one of them, in fact, I’m not going to talk
about these at all. That’s why we’re bringing Elaine Mardis here from the Wash U Sequencing
Center to talk about this and describe it in great detail. These technologies are fast
evolving, they’re incredibly sophisticated and they’re remarkably efficient. As an example,
a couple of these machines down here, one in particular, you know, in one week can generate
a sequence of human genome; that’s something that took 10 years and thousands of people
to do as part of the Human Genome Project that is now routine in many places around
the world including even here at NIH. By the way, the reason it’s particularly exciting
is this slide will not be used in the next time I give a lecture in this series. It will
have to be a new slide because there’s new technologies that are coming. It is like sitting
in an airport looking out on the horizon, yes you have 10 planes on the ground, but
you know what in six months, there will be another plane, about a year later another
one, maybe two years, three years and just yesterday there was a flurry of email, in
fact, Wall Street Journal wrote an article and I think other journals wrote or other
newspapers wrote articles because one of the companies came out with a new technology and
they’re commercializing and they say that they’re going to cross the thousand dollar
threshold this year and there just are many more technologies. I’ve heard more and more
about nanopores and it’s just featured as one example on the cover of Nature that maybe
three, four years from now will be now commercialized. And again, we’ll just continue to step down
the cost of sequencing. Well, has it materialized? Do we have a thousand
dollar genome? Have costs gone down? Where are we at? Well, we know about this a lot
because at least in NHGRI we fund three very big centers that do a lot of sequencing. They
did it for the genome project, they still do now and we give them money and then they
give us data. And every three months they tell us how many genomes they sequenced or
how much DNA they sequenced and how much money they spent. And we’ve tracked that for, like,
over a decade. So let me show you what real data looks like. So, cost for sequencing the
human genome and before I tell you that let me tell you about Moore’s law. So Moore’s
law is the law of the computer industry that basically says that computer power doubles
every 24 months or so. And nobody keeps up with Moore’s law, say technology development
people, except for the computer industry. So, they’re your benchmark; you try to keep
up with them if you can. So, here’s our data. So, shown in white, now
notice the y-axis is logarithmic. Shown in white is Moore’s law. In orange is the data
provided by our sequencing centers dating back to the, to 2001 or so. So, from here
to here they were using that old fashioned method of dideoxy chain-termination sequencing
developed by Fred Sanger in 1977. This was the method that was used for sequencing the
genome in the Human Genome Project and they used it up until this point. Remarkably, while
they were using it they were actually keeping up with Moore’s law. So that was pretty impressive
in and of itself. But right here they switched to next generation sequencing platforms and
ever since then and up to the present time, they’ve flung Moore’s law into the water.
So we exceed Moore’s law, which in many ways was unprecedented and in many ways is actually
incredibly impressive and if you want to continue to follow this trend, you will just, we’re
going to continue to update this slide on our website and I can, we continue to believe
that it will go down and down. So where are we right now in our quest towards
a thousand dollar genome? Well, you got to tell me, right today where we are, we’re somewhere
around there. So not quite at a thousand dollars but we’re really close to it. There actually
are shortcuts where you could just sequence the exome, just the coding sequences. That’s
below a thousand dollars now, pretty much. A whole genome, it depends who asked, it depends
on the accuracy, it depends on if they’re telling you the truth, it’s, you know, three,
four, five thousand, but rapidly heading towards a thousand. It’s not a big deal anymore, this
is not what I stay up at night worrying about. We will get to a thousand dollar genome and
it’s just not the big problem. Before I tell you about the big problem, because
we have big problems, let me tell you one other thing to think about because I think
this is very relevant including to an audience like this. How are we going to be generating
genome sequences over the next decade? Is everybody going to buy one of these instruments
and put it in their lab? We going to have centers set up and do this? So I don’t know,
I actually can’t predict that completely. What I do know is that market forces will
step in when this becomes a commodity, in fact it really is a commodity. At this point,
genome sequencing could be obtained through companies, just show a couple here. All you
have to do is to open the journals and read the advertisements and there’s other companies,
you know — by the way, look at this company here and notice their price because I took
this maybe a few months ago. We’re going to come back in a few slides to a point. And
here is what I was telling you about, you can get a whole exome sequencing done commercially
for just under a thousand dollars. So, I don’t know what it’s going to look like
in the future in terms of whether we will be sequencing genomes in our lab or whether
we will be outsourcing it. It’s becoming a commodity and that’s actually a good thing
because we have far more important things to worry about than generating data if we
have a big challenge of what to do with that information. So, that is actually a great segue into our
fifth step and the last one I’m going to describe before I start describing the future because
the fifth step sort of is a little bit of cold water in the face kind of thing. It’s
actually a little deceptive for me to tell you that sequencing a genome is getting close
to a thousand dollars because that’s just getting you the data. The real bottleneck
nowadays in genomics is not getting the data. The real bottleneck is dealing with the onslaught
of the data that comes flying out of these machines such as sort of shown here in a humorous
fashion. These sequencing instruments are far able to generate data faster than we could
possibly assimilate it and it has put genomics right in the middle of a situation we really
were never in for a long time and that’s a big day. But when the genome project was going on,
we could — we didn’t have big data yet. We were just trying to generate data. But now
all of a sudden, we find ourselves smack in a significant set of issues associated with
having a big data circumstance that has created a pretty substantial bottleneck. I refer to
this as a computational bottleneck. That bottleneck has several elements associated with it. There’s
hardware issues just enough stuff, enough storage capacity, enough processors to analyze
that data. There’s lots of issues around software, being able to deal with an onslaught of data
and interpret it. And of course there’s all sorts of issues around workforce, just having
enough people trained to deal with this. There is a reason why Andy and Tyra are giving three
lectures total. It’s to deal with all of these issues around the computational analysis of
data because it’s become sort of the biggest issue right now in genomics. So you’ll be
hearing two lectures from Andy, one on Tyra to sort of address this. So it’s sort of a computational bottleneck
hand in hand with that which also overlaps with what Andy and Tyra will talk about is
just a sheer informational bottleneck. The fact is that let’s say we get you by the idea
of generate the data, you can simulate the data, you can analyze the data you can even
get using these fancy technologies, your sequence of individual genomes of an individual patient,
an individual subject. And let’s even say you get to the point of being able to analyze
it and filter it and get to the point where you just have your list of three to five million
variants in that particular person who’s sitting across the room from you for example, what
do those variants mean? I mean you see these changes, are they detrimental variants? Are
they innocent variants? And if you did it on a patient for example, here in the clinical
center, you have that genome sequence, when you have that list of three to five million
variants and you rounded on that patient in the morning, is this how you’d feel? Would
you just sort of stare at that list and wonder what it all means? Probably, you would, at
least right now. There’s also that informational bottleneck, simply knowing what the sequence
means when you have individual variants and individual patients. I can’t help under this circumstance to quote
Harold Varmus, known to I’m sure all of you. Former director of NIH, current director of
NCI who wrote a commemorative article about the genome sequence at 10 years where he said,
“Physicians are still a long way from submitting their patients’ full genomes for sequencing.
Not because the price is high but because the data are difficult to interpret.” So that’s
the circumstance we find ourselves in and this is where I just last week saw this advertisement,
I couldn’t help but throw it in, same company. Notice the price has come down since the last
time I took their ad. But they talked about Ben, Ben Franklin obviously, and he had — he
didn’t have, you know, he didn’t have an informations — an informatics bottleneck, making fun of
the fact that we do. And of course they’re a company that wants you to give them money
and they’ll help you solve that bottleneck and we’ll see, but we need to solve this bottleneck
and — but it is interesting that even their price went down since the last time I scanned
one of their ads. So those are the five steps I wanted to tell
you about in going from the genome project until today. Now there are other steps that
I could be starting to talk about, as certainly relevant for the future. Developing new diagnostics,
much more relevant to genomic medicine and anything I’ve talked about. Obviously therapeutics,
preventative measures based on genomic information. Of course there’s probably other stops that
we’re going to have to journey our way through to eventually realize genomic medicine. What we have at the present time is just a
tremendous amount of data. We have great technologies for analyzing genomes and we have, such as
shown here, for the first time in many ways, incredible opportunities to apply these data,
apply these technologies to clinical circumstances, clinical research immediately, hopefully,
eventually clinical care. This is — makes us remarkably well poised for a revolution
that brings about genomic medicine but with this comes just inordinate numbers of challenges
that we all have to face. This is why Bruce Korf is going to come up here and just specifically
talk about genomic medicine in the series later in April. But, what I want to do now and I’m sure will
complement much of what Bruce is going to talk about, is to now just spend the last
20 minutes of so just now let’s gaze into the future. Because what I’ve described at
first was up through the genome project, since the genome project; now it’s really all about
the future. And what we believe the future is going to bring has come about from a strategic
planning process that NHGRI did on behalf of the field of genomics that when on for
several years and then just about 11 months ago was published in the 10th anniversary
issue of Nature commemorating the 10th anniversary of having the sequence, the human genome in
hand and described. And this was the reprint that was available to all of you and if you
didn’t pick up on your way in, please pick it up on your way out and if there’s extras
back there take them to other people in your lab or take them home, they’re great stocking
stuffers for next Christmas if you want — [laughter] — but we don’t want any more of them so if
you just take them, because we don’t want to take them back to the offices. But, oh
also and if you want a PDF version you could go to this website and in fact, you could
read all about our strategic planning process that went on. This is very much about the future, and Nature
was kind in giving us the headline on the front cover that the future is bright, and
in many ways, we do think the future’s bright. So let me describe to you what we sort of
derived based on consultation with hundreds of people around the world in the field of
genetics and genomics and beyond in trying to formulate this 2011 vision for the future
of genomics and based on lots of workshops and consultation, integrative processes of
writing documents. And it’s all described in great detail in the reprint which you ‘re
crazy not to read from end to end if you’re going to participate in this lecture series.
Because so much of what you’re going to hear about in the other lectures is described at
least superficially in that document. What we heard from the strategic planning
process was that it was an exciting time in genomics to be even more specific and more
sophisticated in describing the journey from base pairs to bedside or if you prefer the
metaphor, from helix to health. But in doing so, we can now start to divide this work into
a series of domains that both reflect our history but importantly also reflect our future.
For example, you can think at the more proximal side of this, a domain of research activity
that involves understanding the structure of genomes, sounds familiar. Makes sense,
that’s what we’ve done for a while. Also a set of research activities that get you to
the biology of genomes, on understanding how genomes work and then increasingly start to
apply that knowledge to use genomics to understand the biology of disease, makes a lot of sense
based on what we’ve done. But what becomes more ambitious is now thinking about the future
more and more, is using that knowledge to advance medical science, the science of medicine.
And also being cognizant that just because you have some great medical advance doesn’t
mean you change the practice of medicine, because you also have to do research that
will actually demonstrate that you improved the effectiveness of health care based on
those genomic advances. And so this became sort of five domains of
research activity that provided or outlined, if you will, for our strategic plan; I will
tell you it actually provides an outline for basically everything that our institute is
doing as we think about our genomics program. Now it’s not the only we’re doing, it’s not
the only thing in genomics because there are important cross-cutting elements that are
also very relevant. You heard about one of them and you’ll be lectured on some of these.
Obviously, computational biology and bioinformatics pervasive important for all these domains
of activity as is education and training. This lecture series, if nothing else, is an
example of an outreach education effort to educate people across all of these domains.
And then there’s lots of genomics in society issues. Historically, we described them as
our Ethical, Legal And Social Implications Research Program, but it includes lots of
other things including behavioral research and other areas that fall under the general
umbrella of genomics in society. What was very useful, though, in thinking
about these cross-cutting elements that first returning to these five domains of activity
is that it’s very helpful in planning and projecting to think about these five domains
of activities and think about what has been accomplished over the last 20 years and then
what’s going to happen over the next 20 years as we predicted. What do I mean by that? Well,
we found a useful way to represent this is by hypothetical genomic accomplishments that
are graphed as density plots, such as shown here with each blue dot representing a hypothetical
genomic accomplishment and then when they pile up on each other they change color until
they get red. So what do I mean by that? Well take the time interval of the genome project
which I told you about. Well, basically it was all about this first domain. It was all
about understanding the structure of genomes. Yeah, we learned a little bit about how genomes
work and maybe even a couple things about disease, but really, the real density was
right here and that’s smack on that first domain. I’ve led you through five steps of what’s
taken place since the end of the genome project and that’s reflected here. Because we continue
to learn a lot about the structure of genomes, but yeah well since the genome project we
mostly were spending our time learning about the biology of genomes and starting to dabble
in the biology of disease, rare diseases, increasingly common diseases, yeah and maybe
there are even a few homeruns out here in the more clinically oriented domains. The
center of gravity though was firmly placed on the first two domains. But people want to know about the future.
What’s the next decade going to bring? We think the next decade’s going to look something
like this. We believe the center of gravity will shift more and if anything the next decade
is going to be about refining our knowledge about how genomes work but increasingly applying
that knowledge to understand the genomic basis of disease. With that will come many more
opportunities for advancing medical science and even more homeruns than previously seen
for improving the effectiveness of health care. But being realistic, center of gravity
is going to remain on domains two and three. We’re optimistic that beyond 2020, you’ll
see the change in the practice of medicine first by advancing medical science, eventually
improving the effectiveness of health care. But this is going to take decades realistically,
it’s not going to happen in the next five years or 10 years. But, we really believe
we’re on a trajectory that we will see the center of gravity of these accomplishments
shift rightward over time. Now this is huge research areas. This is not
just about NHGRI, this is not just about NIH; let’s be frank, this is not just about the
United States. What we describe in this document and use as an organizing framework by this
figure, which is figure two, is a far more expansive view of genomics. It is absolutely
a world view of genomics that for, I don’t claim for a minute to just be about one institute,
or one agency, or one country. But with that said, and we absolutely believe that there’s
going to be significant accomplishments that contribute to those five domains of research
activity coming from across the world. The same time, we think a lot about what we
want to do here at NHGRI, what we want to help have happen here at NIH. So, I thought
I would spend just the last few minutes just glimpsing a little further into the future
and specifically telling you what I think are some of the most compelling opportunities
in genomic medicine and things that we are doing to try to accelerate them. We actually
have this as outlined in out of the text boxes, it’s actually text box number two, we call
it imperatives for genomic medicines. The subtext to that was to see the no brainers,
these are the things that are so obvious we absolutely want to support and they really
represent the future, we think the immediate future because we think these are opportunities
that are we absolutely could facilitate it over the next decade. What I will tell you
about this glimpse into the future is that technology drives it and maybe that’s not
a surprise. I think the history of science has shown that technology advances drive science.
I think we’re going to see that more than ever over the next decade. Just like you know
the telescope just drove astronomy and the microscope drove cell biology and then different
imaging technologies drove radiology, absolutely these sequencing instruments are driving genomics
and they’re driving the field forward and we’re going to continue to see these technology
advances. And with that will come the sequencing, now,
not of hundreds of people, no, not of thousands of people. It’s really tens of thousands,
hundreds of thousands of people. We can imagine over the next decade, a million or more, maybe
many more than that that are going to be sequenced and when you start to deal with those sorts
of numbers, you start thinking about how that’ll be done in clinical research contexts such
as here at the clinical center. It will be done perhaps over the next decade and in some
ways as part of clinical care and we need to do research to understand that. But I was
going to point you to an additional article to read about this general vision for the
future, specifically around clinical applications. I was asked and partnered with Teri Manolio
at our institute to write a perspective about this in Cell last year, talking about how
genomics is going to reach the clinic and how these basic discoveries are going to drive
that forward through technology advances. What are the specific areas you’re going to
see this in? What are things that we are directly supporting that you can absolutely expect
to see in the coming handful of years, all, again, driven by these fancy sequencing technologies.
For example, I told you you’d see this pie chart again. We want to fill in this other
half of the glass. I just spent the last day and a half meeting with a new consortium of
centers that we’ve now put together, sequencing groups whose charge is simply going to be
to use genomic sequencing technologies to identify the genomic basis of the remaining
Mendelian disorders for which the gene is not known. We think we could industrialize
this and start to identify these genes particularly and we’ve put this consortium just formed
over the last months and we’ve now just spent a couple of days strategizing with them to
move this forward aggressively. Similarly, this — earlier this week I met
with our large sequencing centers who are going to have a major hand in sequencing tens
of thousands of genomes in the coming years, probably hundreds of thousands perhaps, and
they are going to be, among many things, tackling the challenges of moving from information
about regions of the genome that might confirm variants for complex diseases to actually
getting down to the causative variants and doing this by industrializing the sequencing
of individuals of particular phenotypic features and different diseases and so forth. So, I
would expect major strides. Then the other major disease area, specifically
focusing on one disease area which is absolutely a no brainer and I’m sure you’ve heard a lot
about is in cancer. And here, cancer fundamentally being a disease of the genome we’ve gotten
out of the gates already here at NIH through the cancer genome atlas. A joint venture between
our institute and the cancer institute really is a prototype for applying genome sequencing
for, in this case, a perfectly appropriate target of different kinds of cancers. And
of course, this is, this is such a no brainer. It’s not just the NIH that’s doing this, TCGA
or The Cancer Genome Atlas is just one of many projects that now fall as part of an
international effort, and in fact, in its higher consortium, has formed of groups many
countries are now involved in tackling different tumor types and using genomics to sort of
develop catalogues of changes that take place in tumors and use that knowledge to better
guide diagnostic and therapeutic development. And so what is that future going to look like?
I think a lot about diagnostics in particular, it’s probably because I was trained as a pathologist.
But, right now thinking about how we look and deal with cancer, it’s mostly histopathology,
looking under a microscope and looking at tumors. And in the future, sure, we’ll be
doing that in the future, no question, but with it will come an augmented set of knowledge
about individual genome analyses of the specific tumor that you’re looking at and its rearrangements.
And I am convinced, and we already have data and know it, that that will provide a much
more robust — the diagnostic tool for predicting the nature of the cancer, the prognosis for
that cancer and perhaps treatment options for that particular cancer. You’re going to have a very special lecture
that’s sort of a special part of this course because Bert Vogelstein will be lecturing
in this time slot on February 29th actually as part of a separate lecture for our institute,
an annual lecture that we give. But Bert’s going to come down from Hopkins to give this
and I guarantee you the vision he will articulate will be, will align very much with this slide
as a real pioneer in the area of cancer genomics. By the way, it’s these technologies, these
sequencing technologies, I’ve mostly spoken about how they could be used for sequencing
human DNA but we shouldn’t forget the fact that these technologies can also be used to
sequence other DNA. And the DNA in particular that I’m thinking about are the DNA of microbes
that live in us and on us. And it turns out the whole community of microbes that live
in us and on us is known as the microbiome. And just two quick statistics to make you
a little uneasy, you’re outnumbered by microbes in terms of cells 10 to one, so your little
body ecosystem is only 10 percent human, 90 percent microbes. And another thing that should
make you feel uneasy is of those microbes only about 10 percent have ever been isolated
and studied in a laboratory. Ninety percent of it, we’ve been blind to. Well, why is that relevant to genomics? Well,
we can sequence those microbes now, we can sequence the microbiome using these fancy
technologies and we can catalogue in that community and learn about that community and
figure out if that community has any role in health and disease. And so where once upon
a time we were blind to our microbes or many of them, now all of a sudden we can monitor
them. We can study them and this has led to a whole area of research of microbiome analysis
and NIH has a Common Fund project called the Human Microbiome Project. And once again,
it’s just one component of international efforts, our one microbiome project interdigitates
with the whole international consortium of investigators and countries involved in doing
human microbiome research. And we look out in the future, especially through a pathologist’s
eyes, and nowadays we pretty much deal with petri dishes and gram stains and try our best
to diagnose the microbes that are associated with disease, even though we know we can only
culture a small fraction of them. And in the future, we’ll be doing the same thing but
wow, if we could get some sequence data on samples and see microbes by sequence data
that we could never see in the laboratory otherwise, you got to believe it’s going to
bring insights about their role in health and disease, and so Julie Segre from our institute
will come and describe microbiome analyses in some of the work she’s done and the community
has done and I think you’re going to be amazed at the idea that these new technologies have
just sort of — are changing the face of clinical microbiology. But there’s other things beyond that that
are going to bring, be brought by these new technologies. One could certainly imagine
that the idea of genome sequencing of newborns, I mean all newborns born in the United States
get genetic tests done, just a small number of genetic tests. Could you imagine doing
a more comprehensive survey by sequencing their genomes or some part of them? While
there’s lots of questions to think about and we’re thinking about research to sort of help
answer those questions, it’s something to consider. Of course the whole idea of the
interplay of genetics and drugs and the genetic basis of drug response becomes very important.
Why is that? Well, we don’t all respond to drugs the same. Just like a lot of things
in life, everybody responds a little differently to lots of things and all of us respond different
to medications. All the medications that come out to market, they all work; they just don’t
work on everyone. And the whole notion of pharmacogenomics, understanding the genomic
basis of drug response has really now taken on a very exciting phase where we’re getting
at the genetic basis of drug response that perhaps might lead to better ways of managing
medications for individuals. So Howard McLeod, another individual from North Carolina, we
keep bringing it up from UNC, is going to come and give a lecture, is a real world expert
on pharmacogenomics and I think you will enjoy that lecture tremendously. But there’s other challenges that come with
this technology. Lots of genome sequences being generated, but these are going to eventually
be generated on patients. We’re going to have to communicate that information to patients,
and it’s pretty complicated stuff. We don’t understand it ourselves yet, and yet we’re
going to have to communicate this by health care professionals, try and describe what
their unique genome might bring with respect to disease, with respect to drug response,
with respect to their children. So, there’s a lot of issues associated with communicating
that information and there’s a lot of science behind trying to think about how to sort of
create that future in a very productive way. And Colleen McBride from our institute will
come and give a lecture that will be very relevant to this particular area and I think
that will be a very important one for you to consider as well. In the meanwhile, we need to communicate,
one, to patients, but we need to disseminate this knowledge as it accumulates out to the
health care professionals, and I can tell you that, as I talk about this a lot, individuals
are very concerned about will we have robust enough clinical genomic information systems
that will allow health care professionals, physicians, nurses, genetic counselors, pharmacists,
physicians’ assistants and so forth be able to interpret this tsunami of new information
as it becomes available. And at least in the United States, we’re playing catch up compared
to some countries. This will all interdigitate, likely, with acceleration in the use of electronic
health records which in some ways might be good because genomic information might flow
in nicely to health records if we organize it properly. But maybe it won’t, we have to
deal with that. But of course we also just need tools that are going to be readily available
to health care professionals, that allow them to look at the three to five million variants
of a given patient, figure out which ones of those should we do something about, which
ones should be ignored. And that’s just going to be a lot of information. So, we’re very
much involved in trying to think through what should be developed to try to help professionals
deal at a practical level with this onslaught of new clinical data. So I’m sort of at the end of this journey,
I knew it would take me the full hour and a half. I had a feeling. These five domains,
I think you’ll see in various forms in the coming lectures. I just want to remind you
that some of this is actually new for the field of genomics. And that’s why some of
the last things I talked about are actually very new areas for us. Sort of think about
these domains: the first one to two and a half are really basic science endeavors and
I’m sure many of you regard yourself as basic scientists. But as we’re sort of thinking
about moving more into medical science, it starts to deal with what is called translational
science, really thinking about the application of genomics to medical problems and medical
circumstances. But, even some of the last things I was talking about with getting this
information out to health care professionals starts to become implementation science, which
I think is very new to many of us. Actually demonstrating the effectiveness of health
care, it’s really about implementing things and changing the practice of health care.
This and so I think among the lectures that you hear, you’re going to hear samplings across
the full spectrum of different scientific endeavors, basic, translational and implementation. And finally, what I would say is I hope I’ve
given you the impression that there’s been incredible accomplishments, incredible optimism
and just remarkable successes in genomics, but at the same time there’s a lot of big
challenges, especially some of the ones I alluded to earlier. This is going to require
a Herculean effort by not just people in genomics, but actually more broadly as it gets disseminated
and there’s no reason to believe this is going to be a simple journey; I didn’t mean to imply
that even when I was telling you about some of the good successes. I can’t help but provide in closing sort of
a quote I found from Winston Churchill that I thought was very appropriate. “A pessimist
sees the difficulty in every opportunity but an optimist sees the opportunity in every
difficulty.” One thing I will tell you is that Tyra and Andy in particular have enriched
in this lecture series for an optimist so you’re going to hear about some incredibly
exciting opportunities that are coming with all the difficulties that are moving this
rightward. But with that said, at a practical level we should all, and I hope you find it
inspiring yourself to recognize that this is going to require a community of scientists
and health care professionals to really see this vision through. And we’ve got to stay
optimistic but we also have to realize there’s some significant heavy lifting ahead of us. So, I’m up against the end. I will just stop
there. At a practical level, I know people have appointments to go to. I’ll stick around
and just take questions from the platform and thank you for your attention. Bye bye. [applause]

17 Comments

Add a Comment

Your email address will not be published. Required fields are marked *