I've been seeing "Genome of X has been fully sequenced" articles over the course of some years. Can someone who understands it explain to a layman like myself how exactly this benefits us?
I think the best big picture explanation of why this is important is this: our ability to feed and care for humanity is dependent on biology. Genomes are the source code for biology. Reading the source code is the first step to understanding the source code. Understanding the source code will allow us to build better expressions of biology (those being both ourselves and the animals and plants we depend on).
I hope you won't take it personally because you're far from the only one to say it (and I would have refrained from posting if I hadn't seen people taking this further in this thread), but I really wish people stopped using the "DNA is the source code of your body" analogy. It is used way more than what's it's actually worth, even though I can see how that analogy can be seductive. The structure of a program's source code is nothing like the way DNA encodes and propagates information. Even if you add a bunch of adjectives such as 'self-modifying', 'full of GOTOs', 'nondeterministic', 'live monkeypatched', and thus twist the picture many people have of the way source code translates into execution, it would still be misleading. We are a product of both our genome and environment, at the same time, all the time. Not one then the other. Or, if you will, the source code is its own executable. It doesn't make sense because the analogy doesn't either.
And that's not even taking into account non-DNA things that encode and propagate information, such as microbiota or anything linked to epigenetics.
It’s actually a great analogy. Analogies aren’t supposed to be exact representations; they are supposed to allow you to quickly communicate a decent approximation of what something is by relying on preexisting understanding of how another thing works. The source code analogy does that well.
And, by the way, running programs are also the result of their source code and the environment in which they execute. It’s just that the environments which computer programs run in are very homogenous and predictable. But flip some bits in memory or have a user thrash the UI and you’ll see that a running program interacts with its environment in much the same way as a biological organism.
I never said it was a bad analogy, I even used a rather positive word ('seductive'). It is overrated, though, and it does stop making sense sooner than many people think. What I'm really saying is that relying on this analogy to understand how DNA works in relation to organism development, function, etc. won't get you past a very cursory understanding of it. It may be enough for many people but:
-I like to think that HN is filled with people who take interest in many things, especially science-related subjects, and would hope that many of them on this site would like to know more than an analogy-based understanding of genomics
-For better or for worse, the recent bloom and advances in genetics and biology in general have taken over the modern world and been hailed (probably with reason) by all kinds of circles, from the media to governments or the tech world, as a promising new era that's going to revolutionize our understanding of life, disease, what it means to be human, cognition, what have you. It does feel pretty cool as an insider to know that you're working in the 'hot new thing' but a side effect of that is, I have seen people develop a pretty bizarre fascination with DNA, whether it be their own or other people's. And it's not something that can be pinpointed to entirely rational reasons. I don't want to go off tangents any further, but let's say it can't hurt to occasionnally remind people what DNA is and what it isn't, what it's like and what it's definitely not like. I dislike pedants as much as the next person but seeing so many (presumably very educated) people take that analogy in the same thread made me think that the reminder wasn't that out of place.
I disagree, DNA is pretty much the source code. any differential DNA expression are due to histone modification, proteins, or RNA (and rarely CpG DNA modification in humans) so DNA in the purest form is indeed the source code.
Source code for what, though? How about pre-natal influences in the womb, does this mean the womb is a sort of compiler? What about the structure of DNA itself, known to influence gene expression levels? What about the microbiota, which is arguably more 'you' than your own body in terms of raw cell count? What about the phages found in the microbiota (the phageome)? Why are they not part of the so-called source code?
Source code/machine code, I am just interchanging but in this analogy they are equivalent. Pre-natal influences change the epigenetics, i.e. "on" the genome. The structure of DNA is almost completely determined by histones, which are proteins that DNA wrap around, so once again, not the DNA. Microbiota affect the environment of the cells, meaning it is an input to the cell, not the source code. Any phages affecting the microbiota will affect the input to the cell. These factors aren't considered source code simply because it only affects the execution of the cell's source code.
And even more so, a product of our cells and RNA soup and so on. There's a truly huge operating system that DNA operates in. I think of DNA, when I think of it as all in a coding way, as a top-level script on top of enormous machinery. Like a switch on a giant factory, that says 'man' when you flip it one way and 'mouse' when flipped the other. Does the switch encode what it means to be a man? a mouse? No.
Yeah, but the question is: having the source code in the form of a bunch of files (each containing some routines) OR having it in the form of a bag of routines, is that a significant difference?
(To see the analogy, substitute file = chromosome, and routine = gene)
It is the very first step to all kinds of genetic analyses, especially for anything comparative. Usually with most species, we only have access to 'bits' of the genome rather than the whole chromosomes - a shortcoming of current short-read sequencing technologies. This means it's hard to get a broad picture of things like large-scale sequence rearrangements within the genome. There's a whole range of emerging technologies trying to alleviate that though, but aggregating them all together is a mind-numbing hassle.
The end result, though, is that now that we have access to a high quality picture of what sequences belong to what chromosomes, we can compare other varieties to that reference, see what structural differences (that really means 'sequences moving around') occurred over the course of evolution, when they occurred, why (in evolution terms) they may have occurred, and so on. Understanding that is key to do all kinds of cool experiments (or from an industrial perspective, optimizations) that weren't accessible to us previously, and the stakes are very high when dealing with such a staple crop that feeds billions of people.
Also, many varieties of wheat are hexaploid - their chromosomes come by six, instead of two for us humans. This isn't that exceptional for plants, but it means sorting out the chromosomes with the right alleles when almost each sequence is more or less replicated six times is, as the article title says, a nightmare of complexity. So even if you skip what others and I have said before, it is a technical prowess and it sets a very hopeful precedent for future projects involving other species of similar complexity, and many crops and plants happen to belong to that category.
Genomes are used as a reference point for all sorts of genetic analysis. It's comparable to having the source code to a binary. Plant breeding in particular benefits from having better analytical tools on hand, since it's such a critical field with such a gargantuan and diverse pool of organisms to study.
In that analogy, perhaps it's more like having successfully read out every byte of a binary that was encoded onto a medium that is extremely to read off of.
Having the source code... I'm not sure what that would be analogous to, but it would probably quite a long way off.
The genome is akin to the software of an organism. It's not exactly obfuscated, so we can actually learn a lot just by reading it. The next big benefits are all about relating the genome of one organism to others.
A single "reference" genome can serve as a high quality basis for interpreting new genomes. This is called resequencing. Small pieces of the new genomes are sequenced with very cheap techniques and mapped into the most-similar bits of the big genome, in effect allowing for a kind of guided reconstruction of the new genome. The short read lengths make it difficult to reconstruct a new genome de novo, but that's OK because if we have a reference we've already paid that price once for a related individual.
This process has bias but it's also expedient, say when you have thousands or millions of genomes in a study (as happens in agriculture).
One big problem in the field now is this obsession with reference genomes and resequencing against them. Researchers forget how much the reference genome can bias their resolution of a new genome. This has resulted in a lot of fussing and fixation on high-entropy regions of genomes and small variations (point mutations, or SNPs) within them. There is a growing body of evidence that suggests that large changes are probably more important for adaptation, but they remain somewhat underappreciated because of this short read resequencing modality.
I recently started working in a related field (RNA sequencing) and none of the other responses have really captured "what" it is that's going on here - and that's kinda necessary for the "why". When we say that the human genome project sequenced the human genome and completed in 2001/2002, what they mean is that there are databases with 23 sequences of A,C,T,G's for each of the 23 chromosomes plus another for the mitrochondrial DNA. This is "half" the DNA for a single person - or more accurately it was a composite of various parts of a couple people, in the case of the human genome project. It's only half because it's just one of the two sets of chromosomes that we get from our parents. It's not the genetic code of every person out there, just of one "reference" one. Variation within a species is quite low, so this is still extremely useful: chances are my genome looks a lot like the reference genome so if you want to talk about my genome, you can talk about it in terms of the reference genome. We can say that I have a SNP (single nucleotide polymorphism) where my DNA differs from yours in a single nucleotide (like switching an A for a C). It gives us a framework for talking about the genetic code of individuals of the species - we can say that the protein that does X is encoded in the gene at position Y on chromosome 12.
You can view the human genome at UCSC genome browser [1] and zoom in to the "base" level to see the actual sequence. It has lots of "annotations" where projects have marked where genes are (most of the genetic code does not encode proteins, genes are the parts that do, the rest is less well understood). You can see data there that shows you which parts of the genome vary from one species to another (generally mammals will be very similar to us in genes and less so between genes where the DNA is "less important" and so changes there don't result in death, so looking at where things are "conserved" by evolution tells you where the most fragile and important parts of the genome are, to a first guess).
What "they don't tell you" is that it's still not really complete. For example, there are bits of genetic sequence that we know go somewhere in the genome, but we don't know where. The way we sequence DNA, you only get to read little pieces at once and then have to string them together to reconstruct the whole thing. But one problem is repeats: sometimes the genome just goes "ATATATATATATATAT...." and if there are more there than the length of each piece you can read, it becomes impossible to know how many there are in total. Ambiguities like this exist because our DNA isn't just random strings of data, it has lots of structure that repeats from one place to another. So it can be impossible to tell where exactly a piece of DNA belongs, given the limited information we have. Also the annotations of where genes are is very much not finalized.
There are other problems with the sequences. For example, they'll include "N" to mean "some base, we don't know which" - there are millions of N's at the start of the chromosomes I've looked at. In reality, we all have some actual code there, but it's never been sequenced for anyone. Other structures confuse us: there is DNA encoding for ribosomal RNA which is used in ribosomes which are used to go from genetic code to actual proteins. So your body needs lots of ribosomes to keep everything being made so it needs lots of copies of the ribosomal DNA to produce enough ribosomal RNA - like 500 copies in a row. And it has those not just in one chromosome but in multiple. But the human genome project doesn't show it like that as it really should be: it has about half of one copy in the right spot. And then the rest of it is just spread in short bits throughout the rest of the places the code occurs. This causes problems if you are doing research and don't know what it is you're looking at, since it looks out of place. It's a complete mess. But despite that, it's good enough to do so much with.
Here's one use for a fully sequenced genome: RNA sequencing (which, I'll reiterate, I just started working on this a month ago so I've probably said at least one false thing already). RNA is made from DNA and proteins are made from RNA. So if you want to know what proteins are in a cell - and they're the things that actually do the work - then you can look at what RNA is present. This is easier these days than checking the proteins themselves. But RNA on its own doesn't tell you a lot, so you "align" the RNA to a reference genome: given a little piece of RNA (like 100 nucleotide bases long), where does it fit in the genome? It often has a unique spot it could have come from, so now you know what DNA was turned into what RNA. If there's a gene there, then you know what gene is being "expressed," meaning turned into protein via RNA. Genes don't do much if they're not expressed, so this is very useful to know to understand what is happening in a cell. Looking at RNA sequencing versus DNA sequencing is like looking at what software is running on your computer versus looking at what code is stored on the hard disk. Both tell you a lot and understanding what code is on the hard disk will help you understand what you're looking at when you're looking at what code is running, like "Oh, these operations belongs to the bash executable - so bash must be running".
[1] https://genome.ucsc.edu/cgi-bin/hgGateway (hit GO if you just want a piece of human DNA, hit "base" button in zoom in if you want to see actual sequences but that's too close to see much of the structure that the annotations provide)
The current treatment is strict dietary exclusion of all protein from wheat, barley, rye and other related grains. It’s conceivable that wheat could be modified to produce proteins that are not immunogenic to celiac sufferers.
The challenge would be making sure that only that modified wheat was harvested. I'm sure it could be done, but a needle in a haystack seems much easier to find than a conventional wheat in a celiac friendly wheat stack.
I'm not sure genetic mapping is really the core of the issue anyway. What we need is the equivalent of a reverse vaccination, since celiac is essentially an immunity to wheat. Such a concept might do for chronic disease what vaccination did for acute.
Sure, that would be difficult. Keeping wheat out of even dissimilar grains is currently a challenge for food producers and processors. The test would be the same, though, as testing any other grain for contamination, which is routine - test for gluten.
“Future potential treatments for patients with RCD include the development of genetically detoxified grains, oral and intranasal 'coeliac vaccines' to induce tolerance, inhibitors of TTG, and detoxification of immunogenic gliadin peptides via oral peptidase supplement therapy.”