It now takes just 17 hours to sequence a human’s DNA. Once complete, a blueprint emerges with 6 billion pieces of data—or 3 billion base pairs—which take up nearly 1TB of storage.
But what would one do with all that data?
Essentially, we are entering a new era of personalized medicine where people will be able to make empowered decisions about their health based on genes. Whereas now you might have an instinct to avoid certain foods or substances, in the future you might actually know the specific molecular variant that makes you feel unwell and act accordingly.
For example, you could go for the regular latte or—knowing you possess a polymorph variant, LCT-13910C>T, in intron 13 of the MCM6 gene that is 13,910 bp from the initiation codon of LCT—choose a non-dairy alternative and switch up your vitamin regimen to avoid calcium deficiency.
To learn more, PCMag met with Liz Worthey, PhD, director for software development and informatics at the HudsonAlpha Institute for Biotechnology, a nonprofit institute built to be the most advanced lab for genomics in the world. No, we weren’t in Boston—or even Silicon Valley; we were nestled inside a 152-acre research park in Huntsville, Alabama.
Dr. Worthey is originally from Scotland and joined HudsonAlpha after several years as director of genomic informatics for the Human and Molecular Genetics Center at the Medical College of Wisconsin. She is also co-founder and Chief Product Officer at Envision Genomics, a company helping clinicians diagnose rare disease through the integration of genomic data into clinical care.
Here are edited and condensed excerpts from our conversation.
Although you started your career in immunology, you’re now focused on the data science side of genomics.
Yes, I don’t have a wet lab. My work is all about using computers to extract knowledge, and we have created our own software tools to allow us to do cutting-edge genomics.
What brought you to HudsonAlpha specifically?
The sort of questions I like to answer, in genomics, require a lot data, storage, and computation. HudsonAlpha had already invested heavily in the sort of high-end sequencers my work requires, and they’re now moving over to even higher-end NovaSeq machines. We have two data centers here, on site, in Huntsville, with 12 petabytes of data and a 4800 core cluster. We have a National Institute of Health Undiagnosed Diseases Network genomic medicine grant for patients with undiagnosed diseases, or Mendelian Disorders, which are caused by a single causative gene and HudsonAlpha is also focused on how to do genomics for a larger population of society; for those who have cancer, or other diseases.
Would you say genomics is moving more into preventative health and personalized medicine as it gets cheaper and faster to sequence DNA?
We’re definitely turning a corner, applying our work to much more complex situations. The real goal is to have a blueprint for people, so that, at the very first symptom, we can check that against the knowledge we’ve uncovered in the genome. We also want to serve people who don’t self-identify as having a genetic disease at all, but want to know more about their genomic inheritance to help themselves, and their families, in terms of both disease and health.
Have you had your own DNA sequenced?
Absolutely. In fact my genome is probably one of the most studied in the world and it’s all stored here, on a hard drive in my office.
Did you have any concerns beforehand?
I had maybe 15 minutes of apprehension before looking at it. My daughter was just two years old at the time. When you get your genome sequenced, it’s not just about you, but your whole family, and I was a bit worried about predispositions she might have inherited from me. But it’s fascinating stuff. When I’m at a conference and someone mentions a new disease-associated variant, I think: ‘Oh, let’s have a look,’ and start digging into my own data on my laptop to see if I have it.
I’m from the west coast of Scotland and no one in my family has ever had cardiovascular disease, which is a bit of a shocker, for that area. It turned out there are four or five variants which protect against this, and I have all of them. So that was a cool find.
Do you think it’s something future generations will have done as a matter of course?
In my opinion everyone should have it—and I believe everyone will have it in the future—but it’s a question of scale, right now. No-brainers include kids who are in the ICU, or who present with a genetic disorder. Definitely adults who have gone un- or misdiagnosed for years. It’s also useful to have your genome sequenced to help avoid misdiagnoses and issues with drug interactions, because what might look like something isn’t that at all, and we can often tell from the genome.
But how do you roll that out in a population which doesn’t have a geneticist and genetics counselor on hand?
We have all these, including a bioethicist on staff, here at HudsonAlpha. So let’s make sure we have controlled populations that we learn from first before we go out into the wider community.
Let’s talk about the specific computer science tools you use here.
We developed our own Java-based tertiary analyst tool called CODICEM—or CODI, for short. It’s designed for clinical diagnostics as well as research use, so the users’ interactions conform to the way a clinical lab does its interpretations. Our goal is to make CODI easier to use, faster, and more efficient than anything out there on the market.
How does CODI work?
You select the patient you want to analyze, then start to filter the data. The patient I’m looking at now has 5.5 million variants, which is a tad too many to comb through, and possesses a rare undiagnosed disease. So we start by filtering out certain alleles—sequence variants—that have been seen before too often, changes not likely to be bad, or regions we already know are not associated with disease. We also prioritize certain variants based on various algorithms—i.e. where the likely consequence of the change is bad.
For example, this one is a polymorphism, it’s why their eyes are blue, but this one is why they’re at risk from autoimmune diseases and is causing the problem. And whether the same variant is present in dad, for example, and thus not interesting because we have their data, too. I can then prioritize for disease phenotypes related to specific diseases characteristics, and so on.
Is CODI doing the interpretation itself?
No, it’s not. What it’s doing is helping highly experienced humans, who are very expensive, to do their work more efficiently. Before CODI, this process took hours and hours. Now we can get to a usable data set for skilled interpretation in minutes and usually identification of the disease causing change takes less than 30 minutes. Basically what you’re doing, through all these steps, is whittling down the data to get from, in this case, 5.5 million to 305, grouped by genes, shown by color, and linked to codes set out by the American College of Medical Genetics and Genomics, and rolling up those codes into the ACMG classification for reporting of globally known genetic variants.
Can CODI allow multiple users/authors?
Yes, and that’s vital—so everyone that works on a patient’s case can submit comments or pick out ‘novel variants’, i.e. new knowledge gained through our analysis, and we get to collaborate together in real time, as a group. Another really clever aspect of CODI is that it produces reports in a natural language format.
As if a human wrote it?
Yes. Using full sentences and excellent grammar—it saves us a huge amount of time. These reports used to take a couple of hours for a human to write up. Now it takes CODI a couple of minutes to compile.
Any non-embodied AIs working on your clinical research team?
Yes. There are two parts to interpreting a genome: one is a stack of text detailing the patient’s clinical notes; the other is the set of molecular variants. The goal is to combine the profile, and the molecular variants, into a dataset which you can feed into software like CODI. We’ve created a tool called PyxusMap, to help us apply machine learning and AI to uncover novel relationships between these types of data. It’s very fast at combining these two areas; unstructured clinical data which can be messy, as well as structured genomic data, and PyxusMap makes sense of it.
Finally, can you explain how you’re working with NASA?
I ended up consulting for NASA in Houston on the topic of ‘How can genomics help with sending humans to Mars?’ We know that there are many health issues humans face when going into space, and staying there for any length of time. Some get ocular or musculoskeletal problems, quite a lot develop autoimmune diseases up there. However, interestingly, not everyone gets them, and that’s genetic. So they asked if we could use genomics to improve their outcomes.
Understandably NASA is very sensitive about testing astronauts with a view to not sending them because they’re ‘defective.’
Right. So we focused on looking at what they might be predisposed to, in terms of diseases, so we can send the right drugs with them, and enough in terms of quantity, in the payload. It’s unlikely there’ll be many doctors on the first mission to Mars so could we, by looking at someone’s genome, do the pre-mission clinical work to make sure they have what they need? That’s what we focused on at NASA, and it’s a very cool new area of study. One I’m hoping to continue working with them on.
Want to learn more? HudsonAlpha is hosting the Genomic Medicine Conference: Empowering Personal Health” from March 26-28 in Huntsville. Details here.