Genetic databases offer some fascinating opportunities in medicine. By tracking variations in genes across populations, researchers can learn who’s at a greater risk for diseases—and what treatments might be best, based on a patient’s specific genome. The challenge is figuring out how to store and sift through all this data. That’s where recently developed cloud computing services for DNA, such as Google Genomics and Amazon Web Services (AWS) for genomics come in. These services are paving the way for a whole new era of genetic data analysis on a petabyte scale. Genetic research, welcome to the future.
DNA is Big Data
Don’t be fooled by DNA’s small size; it contains a lot of information. A single genome has 6 billion nucleotide letters, As and Ts, Cs and Gs pairing together to write your unique code. The tiny strand of DNA in your cells is about 100 gigabytes of raw scientific data (1 gigabyte with polished data.) On that scale, it’s easy for genetic databases to cruise into the petabyte range.
Let’s talk size, for a moment. A petabyte is a million gigabytes. It’s hard to visualize that size, so I did some math. My parents have a monthly 2 GB limit on Internet usage. At that rate, they’d use the equivalent of a single genome in about four years. It would take them nearly 41,670 years to use a full petabyte.
Why the cloud?
Like my parents, research facilities don’t have data capabilities on a petabyte level. But companies like Google, Amazon, IBM, and Microsoft do—and they do it quickly. The Google search index is around 100 petabytes, and it’s searchable in less than a second. This makes cloud computing services for DNA an amazing opportunity for storing, analyzing, and sharing genetic data across the research community.
Handling big data constitutes a huge priority for genetic researchers. In spring 2013, the National Cancer Institute (NCI) requested feedback from scientists about what kind of computing requirements they need to work on large-scale genomic projects. Specifically, the NCI wanted to know whether it’d be useful to create cloud-based stores of cancer data. The most popular response was that data storage without a cloud poses several issues.
Data limitations mean downloading common data sets, such as the Cancer Genome Atlas (TCGA), takes days for some researchers. The bulkiness of these databases makes them impractical for answering research questions or doing comparisons using local research. If researchers are going to make any headway, they need room to explore data, visualize it, analyze it, and share it. Cloud computing services give researchers that room—and the speed.
Google Genomics and AWS rely on MapReduce processes for faster analysis. Rather than using a single server to scrutinize data, the task is assigned to several servers that process the data together at the same time. In June, Google released a sneak peek of Cloud Dataflow, which works similarly to allow real-time data analysis of sequenced genomes (or other data) in the cloud.
What’s garnering the most excitement, however, is Google’s BigQuery, which is a way to search for specific information within large databases in seconds. In terms of genome research, this would help researchers pinpoint variation between sequences or find rare sequences (minor frequency) within a large population database.
Some Tech Startups are also offering platforms
Hosting room in the cloud for sequenced genomes has led the way to innovations elsewhere in the tech community. Startups like Seven Bridges Genomics, DNAnexus, and NextCODE Health offer platforms or “browsers” to search and analyze enormous genetic libraries. These companies foresee a future where doctors can use search databases to find treatments that work for a patient by comparing the patient’s DNA against thousands—or millions—of other sequenced genomes. In research, these browsers will be useful for finding patterns in what makes certain populations more susceptible to a disease, what causes rare diseases, or why treatments work for some patients but not others.
Pondering the progress, it’s astounding to think where research could be in 10 years. Think about it. A hundred years ago, no one even knew what DNA looked like. Once scientists discovered the double-helix shape, the intricacy and complexity of DNA proved an enormous challenge to scientists. Scientists spent 15 years sequencing the first genome. 15 years! Today, it’s possible to sequence a genome in under a day. Tomorrow, it will take seconds.
Scientists working to unravel the mysteries of DNA have made incredible leaps in the past century. With Google, Amazon, and other tech companies onboard—and the competition driving down prices—it’s quite likely that the trend for achieving great strides in genetic research will continue as cloud computing facilitates easier, faster data analysis.