The Human Genome Project was a vast long-running and internationally collaborative project to determine the sequence of nucleotide base pairs that make up human DNA, and identify and map all the genes of the human genome, both physically and functionally. In fact, it is the world’s largest collaborative biological project across all of history.
However, it’s massive. Who’d have thought humans are so complex? With a genome of seven billion DNA base pairs, it takes 100GB to store the unique genetic sequence for any individual human being as a string of text using the letters A, T, C and G that refer to the bases – adenine, thymine, cytosine and guanine.
Researchers have already dealt with the issues of collecting and storing genome data, but actually analysing the data — to understand and identify disease markers and explore the difference between healthy and cancerous cells — has previously been a slow and complex affair. Specialised high-performance computing hardware has been employed, but with a high cost to buy, researchers are queuing up to use the small numbers they have.
|
ANU’s Department of Genome Science had access to 30-40 local servers and a number of shared HPC environments. The researchers had workstations armed with 16 cores, 10TB storage and 128GB RAM. Originally, the cloud was viewed as a way of giving temporary boosts in power during peak demand, charged by usage, instead of having to buy another HPC server.
However, while the cost reduction — a quarter of that of managing their own hardware — was expected, ANU also found they received four times the computational power they had on-premise for that cost.
Biological data science research fellow, Dr Sebastian Kurscheid, of ANU’s John Curtin School of Medical Research, explained that cloud computing promised to accelerate the focus on the health aspects of genome research. “There are questions about how medical genomics has a more increasing relevance to clinical practice. It’s already important in the field of rare diseases but it’s becoming more and more relevant also in more common diseases,” he said.
Dr Kurscheid says he has spent about a third of his three-and-a-half year research program getting technical elements in place so he could run HPC analytics over many large data sets. Based on his work within Azure so far, he says that moving earlier to a could-based solution would have saved nine months, freeing him to focus on research.
“Our focus at BizData has been to deliver a seamless experience for researchers using the Microsoft Cloud. For example, today we enable a researcher to take an existing pipeline (for example in Snakemake or Galaxy) that they have already built and allow them to run secondary analysis in the cloud with as much computing power as needed, without changing a line of code. We also make it easy to analyse and collaborate on the research outputs, without having to wait for large volumes of data to download again.”
Dr Kurscheid notes “The general infrastructure is available for going from raw data — as primary as it gets — to a highly analysed and visualised result and that would probably be used for some work that we are currently finalising that’s actually looking at the 3D structure of the genome in cancer cells. I’m envisaging that if we conduct all this analysis using Azure then also doing some really nice visualisation and exploratory analysis using the platform.”
Making genome analytics more accessible and affordable would open new clinical applications, he said.
“Part of the long-term vision is that in the medical field genomics becomes more widely available – it’s already important in rare diseases. As it becomes more common smaller hospitals or pathology services might see demand for this.
“I think that making these workflows and tools and analysis pipelines publicly available in a manner that is adaptable for others would support the broader uptake of genomics in the medical field.”