In a world where it’s becoming more common to have extremely large datasets, one solution is an Open Data Commons. An Open Data Commons provides researchers & scientists open (free) access to a large quantity of data, along with the tools and the APIs to sift through it.
Three weeks ago I was lucky enough to sit down with Dr. Robert Grossman at the University of Chicago. Dr. Grossman has been solving big data problems for the last 30 years. These days his major work involves creating Open Data Commons. Let me explain.
As a researcher or scientist, your first step often involves gathering data, or maybe downloading it. But what do you do when the data you need to access is 5 Petabytes? That’s 1024 Terabytes, or a million Gigabytes. Stored on the Amazon Web Services it would easily cost you over $100,000 per month, which doesn’t include the computational power you’d need to parse and use the data, and if you’re talking about biomedical data you’d have to worry about security compliance as well.
Dr. Grossman is the Founder and Director of the Open Commons Consortium, a not for profit that manages and operates cloud computing and data commons infrastructure to support scientific, medical, health care and environmental research. With the support of the National Cancer Institute he’s helped create one of the biggest data commons (5 Petabytes), the Genomic Data Commons.
Bob sees real connections and commonalities in data-intensive work and has an authentic desire to make things easier for scientists of all disciplines by enabling them with technology. – Maria Patterson, Research Scientist, University of Washington
Cancer is a disease of the genome caused by changes in the DNA, RNA, and proteins of a cell. If researchers can identify the genomic alterations that arise in cancer, that could lead to improvements with diagnosis and treatment. The Genomic Data Commons contains over 14,000 cases where solid tumors were biopsied, measurements were taken, and DNA was sequenced. Per patient, this data easily can amount to half a terabyte. Over 1,000 scientists and researchers access this data on the Genomic Data Commons every month, and it’s managed at the University of Chicago.
Dr. Grossman suggested three ways to help to further the work of the Open Commons Consortium (OCC):
- If you’re familiar with DevOps/BioOps, the OCC could use help standardizing and improving the software and architecture of Data Commons.
- The existing Data Commons provides access to data through their APIs. Dr. Grossman wants to find more developers to build applications using this data.
- You can become a citizen scientist and contribute your own data to the existing projects.
When I asked Dr. Grossman about things he learned the hard way in his career, he mentioned two ideas. The first was taking the long view and staying the course. Not giving up on projects after two years, but being persistent for longer time periods. The other was about not getting discouraged by your failures. As he puts it: “You only need one or two successes to justify quite a few failures.”
I couldn’t agree more.