Big Data and the Deep Blue Sea
The scale and impact of the Oceanic Observatories Initiative is closely followed by the magnitude of the computer science needed to make that data accessible and actionable by scientists. In a sense, the OOI and its infrastructure program, a major undertaking by the National Science Foundation, are constructing a programmable and integratable cloud fabric for oceanography on a big-data scale.
The Ocean Observatories Initiative (OOI) and its accompanying Cyberinfrastructure Program aims to provide an unprecedented ability to study the Earth's oceans and climate using myriad distributed data centers and literally oceans' worth of data.
The scale and impact of the science's importance is closely followed by the magnitude of the computer science needed to make that data accessible and actionable by scientists. In a sense, the OOI and its infrastructure program, a major undertaking by the National Science Foundation, are constructing a big data-scale programmable and integratable cloud fabric for oceanography.
We've gathered three leaders to explain the OOI and how the Cyberinfrastructure Program may not only solve this set of data and compute problems, but perhaps establish a path to how future massive data and analysis problems are solved.
Here to share their story on OOI are:
- Matthew Arrott, project manager at the OOI Cyberinfrastructure. Matthew's career spans more than 20 years in design leadership and engineering management for software and network systems. He's held leadership positions at Currenex, DreamWorks SKG, Autodesk, and the National Center for Supercomputing Applications. His most recent work has been with the University of California as e-Science Program Manager while focusing on delivering the OOI Cyberinfrastructure capabilities.
- Michael Meisinger, managing systems architect for the Ocean Observatories Initiative Cyberinfrastructure. Since 2007, Michael has been employed by the University of California, San Diego. He leads a team of systems architects on the OOI Project. Prior to UC San Diego, Michael was a lead developer in an Internet startup, developing a platform for automated customer interactions and data analysis. Michael holds a master's degree in computer science from the Technical University of Munich and will soon complete a PhD in formal services-oriented computing and distributed systems architecture.
- Alexis Richardson, senior director for the VMware Cloud Application Platform. He is a serial entrepreneur and a technologist. Previously, he was a founder of RabbitMQ and the CEO of Rabbit Technologies Limited, which was acquired by VMware in April of 2010. Alexis plays a role in both the cloud and messaging communities, as well as a leading role in Advanced Message Queuing Protocol (AMQP). He is a cofounder of the CloudCamp conferences, and a cochair of the Open Cloud Computing Interface at the Open Grid Forum.
The discussion is moderated by BriefingsDirect's Dana Gardner, principal analyst at Interarbor Solutions.
Listen to the podcast (40:00 minutes).
Here are some excerpts:
Michael Meisinger:The Ocean Observatories Initiative is a large, U.S. National Science Foundation project intended to build a platform for ocean sciences with an operational life span of 30 years.
It comprises a construction period of five years and will integrate a large number of resources and assets. These range from typical oceanographic assets, like instruments that are mounted on buoys deployed in the ocean, to networking infrastructure on the cyberinfrastructure side. It also includes a large number of sophisticated software systems.
I'm the managing architect for the cyberinfrastructure, so I'm primarily concerned with the interfaces through the oceanographic infrastructure, including beta interfaces, networking interfaces, and then primarily, the design of the system that is the network hardware and software system that comprises the cyberinfrastructure.
OOI's goals include serving the science and education communities with their needs for receiving, analyzing, and manipulating ocean sciences and environmental data. This will have a large impact on the science community and the overall public, as a whole, because ocean sciences data is very important in understanding the changes and processes of the earth, the environment, and the climate as a whole.
Ocean sciences, as a discipline, hasn't yet received as much infrastructure and central attention as other communities. So the OOI initiative is a very important to bring this to the community. It has an almost (US)$400 million construction budget, and an annual operations budget of $70 million for a planned lifetime of 25 to 30 years.
Dana Gardner: What are the big hurdles here in terms of a compute requirements? What makes this so challenging?
Matthew Arrott: It has a number of key aspects that we had to address. It's best to start at the top of the functional requirements, which is to provide interactive mission planning and control of the overall instrumentation on the 65 independent platforms that are deployed throughout the ocean.
The issue there is how to provide a standard command-and-control infrastructure over a core set of 800 instruments, about 50 different classes of instrumentation, as well as be able to deploy -- over the 30-year lifecycle -- new instrumentation brought to us by different scientific communities for experimentation.
The next is that the mission planning and control is meant to be interactive and respond to emergent changes. So we needed an event-response infrastructure that allowed us to operate on scales from microseconds to hours in being able to detect and respond to the changes. We needed an ability to move computing throughout the network to deal with the different latency requirements that were needed for the event-response analysis.
Finally, we have computational nodes all the way down in the ocean, as well as on the shore stations, that are accepting or acquiring the data coming off the network. And we're distributing that data in real time to any one who wants to listen to the signals to develop their own sense-and-response mechanisms, whether they're in the cloud, in their local institutions, or on their laptop.
The fundamental challenge was the ability to create a domain of control over instrumentation that is deployed by operators and for processing and data distribution to be agile in its deployment anywhere in the global network.
Gardner: Why is this a good time to try to solve this from a software distribution and data distribution perspective?
Alexis Richardson: It's the scale that's changed the architecture and deployment patterns that people have been using for these applications.
We can see the OOI project is essentially bringing the science needed to collaborate between vast numbers of sensors and signals and a comparatively smaller number of scientists, research institutions, and scientific applications to do analytics in a similar way as to how Facebook combines what people say, what pictures they post, what music they listen to with everybody's friends, and then allow an application to be attached to that.
So it's a huge technology challenge that would have been simply infeasible 12 years ago in the year 2000, when we thought things were big, but they were not. Now, when we talk about big data being masses of terabytes and petabytes that need to be analyzed all the time, then we're starting to glimpse what's possible with the technology that's been created in the last 10 years.
If we had been talking about this 12 years ago, in the year 2000, we would have been talking about companies like Google and Yahoo, which we would not have considered to be of moderate scale.
Since then, many companies have appeared. For example, Facebook, which has many hundreds of millions of users connecting throughout the world, shares vast amounts of data all the time.
In addition to that, many of these companies have brought out essentially a platform capability, whereby others, such as Zynga, in the case of Facebook, can create applications that run inside these networks -- social networks in the case of Facebook.
Arrott: The challenge goes beyond just the big data challenge. It also now introduces, as Alexis said, the concept of the instrument as an equal partner with the human in the participation in the network.
So you now have to think about what it means to have a device that's acting like a human in the network, and the notion that the instrument is, in fact, owned by someone and must be governed by someone, which is not the case with the human, because the human governs themselves. So it represents the notion of an autonomous agent in the network, as well as that agent having a notion of control that has to stay on the network.
Gardner: I'd like to try to explain for our audience a bit more about what is going on here. We understand that we have a tremendous diversity of sensors gathering in real-time a tremendous scale of data. But we're also talking about automating the gathering and distribution of that data to a variety of applications.
We're talking about having applications within this fabric, so that the output is not necessarily data, but is a computational numerical framework that's then distributed. So there's a lot of data, a lot of logic, and a lot of scale. Can one of you help step me through it all a bit more to understand the architecture of what's being conducted here?
Meisinger: The challenge, as you mentioned, is very heterogeneous. We deal with various classes of sensors, classes of data, classes of users, or even communities of users, and with classes of technological problems and solution spaces.
So the architecture is based on a tiered model or in a layered model of most invariant things at the bottom, things that shouldn't change over the lifetime of 30 years to serve the highest level of attention.
Then, we go into our more specialized layered architecture where we try to find optimal solutions using today's technologies for high-speed messaging, big data, and so on. Then, we go into specialized solutions for specific groups of users and specific sensors that are there as last-mile technologies to integrate them into the system.
So you basically see an onion layer model of the architecture, externalization outside. Then as you go toward the core, you approach the invariants of the system.
This architecture is based on defining a common interaction format. It's based on defining a common data format. Our architecture is strongly communication-oriented, service-oriented, message-oriented, and federated.
As Matthew mentioned, it's an important means to have the individual resources, agents, provide their own policies, not having a central bottleneck in the system or central governing entity in the system that defines policies.
Arrott: Think of it as its four core layers. There is the underlying network resource management layer. We talk about agents. They supply that capability to any process in the system, and we create devices that process.
The next layer up is the data layer, and the data layer consists of two core parts. One is the distribution system that allows for data to be moved in real-time from the source to the interested parties. It's fundamentally a publish-subscribe (pub-sub) model. We're currently using point-to-point as well as topic-based subscriptions, but we're quickly moving toward content-based routing, which is more based on the selector that is provided by the consumer to direct traffic toward them.
The other part of the data layer is the traditional harvesting or retrieval of data from historical repositories.
The next layer up is the analytic layer. It looks a lot like the device layer, but is focused on the management of processes that are using the big data and responding to new arrival of data in the network or change in data in the network. Finally, there is the fourth layer, which is the mission planning and control layer, which we'll talk about later.