Big Data, Big Open Source Tools
Feb 25, 2011 5:00 AM PT
Enterprises are grappling with the skyrocketing amount of data they have to handle as that data proliferates into the terabyte and petabyte stage.
Datasets that large are known as "big data" to IT practitioners.
Relational databases and desktop statistics or visualization packages can't handle big data; instead, massively parallel software running on up to thousands of servers is needed to do the job.
Many businesses turn to open source tools, such as Apache's Hadoop, when working with big data. For example, Twitter sends logging messages to Hadoop and writes the data directly into HDFS, the Hadoop Distributed File System.
Hadoop can support data-intensive applications ranging up to thousands of nodes and multiple petabytes, David Hill, principal at Mesabi Group, told LinuxInsider. It has received wide acceptance.
However, the term "big data" is just a general term for many different types of applications, and Hadoop won't be suitable in every case, Hill warned.
The capture, storage and analysis of big data depends on the nature of the particular application, Hill stated. For example, scale-out network attached storage such as EMC's Isilon or IBM's SONAS (Scale Out Network Attached Storage), might be better for use with unstructured data such as photographs or videos rather than a tool such as Hadoop, he suggested.
Types of Big Data Work
Working with big data can be classified into three basic categories, Mike Minelli, executive vice president at Revolution Analytics, told LinuxInsider.
One is information management, a second is business intelligence, and the third is advanced analytics, Minelli said.
Information management captures and stores the information, BI analyzes data to see what has happened in the past, and advanced analytics is predictive, looking at what the data indicates for the future, Minelli said.
Revolution analytics offers the open source R language and Revolution R Enterprise. These provide advanced analytics for terabyte-class datasets. Revolution Analytics is developing connectors to Hadoop and capabilities for R to run jobs on Google's Map/Reduce framework, Minelli said.
Tools for working with Big Data
IBM's Netezza in its InfoSphere products, Oracle's Exadata, and EMC's Greenplum are other proprietary tools for working on big data.
EMC has introduced a free community edition of its Greenplum database. This community edition is software-only, Mesabi Group's Hill remarked.
Greenplum Community Edition doesn't compete with Hadoop; instead, it's "a project whose aims are to incorporate best of breed technologies together to provide the best choice of platform," Luke Lonergan, vice president and chief technology officer of the EMC Data Computing Products Division, told LinuxInsider.
The initial release of Greenplum Community Edition includes three collaborative modules -- Greenplum DB, MADlib, and Alpine Miner, Lonergan said.
"The version of Greenplum DB included is an advanced development version that will provide rapid innovation, MADlib provides a collection of machine learning and data mining algorithms, and Alpine Miner provides a visual data mining environment that runs its algorithms directly inside Greenplum DB," Lonergan elaborated.
Open source tools for big data include Hadoop, Map/Reduce, and Jaspersoft business intelligence tools.
Jaspersoft offers business intelligence tools that provide reporting, analysis and ETL (extract, transform and load) for massively parallel analytic databases including EMC Greenplum and HP Vertica. A version for IBM Netezza is in the works, Andrew Lampitt, director of business intelligence at Jaspersoft, told LinuxInsider.
Jaspersoft also provides native reporting through open source connectors for Hadoop and various types of NoSQL databases including MongoDB, Riak, CouchDB and Infinispan.
Further, Jaspersoft has an open source bridge to the R advanced analytics product from Revolution Analytics.
Open Source vs Proprietary Tools
Open source tools provide insight into the code so developers can find out what's inside when they do integration, Jaspersoft's Lampitt said.
"In almost every instance, open source analytics will be more cost-effective and more flexible than traditional proprietary solutions," Revolution Analytics' Minelli said.
"Data volumes are growing to the point where companies are being forced to scale up their infrastructure, and the proprietary license costs skyrocket along the way. With open source technology, you get the job done quicker and more accurately at a fraction of the price," he added.
Twitter's a case in point, opting for Hadoop because using proprietary tools would have just been too expensive.
Further, open source tools let enterprises create new analytic techniques to better handle unstructured data such as images and photographs, Minelli said.
"Open source analytics tools let you create innovative analytics that you can bake into your enterprise. In today's ultra-competitive global economy, you just can't wait for a traditional vendor to develop a new analytic technique," Minelli added.
As in other spheres of IT, we're likely to see a mix of open source and proprietary technologies being used to work with big data.
"Short-term, open source analytics will become more and more widely used and will grow virally," Minelli opined. "Over the longer term you'll see a mix or a blend of techniques in highly competitive markets. My guess is that both will remain viable and necessary."