IBM InfoSphere® BigInsights™ for Hadoop

IBM InfoSphere^®
BigInsights™
Enterprise-Grade Hadoop

IBM InfoSphere BigInsights is an industry standard Hadoop offering that combines the best of open source software with Enterprise-grade capabilities.

It helps organizations to cost effectively manage and analyze big data – the volume and variety of data that customers and businesses create and collect every day.

Click on the boxes below to discover more details

ANSI SQL
Big SQL Optimized
SQL support

Search
BigIndex and Data Explorer

Predictive Modeling
Big R scalable data mining on R

Real-time Analytics
InfoSphere Streams

Storage Integration
GPFS POSIX Distributed Filesystem

Application Tooling
Toolkits and accelerators

Data Exploration
BigSheets “schema-on-read” tooling

Text Analytics
Text processing with AQL

Data Governance and Security
Data Click, LDAP and Secured Cluster

Enterprise Performance
Adaptive Map Reduce & Big SQL

Big SQL: Big SQL is an ANSI SQL interface to provide BigInsights users a method for exploring and analyzing data stored in Hadoop leveraging existing SQL skills rather than requiring knowledge of MapReduce.

BigIndex helps make Hadoop-based indexing easy by including it as a native capability in BigInsights. Based on Apache Lucene, BigIndex delivers low-latency, full-text search capabilities for big data. Indexes can be built, scanned and queried using the BigIndex module as part of a workflow.

Data Explorer’s search combines content and data from many different systems throughout the enterprise and presents it to users in a single view, dramatically reducing the amount of time spent looking for information and increasing their ability to work smarter.

Big R is a library of functions that provide end-to-end integration with the R language, Big R functions are similar to existing R functions, but are able to scale for use with big data.

InfoSphere Streams delivers a highly scalable, agile software infrastructure to perform in-motion analytics on a wide variety of relational and non-relational data types that enter the enterprise at unprecedented volumes and speed, and from thousands of real-time sources .

Provides POSIX compliant, enterprise-class distributed file system support that brings already proven big data distributed file system capabilities to the Hadoop and MapReduce environment.

With more than 20 sample applications and two accelerators, InfoSphere BigInsights helps firms quickly benefit from their big data platform. Some of the sample applications include web crawling, data import/export, data sampling, social media data collection and analysis, machine data processing and analysis, ad hoc queries and more.

Accelerators-extensive toolkits with dozens of pre-built software artifacts-enable firms to quickly deploy solutions for analyzing social media and machine data (such as log records, sensor data and more).

Web-based analysis and visualization tool with a familiar, spreadsheet-like interface, featuring D3 graphs, that enables analysis of large amounts of data and helps to design and manage long running data collection jobs.

Sophisticated text analytics with a vast library of extractors enabling actionable insights from large amounts of native textual data.

Mitigate risk with sensitive data discovery. Maintain an acceptable risk tolerance with data monitoring, within source systems and on Hadoop itself.

Adaptive MapReduce adapts to user needs and system workloads automatically to improve performance and simplify job tuning while workload scheduler provides optimization and control of job scheduling based on user-selected metrics.

Oozie

Jaql

ZooKeeper

Hive

HCatalog

Pig

HDFS

MapReduce

HBase

Flume

Sqoop

Lucene

Management application that simplifies workflow and coordination between MapReduce jobs.

Query language designed for JavaScript Object Notation (JSON). Primarily used to analyze large-scale semi-structured data.
Core features include user extensibility and parallelism.

Centralized infrastructure and set of services that enable synchronization across a cluster.

Warehouse infrastructure that supports ETL for data stored in HDFS.

Table and storage management service for Hadoop data that presents a table abstraction so that you do not need to know where or how your data is stored.

Programming language designed to handle any type of data. Pig helps users to focus more on analyzing large data sets and less time writing map programs and reduce programs.

Hadoop Distributed File System (HDFS) is a file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system.

Framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines -- in a reliable, fault-tolerant manner.

Column-oriented database management system that runs on top of HDFS and is often used for sparse data sets. Unlike relational database systems, HBase does not support a structured query language like SQL.

Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS.

ELT tool to support transfer of data between Hadoop and structured data sources.

Enterprise search tool from the Apache Lucene project that offers powerful search tools, including hit highlighting, as well as indexing capabilities, reliability and scalability, a central configuration system, and failover and recovery.

Based on 100% Apache Hadoop^(TM) Open Source

Hadoop-DS benchmark report for Big SQL v3.0 at 30TB
Making the case for big data and Hadoop in the enterprise (eBook)