IBM Analytics Hadoop Architecture

Select a wedge to dive into the different products that help Hadoop support mass amounts of data. Click outside the infographic to return to the original view. From its colossal storage to mighty processing power, these are the technologies that enable Hadoop to manage a virtually infinite number of concurrent tasks.

HDFS

A distributed file system that provides high output access to application data

HDFS is scalable, fault-tolerant, cost-efficient storage for big data.

Ambari

A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, as well as supporting other Apache systems

Apache Ambari takes the guesswork out of operating Hadoop. It simplifies managing and monitoring Hadoop clusters by providing an easy-to-use web UI and REST API.

YARN

A framework for job scheduling and cluster resource management

YARN enables greater sharing, scalability, and reliability of a Hadoop cluster.

Oozie

A workflow scheduler system to manage Apache Hadoop jobs

Oozie provides users with the ability to define actions and dependencies between actions.

Slider

An application that deploys existing distributed applications on a Hadoop YARN cluster while simultaneously allowing users to make clusters larger or smaller

Slider keeps the size of managed applications consistent with the specified configuration, even if server or application failure occurs.

Spark
MapReduce

Spark

A fast and general in-memory compute engine for Hadoop data

Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.

MapReduce

A reliable software framework for easily writing applications that batch process vast amounts of data in parallel on large clusters of commodity hardware

MapReduce makes it easy to scale data processing over multiple computing nodes.

Ranger
Knox

Ranger

A framework to enable, monitor, and manage comprehensive data security across the Hadoop platform

Ranger provides central security policy administration across the enterprise security requirements of authorization, authentication, audit and data protection.

Knox

A gateway that provides a single access point for all REST interactions with Apache Hadoop clusters

Knox is a solution that integrates with enterprise identity management solutions, protects the details of the cluster deployment and simplifies the number of services that clients need to interact with.

Pig
Solr

Pig

A high-level data-flow language and execution framework for parallel computation

Pig allows Apache Hadoop users to write complex MapReduce transformations using a simple scripting language called Pig Latin.

Solr

A popular, open-source enterprise search platform built on Apache Lucene

Solr is highly reliable, fast, scalable, and fault tolerant.

Sqoop

A tool that allows users to import structured data from their relational databases into HDFS and vice versa

Sqoop automates most of the data transfer process, relying on the database to describe the schema for the data to be imported.

Flume

A distributed, reliable service for efficiently collecting, aggregating, and moving large amounts of unstructured log data

Flume is robust and fault tolerant with tunable reliability mechanisms, as well as many failover and recovery mechanisms.

Kafka

A fast, scalable, durable, and fault-tolerant publish-subscribe messaging system

Kafka brokers massive message streams for low-latency analysis in Enterprise Apache Hadoop.

Phoenix

An open source, massively parallel, relational database engine supporting OLTP for Hadoop

Phoenix aims to ease HBase access by supporting SQL syntax and allowing inputs and outputs using standard JDBC APIs.

HBase

A non-relational, distributed database that runs on top of HDFS

HBase is excellent for providing random, real-time read/write access to your Big Data.

Hive

A data warehouse infrastructure that provides data summarization and ad hoc querying

Hive allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements.

System ML

A system that can automatically generate an array of hybrid runtime plans on MapReduce or Spark

SystemML increases the productivity of data scientists through its flexibility and data independence.

Hydra R

An ecosystem of components that allows institutions to build and deploy robust digital repositories supporting digital asset management applications and workflows

Hydra R provides an environment for institutions to combine their individual repository development efforts into a collective solution.

SparkR

A distributed data frame implementation that supports operations like selection, filtering and aggregation

SparkR is an R package that provides a light-weight frontend to use Apache Spark from R.

Titan

A distributed analytics engine for processing property graph with Hadoop

Titan offers support for very large graphs. Titan-supported graphs scale with the number of machines in the cluster.