Select a wedge to dive into the different products that help Hadoop support mass amounts of data. Click outside the infographic to return to the original view. From its colossal storage to mighty processing power, these are the technologies that enable Hadoop to manage a virtually infinite number of concurrent tasks.
HDFS
A distributed file system that provides high output access to application data
HDFS is scalable, fault-tolerant, cost-efficient storage for big data.
Ambari
A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, as well as supporting other Apache systems
Apache Ambari takes the guesswork out of operating Hadoop. It simplifies managing and monitoring Hadoop clusters by providing an easy-to-use web UI and REST API.
YARN
A framework for job scheduling and cluster resource management
YARN enables greater sharing, scalability, and reliability of a Hadoop cluster.
Oozie
A workflow scheduler system to manage Apache Hadoop jobs
Oozie provides users with the ability to define actions and dependencies between actions.
Slider
An application that deploys existing distributed applications on a Hadoop YARN cluster while simultaneously allowing users to make clusters larger or smaller
Slider keeps the size of managed applications consistent with the specified configuration, even if server or application failure occurs.
Spark
A fast and general in-memory compute engine for Hadoop data
Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
MapReduce
A reliable software framework for easily writing applications that batch process vast amounts of data in parallel on large clusters of commodity hardware
MapReduce makes it easy to scale data processing over multiple computing nodes.
Ranger
A framework to enable, monitor, and manage comprehensive data security across the Hadoop platform
Ranger provides central security policy administration across the enterprise security requirements of authorization, authentication, audit and data protection.
Knox
A gateway that provides a single access point for all REST interactions with Apache Hadoop clusters
Knox is a solution that integrates with enterprise identity management solutions, protects the details of the cluster deployment and simplifies the number of services that clients need to interact with.
Pig
A high-level data-flow language and execution framework for parallel computation
Pig allows Apache Hadoop users to write complex MapReduce transformations using a simple scripting language called Pig Latin.
Solr
A popular, open-source enterprise search platform built on Apache Lucene
Solr is highly reliable, fast, scalable, and fault tolerant.
Sqoop
A tool that allows users to import structured data from their relational databases into HDFS and vice versa
Sqoop automates most of the data transfer process, relying on the database to describe the schema for the data to be imported.
Flume
A distributed, reliable service for efficiently collecting, aggregating, and moving large amounts of unstructured log data
Flume is robust and fault tolerant with tunable reliability mechanisms, as well as many failover and recovery mechanisms.
Kafka
A fast, scalable, durable, and fault-tolerant publish-subscribe messaging system
Kafka brokers massive message streams for low-latency analysis in Enterprise Apache Hadoop.
Phoenix
An open source, massively parallel, relational database engine supporting OLTP for Hadoop
Phoenix aims to ease HBase access by supporting SQL syntax and allowing inputs and outputs using standard JDBC APIs.
HBase
A non-relational, distributed database that runs on top of HDFS
HBase is excellent for providing random, real-time read/write access to your Big Data.
Hive
A data warehouse infrastructure that provides data summarization and ad hoc querying
Hive allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements.
System ML
A system that can automatically generate an array of hybrid runtime plans on MapReduce or Spark
SystemML increases the productivity of data scientists through its flexibility and data independence.
Hydra R
An ecosystem of components that allows institutions to build and deploy robust digital repositories supporting digital asset management applications and workflows
Hydra R provides an environment for institutions to combine their individual repository development efforts into a collective solution.
SparkR
A distributed data frame implementation that supports operations like selection, filtering and aggregation
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R.
Titan
A distributed analytics engine for processing property graph with Hadoop
Titan offers support for very large graphs. Titan-supported graphs scale with the number of machines in the cluster.