It is an open-source data platform or framework developed in Java, dedicated to store and analyze the large sets of unstructured data.
With the data exploding from digital mediums, the world is getting flooded with cutting-edge big data technologies. However Apache Hadoop was the first one which caught this wave of innovation. Let’s find out what is Hadoop software and Hadoop system. We will learn about the entire Hadoop ecosystem, Hadoop applications, Hadoop Common and Hadoop framework.
MapReduce™ is the heart of Apache™ Hadoop®. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. The MapReduce concept is fairly simple to understand for those who are familiar with clustered scale-out data processing solutions.
For people new to this topic, it can be somewhat difficult to grasp, because it’s not typically something people have been exposed to previously. If you’re new to Hadoop’s MapReduce jobs, don’t worry: we’re going to describe it in a way that gets you up to speed quickly.
The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.
SQL programmers required languages that were relatively easy to learn for someone having SQL background and at the same time was –
Free of SQL’s excess baggage mentioned above and Could easily handle large data sets.
Originally developed at Yahoo Research in 2006, Pig addressed all these issues and provided better optimization scope and extensibility. Apache Pig also allows developers to follow multiple query approach, which reduces the data scan iterations. It has provisions for a number of nested data types (Maps, Tuples and Bags) and commonly used data operations such as Filters, Ordering and Joins. These advantages have seen Pig being adopted by a large number of users around the globe. Its simplicity has resulted in Yahoo and Twitter resorting to Pig for the majority of their MapReduce operations.
The DBMS systems that SQL operates on, are considered to be faster than MapReduce (operated on by Pig through the PigLatin platform). However, it is the loading of data that is more challenging in case of RDBMS, making the set up difficult. PigLatin offers a number of advantages in terms of declaring execution plans, ETL routines and pipeline modification.
SQL is declarative and PigLatin is procedural to a large extent. What we mean by this is in SQL, we largely specify “what” is to be accomplished and in Pig, we mention “how” a task is to be performed. A script written in Pig is essentially converted to a MapReduce job in the background before it is executed. A Pig script is shorter than the corresponding MapReduce job, which significantly cuts down development time.
Apache Hive is a data warehouse management and analytics system that is built for Hadoop. Hive was initially developed by Facebook, but soon after became an open-source project and is being used by many other companies ever since. Apache hive uses a SQL like scripting language called HiveQL that can convert queries to MapReduce, Apache Tez and Spark jobs.
HBase is a column-oriented database management system that runs on top of Hadoop Distributed File System (HDFS). It is well suited for sparse data sets, which are common in many big data use cases.
Unlike relational database systems, HBase does not support a structured query language like SQL; in fact, HBase isn’t a relational data store at all. HBase applications are written in Java much like a typical Apache™ MapReduce application. HBase does support writing applications in Apache™ Avro™, REST, and Thrift.
An HBase system comprises a set of tables. Each table contains rows and columns, much like a traditional database. Each table must have an element defined as a Primary Key, and all access attempts to HBase tables must use this Primary Key.
Avro, as a component, supports a rich set of primitive data types including: numeric, binary data and strings; and a number of complex types including arrays, maps, enumerations and records. A sort order can also be defined for the data.
Hadoop Distributed File system – HDFS is the world’s most reliable storage system. HDFS is a Filesystem of Hadoop designed for storing very large files running on a cluster of commodity hardware. It is designed on principle of storage of less number of large files rather than the huge number of small files. Hadoop HDFS also provides fault tolerant storage layer for Hadoop and its other components. HDFS Replication of data helps us to attain this feature.
It stores data reliably even in the case of hardware failure. It provides high throughput access to application data by providing the data access in parallel. Let us move ahed in this Hadoop HDFS tutorial with major areas of Hadoop Distributed File System.
As we know Hadoop works in master-slave fashion, HDFS also has 2 types of nodes that work in the same manner. There are namenode(s) and datanodes in the cluster.