Big Data

Understanding and Using Hadoop

2 days

This course is designed for software researchers, engineers, and managers to learn the basics of Hadoop.

What is Hadoop?

Apache Hadoop is an open source software platform that enables distributed processing of large data sets across clusters of commodity servers. Hadoop is able to inexpensively process large amount of data by using the MapReduce programming model and processing data in a distributed computing environment. It is currently used by companies such as Facebook and the New York Times for tasks such as e-commerce and image processing, and by IBM in creating their Watson supercomputer that won Jeopardy! game show.

What Will I Learn?

As a result of attending this course, you should you will learn what Hadoop is, its strategic role in solving Big Data problems, and what types of problems are good candidates to solve with Hadoop. Topics covered in this course include:

An introduction to Hadoop and its role in solving Big Data problems
The MapReduce programming model
The Hadoop Distributed File System (HDFS)
The Hadoop ecosystem (e.g., Pig, Hive, HBase)
Case studies of using Hadoop in industry
An overview of cloud computing
Installing and running Hadoop on a cluster and in the cloud

Who Should Attend?

This is an introductory course to Apache Hadoop. It is suitable for software researchers, engineers (developers and testers), and managers who wish to gain an understanding of the basic aspects of the Hadoop core (MapReduce, HDFS) and the Hadoop ecosystem (Pig, Hive, HBase).

What are the Prerequisites?

Basic knowledge of programming (e.g., Java) is helpful but not necessary.

About the Instructor

Tauhida Parveen, PhD, is an independent consultant with an emphasis on cloud computing and software testing. She has worked in quality assurance with organizations such as WikiMedia Foundation, MEI, Yahoo!, Sabre, and Progressive. She is an adjunct faculty member with the Department of Engineering Systems at the Florida Institute of Technology. She is co-author of the book Software Testing in the Cloud: Migration & Execution (Spring, 2012) and co-editor of the book Software Testing in the Cloud: Perspectives on an Emerging Discipline (IGI Global, 2012).

Course Outline

Introduction to Hadoop

What is Hadoop?
- Hadoop’s history and relation to distributed computing
- Where Hadoop is used today
The MapReduce Programming Model
- Functional programming and its relation to MapReduce
- MapReduce programming model
- Map
- Reduce
- Shuffle and sort
- Combiner
- Writing a MapReduce program
- MapReduce Examples
Analyzing data with UNIX tools vs. MapReduce
- Job failure
- Job scheduling
- Task execution
The Hadoop Distributed File System (HDFS)
- Distributed File Systems
- HDFS Overview
- HDFS Architecture
- Data organization
Using the Hadoop Core: MapReduce and HDFS
- Sample MapReduce program
- MapReduce workflow
The Hadoop Ecosystem: Pig, Hive, and HBase
- The Hadoop ecosystem
- Using Pig, Hive, and HBase
Case Studies
- Facebook
- IBM
- Last.fm
Using Hadoop for Problem Solving (Hands on)
- Developing a MapReduce application (Weather data example)
- Other hands on examples

Cloud Computing
- What is cloud computing
- The role of Hadoop in cloud computing
- Amazon Web Services (AWS)

Installing and Running Hadoop
- Flavors of Hadoop: Apache, Cloudera, Amazon.com Elastic MapReduce
- Hadoop modes
- Building a Hadoop Cluster
- Running a MapReduce job on Hadoop Cluster
- Running Hadoop in the cloud

Understanding and Using Hadoop

What is Hadoop?

What Will I Learn?

Who Should Attend?

What are the Prerequisites?

About the Instructor

Course Outline

Introduction to Hadoop

The MapReduce Programming Model

The Hadoop Distributed File System (HDFS)

Using the Hadoop Core: MapReduce and HDFS

The Hadoop Ecosystem: Pig, Hive, and HBase

Case Studies

Using Hadoop for Problem Solving (Hands on)

Cloud Computing

Installing and Running Hadoop