MTech Managing Big Data syllabus for 2 Sem 2020 scheme 20SCE21

Module-1 Meet Hadoop 0 hours

Meet Hadoop:

Data!, Data Storage and Analysis, Querying All Your Data, Beyond Batch, Comparison with Other Systems: Relational Database Management Systems, Grid Computing, Volunteer Computing Hadoop Fundamentals MapReduce A Weather Dataset: Data Format, Analysing the Data with Unix Tools, Analysing the Data with Hadoop: Map and Reduce, Java MapReduce, Scaling Out: Data Flow, Combiner Functions, Running a Distributed MapReduce Job, Hadoop Streaming

 

The Hadoop Distributed Filesystem

The Design of HDFS, HDFS Concepts: Blocks, Namenodes and Datanodes, HDFS Federation, HDFS High-Availability, The Command-Line Interface, Basic Filesystem Operations, HadoopFilesystems Interfaces, The Java Interface, Reading Data from a Hadoop URL, Reading Data Using the FileSystem API, Writing Data, Directories, Querying the Filesystem, Deleting Data, Data Flow: Anatomy of a File Read, Anatomy of a File Write.

Module-2 YARN 0 hours

YARN

Anatomy of a YARN Application Run: Resource Requests, Application Lifespan, Building YARN Applications, YARN Compared to MapReduce, Scheduling in YARN: The FIFO Scheduler, The Capacity Scheduler, The Fair Scheduler, Delay Scheduling, Dominant Resource Fairness

 

Hadoop I/O

Data Integrity, Data Integrity in HDFS, LocalFileSystem, ChecksumFileSystem, Compression, Codecs, Compression and Input Splits, Using Compression in MapReduce, Serialization, The Writable Interface, Writable Classes, Implementing a Custom Writable, Serialization Frameworks, File-Based Data Structures: SequenceFile

A d v e r t i s e m e n t
Module-3 Developing a MapReduce Application 0 hours

Developing a MapReduce Application

The Configuration API, Combining Resources, Variable Expansion, Setting Up the Development Environment, Managing Configuration, GenericOptionsParser, Tool, and ToolRunner, Writing a Unit Test with MRUnit: Mapper, Reducer, Running Locally on Test Data, Running a Job in a Local Job Runner, Testing the Driver, Running on a Cluster, Packaging a Job, Launching a Job, The MapReduce Web UI, Retrieving the Results, Debugging a Job, Hadoop Logs, Tuning a Job, Profiling Tasks, MapReduce Workflows: Decomposing a Problem into MapReduce Jobs, JobControl, Apache Oozie

 

How MapReduce Works

Anatomy of a MapReduce Job Run, Job Submission, Job Initialization, Task Assignment, Task Execution, Progress and Status Updates, Job Completion, Failures: Task Failure, Application Master Failure, Node Manager Failure, Resource Manager Failure, Shuffle and Sort: The Map Side, The Reduce Side, Configuration Tuning, Task Execution: The Task Execution Environment, Speculative Execution, Output Committers

Module-4 MapReduce Types and Formats 0 hours

MapReduce Types and Formats:

MapReduce Types, Input Formats: Input Splits and Records Text Input, Binary Input, Multiple Inputs, Database Input (and Output) Output Formats: Text Output, Binary Output, Multiple Outputs, Lazy Output, Database Output,

 

Flume

Installing Flume, An Example, Transactions and Reliability, Batching, The HDFS Sink, Partitioning and Interceptors, File Formats, Fan Out, Delivery Guarantees, Replicating and Multiplexing Selectors, Distribution: Agent Tiers, Delivery Guarantees, Sink Groups, Integrating Flume with Applications, Component Catalogue

Module-5 Pig 0 hours

Pig

Installing and Running Pig, Execution Types, Running Pig Programs, Grunt, Pig Latin Editors, An Example: Generating Examples, Comparison with Databases, Pig Latin: Structure, Statements, Expressions, Types, Schemas, Functions, Data Processing Operators: Loading and Storing Data, Filtering Data, Grouping and Joining Data, Sorting Data, Combining and Splitting Data.

 

Spark

An Example: Spark Applications, Jobs, Stages and Tasks, A Java Example, A Python Example, Resilient Distributed Datasets: Creation, Transformations and Actions, Persistence, Serialization, Shared Variables, Broadcast Variables, Accumulators, Anatomy of a Spark Job Run, Job Submission, DAG Construction, Task Scheduling, Task Execution, Executors and Cluster Managers: Spark on YARN

 

Course outcomes:

At the end of the course the student will be able to:

  • Understand managing big data using Hadoop and SPARK technologies
  • Explain HDFS and MapReduce concepts
  • Install, configure, and run Hadoop and HDFS.
  • Perform map-reduce analytics using Hadoop and related tools
  • Explain SPARK concepts

 

Question paper pattern:

The SEE question paper will be set for 100 marks and the marks scored will be proportionately reduced to 60.

  • The question paper will have ten full questions carrying equal marks.
  • Each full question is for 20 marks.
  • There will be two full questions (with a maximum of four sub questions) from each module.
  • Each full question will have sub question covering all the topics under a module.
  • The students will have to answer five full questions, selecting one full question from each module.

 

Textbook/ Textbooks

1 Hadoop: The Definitive Guide Tom White O'Reilley Third Edition, 2012

 

Reference Books

1 SPARK: The Definitive Guide MateiZaharia and Bill Chambers Oreilly 2018

2 Apache Flume: Distributed Log Collection for Hadoop . D'Souza and Steve Hoffman Oreilly 2014