- I will start with set of objective questions which give me idea of participants and also it will encourage participants to learn
- Every fundamental of learning is backed by objective questions and Hands on
- At the end of every day there will be an objective test
- There will be number of real time stories related to big data and Spark
- I will last my course with set of multiple choice questions which demonstrate the improvements in participants
Introduction to Big Data and Distributed Computing :
Big data analysis is future. This section of course will help you to understand, the need of distributed computation.
- Introduction to data.
- Data Science a vision.
- Big data Introduction.
- Parallel computation.
- Problem with parallel computation.
- Traditional parallel computation systems.
- Introduction to Hadoop.
- Hadoop Components.
- HDFS and its architecture.
- HDFS Commands
◦ rmdir and rm
- fsimage and edits log files.
- Hadoop property files.
- Introduction to MapReduce.
- Shortcoming of MapReduce.
- Introduction to Scala
- Scala variables
- Operators in Scala
- Interactive mode and script base programming introduction
- Scala data type and operations on them
- Scala Collections (Touple, Map etc)
- Control Flow and looping in Scala
- Functions in Scala (Declaration, Definition Types and calling)
- Object oriented Scala
- Introduction to function programming in scala.
- Pattern Matching a introduction.
Spark Introduction :
- Introduction to Spark.
- Spark and Hadoop (Similarity and Differences)
- Spark Execution (Master Slave System , Drive, Driver manager and Executors)
- Spark Shell
- Resilient Distributed dataSet (RDD)
Operations On RDD :
- Creation of RDD
- Transformation and Action Introduction
- Lazy evaluation
- Some Important Transformation :
- Some Important Action
- Creation of Paired RDD
- Some important Transformation on pairRDD
- Joines and their Type
- Some Important action on pair RDD
- Hands on all the functions
Fault tolerance and Persistence :
- RDD lineage
- Benefit of persistence
Optimizing Spark program
- Introduction to partitioning
- Inbuilt partitioners (Hash and Range)
- Benefits of partitioning
- groupByKey and reduceBykey comparison
- Spark broadcasting and accumulators
IO in Spark :
- Csv File
- Data From HDFS
Spark Streaming :
- Introduction to Spark Streaming
- Reading from HDFS
- Window Concept
- Push Based Receiver and Pull Based receiver
- Kafka integration with Streaming.
- Introduction to SparkSQL
- SparkSQL datatype
- DataFrame an Introduction.
- Creation of a dataframe.
- Summary statistics on DataFrame.
- Aggregation on Given Data.
- SparkSQL and SQL
- Introduction to Hive.
- Using data from Hive and HiveQL.
- Optimizing SparkSQL code.
Spark Code Deployment and cluster managers.
- Submitting Spark code on StandAlone cluster manager.
- Submitting Spark code on YARN
- Submitting Spark code on Mesos
Note : Every part of course will be associated with hands on . A number of objective questions will always help you in scratch your brain.
Project 1 : Spark core can be used for data preparation and aggregation. Aggregation will be implemented using Spark core APIs.
For data aggregation movie lance data will be used.
Project 2 : Implementing streaming data word frequency visualization. using Kafka and Spark streaming integration.
Project 3 : Implementation of Moving average using SparkSQL.
Project 4 : Data preprocessing, data manipulation and aggregation using SparkSQL. It will be done using Real time data.