Course HighlightsCOURSE
Spark Fundamentals I

Spark Fundamentals I

Spark Fundamentals I Highlights

Course enrollment

Starts on

11 February 2020

Enrollment closes on
31 December 2021

  Course Fee

Fee

US$99 - US$199

Course enrollment

Starts on

11 February 2020

Enrollment closes on
31 December 2021

Course Fee

Fee

US$99 - US$199

About Spark Fundamentals I Course

Learn the fundamentals of Spark, the technology that is revolutionizing the analytics and big data world! Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is the way to go.

  • Learn how it performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining.
  • Learn how it provides in-memory cluster computing for lightning fast speed and supports Java, Python, R, and Scala APIs for ease of development.
  • Learn how it can handle a wide range of data processing scenarios by combining SQL, streaming and complex analytics together seamlessly in the same application.
  • Learn how it runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3.

Course Syllabus

  • Module 1 - Introduction to Spark - Getting started
    • What is Spark and what is its purpose?
    • Components of the Spark unified stack
    • Resilient Distributed Dataset (RDD)
    • Downloading and installing Spark standalone
    • Scala and Python overview
    • Launching and using Spark’s Scala and Python shell ©
  • Module 2 - Resilient Distributed Dataset and DataFrames
    • Understand how to create parallelized collections and external datasets
    • Work with Resilient Distributed Dataset (RDD) operations
    • Utilize shared variables and key-value pairs
    • Describe how data is stored in an HDFS cluster
  • Module 3 - Spark application programming
    • Understand the purpose and usage of the SparkContext
    • Initialize Spark with the various programming languages
    • Describe and run some Spark examples
    • Pass functions to Spark
    • Create and run a Spark standalone application
    • Submit applications to the cluster
  • Module 4 - Introduction to Spark libraries
    • Understand and use the various Spark libraries
  • Module 5 - Spark configuration, monitoring and tuning
    • Understand components of the Spark cluster
    • Configure Spark to modify the Spark properties, environmental variables, or logging properties
    • Monitor Spark using the web UIs, metrics, and external instrumentation
    • Understand performance tuning considerations

General Information

  • It is self-paced.
  • It can be taken at any time.

Recommended skills prior to taking this course

  • Basic understanding of Apache Hadoop and Big Data.
  • Basic Linux Operating System knowledge.
  • Basic understanding of the Scala, Python, R, or Java programming languages.

Requirements

Course Staff

Henry L. Quach
Henry L. Quach

Henry L. Quachis the Technical Curriculum Developer Lead for Big Data. He has been with IBM for 9 years focusing on education development. Henry likes to dabble in a number of things including being part of the original team that developed and designed the concept for the IBM Open Badges program. He has a Bachelor of Science in Computer Science and a Master of Science in Software Engineering from San Jose State University.

Alan Barnes
Alan Barnes

Alan Barnes is a Senior IBM Information Management Course Developer / Consultant. He has worked in several companies as a Senior Technical Consultant, Database Team Manager, Application Programmer, Systems Programmer, Business Analyst, DB2 Team Lead and more. His career in IT spans more than 35 years.

Course Outline

General Information
Learning objectives
Syllabus
Grading Scheme
Change Log
Copyright and Trademarks
Learning objectives
Resilient Distributed Dataset - Part 1
Resilient Distributed Dataset - Part 2
Resilient Distributed Dataset - Part 3
Lab - RDD and Dataframes
Python RDD Solution
Scala RDD Solution
DataFrames Solution
Graded Review Questions
Learning objectives
Spark Libraries - Part 1
Spark Libraries - Part 2
Spark Libraries - Part 3
Lab - Scala Libraries
Solution - Part 1
Solution - Part 2
Solution - Part 3
Solution - Part 4
Graded Review Questions
Learning objectives
Configuration, monitoring, and tuning - Part 1
Configuration, monitoring, and tuning - Part 2
Lab - Spark Fundamentals
Solution
Graded Review Questions
Course Certificate

Earn your certificate

Once you have completed this course, you will earn your certificate.

Spark Fundamentals I

FAQs

Spark Fundamentals I provided 100% online. You will therefore need access to the internet to be able to use the course materials. When you enroll for this course, you be able to access the course materials from the course link in your dashboard immediately. Please note, this course has been designed to be taken with Spark Fundamentals II, we therefore recommend that you complete this course and then enroll for Spark Fundamentals II when you are ready. This will ensure you have covered the required topics for this subject.

Spark Fundamentals I is intended to enable you to develop critical Spark skills, including distributed datasets and DataFrame operations. You will use Scala, Java, and Python to create and run a Spark application. Plus, you will create applications using Spark SQL, and configure and tune Spark. We therefore recommend that you have a basic understanding of Apache Hadoop and big data, basic knowledge of Linux, and basic skills in using Scala, Python, and Java programming languages.

Yes, once you have successfully completed the course, you will earn a Certificate of Completion. However, remember you will also have gained valuable skills that you can refer to in interviews and in your profile on LinkedIn!

Yes. Spark Fundamentals I is totally online. You do not need to turn up to any classes in person. This means, however, that you need to have access to the internet, and also the necessary technology to access the course materials.

The great thing is that this means you can take this course wherever you live. And though you’ll be sitting in your room alone, you won’t be learning alone, for you will be encouraged to communicate and chat with your peers through the discussion space.

Apache Spark is fantastic data processing framework that can process large datasets quickly. It can also distribute processing tasks across many computers. Having the capacity to do both these things makes Apache Spark an important tool for processing big data and developing machine learning. It also has an API that is easy to use and can reduce the burden on developers. It’s therefore a great skillset to have on your resumé and LinkedIn profile.