Course HighlightsCOURSE
Analyzing Big Data with Microsoft R

Analyzing Big Data with Microsoft R

Learn how to use Microsoft R Server to analyze large datasets using R, one of the most powerful programming languages.

Analyzing Big Data with Microsoft R Highlights

Course Enrollment

Starts on

01 January 2019

Enrollment closes on
30 September 2019

  Course duration

Duration

  • Total 8 to 16 hours
  Course Fee

Fee

Free

Course Enrollment

Starts on

01 January 2019

Enrollment closes on
30 September 2019

Course duration

Duration

  • Total 8 to 16 hours
Course Fee

Fee

Free

Enrollment is Closed

About this course

This course is part of the Microsoft Professional Program Certificate in Data Science and the Microsoft Professional Program Certificate in Big Data..

The open-source programming language R has for a long time been popular (particularly in academia) for data processing and statistical analysis. Among R's strengths are that it's a succinct programming language and has an extensive repository of third party libraries for performing all kinds of analyses. Together, these two features make it possible for a data scientist to very quickly go from raw data to summaries, charts, and even full-blown reports. However, one deficiency with R is that traditionally it uses a lot of memory, both because it needs to load a copy of the data in its entirety as a data.frame object, and also because processing the data often involves making further copies (sometimes referred to as copy-on-modify). This is one of the reasons R has been more reluctantly received by industry compared to academia.

The main component of Microsoft R Server (MRS) is the RevoScaleR package, which is an R library that offers a set of functionalities for processing large datasets without having to load them all at once in the memory. RevoScaleR offers a rich set of distributed statistical and machine learning algorithms, which get added to over time. Finally, RevoScaleR also offers a mechanism by which we can take code that we developed on our laptop and deploy it on a remote server such as SQL Server or Spark (where the infrastructure is very different under the hood), with minimal effort.

In this course, we will show you how to use MRS to run an analysis on a large dataset and provide some examples of how to deploy it on a Spark cluster or a SQL Server database. Upon completion, you will know how to use R for big-data problems.

Since RevoScaleR is an R package, we assume that the course participants are familiar with R. A solid understanding of R data structures (vectors, matrices, lists, data frames, environments) is required. Familiarity with 3rd party packages such as dplyr is also helpful.

What you'll learn

You will learn how to use MRS to read, process, and analyze large datasets including:

  • Read data from flat files into R’s data frame object, investigate the structure of the dataset and make corrections, and store prepared datasets for later use
  • Prepare and transform the data
  • Calculate essential summary statistics, do crosstabulation, write your own summary functions, and visualize data with the ggplot2 package
  • Build predictive models, evaluate and compare models, and generate predictions on new data

Course Syllabus

  • Familiarity with R

Meet the instructors

Liberty J. Munson

Jonathan Sanito

Senior Content Developer Microsoft

Jonathan works as a content developer and project manager for Microsoft focusing in Data and Analytics online training. He has worked with trainings for developer and IT pro audiences, from Microsoft Dynamics NAV to Windows Active Directory.

Before coming to Microsoft, Jonathan worked as a consultant for a Microsoft partner, implementing Microsoft Dynamics NAV solutions.

Authman Apatira

Seth Mottaghinejad

Data Scientist Microsoft

Seth is a data scientist at Microsoft who specializes in training and consulting clients who use Microsoft R Server. His past work includes training teams of data scientists to use R and MRS, showing how MRS fits in the big-data architecture, and helping with migration from tools such as SAS to R and MRS, and optimizing R performance. Before joining Microsoft, Seth worked as an analytics consultant at Revolution Analytics, the R-based big data and analytics company that was acquired by Microsoft in May 2015. Seth also has experience in marketing and customer analytics from prior jobs at American Express and Saks Fifth Avenue. He is a passionate "R-vangelist", an avid outdoorsman (who moved to Seattle to be close to lakes and mountains), and an amateur globetrotter.

Course Outline

Enrollment is Closed
Welcome
Syllabus
Grading
How to use edX
Pre-course Survey
Objectives
Introduction to Microsoft R
Overview of RevoScaleR
Analytics Life Cycle
Benefit of RevoScaleR
Installing the Microsoft R Client
Getting the Data
Understanding the Data
Installing the Required Packages
Knowledge Check
Objectives
Loading the Top 1000 Rows
Reading the Whole Data
XDF vs CSV
Knowledge Checks
Preparing the Data
Checking Column Types
A Simple Transformation
Complex Transformations
Examining New Columns
Plotting Neighborhoods
Adding Neighborhoods
Module Wrap-up
Knowledge Checks
Lab
Objectives
Examining Neighborhoods
Focusing on Manhattan
Examining Trip Distance
Examining Outliers
Filtering by Manhattan
Knowledge Checks
Objectives
Reordering Neighborhoods
Neighborhood Trends
Refactoring Neighborhoods
Trip Distribution Across Neighborhoods
Visualizing Trip Distribution
Time Related Patterns
Knowledge Checks
Lab 1
Lab 2
Objectives
Looking at Maps
Creating Clusters
Visualizing Clusters
Clustering Wrap-up
Objectives
A Linear Model for Tip Percent
Examining Predictions
Choosing Between Models
Using Other Algorithms
Comparing Predictions
Judging Predictive Performance
Knowledge Checks
Lab
Objectives
Deploying to SQL Server
Working with Spark (Part 1)
Working with Spark (Part 2)
Working with Spark (Part 3)
Knowledge Checks
Final Exam
Post-Course Survey
Congratulations!
Course Certificate

Earn your certificate

Once you have completed this course, you will earn your certificate.

Analyzing Big Data with Microsoft R