R packages for predictive analytics

Lukas Görög

Introduction

Predictive analysis is a branch of analysis that uses statistical operations to analyse historical facts to predict future events. Predictive analysis is often the result at the end of data mining and machine learning processes. Methods like time series analysis, non-linear least squares, etc. are used in predictive analysis. Undoubtedly, predictive analytics can help many businesses. It finds out the relationship between the data collected and based on the relationship, the pattern or a model is made. Which in return creates predictive intelligence.

R is already one of the most popular programming languages in statistics and data analytics. In this article, we’ll introduce a few trending R packages which are developed and already in use for predictive analytics.

Forecast

This package provides methods and tools for displaying and analyzing univariate time series forecasts, including exponential smoothing via state space models and automatic ARIMA (Autoregressive Integrated Moving Average) modeling.

Further development of Forecast package is now stopped since fable framework took its place. The forecast package will remain in its version, and maintained only with bug fixes. For the latest features and development, it is recommended to use forecasting with the fable package.

You can install the stable version of forecast from CRAN or the development version from Github.

Follow this link to find out all the functions available in the forecast r package.

Fable

As mentioned above, Fable provides a collection of the most often used univariate and multivariate time series forecasting models. It facilitates exponential smoothing via state space models and automatic ARIMA (Autoregressive Integrated Moving Average) modeling. These models work finely within the fable r package. It provides the tools to evaluate, visualize, and combine models in a workflow that is also compatible with the tidyverse.

Find out installation instructions, examples and learning material from the following link to the fabel framework.

RPART

The decision tree is a powerful method and a popular predictive machine learning technique used for classification and regression in data science. This method is also known as Classification and Regression Trees (CART). R package to implement the CART algorithm is called RPART (Recursive Partitioning And Regression Trees). Rpart is a powerful machine learning library, yet it is very easy to use

Mainly, you can use it to perform following functions.

Data Partition: split the data into training and test data. This is called the holdout-validation method for evaluating model performance.

Feature Scaling: The numeric features need to be scaled since the units of the variables may be significantly different, and it might affect the modeling process. The method employed is of centering and scaling the numeric features.

Model Building: The next step is to build the classification decision tree. Start by setting the seed. Then specify the parameters used to control the model training process.

Model Evaluation: To evaluate the model performance on the training and test dataset.

Use the following link to read further about how to use the above functions with an example in RPART package.

GGally

“ggplot2” is used for plotting graphs in R based with the graphics. GGally is an extension to the ggplot2. It adds functions to reduce the complexity of combining geoms with transformed data. Some of these functions include a pair wise plot matrix, a scatter plot matrix, a parallel coordinates plot, a survival plot, and some other functions to plot networks.

You can install this package either from CRAN or GitHub. Find Out installation syntax, further documentation and mostly used functions of GGally and their usage from this link.

H2O

‘H2O’ is a scalable open-source machine learning platform. It provides parallelized implementations of many supervised and unsupervised machine learning algorithms. Some of them are mentioned below:

Generalized Linear Models (GLM)
Gradient Boosting Machines (including XGBoost)
Random Forests, Deep Neural Networks (Deep Learning)
Stacked Ensembles
Naive Bayes
Generalized Additive Models (GAM)
ANOVA GLM
Cox Proportional Hazards
K-Means
PCA
ModelSelection
Word2Vec
A fully automatic machine learning algorithm (H2O AutoML)

Except R version 3.1.0 all other versions of R are compatible with H2O. If you are currently using this version, you will have to upgrade to the R version before using H2O.

The Following R packages are required to use H2O in the R interface:

methods
statmod
stats
graphics
RCurl
jsonlite
tools
utils

Read further about ‘H2O’ and find out how to use it with code examples.

randomForest

In this method, many decision trees are created. All observation is fed into each decision tree. The most common result from each observation of decision trees is used as the final output. A new observation is fed into each and every decision tree and takes the majority vote for each classification model.

An error calculation is made for the cases which are not used while building the tree. That is called an Out-of-bag (OOB) error, which is an estimate depicted as a percentage.

The R package “randomForest” is used to create random forests, which means many decision trees. The package has the function ‘randomForest()’ which is used to create and analyze random forests.

The basic syntax for creating a random forest in R is

randomForest(formula, data)

formula : a formula describing the predictor and response variables.

data : the name of the data set used.

Read further about the Usage, Arguments, and examples about from ‘randomForest’ r package through the link

Igraph

Igraph can handle large graphs very well. It provides functions for generating random and regular graphs, graph visualization, centrality methods etc…

igraph libraries provides functions for creating and manipulating graphs and analysing networks. Even though it is written in C, compatible versions exists as Python and R packages.

The three most important properties of igraph are follows:

igraph is capable of handling large networks efficiently
it can be productively used with a high-level programming language
interactive and non-interactive usage are both supported

igraph is open source, source code can be downloaded from GitHub. Some R packages depend on igraph, such as tnet, igraphtosonia and cccd. You can implement igraph on many operating systems. The R library of igraph is well documented.

Basic Functions of igraph includes generating graphs, compute centrality measures and path length based properties as well as graph components and graph motifs. It is also used for degree-preserving randomization. igraph can read and write Pajek and GraphML files, as well as simple edge lists.

This is an example of a plot created using igraph

Source: Wikipedia

nnet

An Artificial Neural Network (ANN) is a network of groups of small processing units that are modelled imitating the behaviour of a human neural network. ANN algorithms have often been applied, in noise identification, wavelet estimation, speed analysis, shear wave analysis, auto tracking reflector, hydrocarbon prediction, reservoir characterization, etc.

The nnet R package can be used for predictive analytics instances where you have to use ANN approach. It allows you to create a neural network classifier. For better results, follow the recommended process of preparing data, creating a neural network, evaluating the accuracy of the model, and making predictions using the nnet package.

Follow this link to another application example of using nnet package on a basic neural networks-related task.

Conclusion

Predictive analytics nowadays is very important for businesses to beat the competition in the market. R language undoubtedly supports predictive analytics more than any other programming language. It is currently evolving by introducing many new packages to the R community. And further developing the packages that are already formed. Some trending R packages for predictive analytics were introduced in this blog to make use of the best of them. Understand your requirement and find out the best suitable r package for you and make data-oriented decisions to make the best profits for your business.