MLlib (RDD-based)¶
Classification¶
| 
 | Classification model trained using Multinomial/Binary Logistic Regression. | 
| Train a classification model for Binary Logistic Regression using Stochastic Gradient Descent. | |
| Train a classification model for Multinomial/Binary Logistic Regression using Limited-memory BFGS. | |
| 
 | Model for Support Vector Machines (SVMs). | 
| Train a Support Vector Machine (SVM) using Stochastic Gradient Descent. | |
| 
 | Model for Naive Bayes classifiers. | 
| Train a Multinomial Naive Bayes model. | |
| Train or predict a logistic regression model on streaming data. | 
Clustering¶
| 
 | A clustering model derived from the bisecting k-means method. | 
| A bisecting k-means algorithm based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar, with modification to fit Spark. | |
| 
 | A clustering model derived from the k-means method. | 
| K-means clustering. | |
| 
 | A clustering model derived from the Gaussian Mixture Model method. | 
| Learning algorithm for Gaussian Mixtures using the expectation-maximization algorithm. | |
| 
 | Model produced by  | 
| Power Iteration Clustering (PIC), a scalable graph clustering algorithm. | |
| 
 | Provides methods to set k, decayFactor, timeUnit to configure the KMeans algorithm for fitting and predicting on incoming dstreams. | 
| 
 | Clustering model which can perform an online update of the centroids. | 
| Train Latent Dirichlet Allocation (LDA) model. | |
| 
 | A clustering model derived from the LDA method. | 
Evaluation¶
| 
 | Evaluator for binary classification. | 
| 
 | Evaluator for regression. | 
| 
 | Evaluator for multiclass classification. | 
| 
 | Evaluator for ranking algorithms. | 
Feature¶
| 
 | Normalizes samples individually to unit Lp norm | 
| 
 | Represents a StandardScaler model that can transform vectors. | 
| 
 | Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. | 
| 
 | Maps a sequence of terms to their term frequencies using the hashing trick. | 
| 
 | Represents an IDF model that can transform term frequency vectors. | 
| 
 | Inverse document frequency (IDF). | 
| 
 | Word2Vec creates vector representation of words in a text corpus. | 
| 
 | class for Word2Vec model | 
| 
 | Creates a ChiSquared feature selector. | 
| 
 | Represents a Chi Squared selector model. | 
| 
 | Scales each column of the vector, with the supplied weight vector. | 
Frequency Pattern Mining¶
| A Parallel FP-growth algorithm to mine frequent itemsets. | |
| 
 | A FP-Growth model for mining frequent itemsets using the Parallel FP-Growth algorithm. | 
| A parallel PrefixSpan algorithm to mine frequent sequential patterns. | |
| 
 | Model fitted by PrefixSpan | 
Vector and Matrix¶
| 
 | A dense vector represented by a value array. | 
| 
 | A simple sparse vector class for passing data to MLlib. | 
| Factory methods for working with vectors. | |
| 
 | |
| 
 | Column-major dense matrix. | 
| 
 | Sparse Matrix stored in CSC format. | 
| 
 | Represents QR factors. | 
Distributed Representation¶
| 
 | Represents a distributed matrix in blocks of local matrices. | 
| 
 | Represents a matrix in coordinate format. | 
| Represents a distributively stored matrix backed by one or more RDDs. | |
| 
 | Represents a row of an IndexedRowMatrix. | 
| 
 | Represents a row-oriented distributed Matrix with indexed rows. | 
| 
 | Represents an entry of a CoordinateMatrix. | 
| 
 | Represents a row-oriented distributed Matrix with no meaningful row indices. | 
| 
 | Represents singular value decomposition (SVD) factors. | 
Random¶
| Generator methods for creating RDDs comprised of i.i.d samples from some distribution. | 
Recommendation¶
| 
 | A matrix factorisation model trained by regularized alternating least-squares. | 
| Alternating Least Squares matrix factorization | |
| Represents a (user, product, rating) tuple. | 
Regression¶
| 
 | Class that represents the features and labels of a data point. | 
| 
 | A linear model that has a vector of coefficients and an intercept. | 
| 
 | A linear regression model derived from a least-squares fit. | 
| Train a linear regression model with no regularization using Stochastic Gradient Descent. | |
| 
 | A linear regression model derived from a least-squares fit with an l_2 penalty term. | 
| Train a regression model with L2-regularization using Stochastic Gradient Descent. | |
| 
 | A linear regression model derived from a least-squares fit with an l_1 penalty term. | 
| Train a regression model with L1-regularization using Stochastic Gradient Descent. | |
| 
 | Regression model for isotonic regression. | 
| Isotonic regression. | |
| 
 | Base class that has to be inherited by any StreamingLinearAlgorithm. | 
| 
 | Train or predict a linear regression model on streaming data. | 
Statistics¶
| 
 | Trait for multivariate statistical summary of a data matrix. | 
| 
 | Contains test results for the chi-squared hypothesis test. | 
| Represents a (mu, sigma) tuple | |
| Estimate probability density at required points given an RDD of samples from the population. | |
| 
 | Contains test results for the chi-squared hypothesis test. | 
| 
 | Contains test results for the Kolmogorov-Smirnov test. | 
Tree¶
| 
 | A decision tree model for classification or regression. | 
| Learning algorithm for a decision tree model for classification or regression. | |
| 
 | Represents a random forest model. | 
| Learning algorithm for a random forest model for classification or regression. | |
| 
 | Represents a gradient-boosted tree model. | 
| Learning algorithm for a gradient boosted trees model for classification or regression. | 
Utilities¶
| Mixin for classes which can load saved models using its Scala implementation. | |
| Mixin for models that provide save() through their Scala implementation. | |
| Utils for generating linear data. | |
| Mixin for classes which can load saved models from files. | |
| Helper methods to load, save and pre-process data used in MLlib. | |
| Mixin for models and transformers which may be saved as files. |