As part of the PhD work, we've developed an open source Python library to experiment with latent factor models. We've implemented SVD-based rating prediction and recommendation, and a k-fold cross-validation method for the evaluation of the model and to compare the SVD baseline to any further improvements we plan to add based on the social dimension. Additionally, we've also extended the cross-validation method to enable the identification of the number of latent factors that minimizes the mean absolute error.

Download PyJLD from GitHub by cloning my PhD repository. You can find the library and the CLI tools under recommender-systems/matrix-factorization.

git clone git@github.com:jldevezas/phd.git

You can also find an R ggplot2 script under recommender-systems/r-workspace, named evaluation-charts.R to make a cool looking chart like the one on the right. That script can directly be used with the CVS generated by cross_validation.py with the --output option when using the --feature-sampling-interval argument.

Figure 1: Average, for 10 folds, of the mean absolute error, over a variable number of latent factors, sampled in a logarithmic manner. Thus, the sampling frequency was higher for lower numbers of latent factors, since the error tends to stabilize for larger numbers of features. The ideal number of features is highlighted and represents the minimum error value.

Manual

PyJLD consists of the library jld.py and the following command line tools that implement and serve as an example of the methods in this library: train.py, predict.py, recommend.py, recommend_by_query.py, nn.py, cross_validation.py and server.py.

jld.py

This library will comprise all the Python classes that I develop during my doctoral project. For now, however, it only contains a single class named LatentFactorsModel that implements rating prediction and recommendation using Singular Value Decomposition (SVD), as well as k-fold cross-validation to evaluate the model and compare it to any future implementations as well as to find the number of latent factors that minimizes the error.

Available methods for LatentFactorsModel are listed below, along with an explanation of how to use them:

LatentFactorsModel(h5filename)
Creates a new latent factors model class that enables you to train, predict, recommend and validate your model, which is stored completely in disk using the HDF5 format in the provided h5filename file path.
enable_normalization()
Turns on ratings normalization for the user vector, scaling values from zero to one.
disable_normalization()
All ratings are used exactly as they are given.
set_training_rank(rank)
Currently unused, but will be useful to limit number of latent factors and for non-negative matrix factorization, to be offered as an option later on.
set_training_sample_size(sample_size)
Limit the number of ratings (lines) to be read from the training set CSV file.
set_training_csv_delimiter(delimiter)
Set CVS delimiter to something other than the comma character.
get_training_rank()
Return the currently set rank value.
get_training_sample_size()
Return the currently set training set sample size, corresponding to the number of lines read from the training data CSV.
get_training_csv_delimiter()
Return the currently set CSV delimiter character, which by default is a comma.
train(cvs_path)
Load CSV data in the format (user,item,rating) into a bidimensional HDF5 dataset, as a user-item matrix; factorize this matrix supported on disk storage and using singular value decomposition; and store the resulting matrices. For more details on how SVD works, please read the hopefully quite clear SVD for dummies in the context of recommender systems.
predict(user_id, item_id)
Given a user_id and an item_id in the same format as the original training data CSV, return the, possibly precomputed, predicted rating. Note that this method always returns the predicted rating, even if the user had specifically rated the item in the training set.
get_rating(user_id, item_id)
Given a user_id and an item_id in the same format as the original training data CSV, return the original rating or None if the user didn't rate the specified item or if other error occurred, in which case a message will be logged to the command line, with the error description.
precompute_predictions()
Assuming that the training has been done and is stored in the currently defined HDF5 model file, calculate and store the rating predictions for each user.
recommend(user_id, limit=None, all=False)
Given a user_id in the same format as the original training data CSV, return a ranked list of items according to their predicted ratings for the user. If limit is set, then the result will be only the top limit recommendations. If all is set, then all items, including those specifically rated by the user will also be included in the results, ranked according to the predicted rating.
recommend_by_query(item_ratings, all=False)
Given a user vector of item_ratings, which must be the same dimension as the number of items available in the model, return an iterator of items ranked according to the predicted rating. This is useful if you want to train a model and then query it later for new users of a system.
nearest_neighbors(item_ratings, distance=scipy.spatial.distance.euclidean, limit=5)
Given a user vector of item_ratings, which must be the same dimension as the number of items available in the model, return the top limit nearest neighbors according to the distance metric provided, which defaults to the euclidean distance.
nearest_neighbor(self, item_ratings, distance=scipy.spatial.distance.euclidean)
Given a user vector of item_ratings, which must be the same dimension as the number of items available in the model, return the nearest neighbor, according to the distance metric provided, which defaults to the euclidean distance. This is equivalent to nearest_neighbors(item_ratings, distance, limit=1).
mean_absolute_error(original_prediction_indices_tuples)
Given a tuple original_prediction_indices_tuples containing the original ratings vector, the predicted ratings vector and the indices of the held ratings, return the mean absolute error. You can read more about evaluation in the k-fold cross-validation for dummies section where you can also find the formula for this metric.
root_mean_squared_error(original_prediction_indices_tuples)
Given a tuple original_prediction_indices_tuples containing the original ratings vector, the predicted ratings vector and the indices of the held ratings, return the root mean squared error. You can read more about evaluation in the k-fold cross-validation for dummies section where you can also find the formula for this metric.
k_fold_cross_validation(original_csv_path, k=10, given_fraction=0.8, feature_sampling=None, max_features=None, output_filename=None)
Given the original_csv_path for the training set CSV, the number k of folds and the given_fraction of ratings to use in the prediction, calculate and print the mean absolute error and the root mean squared error of the SVD recommendation algorithm. If feature_sampling is set (e.g. n=50), then the system is validated for an interval of features varying from zero to the maximum possible value, in a logarithmic interval for n different numbers of latent factors. Since the erro tends to converge for large numbers of features, you can also use max_features to discard latent factors over that value. Validation results can be saved in a CVS given by output_filename.

train.py

usage: train.py [-h] [-d DELIMITER] [-r RANK] [-s SIZE] [--precompute]
ratings_path model_path

Train model based on SVD matrix factorization for user-item rating prediction.

positional arguments:
ratings_path          a CSV file with no header and three columns: user_id,
	item_id, rating number
model_path            HDF5 file to store the trained model containing the
	factorized matrices

optional arguments:
-h, --help            show this help message and exit
-d DELIMITER, --delimiter DELIMITER
	the CSV column delimiter character (DEFAULT=',')
-r RANK, --rank RANK  the number of latent factors (DEFAULT=1000)
-s SIZE, --size SIZE  the size of the sample to take from the ratings CSV
	(DEFAULT=None)
--precompute          precompute and store prediction values in HDF5
	(DEFAULT=False)

predict.py

usage: predict.py [-h] [-f] user item model_path

Predict a normalized rating for any unrated item of an existing user.

positional arguments:
user                 user identification string, as defined in the training
   set
item                 item identification string, as defined in the training
   set
model_path           HDF5 file with the trained model containing the
   factorized matrices

optional arguments:
-h, --help           show this help message and exit
-f, --force-predict  returns the predicted value even when the original
   rating is available

recommend.py

usage: recommend.py [-h] user model_path

Recommend new items to an existing user ordered by predicted rating.

positional arguments:
user        user identification string, as defined in the training set
model_path  HDF5 file with the trained model containing the factorized
matrices

optional arguments:
-h, --help  show this help message and exit

recommend_by_query.py

usage: recommend_by_query.py [-h] [--love LOVE] [--like LIKE]
		 [--neutral NEUTRAL] [--dislike DISLIKE]
		 [--hate HATE]
		 model_path

Use qualitative users preferences to recommend new items.

positional arguments:
model_path         HDF5 file with the trained model containing the
 factorized matrices

optional arguments:
-h, --help         show this help message and exit
--love LOVE        comma-separated item IDs for loved items
--like LIKE        comma-separated item IDs for liked items
--neutral NEUTRAL  comma-separated item IDs for neutral items
--dislike DISLIKE  comma-separated item IDs for disliked items
--hate HATE        comma-separated item IDs for hated items

nn.py

usage: nn.py [-h] [--love LOVE] [--like LIKE] [--neutral NEUTRAL]
[--dislike DISLIKE] [--hate HATE]
model_path

Discover nearest neighbor in trained model based on qualitative user input.

positional arguments:
model_path         HDF5 file with the trained model containing the
 factorized matrices

optional arguments:
-h, --help         show this help message and exit
--love LOVE        comma-separated item IDs for loved items
--like LIKE        comma-separated item IDs for liked items
--neutral NEUTRAL  comma-separated item IDs for neutral items
--dislike DISLIKE  comma-separated item IDs for disliked items
--hate HATE        comma-separated item IDs for hated items

cross_validation.py

usage: cross_validation.py [-h] [-d DELIMITER] [-r RANK] [-k FOLDS]
	   [-n FEATURE_SAMPLING_INTERVAL] [-m MAX_FEATURES]
	   [-o OUTPUT]
	   ratings_path

Do k-fold cross-validation by creating k training CSVs from the original.

positional arguments:
ratings_path          a CSV file with no header and three columns: user_id,
	item_id, rating number

optional arguments:
-h, --help            show this help message and exit
-d DELIMITER, --delimiter DELIMITER
	the CSV column delimiter character (DEFAULT=',')
-r RANK, --rank RANK  the number of latent factors (DEFAULT=1000)
-k FOLDS, --folds FOLDS
	the number of folds to use in cross-validation
	(DEFAULT=10)
-n FEATURE_SAMPLING_INTERVAL, --feature-sampling-interval FEATURE_SAMPLING_INTERVAL
	the sampling interval in a log space of the number of
	features to use in cross-validation
-m MAX_FEATURES, --max-features MAX_FEATURES
	the maximum number of features to use in cross-
	validation (ignored if sampling size is not defined)
-o OUTPUT, --output OUTPUT
	output CSV filename, to store validation scores (MAE)

server.py

A minimalistic example of how to provide PyJLD recommendation as a Flask service.

$ python server.py model_path