As part of the PhD work, we've developed an open source Python library to experiment with latent factor models. We've implemented SVD-based rating prediction and recommendation, and a k-fold cross-validation method for the evaluation of the model and to compare the SVD baseline to any further improvements we plan to add based on the social dimension. Additionally, we've also extended the cross-validation method to enable the identification of the number of latent factors that minimizes the mean absolute error.
Download PyJLD from GitHub
by cloning my PhD repository. You can find the library and the CLI tools under
recommender-systems/matrix-factorization.
git clone git@github.com:jldevezas/phd.git
You can also find an R
ggplot2 script under recommender-systems/r-workspace, named
evaluation-charts.R to make a cool looking chart like the one on the right.
That script can directly be used with the CVS generated by cross_validation.py
with the --output option when using the
--feature-sampling-interval argument.
Figure 1: Average, for 10 folds, of the mean absolute error, over a variable number
of latent factors, sampled in a logarithmic manner. Thus, the sampling frequency was higher
for lower numbers of latent factors, since the error tends to stabilize for larger numbers
of features. The ideal number of features is highlighted and represents the minimum error
value.
PyJLD consists of the library jld.py and the following command line tools that
implement and serve as an example of the methods in this library: train.py,
predict.py, recommend.py, recommend_by_query.py,
nn.py, cross_validation.py and server.py.
jld.py
This library will comprise all the Python classes that I develop during my doctoral project.
For now, however, it only contains a single class named LatentFactorsModel
that implements rating prediction and recommendation using Singular Value Decomposition
(SVD), as well as k-fold cross-validation to evaluate the model and compare it to any
future implementations as well as to find the number of latent factors that minimizes the
error.
Available methods for LatentFactorsModel are listed below, along with an
explanation of how to use them:
LatentFactorsModel(h5filename)h5filename file path.
enable_normalization()disable_normalization()set_training_rank(rank)set_training_sample_size(sample_size)set_training_csv_delimiter(delimiter)get_training_rank()get_training_sample_size()get_training_csv_delimiter()train(cvs_path)(user,item,rating) into a bidimensional
HDF5 dataset, as a user-item matrix; factorize this matrix supported on disk
storage and using singular value decomposition; and store the resulting matrices.
For more details on how SVD works, please read the hopefully quite clear SVD for dummies in the context of recommender
systems.
predict(user_id, item_id)user_id and an item_id in the same format as the
original training data CSV, return the, possibly precomputed, predicted rating.
Note that this method always returns the predicted rating, even if the user had
specifically rated the item in the training set.
get_rating(user_id, item_id)user_id and an item_id in the same format as the
original training data CSV, return the original rating or None if the
user didn't rate the specified item or if other error occurred, in which case a
message will be logged to the command line, with the error description.
precompute_predictions()recommend(user_id, limit=None, all=False)user_id in the same format as the original training data CSV,
return a ranked list of items according to their predicted ratings for the user. If
limit is set, then the result will be only the top limit
recommendations. If all is set, then all items, including those
specifically rated by the user will also be included in the results, ranked
according to the predicted rating.
recommend_by_query(item_ratings, all=False)item_ratings, which must be the same dimension
as the number of items available in the model, return an iterator of items ranked
according to the predicted rating. This is useful if you want to train a model and
then query it later for new users of a system.
nearest_neighbors(item_ratings, distance=scipy.spatial.distance.euclidean,
limit=5)item_ratings, which must be the same dimension
as the number of items available in the model, return the top limit
nearest neighbors according to the distance metric provided, which defaults to the
euclidean distance.
nearest_neighbor(self, item_ratings,
distance=scipy.spatial.distance.euclidean)item_ratings, which must be the same dimension
as the number of items available in the model, return the nearest neighbor,
according to the distance metric provided, which defaults to the euclidean
distance. This is equivalent to nearest_neighbors(item_ratings, distance,
limit=1).
mean_absolute_error(original_prediction_indices_tuples)original_prediction_indices_tuples containing the
original ratings vector, the predicted ratings vector and the indices of the held
ratings, return the mean absolute error. You can read more about evaluation in the
k-fold cross-validation for dummies
section where you can also find the formula for this metric.
root_mean_squared_error(original_prediction_indices_tuples)original_prediction_indices_tuples containing the
original ratings vector, the predicted ratings vector and the indices of the held
ratings, return the root mean squared error. You can read more about evaluation in
the k-fold cross-validation for dummies
section where you can also find the formula for this metric.
k_fold_cross_validation(original_csv_path, k=10, given_fraction=0.8,
feature_sampling=None, max_features=None, output_filename=None)original_csv_path for the training set CSV, the number
k of folds and the given_fraction of ratings to use in
the prediction, calculate and print the mean absolute error and the root mean
squared error of the SVD recommendation algorithm. If feature_sampling
is set (e.g. n=50), then the system is validated for an interval of
features varying from zero to the maximum possible value, in a logarithmic interval
for n different numbers of latent factors. Since the erro tends to
converge for large numbers of features, you can also use max_features
to discard latent factors over that value. Validation results can be saved in a
CVS given by output_filename.
train.pyusage: train.py [-h] [-d DELIMITER] [-r RANK] [-s SIZE] [--precompute] ratings_path model_path Train model based on SVD matrix factorization for user-item rating prediction. positional arguments: ratings_path a CSV file with no header and three columns: user_id, item_id, rating number model_path HDF5 file to store the trained model containing the factorized matrices optional arguments: -h, --help show this help message and exit -d DELIMITER, --delimiter DELIMITER the CSV column delimiter character (DEFAULT=',') -r RANK, --rank RANK the number of latent factors (DEFAULT=1000) -s SIZE, --size SIZE the size of the sample to take from the ratings CSV (DEFAULT=None) --precompute precompute and store prediction values in HDF5 (DEFAULT=False)
predict.pyusage: predict.py [-h] [-f] user item model_path Predict a normalized rating for any unrated item of an existing user. positional arguments: user user identification string, as defined in the training set item item identification string, as defined in the training set model_path HDF5 file with the trained model containing the factorized matrices optional arguments: -h, --help show this help message and exit -f, --force-predict returns the predicted value even when the original rating is available
recommend.pyusage: recommend.py [-h] user model_path Recommend new items to an existing user ordered by predicted rating. positional arguments: user user identification string, as defined in the training set model_path HDF5 file with the trained model containing the factorized matrices optional arguments: -h, --help show this help message and exit
recommend_by_query.pyusage: recommend_by_query.py [-h] [--love LOVE] [--like LIKE] [--neutral NEUTRAL] [--dislike DISLIKE] [--hate HATE] model_path Use qualitative users preferences to recommend new items. positional arguments: model_path HDF5 file with the trained model containing the factorized matrices optional arguments: -h, --help show this help message and exit --love LOVE comma-separated item IDs for loved items --like LIKE comma-separated item IDs for liked items --neutral NEUTRAL comma-separated item IDs for neutral items --dislike DISLIKE comma-separated item IDs for disliked items --hate HATE comma-separated item IDs for hated items
nn.pyusage: nn.py [-h] [--love LOVE] [--like LIKE] [--neutral NEUTRAL] [--dislike DISLIKE] [--hate HATE] model_path Discover nearest neighbor in trained model based on qualitative user input. positional arguments: model_path HDF5 file with the trained model containing the factorized matrices optional arguments: -h, --help show this help message and exit --love LOVE comma-separated item IDs for loved items --like LIKE comma-separated item IDs for liked items --neutral NEUTRAL comma-separated item IDs for neutral items --dislike DISLIKE comma-separated item IDs for disliked items --hate HATE comma-separated item IDs for hated items
cross_validation.pyusage: cross_validation.py [-h] [-d DELIMITER] [-r RANK] [-k FOLDS] [-n FEATURE_SAMPLING_INTERVAL] [-m MAX_FEATURES] [-o OUTPUT] ratings_path Do k-fold cross-validation by creating k training CSVs from the original. positional arguments: ratings_path a CSV file with no header and three columns: user_id, item_id, rating number optional arguments: -h, --help show this help message and exit -d DELIMITER, --delimiter DELIMITER the CSV column delimiter character (DEFAULT=',') -r RANK, --rank RANK the number of latent factors (DEFAULT=1000) -k FOLDS, --folds FOLDS the number of folds to use in cross-validation (DEFAULT=10) -n FEATURE_SAMPLING_INTERVAL, --feature-sampling-interval FEATURE_SAMPLING_INTERVAL the sampling interval in a log space of the number of features to use in cross-validation -m MAX_FEATURES, --max-features MAX_FEATURES the maximum number of features to use in cross- validation (ignored if sampling size is not defined) -o OUTPUT, --output OUTPUT output CSV filename, to store validation scores (MAE)
server.pyA minimalistic example of how to provide PyJLD recommendation as a Flask service.
$ python server.py model_path