Music Recommender System for multi-lingual Users

Introduction:
Recommender Systems solve an interesting problem of learning users’ preferences and behavior from historical user data. Recommender Systems are used by internet companies to surface the right content. E.g e-commerce websites use them to suggest their products and services,  social media content build feeds for their consumers. Specifically, it is a subclass of information filtering system that seeks to ‘predict the rating a user would give to an item’.  It solves the decision-making problem of pairing two types of things (items) together.
Data-driven solutions for building recommender systems can be categorized into content-based, collaborative or Hybrid Approaches. When it comes to building classic recommender systems, each approach has its advantages and disadvantages.  Traditionally, collaborative filtering approach is the work horse of building internet based recommender engine for video/music streaming services. It follows the idea that users with a common set of consumed items (songs or movies) probably have similar tastes. However such approaches are far from perfect. They have a hard time surfacing and recommending content that is new and unpopular (which is referred to as the cold-start problem).



One of the challenges for building a music streaming website is catering to the needs multi-lingual users (e.g users from the sub-continent). Such users like listening to songs of multiple genres in one session. Recommending effective playlist to such users is a challenging problem due to the sheer variety of different styles and genres.  Typical approaches (such as collaborative filtering) don’t consider the language, social and geographic factors that influence listener preferences. In such a scenario, a better approach for making personalized recommendations is to model the sequential dependencies over the user-item interactions. Sequential Pattern-based approaches try to understand the pattern of relations between frequent items and assign a weight to a sequence of items according to its level of importance among the past sequences for a given user.

User Interaction Dataset:

Our dataset is composed of implicit user listening behavior from one of Pakistan’s popular music streaming website (Patari.pk). Entries of the dataset represent sequential events in the form of user listening sessions. We hypothesize that this dataset can be used for understanding and modeling users’ preferences, the evolution of listening behaviour and item popularity over time. It is challenging to build personalized recommender systems because users rarely spend time to rate items in the collection (listening doesn't necessarily means that a user likes that particular song). Furthermore, most of the data is sparse, representing only user's interaction with a subset of items (available songs in our case).  Users occasionally rate the items but most of the data represent implicit high-level user interaction or ‘presence’ data.

In data pre-processing, we used several strategies like filtering out noise and sessions of short length (less than 6 songs), and songs with play length of fewer than 30 secs,  as they won’t contribute to meaningful sequential learning. We also experimented with filtering out of overly popular items. Next, we followed a approach similar to word_embeddings  and encoded song information that effects the listening preferences. In our case we have the following metadata available:

“UserID", "SongID", "AlbumID", "ArtistID", "TimeStamp", "UNIXTIME", "SessionID"


This encoding allows us to project both the listeners and the songs into a shared low-dimensional latent space. The cosine distance between songs determines their similarity in terms of being a good potential recommendation. We can visualize this using techniques like T-SNE. We also set aside some data for testing by splitting the filtered dataset.

Modeling with LSTM Network:

   Deep learning is used in solving many complex tasks in a wide variety of application domains such as speech recognition, computer vision and natural language processing. A Recurrent Neural Network (RNN) has the power to simulate the Universal Turing Machine, thus having the capacity of implementing arbitrary complex algorithms. Building on RNNs is an extension of RNNs called LSTM networks.  Chris Olah has written a great explanation on how LSTM networks work.

Based on this architecture, we experimented with Neural Networks for modelling relations between items and users. In our case, we followed a language modelling approach where models are used to predict the next word. Such a network takes in an input sequence of events (just like words in a document) and can model the song sequence. This approach allows us to learn how users themselves create a multi-lingual or multi-genre playlist using our input data. We note that such an approach is content agnostic as we only consider consumption patterns. Following this approach allows us to extract the latent factors from the training set and consider the dynamic song sequence of each user session. Lastly, we take the sequential dependencies into account that capture the preferences of a user and ideally extract high-level properties from available data. This means that in our final learned model, similar songs are placed together in high-dimensional latent space. We can use the model to generate song sequences, which appeared next to each other several times during training.


Results and A/B Testing:

 We implemented the LSTM model (Stacked RNNS) using Tensorflow.  Our model consisted of several layers. The network is trained to minimize the Perplexity metric. The network is implemented in Tensorflow and trained using mini-batch gradient descent on an AWS GPU Machine (see details below).

The existing recommender systems was implemented using built-in recommendation algorithm available in sci-kit (SURPRISE).
Before full scale deployment we have to show that our model using can produce sensible recommendations. Our standard training metrics and test setup is designed to evaluate how well the model can predict the sequence and positively affect the future user behaviour.  But our challenge is to point our users to songs that they might not have heard before and suit their taste(i.e we care about the diversity and novelty of the content presented).  In this awesome talk by Chris Jhonson, he encourages us to define our success metrics. e.g retention rate. Once we have a working prototype, we can then A/B test our method and compare it with existing recommender engine and measured retention. This is done by making it available to 1% of the active users. We then keep track of weekly performance of our updates using dashboards using various metrics like reach, retention and depth etc. Following this online A/B testing strategy, we tested our approach on live production traffic by collecting statistical benchmark rankings for various experiments and chose the one with maximum retention.

Once we deploy a particular recommendation approach we can always improve and retrain our models using the constant feedback loop. Sophisticated recommender systems like Spotify use multiple signals including the audio content.

Conclusion and Future improvements:
 Recommender systems are an interesting application of machine learning where we analyze  the user’s past behavior in order to predict how they might act in the future. In this work, our goal was to help our users discover new music that matches their tastes using state of the art deep learning approach. This remains an open problem with multiple promising approaches. In future, we’d like to improve our approach by learning from existing user and expert-curated playlists. We can also make it context-aware by adding more context such as time of the day, location and social relations which can potentially improve performance and enrich the user experience.  We also have the option of incorporating other signals instead of relying a single approach to add value. e.g user feedback (recorded in the form of thumbs-up and down). Lastly adding exploration (e.g based on Epsilon Greedy approach) in the mix and training multiple models and combining them can help resolve some of the problems associated with ML (e.g bias).    


References:


[1] From Idea to Execution: Spotify's Discover Weekly: https://www.youtube.com/watch?v=A259Yo8hBRs
[2] Towards Cognitive Recommender Systems
[3] Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions: http://pages.stern.nyu.edu/~atuzhili/pdf/TKDE-Paper-as-Printed.pdf
[4] Logistic Matrix Factorization for Implicit Feedback Data https://stanford.edu/~rezab/nips2014workshop/submits/logmat.pdf
[5] Deep content-based music recommendation: https://papers.nips.cc/paper/2013/file/b3ba8f1bee1238a2f37603d90b58898d-Paper.pdf
[6] https://benanne.github.io/2014/08/05/spotify-cnns.html
[7] Collaborative Filtering for Implicit Feedback Datasets http://yifanhu.net/PUB/cf.pdf
[8] Sequential Recommender Systems: Challenges, Progress and Prospects: https://www.ijcai.org/Proceedings/2019/0883.pdf
[9] https://www.nature.com/articles/nature14539#auth-3
[10] https://blog.fastforwardlabs.com/2018/01/22/exploring-recommendation-systems.html