KDnuggets
2025-06-11 13:00:00
www.kdnuggets.com
Sponsored Content
Recommender systems rely on data, but access to truly representative data has long been a challenge for researchers. Most academic datasets pale in comparison to the complexity and volume of user interactions in real-world environments, where data is typically locked away inside companies due to privacy concerns and commercial value.
That’s beginning to change.
In recent years, several new datasets have been made public that aim to better reflect real-world usage patterns, spanning music, e-commerce, advertising, and beyond. One notable recent release is Yambda-5B, a 5-billion-event dataset contributed by Yandex, based on data from its music streaming service, now available via Hugging Face. Yambda comes in 3 sizes (50M, 500M, 5B) and includes baselines to underscore accessibility and usability. It joins a growing list of resources helping to close the research-to-production gap in recommender systems.
Below is a brief survey of key datasets currently shaping the field.
A Look at Publicly Available Datasets in Recommender Research
MovieLens
One of the earliest and most widely used datasets. It includes user-provided movie ratings (1–5 stars) but is limited in scale and diversity—ideal for initial prototyping but not representative of today’s dynamic content platforms.
Netflix Prize
A landmark dataset in recommendеr history (~100M ratings), though now dated. Its static snapshot and lack of detailed metadata limit modern applicability.
Yelp Open Dataset
Contains 8.6M reviews, but coverage is sparse and city-specific. Valuable for local business research, yet not optimal for large-scale generalizable models.
Spotify Million Playlist
Released for RecSys 2018, this dataset helps analyze short-term and sequential listening behavior. However, it lacks long-term history and explicit feedback.
Criteo 1TB
A massive ad click dataset that showcases industrial-scale interactions. While impressive in volume, it offers minimal metadata and prioritizes click-through rate (CTR) over recommendation logic.
Amazon Reviews
Rich in content and widely used for sentiment analysis and long-tail recommendation. However, the data is notoriously sparse, with a steep drop-off in interaction for most users and products.
Last.fm (LFM-1B)
Previously a go-to for music recommendations. Licensing limitations have since restricted access to newer versions of the dataset.
Moving Toward Industrial-Scale Research
While each of these datasets has helped shape the field, they all present limitations—either in scale, data freshness, user diversity, or metadata completeness. That’s where new entries, such as Yambda-5B, are particularly promising.
This dataset offers anonymized, large-scale user-item interaction data across music streaming sessions, including metadata such as timestamps, feedback type (explicit vs. implicit), and recommendation context (organic vs. suggested). Importantly, it includes a global temporal split, enabling more realistic model evaluation that mirrors online system deployment. Researchers will also find value in the multimodal nature of the dataset, which includes precomputed audio embeddings for over 7.7 million tracks, enabling content-aware recommendation strategies out of the box.
Privacy has been carefully considered in the design of the dataset. Unlike earlier examples, such as the Netflix Prize dataset, which was eventually withdrawn due to re-identification risks. Аll user and track data in the Yambda dataset is anonymized, using numeric identifiers to meet privacy standards.
Closing the Loop: From Theory to Production
As recommender research moves toward practical application at scale, access to robust, varied, and ethically sourced datasets is essential. Resources like MovieLens and Netflix Prize remain foundational for benchmarking and testing ideas. But newer datasets—such as Amazon’s, Criteo’s, and now Yambda—offer the kind of scale and nuance needed to push models from academic novelty to real-world utility.
Read the original article at Turing Post, the newsletter for over 90 000 professionals who are serious about AI and ML.
By, Avi Chawla – highly passionate about approaching and explaining data science problems with intuition. Avi has been working in the field of data science and machine learning for over 6 years, both across academia and industry.
Transform your cleaning routine with the Shark AI Ultra Voice Control Robot Vacuum! This high-tech marvel boasts over 32,487 ratings, an impressive 4.2 out of 5 stars, and has been purchased over 900 times in the past month. Perfect for keeping your home spotless with minimal effort, this vacuum is now available for the unbeatable price of $349.99!
Don’t miss out on this limited-time offer. Order now and let Shark AI do the work for you!
Help Power Techcratic’s Future – Scan To Support
If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.
As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!
BITCOIN bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge Scan the QR code with your crypto wallet app |
DOGECOIN D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA Scan the QR code with your crypto wallet app |
ETHEREUM 0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a Scan the QR code with your crypto wallet app |
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.