About Metaphor: Intelligent Podcast Discovery
In April 2020, there were 1,000,000 podcasts. Just one year later, by the spring of 2021, the number of podcasts doubled to 2,000,000. This exponential growth reflects the tremendous vitality of the podcast ecosystem. But the enormous number of podcasts creates a problem: With so much out there, how do you decide what to listen to?
Metaphor is an intelligent podcast search engine, cataloging over 2.5 million podcasts and 80 million episodes. It is driven by natural-language processing, so the search index and recommenations are based on what the podcasts are actually about, rather than their pre-defined categories or popularity.
- Podcast Search: Metaphor indexes over 2.5 million podcasts, and you can search for podcasts based on topic, episode, etc. Note: The search engine is currently restricted to English-language podcasts, so not all indexed podcasts are available in search.
- Browse Podcasts By Topic: Metaphor analyzes the contents of each podcast and categorizes them with an AI-driven topic model. This supplements the traditional categories provided by podcasters (e.g. Sports, or Society and Culture), and allows you to drill down to much more specific topics and themes. We have extracted a hierarchical topic model with broad topics at the top, and more specific topics as you drill down. This is both a great way to browse podcasts, and it also offers deep insight into the broader podcasting ecosystem.
- Search Podcasts By Books: Metaphor indexes podcasts that discuss books and other products. You can search for a specific book, or browse books based on Amazon categories.
- Podcast Recommendations: Using Metaphor's analysis of each podcast's contents and themes, it identifies other related podcasts.
For instance, if you listen to PLANET MONEY, you will likely also want to listen to podcasts including:
The Indicator from Planet Money (94.57%)
NPR
The Supply Side (94.26%)
Jonathan Doyle
Marketplace All-in-One (92.91%)
Marketplace
Debunking Economics - the podcast (91.93%)
Steve Keen & Phil Dobbie
Rethinking the Dollar (91.88%)
Mike
The Kākā by Bernard Hickey (91.87%)
Bernard Hickey
Global Economy (91.71%)
None
Listen Headlines (91.48%)
Listen Headlines
U.S. Economy (91.47%)
None
American Institute for Economic Research (91.35%)
American Institute for Economic Research
Mises Wire (91.33%)
None
Mises Wire (91.19%)
None
How It Works
Metaphor is built with natural language and AI toolkits to analyze a podcast’s content and provide intelligent recommendations and other smart data on podcasts. It has four main compinents:
- Podcast indexing: Metaphor looks for podcasts to index by crawling the internet for podcast feeds and also drawing from major directories, and it uses the RSS feeds to keep the search index up to date. On the back-end, Metaphor holds an index of all podcasts (about 2.5 million), but the search engine only makes available those which are in English (over 1.3 million).
- Sophisticated de-duplication and analysis: There are many, many duplicate podcasts. Without de-duplicating, the similarity algorithm would offer the top suggestions as other copies of the same podcast. Plus, often RSS feeds are not complete. By aggregating podcast data from multiple RSS feeds, we get a more complete picture of each podcast.
- Natural language preprocessing and analysis, of both podcasts and individual episodes, using both bag-of-words and document embeddings technologies. This includes proof-of-concept transcription, which is not fully implemented primarily because it is cost-prohibitive to index tens of millions of podcast episodes by transcripts. We made custom modifications to Gensim and other natural language toolkits in order to improve natural language processing specific to the podcasting space, and also developed a massively multiprocessed system to scale up our ability to process hundreds of thousands of podcasts whose metadata are contantly changing as new episodes are published.
- Topic modelling: Building on Metaphor's natural language capabilities, we can create a "topic model" that extracts key themes and issues which are discussed across a corpus of about 750,000 podcasts. (While the index as a whole contains over 2 million podcasts, we restrict this sopisticated algorithm to podcasts with more than 3 episodes; podcasts with fewer episodes do not have enough data associated with them for this to work effectively.) Using this corpus, we can extract a hierarchical topic model with broad topics at the top, and more specific topics as you drill down.
The prototype took about 2 weeks to build, and another 6 weeks to crunch the numbers as the data scaled up. We probably could done it faster with a better laptop. However, the lack of resources forced us to write the entire system in a massively multiprocesssed manner that could scale up to a cluster, given the resources.
Under the hood, Metaphor written in Python with Django and a massive PostgreSQL database and Elasticsearch document store that has catalogued all the world’s podcasts (this also includes non-English podcasts, but for the time being we have only built NLP tools for English). Other key technologies include: Ray, Gensim, Elasticsearch, SentenceTransformers, BERT, BigARTM, HuggingFace, Spacy, NLTK, ZFS, and more.
What's Next
We have published an initial minimum product, the search engine, and are actively working to add additional components to Metaphor which will leverage growing AI and natural-language process technologies to process and present the podcast data in novel ways. This includes:
- Transcription: We have built a prototype transcription engine using HuggingFace and a custom-trained language model, which allows us to use unedited transcripts of episodes to compare podcasts. We can use the transcripts to offer in-episode search, to find specific clips that discuss topics.
Contact
Questions? Reach out to the developer, Jason Lustig.