About Metaphor: Intelligent Podcast Discovery

In April 2020, there were 1,000,000 podcasts. Just one year later, by the spring of 2021, the number of podcasts doubled to 2,000,000. This exponential growth reflects the tremendous vitality of the podcast ecosystem. But the enormous number of podcasts creates a problem: With so much out there, how do you decide what to listen to?

Metaphor is an intelligent podcast search engine, cataloging over 2.5 million podcasts and 80 million episodes. It is driven by natural-language processing, so the search index and recommenations are based on what the podcasts are actually about, rather than their pre-defined categories or popularity.

For instance, if you listen to PLANET MONEY, you will likely also want to listen to podcasts including:

The Supply Side (94.26%)

Jonathan Doyle

Marketplace All-in-One (92.91%)

Marketplace

Debunking Economics - the podcast (91.93%)

Steve Keen & Phil Dobbie

Rethinking the Dollar (91.88%)

Mike

The Kākā by Bernard Hickey (91.87%)

Bernard Hickey

Global Economy (91.71%)

None

Listen Headlines (91.48%)

Listen Headlines

U.S. Economy (91.47%)

None

American Institute for Economic Research (91.35%)

American Institute for Economic Research

Mises Wire (91.33%)

None

Mises Wire (91.19%)

None

How It Works

Metaphor is built with natural language and AI toolkits to analyze a podcast’s content and provide intelligent recommendations and other smart data on podcasts. It has four main compinents:

  1. Podcast indexing: Metaphor looks for podcasts to index by crawling the internet for podcast feeds and also drawing from major directories, and it uses the RSS feeds to keep the search index up to date. On the back-end, Metaphor holds an index of all podcasts (about 2.5 million), but the search engine only makes available those which are in English (over 1.3 million).
  2. Sophisticated de-duplication and analysis: There are many, many duplicate podcasts. Without de-duplicating, the similarity algorithm would offer the top suggestions as other copies of the same podcast. Plus, often RSS feeds are not complete. By aggregating podcast data from multiple RSS feeds, we get a more complete picture of each podcast.
  3. Natural language preprocessing and analysis, of both podcasts and individual episodes, using both bag-of-words and document embeddings technologies. This includes proof-of-concept transcription, which is not fully implemented primarily because it is cost-prohibitive to index tens of millions of podcast episodes by transcripts. We made custom modifications to Gensim and other natural language toolkits in order to improve natural language processing specific to the podcasting space, and also developed a massively multiprocessed system to scale up our ability to process hundreds of thousands of podcasts whose metadata are contantly changing as new episodes are published.
  4. Topic modelling: Building on Metaphor's natural language capabilities, we can create a "topic model" that extracts key themes and issues which are discussed across a corpus of about 750,000 podcasts. (While the index as a whole contains over 2 million podcasts, we restrict this sopisticated algorithm to podcasts with more than 3 episodes; podcasts with fewer episodes do not have enough data associated with them for this to work effectively.) Using this corpus, we can extract a hierarchical topic model with broad topics at the top, and more specific topics as you drill down.

The prototype took about 2 weeks to build, and another 6 weeks to crunch the numbers as the data scaled up. We probably could done it faster with a better laptop. However, the lack of resources forced us to write the entire system in a massively multiprocesssed manner that could scale up to a cluster, given the resources.

Under the hood, Metaphor written in Python with Django and a massive PostgreSQL database and Elasticsearch document store that has catalogued all the world’s podcasts (this also includes non-English podcasts, but for the time being we have only built NLP tools for English). Other key technologies include: Ray, Gensim, Elasticsearch, SentenceTransformers, BERT, BigARTM, HuggingFace, Spacy, NLTK, ZFS, and more.

What's Next

We have published an initial minimum product, the search engine, and are actively working to add additional components to Metaphor which will leverage growing AI and natural-language process technologies to process and present the podcast data in novel ways. This includes:

Contact

Questions? Reach out to the developer, Jason Lustig.