Skip to main content
Semantic Search with AWS & Pinecone

Podcasts Semantic Search

Completed
Vectors Embeddings AWS Pinecone Semantic Search Python Backend

Project description / intro

Over the course of a month, I delivered an end-to-end semantic search capability that turns long-form podcast audio into searchable, attribution-ready text snippets at sentence/paragraph granularity.

I owned the backend architecture and implementation: an ingestion pipeline that lands raw episode text (or transcriptions) in Amazon S3; S3 event notifications that kick off an AWS Step Functions state machine; and a set of containerized AWS Lambda functions that validate and clean the text, chunk it into paragraphs and sentences, generate dense embeddings, and upsert those vectors—together with rich metadata—into Pinecone. The search path is equally lean: a stateless API Gateway + Lambda endpoint embeds the user’s free-text query on the fly, executes a cosine-similarity vector search, and returns ranked results that include the original sentences/paragraphs and deep links back to the parent episode (and indices within it).

The result is a “podcasting 2.0” search experience that moves beyond title/description keywords into the semantic content of episodes, which is core to the product vision.


Business Use Case

Company

Early-stage, pre-Series A startup focused on building a “Podcasting 2.0” experience—semantic search, richer episode enrichment, and shareable content to power discovery for listeners and growth/monetization for creators.

Business need

Podcast discovery on mainstream platforms still leans on shallow metadata and brittle keyword search. Listeners struggle to find relevant segments inside long episodes, creators lack viral, snippet-first sharing, and there’s no unified, content-level search across the corpus. The team’s thesis: unlock in-episode data (transcripts, summaries, quotes) and semantic relationships to deliver a unified search & discovery layer and modern sharing UX.

Goals

  • Unify search across thousands of transcribed episodes at the sentence/paragraph level.

  • Deliver relevance via dense embeddings and vector similarity (cosine).

  • Make content shareable by returning original text + deep links to the parent episode.

  • Time-to-index: reliably under ~2 hours from episode drop to searchable result.

  • Operate lean with pay-per-use, serverless infrastructure to control costs while validating product-market fit.

Market context (why it matters): the category is large and growing (hundreds of millions of listeners, tens of millions of episodes, and over a million new episodes monthly), yet discovery remains fragmented—precisely the gap this system addresses.


Solution

Approach

  1. Ingest new episode text (or transcribe if starting from audio), land into S3.

  2. Trigger a Step Functions workflow via S3 Event Notifications.

  3. Normalize & chunk into short paragraphs → sentences; attach episode/show metadata.

  4. Embed each sentence (or short paragraph) using sentence-transformers/paraphrase-MiniLM-L6-v2 (384-dim).

  5. Index into Pinecone with cosine similarity and rich metadata for precise attribution.

  6. Search API: for a free-text query, generate the query embedding on the fly and perform a vector search; return ranked hits with original text, location (episode/paragraph/sentence indices), and links back to the source.

How we arrived at the solution

  • Relevance vs. cost/latency trade-off: The data scientist and I evaluated multiple embedding families for retrieval quality and throughput. paraphrase-MiniLM-L6-v2 emerged as the best balance of speed, compact vectors (384-d), and semantic quality, which mattered for serving low-latency inference inside a Lambda container and keeping index sizes/costs in check.

  • Operational simplicity: A fully managed vector DB minimized ops work so I could deliver end-to-end functionality in weeks, not months.

  • Pay-per-use: Serverless orchestration (S3 + Step Functions + Lambda) aligned infrastructure cost with ingestion volume.

Tech stack

  • AWS: S3 (landing), Event Notifications, Step Functions (orchestration), Lambda (Python), API Gateway (search endpoint), CloudWatch (metrics/alarms), IAM (least privilege), Parameter Store/Secrets Manager (keys & config).

  • Vector DB: Pinecone (cosine similarity), per-record metadata for episode/author/paragraph/sentence references.

  • Embeddings: sentence-transformers/paraphrase-MiniLM-L6-v2.

  • Packaging & CI/CD: Containerized Lambdas; IaC (CloudFormation/SAM); per-env configs and staged deploys.

Architecture overview

Design choices were guided by speed, cost, and operational simplicity. Sentence-transformers MiniLM-L6-v2 provided compact 384-dimensional embeddings with solid relevance for conversational text, keeping Lambda cold-start and Pinecone index costs down while maintaining high recall. Metadata captured with each vector—episode/show identifiers, author, published date, paragraph/sentence indices, and canonical URLs—ensures that what we retrieve is both context-preserving and immediately usable in the UI, snippet sharing, or downstream features like chapterization and quote extraction. This architecture lines up with the company’s intent to make “full transcripts, summaries, extracted insights & key quotes” modular building blocks for discovery, sharing, and creator analytics.

deepcast_semantic_search_architecture_transparent.svg

Data model (index record, conceptual)

  • id: deterministic hash of {episode_id, paragraph_idx, sentence_idx}

  • vector: 384-dim float32

  • Metadata: episode_id, show_id, author, published_at, paragraph_idx, sentence_idx, char_start/char_end (if available), display_text, episode_url, source.

Considered alternatives

  • Vector store: Weighing Pinecone vs. OpenSearch k-NN/ES-HNSW, Weaviate, Qdrant, and pgvector. We chose Pinecone for: minimal ops, predictable latency at startup scale, robust metadata filters, and favorable pricing (incl. startup incentives).
  • Embeddings: Explored small e5-family and hosted APIs. Local MiniLM won on packaging simplicity (Lambda container), cost, and adequate retrieval quality for podcast text.

Architecture details

Orchestration & idempotency

  • Step Functions coordinates stateless Lambdas: validate → chunk → embed+upsert.

  • Idempotency keys derived from {episode_id, paragraph_idx, sentence_idx} to guard against duplicate upserts on retries.

  • Exponential backoff and DLQs for transient failures (network, Pinecone rate limits).

Chunking strategy

  • Paragraph pre-splitting (~150–300 words), then sentence splitting for fine-grained retrieval.

  • Maintain both granularities in metadata so UI can expand from a single sentence to its parent paragraph for context.

Embedding & cold-start mitigation

  • Lambda packaged as a container image with the MiniLM model baked in; provisioned memory tuned for BLAS throughput.

  • Provisioned Concurrency on the search Lambda to eliminate P95 cold starts during interactive queries.

Search API

  • POST /search with query, optional filters (e.g., show, author, date range), and topK.

  • Pipeline: embed query → Pinecone vector search (cosine) → hydrate with stored metadata → return original text + strong links back to episode context (paragraph/sentence indices, episode URL).

Security, ops, and deployments

  • Least-privilege IAM for each Lambda; KMS-encrypted params for Pinecone keys.

  • CloudWatch metrics/alarms on Step Functions failures, Lambda errors/duration, Pinecone timeouts.

  • Blue/green deploys of Lambdas via SAM/CloudFormation; per-env staging for safe rollouts.


Challenges

  • Long-form variability: Podcast episodes vary from minutes to hours; chunking had to maximize recall without exploding index size. Solution: paragraph→sentence pipeline with compact vectors and de-dupe.

  • Metadata integrity: Ensuring every vector record is reversible to high-quality display text and an episode deep link (critical for UX and creator value).

  • Latency budget: Keep end-to-end query time low while embedding on demand; solved via small, fast model, containerized Lambda, and tuned Pinecone top-K.

  • Migration context: Backends were moving from Vercel to AWS; I created a clean, serverless boundary (S3 → Step Functions → Lambda) that didn’t depend on legacy infra.


Scaling

  • Throughput: The company’s ingestion targets were already several hundred episodes daily; the pipeline parallelizes at the episode level and further within chunking/embedding to absorb bursts (fan-out with safe concurrency caps).

  • Index growth: Compact 384-d vectors + sentence-level granularity keeps cost predictable. Metadata filters prevent scanning large swaths of the corpus.

  • Ops posture: No servers to manage; horizontal scaling via Lambda concurrency and Pinecone capacity.

  • Market-driven scaling: The broader landscape (70M+ episodes, ~1M+ added monthly) supports the need for semantic search as coverage expands.


Conclusion

Goals achieved

  • Unified, content-level search across transcribed episodes with sentence-granularity snippets.

  • High-relevance retrieval using embeddings and cosine similarity.

  • Attribution-ready results (original text + links to episode/paragraph/sentence).

  • Fast freshness: typical ingestion-to-search availability in under ~2 hours.

Cost impact

  • Serverless pay-per-use minimized idle spend; infra scaled down naturally during off-hours.

  • Pinecone offloaded cluster ops, letting a one-engineer backend deliver quickly and keep TCO low while validating the product.

Efficiency gains

  • Automated end-to-end ingestion from S3 drop to searchable vectors.

  • Developers and product could iterate on ranking & UX without touching ingestion.

  • The same metadata supports future features: entity linking, quote extraction, summaries, and creator analytics.

Overall business impact

This project delivered the technical foundation for “Podcasting 2.0”: semantic search, deeper discovery, and shareable, high-signal snippets. It addresses the core market gap (no unified search, restricted sharing, lost creator value), enabling a step-change in listener engagement and new monetization surfaces for creators.


Appendix (quick reference)

Key AWS resources

  • S3 (landing buckets), Event Notifications

  • Step Functions (workflow), Lambda (validate/chunk/embed/upsert, search)

  • API Gateway (search endpoint), CloudWatch (observability), IAM, Secrets

Key decisions

  • Vector DB: Pinecone for simplicity, latency, and metadata.

  • Embeddings: MiniLM-L6-v2 for speed/size/quality balance.

  • Granularity: Sentence-level indexing for precise matches with paragraph context.