Product Matching via Semantic Search

Calibrated retrieval and ranking system, Fetch Rewards

PyTorch Hugging Face SBERT FAISS Cross-Encoders Amazon SageMaker Snowflake SQL Python

Business Problem

  • Fetch processes 11M+ receipts daily, extracting hundreds of millions of line items that need to be matched to an internal product catalog
  • The matching system needed to operate at scale with low latency, handle noisy and inconsistent input data, and maintain high precision to avoid revenue misattribution
  • For the full business context and problem reframing, see the Product & Strategy Case Study

Example: Receipt Line Item to Catalog Match

Receipt Line Item Catalog Product Match Type
ORG MILK 1GAL Organic Whole Milk 1 Gallon SKU match
CHKN BRST BNLS Fresh Chicken Breast Semantic
EVOO 500ML Extra Virgin Olive Oil 500ml Semantic
PNT BTR CRMY 16OZ Peanut Butter Creamy 16oz Semantic
GRD BEEF 80/20 Ground Beef 80/20 UPC match

Items with valid SKU/UPC identifiers are resolved via direct lookup. Items with abbreviated, noisy, or missing identifiers require semantic matching.

Accuracy Was Not the Bottleneck

  • Based on the root-cause analysis and the upper bound established through a human benchmark study (details in case study), building better semantic search algorithms and sentence transformer models was the obvious choice to close the gap to the ceiling
  • The core technical challenge was not model accuracy alone, but designing a system that balanced precision and recall under revenue risk constraints, handling catalog gaps, noisy input data, and entity resolution failures at scale

Modeling Approach

Retrieval & Ranking Pipeline

Receipt Line Items
Direct Match Lookup
SKU / UPC / Barcode
Match found
Matched Product
No match
Candidate Retrieval
SBERT Bi-Encoder + FAISS (Top-K)
Cross-Encoder Reranking
ms-marco-MiniLM-L-6-v2
Confidence Calibration
Precision-Coverage Threshold
High confidence
Auto-Assigned
Low confidence
Manual Review

Gold Dataset

  • Created a 50K+ stratified labeled dataset by merchant, frequency (head vs long-tail), OCR quality, and category
  • Defined annotation guidelines, achieved high inter-annotator agreement, and established the evaluation foundation

Existing Direct Match

  • Fetch already had a deterministic lookup system matching receipt items via SKU, UPC, and barcode identifiers
  • Handled a portion of items with exact matches, but left a significant gap for items with noisy, abbreviated, or missing identifiers

Semantic Search Algorithm

  • Designed a semantic matching pipeline using Sentence-BERT (SBERT) embeddings
  • Experimented with multiple sentence transformer models: all-MiniLM-L6-v2 (384-dim), all-mpnet-base-v2 (768-dim), and BERT-based variants, evaluating tradeoffs between embedding quality, latency, and memory footprint
  • Fine-tuned using cosine similarity loss and MultipleNegativesRankingLoss (MNR) on domain-specific product pairs to learn receipt-to-catalog semantic similarity

Input Representation Experiments

  • Experimented with different input text representations to enrich the embedding model with more contextual information
  • Tested product name alone, and various combinations of product name + size, unit, retailer, SKU, and price
  • Richer input representations improved matching for ambiguous or abbreviated items where name alone was insufficient

Data Augmentation

  • Synthetic generation of noisy product descriptions: abbreviations, misspellings, reordered tokens, and dropped fields
  • Helped the model generalize across retailers and receipt formats without overfitting to clean catalog text

Vector Index (FAISS)

  • Normalized embeddings indexed using FAISS (Facebook AI Similarity Search) with an inner product index for cosine similarity search
  • Sub-millisecond approximate nearest neighbor retrieval across the full catalog, scaling efficiently as the taxonomy grew without requiring model retraining

Candidate Retrieval

  • Multi-strategy candidate generation combining direct lookup, fuzzy matching (Levenshtein, TF-IDF), and SBERT + FAISS semantic retrieval
  • Optimized for Recall@10, ensuring the correct product appeared in the candidate pool

Ranking & Precision Optimization

  • Retrieval stage (SBERT + FAISS) uses a bi-encoder architecture, encoding query and catalog items independently for fast similarity search
  • Reranking via cross-encoder model (ms-marco-MiniLM-L-6-v2) that takes both receipt item and candidate product as a single input pair, enabling deeper token-level interaction and more accurate relevance scoring

Confidence Calibration

  • Tuned confidence thresholds on the 50K+ gold dataset to find the optimal operating point on the precision-coverage tradeoff curve
  • Only high-confidence matches auto-assigned; lower-confidence items routed for review

Evaluation Strategy

Offline ML Metrics

  • Recall@10: measuring whether the correct product appeared in the top-10 candidates
  • Top-1 accuracy: measuring exact match at the top rank
  • MRR (Mean Reciprocal Rank): measuring ranking quality across candidates
  • Precision at operating threshold: ensuring automation didn't introduce revenue risk

Business Metrics

  • Points awarded accuracy: correctness of reward points attributed to users based on matched products
  • Dollar value of rewards: financial accuracy of rewards distributed, directly tied to partner billing and revenue reconciliation

The 50K+ stratified gold dataset served as the foundation for all offline evaluation, and shadow deployments enabled safe comparison on live traffic before production release.

Key Takeaways & Challenges

  • Noisy identifiers and short descriptions: receipt line items often contained abbreviated product names (e.g., "GV 2% MLK 1GL" for "Great Value 2% Reduced Fat Milk, 1 Gallon"), partial UPC/SKU codes with trailing zeros or mismatched formats, making direct text matching unreliable
  • OCR errors in physical receipts: scanned paper receipts introduced misspellings, merged words, and missing characters (e.g., "CHBNI VAN YGT" for "Chobani Vanilla Greek Yogurt"), further degrading match quality
  • Large, evolving taxonomy: the product catalog was continuously growing, requiring the system to generalize to new products without retraining
  • Data drift and continuous inference on unseen items: millions of new and unknown items were inferred daily against fixed catalog embeddings, requiring iterative retraining, gold dataset refresh, and ongoing monitoring to maintain match quality as product distributions shifted over time

Continuous improvement: error analysis and active learning via uncertainty sampling to identify high-value training examples and iteratively improve model performance.

Production Deployment

Load Testing & Latency Optimization

  • Conducted load testing to validate throughput and latency under production traffic volumes
  • Converted trained PyTorch models to ONNX format for optimized inference, reducing latency and increasing throughput for real-time serving

Telemetry & Monitoring

  • Set up telemetry for model performance metrics, tracking latency percentiles (p90, p95, p99), throughput, and error rates across inference endpoints

Scaling

  • Horizontal scaling with auto-scaling SageMaker endpoints to handle traffic spikes and sustained high-volume processing across millions of daily line items

Impact

  • Coverage: 45% improvement in automated product assignment
  • Operations: ~30% reduction in manual review workload
  • Attribution: significantly improved partner attribution reliability, maintaining high precision at the operating threshold, enabling more accurate offer targeting and partner reporting for CPG partners
  • Scale: scalable architecture supporting millions of line items from 11M+ daily receipts, covering over $400M GMV, rewarding $500K+ every day

Read the Product & Strategy Case Study →

← Back to Projects