Product Matching via Semantic Search
Calibrated retrieval and ranking system, Fetch Rewards
Business Problem
- Fetch processes 11M+ receipts daily, extracting hundreds of millions of line items that need to be matched to an internal product catalog
- The matching system needed to operate at scale with low latency, handle noisy and inconsistent input data, and maintain high precision to avoid revenue misattribution
- For the full business context and problem reframing, see the Product & Strategy Case Study
Example: Receipt Line Item to Catalog Match
| Receipt Line Item | Catalog Product | Match Type |
|---|---|---|
| ORG MILK 1GAL | Organic Whole Milk 1 Gallon | SKU match |
| CHKN BRST BNLS | Fresh Chicken Breast | Semantic |
| EVOO 500ML | Extra Virgin Olive Oil 500ml | Semantic |
| PNT BTR CRMY 16OZ | Peanut Butter Creamy 16oz | Semantic |
| GRD BEEF 80/20 | Ground Beef 80/20 | UPC match |
Items with valid SKU/UPC identifiers are resolved via direct lookup. Items with abbreviated, noisy, or missing identifiers require semantic matching.
Accuracy Was Not the Bottleneck
- Based on the root-cause analysis and the upper bound established through a human benchmark study (details in case study), building better semantic search algorithms and sentence transformer models was the obvious choice to close the gap to the ceiling
- The core technical challenge was not model accuracy alone, but designing a system that balanced precision and recall under revenue risk constraints, handling catalog gaps, noisy input data, and entity resolution failures at scale
Modeling Approach
Retrieval & Ranking Pipeline
SKU / UPC / Barcode
SBERT Bi-Encoder + FAISS (Top-K)
ms-marco-MiniLM-L-6-v2
Precision-Coverage Threshold
Gold Dataset
- Created a 50K+ stratified labeled dataset by merchant, frequency (head vs long-tail), OCR quality, and category
- Defined annotation guidelines, achieved high inter-annotator agreement, and established the evaluation foundation
Existing Direct Match
- Fetch already had a deterministic lookup system matching receipt items via SKU, UPC, and barcode identifiers
- Handled a portion of items with exact matches, but left a significant gap for items with noisy, abbreviated, or missing identifiers
Semantic Search Algorithm
- Designed a semantic matching pipeline using Sentence-BERT (SBERT) embeddings
- Experimented with multiple sentence transformer models: all-MiniLM-L6-v2 (384-dim), all-mpnet-base-v2 (768-dim), and BERT-based variants, evaluating tradeoffs between embedding quality, latency, and memory footprint
- Fine-tuned using cosine similarity loss and MultipleNegativesRankingLoss (MNR) on domain-specific product pairs to learn receipt-to-catalog semantic similarity
Input Representation Experiments
- Experimented with different input text representations to enrich the embedding model with more contextual information
- Tested product name alone, and various combinations of product name + size, unit, retailer, SKU, and price
- Richer input representations improved matching for ambiguous or abbreviated items where name alone was insufficient
Data Augmentation
- Synthetic generation of noisy product descriptions: abbreviations, misspellings, reordered tokens, and dropped fields
- Helped the model generalize across retailers and receipt formats without overfitting to clean catalog text
Vector Index (FAISS)
- Normalized embeddings indexed using FAISS (Facebook AI Similarity Search) with an inner product index for cosine similarity search
- Sub-millisecond approximate nearest neighbor retrieval across the full catalog, scaling efficiently as the taxonomy grew without requiring model retraining
Candidate Retrieval
- Multi-strategy candidate generation combining direct lookup, fuzzy matching (Levenshtein, TF-IDF), and SBERT + FAISS semantic retrieval
- Optimized for Recall@10, ensuring the correct product appeared in the candidate pool
Ranking & Precision Optimization
- Retrieval stage (SBERT + FAISS) uses a bi-encoder architecture, encoding query and catalog items independently for fast similarity search
- Reranking via cross-encoder model (ms-marco-MiniLM-L-6-v2) that takes both receipt item and candidate product as a single input pair, enabling deeper token-level interaction and more accurate relevance scoring
Confidence Calibration
- Tuned confidence thresholds on the 50K+ gold dataset to find the optimal operating point on the precision-coverage tradeoff curve
- Only high-confidence matches auto-assigned; lower-confidence items routed for review
Evaluation Strategy
Offline ML Metrics
- Recall@10: measuring whether the correct product appeared in the top-10 candidates
- Top-1 accuracy: measuring exact match at the top rank
- MRR (Mean Reciprocal Rank): measuring ranking quality across candidates
- Precision at operating threshold: ensuring automation didn't introduce revenue risk
Business Metrics
- Points awarded accuracy: correctness of reward points attributed to users based on matched products
- Dollar value of rewards: financial accuracy of rewards distributed, directly tied to partner billing and revenue reconciliation
The 50K+ stratified gold dataset served as the foundation for all offline evaluation, and shadow deployments enabled safe comparison on live traffic before production release.
Key Takeaways & Challenges
- Noisy identifiers and short descriptions: receipt line items often contained abbreviated product names (e.g., "GV 2% MLK 1GL" for "Great Value 2% Reduced Fat Milk, 1 Gallon"), partial UPC/SKU codes with trailing zeros or mismatched formats, making direct text matching unreliable
- OCR errors in physical receipts: scanned paper receipts introduced misspellings, merged words, and missing characters (e.g., "CHBNI VAN YGT" for "Chobani Vanilla Greek Yogurt"), further degrading match quality
- Large, evolving taxonomy: the product catalog was continuously growing, requiring the system to generalize to new products without retraining
- Data drift and continuous inference on unseen items: millions of new and unknown items were inferred daily against fixed catalog embeddings, requiring iterative retraining, gold dataset refresh, and ongoing monitoring to maintain match quality as product distributions shifted over time
Continuous improvement: error analysis and active learning via uncertainty sampling to identify high-value training examples and iteratively improve model performance.
Production Deployment
Load Testing & Latency Optimization
- Conducted load testing to validate throughput and latency under production traffic volumes
- Converted trained PyTorch models to ONNX format for optimized inference, reducing latency and increasing throughput for real-time serving
Telemetry & Monitoring
- Set up telemetry for model performance metrics, tracking latency percentiles (p90, p95, p99), throughput, and error rates across inference endpoints
Scaling
- Horizontal scaling with auto-scaling SageMaker endpoints to handle traffic spikes and sustained high-volume processing across millions of daily line items
Impact
- Coverage: 45% improvement in automated product assignment
- Operations: ~30% reduction in manual review workload
- Attribution: significantly improved partner attribution reliability, maintaining high precision at the operating threshold, enabling more accurate offer targeting and partner reporting for CPG partners
- Scale: scalable architecture supporting millions of line items from 11M+ daily receipts, covering over $400M GMV, rewarding $500K+ every day