Product Matching via Semantic Search

Business Problem

Fetch processes 11M+ receipts daily, extracting hundreds of millions of line items that need to be matched to an internal product catalog
The matching system needed to operate at scale with low latency, handle noisy and inconsistent input data, and maintain high precision to avoid revenue misattribution
For the full business context and problem reframing, see the Product & Strategy Case Study

Example: Receipt Line Item to Catalog Match

Receipt Line Item	Catalog Product	Match Type
ORG MILK 1GAL	Organic Whole Milk 1 Gallon	SKU match
CHKN BRST BNLS	Fresh Chicken Breast	Semantic
EVOO 500ML	Extra Virgin Olive Oil 500ml	Semantic
PNT BTR CRMY 16OZ	Peanut Butter Creamy 16oz	Semantic
GRD BEEF 80/20	Ground Beef 80/20	UPC match

Items with valid SKU/UPC identifiers are resolved via direct lookup. Items with abbreviated, noisy, or missing identifiers require semantic matching.

Accuracy Was Not the Bottleneck

Based on the root-cause analysis and the upper bound established through a human benchmark study (details in case study), building better semantic search algorithms and sentence transformer models was the obvious choice to close the gap to the ceiling
The core technical challenge was not model accuracy alone, but designing a system that balanced precision and recall under revenue risk constraints, handling catalog gaps, noisy input data, and entity resolution failures at scale

Modeling Approach

Retrieval & Ranking Pipeline

Receipt Line Items

↓

Direct Match Lookup
SKU / UPC / Barcode

Match found

↓

Matched Product

No match

↓

Candidate Retrieval
SBERT Bi-Encoder + FAISS (Top-K)

↓

Cross-Encoder Reranking
ms-marco-MiniLM-L-6-v2

↓

Confidence Calibration
Precision-Coverage Threshold

High confidence

↓

Auto-Assigned

Low confidence

↓

Manual Review

Gold Dataset

Created a 50K+ stratified labeled dataset by merchant, frequency (head vs long-tail), OCR quality, and category
Defined annotation guidelines, achieved high inter-annotator agreement, and established the evaluation foundation

Existing Direct Match

Fetch already had a deterministic lookup system matching receipt items via SKU, UPC, and barcode identifiers
Handled a portion of items with exact matches, but left a significant gap for items with noisy, abbreviated, or missing identifiers

Semantic Search Algorithm

Designed a semantic matching pipeline using Sentence-BERT (SBERT) embeddings
Experimented with multiple sentence transformer models: all-MiniLM-L6-v2 (384-dim), all-mpnet-base-v2 (768-dim), and BERT-based variants, evaluating tradeoffs between embedding quality, latency, and memory footprint
Fine-tuned using cosine similarity loss and MultipleNegativesRankingLoss (MNR) on domain-specific product pairs to learn receipt-to-catalog semantic similarity

Input Representation Experiments

Experimented with different input text representations to enrich the embedding model with more contextual information
Tested product name alone, and various combinations of product name + size, unit, retailer, SKU, and price
Richer input representations improved matching for ambiguous or abbreviated items where name alone was insufficient

Data Augmentation

Synthetic generation of noisy product descriptions: abbreviations, misspellings, reordered tokens, and dropped fields
Helped the model generalize across retailers and receipt formats without overfitting to clean catalog text

Vector Index (FAISS)

Normalized embeddings indexed using FAISS (Facebook AI Similarity Search) with an inner product index for cosine similarity search
Sub-millisecond approximate nearest neighbor retrieval across the full catalog, scaling efficiently as the taxonomy grew without requiring model retraining

Candidate Retrieval

Multi-strategy candidate generation combining direct lookup, fuzzy matching (Levenshtein, TF-IDF), and SBERT + FAISS semantic retrieval
Optimized for Recall@10, ensuring the correct product appeared in the candidate pool

Ranking & Precision Optimization

Retrieval stage (SBERT + FAISS) uses a bi-encoder architecture, encoding query and catalog items independently for fast similarity search
Reranking via cross-encoder model (ms-marco-MiniLM-L-6-v2) that takes both receipt item and candidate product as a single input pair, enabling deeper token-level interaction and more accurate relevance scoring

Confidence Calibration

Tuned confidence thresholds on the 50K+ gold dataset to find the optimal operating point on the precision-coverage tradeoff curve
Only high-confidence matches auto-assigned; lower-confidence items routed for review

Evaluation Strategy

Offline ML Metrics

Recall@10: measuring whether the correct product appeared in the top-10 candidates
Top-1 accuracy: measuring exact match at the top rank
MRR (Mean Reciprocal Rank): measuring ranking quality across candidates
Precision at operating threshold: ensuring automation didn't introduce revenue risk

Business Metrics

Points awarded accuracy: correctness of reward points attributed to users based on matched products
Dollar value of rewards: financial accuracy of rewards distributed, directly tied to partner billing and revenue reconciliation

The 50K+ stratified gold dataset served as the foundation for all offline evaluation, and shadow deployments enabled safe comparison on live traffic before production release.

Key Takeaways & Challenges

Noisy identifiers and short descriptions: receipt line items often contained abbreviated product names (e.g., "GV 2% MLK 1GL" for "Great Value 2% Reduced Fat Milk, 1 Gallon"), partial UPC/SKU codes with trailing zeros or mismatched formats, making direct text matching unreliable
OCR errors in physical receipts: scanned paper receipts introduced misspellings, merged words, and missing characters (e.g., "CHBNI VAN YGT" for "Chobani Vanilla Greek Yogurt"), further degrading match quality
Large, evolving taxonomy: the product catalog was continuously growing, requiring the system to generalize to new products without retraining
Data drift and continuous inference on unseen items: millions of new and unknown items were inferred daily against fixed catalog embeddings, requiring iterative retraining, gold dataset refresh, and ongoing monitoring to maintain match quality as product distributions shifted over time

Continuous improvement: error analysis and active learning via uncertainty sampling to identify high-value training examples and iteratively improve model performance.

Production Deployment

Load Testing & Latency Optimization

Conducted load testing to validate throughput and latency under production traffic volumes
Converted trained PyTorch models to ONNX format for optimized inference, reducing latency and increasing throughput for real-time serving

Telemetry & Monitoring

Set up telemetry for model performance metrics, tracking latency percentiles (p90, p95, p99), throughput, and error rates across inference endpoints

Scaling

Horizontal scaling with auto-scaling SageMaker endpoints to handle traffic spikes and sustained high-volume processing across millions of daily line items

Impact

Coverage: 45% improvement in automated product assignment
Operations: ~30% reduction in manual review workload
Attribution: significantly improved partner attribution reliability, maintaining high precision at the operating threshold, enabling more accurate offer targeting and partner reporting for CPG partners
Scale: scalable architecture supporting millions of line items from 11M+ daily receipts, covering over $400M GMV, rewarding $500K+ every day