Digital Receipt Information Extraction System

Patented production ML system, Fetch Rewards

PyTorch BERT Hugging Face Amazon SageMaker MLflow Comet

Business Problem

Millions of digital receipts (emails, e-receipts) flow into Fetch daily as unstructured HTML documents. The system needed to automatically extract structured purchase data at two levels: item-level information (product description, price, quantity, product number/UPC/SKU) and retailer-level information (retailer name, order total, store address, email, and other available metadata). Accurate extraction at both levels was critical for product matching, reward attribution, and partner reporting. For the full business context and product strategy, see the Product & Leadership Case Study.

Challenges

This was a greenfield problem with several unique constraints:

  • No existing research for HTML receipts: SOTA models like LayoutLM relied on OCR and bounding boxes, designed for scanned documents, not HTML
  • Latency tradeoff: an OCR-based pipeline added significant latency and degraded user experience
  • Token length limitation: transformer models were capped at 512 tokens; receipts often exceeded that. This was 2021, well before generative AI models with long context capabilities
  • Complex non-purchase content in HTML: digital receipts contained substituted items, returned items, recommended/promoted products, and unicode characters embedded in the markup, all of which needed to be distinguished from actual purchased items
  • Restaurant and fast food ambiguity: toppings, modifiers, customizations, and combo sub-items created confusion for the model in distinguishing actual line items from their sub-items (e.g., is "Extra Cheese" a separate purchased item or a topping?)
  • Marketing noise mixed with purchase data: promotional banners, suggested products, and loyalty program messages were embedded directly in the HTML alongside real order items

Approach

I designed an approach that worked directly on HTML, bypassing OCR entirely. Raw unstructured HTML receipts were pre-processed by stripping CSS styling, extracting text, and replacing certain HTML tags with special tokens to preserve the visually rich layout structure of receipts.

To build domain knowledge, I performed Masked Language Modeling (MLM) on transformer models using millions of digital receipts, training the language model to understand eReceipt-specific patterns and vocabulary.

For the labeled dataset, I used stratified random sampling and clustering techniques to ensure coverage across different receipt formats and to handle the long-tail distribution of retailers. The model was then supervised fine-tuned for a Named Entity Recognition (NER) task using BIO encoding, with an encoder-based transformer backbone and a token classification head to extract entities like product name, price, quantity, and retailer.

The system started with a placeholder model, allowing me and my team to build the training and inference infrastructure in parallel while iteratively optimizing model accuracy and latency.

Iterative improvement strategy: prioritized model accuracy improvements in phases: first high-volume retailers (covering the majority of receipt traffic), then partner retailers and restaurants (directly tied to brand offer revenue), and finally tackling the long tail of less common formats and edge cases.

Development & Evaluation Pipeline

Building this system required standing up the entire ML lifecycle from scratch. I started by hand-labeling training data myself to establish ground truth, then used that foundation to build and train an internal data annotation team for supervised fine-tuning at scale. Since receipts contain sensitive user data, I developed ML models to mask PII from digital receipts before handing them over to the annotation team.

To improve performance on edge cases and low-volume formats, I leveraged active learning and data augmentation techniques:

  • Training data enrichment via active learning: ran the model's predictions on unseen receipts, ingested those predictions into the annotation tool for human correction, and incorporated the corrected examples back into the training set. This bootstrapping strategy rapidly scaled labeled data coverage for underrepresented retailers and formats
  • Data augmentation: randomly substituted entity values (product names, prices, totals, retailer names) within receipt templates to generate synthetic training examples. This helped the model learn structural patterns independent of specific content, improving generalization on complex, edge-case, and low-volume retailer formats

Evaluation Strategy:

  • Offline evaluation: held-out test datasets tracking precision, recall, and entity-level accuracy across model versions
  • Shadow deployment infrastructure: built a separate shadow pipeline to run candidate models in parallel against the existing system, enabling side-by-side comparison on live traffic without impacting users
  • Dashboards & reporting: built evaluation dashboards to track and compare metrics across model versions
  • Entity-level error analysis: tracked false positives and false negatives by entity type (product description, price, quantity, retailer). For product entities, analyzed whether false positives were actual purchased items versus returned items, substituted products, or promotional recommendations misclassified as purchases
  • Price and amount confusion matrix: evaluated model confusion across price-related fields (item price, discount, coupon, subtotal, tax, order total) at both item-level and retailer-level, where misclassification between these fields had direct impact on reward attribution accuracy

Post-deployment, I established a human-in-the-loop mechanism (weekly audits on real-time production data by an internal data integrity team, supported by a custom app for auditing and correcting model responses). Corrected examples were fed back into the training data for model retraining, creating a continuous improvement loop.

Training Infrastructure

Model training built on PyTorch, Hugging Face, and transformer models with distributed GPU training. MLflow and Comet for experiment tracking and reproducibility. Amazon SageMaker for training pipelines, real-time inference, and auto-scaling endpoints. Automated retraining pipeline with CI/CD for model deployment. Dataset versioning, model versioning, and continuous monitoring for data and model drift.

Impact

  • Scale: production-grade ML system processing hundreds of millions of receipts across thousands of retailers and formats
  • Cost: replaced external vendor dependency, saving millions in annual operational costs
  • Accuracy: met senior leadership expectations based on human evaluations on production data
  • Business: enabled accurate offer matching and reward attribution for millions of active users
  • Innovation: novel approach to HTML-based receipt understanding resulted in a US patent

Read the Product & Leadership Case Study →

← Back to Projects