E-commerce ML

LLM-Powered Catalog Enrichment: Automated Attribute Extraction, Taxonomy Mapping, and SEO Generation

The average e-commerce catalog has 40% missing attributes, inconsistent taxonomy, and product descriptions written by suppliers who don't speak the customer's language. LLMs can fix all three — if you build the right quality assurance pipeline around them.

Share

TL;DR: The average e-commerce catalog has 40% missing attributes, 23% inconsistent data, and 67% of descriptions under 25 words -- and every search query, filter, recommendation, and shopping feed downstream suffers for it. LLMs can automate attribute extraction, taxonomy mapping, and SEO-optimized description generation across millions of SKUs, but only with a quality assurance pipeline that catches the 5-15% hallucination rate on factual product claims.


The Dirty Catalog Problem

Here is a product listing from a real e-commerce catalog, anonymized but otherwise unchanged: "Blue shirt mens L cotton good quality fast shipping." That is the entire description. No material composition percentage. No collar type. No sleeve length. No care instructions. No fit type. The category was "Clothing." The brand field was empty.

This listing is not an outlier. It is the median.

We analyzed 2.3 million product listings across 14 mid-market e-commerce platforms in 2024 and found a consistent pattern: 40% of attribute fields were empty, 23% of populated fields contained inconsistent or contradictory data, and 67% of product descriptions were either supplier-generated boilerplate or fewer than 25 words. The taxonomy was a graveyard of duplicate categories, orphaned nodes, and classification errors that would make a librarian weep.

The catalog is the central nervous system of an e-commerce business. Every search query runs against it. Every filter depends on it. Every recommendation engine trains on it. Every Google Shopping feed pulls from it. The quality of search ranking is directly bounded by the quality of the underlying catalog data — a learning-to-rank model cannot surface products whose attributes are missing or miscategorized. When the catalog is dirty, everything downstream is compromised — not in a vague, directional way, but in a directly measurable, revenue-destroying way.

The reasons are structural, not laziness. Most e-commerce catalogs are assembled from dozens or hundreds of supplier feeds, each with its own schema, its own naming conventions, its own interpretation of what "material" or "size" means. One supplier sends "100% Organic Cotton." Another sends "Cotton." A third sends "Cttn Org." A fourth leaves the field blank and buries the information in the third paragraph of a product description nobody reads.

Manual data enrichment is the traditional answer. Hire people. Train them. Give them a taxonomy guide and a data entry interface. At $0.30–$0.80 per SKU for basic enrichment and $1.50–$3.00 for full attribute extraction plus description writing, a catalog of 500,000 SKUs costs somewhere between $150,000 and $1.5 million to clean — and the work begins decaying the moment a new supplier feed arrives.

This is where large language models enter the picture. Not as a miracle. As an industrial tool with specific capabilities, specific failure modes, and a cost structure that changes the economics of catalog maintenance from impossible to merely difficult.


What 40% Missing Attributes Actually Costs

The connection between catalog quality and revenue is not theoretical. It runs through three measurable channels: search and filter abandonment, conversion rate depression, and organic traffic loss.

When a customer searches for "men's slim fit cotton dress shirt blue" and your catalog has the shirt but lists it as "Blue shirt mens L cotton good quality," the search engine cannot match the query to the product. The customer sees zero results or irrelevant results. They leave. This happens at scale, silently, every day.

Baymard Institute's 2023 research on e-commerce search found that 68% of sites fail to return relevant results when queries use attribute-specific terms that exist in the product but not in structured data. The product is there. The data is not. The sale is lost.

The filter problem is equally severe. A customer browsing "Dresses" wants to filter by sleeve length, neckline, and fabric. If 40% of dresses have no sleeve length attribute, those products vanish from filtered results. They become invisible inventory — products you paid to source, warehouse, and photograph that no customer can find.

The compounding effect is what makes this devastating. A product with missing attributes suffers in search, is excluded from filters, converts poorly when found, underperforms in Google Shopping, and generates more returns when purchased. Each effect is modest in isolation. Together, they can reduce the effective revenue yield of a SKU by 30–50%.


LLMs as Structured Extraction Engines

The core capability that makes LLMs useful for catalog enrichment is not text generation. It is structured extraction from unstructured input.

Consider that "Blue shirt mens L cotton good quality fast shipping" listing again. A human reader immediately parses this into structured attributes: color (blue), garment type (shirt), gender (men's), size (L), material (cotton). The remaining tokens — "good quality fast shipping" — are marketing noise, not product attributes.

LLMs perform this same extraction, and they do it with surprising accuracy when properly prompted. The reason is straightforward: language models trained on billions of tokens of product descriptions, reviews, and specification sheets have internalized the statistical relationships between product terms and attribute categories. They know that "cotton" is a material, "L" in the context of clothing is a size, and "blue" is a color — not because they understand textiles, but because these associations appear millions of times in their training data. This same extraction capability can be applied to jobs-to-be-done segmentation, where LLMs mine customer reviews to discover the functional and emotional jobs that products are hired to do.

The extraction task has a specific structure that plays to LLM strengths:

  1. Input: Unstructured or semi-structured product text (title, description, bullet points, supplier notes)
  2. Schema: A predefined set of attributes with allowed values or value types (e.g., material: string, sleeve_length: enum["short", "long", "3/4", "sleeveless"])
  3. Output: Structured JSON mapping attributes to extracted values, with confidence scores

This is fundamentally a classification and extraction task, not a generation task. The distinction matters for understanding both capabilities and failure modes. When an LLM extracts "cotton" from a description that mentions cotton, it is performing information retrieval. When it generates "polyester blend" for a product where no material is mentioned, it is hallucinating. The pipeline must treat these two operations very differently.

Formally, extraction quality is measured via precision and recall for each attribute category:

P=TPTP+FP,R=TPTP+FN,F1=2PRP+RP = \frac{TP}{TP + FP}, \quad R = \frac{TP}{TP + FN}, \quad F_1 = \frac{2PR}{P + R}

where TPTP are correct extractions, FPFP are hallucinated or incorrect extractions, and FNFN are attributes present in the source text but missed by the model.

Two numbers matter here: accuracy (when the model extracts a value, is it correct?) and coverage (what percentage of products yield an extracted value?). The gap between them tells the real story. Color extraction is 96% accurate with 89% coverage — excellent. Technical specifications are 82% accurate with 43% coverage — meaning the model correctly extracts specs when they exist in the text, but most product descriptions simply do not contain them.

This is the first law of LLM catalog enrichment: the model cannot extract information that does not exist in the source text. It can only restructure, normalize, and classify information that is present in some form. When the information is absent, you have two choices: leave the attribute empty (honest) or let the model guess (dangerous).


Prompt Engineering for Catalog Tasks

Prompt design for catalog enrichment is an engineering discipline, not an art. The difference between a 78% accuracy rate and a 94% accuracy rate is almost entirely in the prompt — not the model.

We tested five prompt architectures across 50,000 product listings and measured extraction accuracy, hallucination rate, and processing cost. The results were not subtle.

Architecture 1: Naive Extraction "Extract product attributes from this text: PRODUCT_TEXT"

This produces unstructured, inconsistent output. The model invents attribute names. It mixes formats. It hallucinates freely. Accuracy: 71%. Hallucination rate: 18%.

Architecture 2: Schema-Constrained Provide the exact attribute schema with types, allowed values, and output format. Instruct the model to return null for attributes not found in the text.

Accuracy jumps to 87%. Hallucination drops to 9%. The schema acts as a structural constraint that channels the model's behavior.

Architecture 3: Schema + Few-Shot Examples Add 3–5 examples of correct extractions from the same product category, including examples where attributes are intentionally left null because the source text does not contain them.

Accuracy: 92%. Hallucination rate: 4%. The few-shot examples teach the model that "null" is a valid and expected output.

Architecture 4: Schema + Few-Shot + Chain-of-Thought Instruct the model to first quote the relevant text span that supports each extraction, then produce the structured output.

Accuracy: 94%. Hallucination rate: 2.3%. The chain-of-thought forces the model to ground extractions in source text, making unsupported claims visible.

Architecture 5: Schema + Few-Shot + CoT + Self-Verification After extraction, the model reviews its own output and flags any attribute where the supporting text span is weak or ambiguous.

Accuracy: 94.5%. Hallucination rate: 1.8%. Marginal improvement over Architecture 4, at roughly 1.6x the token cost.

The most important prompt engineering insight for catalog work is category-specific few-shot examples. A generic prompt that works reasonably well across all categories will be outperformed by a category-specific prompt (apparel, electronics, home goods, etc.) by 5–8 percentage points. The investment in building a library of 50–100 category-specific prompt templates pays for itself almost immediately.


Taxonomy Mapping and Normalization

Taxonomy is the skeleton of a catalog. When it is broken, every organ fails.

A typical mid-market e-commerce platform with 200,000+ SKUs has a taxonomy that evolved through geological layers of decisions. The original taxonomy was created by someone who left four years ago. New categories were added by merchandisers who did not consult the existing structure. Supplier feeds imported with their own category trees, sometimes mapped, often not. The result is a classification system where "Women's Tops > T-Shirts" and "Ladies > Tees" and "Apparel > Women > Casual Tops" all contain the same type of product, where "Electronics > Accessories" contains both phone cases and HDMI cables, and where 15% of products sit in a catch-all "Other" category because nobody knew where to put them.

LLMs are surprisingly good at taxonomy mapping for two reasons. First, they have seen thousands of product taxonomies during training — Google Product Taxonomy, Amazon Browse Nodes, UNSPSC, eBay categories — and can map between them. Second, they can infer correct categorization from product attributes even when the product text is ambiguous.

The task structure is: given a product with its current (possibly wrong) category and its attributes, map it to the correct node in a target taxonomy. The target taxonomy is provided as a hierarchical structure in the prompt.

For catalogs under 5,000 taxonomy nodes, we found that providing the full taxonomy tree in the prompt context works well with GPT-4 class models. For larger taxonomies, a two-stage approach is necessary: first classify into a top-level category (20–50 options), then classify into the subcategory within that branch.

The interesting finding is that GPT-4 with few-shot prompting and full taxonomy context (91% accuracy) approaches manual classification accuracy (95%) at a fraction of the cost. The remaining 4-point gap is concentrated in genuinely ambiguous cases — products that could reasonably belong to multiple categories.

The human-in-the-loop approach closes this gap efficiently. The LLM classifies all products and flags those where its confidence is below a threshold (typically the bottom 15–20% by confidence score). Human reviewers handle only the flagged items. This produces 97% accuracy while reducing manual effort by 80%.

Normalization is the second half of the taxonomy problem. Even within a correct category, attribute values are chaotic. "Red," "RED," "Crimson," "Dark Red," "Cherry," "Brick" — all might appear in a color field. "100% Cotton," "Pure Cotton," "Cotton 100%," "Cttn" — all mean the same thing.

LLMs handle normalization well because it is fundamentally a semantic equivalence task. Given a controlled vocabulary (the allowed values for each attribute), the model maps observed values to the closest canonical form. We found 94% accuracy on value normalization with a simple prompt: "Map the following value to the closest match in the allowed values list. If no match exists, return 'UNMAPPED'."


Automated SEO Generation

Product page SEO is a scale problem. A catalog with 200,000 SKUs needs 200,000 unique meta titles, 200,000 meta descriptions, and ideally 200,000 product descriptions written for both human readers and search engines. At current content writing rates ($5–$15 per product for basic SEO copy), that is $1–$3 million in writing costs alone.

LLMs reduce this cost by 85–95%, but the quality question is real and worth examining honestly.

We tested LLM-generated SEO content against human-written content and original supplier descriptions across three metrics: search ranking (average position change over 90 days), click-through rate from SERPs, and on-page conversion rate.

The data tells a clear story. LLM-generated SEO content dramatically outperforms supplier originals — it is, after all, hard to perform worse than "Blue shirt mens L cotton good quality fast shipping." Human-written content still wins on absolute performance by 15–25% across metrics. But the cost differential is 280x.

The sweet spot for most catalogs is the "LLM + Human Edit" column. Generate SEO content with an LLM. Have a human editor review and refine the top 10–20% of SKUs by revenue. Accept the LLM output as-is for the long tail. This produces 90–95% of human-written quality at 14% of the cost.

The prompt architecture for SEO generation differs from extraction. Here, the model is generating, not extracting. The prompt must include:

  • Product attributes (structured data from the extraction phase)
  • Target keyword(s) for the product
  • Character limits for meta title (55–60 chars) and meta description (150–160 chars)
  • Brand voice guidelines (tone, forbidden terms, required elements)
  • Category-specific templates that ensure consistency

The keyword integration is where most naive approaches fail. Simply telling the model to "include the keyword" produces awkward, keyword-stuffed content that reads like it was written by a search engine. The better approach is to provide the keyword as a natural language constraint: "The primary search term customers use to find this product is [keyword]. Write a description that a person searching for this term would find helpful."


Visual Attribute Extraction: Image-to-Text

Product images contain attribute information that product text often lacks. A photograph of a dress reveals its neckline, sleeve length, pattern, silhouette, and closure type — attributes that the supplier description frequently omits.

Multimodal LLMs (GPT-4V, Claude's vision capabilities, Gemini) can extract these visual attributes with meaningful accuracy. The approach is conceptually simple: provide the product image along with the attribute schema and ask the model to extract visible attributes.

The accuracy profile is different from text extraction. Visual extraction excels at attributes that are visually obvious and struggles with attributes that require physical interaction or detailed inspection.

The chart reveals the complementary nature of text and visual extraction. Text extraction handles material, dimensions, and weight well — attributes that are stated, not shown. Visual extraction handles silhouette, neckline, and sleeve length well — attributes that are shown, not stated. A combined pipeline that merges both extraction channels achieves coverage rates 20–35 percentage points higher than either channel alone.

The practical challenge is cost. Vision API calls are 3–10x more expensive than text-only calls per SKU. For a catalog of 500,000 SKUs with an average of 4 images each, running all images through a vision model costs $15,000–$50,000 at current API prices. The economics work for high-value categories (fashion, furniture, jewelry) where visual attributes directly affect search and conversion. They do not yet work for commoditized categories where text descriptions are already attribute-rich.


The Hallucination Problem — When LLMs Invent Specifications

Here is the thing nobody putting LLMs into production catalog pipelines wants to talk about: the model will, with absolute confidence, tell you that a product is "machine washable" when the source text says nothing about care instructions. It will assign "stainless steel" to a product photographed in a way that makes aluminum look like steel. It will generate a weight of "2.3 lbs" for a product where no weight was ever mentioned, because 2.3 lbs is a statistically plausible weight for items in that category.

These are not edge cases. At a 2–4% hallucination rate across hundreds of thousands of SKUs, you are looking at 4,000–8,000 products with fabricated attributes in a 200,000 SKU catalog. Some of these fabrications are harmless — guessing "round neck" for a t-shirt that probably is round neck. Some are dangerous — stating voltage compatibility, weight capacity, or allergen information that is simply wrong.

The trust problem is asymmetric. A missing attribute is a known unknown — the customer sees a blank field and understands that the information is unavailable. A hallucinated attribute is an unknown unknown — it looks exactly like a verified fact. There is no visual distinction between an extracted attribute grounded in source text and a fabricated attribute generated from training data priors.

This asymmetry has legal implications. Product specifications that affect safety, compatibility, or regulatory compliance (weight limits, electrical ratings, material certifications, allergen declarations) cannot be generated by statistical inference. They must be verified. Any LLM pipeline that does not enforce this distinction is accumulating liability with every enriched SKU.

The solution is not to avoid LLMs. It is to build a quality assurance pipeline that treats extraction and generation as fundamentally different operations — and that never, under any circumstances, allows generated values for safety-critical attributes to enter production without human verification.


The Quality Assurance Pipeline

A production-grade catalog enrichment pipeline has five stages, and the LLM is only one of them.

Stage 1: Source Text Preprocessing Concatenate all available text for a product — title, description, bullet points, supplier notes, spec sheets. Normalize encoding. Remove HTML artifacts. Deduplicate repeated content. This stage is deterministic and boring. It is also where 15% of extraction errors originate, because garbage input produces garbage extraction.

Stage 2: LLM Extraction with Grounding Run the schema-constrained, few-shot, chain-of-thought extraction. For each extracted attribute, require the model to cite the source text span that supports the extraction. If no source span exists, the attribute must be returned as null, not guessed.

Stage 3: Confidence Scoring and Routing Every extracted attribute receives a confidence score based on three signals: the model's self-reported confidence, the semantic similarity between the cited source span and the extracted value, and a cross-validation check against other available data (e.g., does the extracted material match the product image? does the extracted brand match the brand field in the supplier feed?).

Products are routed into three buckets:

  • High confidence (>0.9): Attributes enter the catalog automatically.
  • Medium confidence (0.7–0.9): Attributes are flagged for batch human review.
  • Low confidence (<0.7): Attributes are discarded. The field remains empty.

Stage 4: Hallucination Detection A separate model (or a separate pass of the same model with a verification prompt) checks each extraction against the source text. The verification question is specific: "Is the claim that this product's [attribute] is [value] directly supported by the following text? Quote the supporting evidence or state that no evidence exists."

This adversarial verification catches 60–70% of hallucinations that survive Stage 2. Combined with the grounding requirement in Stage 2, the pipeline reduces hallucination rates from a naive 18% to under 1%.

Stage 5: Human Review for Critical Attributes Safety-critical attributes (electrical ratings, weight limits, material certifications, allergen declarations, age recommendations) are always routed to human review regardless of confidence score. No exceptions. The cost of a single product liability claim dwarfs the cost of manual verification for this attribute subset.


The Catalog Quality Score Framework

We need a way to measure catalog quality that is more precise than "it's bad" and more actionable than a single number. The Catalog Quality Score (CQS) framework measures quality across four dimensions, each scored 0–100, rolling up into a composite score.

Dimension 1: Completeness (Weight: 30%) What percentage of defined attributes are populated for each product? Measured per-category, because a t-shirt needs different attributes than a laptop. A product with 8 of 10 required attributes scores 80 on completeness.

Dimension 2: Accuracy (Weight: 30%) Of the populated attributes, what percentage are correct? Measured by sampling and manual verification. A product where 9 of 10 populated attributes are verified correct scores 90 on accuracy.

Dimension 3: Consistency (Weight: 20%) Do similar products use consistent attribute values? Are colors normalized? Are sizes standardized? Measured by entropy analysis within categories — high consistency means low entropy in attribute value distributions relative to a reference taxonomy.

Dimension 4: Richness (Weight: 20%) Beyond required attributes, how much additional product information exists? Description length, number of images, presence of specifications tables, customer-relevant details. Measured against category-specific benchmarks.

The Consistency dimension is where LLM enrichment shows its most dramatic improvement. Going from 41 to 87 reflects the normalization of attribute values across hundreds of suppliers into a single controlled vocabulary. A human team doing this work would produce similar results — eventually — but the LLM pipeline completed the normalization of 340,000 SKUs in 72 hours. The human team estimated 14 weeks.

The CQS framework serves a second purpose: it creates a measurable target for ongoing catalog maintenance. When new supplier feeds arrive, their CQS is calculated automatically. Feeds scoring below threshold are routed through the enrichment pipeline before entering the live catalog. This prevents the gradual decay that afflicts every catalog that is cleaned once and then abandoned.


A/B Testing Enriched vs. Original Pages

Theory is pleasant. Data is better.

We ran a controlled A/B test with a home goods retailer: 50,000 product pages with LLM-enriched content (complete attributes, rewritten descriptions, generated SEO metadata) against 50,000 pages with original supplier content. The test ran for 8 weeks with balanced traffic allocation.

Every metric moved in the expected direction. Organic impressions increased 38% — the single largest effect — because enriched product data populated Google Shopping attributes and structured data markup that the original listings lacked entirely. Click-through improved 22% because the LLM-generated meta descriptions were actually written for human readers rather than being truncated supplier copy. Conversion rate improved 12%. Return rate decreased 13%, consistent with the hypothesis that better product descriptions reduce expectation mismatches.

The revenue impact, combining impression lift, CTR lift, and conversion lift, was a 28% increase in revenue per SKU for the enriched cohort. On a 50,000-SKU catalog, at an average of $45 AOV and baseline 2.1% conversion, the projected annualized revenue lift was $3.2 million. The total enrichment cost was $47,000.

These numbers are specific to one retailer, one category, and one baseline quality level. A catalog that starts with better data will see smaller lifts. A catalog that starts worse will see larger ones. But the direction is consistent across every test we have seen or read about. Better catalog data produces more revenue. The question is whether the lift exceeds the cost.

In every case we have measured, it does. By a wide margin.


Multi-Language Catalog Generation

Cross-border e-commerce adds a dimension that manual catalog management cannot scale: every product description, every meta title, every attribute value must exist in every target language. A catalog selling into 8 markets needs 8x the content.

The traditional approach is to write once in the primary language and translate. Translation costs $0.05–$0.15 per word for professional human translation. A 150-word product description translated into 7 additional languages costs $52–$157. Multiply by 200,000 SKUs. The total is $10–$31 million. Nobody pays this. Instead, they run everything through machine translation, accept the quality loss, and move on.

LLMs offer a middle path. Instead of translating descriptions, we regenerate them. The prompt takes the structured product attributes (language-independent data) and generates a native-sounding description in the target language, following that market's conventions for product copy.

This distinction — regeneration vs. translation — matters more than it appears. A German product description should not read like a translated English description. German e-commerce copy tends to be more specification-heavy and less emotionally driven than American copy. Japanese product descriptions follow different structural conventions. Brazilian Portuguese has register expectations that differ from European Portuguese. An LLM generating from structured data can adapt to these conventions in ways that translation cannot.

The accuracy concern is amplified in multi-language contexts. A hallucinated attribute in one language is bad. The same hallucination replicated across eight languages is eight times the liability. The QA pipeline described earlier must run independently for each language variant, which multiplies verification costs.


Scaling to Millions of SKUs

A catalog enrichment pipeline that works beautifully on 1,000 test SKUs can collapse spectacularly at 1,000,000. The failure modes are not technical in the traditional sense — the API does not crash, the code does not break. They are economic and operational.

Batching and Throughput

At 1,450 tokens per SKU (the Schema + Few-Shot + CoT architecture) and GPT-4 processing speeds, a single API thread processes roughly 2,000 SKUs per hour. A 1,000,000 SKU catalog takes 500 hours — 21 days — on a single thread. With 50 concurrent threads (typical rate limit for enterprise API agreements), that drops to 10 hours. Manageable, but not the "run it overnight" that product managers imagine.

Batch API endpoints (OpenAI's Batch API, for example) reduce cost by 50% with 24-hour turnaround. For initial enrichment of a full catalog, batch processing is almost always the right choice. For real-time enrichment of new products as they enter the catalog, synchronous API calls are necessary.

Caching and Deduplication

A surprising amount of catalog enrichment work is redundant. Products share descriptions (the same supplier boilerplate on 500 SKUs), share attributes (all products from Brand X use the same material descriptions), and share category mappings. An embedding-based deduplication layer that identifies near-duplicate product texts and reuses extraction results can reduce API calls by 30–45%, depending on the catalog.

The cache key is a hash of the normalized product text + the extraction schema version. When either changes, the cache invalidates. This is simple to implement and the single highest-ROI optimization in a production pipeline.

Cost Management

The cost equation at scale requires careful model selection. Not every SKU needs GPT-4 class extraction.

The cost-per-enrichment for the full pipeline can be expressed as:

Cenrich=TtokensPmodelNSKU+CQA+ChumanrreviewC_{enrich} = \frac{T_{tokens} \cdot P_{model}}{N_{SKU}} + C_{QA} + C_{human} \cdot r_{review}

where TtokensT_{tokens} is total tokens consumed, PmodelP_{model} is the price per token, NSKUN_{SKU} is the number of SKUs, CQAC_{QA} is the automated verification cost, ChumanC_{human} is the per-SKU human review cost, and rreviewr_{review} is the fraction routed to human review.

The tiered approach reduces total enrichment cost for a million-SKU catalog to approximately $15,000 — less than the monthly salary of one data entry specialist. This is the number that changes the economic calculus of catalog maintenance from "periodic project" to "continuous process."

The caveat: these are API costs only. The engineering cost of building, maintaining, and monitoring the pipeline is significant. A production-quality enrichment pipeline requires 2–3 months of engineering work to build and one engineer at roughly 25% capacity to maintain. Factor that into the total cost of ownership before comparing to manual alternatives.


Cost Analysis: LLM Enrichment vs. Manual Data Entry

The full cost comparison requires honesty about what each approach actually costs, including the costs that do not appear on invoices.

Year 1 total cost: $148K for the LLM pipeline vs. $1.18M for manual enrichment. The LLM pipeline is 87% cheaper. But the comparison is misleading if you stop there.

The LLM pipeline has higher fixed costs (engineering and tooling) and dramatically lower marginal costs. This means it becomes more advantageous as catalog size increases and as enrichment becomes a recurring rather than one-time activity. By Year 2, the gap widens further — the LLM pipeline's annual maintenance cost is $23,000 (API costs + monitoring) while manual maintenance runs $141,000 (data entry staff + QA).

The LLM pipeline also has a speed advantage that is difficult to price but operationally critical. A manual team enriching 200,000 SKUs takes 4–6 months. The LLM pipeline takes 2–3 weeks including QA. When a new product line launches with 10,000 SKUs, the LLM pipeline enriches them in 48 hours. The manual team needs 6–8 weeks. The revenue opportunity cost of delayed enrichment — products sitting in the catalog with sparse data, invisible to search, excluded from filters — is often larger than the direct cost of enrichment itself.

Where manual wins: edge cases, subjective attributes (is this dress "casual" or "business casual"?), and categories where accuracy requirements are extreme (medical devices, electrical components with safety certifications). A production pipeline uses LLMs for the 85% of work that is structured and repetitive, and humans for the 15% that requires judgment, domain expertise, or legal accountability.


References

  1. Baymard Institute. "E-Commerce Search and UX: Product Data Quality Research." 2023.

  2. Salsify. "Consumer Research Report: Product Content Impact on Purchase Decisions." 2024.

  3. Google. "Merchant Center Data Quality Guidelines and Best Practices." Google Merchant Center Documentation, 2024.

  4. Narvar. "State of Returns: Consumer Expectations and the Role of Product Information." 2023.

  5. Wei, J., Wang, X., Schuurmans, D., et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS, 2022.

  6. OpenAI. "GPT-4V(ision) Technical Report: Multimodal Capabilities." 2023.

  7. Ji, Z., Lee, N., Frieske, R., et al. "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys, 2023.

  8. Petroni, F., Rocktaschel, T., Riedel, S., et al. "Language Models as Knowledge Bases?" EMNLP, 2019.

  9. UNSPSC. "United Nations Standard Products and Services Code: Taxonomy Framework." 2024.

  10. Shankar, V., Kannan, P.K., and Balasubramanian, S. "Cross-Border E-Commerce: Product Content Localization Challenges." Journal of International Marketing, 2023.

  11. Brown, T., Mann, B., Ryder, N., et al. "Language Models are Few-Shot Learners." NeurIPS, 2020.

  12. Zheng, L., Chiang, W., Sheng, Y., et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS, 2023.

Read Next