Skip to main content

Enrichment pipeline

Last updated: 14 April 2026 Audience: anyone debugging "what does Clione actually do with my product?"


TL;DR

Clione transforms a raw product (title, description, price, vendor, images) into a multi-format semantic representation that:

  1. Lets LLMs (ChatGPT, Perplexity, Gemini) understand the product accurately
  2. Lets search engines crawl rich structured data (schema.org)
  3. Lets vector search retrieve products by intent (semantic similarity)
  4. Lets storefronts inject FAQ widgets and JSON-LD signals automatically

It produces 6 categories of artifacts from a single LLM call + an embedding call:

#ArtifactWhere it livesWho consumes it
1Semantic text (core_identity, reasoning)DB column on product_embeddingsLLMs, internal search, dashboard
2Quality scores (durability, quality_perception, value_for_money, price_positioning)DB column synthetic_propertiesDashboard, scoring, ranking
3Search keywords (bilingual array)DB column search_keywordsFull-text search, tag clouds
4Vector embedding (1536-dim float array)pgvector columnCosine similarity search
5JSON-LD (schema.org/Product)Generated on-the-fly at /products/:id.jsonldSearch engines, LLM crawlers
6SEO meta signals (description, keywords, canonical)Generated on-the-fly at /products/:id.metaSEO tools, headless storefronts

Plus tenant-level: /.well-known/llms.txt (catalog manifest) and /.well-known/ai-offers.json (promotions context).


Pipeline (end to end)

┌─────────────────┐
│ Platform (Shopify, BigCommerce, etc.)
│ Sends raw product:
│ title, description, price, vendor, images, categories
└────────┬────────┘


┌─────────────────────────────────────────────────────────┐
│ STEP 1 · Adapter normalizes platform-specific shape │
│ packages/platform-shopify/src/adapters/product.ts │
│ → RawProduct │
└────────┬────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────┐
│ STEP 2 · LLM enrichment (Normalizer) │
│ packages/semantic-core/src/services/normalizer.ts │
│ Calls llmClient.enrich() with the prompt at: │
│ apps/products-api/src/lib/promptTemplates.ts │
│ │
│ Production model: gpt-4o-mini │
│ Dev fallback: Ollama (llama3.2:1b for the LLM, │
│ nomic-embed-text for the embedding) │
│ │
│ Returns: │
│ core_identity: "1-2 sentence semantic description" │
│ reasoning: { recommended_for, decision_logic, │
│ objection_handler, seasonal_relevance } │
│ synthetic_properties: { durability_score, │
│ quality_perception, value_for_money, │
│ price_positioning, typical_competitors, │
│ search_keywords (bilingual EN+ES) } │
└────────┬────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────┐
│ STEP 3 · Embedding text assembly │
│ packages/storage/src/embeddings/embedding-text.ts │
│ Concatenates with " | " separator: │
│ core_identity + recommended_for + decision_logic │
│ + title + vendor + search_keywords │
└────────┬────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────┐
│ STEP 4 · Vector embedding generation │
│ Production: OpenAI text-embedding-3-small (1536 dims) │
│ Dev: Ollama nomic-embed-text (768 dims) │
│ Result: float[1536] L2-normalized (cosine ready) │
└────────┬────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────┐
│ STEP 5 · Persistence + version snapshot │
│ packages/storage/src/vector/pgvector.ts │
│ Upserts into product_embeddings table (single row) │
│ Snapshots prior version into │
│ product_enrichment_versions (last 3 retained) │
└─────────────────────────────────────────────────────────┘

Storage schema

product_embeddings

ColumnTypeContent
idTEXTplatform product ID
tenant_idTEXTmulti-tenant isolation
tenant_platform_idTEXTper-store isolation
platformTEXTshopify / bigcommerce / etc.
core_identityTEXTLLM-generated semantic description
reasoningJSONB{ recommended_for, decision_logic, objection_handler, seasonal_relevance }
synthetic_propertiesJSONB{ durability_score, quality_perception, value_for_money, price_positioning, typical_competitors, search_keywords }
technical_detailsJSONB{ original_title, vendor, price, currency, images }
metadataJSONB{ transformed_at, llm_model, transformation_version, manually_edited?, last_manual_edit_at?, enrichment_pending? }
search_keywordsTEXT[]denormalized array for FT search
embeddingvector(1536)the actual cosine-normalized vector

product_enrichment_versions (versioning, last 3 retained per product)

Snapshot of core_identity + reasoning + synthetic_properties + search_keywords + metadata plus version, reason ("manual re-enrich" / "bulk re-enrich pending" / "rollback to vN" / "initial sync"), created_by, created_at. Used by the History tab + rollback endpoint.


Endpoints exposing the artifacts

Per-product

EndpointContent-TypePurpose
GET /api/{platform}/products/:idapplication/jsonFull SemanticProduct payload (everything above)
GET /api/{platform}/products/:id.llmapplication/json (X-Content-Format: llm-optimized)LLM-optimized JSON: identity, reasoning, scores, market context, computed summary_for_ai text
GET /api/{platform}/products/:id.jsonldapplication/ld+jsonschema.org/Product with additionalProperty[] carrying scores as PropertyValue
GET /api/{platform}/products/:id.metaapplication/jsonSEO signals: meta_description (160/300/full), canonical URL, keywords[]

Tenant / domain level

EndpointFormatPurpose
GET /.well-known/llms.txttext/plain (Markdown per llms.txt spec)Catalog manifest: store name, product count, enrichment summary, API endpoints, usage guide
GET /.well-known/ai-offers.jsonapplication/jsonPromotions context (managed by promotions-engine)

Storefront embed

embed.js is a single-file loader served at /embed.js that:

  • Auto-detects entity type/ID from URL patterns + OpenGraph meta tags
  • Fetches FAQ data via /api/v1/public/faq/render
  • Injects (a) visible HTML accordion (b) invisible JSON-LD FAQPage schema

Origin-validated against the API key's allowed_domains whitelist.


Quality Scores — important limitations

Read this before showing scores to a customer.

The 5 quality fields (durability_score, quality_perception, value_for_money, price_positioning, typical_competitors) are 100% LLM-inferred today.

What the LLM uses

  • Title
  • Description
  • Price + currency
  • Vendor
  • Categories
  • enrichment_hints from the Wizard if filled

What the LLM does NOT use

  • Real reviews (Google / Trustpilot / internal)
  • Returns rate
  • Sales data / conversion rate
  • External catalog comparatives
  • Manufacturer certifications (ISO, MIL-STD, GOTS, etc.)
  • Lifecycle test results
  • Sustainability databases

Implications

  1. Scores are synthetic opinion, not verified facts.
  2. Same product can produce different scores between runs if temperature > 0.
  3. Useful relatively: the LLM has common sense (Hermès → luxury, Primark → budget), so scores help differentiate within the same catalog, not against external benchmarks.
  4. typical_competitors has the same problem — they're plausible inferences, not verified market data.

Grounding — shipped layers

  • Visual disclaimer badge on every Quality Scores block, resolving from metadata.scores_evidence_level: 🟢 Data-grounded · 🔵 AI + owner hints · 🟡 AI-inferred
  • Enrichment Wizard has a dedicated Evidence section with 5 questions (warranty_period, certifications, returns_rate, reviews_rating, manufacturing_origin). Filling 3+ upgrades the badge to green; 1-2 to cyan. Logic lives in lib/evidence-grounding.ts and applies to both Shopify and BigCommerce enrich endpoints.
  • Owner-provided Typical Competitors structured as [{ brand, model, ref_price?, notes? }] in the product's Edit tab. When any row is saved, metadata.competitors_source = 'owner' and the LLM stops regenerating the list on re-enrichment (lib/preserve-owner-data.ts).

Roadmap remaining

External data connectors — Returns rate webhooks, Judge.me + Yotpo reviews APIs, GA conversion rate — and the per-score provenance schema (synthetic_properties.scores_provenance) with composite resolution. Full design in the internal docs/SCORES_GROUNDING_ROADMAP.md.

Until external connectors land, assume scores reflect the LLM's reading of the catalog text sharpened by whatever evidence + competitors you fed via the wizard — not ground truth, but better than pure synthesis.


Embedding model swap warning

The embedding column is fixed at vector(1536) to match text-embedding-3-small. Switching to a different embedding model with a different dimension (e.g. Ollama's nomic-embed-text at 768 dims) requires a full re-index — the column type would have to change and every existing embedding would need re-generation. It is not a runtime toggle.


When to re-enrich

ScenarioAction
Initial sync from platformAuto-enriched as part of import
Catalog title/description changed at the sourceRe-sync triggers re-enrichment
Owner edits AI fields manually in dashboardSaved with metadata.manually_edited = true. Future re-enrich shows an extra "OVERWRITE" confirmation to prevent losing manual work
Owner runs Enrichment Wizard with new contextManual re-enrich with hints embedded in the prompt
LLM model upgrade (e.g. gpt-4o-mini → next version)Bulk re-enrich recommended; old versions retained for 3 generations via History tab

Re-enrichment is destructive but versioned: the previous output is snapshotted into product_enrichment_versions and can be restored from the History tab. Retention: last 3 versions per product.


Why this matters

The whole design assumes agent commerce is the next discovery channel. ChatGPT shops Shopify, Perplexity recommends, Gemini compares. Stores whose catalogs read clearly to LLMs win; those that don't are invisible.

Clione's job is to make every product readable to those agents in every format they prefer — without the merchant having to think about it.

That's why we generate 6 artifacts, not just one vector.