Enrichment pipeline
Last updated: 14 April 2026 Audience: anyone debugging "what does Clione actually do with my product?"
TL;DR
Clione transforms a raw product (title, description, price, vendor, images) into a multi-format semantic representation that:
- Lets LLMs (ChatGPT, Perplexity, Gemini) understand the product accurately
- Lets search engines crawl rich structured data (schema.org)
- Lets vector search retrieve products by intent (semantic similarity)
- Lets storefronts inject FAQ widgets and JSON-LD signals automatically
It produces 6 categories of artifacts from a single LLM call + an embedding call:
| # | Artifact | Where it lives | Who consumes it |
|---|---|---|---|
| 1 | Semantic text (core_identity, reasoning) | DB column on product_embeddings | LLMs, internal search, dashboard |
| 2 | Quality scores (durability, quality_perception, value_for_money, price_positioning) | DB column synthetic_properties | Dashboard, scoring, ranking |
| 3 | Search keywords (bilingual array) | DB column search_keywords | Full-text search, tag clouds |
| 4 | Vector embedding (1536-dim float array) | pgvector column | Cosine similarity search |
| 5 | JSON-LD (schema.org/Product) | Generated on-the-fly at /products/:id.jsonld | Search engines, LLM crawlers |
| 6 | SEO meta signals (description, keywords, canonical) | Generated on-the-fly at /products/:id.meta | SEO tools, headless storefronts |
Plus tenant-level: /.well-known/llms.txt (catalog manifest) and /.well-known/ai-offers.json (promotions context).
Pipeline (end to end)
┌─────────────────┐
│ Platform (Shopify, BigCommerce, etc.)
│ Sends raw product:
│ title, description, price, vendor, images, categories
└────────┬────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ STEP 1 · Adapter normalizes platform-specific shape │
│ packages/platform-shopify/src/adapters/product.ts │
│ → RawProduct │
└────────┬────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ STEP 2 · LLM enrichment (Normalizer) │
│ packages/semantic-core/src/services/normalizer.ts │
│ Calls llmClient.enrich() with the prompt at: │
│ apps/products-api/src/lib/promptTemplates.ts │
│ │
│ Production model: gpt-4o-mini │
│ Dev fallback: Ollama (llama3.2:1b for the LLM, │
│ nomic-embed-text for the embedding) │
│ │
│ Returns: │
│ core_identity: "1-2 sentence semantic description" │
│ reasoning: { recommended_for, decision_logic, │
│ objection_handler, seasonal_relevance } │
│ synthetic_properties: { durability_score, │
│ quality_perception, value_for_money, │
│ price_positioning, typical_competitors, │
│ search_keywords (bilingual EN+ES) } │
└────────┬────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ STEP 3 · Embedding text assembly │
│ packages/storage/src/embeddings/embedding-text.ts │
│ Concatenates with " | " separator: │
│ core_identity + recommended_for + decision_logic │
│ + title + vendor + search_keywords │
└────────┬────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ STEP 4 · Vector embedding generation │
│ Production: OpenAI text-embedding-3-small (1536 dims) │
│ Dev: Ollama nomic-embed-text (768 dims) │
│ Result: float[1536] L2-normalized (cosine ready) │
└────────┬────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ STEP 5 · Persistence + version snapshot │
│ packages/storage/src/vector/pgvector.ts │
│ Upserts into product_embeddings table (single row) │
│ Snapshots prior version into │
│ product_enrichment_versions (last 3 retained) │
└─────────────────────────────────────────────────────────┘
Storage schema
product_embeddings
| Column | Type | Content |
|---|---|---|
id | TEXT | platform product ID |
tenant_id | TEXT | multi-tenant isolation |
tenant_platform_id | TEXT | per-store isolation |
platform | TEXT | shopify / bigcommerce / etc. |
core_identity | TEXT | LLM-generated semantic description |
reasoning | JSONB | { recommended_for, decision_logic, objection_handler, seasonal_relevance } |
synthetic_properties | JSONB | { durability_score, quality_perception, value_for_money, price_positioning, typical_competitors, search_keywords } |
technical_details | JSONB | { original_title, vendor, price, currency, images } |
metadata | JSONB | { transformed_at, llm_model, transformation_version, manually_edited?, last_manual_edit_at?, enrichment_pending? } |
search_keywords | TEXT[] | denormalized array for FT search |
embedding | vector(1536) | the actual cosine-normalized vector |
product_enrichment_versions (versioning, last 3 retained per product)
Snapshot of core_identity + reasoning + synthetic_properties + search_keywords + metadata plus version, reason ("manual re-enrich" / "bulk re-enrich pending" / "rollback to vN" / "initial sync"), created_by, created_at. Used by the History tab + rollback endpoint.
Endpoints exposing the artifacts
Per-product
| Endpoint | Content-Type | Purpose |
|---|---|---|
GET /api/{platform}/products/:id | application/json | Full SemanticProduct payload (everything above) |
GET /api/{platform}/products/:id.llm | application/json (X-Content-Format: llm-optimized) | LLM-optimized JSON: identity, reasoning, scores, market context, computed summary_for_ai text |
GET /api/{platform}/products/:id.jsonld | application/ld+json | schema.org/Product with additionalProperty[] carrying scores as PropertyValue |
GET /api/{platform}/products/:id.meta | application/json | SEO signals: meta_description (160/300/full), canonical URL, keywords[] |
Tenant / domain level
| Endpoint | Format | Purpose |
|---|---|---|
GET /.well-known/llms.txt | text/plain (Markdown per llms.txt spec) | Catalog manifest: store name, product count, enrichment summary, API endpoints, usage guide |
GET /.well-known/ai-offers.json | application/json | Promotions context (managed by promotions-engine) |
Storefront embed
embed.js is a single-file loader served at /embed.js that:
- Auto-detects entity type/ID from URL patterns + OpenGraph meta tags
- Fetches FAQ data via
/api/v1/public/faq/render - Injects (a) visible HTML accordion (b) invisible JSON-LD
FAQPageschema
Origin-validated against the API key's allowed_domains whitelist.
Quality Scores — important limitations
Read this before showing scores to a customer.
The 5 quality fields (durability_score, quality_perception, value_for_money, price_positioning, typical_competitors) are 100% LLM-inferred today.
What the LLM uses
- Title
- Description
- Price + currency
- Vendor
- Categories
enrichment_hintsfrom the Wizard if filled
What the LLM does NOT use
- Real reviews (Google / Trustpilot / internal)
- Returns rate
- Sales data / conversion rate
- External catalog comparatives
- Manufacturer certifications (ISO, MIL-STD, GOTS, etc.)
- Lifecycle test results
- Sustainability databases
Implications
- Scores are synthetic opinion, not verified facts.
- Same product can produce different scores between runs if temperature > 0.
- Useful relatively: the LLM has common sense (Hermès → luxury, Primark → budget), so scores help differentiate within the same catalog, not against external benchmarks.
typical_competitorshas the same problem — they're plausible inferences, not verified market data.
Grounding — shipped layers
- Visual disclaimer badge on every Quality Scores block, resolving from
metadata.scores_evidence_level: 🟢 Data-grounded · 🔵 AI + owner hints · 🟡 AI-inferred - Enrichment Wizard has a dedicated Evidence section with 5 questions (
warranty_period,certifications,returns_rate,reviews_rating,manufacturing_origin). Filling 3+ upgrades the badge to green; 1-2 to cyan. Logic lives inlib/evidence-grounding.tsand applies to both Shopify and BigCommerce enrich endpoints. - Owner-provided Typical Competitors structured as
[{ brand, model, ref_price?, notes? }]in the product's Edit tab. When any row is saved,metadata.competitors_source = 'owner'and the LLM stops regenerating the list on re-enrichment (lib/preserve-owner-data.ts).
Roadmap remaining
External data connectors — Returns rate webhooks, Judge.me + Yotpo reviews APIs, GA conversion rate — and the per-score provenance schema (synthetic_properties.scores_provenance) with composite resolution. Full design in the internal docs/SCORES_GROUNDING_ROADMAP.md.
Until external connectors land, assume scores reflect the LLM's reading of the catalog text sharpened by whatever evidence + competitors you fed via the wizard — not ground truth, but better than pure synthesis.
Embedding model swap warning
The embedding column is fixed at vector(1536) to match text-embedding-3-small. Switching to a different embedding model with a different dimension (e.g. Ollama's nomic-embed-text at 768 dims) requires a full re-index — the column type would have to change and every existing embedding would need re-generation. It is not a runtime toggle.
When to re-enrich
| Scenario | Action |
|---|---|
| Initial sync from platform | Auto-enriched as part of import |
| Catalog title/description changed at the source | Re-sync triggers re-enrichment |
| Owner edits AI fields manually in dashboard | Saved with metadata.manually_edited = true. Future re-enrich shows an extra "OVERWRITE" confirmation to prevent losing manual work |
| Owner runs Enrichment Wizard with new context | Manual re-enrich with hints embedded in the prompt |
| LLM model upgrade (e.g. gpt-4o-mini → next version) | Bulk re-enrich recommended; old versions retained for 3 generations via History tab |
Re-enrichment is destructive but versioned: the previous output is snapshotted into product_enrichment_versions and can be restored from the History tab. Retention: last 3 versions per product.
Why this matters
The whole design assumes agent commerce is the next discovery channel. ChatGPT shops Shopify, Perplexity recommends, Gemini compares. Stores whose catalogs read clearly to LLMs win; those that don't are invisible.
Clione's job is to make every product readable to those agents in every format they prefer — without the merchant having to think about it.
That's why we generate 6 artifacts, not just one vector.