Enrichment pipeline

Last updated: 14 April 2026 Audience: anyone debugging "what does Clione actually do with my product?"

TL;DR

Clione transforms a raw product (title, description, price, vendor, images) into a multi-format semantic representation that:

Lets LLMs (ChatGPT, Perplexity, Gemini) understand the product accurately
Lets search engines crawl rich structured data (schema.org)
Lets vector search retrieve products by intent (semantic similarity)
Lets storefronts inject FAQ widgets and JSON-LD signals automatically

It produces 6 categories of artifacts from a single LLM call + an embedding call:

#	Artifact	Where it lives	Who consumes it
1	Semantic text (`core_identity`, `reasoning`)	DB column on `product_embeddings`	LLMs, internal search, dashboard
2	Quality scores (`durability`, `quality_perception`, `value_for_money`, `price_positioning`)	DB column `synthetic_properties`	Dashboard, scoring, ranking
3	Search keywords (bilingual array)	DB column `search_keywords`	Full-text search, tag clouds
4	Vector embedding (1536-dim float array)	`pgvector` column	Cosine similarity search
5	JSON-LD (schema.org/Product)	Generated on-the-fly at `/products/:id.jsonld`	Search engines, LLM crawlers
6	SEO meta signals (description, keywords, canonical)	Generated on-the-fly at `/products/:id.meta`	SEO tools, headless storefronts

Plus tenant-level: /.well-known/llms.txt (catalog manifest) and /.well-known/ai-offers.json (promotions context).

Pipeline (end to end)

┌─────────────────┐
│ Platform (Shopify, BigCommerce, etc.)
│ Sends raw product:
│   title, description, price, vendor, images, categories
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────┐
│ STEP 1 · Adapter normalizes platform-specific shape     │
│ packages/platform-shopify/src/adapters/product.ts       │
│ → RawProduct                                            │
└────────┬────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────┐
│ STEP 2 · LLM enrichment (Normalizer)                    │
│ packages/semantic-core/src/services/normalizer.ts       │
│ Calls llmClient.enrich() with the prompt at:            │
│ apps/products-api/src/lib/promptTemplates.ts            │
│                                                          │
│ Production model: gpt-4o-mini                           │
│ Dev fallback: Ollama (llama3.2:1b for the LLM,          │
│   nomic-embed-text for the embedding)                   │
│                                                          │
│ Returns:                                                 │
│   core_identity: "1-2 sentence semantic description"    │
│   reasoning: { recommended_for, decision_logic,         │
│                objection_handler, seasonal_relevance }  │
│   synthetic_properties: { durability_score,             │
│     quality_perception, value_for_money,                │
│     price_positioning, typical_competitors,             │
│     search_keywords (bilingual EN+ES) }                 │
└────────┬────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────┐
│ STEP 3 · Embedding text assembly                        │
│ packages/storage/src/embeddings/embedding-text.ts       │
│ Concatenates with " | " separator:                      │
│   core_identity + recommended_for + decision_logic      │
│   + title + vendor + search_keywords                    │
└────────┬────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────┐
│ STEP 4 · Vector embedding generation                    │
│ Production: OpenAI text-embedding-3-small (1536 dims)   │
│ Dev: Ollama nomic-embed-text (768 dims)                 │
│ Result: float[1536] L2-normalized (cosine ready)        │
└────────┬────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────┐
│ STEP 5 · Persistence + version snapshot                 │
│ packages/storage/src/vector/pgvector.ts                 │
│ Upserts into product_embeddings table (single row)      │
│ Snapshots prior version into                            │
│ product_enrichment_versions (last 3 retained)           │
└─────────────────────────────────────────────────────────┘

Storage schema

`product_embeddings`

Column	Type	Content
`id`	TEXT	platform product ID
`tenant_id`	TEXT	multi-tenant isolation
`tenant_platform_id`	TEXT	per-store isolation
`platform`	TEXT	`shopify` / `bigcommerce` / etc.
`core_identity`	TEXT	LLM-generated semantic description
`reasoning`	JSONB	`{ recommended_for, decision_logic, objection_handler, seasonal_relevance }`
`synthetic_properties`	JSONB	`{ durability_score, quality_perception, value_for_money, price_positioning, typical_competitors, search_keywords }`
`technical_details`	JSONB	`{ original_title, vendor, price, currency, images }`
`metadata`	JSONB	`{ transformed_at, llm_model, transformation_version, manually_edited?, last_manual_edit_at?, enrichment_pending? }`
`search_keywords`	TEXT[]	denormalized array for FT search
`embedding`	`vector(1536)`	the actual cosine-normalized vector

`product_enrichment_versions` (versioning, last 3 retained per product)

Snapshot of core_identity + reasoning + synthetic_properties + search_keywords + metadata plus version, reason ("manual re-enrich" / "bulk re-enrich pending" / "rollback to vN" / "initial sync"), created_by, created_at. Used by the History tab + rollback endpoint.

Endpoints exposing the artifacts

Per-product

Endpoint	Content-Type	Purpose
`GET /api/{platform}/products/:id`	`application/json`	Full SemanticProduct payload (everything above)
`GET /api/{platform}/products/:id.llm`	`application/json` (`X-Content-Format: llm-optimized`)	LLM-optimized JSON: identity, reasoning, scores, market context, computed `summary_for_ai` text
`GET /api/{platform}/products/:id.jsonld`	`application/ld+json`	schema.org/Product with `additionalProperty[]` carrying scores as `PropertyValue`
`GET /api/{platform}/products/:id.meta`	`application/json`	SEO signals: `meta_description` (160/300/full), `canonical` URL, `keywords[]`

Tenant / domain level

Endpoint	Format	Purpose
`GET /.well-known/llms.txt`	`text/plain` (Markdown per llms.txt spec)	Catalog manifest: store name, product count, enrichment summary, API endpoints, usage guide
`GET /.well-known/ai-offers.json`	`application/json`	Promotions context (managed by `promotions-engine`)

Storefront embed

embed.js is a single-file loader served at /embed.js that:

Auto-detects entity type/ID from URL patterns + OpenGraph meta tags
Fetches FAQ data via /api/v1/public/faq/render
Injects (a) visible HTML accordion (b) invisible JSON-LD FAQPage schema

Origin-validated against the API key's allowed_domains whitelist.

Quality Scores — important limitations

Read this before showing scores to a customer.

The 5 quality fields (durability_score, quality_perception, value_for_money, price_positioning, typical_competitors) are 100% LLM-inferred today.

What the LLM uses

Title
Description
Price + currency
Vendor
Categories
enrichment_hints from the Wizard if filled

What the LLM does NOT use

Real reviews (Google / Trustpilot / internal)
Returns rate
Sales data / conversion rate
External catalog comparatives
Manufacturer certifications (ISO, MIL-STD, GOTS, etc.)
Lifecycle test results
Sustainability databases

Implications

Scores are synthetic opinion, not verified facts.
Same product can produce different scores between runs if temperature > 0.
Useful relatively: the LLM has common sense (Hermès → luxury, Primark → budget), so scores help differentiate within the same catalog, not against external benchmarks.
typical_competitors has the same problem — they're plausible inferences, not verified market data.

Grounding — shipped layers

Visual disclaimer badge on every Quality Scores block, resolving from metadata.scores_evidence_level: 🟢 Data-grounded · 🔵 AI + owner hints · 🟡 AI-inferred
Enrichment Wizard has a dedicated Evidence section with 5 questions (warranty_period, certifications, returns_rate, reviews_rating, manufacturing_origin). Filling 3+ upgrades the badge to green; 1-2 to cyan. Logic lives in lib/evidence-grounding.ts and applies to both Shopify and BigCommerce enrich endpoints.
Owner-provided Typical Competitors structured as [{ brand, model, ref_price?, notes? }] in the product's Edit tab. When any row is saved, metadata.competitors_source = 'owner' and the LLM stops regenerating the list on re-enrichment (lib/preserve-owner-data.ts).

Roadmap remaining

External data connectors — Returns rate webhooks, Judge.me + Yotpo reviews APIs, GA conversion rate — and the per-score provenance schema (synthetic_properties.scores_provenance) with composite resolution. Full design in the internal docs/SCORES_GROUNDING_ROADMAP.md.

Until external connectors land, assume scores reflect the LLM's reading of the catalog text sharpened by whatever evidence + competitors you fed via the wizard — not ground truth, but better than pure synthesis.

Embedding model swap warning

The embedding column is fixed at vector(1536) to match text-embedding-3-small. Switching to a different embedding model with a different dimension (e.g. Ollama's nomic-embed-text at 768 dims) requires a full re-index — the column type would have to change and every existing embedding would need re-generation. It is not a runtime toggle.

When to re-enrich

Scenario	Action
Initial sync from platform	Auto-enriched as part of import
Catalog title/description changed at the source	Re-sync triggers re-enrichment
Owner edits AI fields manually in dashboard	Saved with `metadata.manually_edited = true`. Future re-enrich shows an extra "OVERWRITE" confirmation to prevent losing manual work
Owner runs Enrichment Wizard with new context	Manual re-enrich with hints embedded in the prompt
LLM model upgrade (e.g. gpt-4o-mini → next version)	Bulk re-enrich recommended; old versions retained for 3 generations via History tab

Re-enrichment is destructive but versioned: the previous output is snapshotted into product_enrichment_versions and can be restored from the History tab. Retention: last 3 versions per product.

Why this matters

The whole design assumes agent commerce is the next discovery channel. ChatGPT shops Shopify, Perplexity recommends, Gemini compares. Stores whose catalogs read clearly to LLMs win; those that don't are invisible.

Clione's job is to make every product readable to those agents in every format they prefer — without the merchant having to think about it.

That's why we generate 6 artifacts, not just one vector.

TL;DR​

Pipeline (end to end)​

Storage schema​

product_embeddings​

product_enrichment_versions (versioning, last 3 retained per product)​

Endpoints exposing the artifacts​

Per-product​

Tenant / domain level​

Storefront embed​

Quality Scores — important limitations​

What the LLM uses​

What the LLM does NOT use​

Implications​

Grounding — shipped layers​

Roadmap remaining​

Embedding model swap warning​

When to re-enrich​

Why this matters​