Skip to main content

Vector embeddings

The vector embedding is one of the six artifacts Clione produces per product. It's a fixed-length array of floating-point numbers that encodes the meaning of the product text in a high-dimensional space, so two semantically similar products land close together geometrically even if they use different words.

What it's used for

  • Semantic search — "linen sheets for hot sleepers" matches a product titled "100% Belgian flax bedding, temperature-regulating" even without word overlap
  • Similar products recommendations
  • Duplicate detection across catalogs
  • Clustering — grouping products that behave like each other

What feeds the embedding

A single string assembled per product (source: packages/storage/src/embeddings/embedding-text.ts):

<core_identity> | <reasoning.recommended_for> | <reasoning.decision_logic>
| <technical_details.original_title> | <technical_details.vendor>
| <search_keywords joined>

Notice: the enriched text feeds the embedding, not just the raw title. That's why enrichment quality matters — a bad enrichment produces a weak embedding.

Which model

EnvironmentProviderModelDimensions
ProductionOpenAItext-embedding-3-small1536
Dev (Ollama)Localnomic-embed-text768

Vectors are L2-normalised so cosine similarity = dot product (fast in pgvector).

Where they live

Column embedding on the product_embeddings table — typed vector(1536) via the pgvector extension.

Dimension swap warning

The column type vector(1536) is fixed at schema level. If you switch embedding models to one with a different dimension (e.g. Ollama's 768-dim default), you need:

  1. A schema migration to change embedding column dims
  2. A full re-index — every existing embedding must be regenerated from the text
  3. Down time / a dual-write period depending on strategy

It's not a runtime toggle. Plan carefully.

Querying

Internal:

-- Top 10 most similar to a reference vector
SELECT id, core_identity,
embedding <=> '[0.123, 0.456, ...]'::vector AS distance
FROM product_embeddings
WHERE tenant_id = $1
ORDER BY embedding <=> $2::vector
LIMIT 10;

The <=> operator is cosine distance (lower = more similar).

What embeddings cost

OpenAI text-embedding-3-small: $0.02 per 1M tokens as of writing. The assembled text per product averages ~200 tokens → ~$0.00004 per product → $4 per 100 000 products. Cheap compared to the LLM enrichment call itself.