Understanding Vector Databases and Embedding Pipelines

Explore the mechanics of vector databases, text embedding (Dense, Sparse, Hybrid), and similarity metrics like Cosine Similarity with coding examples.

Machine LearningDeep LearningData SciencePythonAgentic AILLM

By Kuriko IWAI

Introduction

Artificial Intelligence (AI) understands the nuance, context, and semantic meaning of data.

But a significant challenge persists: traditional databases are built for keywords matches, whereas AI models think in high-dimensional patterns, which creates a gap between how humans store data and how AI actually processes it.

The vector database bridges this gap by storing information as mathematical coordinates, allowing AI to retrieve data based on similarity rather than just syntax.

This article explores the inner workings of vector databases and the critical process of vectorization with practical implementations of text embedding, comparing dense, sparse, and hybrid embedding techniques.

What is a Vector Database - The Architecture of Semantic Meaning

Vector database (DB) is a specialized system designed to store, index, and query vector embeddings—long arrays of numbers that represent the semantic meaning of unstructured data like text, images, audio, and video.

The diagram below illustrates the process of unstructured data stored in the vector database:

Figure A. Diagram of the vector database workflow showing raw data ingestion, embedding generation, and storage in a 3D coordinate space (Created by Kuriko IWAI)

For example, in a simplified 3-dimensional space (right, Figure A), unstructured text data:

I am a cat.

becomes a vector embedding:

V = [0.12, - 0.98, 0.45] \dots (a)

In real-world models like Gemini or BERT, these vectors are much larger with 768 or 1, 536 dimensions.

Each number represents a specific feature of the sentence's semantic meaning that the AI has learned during the pre-training.

◼ Why Similarity Search Trumps Traditional SQL

Traditional databases like SQL can only search for exact matches.

So, when a user queries on "cats", they'd miss documents on "animals", "pets", or "felines" if those documents don't include the exact word "cats".

Conversely, vector databases can perform similarity searches by finding vector embedding closest to the given query.

Because “felines“ are considered close to “cats“, similarity search can find the documents which only mention “felines”, but not exactly “cats“.

This is useful for:

Semantic search: Finds information based on semantic meaning, not just text.
Retrieval-Augmented Generation (RAG): Provides LLMs long-term memory relative to the user query.
Recommendation engines: Suggests products based on user behavior patterns.

The Vectorization Pipeline

Vectorization is the process of converting unstructured, raw data into vector embeddings to populate a vector database (left → middle, Figure A).

The process involves the following five steps:

Load: Pull data from sources (PDFs, Notion, SQL, Slack).
Clean: Remove noise like headers, footers, or HTML tags.
Chunk: Split long data into smaller pieces.
Embed: Pass the chunks through an embedding model to turn them into lists of vectors.
Index: Store the vectors in a Vector database.

Taking those steps, vectorization dictates how to feed data into the model, which is as important a process as selecting the model itself.

Embedding has three primary types: Text, Vision, and Audio embeddings.

◼ Text Embedding

Text embedding is the most common strategy to vectorize text data while capturing its semantic meaning and context.

Chunking (Step 3) plays a key role in the process because too big chunks confuse the model, while too small ones lose context.

To tackle this challenge, there are two primary chunking strategies:

Late chunking: Embed the entire document first and then break it into small chunks. Ensure each small chunk remembers the context of the entire document.
Semantic chunking: Use AI to find natural breaks in the meaning, instead of chunking the fixed number of characters (e.g., every 500 characters).

Primary text embedding methods employ these chunking strategies such that:

Method	Preferred Chunking	Best For	Main Weakness
Dense	Late	Finding synonyms. Maximize the context awareness.	Struggles with exact IDs or rare jargon.
Sparse	Semantic	Exact matches, part numbers, names. (Avoids splitting words like "Apple / Watch".)	Fails if the user uses different words.
Hybrid	Late + Semantic	Production-grade RAG systems. Uses semantic boundaries but applies late-chunking logic to keep the document's global context.	More complex and expensive to run.

Table 1. Comparison of Embedding Methods: Dense, Sparse, and Hybrid.

◼ Vision Embedding

Vision embedding treats images as data points, enabling the direct search for images without labelling:

Figure B. Architecture of a Vision Embedding model (CNN) processing images into patches for vector representation (Created by Kuriko IWAI)

Its key methods include:

Contrastive Language-Image Pre-training (CLIP): Maps images and text into the same vector space to search images with words.
Vision Transformers (ViT): Breaks images into patches to process them like text tokens.

Major players like OpenAI (CLIP), Meta (DINOv2), Google (SigLIP) have developed their own models.

◼ Audio Embedding

Audio embedding is useful for music recommendation, speech recognition, or identifying sounds-like patterns in industrial sensor data.

Figure C. Process flow of raw audio signals being converted into spectrograms and embedded via an encoder network (source)

Its key methods include:

Spectrogram Analysis: Converts sound waves into visual representations and then embedding them.
Contrastive Language-Audio Pre-training (CLAP): Similar to CLIP but applies to sound.

Key players include Microsoft: (CLAP) and Meta: (Audio2Vec).

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

Measuring Vector Relationships

Once the raw data is vectorized, the model calculates the proximity between vectors using specific mathematical formulas.

Vectors that are closer together share similar semantic, audio, or visual meanings.

The diagram below shows how vector math interprets semantic meaning, taking two text embeddings representing words like "cat" in a 2 dimensional space for an example:

Figure D. Geometric representation of vector relationships showing small, wide, and zero-degree angles between concepts (Created by Kuriko IWAI)

When the two vectors represent related concepts (left, Cat vs. Dog, Figure D), the vectors are pointing in a similar direction with a small orientation (angle) θ, indicating that "Cat" and "Dog" share many features (both are pets, mammals, have four legs, etc) in the two-dimensional vector space.

When the two vectors represent unrelated concepts (middle, Cat vs. Crocodile, Figure D), the vectors are pointing in very different directions with a wide θ.

Lastly, when the vectors share the identical concept but with different magnitude (right, Cat vs. Cat Cat Cat, Figure D), both vectors sit on the exact same line (θ = 0), but vector B is much longer.

This represents a situation where the topic is identical, but the magnitude is different (perhaps a document that mentions "Cat" many more times in a similar concept).

But how can we measure the differences between the vectors?

This section explores the three mathematical metrics:

Dot Product,
Cosine Similarity, and
Euclidean Distance.

◼ Dot Product

The dot product is the fundamental operation that measures the relationship between two vectors, considering both their orientations and magnitudes.

The dot product of the two vectors A and B in an n-dimensional space is generalized:

A \cdot B = \sum_{i = 1}^{n} A_{i} B_{i} = A_{1} B_{1} + A_{2} B_{2} + \dots + A_{n} B_{n} \dots (1.1)

where A_i and B_i represent the i-th entry of the vectors A and B, respectively.

Alternatively, when the orientation (angle) θ between the two vectors is known, Eq. 1.1 can be denoted:

A \cdot B = ‖ A ‖ ‖ B ‖ \cos (θ) \dots (1.2)

where ||A|| and ||B|| represent the magnitudes of the vectors A and B, respectively.

In the case of Figure D, when Vector A (Cat) = [ 2, 2 ], each scenario is measured:

	Cat vs. Dog	Cat vs. Crocodile	Cat vs. Cat Cat Cat
Vector B Coordinates	B = [ 3, 1.5 ] Similar direction, similar size.	B = [ 4, -2 ] Different direction, different size.	B = [ 6, 6 ] Exactly the same direction, but 3x longer.
Dot Product Score	9.0	4.0	24.0
Similarity Interpretation	Moderate/High	Low	Very High

Table 2.1. Similarity Metric Comparison (Dot Product)

◼ Cosine Similarity

The cosine similarity focuses purely on the orientation between two vectors A and B, effectively ignoring their magnitudes.

The formula is essentially the dot product (Eq 1's) divided by the product of the vectors' magnitudes ||A|| ||B||:

similarity = \cos (θ) = \frac{A \cdot B}{‖ A ‖ ‖ B ‖} = \frac{\sum_{i = 1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}} \dots (2)

where A_i and B_i represent the i-th entry of the vector A and B, respectively.

The formula makes it perfect for text analysis because when word frequency (magnitude) might vary, but the context (direction) remains the same.

In the case of Figure D, using the same Vector A & B coordinates, the similarity of each scenario is measured:

	Cat vs. Dog	Cat vs. Crocodile	Cat vs. Cat Cat Cat
Cosine Similarity	0.95	0.32	1.00
Similarity Interpretation	High	Low	Perfect Match

Table 2.2. Similarity Metric Comparison (Cosine Similarity).

◼ Euclidean Distance (L2 Norm)

The Euclidean distance measures (L2) the absolute displacement in space between two points, p and q:

d (p, q) = \sqrt{\sum_{i = 1}^{n} (p_{i} - q_{i})^{2}} \dots (3)

where:

d: The Euclidean distance (the scalar result).
p, q: The two vectors (or points) in n-dimensional space.
i: The index of the current dimension calculated (from 1 to n).
p_i, q_i: The specific coordinates of vectors p and q at the i-th dimension.

The Euclidean distance is highly sensitive to the magnitude of the vectors because it measures how far it is from the origin where the point sits.

When taking two vectors pointing in the exact same direction and multiplying the values of one vector by 10, the Euclidean distance will increase significantly because the tip of the second vector is physically much further away in the coordinate system.

This means when comparing two documents such that:

p: A short document.
q: Much longer version of the doc p. But repeatedly using the same words.

The Euclidean distance will see them far apart even though the context is similar just because of the much higher word counts of q.

In the case of Figure D, the similarity of each scenario is measured, using the same Vector A & B coordinates:

	Cat vs. Dog	Cat vs. Crocodile	Cat vs. Cat Cat Cat
Euclidean Distance	1.12	4.47	5.66
Similarity Interpretation	Closest Match	Unrelated	Unrelated (The Trap)

Table 2.3. Similarity Metric Comparison (Euclidean distance).

The third scenario (Cat vs Cat Cat Cat) demonstrates the trap of the Euclidean distance because it has the largest gap (even though the meaning is identical) simply because Vector B is much longer.

◼ When to Use Which Metric

Given how sensitive these methods are to the scale, each mathematical approach serves a specific use case:

Metric	Sensitive to Magnitude?	Best Use Case
Dot Product	Yes	Neural network layers, signal processing.
Cosine Similarity	No	Document similarity, recommendation systems.
Euclidean Distance	Strongly Yes	Clustering (k-means), physical sensor data.

Table 2.4. Use Cases by Similarity Metrics.

Deep Dive: Implementing Dense, Sparse, and Hybrid Tactics

As Figure A shows, creating text embeddings involves converting human language into a vector that a computer can reason with.

These tactics range from simple keyword counting to advanced models that understand intent and instructions.

In either case, the system must vectorize the data and the query using the same embedding model and calculate the similarity score:

Figure E. Workflow of computing similarity scores between a query embedding and a document corpus in a RAG pipeline (Created by Kuriko IWAI)

I’ll demonstrate how the embedding methods measure the scores using sample data and query:

Dense embedding.
Sparse embedding.
Hybrid embedding.

Sample data:

1data = [
2    "A cat is sitting outside", 
3    "A dog is playing guitar", 
4    "The new movie is awesome"
5]
6
7query = "Tell me about felines"
8

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

Dense Embedding in Action

Dense embedding is the gold standard for modern AI.

The embedding model uses a small, fixed number of dimensions filled with non-zero numbers to capture broad semantic meaning whose position doesn't map to a specific word, but the combination of numbers represents the meaning.

In the case of Eq. a, each dimension can represent a weight on a scale, such as:

Dimension 1 (0.12): Living vs. Non-living (where 1.0 is a human and -1.0 is a rock).
Dimension 2 (-0.98): Size (where -1.0 is tiny and 1.0 is massive).
Dimension 3 (0.45): Domesticity (where 1.0 is a pet and -1.0 is a wild predator).

In this scenario, the first dimension 0.12 suggests the subject is "somewhat living/animate" but perhaps not as high-ranking as a human in the model's hierarchy.

Common tactics involve:

Bi-Encoder.
Instruction Tuning.
Late Interaction.

◼ Bi-Encoder

Bi-encoder is the most common method for text embedding.

It encodes the query and the document separately. It's fast because you can pre-calculate the document vectors.

1from sentence_transformers import SentenceTransformer, util
2
3# load bi-encoder model
4model = SentenceTransformer('all-MiniLM-L6-v2')
5
6# encode
7doc_emb = model.encode(docs)
8query_emb = model.encode(query)
9
10# compute cosine similarity
11scores = util.cos_sim(query_emb, doc_emb)
12

doc_emb - The vectorized docs looks like the following:

[ [ 0.1230927 -0.00072824 0.04190801 ... 0.03736668 -0.03583647 0.06841106 ]
[ 0.02273473 -0.02657051 0.03814451 ... 0.03001389 0.09356179 0.02145582 ]
[ -0.10044324 -0.07739273 -0.00137412 ... -0.00104974 0.07181141 0.02205478 ] ]

Results:

A cat is sitting outside: 0.2984
A dog is playing guitar: 0.1693
The new movie is awesome: 0.1015

◼ Instruction Tuning

Instruction tuning is the process of fine-tuning a pre-trained model on a dataset of explicit instructions (e.g., "Summarize the document," or "Translate to French").

The instruction tells models (e.g., HuggingFace's open source models like E5 or BGE) how to shape the vector based on the task goal to ensure the model's output aligns with the user's specific intent rather than just predicting the next most likely word.

The following snippet uses BGE:

1from sentence_transformers import SentenceTransformer, util
2
3# load bge model
4model = SentenceTransformer('BAAI/bge-small-en-v1.5')
5
6# bge instruction
7instruction = "Represent this sentence for searching relevant passages: "
8
9# encode (with instruction)
10query_emb = model.encode(instruction + query, normalize_embeddings=True)
11doc_embs = model.encode(docs, normalize_embeddings=True)
12
13# compute cosine similarity
14scores = util.cos_sim(query_emb, doc_embs)[0
15]
16

Results:

A cat is sitting outside: 0.4718
A dog is playing guitar: 0.4010
The new movie is awesome: 0.3493

◼ Late Interaction (ColBERT)

Instead of one vector per document, Late Interaction keeps multiple vectors (one per token), allowing for much more granular matching.

1from ragatouille import RAGPretrainedModel
2
3# load model
4RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
5
6# colbert creates an index to store token-level embeddings
7RAG.index(
8    collection=docs,
9    index_name="sample_index",
10    max_document_length=180,
11    split_documents=False
12)
13
14# search index
15results = RAG.search(query=query, k=3)
16

Results:

A cat is sitting outside: 14.03
A dog is playing guitar: 9.33
The new movie is awesome: 5.30

Developer Note: RAGatouille

RAGatouille is one of the common libraries for using ColBERT in Python-based RAG pipelines.

It can simplify the complex original Stanford implementation into a few lines of code for indexing and retrieval, integrating to major AI frameworks like LangChain and LlamaIndex.

It is currently transitioning from the original Stanford ColBERT backend to PyLate for better compatibility.

◼ Performance Summary

In either case, the sentence 'A cat is sitting outside' achieved the highest score, as it was identified as the closest match to the query embedding.

However, their practical applications diverge:

Tactic	Speed	Storage Cost	Accuracy	Primary Use Case
Bi-Encoder	Blazing Fast	Low	Good	Initial retrieval from massive datasets (millions of docs) where speed is the priority.
Instruction	Fast	Low	Very Good	Task-specific search (e.g., "find code snippets" vs "summarize") where intent matters.
Late	Moderate	High	Excellent	High-precision reranking or complex queries where term-level interaction is needed.

Table 3. Performance Matrix: Bi-Encoder vs. Instruction Tuning vs. Late Interaction.

Sparse Embedding in Action

Sparse embedding is a vector representation where most values are zero.

Unlike dense embedding, sparse embedding maps a specific token or keyword to a non-zero value, making them highly interpretable and excellent for exact keyword retrieval.

Common tactics involve:

Best Matching 25 (BM25).
Sparse Lexical and Expansion (SPLADE).

◼ Best Matching 25 (BM25)

Best Matching 25 (BM25) is the industry standard for keyword search.

It ranks documents based on the appearance of query terms, adjusting for document length.

In the following code snippet, I added "A cat" and "cat" as query:

1from rank_bm25 import BM25Okapi
2
3# bm25 works on words (tokens) not vectors
4tokenized_corpus = [doc.lower().split(" ") for doc in docs
5]
6
7# initialize bm25
8bm25 = BM25Okapi(tokenized_corpus)
9
10queries = ["Tell me about felines", "A cat", "cat"
11]
12for query in queries:
13    tokenized_query = query.lower().split(" ")
14
15    # score each doc
16    doc_scores = bm25.get_scores(tokenized_query)
17
18    # get top 3
19    top_n = bm25.get_top_n(tokenized_query, docs, n=3)
20

▫ Results

Query: Tell me about felines

A cat is sitting outside: 0.0000
A dog is playing guitar: 0.0000
The new movie is awesome: 0.0000

Query: A cat

A cat is sitting outside: 0.5661
A dog is playing guitar: 0.0552
The new movie is awesome: 0.0000

Query: cat

A cat is sitting outside: 0.5108
The new movie is awesome: 0.0000

2(tie). A dog is playing guitar: 0.0000

For the query “A cat”, the score of Rank 2 sentence, “A dog is playing guitar” is way much lower than the score of Rank 1 sentence, “A cat is sitting outside*”.*

This is because BM25 uses a specific formula to ensure common words (like “the” or “is”) don’t drown out rare, important words (e.g., “felines”).

It can balance:

Term Frequency (TF): How many times does the word appear? The more, the better.
Inverse Document Frequency (IDF): Is this word rare in the whole collection? Rare words like "feline" get more points than "the".
Document length normalization: Penalizes very long documents so that a 500-page book doesn't win just because it happens to repeat the keyword by accident.

◼ Sparse Lexical and Expansion (SPLADE)

Sparse Lexical and Expansion (SPLADE) is an AI-powered BM25 where a neural network adds weight to latent words to expand a source document with related keywords.

Particularly, SPLADE takes the following two step:

Sparsity: Like BM25, transforms a sentence into a sparse vector - caring only about specific words to make it efficient for standard search engines like Elasticsearch.
Expansion: Adds weight to latent words to fix the vocabulary mismatch problem.

For example, if a document says "A cat is sitting outside," SPLADE internally adds weights for words like "feline,""pet," or "animal," even if they don't appear in the text.

This can improve the precision of keyword search while keeping the intelligence of neural search.

I'll use the splade-v3 model from Naver Labs (creator of SPLADE) for demonstration:

1import torch
2from transformers import AutoModelForMaskedLM, AutoTokenizer
3
4# load model
5model_id = "naver/splade-v3"
6tokenizer = AutoTokenizer.from_pretrained(model_id) 
7model = AutoModelForMaskedLM.from_pretrained(model_id)
8
9# encode 
10query_vec = get_splade_vector(query_text)
11doc_vecs = [get_splade_vector(d) for d in docs
12]
13
14# calc similarity score (dot product)
15scores = []
16for i, d_vec in enumerate(doc_vecs):
17    score = torch.dot(query_vec, d_vec).item()
18    scores.append((docs[i
19], score))
20

▫ Results

A cat is sitting outside: 9.4954
A dog is playing guitar: 0.1755
The new movie is awesome: 0.0000

SPLADE successfully finds the similar sentence to the query: A cat is sitting outside.

◼ Comparison of BM25 vs. SPLADE

So, here is the comparison of BM25 and SPLADE:

Feature	BM25	SPLADE
Input	Raw text	Neural-weighted tokens
Expansion	No (only exact words)	Yes (adds synonyms/related terms)
Infrastructure	Standard CPU / Database	Requires GPU for encoding
Vector Type	Sparse	Enriched sparse

Table 4. Comparison of BM25 and SPLADE.

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

Hybrid Embedding in Action

Relying on just one type of vector sometimes fails in production.

For example, dense vectors are great at finding "adorable kittens" when searching for "cute cats", but might fail to find a specific character like "Marie" from the Aristocats (even though she is an adorable kitten:

Figure. Marie from the Aristocats (Disney)

Hybrid embedding can avoid this challenge by combining dense and sparse embeddings.

Its common methods involve:

Reciprocal Rank Fusion (RRF).
Reranking

◼ Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion (RRF) only cares about the rank among the sentences, rather than the absolute scores:

R R F s c o r e (d) = \sum_{r \in R} \frac{1}{k + r (d)} \dots (4)

Where:

r(d) is the rank of document d in list R, and
k is a constant (usually k = 60) to prevent top-ranked documents from overwhelming the rest.

I’ll use Bi-Encoder and SPLIDE for demonstration:

1# sort the docs by rank
2rank_list_biecoder = [
3    'A cat is sitting outside', 
4    'The new movie is awesome', 
5    'A dog is playing guitar'
6] 
7rank_list_splide = [
8    'A cat is sitting outside', 
9    'A man is playing guitar', 
10    'The new movie is awesome'
11]
12
13# compute rrf scores for each doc in the rank list
14k = 60
15for rank, doc in enumerate(rank_list):
16    if doc not in scores: scores[doc
17] = 0
18    scores[doc
19] += 1 / (k + (rank + 1))
20

▫ Results

A cat is sitting outside: 0.5000
A dog is playing guitar: 0.0366

2(tie). The new movie is awesome 0.0366.

Bi-encoder excels at understanding that “feline” and “cat” are related even if the words don’t match.

On the other hand, SPLADE excels at identifying specific relevant words.

RRF acts as the referee; if a document appears high in both lists, its RRF score will skyrocket. If it only appears in one, it stays in the middle.

RRF yielded the highest score for the sentence "A cat is sitting outside*"* because the sentence is the top in the rank lists of both bi-encoder and SPLADE.

◼ Reranking

Reranking is a method where the system applies a quick dense search to get the top results first, and then passes the top results through cross-encoders to give a final score.

Cross-encoders are trained to rank items rather than provide a percentage of correctness, to examine the query and the document together.

A negative score means the model thinks the document is unlikely to be a perfect match, while a positive score means the model is very confident it's a match.

For demonstration, I'll use the CrossEncoder module from the sentence_transformer library:

1from sentence_transformers import CrossEncoder
2
3# load the model
4model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
5
6# create a query-doc pair
7sentence_pairs = [
8  [query, doc
9  ] for doc in docs
10]
11
12# score
13scores = model.predict(sentence_pairs)
14

▫ Results

A cat is sitting outside: -10.0513
A dog is playing guitar: -10.9967
The new movie is awesome: -11.3669

Although all the sentences are recognized as "unlikely to be a perfect match", the sentence "A cat is sitting outside" is still the closest match among all.

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

Wrapping Up

Vector database and vectorization are key to handling unstructured data.

They can enable semantic search and provide long-term memory for LLMs through RAG.

◼ The Storage Landscape: Choosing Your Vector Storage Tier

If you are looking for where to store these vectors, the market is split into four camps:

Category	Key Players	Best For...
Vector-Native	Pinecone Weaviate Milvus Qdrant	High-performance, specialized AI applications with massive scale.
Cloud Providers	AWS OpenSearch Google Vertex AI Search Azure AI Search	You are locked into a specific cloud ecosystem.
Traditional/SQL	pgvector (PostgreSQL) Supabase Oracle	Keep an existing database, and add vector capabilities.
NoSQL/Document	MongoDB Atlas Vector Search Cassandra Redis	Real-time applications. Keeps JSON-like structures.

Table 5. Market Analysis: Vector-Native vs. Traditional Database Providers.

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

FAQ

1) What is the primary difference between a vector database and a traditional SQL database?

👉 Traditional SQL databases search for exact keyword matches or relational data, whereas vector databases store information as high-dimensional embeddings, enabling searches based on semantic similarity and context rather than syntax.

2) Why is Cosine Similarity preferred over Euclidean Distance for text embeddings?

👉 Cosine Similarity measures the angle between vectors, making it robust to variations in document length (magnitude). Euclidean Distance measures absolute displacement, which often penalizes longer documents even if the semantic content is identical.

3) What is Hybrid Search in a vector database context?

👉 Hybrid search combines Dense Embeddings (semantic meaning) with Sparse Embeddings (keyword/lexical matching) using algorithms like Reciprocal Rank Fusion (RRF) to provide more accurate and robust retrieval results.

4) What are the benefits of 'Late Chunking' in text vectorization?

👉 Late chunking embeds a larger document context before breaking it into smaller pieces, ensuring that each individual chunk retains the global semantic meaning of the entire document, which reduces 'context loss' during retrieval.

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

Share What You Learned

Kuriko IWAI, "Understanding Vector Databases and Embedding Pipelines" in Kernel Labs

https://kuriko-iwai.com/vector-databases-and-embedding-strategies-guide

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

How to Design a Production-Ready RAG System (Architecture + Tradeoffs) (2026 Edition)
Master common Retrieval-Augmented Generation architectures: Naive, Advanced, Modular, GraphRAG, and Agentic with decision path for optimal retrieval.
How to Build Reliable RAG: A Deep Dive into 7 Failure Points and Evaluation Frameworks
Master RAG reliability. Explore common failure points (FPs) and learn to mitigate them with DeepEval, RAGAS, TruLens, and Arize Phoenix.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation

Understanding Vector Databases and Embedding Pipelines

Explore the mechanics of vector databases, text embedding (Dense, Sparse, Hybrid), and similarity metrics like Cosine Similarity with coding examples.

Table of Contents

Introduction

What is a Vector Database - The Architecture of Semantic Meaning

◼ Why Similarity Search Trumps Traditional SQL

The Vectorization Pipeline

◼ Text Embedding

◼ Vision Embedding

◼ Audio Embedding

Shipping AI Systems?

Measuring Vector Relationships

◼ Dot Product

◼ Cosine Similarity

◼ Euclidean Distance (L2 Norm)

◼ When to Use Which Metric

Deep Dive: Implementing Dense, Sparse, and Hybrid Tactics

Shipping AI Systems?

Dense Embedding in Action

◼ Bi-Encoder

◼ Instruction Tuning

◼ Late Interaction (ColBERT)

◼ Performance Summary

Sparse Embedding in Action

◼ Best Matching 25 (BM25)

▫ Results

◼ Sparse Lexical and Expansion (SPLADE)

▫ Results

◼ Comparison of BM25 vs. SPLADE

Shipping AI Systems?

Hybrid Embedding in Action

◼ Reciprocal Rank Fusion (RRF)

▫ Results

◼ Reranking

▫ Results

Shipping AI Systems?

Wrapping Up

◼ The Storage Landscape: Choosing Your Vector Storage Tier

FAQ

Shipping AI Systems?

Share What You Learned

Continue Your Learning

How to Design a Production-Ready RAG System (Architecture + Tradeoffs) (2026 Edition)

How to Build Reliable RAG: A Deep Dive into 7 Failure Points and Evaluation Frameworks

Related Books for Further Understanding