Understanding Vector Databases and Embedding Pipelines
Explore the mechanics of vector databases, text embedding (Dense, Sparse, Hybrid), and similarity metrics like Cosine Similarity with coding examples.
By Kuriko IWAI

Table of Contents
IntroductionWhat is a Vector Database - The Architecture of Semantic MeaningIntroduction
Artificial Intelligence (AI) understands the nuance, context, and semantic meaning of data.
But a significant challenge persists: traditional databases are built for keywords matches, whereas AI models think in high-dimensional patterns, which creates a gap between how humans store data and how AI actually processes it.
The vector database bridges this gap by storing information as mathematical coordinates, allowing AI to retrieve data based on similarity rather than just syntax.
This article explores the inner workings of vector databases and the critical process of vectorization with practical implementations of text embedding, comparing dense, sparse, and hybrid embedding techniques.
What is a Vector Database - The Architecture of Semantic Meaning
Vector database (DB) is a specialized system designed to store, index, and query vector embeddings—long arrays of numbers that represent the semantic meaning of unstructured data like text, images, audio, and video.
The diagram below illustrates the process of unstructured data stored in the vector database:

Figure A. Diagram of the vector database workflow showing raw data ingestion, embedding generation, and storage in a 3D coordinate space (Created by Kuriko IWAI)
For example, in a simplified 3-dimensional space (right, Figure A), unstructured text data:
I am a cat.
becomes a vector embedding:
In real-world models like Gemini or BERT, these vectors are much larger with 768 or 1,536 dimensions.
Each number represents a specific feature of the sentence's semantic meaning that the AI has learned during the pre-training.
◼ Why Similarity Search Trumps Traditional SQL
Traditional databases like SQL can only search for exact matches.
So, when an user queries on "cats", they'd miss documents on "animals", "pets", or "felines" if those documents don't include the exact word "cats".
Conversely, vector databases can perform similarity searches by finding vector embedding closest to the given query.
Because “felines“ is considered close “cats“, similarity search can find the documents which only mention “felines”, but not exactly “cats“.
This is useful for:
Semantic search: Finds information based on semantic meaning, not just text.
Retrieval-Augmented Generation (RAG): Provides LLMs long-term memory relative to the user query.
Recommendation engines: Suggests products based on user behavior patterns.
The Vectorization Pipeline
Vectorization is the process of converting unstructured, raw data into vector embeddings to populate a vector database (left → middle, Figure A).
The process involves the following five steps:
Load: Pull data from sources (PDFs, Notion, SQL, Slack).
Clean: Remove noise like headers, footers, or HTML tags.
Chunk: Split long data into smaller pieces.
Embed: Pass the chunks through an embedding model to turn them into lists of vectors.
Index: Store the vectors in a Vector database.
Taking those steps, vectorization dictates how to feed data into the model, which is as important a process as selecting the model itself.
Embedding has three primary types: Text, Vision, and Audio embeddings.
◼ Text Embedding
Text embedding is the most common strategy to vectorize text data while capturing its semantic meaning and context.
Chunking (Step 3) plays a key role in the process because too big chunks confuse the model, while too small ones lose context.
To tackle this challenge, there are two primary chunking strategies:
Late chunking: Embed the entire document first and then break it into small chunks. Ensure each small chunk remembers the context of the entire document.
Semantic chunking: Use AI to find natural breaks in the meaning, instead of chunking the fixed number of characters (e.g., every 500 characters).
Primary text embedding methods employ these chunking strategies such that:
Table 1. Comparison of Embedding Methods: Dense, Sparse, and Hybrid.
◼ Vision Embedding
Vision embedding treats images as data points, enabling the direct search for images without labelling:

Figure B. Architecture of a Vision Embedding model (CNN) processing images into patches for vector representation (Created by Kuriko IWAI)
Its key methods include:
Contrastive Language-Image Pre-training (CLIP): Maps images and text into the same vector space to search images with words.
Vision Transformers (ViT): Breaks images into patches to process them like text tokens.
Major players like OpenAI (CLIP), Meta (DINOv2), Google (SigLIP) have developed their own models.
◼ Audio Embedding
Audio embedding is useful for music recommendation, speech recognition, or identifying sounds-like patterns in industrial sensor data.

Figure C. Process flow of raw audio signals being converted into spectrograms and embedded via an encoder network (source)
Its key methods include:
Spectrogram Analysis: Converts sound waves into visual representations and then embedding them.
Contrastive Language-Audio Pre-training (CLAP): Similar to CLIP but applies to sound.
Key players include Microsoft: (CLAP) and Meta: (Audio2Vec).
Measuring Vector Relationships
Once the raw data is vectorized, the model calculates the proximity between vectors using specific mathematical formulas.
Vectors that are closer together share similar semantic, audio, or visual meanings.
The diagram below shows how vector math interprets semantic meaning, taking two text embeddings representing words like "cat" in a 2 dimensional space for an example:
[

Figure D. Geometric representation of vector relationships showing small, wide, and zero-degree angles between concepts (Created by Kuriko IWAI)
When the two vectors represent related concepts (left, Cat vs. Dog, Figure D), the vectors are pointing in a similar direction with a small orientation (angle) θ, indicating that "Cat" and "Dog" share many features (both are pets, mammals, have four legs, etc) in the two-dimensional vector space.
When the two vectors represent unrelated concepts (middle, Cat vs. Crocodile, Figure D), the vectors are pointing in very different directions with a wide θ.
Lastly, when the vectors share the identical concept but with different magnitude (right, Cat vs. Cat Cat Cat, Figure D), both vectors sit on the exact same line (θ = 0), but vector B is much longer.
This represents a situation where the topic is identical, but the magnitude is different (perhaps a document that mentions "Cat" many more times in a similar concept).
But how can we measure the differences between the vectors?
This section explores the three mathematical metrics:
Dot Product,
Cosine Similarity, and
Euclidean Distance.
◼ Dot Product
The dot product is the fundamental operation that measures the relationship between two vectors, considering both their orientations and magnitudes.
The dot product of the two vectors A and B in an n-dimensional space is generalized:
where A_i and B_i represent the i-th entry of the vectors A and B, respectively.
Alternatively, when the orientation (angle) θ between the two vectors is known, Eq. 1.1 can be denoted:
where ||A|| and ||B|| represent the magnitudes of the vectors A and B, respectively.
In the case of Figure D, when Vector A (Cat) = [2, 2], each scenario is measured:
Table 2.1. Similarity Metric Comparison (Dot Product)
◼ Cosine Similarity
The cosine similarity focuses purely on the orientation between two vectors A and B, effectively ignoring their magnitudes.
The formula is essentially the dot product (Eq 1's) divided by the product of the vectors' magnitudes ||A|| ||B||:
where A_i and B_i represent the i-th entry of the vector A and B, respectively.
The formula makes it perfect for text analysis because when word frequency (magnitude) might vary, but the context (direction) remains the same.
In the case of Figure D, using the same Vector A & B coordinates, the similarity of each scenario is measured:
Table 2.2. Similarity Metric Comparison (Cosine Similarity).
◼ Euclidean Distance (L2 Norm)
The Euclidean distance measures (L2) the absolute displacement in space between two points, p and q:
where:
d: The Euclidean distance (the scalar result).
p, q: The two vectors (or points) in n-dimensional space.
i: The index of the current dimension calculated (from 1 to n).
p_i, q_i: The specific coordinates of vectors p and q at the i-th dimension.
The Euclidean distance is highly sensitive to the magnitude of the vectors because it measures how far it is from the origin where the point sits.
When taking two vectors pointing in the exact same direction and multiplying the values of one vector by 10, the Euclidean distance will increase significantly because the tip of the second vector is physically much further away in the coordinate system.
This means when comparing two documents such that:
p: A short document.
q: Much longer version of the doc p. But repeatedly using the same words.
The Euclidean distance will see them far apart even though the context is similar just because of the much higher word counts of q.
In the case of Figure D, the similarity of each scenario is measured, using the same Vector A & B coordinates:
Table 2.3. Similarity Metric Comparison (Euclidean distance).
The third scenario (Cat vs Cat Cat Cat) is the trap of the Euclidean distance because it has the largest gap even though the meaning is identical, simply because Vector B is much longer.
◼ When to Use Which Metric
Given how sensitive these methods are to scale, each mathematical approach serves a specific use case:
Table 2.4. Use Cases by Similarity Metrics.
Deep Dive: Implementing Dense, Sparse, and Hybrid Tactics
As Figure A shows, creating text embeddings involves converting human language into a vector that a computer can reason with.
These tactics range from simple keyword counting to advanced models that understand intent and instructions.
In either case, the system must vectorize the data and query with the same embedding model and calculate the similarity score:

Figure E. Workflow of computing similarity scores between a query embedding and a document corpus in a RAG pipeline (Created by Kuriko IWAI)
I'll use the following sample data and query and demonstrate the score using primary methods:
Dense embedding.
Sparse embedding.
Hybrid embedding.
Sample data:
1data = [
2 "A cat is sitting outside",
3 "A dog is playing guitar",
4 "The new movie is awesome"
5]
6
7query = "Tell me about felines"
8
Dense Embedding in Action
Dense embedding is the gold standard for modern AI.
The embedding model uses a small, fixed number of dimensions filled with non-zero numbers to capture broad semantic meaning whose position doesn't map to a specific word, but the combination of numbers represents the meaning.
In the case of Eq. a, each dimension can represent a weight on a scale, such as:
Dimension 1 (0.12): Living vs. Non-living (where 1.0 is a human and -1.0 is a rock).
Dimension 2 (-0.98): Size (where -1.0 is tiny and 1.0 is massive).
Dimension 3 (0.45): Domesticity (where 1.0 is a pet and -1.0 is a wild predator).
In this scenario, the first dimension 0.12 suggests the subject is "somewhat living/animate" but perhaps not as high-ranking as a human in the model's hierarchy.
Common tactics involve:
Bi-Encoder.
Instruction Tuning.
Late Interaction.
◼ Bi-Encoder
Bi-encoder is the most common method for text embedding.
It encodes the query and the document separately. It's fast because you can pre-calculate the document vectors.
1from sentence_transformers import SentenceTransformer, util
2
3# load bi-encoder model
4model = SentenceTransformer('all-MiniLM-L6-v2')
5
6# encode
7doc_emb = model.encode(docs)
8query_emb = model.encode(query)
9
10# compute cosine similarity
11scores = util.cos_sim(query_emb, doc_emb)
12
doc_emb - The vectorized docs looks like the following:
[[ 0.1230927 -0.00072824 0.04190801 ... 0.03736668 -0.03583647 0.06841106]
[ 0.02273473 -0.02657051 0.03814451 ... 0.03001389 0.09356179 0.02145582]
[-0.10044324 -0.07739273 -0.00137412 ... -0.00104974 0.07181141 0.02205478]]
Results:
A cat is sitting outside: 0.2984
A dog is playing guitar: 0.1693
The new movie is awesome: 0.1015
◼ Instruction Tuning
Instruction tuning is the process of fine-tuning a pre-trained model on a dataset of explicit instructions (e.g., "Summarize the document," or "Translate to French").
The instruction tells models (e.g., HuggingFace's open source models like E5 or BGE) how to shape the vector based on the task goal to ensure the model's output aligns with the user's specific intent rather than just predicting the next most likely word.
The following snippet uses BGE:
1from sentence_transformers import SentenceTransformer, util
2
3# load bge model
4model = SentenceTransformer('BAAI/bge-small-en-v1.5')
5
6# bge instruction
7instruction = "Represent this sentence for searching relevant passages: "
8
9# encode (with instruction)
10query_emb = model.encode(instruction + query, normalize_embeddings=True)
11doc_embs = model.encode(docs, normalize_embeddings=True)
12
13# compute cosine similarity
14scores = util.cos_sim(query_emb, doc_embs)[0]
15
Results:
A cat is sitting outside: 0.4718
A dog is playing guitar: 0.4010
The new movie is awesome: 0.3493
◼ Late Interaction (ColBERT)
Instead of one vector per document, Late Interaction keeps multiple vectors (one per token), allowing for much more granular matching.
1from ragatouille import RAGPretrainedModel
2
3# load model
4RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
5
6# colbert creates an index to store token-level embeddings
7RAG.index(
8 collection=docs,
9 index_name="sample_index",
10 max_document_length=180,
11 split_documents=False
12)
13
14# search index
15results = RAG.search(query=query, k=3)
16
Results:
A cat is sitting outside: 14.03
A dog is playing guitar: 9.33
The new movie is awesome: 5.30
Developer Note: RAGatouille
RAGatouille is one of the common libraries for using ColBERT in Python-based RAG pipelines.
It can simplify the complex original Stanford implementation into a few lines of code for indexing and retrieval, integrating to major AI frameworks like LangChain and LlamaIndex.
It is currently transitioning from the original Stanford ColBERT backend to PyLate for better compatibility.
◼ Performance Summary
In either case, the sentence 'A cat is sitting outside' achieved the highest score, as it was identified as the closest match to the query embedding.
However, their practical applications diverge:
Table 3. Performance Matrix: Bi-Encoder vs. Instruction Tuning vs. Late Interaction.
Sparse Embedding in Action
Sparse embedding is a vector representation where most values are zero.
Unlike dense embedding, sparse embedding maps a specific token or keyword to a non-zero value, making them highly interpretable and excellent for exact keyword retrieval.
Common tactics involve:
Best Matching 25 (BM25).
Sparse Lexical and Expansion (SPLADE).
◼ Best Matching 25 (BM25)
The industry standard for keyword search. It ranks documents based on the appearance of query terms, adjusting for document length.
In the following code snippet, I added "A cat" and "cat" as query:
1from rank_bm25 import BM25Okapi
2
3# bm25 works on words (tokens) not vectors
4tokenized_corpus = [doc.lower().split(" ") for doc in docs]
5
6# initialize bm25
7bm25 = BM25Okapi(tokenized_corpus)
8
9queries = ["Tell me about felines", "A cat", "cat"]
10for query in queries:
11 tokenized_query = query.lower().split(" ")
12
13 # score each doc
14 doc_scores = bm25.get_scores(tokenized_query)
15
16 # get top 3
17 top_n = bm25.get_top_n(tokenized_query, docs, n=3)
18
▫ Results
Query: Tell me about felines
A cat is sitting outside: 0.0000
A dog is playing guitar: 0.0000
The new movie is awesome: 0.0000
Query: A cat
A cat is sitting outside: 0.5661
A dog is playing guitar: 0.0552
The new movie is awesome: 0.0000
Query: cat
A cat is sitting outside: 0.5108
The new movie is awesome: 0.0000
2(tie). A dog is playing guitar: 0.0000
In Query A cat, the score of Rank 2 sentence "A dog is playing guitar" is way much lower than the score of Rank 1 sentence.
This is because BM25 uses a specific formula to ensure common words (like "the" or "is") don't drown out rare, important words, balancing:
Term Frequency (TF): How many times does the word appear? The more, the better.
Inverse Document Frequency (IDF): Is this word rare in the whole collection? Rare words like "feline" get more points than "the".
Document length normalization: Penalizes very long documents so that a 500-page book doesn't win just because it happens to repeat the keyword by accident.
◼ Sparse Lexical and Expansion (SPLADE)
Sparse Lexical and Expansion (SPLADE) is an AI-powered BM25 where a neural network adds weight to latent words to expand a source document with related keywords.
Particularly, SPLADE takes the following two step:
Sparsity: Like BM25, transforms a sentence into a sparse vector - caring only about specific words to make it efficient for standard search engines like Elasticsearch.
Expansion: Adds weight to latent words to fix the vocabulary mismatch problem.
For example, if a document says "A cat is sitting outside," SPLADE internally adds weights for words like "feline," "pet," or "animal," even if they don't appear in the text.
This can improve the precision of keyword search while keeping the intelligence of neural search.
I'll use the splade-v3 model from Naver Labs (creator of SPLADE) for demonstration:
1import torch
2from transformers import AutoModelForMaskedLM, AutoTokenizer
3
4# load model
5model_id = "naver/splade-v3"
6tokenizer = AutoTokenizer.from_pretrained(model_id)
7model = AutoModelForMaskedLM.from_pretrained(model_id)
8
9# encode
10query_vec = get_splade_vector(query_text)
11doc_vecs = [get_splade_vector(d) for d in docs]
12
13# calc similarity score (dot product)
14scores = []
15for i, d_vec in enumerate(doc_vecs):
16 score = torch.dot(query_vec, d_vec).item()
17 scores.append((docs[i], score))
18
▫ Results
A cat is sitting outside: 9.4954
A dog is playing guitar: 0.1755
The new movie is awesome: 0.0000
SPLADE successfully finds the similar sentence to the query: A cat is sitting outside.
◼ Comparison of BM25 vs. SPLADE
So, here is the comparison of BM25 and SPLADE:
Table 4. Comparison of BM25 and SPLADE.
Hybrid Embedding in Action
Relying on just one type of vector sometimes fails in production.
For example, dense vectors are great at finding "adorable kittens" when searching for "cute cats", but might fail to find a specific character like "Marie" from the Aristocats (even though she is an adorable kitten:

Figure. Marie from the Aristocats (Disney)
Hybrid embedding can avoid this challenge by combining dense and sparse embeddings.
Its common methods involve:
Reciprocal Rank Fusion (RRF).
Reranking
◼ Reciprocal Rank Fusion (RRF)
Reciprocal Rank Fusion (RRF) only cares about the rank among the sentences, rather than the absolute scores:
Where:
r(d) is the rank of document d in list R, and
k is a constant (usually k = 60) to prevent top-ranked documents from overwhelming the rest.
I’ll use Bi-Encoder and SPLIDE for demonstration:
1# sort the docs by rank
2rank_list_biecoder = [
3 'A cat is sitting outside',
4 'The new movie is awesome',
5 'A dog is playing guitar'
6]
7rank_list_splide = [
8 'A cat is sitting outside',
9 'A man is playing guitar',
10 'The new movie is awesome'
11]
12
13# compute rrf scores for each doc in the rank list
14k = 60
15for rank, doc in enumerate(rank_list):
16 if doc not in scores: scores[doc] = 0
17 scores[doc] += 1 / (k + (rank + 1))
18
▫ Results
A cat is sitting outside: 0.5000
A dog is playing guitar: 0.0366
2(tie). The new movie is awesome 0.0366.
Bi-encoder excels at understanding that “feline” and “cat” are related even if the words don’t match.
On the other hand, SPLADE excels at identifying specific relevant words.
RRF acts as the referee; if a document appears high in both lists, its RRF score will skyrocket. If it only appears in one, it stays in the middle.
Because the sentence, "A cat is sitting outside" is the top in the rank lists of the both bi-encoder and SPLADE, it yielded the highest score.
◼ Reranking
Reranking is a method where the system applies a quick dense search to get the top results first, and then passes the top results through cross-encoders to give a final score.
Cross-encoders are trained to rank items rather than provide a percentage of correctness, to examine the query and the document together.
Negative scores mean the model thinks the document is unlikely to be a perfect match, while positive scores would mean the model is very confident it's a match.
For demonstration, I'll use the CrossEncoder module from the sentence_transformer library:
1from sentence_transformers import CrossEncoder
2
3# load the model
4model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
5
6# create a query-doc pair
7sentence_pairs = [[query, doc] for doc in docs]
8
9# score
10scores = model.predict(sentence_pairs)
11
▫ Results
A cat is sitting outside: -10.0513
A dog is playing guitar: -10.9967
The new movie is awesome: -11.3669
Although all the sentences are recognized as "unlikely to be a perfect match", the sentence "A cat is sitting outside" is still the closest match among all.
Wrapping Up
Vector database and vectorization are key to handling unstructured data.
They can enable semantic search and provide long-term memory for LLMs through RAG.
◼ The Storage Landscape: Choosing Your Vector Storage Tier
If you are looking for where to store these vectors, the market is split into four camps:
Table 5. Market Analysis: Vector-Native vs. Traditional Database Providers.
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
A Technical Roadmap to RAG Architectures and Decision Logic (2026 Edition)
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation
Share What You Learned
Kuriko IWAI, "Understanding Vector Databases and Embedding Pipelines" in Kernel Labs
https://kuriko-iwai.com/vector-databases-and-embedding-strategies-guide
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.
