Understanding Vector Databases and Embedding Pipelines

Explore the mechanics of vector databases, text embedding (Dense, Sparse, Hybrid), and similarity metrics like Cosine Similarity with coding examples.

Machine LearningDeep LearningData SciencePythonAgentic AILLM

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is a Vector Database - The Architecture of Semantic Meaning
Why Similarity Search Trumps Traditional SQL
The Vectorization Pipeline
Text Embedding
Vision Embedding
Audio Embedding
Measuring Vector Relationships
Dot Product
Cosine Similarity
Euclidean Distance (L2 Norm)
When to Use Which Metric
Deep Dive: Implementing Dense, Sparse, and Hybrid TacticsDense Embedding in Action
Bi-Encoder
Instruction Tuning
Late Interaction (ColBERT)
Performance Summary
Sparse Embedding in Action
Best Matching 25 (BM25)
Sparse Lexical and Expansion (SPLADE)
Comparison of BM25 vs. SPLADE
Hybrid Embedding in Action
Reciprocal Rank Fusion (RRF)
Reranking
Wrapping Up
The Storage Landscape: Choosing Your Vector Storage Tier

Introduction

Artificial Intelligence (AI) understands the nuance, context, and semantic meaning of data.

But a significant challenge persists: traditional databases are built for keywords matches, whereas AI models think in high-dimensional patterns, which creates a gap between how humans store data and how AI actually processes it.

The vector database bridges this gap by storing information as mathematical coordinates, allowing AI to retrieve data based on similarity rather than just syntax.

This article explores the inner workings of vector databases and the critical process of vectorization with practical implementations of text embedding, comparing dense, sparse, and hybrid embedding techniques.

What is a Vector Database - The Architecture of Semantic Meaning

Vector database (DB) is a specialized system designed to store, index, and query vector embeddings—long arrays of numbers that represent the semantic meaning of unstructured data like text, images, audio, and video.

The diagram below illustrates the process of unstructured data stored in the vector database:

Figure A. Diagram of the vector database workflow showing raw data ingestion, embedding generation, and storage in a 3D coordinate space (Created by Kuriko IWAI)

Figure A. Diagram of the vector database workflow showing raw data ingestion, embedding generation, and storage in a 3D coordinate space (Created by Kuriko IWAI)

For example, in a simplified 3-dimensional space (right, Figure A), unstructured text data:

I am a cat.

becomes a vector embedding:

V=[0.12,0.98,0.45](a)V = [0.12, -0.98, 0.45] \quad \cdots \text{(a)}

In real-world models like Gemini or BERT, these vectors are much larger with 768 or 1,536 dimensions.

Each number represents a specific feature of the sentence's semantic meaning that the AI has learned during the pre-training.

Traditional databases like SQL can only search for exact matches.

So, when an user queries on "cats", they'd miss documents on "animals", "pets", or "felines" if those documents don't include the exact word "cats".

Conversely, vector databases can perform similarity searches by finding vector embedding closest to the given query.

Because “felines“ is considered close “cats“, similarity search can find the documents which only mention “felines”, but not exactly “cats“.

This is useful for:

  • Semantic search: Finds information based on semantic meaning, not just text.

  • Retrieval-Augmented Generation (RAG): Provides LLMs long-term memory relative to the user query.

  • Recommendation engines: Suggests products based on user behavior patterns.

The Vectorization Pipeline

Vectorization is the process of converting unstructured, raw data into vector embeddings to populate a vector database (left → middle, Figure A).

The process involves the following five steps:

  1. Load: Pull data from sources (PDFs, Notion, SQL, Slack).

  2. Clean: Remove noise like headers, footers, or HTML tags.

  3. Chunk: Split long data into smaller pieces.

  4. Embed: Pass the chunks through an embedding model to turn them into lists of vectors.

  5. Index: Store the vectors in a Vector database.

Taking those steps, vectorization dictates how to feed data into the model, which is as important a process as selecting the model itself.

Embedding has three primary types: Text, Vision, and Audio embeddings.

Text Embedding

Text embedding is the most common strategy to vectorize text data while capturing its semantic meaning and context.

Chunking (Step 3) plays a key role in the process because too big chunks confuse the model, while too small ones lose context.

To tackle this challenge, there are two primary chunking strategies:

  • Late chunking: Embed the entire document first and then break it into small chunks. Ensure each small chunk remembers the context of the entire document.

  • Semantic chunking: Use AI to find natural breaks in the meaning, instead of chunking the fixed number of characters (e.g., every 500 characters).

Primary text embedding methods employ these chunking strategies such that:

Method

Preferred Chunking

Best For

Main Weakness

Dense

Late

Finding synonyms. Maximize the context awareness.

Struggles with exact IDs or rare jargon

Sparse

Semantic

Exact matches, part numbers, names by keeping relevant keywords together; avoids splitting terms like "Apple / Watch".

Fails if the user uses different words

Hybrid

Late + Semantic

Production-grade RAG systems. Uses semantic boundaries but applies late-chunking logic to keep the document's global context.

More complex and expensive to run

Table 1. Comparison of Embedding Methods: Dense, Sparse, and Hybrid.

Vision Embedding

Vision embedding treats images as data points, enabling the direct search for images without labelling:

Figure B. Architecture of a Vision Embedding model (CNN) processing images into patches for vector representation  (Created by Kuriko IWAI)

Figure B. Architecture of a Vision Embedding model (CNN) processing images into patches for vector representation (Created by Kuriko IWAI)

Its key methods include:

  • Contrastive Language-Image Pre-training (CLIP): Maps images and text into the same vector space to search images with words.

  • Vision Transformers (ViT): Breaks images into patches to process them like text tokens.

Major players like OpenAI (CLIP), Meta (DINOv2), Google (SigLIP) have developed their own models.

Audio Embedding

Audio embedding is useful for music recommendation, speech recognition, or identifying sounds-like patterns in industrial sensor data.

Figure C. Process flow of raw audio signals being converted into spectrograms and embedded via an encoder network (source)

Figure C. Process flow of raw audio signals being converted into spectrograms and embedded via an encoder network (source)

Its key methods include:

  • Spectrogram Analysis: Converts sound waves into visual representations and then embedding them.

  • Contrastive Language-Audio Pre-training (CLAP): Similar to CLIP but applies to sound.

Key players include Microsoft: (CLAP) and Meta: (Audio2Vec).

Measuring Vector Relationships

Once the raw data is vectorized, the model calculates the proximity between vectors using specific mathematical formulas.

Vectors that are closer together share similar semantic, audio, or visual meanings.

The diagram below shows how vector math interprets semantic meaning, taking two text embeddings representing words like "cat" in a 2 dimensional space for an example:

[

Figure D. Geometric representation of vector relationships showing small, wide, and zero-degree angles between concepts (Created by Kuriko IWAI)

When the two vectors represent related concepts (left, Cat vs. Dog, Figure D), the vectors are pointing in a similar direction with a small orientation (angle) θ, indicating that "Cat" and "Dog" share many features (both are pets, mammals, have four legs, etc) in the two-dimensional vector space.

When the two vectors represent unrelated concepts (middle, Cat vs. Crocodile, Figure D), the vectors are pointing in very different directions with a wide θ.

Lastly, when the vectors share the identical concept but with different magnitude (right, Cat vs. Cat Cat Cat, Figure D), both vectors sit on the exact same line (θ = 0), but vector B is much longer.

This represents a situation where the topic is identical, but the magnitude is different (perhaps a document that mentions "Cat" many more times in a similar concept).

But how can we measure the differences between the vectors?

This section explores the three mathematical metrics:

  • Dot Product,

  • Cosine Similarity, and

  • Euclidean Distance.

Dot Product

The dot product is the fundamental operation that measures the relationship between two vectors, considering both their orientations and magnitudes.

The dot product of the two vectors A and B in an n-dimensional space is generalized:

AB=i=1nAiBi=A1B1+A2B2++AnBn(1.1)\mathbf{A} \cdot \mathbf{B} = \sum_{i=1}^{n} A_i B_i = A_1B_1 + A_2B_2 + \dots + A_nB_n \quad \cdots \text{(1.1)}

where A_i and B_i represent the i-th entry of the vectors A and B, respectively.

Alternatively, when the orientation (angle) θ between the two vectors is known, Eq. 1.1 can be denoted:

AB=ABcos(θ)(1.2)\mathbf{A} \cdot \mathbf{B} = |\mathbf{A}| |\mathbf{B}| \cos(\theta)\quad \cdots \text{(1.2)}

where ||A|| and ||B|| represent the magnitudes of the vectors A and B, respectively.

In the case of Figure D, when Vector A (Cat) = [2, 2], each scenario is measured:

Cat vs. Dog

Cat vs. Crocodile

Cat vs. Cat Cat Cat

Vector B Coordinates

B = [3, 1.5]
Similar direction, similar size.

B = [4, -2]
Different direction, different size.

B = [6, 6]
Exactly the same direction, but 3x longer.

Dot Product Score

9.0

4.0

24.0

Similarity Interpretation

Moderate/High

Low

Very High

Table 2.1. Similarity Metric Comparison (Dot Product)

Cosine Similarity

The cosine similarity focuses purely on the orientation between two vectors A and B, effectively ignoring their magnitudes.

The formula is essentially the dot product (Eq 1's) divided by the product of the vectors' magnitudes ||A|| ||B||:

similarity=cos(θ)=ABAB=i=1nAiBii=1nAi2i=1nBi2(2)\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} \quad \cdots \text{(2)}

where A_i and B_i represent the i-th entry of the vector A and B, respectively.

The formula makes it perfect for text analysis because when word frequency (magnitude) might vary, but the context (direction) remains the same.

In the case of Figure D, using the same Vector A & B coordinates, the similarity of each scenario is measured:

Cat vs. Dog

Cat vs. Crocodile

Cat vs. Cat Cat Cat

Cosine Similarity

0.95

0.32

1.00

Similarity Interpretation

High

Low

Perfect Match

Table 2.2. Similarity Metric Comparison (Cosine Similarity).

Euclidean Distance (L2 Norm)

The Euclidean distance measures (L2) the absolute displacement in space between two points, p and q:

d(p,q)=i=1n(piqi)2(3)d(\mathbf{p}, \mathbf{q}) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2} \quad \cdots \text{(3)}

where:

  • d: The Euclidean distance (the scalar result).

  • p, q: The two vectors (or points) in n-dimensional space.

  • i: The index of the current dimension calculated (from 1 to n).

  • p_i, q_i: The specific coordinates of vectors p and q at the i-th dimension.

The Euclidean distance is highly sensitive to the magnitude of the vectors because it measures how far it is from the origin where the point sits.

When taking two vectors pointing in the exact same direction and multiplying the values of one vector by 10, the Euclidean distance will increase significantly because the tip of the second vector is physically much further away in the coordinate system.

This means when comparing two documents such that:

  • p: A short document.

  • q: Much longer version of the doc p. But repeatedly using the same words.

The Euclidean distance will see them far apart even though the context is similar just because of the much higher word counts of q.

In the case of Figure D, the similarity of each scenario is measured, using the same Vector A & B coordinates:

Cat vs. Dog

Cat vs. Crocodile

Cat vs. Cat Cat Cat

Euclidean Distance

1.12

4.47

5.66

Similarity Interpretation

Closest Match

Unrelated

Unrelated (The Trap)

Table 2.3. Similarity Metric Comparison (Euclidean distance).

The third scenario (Cat vs Cat Cat Cat) is the trap of the Euclidean distance because it has the largest gap even though the meaning is identical, simply because Vector B is much longer.

When to Use Which Metric

Given how sensitive these methods are to scale, each mathematical approach serves a specific use case:

Metric

Sensitive to Magnitude?

Best Use Case

Dot Product

Yes

Neural network layers, signal processing.

Cosine Similarity

No

Document similarity, recommendation systems.

Euclidean Distance

Strongly Yes

Clustering (k-means), physical sensor data.

Table 2.4. Use Cases by Similarity Metrics.

Deep Dive: Implementing Dense, Sparse, and Hybrid Tactics

As Figure A shows, creating text embeddings involves converting human language into a vector that a computer can reason with.

These tactics range from simple keyword counting to advanced models that understand intent and instructions.

In either case, the system must vectorize the data and query with the same embedding model and calculate the similarity score:

Figure E. Workflow of computing similarity scores between a query embedding and a document corpus in a RAG pipeline (Created by Kuriko IWAI)

Figure E. Workflow of computing similarity scores between a query embedding and a document corpus in a RAG pipeline (Created by Kuriko IWAI)

I'll use the following sample data and query and demonstrate the score using primary methods:

  • Dense embedding.

  • Sparse embedding.

  • Hybrid embedding.

Sample data:

1data = [
2    "A cat is sitting outside", 
3    "A dog is playing guitar", 
4    "The new movie is awesome"
5]
6
7query = "Tell me about felines"
8

Dense Embedding in Action

Dense embedding is the gold standard for modern AI.

The embedding model uses a small, fixed number of dimensions filled with non-zero numbers to capture broad semantic meaning whose position doesn't map to a specific word, but the combination of numbers represents the meaning.

In the case of Eq. a, each dimension can represent a weight on a scale, such as:

  • Dimension 1 (0.12): Living vs. Non-living (where 1.0 is a human and -1.0 is a rock).

  • Dimension 2 (-0.98): Size (where -1.0 is tiny and 1.0 is massive).

  • Dimension 3 (0.45): Domesticity (where 1.0 is a pet and -1.0 is a wild predator).

In this scenario, the first dimension 0.12 suggests the subject is "somewhat living/animate" but perhaps not as high-ranking as a human in the model's hierarchy.

Common tactics involve:

  • Bi-Encoder.

  • Instruction Tuning.

  • Late Interaction.

Bi-Encoder

Bi-encoder is the most common method for text embedding.

It encodes the query and the document separately. It's fast because you can pre-calculate the document vectors.

1from sentence_transformers import SentenceTransformer, util
2
3# load bi-encoder model
4model = SentenceTransformer('all-MiniLM-L6-v2')
5
6# encode
7doc_emb = model.encode(docs)
8query_emb = model.encode(query)
9
10# compute cosine similarity
11scores = util.cos_sim(query_emb, doc_emb)
12

doc_emb - The vectorized docs looks like the following:

[[ 0.1230927 -0.00072824 0.04190801 ... 0.03736668 -0.03583647 0.06841106]
[ 0.02273473 -0.02657051 0.03814451 ... 0.03001389 0.09356179 0.02145582]
[-0.10044324 -0.07739273 -0.00137412 ... -0.00104974 0.07181141 0.02205478]]

Results:

  1. A cat is sitting outside: 0.2984

  2. A dog is playing guitar: 0.1693

  3. The new movie is awesome: 0.1015

Instruction Tuning

Instruction tuning is the process of fine-tuning a pre-trained model on a dataset of explicit instructions (e.g., "Summarize the document," or "Translate to French").

The instruction tells models (e.g., HuggingFace's open source models like E5 or BGE) how to shape the vector based on the task goal to ensure the model's output aligns with the user's specific intent rather than just predicting the next most likely word.

The following snippet uses BGE:

1from sentence_transformers import SentenceTransformer, util
2
3# load bge model
4model = SentenceTransformer('BAAI/bge-small-en-v1.5')
5
6# bge instruction
7instruction = "Represent this sentence for searching relevant passages: "
8
9# encode (with instruction)
10query_emb = model.encode(instruction + query, normalize_embeddings=True)
11doc_embs = model.encode(docs, normalize_embeddings=True)
12
13# compute cosine similarity
14scores = util.cos_sim(query_emb, doc_embs)[0]
15

Results:

  1. A cat is sitting outside: 0.4718

  2. A dog is playing guitar: 0.4010

  3. The new movie is awesome: 0.3493

Late Interaction (ColBERT)

Instead of one vector per document, Late Interaction keeps multiple vectors (one per token), allowing for much more granular matching.

1from ragatouille import RAGPretrainedModel
2
3# load model
4RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
5
6# colbert creates an index to store token-level embeddings
7RAG.index(
8    collection=docs,
9    index_name="sample_index",
10    max_document_length=180,
11    split_documents=False
12)
13
14# search index
15results = RAG.search(query=query, k=3)
16

Results:

  1. A cat is sitting outside: 14.03

  2. A dog is playing guitar: 9.33

  3. The new movie is awesome: 5.30

Developer Note: RAGatouille

RAGatouille is one of the common libraries for using ColBERT in Python-based RAG pipelines.

It can simplify the complex original Stanford implementation into a few lines of code for indexing and retrieval, integrating to major AI frameworks like LangChain and LlamaIndex.

It is currently transitioning from the original Stanford ColBERT backend to PyLate for better compatibility.

Performance Summary

In either case, the sentence 'A cat is sitting outside' achieved the highest score, as it was identified as the closest match to the query embedding.

However, their practical applications diverge:

Tactic

Speed

Storage Cost

Accuracy

Primary Use Case

Bi-Encoder

Blazing Fast

Low

Good

Initial retrieval from massive datasets (millions of docs) where speed is the priority.

Instruction

Fast

Low

Very Good

Task-specific search (e.g., "find code snippets" vs "summarize") where intent matters.

Late

Moderate

High

Excellent

High-precision reranking or complex queries where term-level interaction is needed.

Table 3. Performance Matrix: Bi-Encoder vs. Instruction Tuning vs. Late Interaction.

Sparse Embedding in Action

Sparse embedding is a vector representation where most values are zero.

Unlike dense embedding, sparse embedding maps a specific token or keyword to a non-zero value, making them highly interpretable and excellent for exact keyword retrieval.

Common tactics involve:

  • Best Matching 25 (BM25).

  • Sparse Lexical and Expansion (SPLADE).

Best Matching 25 (BM25)

The industry standard for keyword search. It ranks documents based on the appearance of query terms, adjusting for document length.

In the following code snippet, I added "A cat" and "cat" as query:

1from rank_bm25 import BM25Okapi
2
3# bm25 works on words (tokens) not vectors
4tokenized_corpus = [doc.lower().split(" ") for doc in docs]
5
6# initialize bm25
7bm25 = BM25Okapi(tokenized_corpus)
8
9queries = ["Tell me about felines", "A cat", "cat"]
10for query in queries:
11    tokenized_query = query.lower().split(" ")
12
13    # score each doc
14    doc_scores = bm25.get_scores(tokenized_query)
15
16    # get top 3
17    top_n = bm25.get_top_n(tokenized_query, docs, n=3)
18

Results

Query: Tell me about felines

  • A cat is sitting outside: 0.0000

  • A dog is playing guitar: 0.0000

  • The new movie is awesome: 0.0000

Query: A cat

  1. A cat is sitting outside: 0.5661

  2. A dog is playing guitar: 0.0552

  3. The new movie is awesome: 0.0000

Query: cat

  1. A cat is sitting outside: 0.5108

  2. The new movie is awesome: 0.0000

2(tie). A dog is playing guitar: 0.0000

In Query A cat, the score of Rank 2 sentence "A dog is playing guitar" is way much lower than the score of Rank 1 sentence.

This is because BM25 uses a specific formula to ensure common words (like "the" or "is") don't drown out rare, important words, balancing:

  • Term Frequency (TF): How many times does the word appear? The more, the better.

  • Inverse Document Frequency (IDF): Is this word rare in the whole collection? Rare words like "feline" get more points than "the".

  • Document length normalization: Penalizes very long documents so that a 500-page book doesn't win just because it happens to repeat the keyword by accident.

Sparse Lexical and Expansion (SPLADE)

Sparse Lexical and Expansion (SPLADE) is an AI-powered BM25 where a neural network adds weight to latent words to expand a source document with related keywords.

Particularly, SPLADE takes the following two step:

  1. Sparsity: Like BM25, transforms a sentence into a sparse vector - caring only about specific words to make it efficient for standard search engines like Elasticsearch.

  2. Expansion: Adds weight to latent words to fix the vocabulary mismatch problem.

For example, if a document says "A cat is sitting outside," SPLADE internally adds weights for words like "feline," "pet," or "animal," even if they don't appear in the text.

This can improve the precision of keyword search while keeping the intelligence of neural search.

I'll use the splade-v3 model from Naver Labs (creator of SPLADE) for demonstration:

1import torch
2from transformers import AutoModelForMaskedLM, AutoTokenizer
3
4# load model
5model_id = "naver/splade-v3"
6tokenizer = AutoTokenizer.from_pretrained(model_id) 
7model = AutoModelForMaskedLM.from_pretrained(model_id)
8
9# encode 
10query_vec = get_splade_vector(query_text)
11doc_vecs = [get_splade_vector(d) for d in docs]
12
13# calc similarity score (dot product)
14scores = []
15for i, d_vec in enumerate(doc_vecs):
16    score = torch.dot(query_vec, d_vec).item()
17    scores.append((docs[i], score))
18

Results

  1. A cat is sitting outside: 9.4954

  2. A dog is playing guitar: 0.1755

  3. The new movie is awesome: 0.0000

SPLADE successfully finds the similar sentence to the query: A cat is sitting outside.

Comparison of BM25 vs. SPLADE

So, here is the comparison of BM25 and SPLADE:

Feature

BM25

SPLADE

Input

Raw text

Neural-weighted tokens

Expansion

No (only exact words)

Yes (adds synonyms/related terms)

Infrastructure

Standard CPU / Database

Requires GPU for encoding

Vector Type

Sparse

Enriched sparse

Table 4. Comparison of BM25 and SPLADE.

Hybrid Embedding in Action

Relying on just one type of vector sometimes fails in production.

For example, dense vectors are great at finding "adorable kittens" when searching for "cute cats", but might fail to find a specific character like "Marie" from the Aristocats (even though she is an adorable kitten:

Figure. Marie from the Aristocats (Disney)

Hybrid embedding can avoid this challenge by combining dense and sparse embeddings.

Its common methods involve:

  • Reciprocal Rank Fusion (RRF).

  • Reranking

Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion (RRF) only cares about the rank among the sentences, rather than the absolute scores:

RRFscore(d)=rR1k+r(d)(4)RRFscore(d) = \sum_{r \in R} \frac{1}{k + r(d)} \quad \cdots \text{(4)}

Where:

  • r(d) is the rank of document d in list R, and

  • k is a constant (usually k = 60) to prevent top-ranked documents from overwhelming the rest.

I’ll use Bi-Encoder and SPLIDE for demonstration:

1# sort the docs by rank
2rank_list_biecoder = [
3    'A cat is sitting outside', 
4    'The new movie is awesome', 
5    'A dog is playing guitar'
6] 
7rank_list_splide = [
8    'A cat is sitting outside', 
9    'A man is playing guitar', 
10    'The new movie is awesome'
11]
12
13# compute rrf scores for each doc in the rank list
14k = 60
15for rank, doc in enumerate(rank_list):
16    if doc not in scores: scores[doc] = 0
17    scores[doc] += 1 / (k + (rank + 1))
18

Results

  1. A cat is sitting outside: 0.5000

  2. A dog is playing guitar: 0.0366

2(tie). The new movie is awesome 0.0366.

Bi-encoder excels at understanding that “feline” and “cat” are related even if the words don’t match.

On the other hand, SPLADE excels at identifying specific relevant words.

RRF acts as the referee; if a document appears high in both lists, its RRF score will skyrocket. If it only appears in one, it stays in the middle.

Because the sentence, "A cat is sitting outside" is the top in the rank lists of the both bi-encoder and SPLADE, it yielded the highest score.

Reranking

Reranking is a method where the system applies a quick dense search to get the top results first, and then passes the top results through cross-encoders to give a final score.

Cross-encoders are trained to rank items rather than provide a percentage of correctness, to examine the query and the document together.

Negative scores mean the model thinks the document is unlikely to be a perfect match, while positive scores would mean the model is very confident it's a match.

For demonstration, I'll use the CrossEncoder module from the sentence_transformer library:

1from sentence_transformers import CrossEncoder
2
3# load the model
4model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
5
6# create a query-doc pair
7sentence_pairs = [[query, doc] for doc in docs]
8
9# score
10scores = model.predict(sentence_pairs)
11

Results

  1. A cat is sitting outside: -10.0513

  2. A dog is playing guitar: -10.9967

  3. The new movie is awesome: -11.3669

Although all the sentences are recognized as "unlikely to be a perfect match", the sentence "A cat is sitting outside" is still the closest match among all.

Wrapping Up

Vector database and vectorization are key to handling unstructured data.

They can enable semantic search and provide long-term memory for LLMs through RAG.

The Storage Landscape: Choosing Your Vector Storage Tier

If you are looking for where to store these vectors, the market is split into four camps:

Category

Key Players

Best For...

Vector-Native

Pinecone
Weaviate
Milvus
Qdrant

High-performance, specialized AI applications with massive scale.

Cloud Providers

AWS OpenSearch
Google Vertex AI Search
Azure AI Search

You are locked into a specific cloud ecosystem.

Traditional/SQL

pgvector (PostgreSQL)
Supabase
Oracle

Keep an existing database, and add vector capabilities.

NoSQL/Document

MongoDB Atlas Vector Search
Cassandra
Redis

Real-time applications.
Keeps JSON-like structures.

Table 5. Market Analysis: Vector-Native vs. Traditional Database Providers.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation

Hands-On Large Language Models: Language Understanding and Generation

Share What You Learned

Kuriko IWAI, "Understanding Vector Databases and Embedding Pipelines" in Kernel Labs

https://kuriko-iwai.com/vector-databases-and-embedding-strategies-guide

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.