What Are Vector Databases and Why They Matter
Traditional databases search by exact match (WHERE email = ?) or pattern (LIKE %keyword%). Vector databases search by meaning — they find items that are semantically similar, even if they share no common words. This is the foundation of every AI search feature, chatbot, and recommendation engine in 2026.
A vector database stores high-dimensional vectors (embeddings) and supports Approximate Nearest Neighbor (ANN) search — finding the K vectors closest to a query vector. The "approximate" part is key: exact nearest-neighbor on millions of vectors is too slow, so these databases use clever data structures (HNSW, IVF) that trade a tiny accuracy loss for 100x speed.
The market: pgvector (free PostgreSQL extension, ideal if you already use Postgres), Pinecone (managed, scales to billions of vectors), Chroma (open-source, popular for prototyping), Weaviate (open-source, built-in vectorization), and Qdrant (open-source, Rust-based, fast).
Key Takeaways
pgvector: Vector Search in PostgreSQL
pgvector is a PostgreSQL extension that adds a vector column type and distance operators. It is the most practical choice for most projects because your vectors live alongside your relational data — no separate infrastructure to manage.
pgvector supports three distance metrics: cosine distance (<=>), L2/Euclidean distance (<->), and inner product (<#>). For text embeddings, cosine distance is almost always the right choice.
The HNSW index type (added in pgvector 0.5.0) provides fast approximate nearest-neighbor search suitable for production workloads.
-- Install pgvector CREATE EXTENSION IF NOT EXISTS vector; -- Table with embedding column CREATE TABLE blog_chunks ( id SERIAL PRIMARY KEY, post_id INTEGER REFERENCES posts(id), chunk_index INTEGER, content TEXT NOT NULL, embedding vector(1536), -- OpenAI text-embedding-3-small metadata JSONB DEFAULT '{}' ); -- HNSW index for fast cosine similarity search CREATE INDEX ON blog_chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); -- Insert with embedding INSERT INTO blog_chunks (post_id, chunk_index, content, embedding, metadata) VALUES ( 1, 0, 'React Server Components render on the server and send zero JavaScript...', '[0.023, -0.041, 0.089, ...]'::vector, '{"category": "frontend", "tags": ["react", "rsc"]}' ); -- Semantic search: find 5 most similar chunks SELECT bc.content, p.title AS post_title, 1 - (bc.embedding <=> $1::vector) AS similarity FROM blog_chunks bc JOIN posts p ON p.id = bc.post_id WHERE 1 - (bc.embedding <=> $1::vector) > 0.7 ORDER BY bc.embedding <=> $1::vector LIMIT 5;
Key Takeaways
Chunking Strategies for Documents
Before embedding a document, you must split it into chunks — smaller pieces that can each be independently embedded and retrieved. The chunking strategy directly impacts search quality: too large and the embedding is diluted, too small and context is lost.
The recommended approach: split by semantic boundaries (paragraphs, sections, headings) with a fixed maximum size (500-1000 tokens) and overlap (50-100 tokens). The overlap ensures that information at chunk boundaries is not lost.
For code, split by function/class boundaries. For documentation, split by section headings. For chat logs, split by conversation turns.
// Semantic chunking with overlap function chunkDocument(text: string, options = { maxChunkSize: 800, // tokens overlapSize: 100, // tokens of overlap between chunks separator: '\n\n', // Split on paragraphs first }): string[] { const { maxChunkSize, overlapSize, separator } = options; // Step 1: Split by semantic boundaries (paragraphs) const paragraphs = text.split(separator); const chunks: string[] = []; let currentChunk = ''; for (const paragraph of paragraphs) { const combined = currentChunk ? currentChunk + separator + paragraph : paragraph; if (estimateTokens(combined) > maxChunkSize && currentChunk) { chunks.push(currentChunk.trim()); // Start next chunk with overlap from end of current chunk const words = currentChunk.split(' '); const overlapWords = words.slice(-Math.floor(overlapSize / 1.3)); currentChunk = overlapWords.join(' ') + separator + paragraph; } else { currentChunk = combined; } } if (currentChunk.trim()) { chunks.push(currentChunk.trim()); } return chunks; } function estimateTokens(text: string): number { return Math.ceil(text.length / 4); // Rough estimate: 1 token ≈ 4 chars } // Usage const blogContent = await readFile('react-server-components.md'); const chunks = chunkDocument(blogContent); // Embed each chunk separately for (const chunk of chunks) { const embedding = await getEmbedding(chunk); await db.query( 'INSERT INTO blog_chunks (content, embedding) VALUES ($1, $2)', [chunk, JSON.stringify(embedding)] ); }
Production Pipeline: Ingest, Embed, Index
A production vector search pipeline has three stages: Ingest (load documents, clean, chunk), Embed (convert chunks to vectors using an embedding model), and Index (store vectors in the database with metadata for filtering).
The pipeline should be idempotent — running it twice on the same document should not create duplicates. Use content hashing or document versioning to detect changes and only re-embed modified content.
For large datasets, embed in batches (OpenAI accepts up to 2048 texts per API call) to minimize API calls and latency.
import { createHash } from 'crypto'; import OpenAI from 'openai'; const openai = new OpenAI(); interface DocumentSource { id: string; title: string; content: string; category: string; tags: string[]; updatedAt: Date; } async function ingestDocuments(documents: DocumentSource[]) { for (const doc of documents) { // 1. Check if already indexed (idempotency) const contentHash = createHash('sha256').update(doc.content).digest('hex'); const existing = await db.query( "SELECT id FROM blog_chunks WHERE metadata->>'sourceId' = $1 AND metadata->>'hash' = $2", [doc.id, contentHash] ); if (existing.rows.length > 0) { console.log(`Skipping ${doc.title} — unchanged`); continue; } // 2. Delete old chunks for this document await db.query("DELETE FROM blog_chunks WHERE metadata->>'sourceId' = $1", [doc.id]); // 3. Chunk the document const chunks = chunkDocument(doc.content); // 4. Embed all chunks in batch (up to 2048 per API call) const response = await openai.embeddings.create({ model: 'text-embedding-3-small', input: chunks, }); // 5. Store chunks with metadata for (let i = 0; i < chunks.length; i++) { await db.query( `INSERT INTO blog_chunks (content, embedding, metadata) VALUES ($1, $2::vector, $3)`, [ chunks[i], JSON.stringify(response.data[i].embedding), JSON.stringify({ sourceId: doc.id, title: doc.title, category: doc.category, tags: doc.tags, chunkIndex: i, hash: contentHash, }), ] ); } console.log(`Indexed ${doc.title}: ${chunks.length} chunks`); } }
Key Takeaways
Hybrid Search: Combining Vector + Full-Text
Pure vector search misses exact keyword matches (product codes, error messages). Pure full-text search misses semantic similarity. Hybrid search combines both: run a vector search AND a full-text search, then merge the results using Reciprocal Rank Fusion (RRF).
pgvector + PostgreSQL full-text search is a powerful combination: you can run both types of search in a single SQL query, using UNION ALL and a scoring function to rank results.
-- Hybrid search: vector + full-text in a single query WITH vector_results AS ( SELECT id, content, 1 - (embedding <=> $1::vector) AS vector_score, NULL::float AS text_score FROM blog_chunks ORDER BY embedding <=> $1::vector LIMIT 20 ), text_results AS ( SELECT id, content, NULL::float AS vector_score, ts_rank(to_tsvector('english', content), plainto_tsquery('english', $2)) AS text_score FROM blog_chunks WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $2) LIMIT 20 ), combined AS ( SELECT id, content, COALESCE(vector_score, 0) * 0.7 + COALESCE(text_score, 0) * 0.3 AS hybrid_score FROM ( SELECT * FROM vector_results UNION ALL SELECT * FROM text_results ) all_results ) SELECT DISTINCT ON (id) id, content, hybrid_score FROM combined ORDER BY id, hybrid_score DESC LIMIT 10;
Key Takeaways
Vector databases are essential infrastructure for AI applications. pgvector gives you vector search inside PostgreSQL with zero new infrastructure. Pinecone scales to billions of vectors with managed hosting.
The production pipeline: document → chunk (500-1000 tokens with overlap) → embed (batch API calls) → store (with metadata) → search (hybrid vector + full-text).
For interviews: explain embeddings (semantic meaning as vectors), HNSW indexing (approximate nearest neighbor), chunking strategies (why size and overlap matter), and hybrid search (combining vector + keyword search).