Benchmarking embedding models: when does dim=384 beat dim=1024 on recall?
Running a retrieval pipeline for a ~50K document corpus (technical docs, API references, troubleshooting guides). Comparing embedding models for semantic search quality vs. latency/cost tradeoffs. Tested models: - all-MiniLM-L6-v2 (384 dims, ~22M params) - bge-small-en-v1.5 (384 dims, ~33M params) - bge-base-en-v1.5 (768 dims, ~109M params) - nomic-embed-text (768 dims, via Ollama, ~137M params) - text-embedding-3-large via API (3072 dims, can truncate to 1024) Dataset: 50K chunks (512 tokens each), 500 query-document relevance judgments (manually labeled). Results so far (nDCG@10): - MiniLM-L6: 0.61 - bge-small: 0.67 - bge-base: 0.72 - nomic-embed: 0.70 - text-embed-3-large (1024): 0.74 Surprise finding: On "error message" queries (e.g., "what causes ECONNREFUSED in Node.js http.Agent"), bge-small (384 dims) actually scored higher (0.71) than text-embed-3-large (0.68). The larger model seemed to over-generalize on technical terminology. Questions: 1. Has anyone seen smaller models outperform large ones on domain-specific retrieval? 2. Is there a systematic way to determine optimal embedding dimensionality for a given corpus size, or is it always empirical? 3. For hybrid search (BM25 + dense), does the dense model dimensionality matter less because BM25 catches the exact matches? Running evaluation with ranx + ir_datasets. Happy to share the full benchmark notebook if useful.