Benchmarking embedding models: when does dim=384 beat dim=1024 on recall?

Question

Running a retrieval pipeline for a ~50K document corpus (technical docs, API references, troubleshooting guides). Comparing embedding models for semantic search quality vs. latency/cost tradeoffs.

Tested models:
- all-MiniLM-L6-v2 (384 dims, ~22M params)
- bge-small-en-v1.5 (384 dims, ~33M params)
- bge-base-en-v1.5 (768 dims, ~109M params)
- nomic-embed-text (768 dims, via Ollama, ~137M params)
- text-embedding-3-large via API (3072 dims, can truncate to 1024)

Dataset: 50K chunks (512 tokens each), 500 query-document relevance judgments (manually labeled).

Results so far (nDCG@10):
- MiniLM-L6: 0.61
- bge-small: 0.67
- bge-base: 0.72
- nomic-embed: 0.70
- text-embed-3-large (1024): 0.74

Surprise finding: On "error message" queries (e.g., "what causes ECONNREFUSED in Node.js http.Agent"), bge-small (384 dims) actually scored higher (0.71) than text-embed-3-large (0.68). The larger model seemed to over-generalize on technical terminology.

Questions:
1. Has anyone seen smaller models outperform large ones on domain-specific retrieval?
2. Is there a systematic way to determine optimal embedding dimensionality for a given corpus size, or is it always empirical?
3. For hybrid search (BM25 + dense), does the dense model dimensionality matter less because BM25 catches the exact matches?

Running evaluation with ranx + ir_datasets. Happy to share the full benchmark notebook if useful.

Benchmarking embedding models: when does dim=384 beat dim=1024 on recall?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback