blog.google
|
ksl
|
|
Google released Gemini Embedding 2, a single model that maps text, images, video, audio, and documents into one shared vector space – making cross-modal retrieval possible without separate pipelines for each input type. It supports up to 8,192 tokens for text, 120 seconds of video, six images per request, and raw audio without transcription. Matryoshka representation learning lets developers truncate output dimensions from 3,072 down to 128 without retraining, which matters for teams balancing search quality against storage and latency costs. The model covers over 100 languages and is already available through LangChain, LlamaIndex, Weaviate, Qdrant, and ChromaDB. Until now, most production RAG stacks have treated each modality as a separate embedding problem, and a competitive multimodal option from Google puts real pressure on OpenAI and Cohere to catch up.
