After partitioning, chunking, and summarizing, the embedding step creates arrays of numbers
known as vectors, representing the text that is extracted by Unstructured.
These vectors are stored or embedded next to the text itself. These vector embeddings are generated by an
embedding model that is provided by
an embedding provider.
You typically save these embeddings in a vector store.
When a user queries a retrieval-augmented generation (RAG) application, the application can use a vector database to perform
a similarity search in that vector store
and then return the items whose embeddings are the closest to that user’s query.
Here is an example of a document element generated by Unstructured, along with its vector embeddings generated by
the embedding model sentence-transformers/all-MiniLM-L6-v2
on Hugging Face:
Generate embeddings
To generate embeddings, choose one of the available embedding providers and models in the Select Embedding Model section of an Embedder node in a workflow. When choosing an embedding model, be sure to pay attention to the number of dimensions listed next to each model. This number must match the number of dimensions in the embeddings field of your destination connector’s table, collection, or index.You can change a workflow’s preconfigured provider only through Custom workflow settings.
Chunk sizing and embedding models
If your workflow has an Embedder node, your workflow’s Chunker node settings must stay within the selected embedding model’s token limits. Exceeding these limits will cause workflow failures. Set your Chunker node’s Max Characters to a value at or below Unstructured’s recommended maximum chunk size for your selected embedding model, as listed in the following table’s last column.| Embedding model | Dimensions | Tokens | Chunker Max Characters* |
|---|---|---|---|
| Amazon Bedrock | |||
| Cohere Embed English | 1024 | 512 | 1792 |
| Cohere Embed Multilingual | 1024 | 512 | 1792 |
| Titan Embeddings G1 - Text | 1536 | 8192 | 28672 |
| Titan Multimodal Embeddings G1 | 1024 | 256 | 896 |
| Titan Text Embeddings V2 | 1024 | 8192 | 28672 |
| Azure OpenAI | |||
| Text Embedding 3 Large | 3072 | 8192 | 28672 |
| Text Embedding 3 Small | 1536 | 8192 | 28672 |
| Text Embedding Ada 002 | 1536 | 8192 | 28672 |
| Together AI | |||
| M2-Bert 80M 32K Retrieval | 768 | 8192 | 28672 |
| Voyage AI | |||
| Voyage 3 | 1024 | 32000 | 112000 |
| Voyage 3 Large | 1024 | 32000 | 112000 |
| Voyage 3 Lite | 512 | 32000 | 112000 |
| Voyage Code 2 | 1536 | 16000 | 56000 |
| Voyage Code 3 | 1024 | 32000 | 112000 |
| Voyage Finance 2 | 1024 | 32000 | 112000 |
| Voyage Law 2 | 1024 | 16000 | 56000 |
| Voyage Multimodal 3 | 1024 | 32000 | 112000 |

