Configs
-
chunk_api_key: Ifchunk_by_apiis set toTrue, requests that are sent to the Unstructured API will use this Unstructured API key to make authenticated calls. -
chunk_by_api: Default:False. If set toTrue, uses Unstructured to run chunking. If set toFalse, runs chunking locally. -
chunk_combine_text_under_n_chars: Combine consecutive chunks when the first does not exceed lengthnand the second will fit without exceeding the hard-maximum length. Only operative for theby_titlechunking strategy. -
chunk_include_orig_elements:Truewhen chunking to add the original elements consolidated to form each chunk to.metadata.orig_elementson that chunk. -
chunk_max_characters: Default:500. The hard-maximum chunk length. No chunk will exceed this length. An oversized element will be divided by text-splitting to fit this window. -
chunk_multipage_selections:Trueto ignore page boundaries when chunking such that elements from two different pages can appear in the same chunk. Only operative for theby_titlechunking strategy. -
chunk_new_after_n_chars: The soft-maximum chunk length. Another element will not be added to a chunk ofnlength even when it would fit without exceeding the hard-maximum length. -
chunk_overlap: Default:0. Prefix each chunk’s text with the last overlap ofncharacters from the prior chunk. Only applies to oversized chunks divided by text-splitting. To apply overlap to non-oversized chunks, usechunk_overlap_all. -
chunk_overlap_all: Applies overlap to chunks formed from whole elements as well as those formed by text-splitting oversized elements. The overlap length is taken from thechunk_overlapvalue. -
chunking_endpoint: Ifchunk_by_apiis set toTrue, chunking requests are sent to this Unstructured API URL. By default, this URL is the Unstructured Partition Endpoint URL:https://api.unstructuredapp.io/general/v0/general. However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, email Unstructured Support at support@unstructured.io. -
chunking_strategy: One ofbasicorby_title. When omitted, no chunking is performed. Thebasicstrategy maximally fills each chunk with whole elements, up the specified size limits as specified bymax_charactersandnew_after_n_chars. A single element that exceeds this length is divided into two or more chunks using text-splitting. ATableelement is never combined with any other element and appears as a chunk of its own or as a sequence ofTableChunkelements splitting is required. Theby_titlebehaviors are the same except that section and optionally page boundaries are respected such that two consecutive elements from different sections appear in separate chunks.
Chunking will fail if you set both
partition_by_api to False and chunking_strategy to by_page or by_similarity. However, the rest of your data processing pipeline should be unaffected by this setting.
