Staging functions in the Unstructured open source library are being deprecated in favor of destination connectors in the Unstructured Ingest CLI and Unstructured Ingest Python library.
unstructured package help prepare your data for ingestion into downstream systems. A staging function accepts a list of document elements as input and return an appropriately formatted dictionary as output. In the example below, we get our narrative text samples prepared for ingestion into LabelStudio using the stage_for_label_studio function. We can take this data and directly upload it into LabelStudio to quickly get started with an NLP labeling task.
convert_to_csv
Converts outputs to the initial structured data (ISD) format as a CSV string.
Examples:
convert_to_csv function, you can check the source code here.
convert_to_dataframe
Converts a list of document Element objects to a pandas dataframe. The dataframe will have a text column with the text from the element and a type column indicating the element type, such as NarrativeText or Title.
Examples:
convert_to_dataframe function, you can check the source code here.
convert_to_dict
Converts a list of Element objects to a dictionary. This is the default format for representing documents in unstructured.
Examples:
convert_to_dict function, you can check the source code here.
dict_to_elements
Converts a dictionary of the format produced by convert_to_dict back to a list of Element objects.
Examples:
dict_to_elements function, you can check the source code here.
stage_csv_for_prodigy
Formats outputs in CSV format for use with Prodigy. After running stage_csv_for_prodigy, you can write the results to a CSV file that is ready to be used with Prodigy.
Examples:
stage_csv_for_prodigy function, you can check the source code here.
stage_for_argilla
Convert a list of Text elements to an Argilla Dataset. The type of Argilla dataset to be generated can be specified with argilla_task parameter. Valid values for argilla_task are "text_classification", "token_classification", and "text2text". If "token_classification" is selected and tokens is not included in the optional kwargs, the nltk word tokenizer is used by default.
Examples:
stage_for_argilla function, you can check the source code here.
stage_for_baseplate-
The stage_for_baseplate staging function prepares a list of Element objects for ingestion into Baseplate, an LLM backend with a spreadsheet interface. After running the stage_for_baseplate function, you can use the Baseplate API to upload the documents to Baseplate. The following example code shows how to use the stage_for_baseplate function.
stage_for_baseplate function, you can check the source code here.
stage_for_datasaur
Formats a list of Text elements as input to token based tasks in Datasaur.
Example:
stage_for_datasaur function. Entities you specify in the input will be included in the entities key in the output. The list of entities is a list of dictionaries and must have all of the keys in the example below. The list of entities must be the same length as the list of elements. Use an empty list for any elements that do not have any entities.
Example:
stage_for_datasaur function, you can check the source code here.
stage_for_label_box
Formats outputs for use with LabelBox. LabelBox accepts cloud-hosted data and does not support importing text directly. The stage_for_label_box does the following:
-
Stages the data files in the
output_directoryspecified in function arguments to be uploaded to a cloud storage service. -
Returns a config of type
List[Dict[str, Any]]that can be written to ajsonfile and imported into LabelBox.
stage_for_label_box does not upload the data to remote storage such as S3. Users can upload the data to S3 using aws s3 sync ${output_directory} ${url_prefix} after running the stage_for_label_box staging function.
Examples:
The following example demonstrates generating a config.json file that can be used with LabelBox and uploading the staged data files to an S3 bucket.
stage_for_label_box function, you can check the source code here.
stage_for_label_studio
Formats outputs for upload to LabelStudio. After running stage_for_label_studio, you can write the results to a JSON folder that is ready to be included in a new LabelStudio project.
Examples:
annotations kwarg is a list of lists. If annotations is specified, there must be a list of annotations for each element in the elements list. If an element does not have any annotations, use an empty list. The following shows an example of how to upload annotations for the “Text Classification” task in LabelStudio:
predictions kwarg is also a list of lists. A prediction is an annotation with the addition of a score value. If predictions is specified, there must be a list of predictions for each element in the elements list. If an element does not have any predictions, use an empty list. The following shows an example of how to upload predictions for the “Text Classification” task in LabelStudio:
stage_for_label_studio function, you can check the source code here.
stage_for_prodigy
Formats outputs in JSON format for use with Prodigy. After running stage_for_prodigy, you can write the results to a JSON file that is ready to be used with Prodigy.
Examples:
.jsonl format for feeding data to API loaders. After running stage_for_prodigy, you can use the save_as_jsonl utility function to save the formatted data to a .jsonl file that is ready to be used with Prodigy.
stage_for_prodigy function, you can check the source code here.
stage_for_transformers
Prepares Text elements for processing in transformers pipelines by splitting the elements into chunks that fit into the model’s attention window.
Examples:
stage_for_transformers:
-
buffer: Indicates the number of tokens to leave as a buffer for the attention window. This is to account for special tokens like[CLS]that can appear at the beginning or end of an input sequence. -
max_input_size: The size of the attention window for the model. If not specified, the default is themodel_max_lengthattribute on the tokenizer object. -
split_function: The function used to split the text into chunks to consider for adding to the attention window. Splits on spaces be default. -
chunk_separator: The string used to concat adjacent chunks when reconstructing the text. Uses spaces by default.
unstructured Text objects, use the chunk_by_attention_window helper function. Simply modify the example above to include the following:stage_for_transformers function, you can check the source code here.
stage_for_weaviate
The stage_for_weaviate staging function prepares a list of Element objects for ingestion into the Weaviate vector database. You can create a schema in Weaviate for the unstructured outputs using the following workflow:
stage_for_weaviate function, you can check the source code here.
