Skip to main content

snorkelflow_extensions.taxonomy_distillation.models.huggingface.HuggingfaceTextEncoder

class snorkelflow_extensions.taxonomy_distillation.models.huggingface.HuggingfaceTextEncoder(model_name: str, batch_size: int = 32)

Bases: object

Text encoder using Hugging Face sentence-transformers models.

This encoder wraps sentence-transformers models from the Hugging Face model hub to generate dense vector representations of text. The encoded vectors serve as input features for downstream classification tasks in the hierarchical text classification pipeline.

Supports batch processing for efficient encoding of multiple texts and automatic device management for GPU acceleration when available.

__init__

__init__(model_name: str, batch_size: int = 32) None

Initialize the huggingface text encoder.

Parameters

NameTypeDefaultInfo
model_nameThe model name. Must be a sentence-transformers model from the Hugging Face model hub: https://huggingface.co/models?library=sentence-transformers
batch_sizeOptionalThe batch size to use for encoding. Default is 32.
verboseThe verbosity level of the encoder. Default is 0.

Returns: None

Methods

__init__(model_name[, batch_size])Initialize the huggingface text encoder.
encode_text(text)Transform a single text.
encode_texts(texts)Transform a list of texts.
get_embedding_dim()Get the embedding dimension.

encode_text

encode_text(text: str) Tensor

Transform a single text. The fit method must be called before calling this method.

Parameters

NameTypeDefaultInfo
textThe text to transform.

Returns: The transformed text.

encode_texts

encode_texts(texts: List[str]) Tensor

Transform a list of texts. The fit method must be called before calling this method.

Parameters

NameTypeDefaultInfo
textsA list of texts.

Returns: The transformed texts.

get_embedding_dim

get_embedding_dim() int

Get the embedding dimension. The fit method must be called before calling this method.

Parameters

NameTypeDefaultInfo
None
Returns: The embedding dimension.

Or 0 if the embedding dimension is not available.