Bag-of-words Transformer¶
-
class
convokit.bag_of_words.bow_transformer.
BoWTransformer
(obj_type: str, vector_name='bow_vector', text_func: Callable[[convokit.model.corpusComponent.CorpusComponent], str] = None, vectorizer=None)¶ Bag-of-Words Transformer for annotating a Corpus’s objects with the bag-of-words vectorization of some textual element of the Corpus components.
Runs on the Corpus’s Speakers, Utterances, or Conversations (as specified by obj_type). By default, the text used for the different object types:
- For utterances, this would be the utterance text.
- For conversations, this would be joined texts of all the utterances in the conversation
- For speakers, this would be the joined texts of all the utterances by the speaker
Other custom text configurations can be configured using the text_func argument
Compatible with any type of vectorizer (e.g. bag-of-words, TF-IDF, etc)
Parameters: - obj_type – “speaker”, “utterance”, or “conversation”
- vectorizer – a sklearn vectorizer object; default is CountVectorizer(min_df=10, max_df=.5, ngram_range(1, 1), binary=False, max_features=15000)
- vector_name – name for the vector matrix generated in the transform() step
- text_func – function for getting text from the Corpus component object. By default, this is configured based on the obj_type.
-
fit
(corpus: convokit.model.corpus.Corpus, y=None, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function BoWTransformer.<lambda>>)¶ Fit the Transformer’s internal vectorizer on the Corpus objects’ texts, with an optional selector that selects for objects to be fit on.
Parameters: - corpus – the target Corpus
- selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
Returns: the fitted BoWTransformer
-
fit_transform
(corpus: convokit.model.corpus.Corpus, y=None, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function BoWTransformer.<lambda>>) → convokit.model.corpus.Corpus¶ Fit the Transformer’s internal vectorizer on the Corpus component objects’ texts, and then compute vector representations for them and stores it in the Corpus object as vector_name.
Parameters: - corpus – target Corpus
- selector – a (lambda) function that takes a Corpus component object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
Returns: the Corpus with the computed vector matrix stored in it
-
get_vocabulary
()¶ Get the vocabulary of the vectorizer object
-
transform
(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function BoWTransformer.<lambda>>) → convokit.model.corpus.Corpus¶ Computes the vector matrix for the Corpus component objects and then stores it in a ConvoKitMatrix object, which is saved in the Corpus as vector_name.
Parameters: - corpus – the target Corpus
- selector – a (lambda) function that takes a Corpus component object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
Returns: the target Corpus annotated