Column normalized Tf-Idf¶
Implements a modifed Tf-Idf transformer that normalizes by columns (i.e., term-wise).
-
class
convokit.expected_context_framework.col_normed_tfidf.
ColNormedTfidf
(**kwargs)¶ Model that derives tf-idf reweighted representations of utterances, which are normalized by column. Can be used in ConvoKit through the ColNormedTfidfTransformer transformer; see documentation of that transformer for further details.
-
fit_transform
(X, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- X : array-like of shape (n_samples, n_features)
- Input samples.
- y : array-like of shape (n_samples,) or (n_samples, n_outputs), default=None
- Target values (None for unsupervised transformations).
- **fit_params : dict
- Additional fit parameters.
- X_new : ndarray array of shape (n_samples, n_features_new)
- Transformed array.
-
-
class
convokit.expected_context_framework.col_normed_tfidf.
ColNormedTfidfTransformer
(input_field, output_field='col_normed_tfidf', model=None, **kwargs)¶ Transformer that derives tf-idf reweighted representations of utterances, which are normalized by column, i.e., per term. This may be helpful in deriving downstream representations that are less sensitive to relative term frequency; for instance, it could be used to derive input representations to ExpectedContextModelWrapper.
Parameters: - input_field – the name of the attribute of utterances to use as input to fit. note that unless token_pattern is specified as an additional argument, this attribute must be a string consisting of whitespace-separated features.
- output_field – the name of the attribute to write to in the transform step.
- model – optional, an exisitng ColNormedTfidfTransformer
- kwargs – other keyword arguments used to initialize the underlying TfidfVectorizer from scikit-learn, see that documentation for details.
-
dump
(dirname)¶ Dumps model to disk.
Parameters: dirname – directory to write to Returns: None
-
fit
(corpus, y=None, selector=<function ColNormedTfidfTransformer.<lambda>>)¶ Fits a transformer over training data.
Parameters: - corpus – Corpus
- selector – which utterances to fit the transformer over. a boolean function of the form filter(utterance) that defaults to True (i.e., all utterances).
Returns: None
-
fit_transform
(corpus, y=None, selector=<function ColNormedTfidfTransformer.<lambda>>)¶ Fit and run the Transformer on a single Corpus.
Parameters: corpus – the Corpus to use Returns: same as transform
-
get_vocabulary
()¶ Returns: array of feature names
-
load
(dirname)¶ Loads model from disk.
Parameters: dirname – directory to load from Returns: None
-
transform
(corpus, selector=<function ColNormedTfidfTransformer.<lambda>>)¶ Computes column-normalized tf-idf representations for utterances in a corpus, stored in the corpus as <output_field>. Also annotates each utterance with a metadata field, <output_field>__n_feats, indicating the number of terms in the vocabulary that utterance contains.
Parameters: - corpus – Corpus
- selector – which utterances to transform
Returns: corpus, with per-utterance representations and vocabulary counts
-
transform_utterance
(utt)¶ Computes tf-idf representations for a single utterance. Representation is stored in the utterance as <output_field>__vect; number of vocabulary terms that utterance contains is stored as <output_field>__n_feats
Parameters: utt – Utterance Returns: utterance, with representation and vocabulary count