TextParser¶
-
class
convokit.text_processing.textParser.
TextParser
(output_field='parsed', input_field=None, mode='parse', input_filter=<function TextParser.<lambda>>, spacy_nlp=None, sent_tokenizer=None, verbosity=0)¶ Transformer that dependency-parses each Utterance in a Corpus. This parsing step is a prerequisite for some of the models included in ConvoKit.
By default, will perform the following:
- tokenize words and sentences
- POS-tags words
- dependency-parses sentences
However, also supports only tokenizing or only tokenizing-and-tagging. These are performed using SpaCy and nltk’s sentence tokenizer (since SpaCy requires dependency parses in order to tokenize sentences).
Parses are stored as json-serializable objects, consisting of a list of parses of each sentence, where each sentence-level parse is a dict containing:
- toks: a list of tokens in the sentence.
- rt: the index of the root of the dependency parse, in the list of tokens.
Each token, in turn, is a dict containing:
- tok: the text
- tag: the POS tag (if tagging is on)
- dep: the dependency between that token and its parent (‘ROOT’ if the token is the root). available if parsing is on.
- up: the index of the parent of the token in the sentence. does not exist for root tokens.
- dn: the indices of the children of the token in the sentence
Note that in principle, this data structure is readily extensible – arbitrary fields could be added to sentences and tokens (e.g., to support NER).
Parameters: - output_field – name of attribute to write parse to, defaults to ‘parsed’.
- input_field – name of the field to use as input. the field must point to a string, and defaults to utterance.text.
- mode – by default, is set to “parse”, which indicates that the entire parsing pipeline is to be run. if set to “tag”, only tokenizing and tagging will be run; if set to “tokenize”, only tokenizing will be run.
- input_filter – a boolean function of signature input_filter(utterance, aux_input). parses will only be computed for utterances where input_filter returns True. By default, will always return True, meaning that parses will be computed for all utterances.
- spacy_nlp – if provided, will use this SpaCy object to do parsing; otherwise will initialize an object via load(‘en’).
- sent_tokenizer – if provided, will use this sentence tokenizer; otherwise will initialize nltk’s sentence tokenizer.
- verbosity – frequency of status messages.
-
convokit.text_processing.textParser.
process_text
(text, mode='parse', sent_tokenizer=None, spacy_nlp=None)¶ Stand-alone function that computes the dependency parse of a string.
Parameters: - text – string to parse
- mode – ‘parse’, ‘tag’, or ‘tokenize’
- sent_tokenizer – if provided, use this sentence tokenizer
- spacy_nlp – if provided, use this spacy object
Returns: the parse, in json-serializable form.