This function tokenizes a given string based on the Penn Treebank rules and performs additional modifications to handle clitics (e.g., "n't", "'ll") and periods (trailing full stops) in the text.
Arguments
- x
A character vector containing the text you wish to tokenize.
- flatten
If 'TRUE', the tokeniser flattens the result into a single string, with white spaces separating tokenised items.
Details
The function is a wrapper for the tokenizers::tokenize_ptb() function, which generally works well but tends to co-join as single tokens sentence ending full stops, and quotations starting with single quotation marks (apostrophes).
It can be used as a tokenizer to prepare text for the add_st_tags()
function,
in which case you should use flatten = TRUE
argument, in combination with
add_st_tags(tokenizer = "horizontal")
:
dtag_tokenize(text, flatten = TRUE) |> add_st_tags(tokenizer = "horizontal")
Note the add_st_tags()
function itself can also tokenize the text using the udmodel, but
this tends to separate hyphenated words, which might not be the desired result.
Check the examples to compare