Tokenize strings

This function tokenizes a given string based on the Penn Treebank rules and performs additional modifications to handle clitics (e.g., "n't", "'ll") and periods (trailing full stops) in the text.

Usage

dtag_tokenize(x, flatten = FALSE)

Arguments

x: A character vector containing the text you wish to tokenize.
flatten: If 'TRUE', the tokeniser flattens the result into a single string, with white spaces separating tokenised items.

Value

A character vector containing the tokenized text.

Details

The function is a wrapper for the tokenizers::tokenize_ptb() function, which generally works well but tends to co-join as single tokens sentence ending full stops, and quotations starting with single quotation marks (apostrophes).

It can be used as a tokenizer to prepare text for the add_st_tags() function, in which case you should use flatten = TRUE argument, in combination with add_st_tags(tokenizer = "horizontal"):

dtag_tokenize(text, flatten = TRUE) |> add_st_tags(tokenizer = "horizontal")

Note the add_st_tags() function itself can also tokenize the text using the udmodel, but this tends to separate hyphenated words, which might not be the desired result.

Check the examples to compare

Usage

Arguments

Value

Details

Examples