Skip to contents

This function tokenizes a given string based on the Penn Treebank rules and performs additional modifications to handle clitics (e.g., "n't", "'ll") and periods (trailing full stops) in the text.

Usage

dtag_tokenize(x, flatten = FALSE)

Arguments

x

A character vector containing the text you wish to tokenize.

flatten

If 'TRUE', the tokeniser flattens the result into a single string, with white spaces separating tokenised items.

Value

A character vector containing the tokenized text.

Details

The function is a wrapper for the tokenizers::tokenize_ptb() function, which generally works well but tends to co-join as single tokens sentence ending full stops, and quotations starting with single quotation marks (apostrophes).

It can be used as a tokenizer to prepare text for the add_st_tags() function, in which case you should use flatten = TRUE argument, in combination with add_st_tags(tokenizer = "horizontal"):

dtag_tokenize(text, flatten = TRUE) |> add_st_tags(tokenizer = "horizontal")

Note the add_st_tags() function itself can also tokenize the text using the udmodel, but this tends to separate hyphenated words, which might not be the desired result.

Check the examples to compare

Examples