Add ST Tags to Text — add_st

The add_st_tags function is designed to process and annotate text using the Universal Dependencies (UD) model with the udpipe package. It allows users to tokenize and tag text with part-of-speech (Stanford) tags, and to extract and handle hesitation markers. The function provides options for controlling the parsing, tokenizer type, and handling of flattened input.

Usage

add_st_tags(
  x,
  mdl = udmodel,
  st_hesitation = FALSE,
  flattened = TRUE,
  skip_parse = TRUE,
  ...
)

Arguments

x

A character vector of input text to be processed.

mdl

A udpipe model to use for processing the text. The default is the udmodel.

st_hesitation

A logical value indicating whether or not to extract hesitation markers from the input text. If TRUE, the function will extract hesitation markers and return them separately. Default is FALSE.

flattened

A logical value indicating if the input text is flattened. If FALSE, i.e. if the character string is in tokenized form, the function will flatten the text before processing. Default is TRUE.

skip_parse

A logical value determining if the function should skip parsing and only return tokenized and tagged text. If FALSE, the function returns the full UD model when parsing. Default is TRUE.

...

Additional arguments to be passed to the udpipe_annotate() function. For example:

tokenizer = "horizontal" to force the udpipe_annotate function to tokenize on tokens separated by white spaces. This will combine words and trailing punctuation marks, unless they have been separated by white space previously.

tokenizer = "vertical" to force the udpipe_annotate function to tokenize on tokens separated by new line breaks. This can be useful if you want the tokenizer to recognise multi-word entities as a single token, or avoid separating hyphenated words etc.

Value

If skip_parse is FALSE, the function returns a tibble with the full udpipe model when parsing. If st_hesitation is TRUE (experimental), the function returns a character vector of tokenized and tagged text with hesitation markers extracted and handled separately. Otherwise, the function returns a character vector of tokenized and tagged text.

Examples

if (FALSE) {
# Example text:
text <- "This is an example sentence to be tagged"
# Example speech, tokenized:
speech <- c("I","don't", "know" ,  "erm" ,",", "whether" , "to" ,
"include" ,"hesitation" , "markers", ".")
# Initiate udpipe model
init_udpipe_model()
# Tag text
add_st_tags(text)
# Tag speech
add_st_tags(speech, st_hesitation = TRUE, tokenized = TRUE)
text <- "I'm in a part-time job, at the moment."
text2 <- "I'm\nin\na\npart-time\njob\n,\n\nat the moment\n.\n"
# tokenizes using default model - may separate some hyphenated words
add_st_tags(text)
# tokenizes on whitespaces - punctuation marks can be lumped in with words
add_st_tags(text, tokenizer = "horizontal")
# tokenizes on user-defined line breaks - possible to capture multi-word expressions
 add_st_tags(text2, tokenizer = "vertical") }