dtag_tbl — dtag_tbl • dtagger

Compute the Biber-style dimension scores of a data frame.

Usage

dtag_tbl(
  tbl,
  input = 1,
  text = 2,
  tokenized = FALSE,
  ttr = 400,
  deflated = TRUE,
  exclude = NULL
)

Arguments

tbl

A data frame with at least one column as input id and another for the _ST tagged text.

input

A column for the input id (defaults to 1st position, but can be named as e.g. "colname1").

text

A column for the text (defaults to 2nd position, but can be named as e.g. "colname12"). The text should be tagged with _ST tags, and in the flattened, not tokenized form.

tokenized

Logical. The default is FALSE, in which case the function tokenizes the text with str_split(text, " "). Set to TRUE if text is already tokenized and in a list column.

ttr

Maximum number of tokens to consider for TTR, defaults to 400.

deflated

Logical. If TRUE (default), Dimension scores are calculated without using the low mean frequency features from Biber's original study, following the MAT tagger algorithm (Nini 2019).

exclude

A character vector of dimension tags you don't want to include in the analysis. Note that the dimension tags should be quoted inside angular brackets. For example:

(1) If some of the word counts in the texts are below 400, you may want to exclude type token ratios from the analysis with exclude = "<TTR>".

(2) Some of the tags (such as <WZPRES> and <GER>) were manually checked in the original Biber study, but are automatically tagged here. You can exclude some of these tags by naming them in the exclude character vector such as exclude = c("<GER>", "<WZPRES>").

(3) To exclude all of tags manually checked by Biber, use the argument exclude = "<MANUAL>" This is the same as the argument: exclude = c("<DEMP>", "<GER>", "<PASTP>", "<PRESP>", "<SERE>", "<THAC>", "<THVC>", "<TOBJ>", "<TSUB>", "<WZPAST>", "<WZPRES>")

(4) You can combine (1) and (3) with exclude = c("<MANUAL>" , "<TTR>")

Value

A tibble containing:

wordcount - number of non-punctuation tokens found in text
dimension - Dimension1 ~ Dimension6 from Biber 1988 for each feature
feature - the <MDA> tag or AWL or TTR
detail - brief description of the feature
count - number of times the feature is counted in text
value - in case of <MDA> tag, normalized frequency per 100 tokens
z-score - value scaled to the biber_mean and biber_sd
d-score - same as z-score, but with the sign of negative dimension features reversed
biber_mean and biber_sd for each feature, based on Biber 1988
closest matching text type for each input, based on Biber 1989

Details

This function adds multidimensional analysis <MDA> tags to a data frame. The data frame should contain one column to identify input id, and another for the text that has been tagged with _ST tags, and flattened into non-tokenized form.

After tagging the text, the function then calculates Dimension scores based on the Biber 1988 standard, and approximates the closest text type as per Biber 1989 text classification.

References

Biber, D. (1988). Variation across Speech and Writing. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511621024
Biber, D. (1989). A typology of English texts. , 27(1), 3-44. https://doi.org/10.1515/ling.1989.27.1.3
Nini, A. (2019). The Multi-Dimensional Analysis Tagger. In Berber Sardinha, T. & Veirano Pinto M. (eds), Multi-Dimensional Analysis: Research Methods and Current Issues, 67-94, London; New York: Bloomsbury Academic.