Skip to contents

Tag a directory of folders of plain text files with <MDA> tags

Usage

dtag_directory(
  path,
  n = NULL,
  ST = FALSE,
  deflated = TRUE,
  exclude = NULL,
  ...
)

Arguments

path

A character string denoting the folder containing the target folders (at any level).

n

An optional argument denoting the maximum number of text files to be analyzed.

ST

Logical argument denoting whether the text files have _ST tags included already.

deflated

Logical argument. If TRUE (default), in addition to the normal results, the function returns the dimension scores with "deflated" results, which means rare features from Biber's original study (mean freq < 0.1) are removed from the Dimension score calculations.

exclude

A character vector of dimension tags you don't want to include in the analysis. Note that the dimension tags should be quoted inside angular brackets. For example:

(1) If some of the word counts in the texts are below 400, you may want to exclude type token ratios from the analysis with exclude = "<TTR>".

(2) Some of the tags (such as <WZPRES> and <GER>) were manually checked in the original Biber study, but are automatically tagged here. You can exclude some of these tags by naming them in the exclude character vector such as exclude = c("<GER>", "<WZPRES>").

(3) To exclude all of tags manually checked by Biber, use the argument exclude = "<MANUAL>" This is the same as the argument: exclude = c("<DEMP>", "<GER>", "<PASTP>", "<PRESP>", "<SERE>", "<THAC>", "<THVC>", "<TOBJ>", "<TSUB>", "<WZPAST>", "<WZPRES>")

(4) You can combine (1) and (3) with exclude = c("<MANUAL>" , "<TTR>")

...

Additional arguments to be passed on.

Value

A list of data frames containing:

corpus_dimension_scores

  • corpus - name of corpus folder

  • corpus_text_type - closest text type for average corpus dimensions

  • most_common_text_type - the mode of the closest text type for the documents within the corpus folder

  • Dimension scores - calculated scores for Dimension1 ~ Dimension6

document_dimension_scores

  • corpus - name of corpus folder

  • doc_id - name of text file

  • Dimension scores - calculated scores for Dimension1 ~ Dimension6

  • closest_text_type - closest matching text type for each doc_id, based on Biber 1989

  • dimension_tags

  • dimension - Dimension1 ~ Dimension6 from Biber 1988 for each feature

  • feature - the <MDA> tag or AWL or TTR

  • detail - brief description of the feature

  • count - number of times the feature is counted in text

  • value - in case of <MDA> tag, normailsed frequency per 100 tokens

  • z-score - value scaled to the biber_mean and biber_sd

  • d-score - same as z-score, but with the sign of negative dimension features reversed

  • biber_mean and biber_sd for each feature, based on Biber 1988

tokenized_tags

  • corpus - name of corpus folder

  • doc_id - name of text file

  • st - text tokenized on each _ST tag

  • mda - text tokenized on each <MDA> tag

texts

  • corpus - name of corpus folder

  • doc_id - name of text file

  • raw_text - untagged, flattened text for each doc_id

  • tagged_text - flattened text with _ST and <MDA> tags for each doc_id

  • wordcount - number of non-punctuation tokens found in text

Tukey_hsd

  • dimension - the dimension for pairwise comparison

  • contrast - the corpora under pairwise comparison

  • null.value - the expected difference in means after aov (zero)

  • estimate - the difference in means after aov

  • conf.low - the 95% familywise lower confidence level

  • conf.high - the 95% familywise upper confidence level

  • p.value - significance test

Details

The target texts to be tagged should be placed in a directory of folders with $$ prefixed on the folder names. The function will then read in any text files from the target folders, and retrieve the folder names as the "corpus" variable.

If the texts have already been tagged with Stanford _ST tags, choose the option ST = TRUE.

Otherwise, the function add_st_tags() will run over the texts, for which it is necessary to have a udpipe model loaded. See add_st_tags for details.

The function then adds multidimensional analysis <MDA> tags, and calculates Dimension scores based on the Biber 1988 standard. Note that some of the tags from the original study can be excluded from the analysis, with the exclude argument.

If the argument deflated = TRUE, the function also returns Dimension scores calculated without using the low mean frequency features from Biber's original study, following the MAT tagger algorithm (Nini 2019).

The function returns a list of tibbles including the tagged texts, individual and corpus-level scores for each dimension of the text and word counts. If the function detects more than one corpus folder (folders prefixed with $$), it will also return the result of post-hoc significance tests. This is a set of confidence intervals on the differences between the means of the dimension scores based on the Studentized range statistic, Tukey's ‘Honest Significant Difference’ method.

References

  1. Biber, D. (1988). Variation across Speech and Writing. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511621024

  2. Biber, D. (1989). A typology of English texts. , 27(1), 3-44. https://doi.org/10.1515/ling.1989.27.1.3

  3. Nini, A. (2019). The Multi-Dimensional Analysis Tagger. In Berber Sardinha, T. & Veirano Pinto M. (eds), Multi-Dimensional Analysis: Research Methods and Current Issues, 67-94, London; New York: Bloomsbury Academic.

Examples

if (FALSE) {
dtag_directory("path_to_directory")}