Skip to contents

This function produces concordance lines of text from data by finding up to two tag matches in tokenized text.

Usage

conc_by_tag(
  data,
  what = "token",
  tag = "mda",
  match,
  cols = NULL,
  tag2 = NULL,
  match2 = NULL,
  ...
)

Arguments

data

A relational data frame containing the text to concordance. The data frame is expected to have one column of tokens, in tokenized form, at least one column of the corresponding tags, and identifying details such as corpus, doc_id etc.

what

The name of the column containing the text to concatenate. Default is "token".

tag

The name of the column containing the tags to match. Default is "mda".

match

The tag to match within the tag column. The match can take regex, so you can use anchoring characters (^ and $) for specific searches.

cols

The names of the columns to include in the output. It may be useful to include some extra reference columns (such as doc_id), or other tags for more fine-grained filtering.

tag2

The name of the second column containing the tags to match (optional).

match2

The second tag to match within the tag2 column (optional).

...

Additional arguments to be passed onto dtagger::quick_conc.

For example: pass on the separated = TRUE argument, to enable sorting search result by adjacent tokens to the left and right

pass on the n = 3 argument, to limit the search window to 3 tokens either side of the match.

Value

A tibble containing:

  • case - a case number for the match found.

  • left - objects immediately adjacent (up to n) to the left of the matched node, as defined by the what argument (default is token). In case of separated = TRUE, the left are separated into left(n):left1

  • match - the matched search item, as defined by the match argument.

  • right - tokens immediately adjacent (up to n) to the right of the matched node, as defined by the what argument (default is token). In case of separated = TRUE, the right tokens are separated into right1:right(n).

  • index - the index row position of matched result from the input data frame.

  • other cols - as defined by the tag, tag2 and cols arguments.

Details

The purpose of this function is to allow fine-grained concordance searches of tagged text. The input should be a dataframe with a column for tokens in tokenized form, and separate columns for tags, document and corpus details.

Typically, the function can be used with output from udpipe::udpipe_annotate and dtagger::dtag_tbl, dtagger::dtag_directory or dtagger::add_tag_tbl functions.

The concordancer can take up to two tag inputs, for example matching all upos == "ADJ" tags and dep_rel == "amod" tags, and seing the resulting key words in context.

Examples