Quick concordancing of pattern matches or index locations

Lightweight concordancing function to return key words in context (KWIC) in a tidy format.

Usage

quick_conc(x, index, n = 5, tokenize = FALSE, separated = FALSE)

Arguments

x: a character vector of tokenized strings, or a single string
index: a character vector of regex pattern to match, or a numeric vector to use as index of matches
n: an integer, to specify the number of context tokens either side of the matched node
tokenize: a logical, to tokenize the text first or not. If TRUE, a very basic tokenizer is used to split the string on whitespaces and punctuation (but not word internal apostrophes, at marks and hyphens).
separated: a logical, to separate the context tokens or not

Value

A tibble containing:

case - a case number for the match found.
left - objects immediately adjacent (up to n) to the left of the matched node. In case of separated = TRUE, the left are separated into left(n):left1
match - the matched search item, as defined by the index argument.
right - tokens immediately adjacent (up to n) to the right of the matched node. In case of separated = TRUE, the right tokens are separated into right1:right(n).
index - the index row position of matched result from the input data frame.

Examples

x <- c("The", "cat", "sat", "on", "the", "mat")
index <- c("cat", "sat")
quick_conc(x, index, n = 2)
#> # A tibble: 2 × 5
#>    case token_id left    match right 
#>   <int>    <int> <chr>   <chr> <chr> 
#> 1     1        2 NA The  cat   sat on
#> 2     2        3 The cat sat   on the
x <- "The dog barked loudly, alerting the neighbors of potential danger.
A nearby park seemed like the perfect spot for the dog and
it quickly made its way there."
quick_conc(x, index = "dog", n = 3, tokenize = TRUE, separated = TRUE)
#> # A tibble: 2 × 9
#>    case token_id left3 left2 left1 match right1 right2 right3 
#>   <int>    <int> <chr> <chr> <chr> <chr> <chr>  <chr>  <chr>  
#> 1     1        2 NA    NA    The   dog   barked loudly ,      
#> 2     2       23 spot  for   the   dog   and    it     quickly
quick_conc(x, index = c(4,8,12), tokenize = TRUE)
#> # A tibble: 3 × 5
#>    case token_id left                              match     right              
#>   <int>    <int> <chr>                             <chr>     <chr>              
#> 1     1        4 NA NA The dog barked              loudly    , alerting the nei…
#> 2     2        8 barked loudly , alerting the      neighbors of potential dange…
#> 3     3       12 the neighbors of potential danger .         A nearby park seem…