en
                    array(1) {
  ["en"]=>
  array(13) {
    ["code"]=>
    string(2) "en"
    ["id"]=>
    string(1) "1"
    ["native_name"]=>
    string(7) "English"
    ["major"]=>
    string(1) "1"
    ["active"]=>
    string(1) "1"
    ["default_locale"]=>
    string(5) "en_US"
    ["encode_url"]=>
    string(1) "0"
    ["tag"]=>
    string(2) "en"
    ["missing"]=>
    int(0)
    ["translated_name"]=>
    string(7) "English"
    ["url"]=>
    string(80) "https://www.statworx.com/en/content-hub/blog/strsplit-but-keeping-the-delimiter/"
    ["country_flag_url"]=>
    string(87) "https://www.statworx.com/wp-content/plugins/sitepress-multilingual-cms/res/flags/en.png"
    ["language_code"]=>
    string(2) "en"
  }
}
                    
Contact
Content Hub
Blog Post

strsplit – But Keeping the Delimiter

  • Expert Jakob Gepp
  • Date 20. April 2018
  • Topic CodingRTutorial
  • Format Blog
  • Category Technology
strsplit – But Keeping the Delimiter

One of the functions I use the most is strsplit. It is quite useful if you want to separate a string by a specific character. Even if you have some complex rules for the split, most of the time you can solve this with a regular expression. However, recently I came across a problem I could not get my head around. I wanted to split the string but also keep the delimiter.

Basic Regular Expressions

Let’s start at the beginning. If you do not know what regular expressions are, I will give you a short introduction. With regular expressions, you can describe patterns in a string and then use them in functions like grep, gsub or strsplit.

As the R (3.4.1) help file for regex states:

A regular expression is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE. There is a also fixed = TRUE which can be considered to use a literal regular expression.

If you are looking for a specific pattern in a string – let’s say "3D" – you can just use those characters:

x <- c("3D", "4D", "3a")
grep("3D", x)
[1] 1

If you instead want all numbers followed by an upper case letter you should use regular expressions:

x <- c("3D", "4D", "3a")
grep("[0-9][A-Z]", x)
[1] 1 2

Since regular expressions can get quite complicated really fast, I will stop here and refer you to a cheat sheet for more info. In the cheat sheet, you can also find the part that gave me the trouble: lookarounds

Lookarounds

Back to my problem. I had a string like c("3D/MON&SUN") and wanted to separate it by / and &.

x <- c("3D/MON&SUN")
strsplit(x, "[/&]", perl = TRUE)
[[1]]
[1] "3D"  "MON" "SUN"

Since I still needed the delimiter as it contained useful information, I used the lookaround regular expressions. First up is the lookbehind which works just fine:

strsplit(x, "(?<=[/&])", perl = TRUE)
[[1]]
[1] "3D/"  "MON&" "SUN"

However, when i used the lookahead, it did not work as I expected

strsplit(x, "(?=[/&])", perl = TRUE)
[[1]]
[1] "3D"  "/"   "MON" "&"   "SUN"

In my search for a solution, I finally found this post on Stackoverflow, which explained the strange behaviour of strsplit. Well, after reading the post and the help file – it is not strange anymore. It is just what the algorithm said it would do – the very same way it is stated in help file of strsplit:

repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}

Since the lookarounds have zero length, they mess up the removing part within the algorithm. Luckily, the post also gave a solution that contains some regular expression magic:

strsplit(x = x, "(?<=.)(?=[/&])",perl = TRUE)
[[1]]
[1] "3D"   "/MON" "&SUN"

So my problem is solved, but I would have to remember this regular expression … uurrghhh!

A New Function: strsplit 2.0

If I have the chance to write a function which eases my work – I will do it! So I wrote my own strsplit with a new argument type = c("remove", "before", "after"). Basically, I just used the regular expression mentioned above and put it into an if-condition.
To sum it all up: Regular expressions are a powerful tool and you should try to learn and understand how they work!

strsplit <- function(x,
                     split,
                     type = "remove",
                     perl = FALSE,
                     ...) {
  if (type == "remove") {
    # use base::strsplit
    out <- base::strsplit(x = x, split = split, perl = perl, ...)
  } else if (type == "before") {
    # split before the delimiter and keep it
    out <- base::strsplit(x = x,
                          split = paste0("(?<=.)(?=", split, ")"),
                          perl = TRUE,
                          ...)
  } else if (type == "after") {
    # split after the delimiter and keep it
    out <- base::strsplit(x = x,
                          split = paste0("(?<=", split, ")"),
                          perl = TRUE,
                          ...)
  } else {
    # wrong type input
    stop("type must be remove, after or before!")
  }
  return(out)
}

Jakob Gepp Jakob Gepp

Learn more!

As one of the leading companies in the field of data science, machine learning, and AI, we guide you towards a data-driven future. Learn more about statworx and our motivation.
About us