One of the functions I use the most is strsplit
. It is quite useful if you want to separate a string by a specific character. Even if you have some complex rules for the split, most of the time you can solve this with a regular expression. However, recently I came across a problem I could not get my head around. I wanted to split the string but also keep the delimiter.
Basic Regular Expressions
Let’s start at the beginning. If you do not know what regular expressions are, I will give you a short introduction. With regular expressions, you can describe patterns in a string and then use them in functions like grep
, gsub
or strsplit
.
As the R (3.4.1) help file for regex
states:
A regular expression is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by
perl = TRUE
. There is a alsofixed = TRUE
which can be considered to use a literal regular expression.
If you are looking for a specific pattern in a string – let’s say "3D"
– you can just use those characters:
x <- c("3D", "4D", "3a")
grep("3D", x)
[1] 1
If you instead want all numbers followed by an upper case letter you should use regular expressions:
x <- c("3D", "4D", "3a")
grep("[0-9][A-Z]", x)
[1] 1 2
Since regular expressions can get quite complicated really fast, I will stop here and refer you to a cheat sheet for more info. In the cheat sheet, you can also find the part that gave me the trouble: lookarounds
Lookarounds
Back to my problem. I had a string like c("3D/MON&SUN")
and wanted to separate it by /
and &
.
x <- c("3D/MON&SUN")
strsplit(x, "[/&]", perl = TRUE)
[[1]]
[1] "3D" "MON" "SUN"
Since I still needed the delimiter as it contained useful information, I used the lookaround regular expressions. First up is the lookbehind which works just fine:
strsplit(x, "(?<=[/&])", perl = TRUE)
[[1]]
[1] "3D/" "MON&" "SUN"
However, when i used the lookahead, it did not work as I expected
strsplit(x, "(?=[/&])", perl = TRUE)
[[1]]
[1] "3D" "/" "MON" "&" "SUN"
In my search for a solution, I finally found this post on Stackoverflow, which explained the strange behaviour of strsplit
. Well, after reading the post and the help file – it is not strange anymore. It is just what the algorithm said it would do – the very same way it is stated in help file of strsplit
:
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
Since the lookarounds have zero length, they mess up the removing part within the algorithm. Luckily, the post also gave a solution that contains some regular expression magic:
strsplit(x = x, "(?<=.)(?=[/&])",perl = TRUE)
[[1]]
[1] "3D" "/MON" "&SUN"
So my problem is solved, but I would have to remember this regular expression … uurrghhh!
A New Function: strsplit 2.0
If I have the chance to write a function which eases my work – I will do it! So I wrote my own strsplit
with a new argument type = c("remove", "before", "after")
. Basically, I just used the regular expression mentioned above and put it into an if-condition.
To sum it all up: Regular expressions are a powerful tool and you should try to learn and understand how they work!
strsplit <- function(x,
split,
type = "remove",
perl = FALSE,
...) {
if (type == "remove") {
# use base::strsplit
out <- base::strsplit(x = x, split = split, perl = perl, ...)
} else if (type == "before") {
# split before the delimiter and keep it
out <- base::strsplit(x = x,
split = paste0("(?<=.)(?=", split, ")"),
perl = TRUE,
...)
} else if (type == "after") {
# split after the delimiter and keep it
out <- base::strsplit(x = x,
split = paste0("(?<=", split, ")"),
perl = TRUE,
...)
} else {
# wrong type input
stop("type must be remove, after or before!")
}
return(out)
}