regex - R: Fast string split on first delimiter occurence -


I have a file with ~ 40 million lines that I need to split by the first comma delimiter. Using the

stringr function str_split_fixed does the following, but it is very slow.

  Library (Dataable) Library (stringer) Df1 & lt; - data.frame (id = 1: 1000, letter 1 = representative (letter [sample (1: 25,1000, substituted = t)], 40)) df1 $ combCol1 & lt; - paste (df1 $ id, ', df1 $ letter1 September =' ') DF1 $ combCol2 & lt; - Paste ($ combCol1 DF1, ',', DF1 $ combCol1 September = '') st1 & lt; - str_split_fixed ($ combCol2 DF1, ',', 2)   

Is there a faster way to do this?

updates

stri_split_fixed in more recent versions of

"Stringi" The function has a simplified argument which will be the TRUE to return a matrix, thus the update solution will be:
  Original answer (with updated benchmark)  

If you click on " Stringer "are resting with syntax and do not want to be a hero from far away, If you want to profit even by increasing the speed, instead of the "string" package:

  Library (stringer) library (string) system.time (temp1) - str_split_fixed (df1 $ combCol2 , ',', 2)) # User system elapsed # 3.25 0.00 3.25 system.time stri_split_fixed (DF1 $ combCol2, -; - (temp2a & lt do.call (rbind, stri_split_fixed (DF1 $ combCol2, "", 2 )) # User System # 0.04 0.00 0.05 system.time (temp2b go and execute ",", 2, simplify = TRUE) # User System Elapsed # 0.01 0.00 0.01   

Most " "Stringing" functions in "string" functions Ntaan, but as it can be seen, for example, "string" output requires an additional step to bind data to output as a matrix instead of a list.


Here's the comparison with @ RichardScriven's suggestion in this comment:

  fun1a & lt; - Function () do.call (rbind, stri_split_fixed (DF1 $ combCol2, "", 2)) fun1b & LT; - Function () stri_split_fixed (, DF1 $ combCol2, "" 2, easy making = TRUE) fun2 & lt; - Function () {do.call (rbind, regmatches ($ combCol2, regexpr (DF1 ",", DF1 $ combCol2) ,,, invert = TRUE)}} library (microbenchmark) microbenchmark (fun1a) fun1b () fun2 ), Bar = 10) # Unit: millisecond # expr min LQ Mean median uq max neval # fun1a () 4272647 46,35848 59,56948 51,94796 629.2920 9 8.46330 10 # fun1b () 17,55183 18 , 59337 20,09049 18,84907 22,09419 26,85343 10 # fun2 () 370,82055 404,23115 434,62582 439,54923 476,02889 480,97912 10    

Comments

Popular posts from this blog

php - PDO bindParam() fatal error -

php - How can I cram 6+31 numeric characters into 22 alphanumeric characters? -

logging - How can I log both the Request.InputStream and Response.OutputStream traffic in my ASP.NET MVC3 Application for specific Actions? -