Best practices for splitting and joining strings in shell scripts

8

At first

How do I split a comma-separated string into different fields, like in CSV? That's the story. It is a story that I often see, but I would like to summarize it because there is a lot of code that can not be said to be good.

I'm dealing with CSV as an example, but since the subject is splitting and joining strings, we don't really support CSV by enclosing them in double quotes to include commas and line breaks. Due to the complexity of the specification, it is recommended to use a well-known library in another language when working with such files. The CSV covered in this article is simply a comma separating each field.

Assumption A single line of data is divided and placed in positional parameters

The result of splitting a one-line string is recommended to be assigned to the positional parameters ($1, $2, $3, ...). Positional parameters are POSIX compliant (i.e. bash-independent) and are the only arrays (that can be used like that). In principle, this article divides one line and puts it in the positional parameters, but if it is bash etc., you can use an array normally, and I think that there are cases where you can process it by putting it in a variable without putting it in the positional parameter. Change it according to your situation.

To assign to a positional parameter, use the set command. For example, if you run set -- a b c, it becomes \\$1:a, \\$2: b, \\$3:c. -- is intended to prevent the argument a from being interpreted as an option of set when it is -a. In this example, it is unnecessary but explicitly written.

Since it is assigned to the position parameter, the original position parameter will disappear, but since it is localized for each function, it is easy to leave the original position parameter if processing is performed in the function. There is only one array variable (positional parameter) within the POSIX-compliant range, but you can use multiple arrays using this method.

# つまりこういうことです
set -- a b c
foo() {
  set -- 1 2 3 # 位置パラメータが変更されてしまう
}
foo
echo "$@" # => でもここでは a b c のまま残っている

Also, instead of assigning a positional parameter in set, you can call a user-defined shell function and get it as an argument (positional parameter) in the function.

How to use IFS

The easiest way to do the split is to use IFS. You can insert a delimiter in the IFS and execute the set command to split each field separately.

# zshの場合は setopt shwordsplit を実行しておく必要がある
set -f # これがないと * などがパス名展開されてしまう
while IFS= read -r line; do
  IFS=,
  set -- $line
done

If the maximum number of fields is fixed, you can also put it in a variable for each field while read. This is probably the fastest way to read from a file or standard input.

while IFS=, read -r f1 f2 f3; do
  echo "f1: $f1, f2: $f2, f3: $f3"
done

Conversely, if you want to join, use $*. The first character of the IFS is used as the delimiter.

set -- a b c d e # $1~$5にそれぞれ代入
IFS=,
line=$*

# 余談ですが結合に使われるのは IFS の最初の 1 文字だけなので、IFS を一時的に変更して
# 元に戻したい場合は、このようにすれば元の値を保存するための変数は必要ありません
IFS=",$IFS"
line=$*
IFS=${IFS#?} # 最初の一文字を削除

The Problem

This method has several problems. For IFS, the delimiter is limited to one character. In many cases, the delimiter is considered to be a single character, but it cannot be used if you want the delimiter to be more than one character.

If the data to be read is tab-delimited (TSV), further care must be taken. In the case of comma separation, you can handle empty fields, but in the case of tab delimiters, it will be assumed that there was none. This is because in IFS splitting, multiple consecutive spaces or tabs are treated as one. Whitespace and tabs before and after lines are also ignored. (Ref. 2.6.5 Field Splitting)

line="a,,c"
IFS=,
set -- $line # $1:a, $2:, $3:c

TAB=$(printf '\\t')
line="a${TAB}${TAB}c"
IFS=$TAB
set -- $line # $1:a, $2:c

When merging, unlike splitting, the delimiter behaves the same as a comma for spaces and tabs.

Splitting and combining using IFS is an easy method, but it is necessary to be careful because it may not be possible to use it in some cases.

Split by sed + eval

The next method is to use sed and eval. After splitting, assign them to the position parameters in the same way. In this way, the delimiter can be more than two characters, or it can handle TSV.

# カンマ区切り
sed "s/'/'\\\\\\\\''/g; s/^/'/g; s/$/'/g; s/,/' '/g" | while IFS= read -r line; do
  eval "set -- $line"
done

# 行の前後のシングルクォートはシェルスクリプト側でつけると短く出来る
sed "s/'/'\\\\\\\\''/g; s/,/' '/g" | while IFS= read -r line; do
  eval "set -- '$line'"
done

# タブ区切り
TAB=$(printf '\\t')
sed "s/'/'\\\\\\\\''/g; s/$TAB/' '/g" | while IFS= read -r line; do
  eval "set -- '$line'"
done

This sed converts the string a,b,c into a string called 'a' 'b' 'c' by enclosing it in single quotes for each field. This string is assigned to the positional parameter by eval by sticking it with set --. We also do escaping in case there is a single quote in the field. For example, if it is that's right, it is 'that'\\''s right'.

Note that the single-quote enclosed process (i.e., running sed) is not done every time the CSV is read in a single line, but rather processing the entire sed first, and then reading the result of the execution. In this article, we call and split the external commands line by line in a while loop, so it is not recommended for speed reasons. If you call an external command in a loop, it will be several hundred times slower. In order not to slow down shell scripts, it is strictly forbidden to call external commands in a loop with a large number of times.

Notes

The caveat of this method is that since pipes are used, the while ~ done part is a subshell, so you can not refer to the variables used inside. For example, it is troublesome when counting the number of processed lines while processing one line at a time. There are several ways to deal with it, but here are some examples.

#!/bin/bash
# bash等用
count=0
while IFS= read -r line; do
  eval "set -- '$line'"
  count=$((count + 1))
done < <(sed "s/'/'\\\\\\\\''/g; s/,/' '/g; ")
echo "$count"
#!/bin/sh
# POSIX 準拠
sed "s/'/'\\\\\\\\''/g; s/,/' '/g; " | {
  count=0
  while IFS= read -r line; do
    eval "set -- '$line'"
    count=$((count + 1))
  done
  echo "$count"
}
# ※注意 ここで count は参照できません

# または
read_csv() {
  count=0
  while IFS= read -r line; do
    eval "set -- '$line'"
    count=$((count + 1))
  done
  echo "$count"
}
sed "s/'/'\\\\\\\\''/g; s/,/' '/g; " | read_csv
# ※注意 ここで count は参照できません

Shell script to split only

This is how I would like to introduce it personally. The methods described so far are suitable for reading multiple lines of data from a file or standard input. If you are processing a file of a certain size, sed+eval is sufficient, but if you want to split the value contained in a variable, for example, the cost of calling sed will be relatively large. In such a case, it is a method to split the string contained in the variable only with a shell script.

var="a,b,c"

set --
work="$var,"
while [ "$work" ]; do
  field=${work%%,*} # 文字列の最後から見て一番遠いカンマまで削除 = 最初のフィールドを取得
  work=${work#*,} # 文字列の最初から見て一番近いカンマまで削除 = 最初のフィールドを削除
  set -- "$@" "$field"
done

As in the previous examples, it is assigned to a positional parameter. The above writing method personally feels that the number of lines is large, so I often write it together.

var="a,b,c"

set --
work="$var,"
while [ "$work" ] && field=${work%%,*} && work=${work#*,}; do
  set -- "$@" "$field" 
done

Variable expansion (${work%%,*} and ${work#*,}) may be difficult to read if you are not used to it, but since it is executed by the shell without (external) command execution, the execution speed is fast, so it is essential knowledge not to slow down shell scripts.

You can also use variable expansion to parse complex strings (such as ISO 8601 notation 2020-08-23T12:34:56+09:00 dates). It would be cool if you could edit complex character strings in one shot using sed and regular expressions, but sometimes it is easier to understand and faster to steadily split and process strings by variable expansion. Just because it is a shell script, it is not always correct to pass data by standard input / output and process it with (external) commands.

Do this if you want to put a delimiter in a variable.

var="a,b,c"
sep="$TAB"

set --
work="$var$sep"
while [ "$work" ] && field=${work%%"$sep"*} && work=${work#*"$sep"}; do
  set -- "$@" "$field" 
done

A double quote enclosing the $sep is required. Without it, some delimiters may not work correctly.

The advantage of using IFS is that it can handle multiple delimiters, and it can handle spaces and tabs in the same way as commas. No set -f is required to suppress pathname expansion, and no setopt shwordsplit is required for zsh. Even when processing multiple lines, variables are easier to access because subshells are not required.

The disadvantage of this method is that when there are many delimiters and the string length becomes long, the copying process of many large strings slows down from the point where it exceeds several KB ~ several tens of KB to the experience level. However, since it does not call external commands, it can be processed quickly enough if it is a short string.

Another solution is to convert to a string that can be eval as well as the sed method. This is faster and easier because it does not split in a loop and the number of times the string is copied is small, but replacing the string by variable expansion is not POSIX compliant, so it is limited to using bash etc.

var="a,b,c"
work=${var//"'"/"'\\''"}
eval "set -- '${work//,/"' '"}'"

Shell script to combine only

If the delimiter is a single character, it is easy and fast to use IFS. If the delimiter is not a single character, you must combine them using a loop. Running line-by-line (experiential) commands and combining them is a slower method to avoid. I think there are many detailed ways to write it, but this is an example.

unset var
if [ $# -gt 0 ]; then
  var=$1
  shift
  for i in "$@"; do
    var="${var},${i}"
  done
fi

# 別解
unset var
for i in "$@"; do
  var="${var:-}${var+,}${i}"
done

If you want to output to standard output instead of putting it in a variable, you can also combine them with printf. However, mksh and posh are slower because printf is not shell built-in.

if [ $# -gt 0 ]; then
  printf '%s' "$1"
  shift
  printf ',%s' "$@"
  echo # 最後の改行
fi
Share:
8
Koichi Nakashima
Author by

Koichi Nakashima

おそらくウェブアプリエンジニア。🐚シェルスクリプト関連のマイナー技術の紹介や既存情報の更新と訂正を行っています。 🚩オプションパーサー https://github.com/ko1nksm/getoptions 🚩ユニットテストフレームワーク https://shellspec.info 🌈好きなもの:シンプルで無駄のないコード、リファクタリング。嫌いなもの:技術的負債、レガシーコード

Updated on August 23, 2020