How to batch convert file encoding in Bash.
Context
I have been using MCVS as my primary IDE for Unity Development for a long time while I mostly uses VSCode for other work. It just occurred to me, today, that I probably should give VSCode a try on Unity to see how they orchestrate with each other.
I switched, but a problem ensue. It turns out that all my .cs
files are encoded with GBK (I do have some non-Latin characters in this particular project) and VSCode attempts to open it with UTF-8. Of course, I would easily switch the encoding in my VSCode by doing "default_encoding": "GBK"
, but I do not want to use GBK for ALL projects (For this project I have non-Latin characters because I am collaborating with a non-English speaking developer, but that is not true for my other projects). The other option is to save with UTF-8 encoding on a file-by-file basis, but that is just tedious.
I have to find a way to batch convert my file encoding, and Bash seems to be the way to go.
Conversion
After some research I discovered the iconv
command. Hence, my strategy would be using iconv
with find
to perform a conversion for all my files. Initially, I tried
>> for file in `find -type f -name "*.cs"`; do
>> echo "$file"
>> mv "$file" "$file.gbkold" && iconv -f GBK -t UTF-8 < "$file.gbkold" > "$file"
>> done
It worked for the most part but this won’t work with files with path that contains spaces (I realized this since I echo
the files names). Luckily I committed before performing this destructive operation, so I easily reverted all changes by git checkout -- .
and git clean -f
(removing all created .gbkold
files). Then, I tried,
>> find . -iname "*.cs" | while read file
>> do
>> echo "$file"
>> mv "$file" "$file.gbkold" && iconv -f GBK -t UTF-8 < "$file.gbkold" > "$file"
>> done
This read file paths by line and is unaffected by spaces in path. This commands works, and most files are successfully converted. For one file or two, iconv
tells me that it failed to convert, so I have to manually save them with UTF-8 encoding (this is simply done by copying everything from the .cs.gbkold
file to the actual .cs
file and save with UTF-8 in VSCode - Now you see why I save an older copy of my files.)
I only converted everything I wrote, without touching .cs
files in the Plugin folder which contains some third-party extensions like DoTween.
Clean-up
In principle this should have no effect on the original Unity project. My game still plays as per usual.
Now, after making sure everything works, I move on to delete the temporary files by
>> find . -type f -name "*.gbkold" -delete
>> find . -type f -name "*.gbkold.meta" -delete #Since I opened Unity after generating the .gbkold files, Unity have made .meta files for them
Note that rm *.gbkold -rf
will not work! We are removing files recursively, not directories.
Nice! I end by doing git commit -a "Switch to UTF-8"
.
- Post link: https://reimirno.github.io/2022/03/11/Batch-Encoding-Change-via-Bash/
- Copyright Notice: All articles in this blog are licensed under unless otherwise stated.
GitHub Discussions