Go Quickly - Converting Character Encodings In Golang
At one point or another, every developer gets stuck converting a pile of files from one character encoding to another. Go's native character set is UTF-8, and the core Go libraries don't come with tools for converting character sets. However, one of the Go extension libraries makes this easy.
The package you want is golang.org/x/text, which comes with a variety of tools for working with text. And the one we're most interested in is the encoding set of packages.
Decode to UTF-8, Encode to Something Else
The encoding
library defines Decoder
and Encoder
structs.
- The
Decoder
is used for starting with text in a non-UTF-8 character set, and transforming the source text to UTF-8 text. - The
Encoder
is used to take UTF-8 text and transform it to another encoding.
Particular encodings are stored in a number of subpackages beneath golang.org/x/text/encoding
.
One of the most widely used encodings in the North America and Europe is ISO-8859-1 (aka Latin-1). So here we'll show how to decode from that format into UTF-8.
A number of common Western encodings are located in the golang.org/x/text/encoding/charmap package. This includes the ISO-8859 family as well as the Windows 1252 character set.
package main
import (
"io"
"os"
"golang.org/x/text/encoding/charmap"
)
func main() {
f, err := os.Open("my_isotext.txt")
if err != nil {
// handle file open error
}
out, err := os.Create("my_utf8.txt")
if err != nil {
// handler error
}
r := charmap.ISO8859_1.NewDecoder().Reader(f)
io.Copy(out, r)
out.Close()
f.Close()
}
The above opens a ISO-8859-1 source text (my_isotext.txt
), creates a destination file (my_utf.txt
), and copies the first to the second. But to decode from ISO-8859-1 to UTF-8, we wrap the original file reader (f
) with a decoder:
r := charmap.ISO8859_1.NewDecoder().Reader(f)
Decoding happens as we stream data from the source file to the destination.
From the command line, we can see the result:
โ go run main.go
โ file my_isotext.txt
my_isotext.txt: ISO-8859 English text, with CRLF line terminators
โ file my_utf8.txt
my_utf8.txt: UTF-8 Unicode English text, with CRLF line terminators
Decoders and encoders can also work with string
s and []byte
, making them versatile whatever your encoding needs are.