Go Quickly - Converting Character Encodings In Golang

Mar 9 2016

At one point or another, every developer gets stuck converting a pile of files from one character encoding to another. Go's native character set is UTF-8, and the core Go libraries don't come with tools for converting character sets. However, one of the Go extension libraries makes this easy.

The package you want is golang.org/x/text, which comes with a variety of tools for working with text. And the one we're most interested in is the encoding set of packages.

Decode to UTF-8, Encode to Something Else

The encoding library defines Decoder and Encoder structs.

  • The Decoder is used for starting with text in a non-UTF-8 character set, and transforming the source text to UTF-8 text.
  • The Encoder is used to take UTF-8 text and transform it to another encoding.

Particular encodings are stored in a number of subpackages beneath golang.org/x/text/encoding.

One of the most widely used encodings in the North America and Europe is ISO-8859-1 (aka Latin-1). So here we'll show how to decode from that format into UTF-8.

A number of common Western encodings are located in the golang.org/x/text/encoding/charmap package. This includes the ISO-8859 family as well as the Windows 1252 character set.

package main

import (
    "io"
    "os"

    "golang.org/x/text/encoding/charmap"
)

func main() {
    f, err := os.Open("my_isotext.txt")
    if err != nil {
        // handle file open error
    }
    out, err := os.Create("my_utf8.txt")
    if err != nil {
        // handler error
    }

    r := charmap.ISO8859_1.NewDecoder().Reader(f)

    io.Copy(out, r)

    out.Close()
    f.Close()
}

The above opens a ISO-8859-1 source text (my_isotext.txt), creates a destination file (my_utf.txt), and copies the first to the second. But to decode from ISO-8859-1 to UTF-8, we wrap the original file reader (f) with a decoder:

r := charmap.ISO8859_1.NewDecoder().Reader(f)

Decoding happens as we stream data from the source file to the destination.

From the command line, we can see the result:

โ‡’  go run main.go
โ‡’  file my_isotext.txt
my_isotext.txt: ISO-8859 English text, with CRLF line terminators
โ‡’  file my_utf8.txt
my_utf8.txt: UTF-8 Unicode English text, with CRLF line terminators

Decoders and encoders can also work with strings and []byte, making them versatile whatever your encoding needs are.



comments powered by Disqus