How To Create Large Files for Testing

Apr 18 2017

Sometimes you need to create a large file for testing. The command line tool dd is an easy way to create large files filled with random data.

I recently found myself needing to test uploading and downloading files of various sizes. So I wanted a quick way to create several files, each of very specific size. Almost all UNIX-like systems, including Linux and macOS, provide a tool called dd that makes this job easy.

The dd command is a low-level copying tool. Apparently, it's original intent was to convert between ASCII and EBCDIC encodings. But most of the time we use it as an efficient way to copy data. Here, we'll take advantage of a couple of built-in UNIX devices to create large files.

For the examples below, I'm running macOS's version of dd. Different versions provide different output.

Creating a Large "Empty" File

Let's create a file whose contents is just a long series of null characters. This file will have a size, but if we look at its contents, we'll see nothing.

$ dd if=/dev/zero of=data.bin count=400 bs=1024
400+0 records in
400+0 records out
409600 bytes transferred in 0.001991 secs (205722299 bytes/sec)

The arguments are as follows:

  • if is the input source
  • of is the output file
  • count is the number of times to repeat a copy
  • bs is the size of the chunk that is copied on each step. (bs stands for block size)

Now we've created a file that is 400k. In dd, you specify the size by multiplying the block size, 1024 (or 1k), by the count (400).

$ ls -lah data.bin
-rw-r--r--  1 mbutcher  staff   400K Apr 18 09:11 data.bin

But if we were to cat the file, it would appear to be empty:

$ cat data.bin

What's happening above is that dd is copying 400k of data off of the special /dev/zero device, which produces a stream of null characters. Essentially we're making a big "empty" file.

Creating a Large File of Random Data

Sometimes we don't want a large empty file, but a large file of random data. For example, while I was testing these large uploads and downloads, I realized that the data was being compressed in transit. And a file full of null characters compresses very efficiently:

ls -lah data.bin.gz
-rw-r--r--  1 mbutcher  staff   441B Apr 18 09:11 data.bin.gz

So instead, I wanted to fill the file with random data that would not be particularly efficient to compress. To do this, use /dev/random instead of /dev/zero. Here's how we create a 5m file of random data with dd.

$ dd if=/dev/random of=data.bin count=5k bs=1024
5120+0 records in
5120+0 records out
5242880 bytes transferred in 0.410073 secs (12785234 bytes/sec)

There are two important details that changed in the example above:

  • We use /dev/random instead of /dev/null. This fills the file with random (binary) data.
  • We use count=5k instead of count=400. This tells dd to copy 5m of data, or (5 * 1024) * 1024.
$ ls -lah data.bin
-rw-r--r--  1 mbutcher  staff   5.0M Apr 18 09:19 data.bin

If I were to run cat on this file, my console would display a huge stream of garbled characters, since our file is full of binary data.

And now, if I compress the data, I see that it compresses much less efficiently:

$ ls -lah data.bin.gz
-rw-r--r--  1 mbutcher  staff   5.0M Apr 18 09:19 data.bin.gz

The one trade-off with using /dev/random instead of /dev/zero is speed: The random generator can take a while to create a sufficient amount of data. But most of the time, this difference in speed isn't all that relevant.