How To Quickly Generate A Large File On The Command Line (With Linux)

GeneratorThere comes a time in every developers life where they need a data file for testing purposes and there are none handy. Rather than searching around for a file that fits your needs, the easiest thing to do is simply to generate one. There are a number of reasons why you might want to generate a data file. For example, recently we needed to test the file upload functionality of a little application we were writing at work, using a whole range of files of different sizes (from <1Mb up to >100Mb). Rather than hunt around for files that would fit the bill, it was a lot easier to just generate some. Another reason might be when you need to test some functionality (e.g. algorithm) to see how it would handle very large sets of data. Since you normally don't have files that are 1Gb or more in size just lying around, generating some is probably a good way to go.

Fortunately the Linux command line has all the tools we need to quickly and easily generate any kind of data file that we require (I am of course assuming that as a self-respecting developer you're using or at least have access to a Linux system :)). Let us examine some of the options.

Firstly, to get the obvious out of the way. Solaris has a command called mkfile which will allow you to generate a file of a particular size, but we don't have this command on Linux (or at the very least I don't have it on Ubuntu), so I'll leave it at that. If you're on Solaris feel free to investigate.

When You Don't Care At All About The Contents Of The File

You just want a file of a particular size, and don't really care what's in it or how many lines it contains – use /dev/zero. This is a special file on Linux that provides a null character every time you try to read from it. This means we can use it along with the dd command to quickly generate a file of any size.

dd if=/dev/zero of=file.txt count=1024 bs=1024

This command will create a file of size count*bs bytes, which in the above case will be 1Mb. This file will not contain any lines i.e.:

alan@alan-ubuntu-vm:~/tmp$ dd if=/dev/zero of=file.txt count=1024 bs=1024
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.00510162 s, 206 MB/s
alan@alan-ubuntu-vm:~/tmp$ ls -al file.txt
-rw-r--r-- 1 alan alan 1048576 2010-03-21 22:25 file.txt
alan@alan-ubuntu-vm:~/tmp$ wc -l file.txt
0 file.txt

The advantages of this approach are as follows:

  • it is blazingly fast taking around 1 second to generate a 1Gb file (dd if=/dev/zero of=file.txt count=1024 bs=1048576 where 1048576 bytes = 1Mb)
  • it will create a file of exactly the size that you specified

The disadvantage is the fact that the file will only contain null characters and as a result will not seem to contain any lines.

When You Don't Care About The Contents But Want Some Lines

You want a file of a particular size but don't want it to just be full of nulls, other than that you don't really care. This is a similar case to the above, use /dev/urandom. This is another special file in Linux, it is a partner of /dev/random which serves as a random number generator on a Linux system. I don't want to go into the mechanics of it, but essentially /dev/random will eventually block unless your system has a lot of activity, /dev/urandom in non-blocking. We don't want blocking when we're creating our files so we use /dev/urandom (the only real difference is that /dev/urandom is actually less random but for our purposes it is random enough :)). The command is similar:

dd if=/dev/urandom of=file.txt bs=2048 count=10

This will create a file with bs*count random bytes, in our case 2048*10 = 20Kb. To generate a 100Mb file we would do:

dd if=/dev/urandom of=file.txt bs=1048576 count=100

The file will not contain anything readable, but there will be some newlines in it.

alan@alan-ubuntu-vm:~/tmp$ dd if=/dev/urandom of=file.txt bs=1048576 count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 13.4593 s, 7.8 MB/s
alan@alan-ubuntu-vm:~/tmp$ wc -l file.txt
410102 file.txt

The disadvantages here are the fact that the file does not contain anything readable and the fact that it is quite a bit slower than the /dev/zero method (around 10 seconds for 100Mb). The advantage is the fact that it will contain some lines.

You Want Readable Contents But Don't Care If It Is Duplicated

In this case you want to create a file with a particular number of human-readable lines, but you don't really care if the lines are duplicated and don't need the size of the file to be precise. The best way I have found of doing this is as follows.

  • create a file with two lines in it
  • concatenate the file with itself and output to a different file
  • copy the new file over the original file
  • keep doing this until you get a file of a size you desire

Here are the specifics. Firstly create a file with two lines in it:

cat - > file.txt

This commands redirects STDIN to a file, so you will need to enter two lines and then press Ctrl+D. Then you will need to run the following command:

for i in {1..n}; do cat file.txt file.txt > file2.txt && mv file2.txt file.txt; done

Where n is an integer. This will create a file with 2^(n+1) lines in it, by duplicating your original two lines. So to create a file with 16 lines you would do:

for i in {1..3}; do cat file.txt file.txt > file2.txt && mv file2.txt file.txt; done

Here are some more numbers to get you started:

  • n=15 will give you 65536 lines (if the original two lines were 'hello' and 'world' the file will be 384Kb)
  • n=20 will give you 2097152 lines (12Mb file with 'hello' and 'world' as the two starting lines)
  • n=25 will give you 67108864 lines (384Mb file with 'hello' and 'world' as the two starting lines)

Up to n=20 the file is generated instantly, after that there will be some noticeable lag, n=25 takes a few seconds.

Here is a handy tip. If you want to quickly empty a file without deleting it, redirect /dev/null into it:

cat /dev/null > file.txt
alan@alan-ubuntu-vm:~/tmp$ ls -ltrh file.txt
-rw-r--r-- 1 alan alan 100M 2010-03-21 22:56 file.txt
alan@alan-ubuntu-vm:~/tmp$ cat /dev/null > file.txt
alan@alan-ubuntu-vm:~/tmp$ ls -ltrh file.txt
-rw-r--r-- 1 alan alan 0 2010-03-21 22:57 file.txt

You Want Readable Contents And No Duplicate Lines

Similar to above, but duplicate lines are an issue for you. You want a file with a certain number of lines but don't need the size to be precise. In this situation it is a bit of a tall order to do this as a one-liner using pure shell tools (although it is quite possible if you don't mind writing a little script), so we need to turn to one of our beefier friends, Perl or Ruby (whichever one you prefer). I pick Ruby – of course :). The idea is as follows.

  • Linux has a dictionary of words which is located at /usr/share/dict/words
  • we want to randomly pick a number of words from there to make up into a line then output the line
  • keep doing this until we get the number of lines we were looking for

The command will look like this:

ruby -e 'a=STDIN.readlines;X.times do;b=[];Y.times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > file.txt

Where X is the number of lines in the file you want to generate and Y is the number of words in each line. So, to create a file with 100 lines and 4 words in each line you would do:

ruby -e 'a=STDIN.readlines;100.times do;b=[];4.times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > file.txt

This is getting a little complex for a one-liner, but once you understand what's going on, it is fairly easy to put together. We basically read in the dictionary into an array, then we randomly select words to form a line (getting rid of newlines while we're at it), we then output our newly created line and we put a loop around the whole thing -  pretty simple.

With this method you're very unlikely to ever get repeated lines (although it is technically possible). It takes about 10 seconds to generate a 100Mb file (around 1 million lines with 12 words per line), which is comparable to some of our other methods. The lines we produce will not make any semantic sense but will be made up of real words and will therefore be readable.

There you go, you're now able to generate any kind of data file you want (large or small) for your testing purposes. If you know of any other funky ways to generate files in Linux, then please share and don't forget to grab my RSS feed if you haven't already. Enjoy!

Image by MGSpiller

  • Evan

    Maybe we need a kernel patch to create a /dev/loremipsum device? :-)

    http://en.wikipedia.org/wiki/Lorem_ipsum

    • http://www.skorks.com Alan Skorkin

      It’s funny but not a bad idea, if there was a way to read a constant stream of lorem ipsum from somewhere, line by line, with a particular number of bytes per line, that would be ideal :). Even better some kind of markov chain device that spits out natural text, this way you can actually generate “real” documents and the applicability will go up significantly :).

      • http://tech.bluesmoon.info/ Philip Tellis
        • http://www.skorks.com Alan Skorkin

          Hi Philip,

          That does look interesting, I wonder how fast it is to create a reasonably large file.

  • http://www.pixelbeat.org/ Pádraig Brady

    http://www.pixelbeat.org/scripts/truncate

    yes “Lorem ipsum” | fmt | head -n10000

    yes “Lorem ipsum
    exercitation ullamco labori” | head -n1000 | cat -n | shuf

    seq -f%030.0f 1000000 | shuf

    split –line-bytes

  • SirPing

    An even quicker way to generate a really big file that contains only null characters is by using the seek parameter, thus generating a sparse file. Here we create an one terabyte file in a fraction of a split second:

    dd if=/dev/zero of=one-tera-byte-file bs=1 count=0 seek=1T

    • http://www.skorks.com Alan Skorkin

      That’s pretty cool and good to know, cheers.

  • Travis Cardwell

    Reading /dev/urandom is slow because it uses the kernel prng and depletes your system entropy. It is not meant to provide a continuous stream of pseudo-random numbers; it is meant to seed other prngs.

    http://linux.die.net/man/4/random
    http://www.google.com/search?q=dev+urandom+slow

    • http://www.skorks.com Alan Skorkin

      Hi Travis,

      True, but it will work in a pinch, and is reasonably easy to remember, but if not pressed for time I would probably use one of the last two approaches that I mentioned.

  • http://jmtd.net/ Jon

    Use the smiley of death (:> some-file) instead of ‘cat /dev/null > some-file’ to truncate it.

    • http://www.skorks.com Alan Skorkin

      Hehe, smiley of death – I like it, thanks for sharing that.

  • Peter

    On Windows, use
    fsutil file createnew “desired_file_name” file_size

    • http://www.skorks.com Alan Skorkin

      Hey Peter, I was hoping someone would share some windows ways to do this, thanks.

  • http://www.dancingbison.com Vasudev Ram

    @ jon and alan:

    Smiley of death is a nice term, hadn’t come across it before.

    In that, the colon is a do-nothing command (that returns true, BTW) and the rest of the line is the redirection. But even that can be made shorter – although by just 1 character :) , just use:

    > file.txt

    Odd though it may seem, that’s a legal shell command.

    Thinking mathematically, one could say that “a zero-length or null command is also a command” :), so the shell allows it …

    - Vasudev

  • http://www.dancingbison.com Vasudev Ram

    And related to the above, a shorter way of writing an infinite while loop in the shell, is, instead of this:

    while true
    do
    # commands
    done

    to use this:

    while :
    do
    # commands
    done

    Works for the same reason as stated in my previous comment above; the : (colon) command is a do-nothing (shell built-in) command that returns true as its exit code, but is faster to run than the “true” command that does the same, but “true” is an actual binary (executable file) on disk, and hence (except for caching) has to be loaded into memory each time through the while loop, leading to a small overhead.

    - Vasudev

  • sdaau

    Hi – thanks for the great post! My snippet: given a string append its reversed version, and keep on outputting this – using Python (bash syntax here):

    python -c 'import sys; a1="abcdefghijklmnopqrstuvwxyz" ; a2=a1[::-1] ; a=a1+a2[1:] ; size=100000 ; for i in range(1,size,len(a)): sys.stdout.write(a)' # > myfile.dat
    

    Cheers!

    • sdaau

      Ups… just to note – the “for” needs to be on its own line, the linebreak seems to be eaten by the pre tag.. Please either add this note to prev post, or remove ‘pre’ tag from previous post – and delete this post – thanks!

    • sdaau

      Sorry for another bump – here is a slightly changed version, which is a true one-liner in Python, by using square brackets and inverting the order of the for and the write command:

      python -c 'import sys; a1="abcdefghijklmnopqrstuvwxyz" ; a2=a1[::-1] ; a=a1+a2[1:] ; size=100000 ; [sys.stdout.write(a) for i in range(1,size,len(a))]' # > myfile.dat

      Cheers!

  • Larry

    I’m surprised nobody caught this. The dd command is not blazingly fast. In fact, it is very limited to the IO speed of your disk. The reason why creating a 1GB file took only a second is because if dirtied your cache buffers which are kept in memory. Assuming that you had 4GB RAM and a normal SATA disk (circa March 2010), a DD of a 50 GB file would have taken you about about 15 1/2 minutes, not 50 seconds as might be implied. Had you had only 256MB on RAM in your system, the DD of a 1GB file would have probably taken about 20 seconds.

  • Puneet Madaan

    Though I was seeking some neat interface, rather then popen dd to handle my swapfile in chroot… and hadn’t been able to find a neater solution, then one already existng.. yet I think you can make the process of file creation even faster, as you really do not need dd to write blocks, its such a wastage of time in majority of cases..

    if contents does not matter, then seek is better then write operation (considering faster operation with lesser Hardidsk noise polution :p )… anyways GB big files are usually diskimage (my assumption, in my case it was swapfile), which later are formated using mk-blabla file-system structure… thus

    dd if=/dev/zero of=somefile bs=1 seek=1G count=0

    Greetingz

    • Artur Linhart

      dd if=/dev/zero of=somefile bs=1 seek=1G count=0

      - this is the best way of all, especially by SSD disks, because no bytes are physically written on the medium at all, so no have to be deleted… really, very nice :-))) Thanks :-)
      Archie.

  • int_ua

    fallocate -l 4G filename

  • Pingback: Create 1Gb file in seconds ? | Jeevana

  • Pingback: String Searching Analysis – Rabin Karp versus Naive Search | Soft Entropy

  • Aditya Patil

    This one maynot be as fast, but still works

    yes “some big text” > filename