There comes a time in every developers life where they need a data file for testing purposes and there are none handy. Rather than searching around for a file that fits your needs, the easiest thing to do is simply to generate one. There are a number of reasons why you might want to generate a data file. For example, recently we needed to test the file upload functionality of a little application we were writing at work, using a whole range of files of different sizes (from <1Mb up to >100Mb). Rather than hunt around for files that would fit the bill, it was a lot easier to just generate some. Another reason might be when you need to test some functionality (e.g. algorithm) to see how it would handle very large sets of data. Since you normally don't have files that are 1Gb or more in size just lying around, generating some is probably a good way to go.
Fortunately the Linux command line has all the tools we need to quickly and easily generate any kind of data file that we require (I am of course assuming that as a self-respecting developer you're using or at least have access to a Linux system :)). Let us examine some of the options.
Firstly, to get the obvious out of the way. Solaris has a command called mkfile which will allow you to generate a file of a particular size, but we don't have this command on Linux (or at the very least I don't have it on Ubuntu), so I'll leave it at that. If you're on Solaris feel free to investigate.
When You Don't Care At All About The Contents Of The File
You just want a file of a particular size, and don't really care what's in it or how many lines it contains – use /dev/zero. This is a special file on Linux that provides a null character every time you try to read from it. This means we can use it along with the dd command to quickly generate a file of any size.
dd if=/dev/zero of=file.txt count=1024 bs=1024
This command will create a file of size count*bs bytes, which in the above case will be 1Mb. This file will not contain any lines i.e.:
alan@alan-ubuntu-vm:~/tmp$ dd if=/dev/zero of=file.txt count=1024 bs=1024 1024+0 records in 1024+0 records out 1048576 bytes (1.0 MB) copied, 0.00510162 s, 206 MB/s alan@alan-ubuntu-vm:~/tmp$ ls -al file.txt -rw-r--r-- 1 alan alan 1048576 2010-03-21 22:25 file.txt alan@alan-ubuntu-vm:~/tmp$ wc -l file.txt 0 file.txt
The advantages of this approach are as follows:
- it is blazingly fast taking around 1 second to generate a 1Gb file (dd if=/dev/zero of=file.txt count=1024 bs=1048576 where 1048576 bytes = 1Mb)
- it will create a file of exactly the size that you specified
The disadvantage is the fact that the file will only contain null characters and as a result will not seem to contain any lines.
When You Don't Care About The Contents But Want Some Lines
You want a file of a particular size but don't want it to just be full of nulls, other than that you don't really care. This is a similar case to the above, use /dev/urandom. This is another special file in Linux, it is a partner of /dev/random which serves as a random number generator on a Linux system. I don't want to go into the mechanics of it, but essentially /dev/random will eventually block unless your system has a lot of activity, /dev/urandom in non-blocking. We don't want blocking when we're creating our files so we use /dev/urandom (the only real difference is that /dev/urandom is actually less random but for our purposes it is random enough :)). The command is similar:
dd if=/dev/urandom of=file.txt bs=2048 count=10
This will create a file with bs*count random bytes, in our case 2048*10 = 20Kb. To generate a 100Mb file we would do:
dd if=/dev/urandom of=file.txt bs=1048576 count=100
The file will not contain anything readable, but there will be some newlines in it.
alan@alan-ubuntu-vm:~/tmp$ dd if=/dev/urandom of=file.txt bs=1048576 count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 13.4593 s, 7.8 MB/s alan@alan-ubuntu-vm:~/tmp$ wc -l file.txt 410102 file.txt
The disadvantages here are the fact that the file does not contain anything readable and the fact that it is quite a bit slower than the /dev/zero method (around 10 seconds for 100Mb). The advantage is the fact that it will contain some lines.
You Want Readable Contents But Don't Care If It Is Duplicated
In this case you want to create a file with a particular number of human-readable lines, but you don't really care if the lines are duplicated and don't need the size of the file to be precise. The best way I have found of doing this is as follows.
- create a file with two lines in it
- concatenate the file with itself and output to a different file
- copy the new file over the original file
- keep doing this until you get a file of a size you desire
Here are the specifics. Firstly create a file with two lines in it:
cat - > file.txt
This commands redirects STDIN to a file, so you will need to enter two lines and then press Ctrl+D. Then you will need to run the following command:
for i in {1..n}; do cat file.txt file.txt > file2.txt && mv file2.txt file.txt; done
Where n is an integer. This will create a file with 2^(n+1) lines in it, by duplicating your original two lines. So to create a file with 16 lines you would do:
for i in {1..3}; do cat file.txt file.txt > file2.txt && mv file2.txt file.txt; done
Here are some more numbers to get you started:
- n=15 will give you 65536 lines (if the original two lines were 'hello' and 'world' the file will be 384Kb)
- n=20 will give you 2097152 lines (12Mb file with 'hello' and 'world' as the two starting lines)
- n=25 will give you 67108864 lines (384Mb file with 'hello' and 'world' as the two starting lines)
Up to n=20 the file is generated instantly, after that there will be some noticeable lag, n=25 takes a few seconds.
Here is a handy tip. If you want to quickly empty a file without deleting it, redirect /dev/null into it:
cat /dev/null > file.txtalan@alan-ubuntu-vm:~/tmp$ ls -ltrh file.txt -rw-r--r-- 1 alan alan 100M 2010-03-21 22:56 file.txt alan@alan-ubuntu-vm:~/tmp$ cat /dev/null > file.txt alan@alan-ubuntu-vm:~/tmp$ ls -ltrh file.txt -rw-r--r-- 1 alan alan 0 2010-03-21 22:57 file.txt
You Want Readable Contents And No Duplicate Lines
Similar to above, but duplicate lines are an issue for you. You want a file with a certain number of lines but don't need the size to be precise. In this situation it is a bit of a tall order to do this as a one-liner using pure shell tools (although it is quite possible if you don't mind writing a little script), so we need to turn to one of our beefier friends, Perl or Ruby (whichever one you prefer). I pick Ruby – of course :). The idea is as follows.
- Linux has a dictionary of words which is located at /usr/share/dict/words
- we want to randomly pick a number of words from there to make up into a line then output the line
- keep doing this until we get the number of lines we were looking for
The command will look like this:
ruby -e 'a=STDIN.readlines;X.times do;b=[];Y.times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > file.txt
Where X is the number of lines in the file you want to generate and Y is the number of words in each line. So, to create a file with 100 lines and 4 words in each line you would do:
ruby -e 'a=STDIN.readlines;100.times do;b=[];4.times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > file.txt
This is getting a little complex for a one-liner, but once you understand what's going on, it is fairly easy to put together. We basically read in the dictionary into an array, then we randomly select words to form a line (getting rid of newlines while we're at it), we then output our newly created line and we put a loop around the whole thing - pretty simple.
With this method you're very unlikely to ever get repeated lines (although it is technically possible). It takes about 10 seconds to generate a 100Mb file (around 1 million lines with 12 words per line), which is comparable to some of our other methods. The lines we produce will not make any semantic sense but will be made up of real words and will therefore be readable.
There you go, you're now able to generate any kind of data file you want (large or small) for your testing purposes. If you know of any other funky ways to generate files in Linux, then please share and don't forget to grab my RSS feed if you haven't already. Enjoy!
Image by MGSpiller
Related posts:
- Sort Files Like A Master With The Linux Sort Command (Bash)
- Using Bash To Output To Screen And File At The Same Time
- The Most Common Things You Do To A Large Data File With Bash
- Partitioning Your Hard Drive During A Linux Install
- Bash Shell Awesomeness – Mass Killing Processes (On Ubuntu)
- Executing Multiple Commands – A Bash Productivity Tip
- Downgrading A Ubuntu Package
{ 15 comments… read them below or add one }
Maybe we need a kernel patch to create a /dev/loremipsum device? :-)
http://en.wikipedia.org/wiki/Lorem_ipsum
It’s funny but not a bad idea, if there was a way to read a constant stream of lorem ipsum from somewhere, line by line, with a particular number of bytes per line, that would be ideal :). Even better some kind of markov chain device that spits out natural text, this way you can actually generate “real” documents and the applicability will go up significantly :).
Just use nonsense:
http://freshmeat.net/projects/nonsense
Hi Philip,
That does look interesting, I wonder how fast it is to create a reasonably large file.
http://www.pixelbeat.org/scripts/truncate
yes “Lorem ipsum” | fmt | head -n10000
yes “Lorem ipsum
exercitation ullamco labori” | head -n1000 | cat -n | shuf
seq -f%030.0f 1000000 | shuf
split –line-bytes
An even quicker way to generate a really big file that contains only null characters is by using the seek parameter, thus generating a sparse file. Here we create an one terabyte file in a fraction of a split second:
dd if=/dev/zero of=one-tera-byte-file bs=1 count=0 seek=1T
That’s pretty cool and good to know, cheers.
Reading /dev/urandom is slow because it uses the kernel prng and depletes your system entropy. It is not meant to provide a continuous stream of pseudo-random numbers; it is meant to seed other prngs.
http://linux.die.net/man/4/random
http://www.google.com/search?q=dev+urandom+slow
Hi Travis,
True, but it will work in a pinch, and is reasonably easy to remember, but if not pressed for time I would probably use one of the last two approaches that I mentioned.
Use the smiley of death (:> some-file) instead of ‘cat /dev/null > some-file’ to truncate it.
Hehe, smiley of death – I like it, thanks for sharing that.
On Windows, use
fsutil file createnew “desired_file_name” file_size
Hey Peter, I was hoping someone would share some windows ways to do this, thanks.
@ jon and alan:
Smiley of death is a nice term, hadn’t come across it before.
In that, the colon is a do-nothing command (that returns true, BTW) and the rest of the line is the redirection. But even that can be made shorter – although by just 1 character :) , just use:
> file.txt
Odd though it may seem, that’s a legal shell command.
Thinking mathematically, one could say that “a zero-length or null command is also a command” :), so the shell allows it …
- Vasudev
And related to the above, a shorter way of writing an infinite while loop in the shell, is, instead of this:
while true
do
# commands
done
to use this:
while :
do
# commands
done
Works for the same reason as stated in my previous comment above; the : (colon) command is a do-nothing (shell built-in) command that returns true as its exit code, but is faster to run than the “true” command that does the same, but “true” is an actual binary (executable file) on disk, and hence (except for caching) has to be loaded into memory each time through the while loop, leading to a small overhead.
- Vasudev