I find that whenever I get a large data file from somewhere (i.e. extract some data from a database, crawl some sites and dump the data in a file) I always need to do just that little bit of extra processing before I can actually use it. This processing is always just non-trivial enough and I do it just uncommonly enough for me to always forget exactly how to go about it. Of course, this is to be expected, if you learn something and want it to stick you have to keep doing it. It’s all part and parcel of how our brain works when it comes to learning new skills, but that doesn’t make it any less annoying.

Back to our data file, for me I find that I almost always need to do 3 things (amongst others) before doing anything else with my file.

  • delete the first line (especially when pulling data out of the database)
  • delete the last line 
  • remove all blank lines

Don’t ask me why but for whatever reason, you always get an extraneous first line and unexpected blank lines (and less often an extraneous last line) no matter how you produce the file :).

Anyways, my tool of choice in the matter is bash – it is just too trivial to use anything else (plus I love the simplicity and power of the shell). So, to make sure I never forget again here is the easiest way of doing all the three things above using sed:

sed 1d input_file | sed '$d' | sed '/^$/d' > output_file 

Update: As Evan pointed out in the comments, it would be more efficient to do the following:

sed -e 1d -e ‘$d’ -e ‘/^$/d’ input_file > output_file

_This way the file doesn’t have to go through multiple pipes.

_

Of course since we’re using bash, there should be numerous ways of doing the above.

You can remove the first line using awk:

awk 'FNR>1'

but I don’t know how to remove the last line using awk. Anyone?

You can use head or tail to get rid of the first and last line:

head --lines=-1 input_file | tail --lines=+2

but not to remove blank lines.

You can use grep to remove blank lines

grep -v "^$" input_file

but it would be silly to try and use it to remove the first and last line (possible though).

If you know of an easier way to do the above three things in a one-liner using bash – do share it.

What are some of the most common (but non-trivial enough) things that you find yourself doing with bash when it comes to pre-processing that large data file?

For more tips and opinions on software development, process and people subscribe to skorks.com today.

Image by rachel_thecat