The Most Common Things You Do To A Large Data File With Bash

I find that whenever I get a large data file from somewhere (i.e. extract some data from a database, crawl some sites and dump the data in a file) I always need to do just that little bit of extra processing before I can actually use it. This processing is always just non-trivial enough and I do it just uncommonly enough for me to always forget exactly how to go about it. Of course, this is to be expected, if you learn something and want it to stick you have to keep doing it. It's all part and parcel of how our brain works when it comes to learning new skills, but that doesn't make it any less annoying.

Back to our data file, for me I find that I almost always need to do 3 things (amongst others) before doing anything else with my file.

  • delete the first line (especially when pulling data out of the database)
  • delete the last line 
  • remove all blank lines

Don't ask me why but for whatever reason, you always get an extraneous first line and unexpected blank lines (and less often an extraneous last line) no matter how you produce the file :).

Anyways, my tool of choice in the matter is bash – it is just too trivial to use anything else (plus I love the simplicity and power of the shell). So, to make sure I never forget again here is the easiest way of doing all the three things above using sed:

sed 1d input_file | sed '$d' | sed '/^$/d' > output_file 

Update: As Evan pointed out in the comments, it would be more efficient to do the following:

sed -e 1d -e ‘$d’ -e ‘/^$/d’ input_file > output_file

This way the file doesn't have to go through multiple pipes.

Of course since we're using bash, there should be numerous ways of doing the above.

You can remove the first line using awk:

awk 'FNR>1'

but I don't know how to remove the last line using awk. Anyone?

You can use head or tail to get rid of the first and last line:

head --lines=-1 input_file | tail --lines=+2

but not to remove blank lines.

You can use grep to remove blank lines

grep -v "^$" input_file

but it would be silly to try and use it to remove the first and last line (possible though).

If you know of an easier way to do the above three things in a one-liner using bash – do share it.

What are some of the most common (but non-trivial enough) things that you find yourself doing with bash when it comes to pre-processing that large data file?

For more tips and opinions on software development, process and people subscribe to today.

Image by rachel_thecat

  • Evan

    Wouldn’t “sed -e 1d -e ‘$d’ -e ‘/^$/d’ input_file > output_file” be a bit more efficient. Save that large file having to to through two pipes and two extra processes!

    • Hi Evan,

      It certainly looks like it would be, can you tell that my sed skills are sub-par :). Thanks.

  • Korny

    I always use ‘head’ and ‘tail’ to trim lines, and ‘grep’ to do things like remove internal blank lines.
    For more complex things I have used ‘sed’, but I always have to look it up again when I need it – which is a bit of a showstopper.
    Instead, I try to re-use the tool I’m most familiar with – ruby. Consider:
    cat my_data | ruby -pe ‘next if $_ == “\n” ‘
    the ‘-p’ means ‘loop and print every line’, and the call to “next” skips out of the loop before the printing. Alternatively, you can use ‘-n’ which loops without printing.
    Or you can just roll your own loop:
    cat my_data | ruby -e ‘$stdin.each {|l| puts l unless l == “\n”}’
    Or to truncate your first and last lines: (unfortunately loading everything into memory)
    cat my_data | ruby -e ‘$stdin.to_a[1…-1].each {|l| puts l unless l == “\n”}’
    … though realistically, for such a simple example I’d probably use head, tail, and grep!

    • I didn’t even consider that you can pipe stuff to ruby, although i am not sure why I didn’t consider it since it makes perfect sense :). And having said that the example you give looks very similar to how you can pipe stuff to perl on the command line, but I never tend to use it since my perl skills are abysmal. But as you say, for simpler things I prefer to use the simple tools that are built into the shell and only take it up a notch (with perl or ruby) for more complex stuff.

  • Peter Cable

    I’m unclear why you even mention bash here. Sed, awk and ruby are interpreters for their respective programming languages and they are doing the actual work here, not bash.

    You could accomplish all these tasks in bash, but I think it would be messy.

    • Hi Peter,

      You’re right they are programming languages in their own right, but you can easily pipe inputs to them and pipe outputs out of them and in that way they are very much like shell tools (i.e. head, tail etc.).

      So while we’re using sed and awk we’re not writing full fledged scripts in them but instead are piping data through them and allowing them to perform little bits of functionality all on the command line, as per the unix philosophy.

  • Pingback: Dew Drop – March 7, 2010 | Alvin Ashcraft's Morning Dew()

  • apt-get install buthead

    • I just love the names that unix and linux people come up with for their utilities :). Apparently “buthead” is a Program to copy all but the first N lines of standard input to standard output.

  • lre

    Just stumbled across this post, and have to add

    awk ‘NR==1 { next} # Do not hold first line
    hold!~/^$/ { print hold } # If held line is non-empty, print it
    { hold= $0 } # Hold line (thus, last line is not printed)
    ‘ inputfile > outputfile

  • Chad

    This has already been mentioned, but the title of the article remains… This article isn’t about bash. sed, awk, head, tail, and grep have nothing to do with bash. The fact pipes and redirects are used makes it barely about Bash, and just as much about sh or ksh. This article is about unix command line utilities for modifying text files, and should be worded to reflect it. Bash and said utilities are being conflated, which is confusing to readers, and I suspect points to a lack of clarity on the part of author about what Bash is. The commands in this article would work just as well in Korn shell, POSIX shell, and I believe C shell (although I avoid C shell).

    To do this work with Bash you would create a script file that starts with


    … and begin coding, but why would you use Bash for this work??…Therefore this article needs to be rewritten to stop talking about Bash.