My web host (Strato) provides me with log files, allowing me to do traffic analysis. Unfortunately I have to manually download the files from a configuration web interface which provides only log files for the last six weeks. This means that I have to make sure I download the log files at least once every six weeks. Moreover the log files on my hard drive will then be split into pieces at arbitrary points, depending on when I downloaded them.

What I’d like to do is merge all the individual files into one big log file which I can then split into well-defined parts, e.g. one log file per month, year, or from one redesign to the next. While I found a few tools to merge interleaved log files from different servers forming a load-balancing cluster, I was unable to come up with a tool for my task – so I wrote one myself.

How it works

The initial idea was to probably use Perl to parse each line of the various partial log files and then produce an output sorted by timestamp. However, I wanted the script to be as simple as possible which led me to consider an alternative solution. In order to merge two files, there has to be some overlap between the time frame and thus the lines contained in both files. With that condition in mind, it’s easy to think of the Unix/Linux utilities diff and patch. So I conceived a Bash shell script with those two commands as the work horses, surrounded by a little control logic. Why Bash? Well, there exists an issue with really weird file names for which Bash provides an easy solution. But more on that later.

Like I said, merging requires for some overlap to exist between two files to be merged. If there is a “gap” in the input data, the merging fails. At this point the user may want to append the files regardless of any gaps or keep several gap-less merged files. In such cases the script merely informs the user and aborts the merging process. The script also terminates with an error message if it encounters a merge conflict, that is two partial log files disagree about some lines. In such cases it also falls to the user to solve the merge conflict.

Two more things the script takes care of automatically is uncompress gzipped log files while keeping the compressed original as well as moving original partial log files to a backup folder after successfully merging them.

The script

The code contains a lot of comments which is why I’m not going to describe its inner workings in more detail here. Have a look at the code instead:

#!/bin/bash

# This shell script takes a number of partial but overlapping log files
# (e.g. from a web server) and merges them into one consecutive log file.
#
# If the partial log files are compressed using gzip their contents are
# temporarily decompressed for merging while not touching the original files.
#
# Use the three variables below to configure this script according to your
# needs.
# After processing the partial log files successfully, they get moved to the
# directory specified in the variable `backupdir'. If you don't want the files
# to be moved, set `backupdir' to an empty string.

mergefile="mergedlogfile"  # Name of the merged (output) file
partfiles="access_log_*"   # Names (with wildcards) of the partial log files
backupdir="mergedparts"    # Directory to move processed partial log files to

# The partial log files have filenames like "access_log_20110830-20110924"
# where the numbers denote the date interval of log entries contained in each
# file. This filename format allows us to sort the files chronologically by
# sorting their filenames in ascending order. That's what the command below
# does -- it returns one filename after the other in chronological order.
# The way merging works doesn't need the input files to be preordered
# _chronologically_, but gaps between consecutive files have to be avoided.
# Sorting chronologically achieves that and is easy to implement by sorting
# lexically, given that the partial log files have a naming scheme like the
# one mentioned above.
find . -maxdepth 1 -name "$partfiles" -print0 | sort -z \
| while read -d $'\0' partial
do
    echo "About to merge $partial into $mergefile"

    # If partial log file is gzipped, unpack to temporary file
    filetype=$(file -b "$partial")
    is_gzip=$(expr match "$filetype" '^gzip compressed data')
    if [ $is_gzip -gt 0 ]
    then
        unpacked=$(mktemp --tmpdir logmerge.XXXXXX)  # Create temporary file
        gunzip -cd "$partial" > "$unpacked"  # Unpack, keeping original file
    else
        unpacked="$partial"  # Original partial log file is unpacked already
    fi

    # Now do the actual merging
    if [ ! -e "$mergefile" ]
    then
        # The merged (output) file doesn't exist, yet, so instead of doing
        # any merging, the contents from the first partial log file are
        # simply copied to the merged file.
        echo "creating file $mergefile from contents of first input file"
        cp "$unpacked" "$mergefile"
    else
        # The merged (output) file already exists, so let's merge the contents
        # of the current partial log file into it.

        # Before we do that, though, we have to check for merge conflicts.
        # These would otherwise go unnoticed for instance if the following
        # files part1 and part2 were to be merged.
        #   part1:                      part2:
        #     line 1 unique               line 2 different part2
        #     line 2 different part1      line 3 common
        #     line 3 common               line 4 unique
        # Without the explicit merge conflict check this would result in the
        # merged (output) file containing
        #     line 1 unique
        #     line 2 different part1
        #     line 3 common
        #     line 4 unique
        # On the second run of logmerge.sh, however, you would get a somewhat
        # misleading error message when trying to merge in part2.
        # In contrast to this the following files part1 and part2 would produce
        # an error on the first run of logmerge.sh:
        #   part1:                      part2:
        #     line 1 unique               line 2 common
        #     line 2 common               line 3 different part2
        #     line 3 different part1      line 4 unique
        # Either scenario is unacceptable, thus let's check for merge conflicts
        # now. To do that we have `grep' search the `diff' output for lines
        # consisting entirely of the string "---" which indicates lines that
        # differ between the two compared files, i.e. merge conflicts.
        #
        # In addition to merge conflicts this method also exposes gaps in
        # the input data. These have to be handled as well since the `patch'
        # command further down would otherwise return an error. It's better to
        # give the user a clear message on what just happened. Then they can
        # choose the best way to handle the error, which in case of gaps would
        # probably be to create two different merged files.
        diff "$mergefile" "$unpacked" | grep '^---$' > /dev/null
        if [ $? -eq 0 ]
        then
            # Print error message to stderr
            echo "ERROR: There's a merge conflict or a gap between" >&2
            echo "       the previously merged files and" >&2
            echo "       $partial." >&2
            echo "       Aborting merge operation." >&2

            # If it has been created earlier, delete temporary file containing
            # unpacked partial log file
            if [ "$partial" != "$unpacked" ]
            then
                rm "$unpacked"
            fi

            # Abort execution with error code
            exit 1
        fi

        # Now that we've made sure there's no merge conflict or gap we can
        # start the actual merging process. Here's how it's done:
        # a) `diff' compares the two files line by line;
        # b) `grep' extracts only those lines that are in the partial log file
        #    but not yet in the merged file, together with the information
        #    from `diff' on where to insert those lines into the merged file;
        # c) `patch' then applies those changes by adding them to the merged
        #    file.
        diff "$mergefile" "$unpacked" | grep -B 1 "^>" | patch "$mergefile"
    fi

    # If it has been created earlier, delete temporary file containing
    # unpacked partial log file
    if [ "$partial" != "$unpacked" ]
    then
        rm "$unpacked"
    fi

    # The contents of the partial log file are now in the merged file.
    # We don't really need the partial log file anymore, but just in case
    # let's back it up by moving it to the backup directory.
    # Note: The script doesn't require the partial log files to be moved to a
    # different place and thereby out of view for subsequent runs. If there's
    # a partial log file that does not add anything new to the merged file
    # (i.e. it has been merged before) there's simply no merging action taking
    # place.
    if [ -d "$backupdir" -a ! -z "$backupdir" ]
    then
        # Only move to backup directory if it exists and if it's defined, i.e.
        # its name is not an empty string
        mv "$partial" "$backupdir"
    fi
done

You can also download the script.

Non-Bash shells

Now for the reason this is a Bash script. The main while loop reads filenames from a pipe which is fed by commands that use NUL separation. This enables the script to work with filenames containing even the wildest characters. Per definition the only characters that are strictly forbidden are the NUL character and slash “/” which separates directories. To handle NUL separation correctly, the read command uses the option -d $'\0'. This is a Bash extension that doesn’t work in i.e. Ubuntu’s simple shell dash that is found at /bin/sh. If you want to use the script with a non-Bash shell, replace

find . -maxdepth 1 -name "$partfiles" -print0 | sort -z \
| while read -d $'\0' partial

with

IFS='
'
for partial in $(find . -maxdepth 1 -name "$partfiles" -print | sort)

Make sure to have a newline directly after IFS=' as this seems to be the only portable way to set the input field separator to newline. And just so you know, the downside to this modification is that the script now won’t work anymore with filenames containing new lines.