Josh Dick portrait Josh Dick

Rejecting False Binary Files in Git

When text files stay up past their bedtime.

The Problem

I noticed a small but annoying issue with Git that surfaced every once in a while: Git would suddenly start treating text files as if they were binary files. This example output from a relevant Stack Overflow question illustrates the issue:

$ git diff MyFile.txt
diff --git a/MyFile.txt b/MyFile.txt
index d41a4f3..15dcfa2 100644
Binary files a/MyFile.txt and b/MyFile.txt differ

In general, this issue could be viewed as a limitation of Git’s built-in diff capability, since it Git apparently is not great at comparing files that use variable-width encoding. For my job however, none of the source code our team works on is supposed to use variable-width encoding. For our purposes, it’s safe to assume that seeing this issue means that a developer mistakenly added non-ASCII characters to an otherwise-ASCII text file (Git’s limitations notwithstanding.)

I decided to create a Git update hook to reject problem commits when developers try to push them so these mistakes could be caught in an automated fashion when they happened, rather than having developers figure out when and why their diffs suddenly stopped working.

The Code

#!/bin/bash

#
# [Your existing update hook code here...]
#

refname="$1"
oldrev="$2"
newrev="$3"

# --- Reject this commit if it contains source files that Git views as binary files.

# First, make sure user is pushing a new branch and not deleting one. ($newrev will be all zeros when deleting.)
zero="0000000000000000000000000000000000000000"
if [ "$newrev" != "$zero" ]; then

  # Tests whether Git thinks the passed filename ($1) is a binary file.
  isBinary() {
    binaryPattern=$(printf '%s\t-\t' -);
    diffResult=$(git diff --numstat master $newrev -- "$1");
    return $([[ "$diffResult" =~ "$binaryPattern" ]])
  }

  # Get a list of all relevant files to check: filter the names of all changed files in $newrev by extension.
  checkFiles=$(git show --pretty="format:" --name-only $newrev | grep -E '\.(java|groovy|js|xml|css|less|sh)(\..+)?$')

  binaryFilesFound=false
  while read -r filename; do
      if [ ! -z $filename ] && $(isBinary $filename); then
        echo "*** File $filename appears to be a binary file, but should be plain text." >&2
        binaryFilesFound=true
      fi
  done <<< "$checkFiles"

  if $binaryFilesFound; then
    echo "Please remove all non-human-readable characters from the above file(s)."
    echo "For example, to remove NUL characters from a file:"
    echo "  tr < file.ext -d '\000' > file.tmp"
    echo "  mv file.tmp file.ext"
    # Reject the push
    exit 1;
  fi

fi

# --- Finished
exit 0

How it Works

This code lives inside the update hook script (hooks/update) in our central Git repository. It runs on the server whenever someone does a git push.

Here’s a summary of what the code does:

  1. Get a list of all files to be changed in the new revision (incoming pushed changes).
  2. Filter the list of files by their extensions, so that the list only contains files that are supposed to be plain text files.
  3. For each file in the list, make sure that the incoming changes to that file don’t cause Git to treat the file as binary data.
  4. If any of the files fail this test, reject the push and show the user an explanatory error so they can fix the problem before attempting to push again.

Considerations

Using an update hook isn’t the ideal solution, since an ideal solution would prevent this type of mistake from being made at all. However, the update hook does prevent developers from accidentally propagating problematic changes.

If you know of a better way to solve this problem, I’d love to hear about it!

Tags: git

[ ↩ all writing posts ]