Non-Unicode Command Line Arguments

27th August 2020

Command line arguments on Unix systems are arbitrary null-terminated C-strings which can contain any byte value except zero. A program which naively expects its arguments to be valid Unicode strings can crash and burn if it finds an invalid byte value instead.

How can we test this behaviour? Deliberately feeding a program invalid Unicode on the command line is a little awkward but we can do it using xargs:

$ echo -e 'foo \xFF bar' | xargs path/to/binary

This feeds the program three arguments, the second argument consisting of a single invalid (for UTF-8) byte with value 255.

Example — Pytho💥

Let's try feeding our poisoned argument string to the following Python script which simply attempts to print its own command line arguments:

import sys

for arg in sys.argv:
    print(arg)

Running with Python 3.8 gives me the following output:

$ echo -e 'foo \xFF bar' | xargs python3 args.py
args.py
foo
Traceback (most recent call last):
  File "args.py", line 4, in <module>
    print(arg)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 0: surrogates not allowed

What's really interesting here is that the problem doesn't surface until we hit the print() call in line 4. Python has cheerfully handed us a booby-trapped "string" which explodes when we try to use it!

(It's also interesting that this behaviour is a regression introduced in Python 3 — earlier versions of Python treated command line arguments as arbitrary byte strings and didn't have this issue.)

Example — Sw�ft

Swift takes a different approach. Unfortunately its solution is to silently replace invalid byte values with the Unicode replacement character, � (U+FFFD).

We can see this by feeding our poisoned argument string to this simple script:

for arg in CommandLine.arguments {
    print(arg)
}

Compiling with Swift 5.2 gives me the following output:

$ echo -e 'foo \xFF bar' | xargs .build/debug/args
.build/debug/args
foo
�
bar

This isn't a bad choice for default behaviour but as far as I can tell it isn't documented anywhere, it just happens.

Takeaway

Lots of programming languages seem to struggle with this issue, it's not just Python and Swift. I think the takeaway is that if a programming language wants to treat command line arguments as "strings" then it needs to make the conversion explicit and explicitly allow for failure.

To take Python as an example, sys.argv could be deprecated and replaced by a sys.args() function with explicit options and exceptions for handling invalid input.