Prefer Parsing Over Validating: Let The Compiler Help You Remove Edge Cases

In this blog post, I’ll discuss a technique that I’ve used to create expressive APIs that are more robust with the help of the compiler. I’m going to demonstrate a simple example of this technique in Java but anything you are about to read can be applied to any other statically typed language.

Let’s start by quickly defining “parsing” and “validating”. In my mind, validation is about analyzing input and reporting if it adheres to a certain set of validation rules:

String getEmail() {
    String email = // get it somehow
    if (!isEmail(email)) {
        // handle the error
    }

    return email;
}

Parsing is a beefed up form of validation. Instead of just reporting a true/false answer, it generates a more precise representation of the input for later use. Here is how we could implement the above using a parsing approach:

class Email {
    private String email;

    public Email(String email) {
        if (email == null) {
            throw new IllegalArgumentException("null email");
        }
        this.email = email;
    }

    public String getDomain() {
        // get the domain
    }

    // other member functions
}

// the parsing function
Optional<Email> getEmail() {
    String email = // get it somehow
    
    return isEmail(email) 
        ? Optional.of(new Email(email))
        : Optional.empty();
}

There is definitely more code up front but when a client calls getEmail, they know they either have a valid email or they don’t. In case they don’t, an empty Optional is provided. In case the input was an email, parsing has spit out an Email object that only represents a valid email; we’ve taken in a String which can represent any unstructured data and transformed that data into a more precise form. You can think of parsing and validating as mapping functions that take an input (usually a string) and map it to an output; validation outputs a boolean whereas parsing outputs another data structure. It’s also important to note that parsing has an implicit boolean in its return type; if nothing is returned, we can take that as meaning parsing has failed but if something is returned, we take that as a successful parse along with our shiny new data. We get the best of both worlds! Not to mention that if you read my previous post about booleans, you’ll find out that they carry almost no useful information so we just avoid them where possible.

Parsing points out a shortcoming of validation: how do you determine if a certain entity is what is claims to be in the current scope? Imagine you are writing a function f(String email). How do we know if that email you will receive really is an email? Just because the parameter is called email does not make its contents a valid email. Do we have to check that ourselves in the current scope or can we assume someone up the call stack already did it for us? There’s no good way to know without either referring to documentation or going back to check for yourself. Neither of these options are ideal because:

if you aren’t the owner of f, then you may not have the source code of ‘f’ to check the implementation.
if you aren’t the owner of f, then the owner of f may not mention anything about the error conditions in the documentation.

Even as the owner of f, relying on documentation isn’t ideal because as Uncle Bob tells us:

Don’t use a comment when you can use a function or a variable.

I’m going to extend that statement and say:

Don’t use a comment when you can use a function, a variable, or a new type.

With this background in mind, let’s dive into how we can practically apply this information.

Let’s start with a simplified, real-world example I’ve run into. Consider a function that can join a list of strings together given another string like this:

/**
* Given strs, join them all using joiner.
*
* For example, invoking join1(List.of("abc", "def"), ",") gives the string "abc,def"
*
* @param strs the list of strings to join. The list is expected to be non-null and non-empty.
* @param joiner the string to join with. joiner is expected to be non-null
* @return a joined string
*/
private String join1(List<String> strs, String joiner) {
    StringBuilder sb = new StringBuilder(strs.get(0));
    for (int i = 1; i < strs.size(); ++i) {
        sb.append(joiner).append(strs.get(i));
    }

    return sb.toString();
}

(As an aside, I know it’s possible to implement join on an empty list but let’s assume we won’t support an empty list for this example.)

join1 works fine for most inputs:

String s1 = join1(List.of("abc", "def"), ","); // OK! s1 == "abc,def"
String s2 = join1(List.of("abc", "def"), "") // OK! s2 == "abcdef"
String s3 = join1(List.of("abc"), ","); // OK! s3 == "abc"
String s4 = join1(List.of(), ","); // ERROR! ArrayIndexOutOfBoundsException

Wait a minute! Why does the last line throw? Look at the the first line of the join1 function; we’re initializing our StringBuilder with the first element of the list. In an empty list, there is no first element so we have no choice but to throw. It’s clear the function is buggy and needs to be updated.

Or does it?

If you pay close attention to the documentation of the function, we clearly state:

The list is expected to be non-null and non-empty.

and

joiner is expected to be non-null

We designed our function by contract by giving our clients a contract they need to adhere to and, surprise surprise, the client didn’t pay attention. Technically, we could stop here and just demand that our clients pay attention to the documentation but that’s unsatisfactory to any decent developer. Instead, we want to be good developers and practice defensive programming and POLA as we know our users might not read our documentation. Let’s try adding in some defensive checks to join1:

/**
* Given strs, join them all using joiner.
*
* For example, invoking join1(List.of("abc", "def"), ",") gives the string "abc,def"
*
* @param strs the list of strings to join. The list is expected to be non-null and non-empty.
* @param joiner the string to join with. joiner is expected to be non-null.
* @return a joined string
*
* @throws IllegalArgumentException if either argument is null or if strs is empty
*/
private String join2(List<String> strs, String joiner) {
    if (strs == null || strs.isEmpty()) {
        throw new IllegalArgumentException("null or empty strs");
    }

    if (joiner == null) {
        throw new IllegalArgumentException("null joiner");
    }

    StringBuilder sb = new StringBuilder(strs.get(0));
    for (int i = 1; i < strs.size(); ++i) {
        sb.append(joiner).append(strs.get(i));
    }

    return sb.toString();
}

Now we explicitly handle the corner cases and we’ve modified our contract to state we’ll throw in case of error conditions. You could argue our function is now more robust but I don’t think we did much to improve the quality or robustness of our code. In fact, I’d argue that join2 is a regression in quality over join1 since:

join2 almost doubled in length with none of the new code doing any real work; it’s just checking for error conditions.
Even worse, our extra code didn’t make our function any more robust to errors; we’ve just changed the exception that will be thrown.

It turns out join2 didn’t solve the real problem with join1: our contract is too open and we’re accepting any list when in reality, our function can’t work with any list. What if there were a way to narrow our contract to only accepting non-empty lists while having the compiler enforce this constraint at compile time?

Consider this implementation instead:

private String join3(NonEmptyList<String> strs, String joiner) {
    StringBuilder sb = new StringBuilder(strs.get(0));
    for (int i = 1; i < strs.size(); ++i) {
        sb.append(joiner).append(strs.get(i));
    }

    return sb.toString();
}

We’ll go over the implementation of NonEmptyList later but for now, assume it’s a list that maintains the invariant that it cannot be empty. With that in hand, our function is now safer, shorter, and every line of code is doing useful work. We’ve also made it clear that our list cannot be empty from the signature of the function itself instead of relying on documentation. Let’s run it against the same inputs as above.

String s1 = join3(NonEmptyList.of("abc", "def"), ","); // OK! s1 == "abc,def"
String s2 = join3(NonEmptyList.of("abc", "def"), "") // OK! s2 == "abcdef"
String s3 = join3(NonEmptyList.of("abc"), ","); // OK! s3 == "abc"

Looks good!

What if we did this?

String wontCompile = join3(List.of("foo"), ","); // compiler error!

In order to understand why this is a compiler error, it’s time to dive into the implementation of NonEmptyList.

NonEmptyList is a decorator of java.util.List with a trivial constructor:

public class NonEmptyList<T> implements List<T> {

    private final List<T> list;

    public NonEmptyList(List<T> list) {
        if (list == null || list.isEmpty()) {
            throw new IllegalArgumentException("null or empty list");
        }

        this.list = list;
    }

For most of the implementation, it simply delegates to its member list. For example:

@Override
public int size() {
    return list.size();
}

There are some notable exceptions though to ensure the list maintains its invariant:

@Override
public boolean remove(Object o) {
    if (list.size() == 1) {
        throw new IllegalStateException("singleton list");
    }

    return list.remove(o);
}

@Override
public T remove(int i) {
    if (list.size() == 1) {
        throw new IllegalStateException("singleton list");
    }
    return list.remove(i);
}

@Override
public boolean removeAll(Collection<?> collection) {
    throw new UnsupportedOperationException("removeAll() unsupported");
}

@Override
public boolean retainAll(Collection<?> collection) {
    throw new UnsupportedOperationException("retainAll() unsupported");
}

@Override
public void clear() {
    throw new UnsupportedOperationException("clear() unsupported");
}

and an additional helper method to help construct a NonEmptyList directly from its elements:

@SafeVarargs
public static <T> NonEmptyList<T> of(T head, T... tail) {
    List<T> list = Stream.of(Stream.of(head), Arrays.stream(tail))
            .flatMap(Function.identity())
            .collect(Collectors.toList());

    return new NonEmptyList<>(list);
}

For bonus points, see if you can explain why of takes two arguments (head and tail) instead of just one tail argument to create a NonEmptyList.

By now, it should be clear why join3(List.of("foo"), ","); failed to compile: join3 is expecting a NonEmptyList but List.of returns a List, not a NonEmptyList; a List is not necessarily a NonEmptyList and the compiler refuses to compile it and provides us a semi-helpful explanation as to why:

no instance(s) of type variable(s) E exist so that List<E> conforms to NonEmptyList<String>.

So let’s review.

We’ve created a new class NonEmptyList that can only be constructed with non-empty, non-null lists.
We’ve used NonEmptyList to make our function more expressive by virtue of function’s signature.
We’ve used NonEmptyList to make our function shorter and enforce that a list is non-empty at compile time.

There are also additional, transient benefits that we gain:

We’ve bound the not empty check to one place in the code: the NonEmptyList constructor. This means that any time a user receives a NonEmptyList, they can be assured the list has at least one element in it and they don’t have to write code or check the documentation to verify. The function is the validation.
Since there will likely be less validation code, your code is (probably) going to run faster while looking a lot cleaner.
We can reuse NonEmptyList across API boundaries to enforce our NonEmptyList constraint.
We also communicate, indirectly, that any function taking a list that is not a NonEmptyList cannot assume that the list is non-empty.

As usual in software development, parsing is not a panacea since validation still has its place. Validations are useful at the outer boundary of a program to verify user input or to make a decision based on user input. Or we may want to validate a value for security reasons. For example, it’s standard practice in C to ensure that the length of a string is 1 less than the size of the string buffer we’ve prepared for it. Remember, validations on their own are fine in a localized scope; it’s when they start permeating through API boundaries that they become problematic to keep track of.

Other potential issues include scalability: we’ve written a whole new class to encapsulate a single idea. This isn’t scalable since we can end up with a large assortment of simple classes that are really encapsulating a single constraint. Imagine writing a NonEmptySet, NonEmptyMap, and so on! In my experience though, this is hardly an issue as the amount of time spent writing a trivial decorator is far, far outweighed by the benefits of being clearer with your domain. Also, it is generally not a good idea to pass around raw data structures in your code. We’ve done this here to demonstrate an idea and keep the examples simple, but data structures with invariants to be maintained should be properly encapsulated where possible.

If you wish to try this out in your next project, my advice to you would be to do a little analysis on your project and find one data structure or domain that has a lot of validation checks. Try this technique with that one data structure and turn the validation into parsing. Incorporate the result of that parsing into your project. This is likely going to require you to do a fair amount of rewriting. When you’re done, step back and take a look at what your results are. Has your code become more concise, less repetitive, and more expressive? Let me know!