Do not overuse primitive data types
By Jacobo Tarrío
July 6, 2021

Ask a modern programmer to design a program for you, and they’ll soon give you a diagram full of classes named Customer and Employee and User and UserDAO and ConnectionProvider and ConnectionProviderFactory and AbstractConnectionProviderFactoryImpl.

Ok, ok, you’re right: only Java programmers will get so mired down in this taxonomy hell, but it is true that, nowadays, every programmer knows how to use some basic object orientation principles. Even with non-OOP languages, coders will create and use structs or records or Types or whatever mechanism the language provides to create new data types.

However, even though programmers in 2021 tend to consider ourselves wiser and more enlightened than our forebears, our programs still contain vestiges of the olden days when Fortran and COBOL and BASIC only gave us a small set of primitive data types. Every time we want to store a primary key for a database record, we put it in a long or string-typed variable. If we want to pass a timestamp to a function, we use an int or long argument. What about a function that returns a UUID? We use a string. A text string that we just got from the user? A text string that is ready to insert in an HTML document? Both, strings.

We use the same data types, over and over again, for things as different from one another as primary keys from several tables, timestamps in seconds and milliseconds, URIs, and UUIDs. This is an extremely common practice that is, also, an extremely common cause of bugs. Who hasn’t ever written some code that tries to look up a record in the wrong table, or that compares seconds and microseconds, or that inserts an unescaped text string in an HTML document?

In this post, I describe three strategies to deal with those objects, avoid mistakes like the ones above, and avoid introducing bugs.

Use new types for new entities

Have you heard of the Internet Archive? It’s a non-profit organization that wants to create an “Internet Library” by archiving every webpage in the world, to preserve them so we can look at them in the future.

As you might guess, the Internet Archive does a lot of work with URIs. Every webpage has its own URI and also contains other URIs for all the other webpages it links to. In the Archive, webpages are indexed by their URIs so users can easily access them. In summary, for the Internet Archive, URIs are its currency.

Programmers are always tempted to store URIs in variables of type string. We see a sequence of alphanumeric characters and we think: “ooh, that’s easy: it’s a text string, therefore it’s of type string.” However, URIs aren’t just text strings. URIs have certain rules they must follow at all times: they must have a certain format, there are some characters that are not valid, there are escape sequences, etc.

When we store a URI in a string-typed variable, the system knows nothing about those rules. The variable might contain a malformed URI, or even something that is no URI at all, and nothing would warn us until we try to use it and something breaks. Therefore, we need to write (and always use) code that, every time we assign a new value to the variable, will validate and normalize it. Should we fail to do that every single time, we would expose ourselves to a multitude of possible programming errors. We might not be able to find the page the user is looking for, we might open our page to an XSS attack, or something in between.

To solve this problem most effectively, we need to acknowledge the fact that a URI is not a text string that we can handle using only string-typed variables; we need to use a specific data type that will know and enforce all the rules and format of a URI. This URI data type must provide operations to convert a text string into an object of type URI by applying all the validation and normalization rules, and it must also provide operations to extract the different parts of the URI and get a representation of the URI in the form of a text string.

Most modern programming languages provide a class named URI in their standard library, so we only need to remember to use it instead of string anywhere that we deal with URIs.

The Internet Archive’s home page has a text box where the user can write a URI to see what that URI contained at a time in the past. The web server receives the web request, which contains the URI in the form of a string; the first thing the web server does is to convert that string into an object of type URI. From this point on, the URI is validated and formatted, so the server can just pass it along to the storage subsystem, which can retrieve the content of the webpage.


Sometimes we can’t use classes provided by our standard library, so we have to write them ourselves. For example, several years ago, I worked in a project that used a database that stored objects that were indexed through an identifier that followed a HHHHHHHH-V format, where H was a hexadecimal digit, and V was a version number with one or more decimal digits.

The original authors of the system hadn’t agreed on a single way to handle those identifiers. In some parts of the program, the identifier was stored in a string variable. In other parts, the identifier was split in two: a string that contained the hexadecimal digits, and an int that contained the version number. Our code often looked a bit like this:

public List<String> findLinkedIds(String id)
        throws IllegalArgumentException, NotFoundException {
    if (!isValid(id)) {
        throw new IllegalArgumentException("id");
    }
    String hexa = extractHexa(id);
    int version = extractVersion(id);
    Record record = lookupRecord(hexa, version);
    List<String> linkedIds = new ArrayList<>();
    for (Record linked : record.getLinkedRecords()) {
        linkedIds.add(linked.getHexa() + "-" + linked.getVersion());
    }
    return linkedIds;
}

public Record lookupRecord(String hexa, int version) {
    Query query = createQuery("SELECT * FROM Records WHERE hexa=?, version=?",
                              hexa, version);
    return transform(query.execute());
}

As you can see, it was a mess. We spent half of our time validating identifiers and splitting them in two and reassembling them and chasing bugs caused by places where we had forgotten to validate or where we had passed an identifier when we needed just the hexadecimal part or vice versa.

The solution to this problem came after we admitted that an identifier was a new type and not just a text string, and we created a new class for storing and handling identifiers.

public class Identifier {
    private String hexa;
    private int version;

    private Identifier(String hexa, int version) {
        this.hexa = hexa;
        this.version = version;
    }

    public String getHexa() { return hexa; }
    public int getVersion() { return version; }
    public String toString() { return hexa + "-" + version; }

    public static Identifier create(String id) throws IllegalArgumentException {
        // Validate and extract the parts of the identifier
        ...
        String hexa = ...;
        int version = ...;
        return new Identifier(hexa, version);
    }
}

And, with this, we could simplify our code, avoid the constant validations, and handle identifiers consistently throughout the program, removing hundreds of lines of code and opportunities for bugs.

public List<Identifier> findLinkedIds(Identifier id)
        throws NotFoundException {
    Record record = lookupRecord(id);
    List<Identifier> linkedIds = new ArrayList<>();
    for (Record linked : record.getLinkedRecords()) {
        linkedIds.add(linked.getIdentifier());
    }
    return linkedIds;
}

public Record lookupRecord(Identifier id)
    Query query = createQuery("SELECT * FROM Records WHERE hexa=?, version=?",
                              identifier.getHexa(), identifier.getVersion());
    return transform(query.execute());
}

Use different types for different entities

In the world of databases, there are two types of people: those who use INTs for primary keys, and those who use UUIDs. Whatever type of person your DBA is, your code is going to be full of variables, all belonging to the same type, that contain primary keys for several different tables.

long idPost = getLongParam("idPost");
long idComment = getLongParam("idComment");
Comment comment = loadComment(idPost, idComment);
long idUser = comment.userId();
User user = loadUser(idPost);
outputTemplate("comment", user, comment);

This could be an excerpt from the source code of a CMS: a function that handles the response to an HTTP GET operation to display a comment to a post. This function receives a post identifier and a comment identifier, loads the comment and the data of the user who posted the comment, and outputs them as HTML to display them in the browser.

The database uses integer numbers for the primary keys, and the language stores them in variables of type long. There is one key for the post, another for the comment, and yet another one for the user. And, since all three variables have the same type, it’s very easy to mix them up without noticing. In fact, the code above has a bug. How long does it take you to find it?

Even though all primary keys are stored in variables belonging to the same type, they are not interchangeable. If we pass a user id to a function that expects a post id, this function will give us an incorrect result and we will not notice this slip-up until we run the program and we notice the error. Or, even worse, until a user sees it and notifies us. Or, worst of all, we never realize and end up with corrupted data.

If those keys are not interchangeable in practice, they shouldn’t be interchangeable in the code either. We can achieve this very easily using different classes for each different table’s primary keys.

IdPost idPost = new IdPost(getLongParam("idPost"));
IdComment idComment = new IdComment(getLongParam("idComment"));
Comment comment = loadComment(idPost, idComment);
IdUser idUser = comment.userId();
User user = loadUser(idPost); // The compiler throws an error here.
outputTemplate("comment", user, comment);

It’s extremely easy to create these classes in most modern programming languages. For example, in Java we can create a base class that incorporates all the functionality and then add one new line of code for each new type.

public abstract class LongId {
    private long id;
    public LongId(long id) { this.id = id; }
    public long getId() { return id; }
    public String toString() { return getClass().getSimpleName() + "=" + id; }
}

public IdPost extends LongId { public IdPost(long id) { super(id); } }

public IdComment extends LongId { public IdComment(long id) { super(id); } }

public IdUser extends LongId { public IdUser(long id) { super(id); } }

This technique is also useful in data processing tasks, where we might handle data in several different stages of processing and we don’t want to mix them up. The most common use nowadays is in avoiding XSS in web apps.

Many web apps need to receive a text string from the user, process it in some way, and finally display it in a web page. If you aren’t careful, you will just stick that text string directly in the HTML and create an XSS attack; at a minimum, you need to escape that text string before inserting it into the HTML. Sometimes, web app programmers will lose track of their strings and will escape a string twice, and then users will see weird stuff like “r&eacute;sum&eacute;” instead of “résumé”. Other times, they will use the wrong type of escape, such as a SQL escape, and users will end up seeing “O\'Connell” instead of “O'Connell”.

To avoid this problem, many modern web frameworks don’t let programmers insert a string directly. Instead, they force us to use special types that contain already-escaped text strings. We can create an instance of one of those types from an unescaped string, and there is no way to make a mistake from that point on: wherever we use that class, it contains a string that is already escaped and ready to insert into an HTML page.

Use a single type for a single entity

It’s very common to use ints and longs to represent time intervals. However, we haven’t agreed on which unit to use. The Unix operating system uses seconds, but the Java and JavaScript programming languages use milliseconds. I’ve worked on systems that used microseconds and nanoseconds. Very often, we need to use different units in different parts of the same program, depending on who wrote the code or which function takes the argument.

The problem is that all those time intervals are represented as a plain long, without any kind of indication of the time unit being used, so it’s very easy to pass a number of milliseconds to a function that expects seconds, or subtract microseconds from nanoseconds, or perform other similar nonsensical operations by mistake.

We might be tempted to try to solve this problem by creating a new type for each unit. A class Seconds would store an interval measured in seconds, a class Milliseconds would store another interval measured in milliseconds, and so forth. In this way, we could not mix different units without the compiler giving us an error.

The problem is that, very frequently, we need to convert between different units. Sometimes we have a function that returns seconds, which we have to pass to another function that expects milliseconds, so we need to perform a conversion. Other times, we have to combine two intervals that could be measured in different units. We might try to solve this by adding conversion and addition and subtraction functions for each pair of units; however, with only four units, it would yield forty-four functions in total. That’s a lot of functions.

public class Seconds {
    public Milliseconds milliseconds() { return ... }
    public Microseconds microseconds() { return ... }
    public Nanoseconds nanoseconds() { return ... }
    public Seconds plus(Seconds other) { return ... }
    public Seconds plus(Milliseconds other) { return ... }
    public Seconds plus(Microseconds other) { return ... }
    public Seconds plus(Nanoseconds other) { return ... }
    public Seconds minus(Seconds other) { return ... }
    public Seconds minus(Milliseconds other) { return ... }
    public Seconds minus(Microseconds other) { return ... }
    public Seconds minus(Nanoseconds other) { return ... }
}
// Do this three more times for Milliseconds, Microseconds, and Nanoseconds

Frankly, this option is not sustainable. The amount of code we need to write as soon as we need to add one more unit or one more operation will quickly become gargantuan, and, moreover, most of it will consist of conversion operations.

The actual solution comes from realizing that the entity we are creating data types for is not the number of seconds, milliseconds, or microseconds. The entity is the time interval, and all those units are only different ways to measure it. We don’t need to create a new type for each time unit; we only need to create a single type for time intervals, which contain all the operations we need to express those intervals in the appropriate units.

For example, we could have a type Interval that contains a variable for the length of the interval in a convenient unit, and that also contains functions to express that interval in seconds, milliseconds, microseconds, and nanoseconds, along with other functions to perform the reverse conversion.

public class Interval {
    private double value;
    private Interval(double value) { this.value = value; }
    // Constructors
    public static Interval fromSeconds(long seconds) { return new Interval(seconds); }
    public static Interval fromMilliseconds(long milliseconds) { return new Interval(milliseconds / 1e3); }
    public static Interval fromMicroseconds(long microseconds) { return new Interval(microseconds / 1e6); }
    public static Interval fromNanoseconds(long nanoseconds) { return new Interval(nanoseconds / 1e9); }
    // Conversions
    public long seconds() { return (long) value; }
    public long milliseconds() { return (long) (value * 1e3); }
    public long microseconds() { return (long) (value * 1e6); }
    public long nanoseconds() { return (long) (value * 1e9); }
    // Operations
    public Interval plus(Interval other) { return new Interval(value + other.value); }
    public Interval minus(Interval other) { return new Interval(value - other.value); }
}

And that’s all the code we need to represent time intervals measured in four different units! Moreover, if we ever needed to add more units, we would only need to add two functions.

Now we could use this class to replace, in our code, all those longs expressed in indeterminate units, and remove all the chances for mistakes and bugs.

// Old and busted.
// Is the timeout in seconds? Milliseconds? Fortnights?
Record lookupRecord(Identifier id, long timeout) { ... }

// New hotness.
Record lookupRecord(Identifier id, Interval timeout) { ... }

Many modern programming languages provide a class like Interval in their standard library. For example, Java has the java.time package, which provides the Duration class (along with another class, called Instant, which represents a particular moment in time.) The C++ language provides the std::chrono namespace, with its classes duration and time_point. For other languages, there may be a third-party library that provides it. Always use those classes; never use longs for time.

Conclusion

Sometimes, we resort to using primitive data types without thinking, when it might end up causing us terrible headaches. We should stop to think whether that number we are storing in that long is really just a number, or whether what’s in that string is only a text string, and create and use new data types whenever we find that they have some kind of nuance or restriction or invariant.

We humans know we cannot add two cherries and three oranges together. However, if computers only see two longs or two doubles, they will happily add them and divide them. It falls on us, humans who know that different entities belong to different types, to tell the computer about this difference by using different data types for each.

And, finally, we humans know that 60 seconds is the same thing as a minute, or that 5,000 meters are 5 kilometers; however, the computer only sees one variable that says “60” and another that says “1”, or a “5000” value and another “5”. It’s on us to tell the computer that both things are the same.

Next time you think that it would be cool if you could “annotate” or “mark” a number or a text string to treat it especially, try creating and using a new data type. I’m sure that it will make your programs more reliable and easier to read and modify.

Este artículo ha sido traducido al español: “No abuséis de los tipos de datos primitivos”.
Other stories about “programming”.
Table of contents.
Except where indicated otherwise, this page and its contents are Copyright © Jacobo Tarrío Barreiro. All Rights Reserved. Privacy statement and terms of use.