By David Wegman, CTO @ Valor
The next time you strike up a conversation with your friendly neighborhood computer, take note of how long it takes before you get frustrated. Despite the advances in artificial intelligence over the past decades -- and despite the incredible capacity humans have for adaptation -- human-computer interaction is unnatural (from the human perspective, anyway). Every touch point where people provide input to computers, or receive output from computers, is an opportunity for misunderstanding.
Even as our systems are getting smarter all the time, there are some simple steps we can take to eliminate ambiguity. Data architects serve an important role, helping to ensure that information is not lost or garbled in translation. These techniques are essentially an investment. Every minute spent on avoiding problems up front can save much more time down the road when things aren't working properly.
Which came first, 3/7/2017 or 5/4/2017? The answer depends on where you are in the world when asking the question. In the United States, dates are commonly represented as month/day/year, so these dates would usually be interpreted as March 7 and May 4, respectively. In many other countries, dates are represented as day/month/year, so they would be interpreted as July 3 and April 5, respectively.
This becomes problematic when a data file, which includes date information, is read by a computer system. Each time the system encounters a field known to be a date, it must decide how to interpret the information. Fortunately, most modern systems allow us to choose the format of the date for inputting and outputting dates. However, if the date format is not chosen carefully, it can result in one of the most pernicious types of errors in computer systems: one which does not raise a flag immediately and lays dormant for some time. A date which is incorrectly interpreted can result in a myriad of problems, as was widely publicized at the end of the last century.
Given enough data points, it may eventually become clear whether dates have been written starting with the month or day (e.g. if one of the values is 3/15/2017, the format cannot be day/month/year because 15 cannot refer to the month, so the format is probably month/day/year). This approach is suboptimal because it requires an additional step which is not guaranteed to work properly in all cases. A better approach is to avoid the problem altogether by taking care when choosing a date format.
To eliminate ambiguity when working with dates, when possible, use the format YYYY/MM/DD. This represents a four-digit year, followed by a two-digit month, followed by a two-digit day. March 7, 2017 would be represented as 2017/03/07. This format is widely understood and eliminates the ambiguity that can occur when the year appears at the end.
A common method for storing tabular data is in CSV (comma-separated values) format. In a CSV file, each line contains one row of a table. Within each line, a delimiter character appears in between each value, demarcating the columns. The delimiter character is usually a comma or a tab.
A problem can arise when one of the values that needs to be stored contains the delimiter character. For example, a person's name may contain a comma (e.g. "Martin Luther King, Jr."). In this situation, a line containing this value will contain an extra delimiter character. Software which treats each occurrence of the delimiter as a new column may be confused by the fact that the number of columns is inconsistent from one line to another.
One strategy is to choose a delimiter character which does not appear in any of the values. This technique can help minimize problems, however it is not guaranteed to completely avoid them as new data files are created in the future. A better approach is to wrap values that may contain delimiter characters in double quotes, and to ensure that literal double quote characters are specifically labeled (or "escaped," in programmer speak) using a backslash character. This ensures that the data file will be parseable regardless of the data that needs to be stored.