Why Eliminating Ambiguity in Your Data Matters

By David Wegman, CTO @ Valor

The next time you strike up a conversation with your friendly neighborhood computer, take note of how long it takes before you get frustrated.  Despite the advances in artificial intelligence over the past decades -- and despite the incredible capacity humans have for adaptation -- human-computer interaction is unnatural (from the human perspective, anyway).  Every touch point where people provide input to computers, or receive output from computers, is an opportunity for misunderstanding.

 

Even as our systems are getting smarter all the time, there are some simple steps we can take to eliminate ambiguity.  Data architects serve an important role, helping to ensure that information is not lost or garbled in translation.  These techniques are essentially an investment.  Every minute spent on avoiding problems up front can save much more time down the road when things aren't working properly.

 

Date formats

 

Which came first, 3/7/2017 or 5/4/2017?  The answer depends on where you are in the world when asking the question.  In the United States, dates are commonly represented as month/day/year, so these dates would usually be interpreted as March 7 and May 4, respectively.  In many other countries, dates are represented as day/month/year, so they would be interpreted as July 3 and April 5, respectively.

 

This becomes problematic when a data file, which includes date information, is read by a computer system.  Each time the system encounters a field known to be a date, it must decide how to interpret the information.  Fortunately, most modern systems allow us to choose the format of the date for inputting and outputting dates.  However, if the date format is not chosen carefully, it can result in one of the most pernicious types of errors in computer systems: one which does not raise a flag immediately and lays dormant for some time.  A date which is incorrectly interpreted can result in a myriad of problems, as was widely publicized at the end of the last century.

 

Given enough data points, it may eventually become clear whether dates have been written starting with the month or day (e.g. if one of the values is 3/15/2017, the format cannot be day/month/year because 15 cannot refer to the month, so the format is probably month/day/year).  This approach is suboptimal because it requires an additional step which is not guaranteed to work properly in all cases.  A better approach is to avoid the problem altogether by taking care when choosing a date format.

 

To eliminate ambiguity when working with dates, when possible, use the format YYYY/MM/DD.  This represents a four-digit year, followed by a two-digit month, followed by a two-digit day.  March 7, 2017 would be represented as 2017/03/07.  This format is widely understood and eliminates the ambiguity that can occur when the year appears at the end.

 

Field delimiters

 

A common method for storing tabular data is in CSV (comma-separated values) format.  In a CSV file, each line contains one row of a table.  Within each line, a delimiter character appears in between each value, demarcating the columns.  The delimiter character is usually a comma or a tab.

 

A problem can arise when one of the values that needs to be stored contains the delimiter character.  For example, a person's name may contain a comma (e.g. "Martin Luther King, Jr.").  In this situation, a line containing this value will contain an extra delimiter character.  Software which treats each occurrence of the delimiter as a new column may be confused by the fact that the number of columns is inconsistent from one line to another.

 

One strategy is to choose a delimiter character which does not appear in any of the values.  This technique can help minimize problems, however it is not guaranteed to completely avoid them as new data files are created in the future.  A better approach is to wrap values that may contain delimiter characters in double quotes, and to ensure that literal double quote characters are specifically labeled (or "escaped," in programmer speak) using a backslash character.  This ensures that the data file will be parseable regardless of the data that needs to be stored.

 

 

waterunit.png

Units of measure

 

Sally's water meter recorded 350 gallons of water used.  John's recorded 200 cubic feet of water used.  Who used more water?  This is a question with a simple answer (John did).  But what if the units were not specified?  If all we know is that Sally used 350 and John used 200, we might decide that Sally used more, but only if we first assume that their meters record water using the same units.  Even if that assumption is correct, if we don't know the units, we won't be able to bill properly for the water or compare the quantity to an amount stored in other systems.

 

Quantitative values (i.e., measurements) should always have units specified.  When preparing a data file, you can provide information about units as a separate field.  For example:

 

Alternatively, units can be provided in documentation which accompanies the data.  One advantage of including units information inline in the data is that anyone with the data will automatically have units information, even if the documentation is not accessible.  Another benefit is that the units can vary from one record to another, as in the earlier example of two different people whose water meters recorded in different units.  However, in some cases it may not be practical to provide units inline, and good documentation can help fill this gap.

 

Keep clear and carry on

 

Data parsing errors are not unusual.  However, with a small investment they can be minimized.  By avoiding common data pitfalls and making the right choices at the outset, you will eliminate unnecessary troubleshooting and set yourself up for success.

Valor Water Analytics Intern Blog: Krishna Rao

Valor Water Analytics Intern Blog: Krishna Rao

Hi, I'm Krishna!

I am an environmental fluid mechanics and hydrology engineer currently pursuing my masters degree  at Stanford University. I work on the intersection of data science and water hydraulics to create intelligent statistical models. Apart from my course curriculum, I also pursue research in eco-hydrology remote sensing as a research assistant. Thanks to the long commute to work, I am catching up on my reading. I am currently reading Lab Girl by Hope Jahren.

 

Valor Water Analytics Intern Blog: Jakob Grinvoll

Valor Water Analytics Intern Blog: Jakob Grinvoll

Hi, I’m Jakob!

I’m a 25 year old student from Norway. I’m currently part of an exchange program at UC Berkeley in coordination with my school in Norway, The Norwegian School of Economics. The program is called Innovation School and is the perfect excuse for spending the summer in San Francisco. I spend one day a week at UC Berkeley and four days here at Valor. So the main part of the program is gaining experience from working in San Francisco which has been, and is, awesome.

Valor Water Analytics Intern Blog: Alex Pan

Valor Water Analytics Intern Blog: Alex Pan

Hi, I'm Alex!

Hi, I’m Alex! I’m going into my third year at UC Berkeley, where I study computer science. I’m originally from Novi, Michigan, where I was born and raised until I moved out for college. I’m currently working full time as a software engineering intern for Valor, which is about a 45 minute commute from my apartment from Berkeley. I’m super excited about working hard, seeing the city, and enjoying the weather.

How Dashboards Helps Decision-Makers at Water Utilities

How Dashboards Helps Decision-Makers at Water Utilities

By Renee Jutras, Full Stack Developer

Data has become part of the way we tell stories today. Online articles use maps and graphs to add a splash to their stories because, as they say, “a picture is worth a thousand words”. And it’s true - a well thought out data visualization can convey much more information than just a description, and let the viewer draw their own conclusions about the information. The difference between a clear positive trend and a potentially coincidental trend is instantly recognizable on a graph.
Dashboards take graphs even further by adding organization and interactivity. The best dashboard helps you continuously monitor whatever your pain points are while giving you the power to explore your data visually as freely as possible.

In order to take water utilities further into the future, better technology is needed. Valor Water Analytics’ dashboards put vital information at the fingertips of the decision-makers at utilities, so that they can start to make actionable decisions based on their data.

Valor Water Analytics Intern Blog: Priya Dhandev

Valor Water Analytics Intern Blog: Priya Dhandev

I graduated with a degree in MS in Electrical Engineering from Santa Clara University CA in December 2016. I went to Indian Institute of Technology Jodhpur, India for my undergraduate program in Electrical Engineering. During the MS program where I was specializing in VLSI Design & Testing, I discovered my love for programming!! To enhance my knowledge and to learn the required skills to be a skilled software developer, I took severalcourses on Coursera, Udacity and Udemy. 

Broken Meter Beater

Broken Meter Beater

By Steve Birndorf

So, I’ve been thinking about broken meters quite a bit lately (and, when I say “broken meter,” I’m referring to all sorts of different issues--under-registration, non-registration, decay, stuck meters, zero reads, etc.). Every day, as I talk to municipalities and water agencies around California and around the country, broken meters are a topic of universal importance and concern. Everyone’s got ‘em, and everyone is trying to get rid of ‘em. And, broken meters don’t just go away when you fix them...they are a recurring problem which occur year after year after year...

 

Broken meters, significantly impact water utilities. They leave revenue uncollected, they impact revenue stability, they make compliance difficult, they result in truck rolls, they impact conservation efforts. The list goes on.

Zero visibility: Issues in Water Use Data Resolution

BY DAVID WEGMAN, CTO, VALOR WATER

In the beginning -- that is, before HD television -- there was standard definition television.  Back then, nobody complained much about the quality of the image.  In reality, the reason why people didn't make a fuss was that they didn't know what they were missing out on.  The same goes for the transition from cassette tapes to CDs and a host of other evolutionary enhancements in audio/visual quality over the years.  Ignorance is bliss.

Doing the Right Thing: Ending Water Cutoffs

By Janani Mohanakrishnan, PhD

Intro

“What keeps you up at night?” This was a question posed to George Hawkins (GM at DC Water), at a recent water conference. He promptly replied “Figuring out how to keep providing water to my growing population of low-income customers”. Later that week, a colleague mentioned her displeasure at having her water cutoff because of a system error. She had paid her active utility service deposit, had no history of nonpayment and was still cutoff without notice. Guilty until proven innocent!