We’ve all encountered database tables that look like this:
ID Data ----- -------------------------------------------- 00001 TRUE, FALSE, FILE_NOT_FOUND 00002 MALE|FEMALE|TRANS|EUNUCH|OTHER 00003 <?xml version="1.0" encoding="UTF-8"?><item id="1234"><name "Widget"/>...</item> 00004 1234|Fred,Lena,Dana||||||||||||1.3DEp42|
Oh the joy of figuring out what each field of each row represents. The fun of deciphering the code that writes and reads/parses each row of data. In a moment, you will fondly look back on that experience as the Good-Old-Days.
The task of administering elections in the Great White North is handled by the appropriately-named agency Elections Canada. As part of their mandate, they provide the results of past elections in granular detail, both as nicely formatted web pages and as downloadable raw files. The latter are meant to be used by researchers for studying how turnout varies across provinces, ages, races, etc., as well as arguing about the merits of proportional representation versus single transferable votes; and so forth.
One of the more comprehensive data files is descriptively known as Table-Twelve, and it contains a record for every candidate who ran in the election. Each record contains how many votes they got, the riding (electoral district) in which they competed, their affiliated party, home town, occupation, and hundreds of other details about the candidate. This file has been published for every election since the 38th general in 2004. Vicki was charged with creating a new parser for this data.
Table-Twelve is a CSV file in the same way that managers describe their new agile process as <details of waterfall here>. While parsing a CSV file in general is no big deal, writing a function to parse this data was far harder than she expected. For one thing, the column titles change from year to year. One might think Who cares, as long as the data is in the same sequence. One would be wrong. As an example, depending upon the year, the identifier for the electoral district might be in a column named "Electoral District Name", "Electoral District" or "District", and might contain a string representing the district name, or a numeric district identifier, either of which may or may not be enclosed in single or double quotes. Just to make it interesting, some of the quoted strings have commas, and some of the numbers are commafied as well.
Further inspection revealed that the columns are not only inconsistently named, but named so as to be completely misleading. There’s a column labeled "Majority". If you’re thinking that it contains a boolean to indicate whether the candidate got a majority, or 50%+1 of the number of cast votes (i.e.: "How many votes do you need for a majority?"), you’d be mistaken. Nor is it even a slight misuse (where it should have been "Plurality"). Instead, it’s the delta between the winning candidate and the second-place candidate in that riding. They also helpfully give you the quotient of this delta to the total cast votes as the "Majority Percentage".
Canada has a parliamentary system; it’s also important to know how many candidates of each party won, so the party designation is obviously going to be easy to access, right? Or maybe you’d like to sort by surname? Well, it turns out that the party is appended to the field containing the candidate’s name, delimited with a single space (and possibly an asterisk if they were incumbent). But the candidate’s name and the party are already each a variable number of words (some have middle names or two surnames) delimited by single spaces. The party name, however, must be given in both English and French, separated by a forward slash. Of course, some parties already have a slash in their name! Oh, and if the candidate didn’t run as a member of a party, they might be listed as "Independent" or as "No affiliation"; both are used in any given file.
Above and beyond the call of making something difficult to parse, the files are full of French accented text, so the encoding changes from file to file, here ISO-8859, there UTF-8, over there a BOM or two.
Don’t get me wrong, I’ve written parsers for this sort of garbage by creating a bunch of routines to do trivial parsing and using them for larger logical parsers, and so on until you can parse all of the fields in an entire row, and all the special cases that spew forth. But the files they were supposed to parse were consistent from one day to the next.
Vicki is considering pulling out all of her hair, braiding it together and using it to hang the person who designed Table-Twelve.