I recently read a great little story in the New York Times Magazine that was about Data. I believe Jake Halpern is going to release a book on this according to this amazon page. Now from what I can tell, the story is about the industry of data. I spent a few years on the outside of this looking in, and in a way I guess I spent more than a decade in Data.

Data trading is an interesting industry in which you would assume that you would find structure. What I can tell you is that the 'Data world' lacks structure in almost all ways. When reading the article I was thinking about structured vs unstructured data that individuals trade in. For example, lets take a persons' identity for a second. Here is an example of unstructured data:

  • Name (Casing is nOt SeNsiTiVe iS It?)
  • Address: How many ways can you have an address? Think about how may developers have created developments with the strangest of addresses?
  • Date of Birth: Pretty Standard and Straight Forward
  • Phone Number: Something with a ton of entropy right?

What about structured data?

  • U.S. Social Security Number: globally¬†available 'structured' Universal ID.
  • Health Record: Locally available (specific to the hospital or doctors office for example).

Consider however the problem that is described in the article above. When a person (identity) is asking for a loan or takes out an obligation on something such as a medical bill it is up to the 'debtee' and the 'debtor' to verify that the debt is 'factual'. But of course this type of data is completely unstructured. There is no structure to the following information:

  • Account Number: Locally significant number to the institution, which is problematic.
  • Name: Same as above, completely unstructured and not be used as any valid key data as it can be duplicated.
  • Address and Phone Numbers: High amount of entropy with individuals able to move and change numbers.
  • Social Security Number: The 'Actual' Key.
  • Amount 'owed'. : This is only available to the loan amount origination.
  • Amount 'collected'. : Unfortunately this is a tricky one, collected by who and 'settled' when? This is a problem area let's get back to this.

This unstructured data presents issues for people in say, the credit report agencies. For example, one someone (anyone really) puts a 'collection notice' in your name whats actually happening? Well in a relational database world there are two tables that are being joined. For example.

Table 1: Debt (Account Number (Key?), Balances of the Account, Identity of people (social?))

Table 2: Identity (Social Number (Key?), Name, Address, etc)

Table 3: Social <-> Account Number

This is really whats happening in a spreadsheet or in a database. The problem with this 'solution' involves the following scenario:

Debt Gets Sold (Charged off) -> Account gets purchased to someone in an auction -> Some of the accounts are settled, which is potentially a grossly inaccurate process -> Sold Again (Remaining accounts) -> Sold Again and Again -> Do the accounts get recollected?

If the accounts get recollected and go into 'default again' the accused person has to provide 'proof' that they are already paid on the debt. Did they accurately pay? What if the systems that were used to collect where already breached? What if they had been sold twice? Does the person have to pay twice?

What is a solution here? Debt Exchanges. Unfortunately this probably has to go through a national clearing house in the same way all 'ACH's go through the Fed Reserve. For example some database that looks like this:

  • National Debt Loan Number:
  • Identified Lessor:
  • Identified Lessee:
  • SSN?:
  • Loan Origination:
  • Current Factual Balance:
  • Amount Already Collected:
  • Loan Owner:

This would be for example a way for debt collectors to buy in a fair marketplace with a national debt loan number. In essence what we are saying is one way to take unstructured data and make sense of it, is to massage it up front with a bit of 'pre-processing' in order to prevent a large amount of 'post-processing'.

Am I wrong? Can an argument be made against this? I welcome it!