by Matt Manning
At the recent SIPA Conference data content industry expert Russell Perkins delved into the current transformation of our industry. He offered some head-turning statistics confirming what many of us know already: data product business models are highly stable and very profitable.
He then went on to explore current trends in the industry. Once merely passive compendiums sitting on the shelf, data products are now complex software products designed to be used and used heavily.
Russell’s insights come from a unique perspective as the industry’s premiere data publishing consultant and are always well worth reading.

posted by Shyamali Ghosh on July 7, 2014
by Kevin Dodds
We manage a lot of data at IEI and, more often than not, I find that the key to cleaning up and improving the data we get lies in where the data came from in the first place – in other words, its “provenance.”
Data “sources” roughly fall into these groups:
- Proprietary structured databases
- Internal customer or prospect data
- CRM data
- Circulation files
- One-off customer purchases
- Public data
- Government filings
- Public transaction data (shipping manifests, bills of lading)
- User-generated content (reviews, rankings)
- News (press releases, news articles, blog posts)
- Web information (addresses, bios, products)
A typical data project usually involves deduping, normalizing, appending missing information, and direct verification via a combination of in-house researchers, trained crowdsourced workers, and software tools. The choice of these tools depends not only on the desired end result of the project (a publishable database, a list clean enough to use for marketing) but also where the data came from.
Typical red flags in data’s provenance:
- Was the data harvested data? (old, miscategorized, unverified)
- Was the data entered by hand? (misspellings, transposed fields, missing required fields)
- If internal data, when was it last used? (old)
- Were multiple sources combined? (mixed formatting conventions, truncation)
Based on these indicators, we define a process to address each of the issues (re-categorizing, fixing spelling errors, targeting key missing fields) in a logical sequence.
The one indispensable step in all data projects is direct verification via primary sources. These sources can include recent government filings, official websites, or direct communication with a person at the company in question. Without this step at the end of a process there is a significant risk of introducing old or incorrect data into the deliverable. This final verification also adds value as a citation as to the data’s accuracy, much like a “sell by” date. Increasingly, data customers expect this piece of metadata as a “certificate of authenticity” and for good reason: Their customers, either paying subscribers or internal sales teams, all need to know where the data came from, too.
posted by Shyamali Ghosh on June 30, 2014