Thursday, 11 October 2012

Do you trust your data?

At Electric Lane we help people set up image management systems. Data handling is top of the list in the early stages, and legacy data can be one of the biggest headaches if you don't tackle it at the right time. If you do, things will go swimmingly!

What I am saying here applies to any data transfer, whether it's associated with images or not. Anyone  transferring data from one system to another should ensure that the data is clean, separated, and granular. And if a rule can be applied to data handling, its should be automated. It's the job of a machine, not a person. (We mortals have plenty to do.). Given that data is driving much of the new internet based business around the world, it pays to get yours in a good state.

The first thing we do is to audit what's there.  Then we set up a set of rules to apply to the data so that it is exported in a fit state, and configured as our customers want it. We invariably find problems with legacy data.

Here are some of the things we find and fix. We use our own script, which can be configured to do just about anything with data (as well as images - that's for another time).

Hidden characters
What you see is not always what you get. Hidden characters are often present, especially in exports from databases. These characters, which include character returns, tabs, and other non printing instructional characters can affect your workflow later on. A carriage return entered in a descriptive field in a database can result in the data being divided into two columns or rows in the export. Horrible!  We clean these out out with an invisible hand so the data is exported as intended.

Unseparated data
Often data which should be separated has been merged in one field, sometimes without separators.  In the art world for example, a field might read  Vincent Van Gogh, 1888, Vase with Sunflowers. There are 3 fields here which need separating in the database, and that's easy if commas are only used as separators. Even if there are no commas we can separate the data if there is a consistent pattern of different types of data. So if all the captions read in this order Vincent Van Gogh 1888 Vase with Sunflowers we could separate the data by applying a rule which places separators before and after the date.

One of the common data formats for delivery is CSV (comma separated values), with commas as separators. This means that other commas in your data can confuse the export. If you have descriptive text with commas, and separators are needed, it's better to export data as tab delimited format, otherwise your sentences will split at every comma.

Formatting issues
There are at least 5 different types of quote mark!  In a plain spreadsheet environment formatting symbols like quote marks, hyphens and spaces will be substituted by characters.  But the substitutions can be different in different environments so the answer is to clean them out before they get to the spreadsheet. (For example you may have seen this appear %20 when you paste a web address into your browser. %20 is a substitute for a character space.)

Diacritics (accents and non-roman alphabet) are not always correctly interpreted by software. They are often not allowed when keywording images, for this very reason. If appropriate, we can export data without diacritics, or we can ensure that it data is correctly exported.

If you are interested, you can see the possible variants here in the ascii table. (Unicode enables all characters to be correctly handled, but data needs handling differently and not all software is unicode compliant.)

Dates can be a horror. We know that dates entered in some earlier versions of some software (including Photoshop) can change when read in other versions.

In a spreadsheet, data in a cell can be string (characters), number, or date format. The spreadsheet will reformat your dates to the preference you have set. (Day-month-year, month-day-year, and so on) You may well have encountered this in Excel, but it applies to all spreadsheets. In XMP (the Adobe data model used for embedding data in images)  a date can be year only (no day or month). This kind of date is treated as a numeral by spreadsheets (just a number) so falls out of order with the date values when being sorted.

We have a way round this when we want to analyse data. We insert a string character so that when dates arrive in the spreadsheet cell they do not reformat. For import back into an image or into another database the appropriate date format needs to be applied.

For historical archives there is an additional problem with circa dates, which cannot be expressed in date format and require an extra text field.


We sometimes find that databases export keywords with non standard separators which are not recognised by spreadsheets. Replacing spaces with comma separators causes problems for compound keywords, like Leather Jacket. We review the data before it gets to the spreadsheet to identify the hidden separators and substitute standard separators so that the keywords read correctly.

You may want to map data from one field into a field of a different name. Along the way, you may want to split or even merge data for various purposes.  Everything is possible so long as the route is clear and no information is lost along the way. People have stopped talking about the all singing all dancing single data structure. There is recognition that legacy systems are here to stay in one form or another; that differents chemas need to be able to talk to each other.

 So these are the things we can sort out for any kind of data.

The problems of data are exercising the minds of a number of people around the world. The European database of art is one example where data from a number of sources is pulled onto one site for a search across all contributing sources. The data structure is based on Dublin Core, which is extensible but not supremely fitted to describe imagery (there are only 5 fields which map directly between IPTC and Dublin Core). So there are some inconsistencies turning up in the data fields displayed online.

But there are problems with these 5 fields too, if Dublin Core is not qualified. For example, IPTC now has two creator fields, one for creator of the photograph, and one for creator of the artwork.What can happen is demonstrated by an image I saw once on the Getty site. The Mona Liisa painting was  displayed online, and the photographer was listed as Leonardo Da Vinci. That was before we had an artwork creator field in IPTC.

Because we find outselves working at the interface between collections management and image DAM systems, we are getting more involved in collections management data structures. We have set up a Cultural Heritage metadata group for people working in this area, and hope to create a common set of fields  for heritage works from which to produce a subset of new fields for IPTC. Clearly we will not be reinventing the wheel, and are working with people involved in VRA Core, Linked Heritage, and others. I have also been involved in the ARROW PLUS project, where standardised querying of data is an important element in the effort trace rightsholders.  More on all of this later.......

See also my 10 rules for image metadata in the CEPIC/IPTC Image Metadata Handbook

If you need help with your data contact .