Thursday 11 October 2012

Do you trust your data?

WE CAN HELP YOU 
At Electric Lane we help people set up image management systems. Data handling is top of the list in the early stages, and legacy data can be one of the biggest headaches if you don't tackle it at the right time. If you do, things will go swimmingly!

What I am saying here applies to any data transfer, whether it's associated with images or not. Anyone  transferring data from one system to another should ensure that the data is clean, separated, and granular. And if a rule can be applied to data handling, its should be automated. It's the job of a machine, not a person. (We mortals have plenty to do.). Given that data is driving much of the new internet based business around the world, it pays to get yours in a good state.

DATA AUDIT
The first thing we do is to audit what's there.  Then we set up a set of rules to apply to the data so that it is exported in a fit state, and configured as our customers want it. We invariably find problems with legacy data.

DATA FIX
Here are some of the things we find and fix. We use our own script, which can be configured to do just about anything with data (as well as images - that's for another time).

Hidden characters
What you see is not always what you get. Hidden characters are often present, especially in exports from databases. These characters, which include character returns, tabs, and other non printing instructional characters can affect your workflow later on. A carriage return entered in a descriptive field in a database can result in the data being divided into two columns or rows in the export. Horrible!  We clean these out out with an invisible hand so the data is exported as intended.

Unseparated data
Often data which should be separated has been merged in one field, sometimes without separators.  In the art world for example, a field might read  Vincent Van Gogh, 1888, Vase with Sunflowers. There are 3 fields here which need separating in the database, and that's easy if commas are only used as separators. Even if there are no commas we can separate the data if there is a consistent pattern of different types of data. So if all the captions read in this order Vincent Van Gogh 1888 Vase with Sunflowers we could separate the data by applying a rule which places separators before and after the date.

Commas
One of the common data formats for delivery is CSV (comma separated values), with commas as separators. This means that other commas in your data can confuse the export. If you have descriptive text with commas, and separators are needed, it's better to export data as tab delimited format, otherwise your sentences will split at every comma.

Formatting issues
There are at least 5 different types of quote mark!  In a plain spreadsheet environment formatting symbols like quote marks, hyphens and spaces will be substituted by characters.  But the substitutions can be different in different environments so the answer is to clean them out before they get to the spreadsheet. (For example you may have seen this appear %20 when you paste a web address into your browser. %20 is a substitute for a character space.)

Diacritics
Diacritics (accents and non-roman alphabet) are not always correctly interpreted by software. They are often not allowed when keywording images, for this very reason. If appropriate, we can export data without diacritics, or we can ensure that it data is correctly exported.

If you are interested, you can see the possible variants here in the ascii table. (Unicode enables all characters to be correctly handled, but data needs handling differently and not all software is unicode compliant.)

Dates
Dates can be a horror. We know that dates entered in some earlier versions of some software (including Photoshop) can change when read in other versions.

In a spreadsheet, data in a cell can be string (characters), number, or date format. The spreadsheet will reformat your dates to the preference you have set. (Day-month-year, month-day-year, and so on) You may well have encountered this in Excel, but it applies to all spreadsheets. In XMP (the Adobe data model used for embedding data in images)  a date can be year only (no day or month). This kind of date is treated as a numeral by spreadsheets (just a number) so falls out of order with the date values when being sorted.

We have a way round this when we want to analyse data. We insert a string character so that when dates arrive in the spreadsheet cell they do not reformat. For import back into an image or into another database the appropriate date format needs to be applied.

For historical archives there is an additional problem with circa dates, which cannot be expressed in date format and require an extra text field.

Keywords

We sometimes find that databases export keywords with non standard separators which are not recognised by spreadsheets. Replacing spaces with comma separators causes problems for compound keywords, like Leather Jacket. We review the data before it gets to the spreadsheet to identify the hidden separators and substitute standard separators so that the keywords read correctly.

Mapping
You may want to map data from one field into a field of a different name. Along the way, you may want to split or even merge data for various purposes.  Everything is possible so long as the route is clear and no information is lost along the way. People have stopped talking about the all singing all dancing single data structure. There is recognition that legacy systems are here to stay in one form or another; that differents chemas need to be able to talk to each other.

 So these are the things we can sort out for any kind of data.

BRINGING IT ALL TOGETHER?
The problems of data are exercising the minds of a number of people around the world. The European database of art is one example where data from a number of sources is pulled onto one site for a search across all contributing sources. The data structure is based on Dublin Core, which is extensible but not supremely fitted to describe imagery (there are only 5 fields which map directly between IPTC and Dublin Core). So there are some inconsistencies turning up in the data fields displayed online.

But there are problems with these 5 fields too, if Dublin Core is not qualified. For example, IPTC now has two creator fields, one for creator of the photograph, and one for creator of the artwork.What can happen is demonstrated by an image I saw once on the Getty site. The Mona Liisa painting was  displayed online, and the photographer was listed as Leonardo Da Vinci. That was before we had an artwork creator field in IPTC.

Because we find outselves working at the interface between collections management and image DAM systems, we are getting more involved in collections management data structures. We have set up a Cultural Heritage metadata group for people working in this area, and hope to create a common set of fields  for heritage works from which to produce a subset of new fields for IPTC. Clearly we will not be reinventing the wheel, and are working with people involved in VRA Core, Linked Heritage, and others. I have also been involved in the ARROW PLUS project, where standardised querying of data is an important element in the effort trace rightsholders.  More on all of this later.......

See also my 10 rules for image metadata in the CEPIC/IPTC Image Metadata Handbook

If you need help with your data contact sarah@electriclane.co.uk .

Friday 3 August 2012

Copyright Works!

After what must have been a very intense 8 months,  'Copyright Works - Streamlining Copyright for the Digital Age' by Richard Hooper and Ros Lynch has been published (I will call it the Hooper report). Commissioned by the government, this independent report is ambitious in scope, bringing into one document the thoughts and experience of a variety of creative industry sectors, all with their own particular drum to bang.

The main points relating to the picture industry will be listed at the end, and if you are acquainted with all the various strands of thought which have informed the report, go straight there. Here's the background for everyone else....

The big idea at the start was to look into the setting up of a digital content exchange. For many of us, the idea was vague. What exactly is a digital content exchange and whoever would be in charge of setting it up? In the course of discussions it became clear that the picture industry already operates digital content exchanges in the form of image libraries. Now there's a concept we understand - a user goes to a place online, asks for content, finds out what rights they can buy, pays the money, downloads the image. Easy in our industry, we've been doing it for years. So what's the need for change?

Licensing rights can be more complex in other industries like music and audio visual where multiple rights may exist in a work, and several, sometime overlapping, collecting societies are responsible for handling rights. For the user it's a dogs dinner, and there are attempts to streamline some of those databases into something as near as possible to a one stop shop for people who want to licence content.

The problem in the picture industry is a little different, as the Hooper report recognises. We have user friendly access for people who want to buy images, but there is a real problem with image identification. Images that escape from databases are mostly floating in the websphere without metadata, without an identification label, and with very limited means of finding rightsholders. They easily become orphaned, and metadata is almost routinely stripped from images when they are uploaded to the internet by anything other than the most professional image library software systems. This affects anyone who takes pictures, the amateur as much as the pro. If the concept of copyright is to be retained - and the Hoooper report recognises the importance of this -  there must be an effective way to label all images so that the information has a chance of sticking to the image.

For the general image user, things are just as frustrating. If you  find an image on the internet you want to use - whether for commercial use, for education, for a powerpoint -  you will have a hard time locating the rightsholder. You can use image recognition - Tineye and the Google's search by image - but what you often get are pages of results showing where images have been used, passed from site to site without a clear licence. How then to find the legal place to licence or use an image?

At the IPTC conference at CEPIC in London this year (see my last blog), we raised these issues. The position  I took, in my paper 'Orphan Works and Image Licensing' for the ARROW Plus project for CEPIC was that to be sure of identifying the source of an image you need  a verifiable identifier embedded in the image so that rights information can be resolved (by a url, or by a registry), and some kind of visual icon to tell you that such information exists.  Visual recognition can help identify uses of the image, and digital fingerprinting can embed an resolvable identifier into the pixel structure which is harder to remove than data embedded in XMP fields. (see presentation Avoiding Orphans)

Although image libraries have identifiers for images (picture numbers) which are unique to their organisation, what's missing is global uniqueness, which can be delivered by properly accredited registries. Other industries are ahead of us there, books have ISBN numbers and the music and AV industries have their own standard identifiers.

It's often assumed that if there is a standard numbering system, everyone has to change the way they work, renumber their images, change their database. This would cause chaos and confusion and put hard pressed businesses out of business. But there has been general recognition by people working on identifiers and registries that there is no value in making people change the way they work. The value we can add for the future is in linking what people are doing, and translating their efforts so that data becomes interoperable and useful in a wider sphere.

 Verifiable globally unique data can be added to images by a registry system. The IPTC schema has already make provision for registry data to sit alongside a picture library or photographers own numbering reference.

PLUS has thought all this through. I won't go into how it works here, except to say that like all workflow developments in the future, it will be down to the software people to automate workflows to make it easy for users. The thinking behind the PLUS schema is sound, and anyone who is serious about the future of the imaging business is advised to engage with the ideas they have developed.

Other problems for images? How can they be connected with other media so that people can license across media types. CEPIC is part of a proposed European project the RDI or Rights Data Integration project, which is being led by the Linked Content Coalition (LCC) and includes partners from various media sectors (It is expected that the project will be approved by the EU). The broad ideas is to set up a technical framework for allowing different rights schemas to talk to each other. Again, this is not a matter of trying to squeeze all sectors into the same way of working, but rather a way of setting up a mapping process which can be automated, so that systems can speak to each other with a defined vocabulary at the core. Its a little like Esperanto, which was designed to be a bridge between languages, to facilitate communication (and peace) between nations.  As with language translation, we know that the data mapping process can never be perfect, just good enough. CEPIC will be a content exchange partner, testing the concept of  a European Content Exchange for visual works.

The Hooper Report reflects the fact that those involved have listened carefully to our industry (as well as to all the others) and has not fallen into the trap of treating all sectors in the same way. The fascinating part of it is that the work leading up to the report has actually facilitated cross sector engagement and made people in the creative industries think ahead.

So here are the main points relevant to the image industry:

Metadata stripping
The report recognises the problem of metadata stripping (P15 of the report). This is significant. There a call for image using organisations to develop a voluntary code of practice committing not to strip metadata from images and not to use images from which metadata has been stripped. Further, that software developers work with the image industry to find a solution enabling images to retain metadata when posted to the web. (This is technically not difficult. The will of the software purchasers is a key element in shifting software into the modern age.) And the report recommends that the government work with our industry as far as practicable to help find a solution to metadata stripping.
We can all be rightly sceptical of voluntary codes of conduct, but this is very good news. A great boost for the  IPTC and the EMM (embedded metadata manifesto)!

Social networking sites
The report mentions social networking sites like YouTube and FlickR (P73)and the need for capturing data as content enters these systems. The IPTC has been investigating metadata stripping in social networking sites, and we see this as a significant barrier to copyright protection. The fact that the Hooper report recognises the importance of copyright for pros and non pros alike is very welcome indeed.

Linked Content Coalition
(LCC)
The work of the LCC  is explicitly supported by the report. This is good for data interoperability, and for the proposed RDI European project in which CEPIC will play a part.

The Digital Copyright Hub
The hub is conceived not only as a DCE (Digital Content Exchange) although it will function as such in some areas. Rather it is a way of linking different exchanges and registries. Bringing together the best of technology already being developed, the Hub will be a not for profit entity, drawing on the experience of the Copyright Clearance Centre in the US, the Linked Content Coalition in the UK, and the Technology Strategy Board's recent work on a Digital Licensing Framework (DLF)- a project in which National Maritime Museum, Tate Images, V& A and Pearson have played a part.
The Hub, it would seem, could be all things to all people - a linking mechanism for DCE's (read image libraries in our sector) and marketplace for photographers, a place where rights can be queried across media types, an orphan works registry, and part of a diligent search.  The problems are more likely to lie in governance and technology creep (technology on the ground not keeping up with opportunities) than in the technology itself.

This is the opportunity that the Hooper report has given the creative industries in the UK -  to be at the forefront of linking technologies and user friendly rights exchange hubs. The detail to be grappled with is staggering, but someone had to get outside the media silos and understand the opportunities technology is offering, and that effort could present a turning point for creative industries.

Perhaps we are all a little light headed in the UK at the moment (those of us who haven't escaped abroad). After Danny Boyle planted a flag for creativity and technology at the Olympics opening we all felt a stirring of pride for what we can do. After Wiggins and our various gold medals our feet have perhaps left the ground. But it's hard not to get excited when a government commissioned report backs the out-front ideas we have been playing with over the last few years. It's good to be heard, and now the hard work begins.

Ros Lynch will have her work cut out for her in the first year of spawning the industry funded Digital Hub. There will almost certainly be issues for smaller and poorer industries playing with the big boys. But for now we can be glad that Richard Hooper and Ros Lynch have so intelligently pulled together the ideas and aspirations of so many differing interests and come up with a roadmap for copyright protection and licensing activity which, in its outline, makes sense to so many people.

Sarah Saunders
4 August 2012

Monday 11 June 2012

Find the Rights - IPTC Conference 2012

It was great to have the IPTC Photo Metadata Conference in London this year. Metadata has moved now to a central position in the image licensing industry. Our topic covered two aspects of search-  finding the image and finding the rightsholders - both essential for the future of the image business.
 The full agenda for the conference is shown on the IPTC web site, and I report here on the sessions I was closely involved in. Overall, the conference was a great success, with excellent speakers and very good feedback from attendees. Image search was a major topic, with sessions chaired by David Riecks including reports on visual search techniques, controlled vocabularies, crowd sourcing,  linked data and indexing images for web search.

In the morning session 'Search  - finding the rights', which was moderated by Linda Royles, I gave an overview of ways of finding and protecting image rights.  This encapsulated the thinking in my paper 'Orphan Works and Image Licensing' written for CEPIC as part of the ARROW project. The presentation (not long and very visual) can be downloaded here. The main points I set out were along these lines:

1. Positive identification not orphan grabs
Positive identification is the opposite of the orphan works database concept. It encourages people to use images which have permissions, not those which don't. Methods include embedded metadata, unique identifiers linked to registries, visual search methods, and digital fingerprinting.

Orphan works databases may be needed in the heritage sector to enable institutions to digitise older orphaned works in their collections, but this is not a solution for currently circulating digital files which may have been orphaned because technology and behaviour have not yet caught up with the medium.

2. Labels to help image users
People need easy access to images they can legally license or use.  For this, we need identifiable icons on thumbnails and easy click through to source web sites. (see PicScout Image Exchange )

3. Copyright protection for all works
If copyright protection is eroded for images belonging to the general public, that will impact the creative industries as well. The guiding principle should be that permission is required for use of any image or artwork. It's not good enough to just use any old image because there's no information.  Creative Commons licences will become increasingly important, especially for non professional photographers (and remember we are all photographers now!) but the term 'commercial use' needs to be properly defined, so that image licensing opportunities are not subsumed by the term.

4. We need registries and content exchanges
An image with no metadata should not be released from copyright, but we do need to change the way we work to secure digital images for the future. Digital content exchanges and registries will probably be needed to secure the rights of creators and to promote licensing and easy access for users. Identifiers require registration bodies in order to be authenticated,  global identifiers require a global network, and users need easy access via a user interface.

Content exchanges  consist of access to identified content, rights information and delivery mechanisms. They are being promoted as a way of finding rightsholders in a cross media environment. In the UK, The Intellectual Property Office (IPO) will be reporting soon on its consultation about a UK Digital Content Exchange (DCE)

Image libraries are in fact already content exchanges. In some ways they are ahead of other industries. Where the image library industry needs development is in the area of unique identifiers, so that images can be properly tracked in the web environment. The PLUS registry answers many of these requirements including that of a network of registries to form a global registry.

Other contributors in morning rights session
Antoinette Graves from the Intellectual Property Office (IPO) outlined the framework for the IPO  consultation on copyright, remarking that the major area of concern is the heritage sector where digitisation projects are held up by orphan works and the resulting legal uncertainties for the heritage sector. The consultation is ongoing and results will be published later this year.

Mark Bide  from EditEUR argued that investment in content made by creators is now lagging behind the returns gained in the internet environment. It now seem easier to profit from other people's creativity, and it is important to turn the tide and make investment in content pay for the creator, as it should. The answer to the machine, he argued, is in the machine.  Technology and automation can bring about a revolution in licensing while preserving copyright and benefiting content creators in all media sectors.

 Nancy Wolff  from New York law firm CDAS warned that orphan works legislation is coming. DShe believes it is critically important for the picture industry to get to grips with issues and solutions now.

Afternoon Session on Orphan Works  
I chaired a break out session which pulled together some of the topics raised in the morning, and allowed for discussion between the people involved.


Offir Gutelzon  from Picscout demonstrated how the Picscout registry provides tracking and identification services for image suppliers and help for image users to access licensable images from an internet image search.

Cathy Aron, Executive Director of  image library association PACA described how  associations in the US are combining to reach agreement on issues relating to orphan works, in the expectation that orphan works legislation (which failed to enter the stature books in the US last time round) will eventually be passed. Issues like diligent search, the need for registries and the concept of safe harbour for cultural institutions are on the table, and it was agreed that CEPIC and PACA should stay in touch on orphan works related issues.

Paola Mazzucchi from EU project ARROW demonstrated how the system works to link and access data about rightsholders relating to orphaned books.  She  discussed the difficulties in accessing rights data about images which are not credited, and talked about the feasibility study currently underway at ARROW which will look more closely at easy of finding image rightsholders.

With legislation pending on both sides of the atlantic the urgency for the image business to find orphan works solutions  was stressed on all sides -  it was recognised that this is a global problem which will need global solutions.  An important step would be to find common ground on the essential elements of a diligent search.