Tuesday, 17 September 2013

Heritage data exchange - why we need embedded data

Working with heritage images is hugely rewarding. Not only is it a pleasure to play a part in releasing images of beautiful and interesting artworks and museum objects to the public, but also, there are interesting challenges not faced in other sectors. Heritage images need data fields to describe both the object or artwork depicted, and the image of that object; the organisations producing and distribiuting these digital images need a workflow that encompasses public access and commercial uses of the image, and the needs of the very different departments through which an object and its image pass.

Objects are looked after by curatorial departments, and the data about them is stored in collections management systems. They are photographed by freelancers or photography departments and the images are passed to cataloguing departments, for use on the web site, for educational use, and for commercial sales in the picture library. For historical reasons, most workflows haven't yet been joined up, and that's another challenge we face with the data.

Metadata is the driver for images as they pass through the workflow. Without an identifying number, for example, an image cannot be processed using current systems and may as well be thrown into the digital equivalent of a super sewer.

There are two sets of people thinking about how pictures are transferred from one place to another - what we call interoperability - and they fall mainly into two groups. There are the picture professionals who create images and handle them on their way to  publication, and there are the cataloguers and library professionals who handle much of the specialist data about the objects depicted  in the images.

The way metadata is exchanged is reflected in the make up of these groups. Library and museum professionals are increasingly looking at schemas like LIDO (which comes from the museums community) which is based on XML, is event based, machine readable, granular, and can hold a deep level of complexity. LIDO is an exciting development for data exchange, but that information is not accessible to people looking at an image which has escaped from its database environment. It is not embedded in the image.

Understanding XML is something of an IT or library skill. Every XML schema has its own set of rules, and to work with the data, or to make it human readable, you need to use an XSLT transformation file which is a template for handling the XML data. Data expressed in XML is not very accessible to those who are not used to it. 

Below is a user-friendly display of some data in the VRA Core schema and its corresponding XML.

For embedded image metadata to be truly useful it must be readable by commonly used media software. Two things are necessary to make this possible: agreed upon technical standards for reading and writing the metadata and agreed upon schemas for giving it structure and meaning. The technical standards have evolved to the customizable, open-source Adobe XMP standard. Because XMP uses extensible RDF/XML standard, it allows the use of rich metadata schemas such as IPTC, the most widely used schema for describing the content and intellectual property rights of an image. The reliable portability of XMP means all IPTC fields can be read by Photoshop and Bridge and other media handling software, and the most essential of the fields can be accessed by anyone, using their computer's file browser (see Screengrab 1 below)

 Screengrab 1

Data which is embedded is attached to the image file, and that is a distinct advantage in a number of situations. The IPTC Extension schema has fields to describe artworks and objects in the image, and the IPTC is considering adding some additional key fields for general heritage use (eg. circa date) . 

For data exchange within and between museums and galleries Greg Reser (Chair of the VRA Embedded Metadata Group)  and I are proposing a richer set of data fields called SCREM to describe the image and the artwork in a more specialist context.

Here are some examples of how embedded metadata can be used.

The Photographer

The photographer of an artwork embeds the inventory number into the image, which is passed on to the curators and cataloguers for further data enrichment. They may also embed the photographer's name, the job number, the name of the artist and other details available at this stage eg the inscription on the back of an artwork.  The inventory number is picked up by the cataloguing software, or it can be exported by the photographer into a spreadsheet along with the (hopefully unique) filename. But the inventory number is not separated from the image file whatever happens to the spreadsheet, and the image is uniquely linked to the object.
Data entered in IPTC Core and Extension. SCREM will add useful fields.

The Teacher
The collections management team embeds key information about the artist, the artwork, and the artwork rights in images sent for for upload to the museum web site. A teacher looks at images on the museum site and, using right-click, downloads an image onto the desktop. The key data about the artist, the artwork, the museum, the dates, the material and size of the painting are embedded in the image. Much of this data can be viewed in the teachers file browser (eg Windows Explorer)  and used for teaching. The teacher may also decide to use a utility which reveals the data in the notes section of Powerpoint for use in lectures. (see Screengrab 2 )
Data entered in IPTC Core and Extension.

Screengrab 2

....or might want to display the information in online notes for students(Screengrab 3)

Screengrab 3 

Museums Loans
A museum receives images of artworks loaned by another museum for a special exhibition. The images are to be used in the catalogue and on the web site. Sometimes IPTC data is included in the image files. This is useful as it can be imported into the museum DAM system, but IPTC Extension isn't as extensive as the museum may like and so some fields have to be jammed together into one and then separated after ingest into the DAM. This is not ideal and these users would love to have a way of receiving  a richer set of embedded data in a standard way.  
Data entered in IPTC Core and Extension. SCREM will allow for more granular specialist data exchange.

Embedded data can play an important and practical role in data exchange. It can be read by people using commonly available image management software and some fields can be read by anyone with a computer. What's not to like ?

We don’t see this as an either or between embedded or non-embedded methods of data exchange; data management is developing all the time. But people working now are finding uses for heritage data embedded in the image file, and we are proposing a sub-set of existing fields (not to reinvent the wheel) which can be carried in the image file and supported by a custom XMP panel, so data entered by one party is read at the receiving end in the same way.

The IPTC expanded set of heritage fields will include those commonly needed by general users for describing and identifying the artwork. The SCREM set of fields will be richer, and will be useful for imaging and cataloguing professionals in heritage organisations. SCREM can be associated with a custom XMP panel which makes the data easily visible in Bridge or Photoshop.

We have issued a survey Expanding Cultural Heritage Metadata to gain feedback on our approach to this. Do take part!

With thanks to Greg Reser a metadata expert and Chair of the VRA (Visual Resources Association) Embedded Metadata Group. . We are working together to improve life for people working with heritage data.

Friday, 16 August 2013

Heritage Metadata at IPTC

Heritage is a fast growing sector, with millions of images yet to be digitised and catalogued and discovered by the public. In June the IPTC Photometadata Conference 'Image Rights: manage them - or lose them' in Barcelona held an afternoon session ' Metadata of Cultural Heritage images' where we discussed the role of metadata in the heritage workflow and the need for data embedded in the image file.

I wrote about the need for attribution of heritage digital images in my previous blog 'Museums - How important is Attribution?'.  Greg Reser from the University of California San Diego, underlined the point again in his presentation at the conference. Greg is a metadata expert and among other things is Chair of the VRA (Visual Resources Association) Embedded Metadata Group. His presentation, downloadable here, showed what happens to the well documented web image on a heritage site when the image is downloaded by a user. All too often, the image on the users desktop is devoid of any documentation. Why? Because there is no metadata embedded in the image.

Greg set out to show us how embedded data can help ordinary users do their jobs in the education and heritage sectors. He was keen to emphasise that use of embedded metadata is not restricted to professionals working with Photoshop and Bridge. He showed that key information  can be displayed by the file browser in both PC and Mac. He also showed an example of how useful embedded metadata can be to people accessing images, demonstrating a utility developed by the VRA to bring embedded metadata into the notes section of Powerpoint so teachers can access the information as they show the slides to students. The data can be uploaded into a class web gallery as well, for reference by students. The opportunities are endless. Although not all metadata fields are currently displayed on file browsers, selected information including Dublon Core Description can be displayed.

The next part of the conference detailed our effort to bring more heritage fields to the IPTC schema. The IPTC schema originated in the news industry and became the standard for embedded metadata in the broader image production, distribution and  licensing industries. Until IPTC Extension was created in 2007, there were no specific fields for heritage objects or artworks. The need for a separate set of fields was demonstrated by this set of metadata found online some years back.

This is an understandable mismapping if the creator of the photograph and the creator of the artwork are listed in the same field.  The Artwork and Object fields in IPTC Extension were created to avoid this kind of semantic muddle, and went some way to making it possible to transfer data about heritage objects embedded in the image file. Since then, in the course of working with heritage metadata, we have identified  gaps in the IPTC schema for heritage fields, and we are now running a project to add new fields at the next update of IPTC.

There are some immediate and obvious gaps. The date field is one. The IPTC Extension Date Created field for artworks is a date formatted field. This doesn't encompass the uncertainty in heritage dates: circa dates and spans of time. So a text formatted date field seems an obvious addition. Material  (or Technique or Medium) is another, as is Style or Period. Then there is the need tolink to other sources of information about an artwork on the web, so url fields for the artwork are also on the list.

We have produced a set of possible candidates for the IPTC schema which we are putting out to consultation in the next few weeks. But how far should we go in adding detailed heritage fields to the IPTC?

Generally speaking the direction of travel should be towards granular, non ambiguous data. The objections raised when I first started looking at IPTC, namely that no-one would use so many fields, are overridden now by the increasing automation of data transfer. It is down to the technical people to create selective user interfaces for appropriate sectors , but semantic precision is all important for interoperability.

Nevertheless, in looking at the richness of heritage data we realised that the project could become unwieldy for IPTC and some criteria have to be set.

We looked at how embedded data is used in the heritage sector; for transfer of data within and between institutions (Institutional Use); and for display to the public on the desktop and on the web (Public Display). Perhaps the IPTC schema, which is viewable by anyone using the latest Photoshop and Bridge software, is suited to display of data once an image has left the institutional environment and the VRA schema (and other heritage schemas) can be used for the detailed data used by heritage institutions (and viewable by means of a custom XMP panel), with some fields common to both schemas. See the Metadata Deluxe site for the beta version of the current VRA custom XMP panel.

Here is the list of initial criteria we have set out for adding fields to the IPTC schema

YES  for IPTC if 
  • about the digital image
  • needed to identify source, creator and copyright of artwork
  • links to more granular data on the web
  • descriptive data useful for end user
  • useful for image retrieval
  • can be mapped from other schemas

NO for IPTC if
  • about movement, condition, exhibition of work, except where needed for accreditation
  • specific to institution holding the work and not needed by outside users
  • detailed provenance (former owners etc)
  • about monetary value or insurance of work

We are looking at creating an overall schema for embedded metadata for heritage, to cover both institutiobnal and public access uses. For now we have called it SCREM for Heritage Media Files (SChema for Rich Embedded Metadata for Heritage Media Files)

If you work with heritage images see the presentation I made to the IPTC conference and give us your feedback on the candidate fields. A link showing candidate fields will be up shortly for your responses.

Friday, 17 May 2013

Museums - how important is attibution?

We've got so used to thinking about metadata as a protection for copyright we sometimes forget how useful it is in other ways. It is a sad fact that the vast majority of images on museum web sites contain no metadata if the images are downloaded.  Right-click/download is such an easy way of sourcing images for personal use, for example in education. But without embedded metadata the images float on the desktop without any information about the museum or organisation they came from. How irritating for the user and for the museum providing the resource.
Despite years of campaigning with the IPTC Photometadata Working Group and the Embedded Metadata Manifesto, I see things moving at snails pace. Even preview images downloaded from heritage image libraries often contain no metadata.
Meanwhile the orphan works bandwagon rolls on…..
Who do we blame for this? Lets put it this way; the only people who can make a real difference are the software developers who create web sites. Generally speaking these people, who have great skills in software and coding, are not very image-aware. It was never on their job description. Now imagery plays such a large role on the web it’s time for a wake up call. The metadata may add a tiny bit to the image size, but in my view photographers, image libraries and yes, main museum departments, can hardly be said to be doing sensible business if their images are circulating on the web without so much as a single line of attribution.
Who will make the software people play ball with metadata? This means ENABLING METADATA EMBEDDING  and NOT STRIPPING METADATA  when images are uploaded to the web. People working in a heritage environment need to speak up. If a software developer says ‘ Ah this isn’t on my development list’ perhaps the answer is ‘It should be. Metadata embedding is essential to our business, and we are essential to yours. ' Stand up for yourselves!
Interest in metadata workflow is growing, with collections management and image library departments starting to talk to each other. Attribution will be one of the topics at the heritage session at the IPTC Conference at CEPIC in Barcelona in June. Topics we will discusss are:
  • Embedded metadata and attribution of museum images
  • IPTC Fields and the Cultural Heritage Sector (what new fields are needed?
 If you are interested in these things get in touch……


Tuesday, 26 March 2013

Machine readable rights - a new world of frictionless licensing?

There was a distinct buzz as representatives from the news industry, image libraries, publishing, software, photography and the legal profession gathered in Amsterdam for the Machine Readable Rights and the News Industry: Opportunities, Standards and Challenges,  which formed part of the IPTC Spring Conference. The pot at the end of the rainbow is  'frictionless licensing', a vision of licensing transactions on a seamless automated workflow between content owner, publisher and end user.

Although the IPTC is rooted in the news industry, its scope is broader, and the issues discussed at this conference were widely seen as critical for all sectors of the creative world, and touch on issues of copyright, business viability and the future of all content licensing across media sectors and boundaries.

As an organisation representing the image licensing industry CEPIC is working to secure future business for image creators and image libraries. CEPIC’s participation in the European RDI (Rights Data Interchange) project was described by Sylvie Fodor. She outlined the need for unique identifiers for images and for methods of reuniting images with their rightsholders in an environment where the proliferation of so-called orphan works is threatening the very business of image licensing. The RDI will provide a test bed for CEPIC as a registry allocating unique identifiers for images, and as an exchange, where users are directed to licensing outlets for images. Partners in the project are Getty Images/PicScout, AGE/THP, Album and PLUS.

Eugene Mopsik from the US photographers association ASMP also called for unique identifiers. It is easier to right click and download than to license an image, he said, and called for a professional attitude on behalf of image providers so that rights data is supplied along with the images. It is no longer good enough, he said, to wait for misuse and then litigate. Supporting the use of the PLUS standards, he reminded the conference that a Picscout survey found 80% of web uses were unauthorised. The logic here is - make it easier for users to do the right thing and license the image. Mopsik supports the use of the PLUS registry as part of the solution for orphan works, and indicated that linking photographers direct to a PLUS registry is a 'no brainer' technically. The hold ups are more political and administrative than technical.

Licensing deals between press agencies are a complex and time-consuming business, said George Galt, Associate General Counsel for Associated Press. He highlighted the fact that lawyer time could be radically reduced if there were standards in rights expression, citing an example where a 3 year contract between agencies was finally signed 6 months after it was due to expire. There is a long list of rights expression ripe for standardising; exclusivity, prohibitions, retained rights, payments terms and copyright are just a few.  Getting the key terms standardised would go a long way to improving workflow he indicated.

Making money from content is the key driver to standards and automation. Andrew Moger, Executive Director of News Media Coalition, previous picture editor at the Times newspaper, said that as far as he could see everyone has worked out how to make money out of photographs, except photographers. He talked about conflicts over image rights in the sports world, citing the recent lock out of Getty Images from the Australian Cricket tour of India, and consequent ban on distributing images in a protest supported by all agencies including those allowed at the event. The issue of exposure by the press is taken for granted, he said. Imagery is valuable to businesses of all kinds, and ways must be found to retain that value for the content producers.

Thomas Hoppner, from law firm Olswang raised another issue which should concern all media sectors - the power of aggregators like Google to control and skew the market.  The Google site has become a place where people read the news online, and content is gathered from news sites, placing the search engine giant in direct competition with news providers. Content is supposedly protected against copyright infringement by the Robots Exclusion Protocol, but in practice if the protocol is used to block search engine access to material on a  news site, overall findability is severely compromised, making the site more or less invisible on the internet. Such is the power of Google. Negotiations are underway, but the balance of power is not equal, and there is a danger that the content creators and providers will lose out to the aggregators.

The Newspaper Licensing Agency, NLA, works on behalf of major newspaper groups to collect revenues from schools, cutting agencies, and publisher feeds. Faisal Shahabuddin, Head of Product and Service Management at the agency pointed out that progress in the area of machine readable rights will depend on the bottom line. Will developments increase revenue or reduce costs, he asked, leaving the answer open.

Returning to the issue of copyright, Michael Steidl, Managing Director of IPTC, described the recent IPTC study of metadata retention in images placed on social networking sites. Millions of images are uploaded to social networking other sites daily, and the study showed that FlickR, Twitter and Facebook currently remove key copyright data embedded in their image by photographers. Google + and Tumblr do better. The results can be found at the Embedded Metadata Manifesto site, which was set up by IPTC to campaign on retaining metadata embedded in image files. This issue of removing embedded data gets to the heart of copyright protection. The technology is there, but software developers ad suppliers need to more aware of the need to retain embedded data, and critically, to implement the technology to enable that. In a world where software development is based on user need, the IPTC wants to make both software companies and their clients aware of the need for metadata to be retained. The words 'noone is asking for it' should not be allowed to pass their lips.

John McHugh picked up the copyright issue from the point of view of a working journalist. Photographing in Afghanistan and other dangerous locations, he is acutely aware of the emotional and financial investment he makes in the images he produces. Dissemination of photography is exploding, he said, but revenues are not following, and attribution for images is thin on the ground. The IPTC metadata offers copyright protection but when metadata is stripped the photographer is left high and dry. Sharing and stealing is far too easy, he said, and photographers cannot not simply opt out of the internet. His solution is to take a step back and make it possible to watermark images direct from a mobile phone.  Marksta is a new company he set up to create an app which creates custom watermarks visible on the image, and allows photographers to enter IPTC metadata direct from their mobile phones. It’s not only professional photographers who are affected by losing their accreditation and rights. Marksta is also driven by soccer mums wanting to retain control of images of their children on the internet. The broader point here is that if copyright disintegrates in the social and non professional sphere, hanging onto it in the professional sphere will be nigh impossible if an anti copyright, use-what-you-want-where-you-want culture wins out.

So what are the standards that can be applied to expressions of rights and how can they be used? Graham Bell, Chief Data Architect at Editeur, has been working in the field for many years. Editeur manages the ISBN standards for books and has developed a number of other standards in the publishing industry. He distinguished between machine readable rights expressions, and those which are machine actionable, the basis for automation. Editeur has developed ONYX standards, which operate between publisher and retailer. So far there are few implementations, and this is a criticism often levied against standards bodies  - why is no-one using this standard? The answer, as those working in the standards field know well, is that it is a long game. If standards take time to develop, implementation often takes longer. The real world of commerce eventually comes to recognise the advantages of using standards, and it seemed from this conference that the time for machine readable rights is on the horizon. The people and companies who develop standards always have to work ahead of the curve.

The PLUS Coalition has produced a set of standards developed for use in the image industry. Jeff Sedlik and Ray Gauss presented the work of the organisation which recognised the need back in 2004. The international non profit PLUS Coalition created a matrix of rights expressions for use in the stock image industry, and a registry for images, creators, licensors and licensees, to enable rights data to be held online. This addresses the critical issue of where data is held. Data embedded in images is important, but there is wide agreement now that important data needs to be stored in the cloud. The principle is for embedded data to point to an online database or registry where it can be changed dynamically and so is always up to date. There are a number of important issues about who holds the data, how accessible it is, and who has access to what data. Both PLUS and Editeur have been engaging these issues for a number of years, and the registries reflect that.  PLUS has developed rights expressions to cover parties, industries, scenarios, licensing modes, time frames and much more. Adoption, as always, depends critically on user demand and software development. It's important at this point to note that none of the standards bodies forsees a future where there is no human interaction at all. It's more a case of let’s automate and standardise where it saves time and money and where it makes sense.

IPTC has been hard at work creating a rights schema for the news industry Rights ML using the Open Rights Digital Language ODRL.  Stuart Myles, Director of Schema Standards at Associated Press and Right ML Lead at IPTC presented the principles behind Right ML - that it is publishing specific, supports todays restrictions, and buolds for the future. The news industry, he said, needs increased automation to cope with the fast moving digital environment and sophisticated publishing relationships.

Creative Commons grasped a key aspect of licensing and rights years ago when it set up simplified rights expressions and icons for images on the internet. The user of a digital image needs to be able to be alerted to the rights situation, and there needs to be a presumption that there are rights associated with the image. The fact that the Creative Commons baseline is  that images will be shared is not at issue here; it is more important that it is clear that rights belong ito the creator or copyright holder, and that licences are actively granted not just taken for granted.

The use of Creative Commons licences has increased enormously; in 2007 they had 29.5 million works and in 2013 252.6 million works under CC licence. The expressions in CC REL (CC Rights Expression Language) code is legal (readable by lawyers), human understandable, and machine readable. The fact that many commercial image licensing bodies do not use Creative Commons lies in the failure of the codes to sufficiently define what commercial and non commercial uses entail. This can lead to users thinking, for example, that use of an image in a charity annual report is a non-commercial use. Anyone involved in licensing images will recoil at this understanding, and work in this area would be a prerequisite for acceptance of this schema by the professional licensing industry. This is however an important step, to bridge the gap between propoenants of 'free' and those engaged in licensing activity. A definition of boundaries will be essential to happy coexistence.

Andrew Farrow Project Director of the Linked Content Coalition (LCC) outlined the ideas behind the RDI project as an exemplary implementation of LCC.  LCC aims to provide a technical infrastructure to enable interchange of rights expressions between different existing media schemas so that rights can be read and understood across all media in an automated way. The RDI project with its 16 partners from different media sectors, starts in May 2013 and its objective is to demonstrate the range of dataflows across the media supply chain and show that the data can be interoperable.

Andreas Gebhard from Getty Images chaired the first session on the need for machine readable rights. Getty, as owner of PIcScout has long understood the need for copyright protection, and for mechanisms to link users to licensors. ImageIRC from Picscout used with ImageExchange allows images registered with Picscout to be located and tracked on the internet using fingerprinting technology and links users with the means to license them.

Zhanfeng Yue from Beiijing Copyright Bank Tech Co Ltd presented another form of protection copyright, the Copyright Stamp, which creates a visible and resolvable icon in the image, so easing the path for those who want to use images, and protecting the copyright of creators.

Generally, the mix of embedded machine readable data, fingerprinting technology and visual search mechanisms form the bedrock of copyright protection in the web environment. At a time where copyright and rewards for content providers are under severe pressure, discussions at the conference show a real desire to use available technologies to support content creators into the future.

Presentations from the conference can be found at the IPTC web site.

Watch out for the IPTC Photometadata Conference at CEPIC in Barcelona in June this year. Many of the topics raised here will be discussed further.