Friday, 21 May 2010

Metadata and the Future of News

IPTC Business Meets Technology day Paris, April 2010

The IPTC evolved to represent the needs of the news business, and much of the Spring Conference this year revolved around the adoption of the IPTC's G2 standards for the news industry. The IPTC's Photometadata Working Group focusing on visual content, has responded to the needs of the photographic industry to make image search more focussed. The Controlled Vocabulary project was presented at the conference for the first time.

G2, which provides an XML-based standard for the exchange of news items, is well supported by large agencies, despite the fact that some of their customers are working with old systems and are not ready to change.

The news industry is looking to a future where news gathering is networked and syndicated, where news items are linked across media, and where outlets for news include mobile phones, social networking sites, and more personalised news delivery.

New business models involve aggregating content across publishers in the same way agencies aggregate content from their suppliers. Planning is needed to fund inter-organisational processes. The supplies within a network can then focus on their core business.

The networking and syndicating of news require interoperable metadata. There is a growing realisation that some sources of data, for example on events or on personalities and entities, are best shared. In the new world it is argued, it is pointless for everyone to gather the same data. This is where linked data comes in.

Fran Alexander
from the BBC highlighted the need for different sets of metadata in different stages of the workflow and in different departments of a large organisation like the BBC. Rather than insisting on standardisation across the organisation, she believes that mapping techniques produce better results, with standards risiing as departments increasingly work together.
The Linked Data community, which has emerged from the ideas around the Semantic Web, promotes the idea of sources of data scattered throughout the web, with unique string addresses on the web called uri's. These web addresses can link to other uri'’s to expand the available information on a subject or entity.

The principle is that by linking data sets on the web, the requirement to reproduce the same work of data gathering is reduced, and more data becomes available by a process of organic growth.

The uri's are simply web addresses for lists of information about an entity (a person, an organisation, a building, anything that has a name and is unique.) That information could include details of date of birth, height, weight, schooling, job history, any facts which can (or could theoretically) be checked.

Someone searching for information about Barack Obama for example, might find information on his first job after college which may reveal other facts about him which are of interest
UK company Talis demonstrated their own platform which is designed to build linked data. The Talis architecture is used by Government departments and other clients to pull together and link sources of information scattered around the country.

Talis suggested that the IPTC could be an ideal linking hub holding trusted data which can be linked to other uri's and other hubs on the web.

The Okkam Project is a European funded project which has already assigned 7.5 million unique identities for entities and has created 16 applications for their use. Okkam is running a pilot project with news agency ANSA, which allows the agency to create an enriched newsfeed using News ML standards, which can hold links to other sources of information. Okkam works on the principle of a 14 th century philosopher who said we should not' multipy entities beyond necessity.'

The Okkam system is designed to be an open neutral system, which will be run b y the Okkam Trust once the EU project is over. The Trust would be funded by money from commercial applications developed by the project, and is keen for stakeholders like the IPTC to join the board of the Trust.

Okkam aims to fix reference names for entities, but recognises it is not the only company to do so. The aim is to create permanent identifiers, which is where it differs from projects like Open Calais, which is about entity extraction.

Underlying all this are considerations about the trustworthiness of data sets. The web site www.sameas.org finds entities which are supposedly the same, but not all the results are seen as sound. How rigorous the test for sameness will depend on its use.

The IPTC Photometadata Group Controlled vocabulary project was presented by Sarah Saunders from Electric Lane, who discussed the ambiguities inherent in keyword searches for images, and demonstrated the use of keyword predicates, a new feature of the proposed controlled vocabulary, which helps to reduce ambiguities. An example is the word orange, which can be used as a descriptive word, or as an object word. The word Paris can be used to describe the location of an image, a person (Paris Hilton) or a view of Paris.

The new IPTC keyword predicates will separate the differing uses of a word, and avoid unwanted search results. The draft vocabulary and the ideas behind it will be presented at the Photometadata Conference in Dublin in June.

More information from the Spring Conference in the IPTC Mirror.
Sarah Saunders
April 2010

No comments:

Post a Comment