Monday, 25 October 2010

The known and the unknown - keywording for visibility

Why is everyone talking about keywording? People in the image industry are scratching their heads about ways to keyword their images. Now the web is buzzing with visual material, words need to be deployed intelligently to ensure images can be found by the people who need them.

Technology offers other ways of finding images, you may say, which don't require so much human input. Visual recognition techniques do offer clever ways to look for images, but computers can only learn from the way humans keyword the images in the first place, and they are not very clever at understanding abstract concepts. How do you explain to a computer all the different ways of expressing the idea of freedom, for example? Can love only be expressed by the shape of a heart, or a smile between two people? Human interpretation is still needed, and computers are still taking baby steps at recognising 'things in the picture' like trees and tables, never mind the more abstract and subtle signifiers found in visual material. So what we are looking at, for some time to come, is human tagging of images to make them findable.

The problem anyone keywording images faces is this. Language is a wonderful, expressive tool for communication, there are many ways to say the same thing, and words often have more than one meaning. The word I use to tag my image may not the the one used by the person searching for it. They may use the plural where I used the singular. They may use a different spelling, or different versions of a language like American and UK English. And then there are the requirements of multi language searches.

Any good tagging system needs to scoop up all the variations of a word so that whatever word the searcher uses, they will find their way to the image. Words need to be uniquely defined, so you can tell the difference, for example, between orange the colour, and orange, the fruit. There may be broader terms than the one you first thought of which may be useful, so your image of a train should also appear under a search for transport.

The way to achieve consistency, and to scoop up all the appropriate terms, is to create a controlled vocabulary. With a set of preferred terms, and their synonyms, the vocabulary is usually structured in hierarchical way to include broader and narrower terms. Vocabularies for use with images vocabularies have been informed by work done on text search in the library sector, but they have developed further to include concepts specific to visual material. One of the big advantages is that properly controlled vocabularies can be translated - just once- so that searches can be made in different languages.

Can a single vocabulary describe the entire world, the universe, and everything in it? Yes, if it has top level terms broad enough to cover everything, a logical structure, and sufficient depth to reach down to a granular level.

How does it help in practice? The vocabulary is embedded in the software both at the keyboarding and the search stage, creating automatic links between words and effectively automating much of the keyboarding effort. The keyboarding operative, with a well designed CV and good software can concentrate on interpreting the image for the user audience. Thats the part the machine can't do.

People in the stock image industry have been working on this for decades, and have come up with some pretty good systems for keyboarding, led by teams in large agencies like Getty and Corbis. Now it's time for everyone else to sign up for productive and accurate keywording, learning, where possible, from experience already gained in the industry on keyboarding and customer behaviour. The benefits will be felt not only by smaller picture agencies and photographers, but also in the wider world. Imagery is playing an ever greater part in company DAM systems, where the level and quality of retrieval makes sense of investment in this area. A picture may be worth 1000 words, but without words, a picture may be lost forever.

At Electric Lane we have been increasingly involved in creating vocabularies for image collections. We are also working with the standards body IPTC on a project to create a standard vocabulary to help collections of all sizes raise their keywording standards and make their data more interoperable.

We are offering a one day course, Keywording, on December 7 in London, run by Electric Lane Associate Liisa Kaakinen, a stock image industry keywording and controlled vocabulary expert. The course covers professional keywording techniques and the vocabularies that lie behind them, applied to still and moving images. For those wondering what to do about keywording, this session provides an essential step to understanding the process, the gains, the resources needed, and how to maximise productivity.

For further enquiries about course content contact, tel 020 7607 1415.

See also
Is Language a moving target
Multilingual Keywording
IPTC Mirror on IPTC Controlled Vocabulary Initiative
Google is not Perfect, Fran Alexander

Wednesday, 6 October 2010

The Semantic Web, Linked Data, and Knowledge Organisation

It's a journey of possibilities, the semantic web, and its one we're all engaged in in one form or another. I've been eyeing up the topic for some time, and the third UK ISKO Conference gave me the push I needed to look deeper into what the semantic web and linked data hold for the future.
Last time I was at the ISKO conference, in 2008, many of us there were baffled by the possibilities of linked data. If all this data was to be linked on the web, who would put it out there? Who would put the money up to produce the data? This year, answers to some of those questions emerged. It would appear that linked data is at the point of lift-off, and it's already being used in ways we can now understand.

I start by looking at some of the ideas behind linked data, and then follow the presentations at the conference, all of which clarified some aspect of the subject and gave us a glimpse of how it can benefit 'the rest of us', the users.

The Significance of Linked Data

Linked Data is part of the Semantic Web, and for those wondering exactly what that is, here's a quick explanation. Semantic Web is the term used to describe a Web environment in which the meaning (or semantics) of information is made explicit and therefore machine readable. Our brains can handle very complex information. When we say we want an apple, we can work out from the circumstances that we want the eating type of apple, not the company Apple or the Big Apple, or an Apple computer. Machines need much more explicit instructions to contextualise the exact meaning of a word. We know that by building up a series of simple instructions, starting with the basic 0 or 1 choice, computers can perform very complex tasks. The Semantic Web, a term coined by Tim Berners-Lee the creator of the World Wide Web, describes an environment in which information can be accessed and processed automatically in an intelligent way. Linked data makes sense of the Semantic Web by providing a framework for a network of related information.

We are accustomed to the idea of HTLM documents located at URL's (Universal Resource Locators) and linked together by hyperlinks. In the same way smaller bits of information can be assigned URI's (Universal Resource Identifiers), and these bits of data can be linked using RDF technologies. So we can take the idea of 'Cat' and assign it a URI. We can give another URI to a particular cat, say Dick Whittington's cat. Then we can link the two to increase the amount of information available.

To link data in the public sphere, it has to be freely available on the web. The idea of releasing stuff for free is becoming more accepted with even parts of the creative industries starting to looking for new ways to make money in an environment where the market price of media increasingly parts company from the cost of creating it.

Services may come to be the new currency. If you share data wisely, you can attract value to your company and its services, increase the traffic to your web site and gain an element of trust which is working capital of a sort. By releasing data for free you can signpost things you can place a pound sign next to.

The linked data community is part of the open source community and while many of us working in media have been struggling with the recession and changes to our industries, there has been a quiet shifting of the scenery in the background.

The information revolution is being powered by new ideas and by the growth in mobile technology. People are already buying and using Apps to entertain, inform, and to find their way around the world. A steady stream of up to date information plays an essential role . The processing of that data into useful formats and apps will spawn the businesses of the future.

Share the data and the apps will follow
The keynote speech at the conference was made by Professor Nigel Shadbolt, from University of Southampton, Director of the Web Science Trust, and the Web Foundation. Together with inventor of the World Wide Web, Tim Berners-Lee, he was a key figure behind, a UK Government project set up in 2009 to make official data available on the internet for anyone to re-use. The thrust of his presentation was that once data is out on the Web, other people can do things with it, and this opens up opportunities of benefit to both individual citizens and to the businesses who create new services using the data. Publish the data, said Shadbolt, and the applications will flow.

The British Government has more that 4,000 datasets, and the aim is to make much of this data public. Already on the site there are apps using data previously locked away in government departments. One example is the data on cycle accident location. Once that data is pubic, applications can be created to help cyclists avoid the accident blackspots. By sharing data, Professor Shadbolt said, you can bring eyeballs and brainpower to a problem.

The public can do little with endless sheets of raw data, but data can be transformed into useful applications, and that's the basis for a new kind of business . In a world of iphone apps and mobile media, that information can be just what you need when you're looking for a bustop or running for a train, or finding the nearest dentist, once it's make accessible.

More data is now being made available. Postcode data was once copyrighted but is now freely available. From January 2010 every local council has to publish all spending over £500. There is a new appetite for open data. Public data provides ways of holding public services to account, and Professor Shadbolt's view is that it should be published quickly, and it should be linked.

Government departments can profit internally from linked data as well, he said, gaining better access to their own data and making better sense of related data from other departments.

He sees linked data leading to more accountability, more localism, more arguments. But how do we know we can trust the data and the interpretation of it? Shadbolt thinks there will be a flight to quality. Perhaps there will be a sifting process similar to that on the internet where some data sets are more highly regarded than others.

A language for linking
Antoine Isaac is scientific coordinator for Europeana and researcher at the Vrije Universiteit Amsterdam. He works with Semantic Web technology in the cultural heritage environment, focussing on interoperability of collections and their vocabularies. He was involved in the design of SKOS (Simple Knowledge Organisation System) the language designed to represent structured vocabularies (thesauri, taxonomies and classification schemes) so they can be used in the Semantic Web environment.

Isaac gave an overview of SKOS, which is expressed in RDF and enables linking between Knowledge Organisation Systems (KOS). SKOS, he said, presents a way of expressing structured information so that connections can be made between different knowledge systems like vocabularies and taxonomies. Unlike OWL, the language for expressing ontologies (which are generally held to be more complex broader taxonomies with more formal relationships between terms), SKOS is designed, as its name implies, to be easier to use and less manpower resource hungry than OWL.

RDF (Resource Description Framework) is a way of making statements about resources (particularly on the web). These statements are in the form: Subject (John); Predicate (has the age); Object (20 years). This is what is called a triple.

SKOS is a form of glue which allows different classification systems to be linked in the Semantic Web environment. The benefits are the re-use and sharing of information and the linking of concepts.

Isaac described the steps to take: put your data on the web; make it available as structured data; use open standard formats (XML, RDF); useURI's to locate the data; link it to other data.

Evangelising Linked Data
Richard Wallis has been with technology company Talis for 20 years and calls himself a Technology Evangelist. Talis started in the library sphere based at the University of Birmingham 40 years ago and is now one of the leading Semantic Web Technology companies.

Talis offers training and applications of linked data for a variety of commercial and governmental bodies and is involved in Linked data is being used by Walmart, Tesco, The Library of Congress, the BBC, The Ministry of Defence and many others. Talis helped the BBC to link its own data to other sources of data on the Web. The BBC Wildlife site for example links to background information DBPedia, the linked form of Wikipedia.

One of the opportunities in using linked data is to create mashups, which are ways of combining data from a number of sources to create a new service or resource. Talis has a good example of mashups on its web site, where Guardian data about politicians was linked to BBC data about programming. The result was a timeline showing the exposure of individual politicians in the Guardian and on the BBC on different dates. This could be extended to other sources to create a very rich view of politicians' media exposure, demonstrating the opportunities for presenting and interpreting data once it's out there.

Connecting Communities
Steve Dale calls himself a community and collaboration ecologist. He blends technology solutions with an understanding of how people can be encouraged to organise and collaborate creatively in a sustainable knowledge ecosystem. He led the project to create a community of practice platform for the UK local government sector.

Data is everywhere, he said, and people are faced with the question of where to go for information, which networks to join, in an environment where there is almost no connectivity, and data is hidden behind applications.

Local government is awash with data, but is it being used to it's full extent? How for example do you compare performance with other areas? What is the relationship between national indicators? Dale charted the Knowledge Hub which should create the links to provide the answers. Integral parts of this Hub are data mark up and search facilities, data integration and aggregation, forms based data entry for benchmark comparisons, public datasets, mashups , and Apps. Data attracts value from contributions by other data producers as well as technical and user communities.

It's a big change, he said and its about open architecture, open source, open everything. Addressing the issue of reliability - how do you know if the interpretation of the data is correct or whether it is misleading- he agreed that this will be one of the pain points going forward, responding to the point raised that data is collected in different ways by different local authorities. Maybe the fact that the public is testing the data through available apps will do something to increase public awareness of the dangers in data interpretation. As Dale said, perhaps tomorrow will be a statisticians world.

Finding partners in trade
Martin Hepp is professor of general management and e-Business at Universitat der Bundeswehr in Munich. He looked at the costs in GDP terms of keeping commercial markets alive, citing a 1920 list of just over 5000 types of goods, compared to the current market environment where its easy to count just 30 types of bread alone. Finding partners to trade with is what drives business, he said, and the search space expands constantly. To find a specific item is the aim of the game, and the internet makes searches easier, but we still spend a lot of time every day looking for things. The world wide web, he said, is currently like a giant shredder, ingesting structured data and spewing it out as unstructured text, destroying or shredding information we had at the outset. What we need is to retain data structure, link data elements by meaning and reduce the look-up effort required.

Hepp says the quality of vocabularies will define how easily data can be re-used. Taxonomists everywhere can rejoice. Their calling in not only not going out of style, it is becoming becomes ever more relevant.

For the last 8 - 10 years, Hepp has been working on a web ontology for e commerce called Good Relations. Data levels need to be sufficient, he said, for rule based transformation. For example, the product needs to be distinguished from the offer, the store from the business entity, the product from the product model. If you are buying a car, the product data may be registration date, condition, mileage etc. A different set of information, can be extracted from standard model data. Major businesses have seen the opportunities and have implemented the vocabulary, which has the immediate effect of raising their google rating.

Dublin Core and Semantic Relativism
Andy Powell is Research Programme Director at Eduserv and has been active in the Dublin Core initiative since 1996, where he is now a member of the advisory board. He is active in standards activities relating to RDF, Dublin Core and other digital library projects.

He reviewed the history of Dublin Core, which started with 15 original metadata elements to describe web resources. Now there are around 60 properties and classes in a well curated vocabulary. Dublin Core started labeling with html metatags, which were later ignored because of Spam. It started with broad semantics, and 15 'fuzzy' buckets for data which were a hangover from library catalogue cards, and it was a record-centric model.

The challenge then was to change to the idea of strings of data, using RDF to express relationships. The open world view needs to be sold in, and there's the problem of URI's. Should a URI be a locator of a web page or an ID? For those of us struggling to understand the question, Powell introduced us to a useful Batman blog by Chris Gutteridge at Southampton University on the subject of modeling the world. When creating vocabularies we face the question of how to describe the entire world in unambiguous terms.This is amplified when you look at ontogologies on the web and the use of URI's to identify bits of data. See the blog for a tour into madness and back.

The conclusion? As ever, it points to the fact that we can only find approximate answers to questions. The web gives us the ability to approach things from a number of different angles to gain information. There's no point obsessing about absolute accuracy, as you are in danger of descending onto a Kafkaesque universe from which there is no point of exit. Enough said.

Ordnance Survey hits the streets
John Goodwin who works in the research department at Ordnance Survey (OS) was tasked with looking into the Semantic Web and its implications for OS. He started by constructing ontologies in the ontology language OWL to describe OS data, and is now responsible for linked data published by OS.

A number of natural hubs are starting to form in the linked data universe, and geographical terms are among them. (see this 2010 linked data diagram and compare it to this one from 2008).

OS is at the forefront of the geospatial linked data web, providing a trusted source for developers. Goodwin highlighted some of the problems and confusions with geo data. An area such as Hampshire can be represented as an administration area, or a county as used in the vernacular. Boundary lines can change for administration purposes while general usage of a geo term remains the same. There are issues around overlapping boundaries, touching boundaries and partial overlaps. It's hard to uniquely define an area, as anyone involved in geo term taxonomies will tell you.

The result of work done at the OS is that now you can not only find a map of anywhere in the UK on the OS site, but also, those interested in creating other applications can access OS linked data to work with.

Managing thesauri without knowing anything about Semantic Web
Andreas Blumauer is founder of Semantic Web Company, which focuses on technology consulting, the media economy and metadata management. He described the stage we desire, where the technical people have done their job, they give us a user interface, and the user has no need to look under the bonnet.

Blumauer gave us a quote from Dr Chris Welty ' It is not semantic which is new, it is the web that is new'.

He sees SKOS as having just the right level of complexity, and the ability to introduce Web 2.0 mechanisms to the web of data. Web 2.0 refers to changes in the way developers write applications for the web, enabling users to share, collaborate, and create content.

Blumauer characterised SKOS as a hand adding and retrieving linked data to and from the Cloud. The way his company's thesaurus management software Poolparty adds and retrieves data from the linked data web environment is shown in this Poolparty demo. The advantages to business are reduced costs of content management, increased automation of data handling, better search engine optimisation and access to new services and mashups (new applications made using data from a variety of sources).

Everything is a thing!
Bernard Vatant is an expert in data modeling, migration and interoperability of vocabularies. He has worked with the Publications Office of the European Union (EUROVOC vocabulary) and French National Library on the on the evolution and integration of RAMEAU for the Europeana project.

Vanant said that everything can be represented as a sign. People, products, devices, places, and concepts from vocabularies can be represented and connected. Everything, in other words, is a thing. He spoke about the semiotic triangle of meaning . He highlighted the differences between terms, concepts, and things and said they should all be first class citizens of the Semantic Web. His presentation contains rich thoughts on semantics and their use, and can be found here.

Go and play
There is something forfor all of us here. I hope you will all go out and play, look at some of the links on web sites like the BBC, or look out for apps using the data that's been made available.

It will all soon be second nature so we wont have to ask what SKOS is, or what a URI represents. Remember the Web when we thought it was just a way of skiving off to surf?


ISKO Blogspot The conference encapsulated for SKOS by Fran Alxander
Government apps list
Talis Nodalites blog with musings on linked data and Pavlova and many other things
The Basics of Linked Data Tim Berners Lee - from the horse's mouth
Jenis musings Why linked Data for Government?