Clustering Data: Generating Organization from the Ground Up

My trip to the 2008 Information Architecture Summit (IA Summit) down in Miami has me thinking a lot about helping people find information. In this post I am going to examine clustering data.

Flickr Tag Clusters
Tag clusters are not new on Flickr – they were announced way back in August of 2005. The best way to understand tag clusters is to look at a few. Some of my favorites are the water clusters (shown in the image above). From this page you can view the reflection/nature/green cluster, the sky/lake/river cluster, the blue/beach/sun cluster or the sea/sand/waves cluster.

So what is going on here? Basically Flickr is analyzing groupings of tags assigned to Flickr images and identifying common clusters of tags. In our water example above – they found four different sets of tags that occurred together and distinctly apart from other sets of tags. The proof is in the pudding – the groupings make sense. They get at very subtle differences even though the mass of data being analyzed is from many different individuals with many different perspectives.

Tag clusters are very powerful and quite different from tag clouds. Tag clouds, by their nature, are a blunt instrument. They only show you the most popular tags. Take a look at the tag cloud for the Library of Congress photostream on Flickr. I do learn something from this. I get a sense of the broad brush topics, time periods and locations. But if you look at the full list of Library of Congress Flickr tags you see what a small percentage the top 150 really are (and yes.. that page does takes a while to load). Who else is now itching to ask Flickr to generate clusters within the LOC tag set?

Steve.Museum
Another example of cultural heritage images being tagged is the Steve Museum Art Museum Social Tagging Project which lets individuals tag objects from museums via Steve Tagger. It resembles the Library of Congress on Flickr project in that it includes existing metadata with each image and permits users to add any tags they deem appropriate. I think it would be fascinating to contrast the traffic of image taggers on Steve.Museum vs Flickr for a common set of images. Is it better to build a custom interface that users must seek out but where you have complete control over the user experience and collected data? Or is it better to put images in the already existing path of users familiar with tagging images? I have no answers of course. All I know is I wish I could see the tag clusters one could generate off the Steve.Museum tag database. Perhaps someday we will!

Del.icio.us Tags
Del.icio.us, a web service for storing and tagging your bookmarks online, supports what they call ‘related tags’ and ‘tag bundles’. If you view the page for the tag ‘archives’ – you will see to the far right a list of related tags like those shown in the image here. What is interesting is that if I look at my own personal tag page for archives I see a much longer list of related tags (big surprise that I have a lot of links tagged archives!) and I am given the option of selecting additional tags to filter my list of links via a combination of tags.

Del.icio.us’s ‘tag bundles’ let me create my own named groupings of tags – but I must assemble these groups manually rather than have them generated or suggested. On the plus side, Del.icio.us is very open about publishing its data via APIs and therefore supporting third party tools. I think my favorite off that list for now has to be MySQLicious which mirrors your del.icio.us bookmarks into a MySQL database. Once those tags are in a database, all you need are the right queries to generate the clusters I want to see.

Clusty: Clustered Search Results
An example of what this might look like for search results can be seen via the search engine Clusty.com from the folks over at Vivisimo. For example – try a search on the term archives. This is one of those search terms for which general web searching is usually just infuriating. Clusty starts us with the same top 2 results as a search for archives on Google does, but it also gives us a list of clusters on the left sidebar. You can click on any of those clusters to filter the search results.

Those groups don’t look good to you? Click the ‘remix’ link in the upper right hand corner of the cluster list and you get a new list of clusters. In a blog post titled Introducing Clustering 2.0 Vivisimo CEO Raul Valdes-Perez explains what happens when you click remix:

With a single click, remix clustering answers the question: What other, subtler topics are there? It works by clustering again the same search results, but with an added input: ignore the topics that the user just saw. Typically, the user will then see new major topics that didn’t quite make the final cut at the last round, but may still be interesting.

I played for a while.. clicking remix over and over. It was as if it was slicing and dicing the facets for me – picking new common threads to highlight. I liked that I wasn’t stuck with what someone else thought was the right way to group things. It gave me the control to explore other groupings.

Ontology is Overrated
Clay Shirky’s talk Ontology is Overrated: Categories, Links and Tags from the spring of 2005 ties a lot of these ideas together in a way that makes a lot of sense to me. I highly recommend you go read it through – but I am going to give away the conclusion here:

It’s all dependent on human context. This is what we’re starting to see with del.icio.us, with Flickr, with systems that are allowing for and aggregating tags. The signal benefit of these systems is that they don’t recreate the structured, hierarchical categorization so often forced onto us by our physical systems. Instead, we’re dealing with a significant break — by letting users tag URLs and then aggregating those tags, we’re going to be able to build alternate organizational systems, systems that, like the Web itself, do a better job of letting individuals create value for one another, often without realizing it.

I currently spend my days working with controlled vocabularies for websites, so please don’t think I am suggesting we throw it all away. And yes, you do need a lot of information to reach the critical mass needed to support the generation of useful clusters. But there is something here that can have a real and positive impact on users of cultural heritage materials actually finding and exploring information. We can’t know how everyone will approach our records. We can’t know what aspects of them they will find interesting.

There Is No Box
Archivists already know that much of the value of records is in the picture they paint as a group. A group of records share a context and gives the individual records meaning. Librarians and catalogers have long lived in a world of shelves. A book must be assigned a single physical location. Much has been made (both in the Clay Shirky talk and elsewhere) that on the web there is no shelf.

What if we take the analogy a step further and say that for an online archives there is no box? Of course, just as with books, we still need our metadata telling us who created this record originally (and when and why and which record comes before it and after it) – but picture a world where a single record can be virtually grouped many times over. Computer programs are only going to get better at generating clusters, be they of user assigned tags or search results or other metdata. From where I sit, the opportunity for leveraging clustering to do interesting things with archival records seems very high indeed.

Clustering Data: Generating Organization from the Ground Up

2 Comments