In an example of Twitter serendipity, @silverasm‘s (Aditi Muralidharan) tweet pointed me to @historying‘s blog post about Topic Modeling. In this post Cameron Blevins explains the results of using the topic modeling feature of UMass Amherst‘s MAchine Learning for LanguagE Toolkit (MALLET) on the text of Martha Ballard’s Diary.
I have spent lot of time thinking about how to generate thematic overviews of groups of archival collections. My information visualization project, ArchivesZ, aims to provide ways of understanding aggregated archival description data, both from a single institution or across institutional boundaries. Now I find myself wondering if text mining with a tool like MALLET might generate smart topic groupings more elegantly than fighting with the wide range of non-standardized collection subjects.
Topic Modeling with MALLET
To get a sense of what MALLET generates, see the excerpt below from Blevins’s post:
With some tinkering, MALLET generated a list of thirty topics comprised of twenty words each, which I then labeled with a descriptive title. Below is a quick sample of what the program “thinks” are some of the topics in the diary:
- MIDWIFERY: birth deld safe morn receivd calld left cleverly pm labour fine reward arivd infant expected recd shee born patient
- CHURCH: meeting attended afternoon reverend worship foren mr famely performd vers attend public supper st service lecture discoarst administred supt
- DEATH: day yesterday informd morn years death ye hear expired expird weak dead las past heard days drowned departed evinn
- GARDENING: gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds
He goes on to explain that “MALLET also allows us to track those topics across the text.” What if, instead of text mining a diary, we pumped the descriptions of every archival collection from a single institution into MALLET. Of course we would need a good list of stop words including such common terms as archives, history, sources and records. But I wonder how the topics MALLET suggests would compare to the official subjects associated with each collection? Could this give us a broad overview of the topics covered by a specific repository and give us a new way to build paths to the collections based on topic?
Auto-Classification Using Castanet
Text miner Aditi Muralidharan also posted recently on this theme in Castanet: automatically generating a browsing structure for a collection and explains:
Castanet automatically carves a sub-structure from the hierarchical concept dictionary, WordNet (http://wordnet.princeton.edu), and matches items in the collection to one or many appropriate places within that hierarchy. Then, after some automated trimming and flattening, the result is a hierarchical browsing system.
I have heard of Castanet before via the Flamenco Search Interface Project. Apparently Muralidharan did a project using Castanet last summer to create a category system for Flickr Commons images based on the images’ tags which is then rendered using a Flamenco interface. I include a partial screen-shot below to give you a taste of what the navigation of images feels like a few levels down in the hierarchy. I love the classification of ‘Group Action’ then filtered by a sub-classification of ‘Commerce’. The first images shown are of ‘horse trading’ – with additional headings and images beneath them as well as additional filter options on the left.
What if we pulled all the English language archival descriptions from around the world as our original data set. If we used this data for topic modeling, our subjects clusters would be cross-institutional. Maybe we could map the local institution assigned subjects to the topic model generated topics for each collection and get a sort of automated crosswalk for finding related collections. If we used the local institution assigned subjects from the archival descriptions for Canasta style auto-classification, maybe we could generate a way to hierarchically browse collections topically.
Both MALLET and Flamenco are open source (I am not sure of the status of Castanet) and, as I discovered working on ArchivesZ, many institutions will share their archival description data for a good cause. So – is this a good cause? I need to tease these ideas out a bit more, but what do you all think of it at first blush? Feasible? Interesting? Worthwhile experiments?