Menu Close

Category: information visualization

SXSW Interactive: Data and Revelations

I am typing on a laptop in the Samsung blogger lounge at SXSW. Given this easy opportunity to blog, I wanted to share the overarching theme for my experience so far (3 days in) to SXSW Interactive. Data. It is all about data. APIs exposing data. People visualizing data. Using data to make business and policy decisions. Graphing data to keep track of web site and application performance. Privacy of data. Crowdsourcing data. Data about social media behavior. And on and on!

It has been a common thread I have traced from session to session, conversation to conversation. I expect someone with less of a database and metadata fixation might see something else as the overall meme, but I have a purse full of cards pointing me to new data sources and a notebook full of URLs to track down later to defend my view.

I keep catching myself giving mini-lessons on archives and preservation of electronic records like some sort of envoy from another universe. While I feel like a strong overall tech person at an archives conference, I feel like a data and visualization person here. This morning two of my sessions were over in the same hotel that SAA in Austin was hosted in and it was strange to be in that hotel with such a different group of people. I have managed to connect with an assortment of digital humanities folks. Someone even managed to find space for and plan an informal event for tomorrow night: Innovating and Developing with Libraries, Archives, and Museums.

My list of tech to learn (HTML5, NoSQL) and projects to contemplate and move forward (mostly ideas for visualizations using all the data everyone is sharing) is getting longer by the hour. It has been a process to figure out how to get the most I can out of SXSW. It is definitely more a space for inspiration than for deep diving into specifics. Letting go of the instinct that I am supposed to ‘learn new skills’ at a conference is fabulous!

Creative Funding for Text-Mining and Visualization Project

The Hip-Hop word count project on Kickstarter.com caught my eye because it seems to be a really interesting new model for funding a digital humanities project. You can watch the video below – but the core of the project tackles assorted metadata from 40,000 rap songs from 1979 to the present including stats about each song (word count, syllables, education level, etc), individual words, artist location and date. This information aims to become a public online almanac fueled by visualizations.

I am a backer of this project, and you can be too. As of the original writing of this post, they are currently 47% funded twenty-eight days out from their deadline. For those of you not familiar with Kickstarter, people can post creative projects and provide rewards for their funders. The funding only goes through if they reach their goal within the time limit – otherwise nothing happens, a model they call ‘all-or-nothing funding’.

What will the money be spent on?

  • 45% for PHP programmers who have been coding the custom web interface
  • 35% for interface designers
  • 10% for data acquisition & data clean up
  • 10% for hosting bills

They aim for a five month time-line to move from their existing functional prototype to something viable to release to the public.

I am also intrigued by ways that the work on this project might be leveraged in the future to support similar text-mining projects that tie in location and date. How about doing the same thing with civil war letters? How about mining the lyrics from Broadway musical songs?

If this all sounds interesting, take a look at the video below and read more on the Hip-Hop Word Count Kickstarter home page. If half the people who follow my RSS feed pitch in $10, this project would be funded. Take a look and consider pitching in. If this project doesn’t speak to you – take a look around Kickstarter for something else you might want to support.

Google Tackles Magazine Archives

Google Book Search: Popular Mechanics Jan 1905 Cover ImageAs has been reported around the web today, Google is now digitizing and adding magazines to Google Book Search. This follows on the tails of the recent Google Life Photo archive announcement.

I took a look around to see what I could see. I was intrigued by the fact that I couldn’t see a list of all the magazines in their collection. So I went after the information the hard way and kept reloading the Google Book Search home page until I didn’t see any new titles displayed in their highlighted magazine section. This is what I came up with, roughly grouped by general topic groupings.

Science and technology:

Lifestyle and city themed:

African American:

  • Ebony Jr!: May 1973 through October 1985
  • Jet: November 1961 through October 2008
  • Black Digest: Named ‘Negro Digest’ from November 1961 through April 1970, then Black Digest from May 1970 through April 1976.

Health, nutrition and organic:

  • Women’s Health and Men’s Health: January 2006 through present. I found it very amusing to be able to scan the covers of all the issues so easily – true for all of these magazines of course, but funny to see cover after cover of almost identically clad men and women exercising.
  • Prevention: January 2006 through the present
  • Better Nutrition: January 1999 through December 2004
  • Organic Gardening: November 2005 to the present
  • Vegetarian Times: March1981 through November 2004

Sports and the outdoors:

They of course promise more magazines on the way, so if you are reading this long after mid December 2008  I would assume there are more magazines and more issues available now. I hope that they make it easier to browse just magazines. Once they have a broader array of titles – how neat would it be to build a virtual news stand for a specific week in history? Shouldn’t be hard – they have all the metadata and cover images they need.

I love being able to read the magazine – advertising and all. They display the covers in batches by decade or 5 year period depending on the number of issues. I also like the Google map provided on each magazines ‘about’ page that shows ‘Places mentioned in this magazine’ and easily links you directly to the article that mentions the location marked on the map.

I think it is interesting that Google went with more of a PDF single scrolling model rather than an interface that mimics turning pages. In many issues (maybe all?) they have hot-linked the table of contents so that you can scroll down to that section instantly. You can also search within the magazine, though from my short experiments it seems that only the articles are text indexed and the advertisements are not.

Google’s current model for search is to return results for magazines mixed in with books in Google Book Search results – but they do let you limit your results to only magazines from their Advanced Search page within Google Book Search. See these results for a quick search on sunscreen in magazines.

Overall I mark this as a really nice step forward in access to old magazines. As with many visualizations, seeing the about page for any of these magazines made me ask myself new questions.  It will be interesting to see how many magazines sign on to be included and how the interface evolves.

To read more about Google’s foray into magazine digitization and search take a look at:

For a really nice analysis of the information that Google provides on the magazine pages see Search Engine Land’s Google Book Search Puts Magazines Online.

NEH Digital Humanities Startup Grant News: Visualizing Archival Collections

archivesz ng

As of August 22nd, 2008 it was official. There is even a blog post over on the NEH Office of Digital Humanities updates page to prove it. The University of Maryland was granted a Level I NEH Digital Humanities Startup Grant to fund work on the ‘Visualizing Archival Collections’ project. The official one liner is that the project will support “The development of visualization tools for assessing information contained in electronic archival finding aids created with Encoded Archival Description (EAD)”. Why did I wait so long to announce this on the blog? I wanted to have something fun to announce at the end of my SAA presentation out in San Francisco!

The project director is Dr. Jennifer Golbeck. I also have the support of University of Maryland’s Jennie Levine, Dr. Bruce Ambacher, and Dr. Doug Oard. This amazing set collaborators should help me stay on the right track and make sure I keep the sometimes competing issues relating to archives, information retrieval and interface design in balance.

I will be collecting EAD encoded finding aids over the next few months. My goal is to gather a broad sample of English language finding aids from a wide range of institutions and work on the script that extracts this data into a database. Once we have the data extracted I get to look at what we have, do some data cleanup and start thinking about what sorts of visualizations might work with our real world data. During the spring term we will design and build a 2nd generation prototype of ArchivesZ.

Want your data to be part of this? If you would like to contribute EAD finding aids in XML format to the project, please send me the following information:

  1. Archives Name
  2. Archives Parent Institution (if applicable)
  3. Archives Location
  4. Contact at Archives for questions about the finding aids (name, email and phone number)
  5. Estimate of # of finding aids being offered
  6. Controlled Vocabulary or Thesaurus used for Subject values (as many as are used)
  7. Method of finding aid delivery (sending me a zip file? pointing me at a directory online? some other way?)
  8. Do I have your permission to post a discussion of the data issues I may find in your finding aids here on Spellbound Blog? (Please see the OSU Archives post as an example of they types of issues I discuss)

You can either put this into the form on my Contact Page or send email directly to jeanne AT spellboundblog dot com.

Thank you to everyone for their enthusiasm about the ArchivesZ project. It is very exciting to have the opportunity to take all these shiny ideas to the next level.

Dipity: Easy Hosted Timelines

Dipity LogoI discovered Dipity via the Reuters article An open-source timeline of the virtual world. The article discusses the creation of a Virtual Worlds Timeline on the Dipity website. Dipity lets anyone create an account and start building timelines. In the case of the Virtual Worlds Timeline, the creator chose to permit others to collaborate on the timeline. Dipity also provides four ways of viewing any timeline: a classic left to right scrolling view, a flipbook, a list and a map.

I chose to experiment by creating a timeline for Spellbound Blog. Dipity made this very easy – I just selected WordPress and provided my blog’s URL. This was supposed to grab my 20 most recent posts – but it seems to have taken 10 instead. I tried to provide a username/password so that Dipity could pull ‘more’ of my posts (they didn’t say how many – maybe all of them?). I couldn’t get it to work as of this writing – but if I figure it out you will see many more than 10 posts.

I particularly like the way they use the images I include in my posts in the various views. I also appreciate that you can read the full posts in-place without leaving the timeline interface. I assume this is because I publish my full articles to my RSS feed. It was also interesting to note that posts that mentioned a specific location put a marker on a map – both within the single post ‘event’ as well as the full map view.

Dipity also supports the streamlined addition of many other sources such as Flickr, Picasa, YouTube, Vimeo, Blogger, Tumblr, Pandora, Twitter and any RSS feed. They have also created some neat mashups. TimeTube uses your supplied phrase to query YouTube and generates a timeline based on the video creation dates. Tickr lets you generate an interactive timeline based on a keyword or user search of Flickr.

Why should archivists care? I always perk up anytime a new web service appears that makes it easy to present time and location sensitive information. I wrote a while ago about MIT’s SIMILE project and I like their Timeline software, but in some ways hosted services like Dipity throw the net wider. I particularly appreciate the opportunity for virtual collaboration that Dipity provides. Imagine if every online archives exhibit included a Dipity timeline? Dipity provides embed code for all the timelines. This means that it should be easy to both feature the timeline within an online exhibit and use the timeline as a way to attract a broader audience to your website.

There has been discussion in the past about creating custom GoogleMaps to show off archival records in a new and different way.  During THATCamp there was a lot of enthusiasm for timelines and maps as being two of the most accessible types of visualizations. By anchoring information in time and/or location it gives people a way to approach new information in a predictable way.

Most of my initial thoughts about how archives could use Dipity related to individual collections and exhibits – but what if an archive created one of these timelines and added an entry for every one of their collections. The map could be used if individual collections were from a single location. The timeline could let users see at a glance what time periods were the focus of collections within that archives. A link could be provided in each entry pointing to the online finding aid for each collection or record group

Dipity is still in working out the kinks of some of their services, but if this sounds at all interesting I encourage you to go take a look at a few fun examples:

And finally I have embedded the Internet Memes timeline below to give you a feel of what this looks like. Try clicking on any of the events that include a little film icon at the bottom edge and see how you can view the video right in place:

Image Credit:  I found and ‘borrowed’ the Dipity image above from Dipity’s About page.

THATCamp 2008: Day 1 Dork Short Lightening Talks

lightningDuring lunch on the first day of THATCamp people volunteered to give lightning talks they called ‘Dork Shorts’. As we ate our lunch, a steady stream of folks paraded up to the podium and gave an elevator pitch length demo. These are the projects about which I managed to type URLs and some other info into my laptop. If you are looking for examples of inspirational and innovative work at the intersection of technology and the humanities – these are a great place to start!

Have more links to projects I missed including? Please add them in the comments below.

Image credit: Lightning by thenss (Christopher Cacho) via flickr

THATCamp 2008: Text Mining and the Persian Carpet Effect

alarch: Drift of Harrachov mine (Flickr)I attended a THATCamp session on Text Mining. There were between 15 and 20 people in attendance. I have done my best to attribute ideas to their originators wherever possible – but please forgive the fact that I did not catch the names of everyone who was part of this session.

What Is Text Mining?

Text mining is an umbrella phrase that covers many different techniques and types of tools.

The CHNM NEH-funded text mining initiative defined text mining as needing to support these three research functions:

  • Locating or finding: improving on search
  • Extraction: once you find a set of interesting documents, how do you extract information in new (and hopefully faster) ways? How do you pull data from unstructured bulk into structured sets?
  • Analysis: support analyzing the data, discovery of patterns, answering questions

The group discussed that there were both macro and micro aspects to text mining. Sometimes you are trying to explore a collection. Sometimes you are trying to examine a single document in great detail. Still other situations call for using text mining to generate automated classification of content using established vocabularies. Different kinds of tools will be important during different phases of research.

Projects, Tools, Examples & Cool Ideas

Andrea Eastman-Mullins, from Alexander Street Press, mentioned the University of Chicago’s ARTFL Project and these two tools:

  • PhiloLogic: An XML/SGML based full-text search, retrieval and analysis tool
  • PhiloMine: a extension being developed for PhiloLogic to provide support for “a variety of machine learning, text mining, and document clustering tasks”.

Dan Cohen directed us to his post about Mapping What Americans Did on September 11 and to Twistori which text mines Twitter.

Other Projects & Examples:

Some neat ideas that were mentioned for ways text mining could be used (lots of other great ideas were discussed – these are the two that made it into my notes):

  • Train a tool with collections of content from individual time periods, then use the tool to assist in identification of originating time period for new documents. Also could use this same setup to identify shifts in patterns in text by comparing large data sets from specific date ranges
  • If you have a tool that has learned how to classify certain types of content well… then watch for when it breaks – this can give you interesting trails to things to investigate.

Barriers to Text Mining

All of the following were touched upon as being barriers or challenges to text mining:

  • access to raw text in gated collections (ie, collections which require payment to permit access to resources) such as JSTOR and Project MUSE and others.
  • tools that are too difficult for non-programmers to use
  • questions relating to the validity of text mining as a technique for drawing legitimate conclusions

Next Steps

These ideas were ones put forward as important to move forward the field of text mining in the humanities:

  • develop and share best practices for use when cultural heritage institutions make digitization and transcription deals with corporate entities
  • create frameworks that enable individuals to reproduce the work of others and provide transparency into the assumptions behind the research
  • create tools and techniques that smooth the path from digitization to transcription
  • develop focused, easy-to-use tools that bridge the gap between computer programmers and humanities researchers

My thoughts
During the session I drew a parallel between the information one can glean in the field of archeology from the air that cannot be realized on the ground. I discovered it has a name:

“Archaeologists call it the Persian carpet effect. Imagine you’re a mouse running across an elaborately decorated rug. The ground would merely be a blur of shapes and colors. You could spend your life going back and forth, studying an inch at a time, and never see the patterns. Like a mouse on a carpet, an archaeologist painstakingly excavating a site might easily miss the whole for the parts.” from Airborne Archaeology, Smithsonian magazine, December 2005 (emphasis mine)

While I don’t see any coffee table books in the near future of text mining (such as The Past from Above: Aerial Photographs of Archaeological Sites), I do think that this idea captures the promise that we have before us in the form of the text mining tools. Everyone in our session seemed to agree that these tools will empower people to do things that no individual could have done in a lifetime by hand. The digital world is producing terabytes of text. We will need text mining tools just to find our way in this blizzard of content. It is all well and good to know that each snowflake is unique – but tell that to the 21st century historian soon to be buried under the weight of blogs, tweets, wikis and all other manner of web content.

Image credit: Drift of Harrachov Mine by alarch via flickr

As is the case with all my session summaries from THATCamp 2008, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.