THATCamp2008 | Spellbound Blog

THATCamp Reflections

February 26, 2020

My path to the inaugural THATCamp started at the Society of American Archivist’s 2006 annual meeting in DC. I was a local grad student presenting my first poster: Communicating Context in Online Collections – and handing out home-printed cards for my blog. When I ran out, I just wrote the URL on scraps of paper. I found my way to session 510: Archives Seminar: Possibilities and Problems of Digital History and Digital Collections, featuring Dan Cohen and Roy Rosenzweig, described in the SAA program as follows:

The co-authors of Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web lead a discussion of their book and, in particular, the possibilities of digital history and of collecting the past online. The discussion includes reflections on the September 11 Digital Archive and the new Hurricane Digital Memory Bank, which collects stories, images, and other digital material related to hurricanes Katrina, Rita, and Wilma.

The full hour and twenty-four minutes audio recording is available online if you want to dive down that particular rabbit hole.

2006 was early in the “archives blogging” landscape. It was the era of finding and following like-minded colleagues. RSS and feed readers! People had conversations in the comments. 2006 was the year I launched my blog. My post about Dan & Roy’s session was only the 9th post on my site. I was employed full time doing Oracle database work at Discovery and working towards my MLS in the University of Maryland’s CLIS (now iSchool) program part-time. So I added Dan’s blog to the list of the blogs I read. When Dan invited people to come to THATCamp in January of 2008 and I realized it was local – I signed up. You can see my nametag in the “stack of badges” photo above. For a taste of my experiences that day, take a look at my 2008 THATCamp blog posts.

In 2008, the opportunity to sit in a room of people who were interested in the overlap of technology and humanities was exciting. As a part-time graduate student (and wife and mother of a 6-year-old), I spent almost no time on campus. I did most of my thinking about archives and technology at home late at night in the glow of my computer screen. There was not a lot of emphasis on the digital in my MLS program at UMD. I had to find that outside the classroom.

The connections I made at that first THATCamp extend to today. As mentioned elsewhere, I was part of the group who put together the first regional THATCamp in Austin as a one-evening side-event for the Society of American Archivists Annual Meeting in 2009. I swear that Ben Brumfield and I were just going to meet for dinner while I was in Austin, where he lives, for SAA. Somehow that turned into “Why not throw a THATCamp?”. How great to have no idea of the scale of what we were taking on! Ben did an amazing job of documenting what we learned and tips for future organizers, including giving yourself more time to plan, reaching out to as diverse a group as possible, and planning an event that lasted longer than four hours. All that said, it was a glorious and crazy evening. I still have my t-shirt. While our discussions might have been more archives-skewed than at most THATCamps, it also gave lots of archivists a taste of what THATCamp and un-conferences were like. Looking through the posts on the THATCamp Austin website, there was clearly an appetite for the event. We could easily have had enough topics to discuss to fill a weekend – but only had time for two one hour session slots, plus a speed round of “dork shorts” lightning talks.

I know I went to other THATCamps along the way. I graduated with my MLS in 2009. I started an actual day-job as an archivist in July of 2011 at the World Bank. Suddenly I got paid to think about archives all day – and I didn’t need my blog in the way I used to. I started writing more fiction and attending conferences dedicated to digital preservation. Somewhere in there, I went to the 2012 THATCamp Games at UMD.

THATCamps brought together enthusiastic people from so many different types of digital and humanities practice — all with their own perspectives and their own problems to solve. We don’t get many opportunities to cross-pollinate among those from academia and the public and private sectors. Those early conversations were my first steps towards ideas about how archivists might collaborate with professionals from other communities on digital challenges and innovations. In fact, I can see threads stretching from the very first THATCamp all the way to my Partners for Preservation book project.

Thanks, THATCamp community.

This post is cross-posted as part of the 2020 THATCamp retrospective.

THATCamp 2008: Day 1 Dork Short Lightening Talks

June 14, 2008 2 Comments

During lunch on the first day of THATCamp people volunteered to give lightning talks they called ‘Dork Shorts’. As we ate our lunch, a steady stream of folks paraded up to the podium and gave an elevator pitch length demo. These are the projects about which I managed to type URLs and some other info into my laptop. If you are looking for examples of inspirational and innovative work at the intersection of technology and the humanities – these are a great place to start!

World Digital Library (Library of Congress )
PicLens + FireFox + any search results page from the New York Public Library Digital Gallery = a 3D experience of ALL the photos at one time. PicLens uses the RSS feed to retrieve the full set of images along with their captions and will work with any RSS feed of images – such as RSS image feeds from Flickr or Smugmug .
HistoryWired (National Museum of American History): A new spin on a treemap visualization built on top of museum metadata. One box is displayed per item and the box size is based on popularity. The rest of its innovations are just easier to experience than describe.
The Object of History (National Museum of American History + CHNM )
Omeka (CHNM )
Eminent Domain (NYPLOnline Exhibition): built on Omeka
American Social History Online (Digital Library Federation): Zotero enabled. They are on the hunt for more MODS records. Built on Ruby On Rails (RoR) and will be put out as open source software within a couple of months.
Typographia(David Rieder, NC State University)

Have more links to projects I missed including? Please add them in the comments below.

Image credit: Lightning by thenss (Christopher Cacho) via flickr

THATCamp 2008: Crowdsourced Transcription and Collaborative Annotation

June 5, 2008 13 Comments

The THATCamp session officially titled ‘Crowdsourcing’ on the schedule was actually aimed at discussing the intersection of crowdsourced transcription and collaborative annotation. The group was small – just six of us and Ben Brumfield got us going by giving us an overview of transcription software and projects:

The FamilySearch Indexing Project is an LDS church project put out by the FamilySearch Labs. Their goals: “Volunteers extract family history information from digital images of historical documents to create searchable indexes that assist everyone in finding their ancestors.”
The Manuscript Transcription Assistant is based at Worcester Polytechnic Institute (WPI) and is described as “a tool to assist transcribers in creating transcriptions, and incorporate meta-data about each image and transcription that can then be used to search through an electronic library of transcriptions”. I found mention in the FAQ of the desire to create a community so that “transcribers will be able to collaborate their work by rating the quality of other user’s transcriptions. By ranking the transcriptions, specific versions of transcriptions will emerge as an authority for that manuscript. ” Unfortunately, a lot of the links on that site are broken and my attempt to register gave me an error. It is not clear to me that this project is actually still active.
Soldier Studies is a website dedicated to posting transcriptions of civil war letters and diaries. This is not a tool for transcribing, but is clearly a repository targeting specifically transcriptions (see their Mission Statement for more information).
Oh No Robot is a comics transcription and search tool. It provides a page to find comics needing transcription and a great page to explain how transcription works on their site.

After examining what was out there, Ben concluded that what he wanted didn’t exist – so he started to build it himself. He gave us a demo of his “very beta” software. His goal is to build a web based tool to support collaborative manuscript transcription and annotation by individuals without a strong technical background. In its current (and private beta) state the software supports transcription, an innovative approach to linking individual words or phrases to collection defined subjects and some basic community tools to let his virtual team discuss transcription issues. Ben is working hard on the software – if you are interested in his project, definitely get in touch with him.

Travis Brown showed us his creation: eComma. eComma aims to “enable groups of students, scholars, or general readers to build collaborative commentaries on a text and to search, display, and share those commentaries online”. He showed us how users could tag or add comments on individual words or phrases of a loaded text. Take a look at the eComma page for Sonnet 18 by William Shakespeare. The words highlighted in blue are those which are tagged or have comments associated with them. If you highlight ‘the eye of heaven’ in line 5 you will see that it is tagged as a metaphor. Travis reported that he will have 2 other programmers working on eComma with him this summer and has his eye on improving some interface issues and adding a few more features.

We also talked about ways to display transcription. Elena Razlogova guided us over to the DoHistory website. There she showed us the Magic Lens interface. This interface displays the transcription of a handwritten diary page via a lens style overlay that you can move with your mouse. This reminded me of the Gilder Lehrman Battle Lines: Letters from America’s Wars interface that I found when doing research for my Communicating Context in Online Collections Poster. If you haven’t seen it before – go examine the page showing the transcription of (turn down your speaker if a reader’s voice will disturb those around you) Nathanael Green’s letter to Catherine Greene dated July 17, 1778.

While on the DoHistory site I also found the Try Your Hand At Transcribing page. This page shows the challenge of transcribing handwritten documents by giving you the chance to try it yourself and then lets you check your transcription with the click of a button.

We talked a bit about the technology behind eComma (forgive me Travis for not having enough details in my notes to explain your current architecture here) and the challenges inherent in wanting to annotate overlapping sets of words. Though he isn’t using it in the current implementation of eComma, Travis mentioned the Layered Markup Annotation Language (LMNL) which the tutorial page explains as:

…LMNL documents contain character data which is marked up using named and occasionally overlapping ranges. Ranges can have annotations, which can themselves be annotated and can have structured content. To support authoring, especially collaborative authoring, markup is namespaced and divided into layers, which might reflect different views on the text.

I can definitely see how LMNL might be an interesting framework for building transcription and annotation software.

Krissy O’Hare brought up the challenges of transcribing audio and video that she has faced working on oral history projects at Concordia University. This led to Travis (I think?) mentioning the Texas German Dialect Project (TGDP) and the CMU Sphinx Group Speech Recognition Engine. TGDP has an online archive of recorded interviews along with their transcriptions and translations. CMU Sphinx’s introduction explains that their software tools are targeted at expert users wanting to build speech-using applications.

This was a great session. The small group gave everyone a chance to contribute and take over the keyboard in order to show off their favorite sites. It was immediately after the Text Mining session, so our minds were already full of all the great things one could do with text once it is transcribed.

I am excited to watch the evolution of group transcription and annotation software. If you know of other transcription or annotation tools or projects – please post them to the comments.

Image credit: Free pencils by zone41 via flickr

As is the case with all my session summaries from THATCamp 2008, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

THATCamp 2008: Text Mining and the Persian Carpet Effect

June 1, 2008 5 Comments

I attended a THATCamp session on Text Mining. There were between 15 and 20 people in attendance. I have done my best to attribute ideas to their originators wherever possible – but please forgive the fact that I did not catch the names of everyone who was part of this session.

What Is Text Mining?

Text mining is an umbrella phrase that covers many different techniques and types of tools.

The CHNM NEH-funded text mining initiative defined text mining as needing to support these three research functions:

Locating or finding: improving on search
Extraction: once you find a set of interesting documents, how do you extract information in new (and hopefully faster) ways? How do you pull data from unstructured bulk into structured sets?
Analysis: support analyzing the data, discovery of patterns, answering questions

The group discussed that there were both macro and micro aspects to text mining. Sometimes you are trying to explore a collection. Sometimes you are trying to examine a single document in great detail. Still other situations call for using text mining to generate automated classification of content using established vocabularies. Different kinds of tools will be important during different phases of research.

Projects, Tools, Examples & Cool Ideas

Andrea Eastman-Mullins, from Alexander Street Press, mentioned the University of Chicago’s ARTFL Project and these two tools:

PhiloLogic: An XML/SGML based full-text search, retrieval and analysis tool
PhiloMine: a extension being developed for PhiloLogic to provide support for “a variety of machine learning, text mining, and document clustering tasks”.

Dan Cohen directed us to his post about Mapping What Americans Did on September 11 and to Twistori which text mines Twitter.

Other Projects & Examples:

MONK project (Metadata Offer New Knowledge)
Open Content Alliance(OCA)
Library of Congress Chronicling America – newspaper pages from 1897-1910
Tanya Clement’s project “Using Digital Tools to Not-Read Gertrude Stein’s The Making of Americans” at University of Maryland, College Park
Two other University of Maryland, College Park projects that were not mentioned during the session, but may be of interest are FeatureLens and BasketLens
Google Docs now includes Flesch-Kincaid Readability Tests and Automated Readability Index in the same window in which it shows you your Word Count
Spam filters – such as Bayesian spam filtering using text mining to identify spam e-mails
Clustering – see my post on this: Clustering Data: Generating Organization from the Ground Up and also take a look at Clusty.com and their ‘remix clusters’ option.

Some neat ideas that were mentioned for ways text mining could be used (lots of other great ideas were discussed – these are the two that made it into my notes):

Train a tool with collections of content from individual time periods, then use the tool to assist in identification of originating time period for new documents. Also could use this same setup to identify shifts in patterns in text by comparing large data sets from specific date ranges
If you have a tool that has learned how to classify certain types of content well… then watch for when it breaks – this can give you interesting trails to things to investigate.

Barriers to Text Mining

All of the following were touched upon as being barriers or challenges to text mining:

access to raw text in gated collections (ie, collections which require payment to permit access to resources) such as JSTOR and Project MUSE and others.
tools that are too difficult for non-programmers to use
questions relating to the validity of text mining as a technique for drawing legitimate conclusions

Next Steps

These ideas were ones put forward as important to move forward the field of text mining in the humanities:

develop and share best practices for use when cultural heritage institutions make digitization and transcription deals with corporate entities
create frameworks that enable individuals to reproduce the work of others and provide transparency into the assumptions behind the research
create tools and techniques that smooth the path from digitization to transcription
develop focused, easy-to-use tools that bridge the gap between computer programmers and humanities researchers

My thoughts
During the session I drew a parallel between the information one can glean in the field of archeology from the air that cannot be realized on the ground. I discovered it has a name:

“Archaeologists call it the Persian carpet effect. Imagine you’re a mouse running across an elaborately decorated rug. The ground would merely be a blur of shapes and colors. You could spend your life going back and forth, studying an inch at a time, and never see the patterns. Like a mouse on a carpet, an archaeologist painstakingly excavating a site might easily miss the whole for the parts.” from Airborne Archaeology, Smithsonian magazine, December 2005 (emphasis mine)

While I don’t see any coffee table books in the near future of text mining (such as The Past from Above: Aerial Photographs of Archaeological Sites), I do think that this idea captures the promise that we have before us in the form of the text mining tools. Everyone in our session seemed to agree that these tools will empower people to do things that no individual could have done in a lifetime by hand. The digital world is producing terabytes of text. We will need text mining tools just to find our way in this blizzard of content. It is all well and good to know that each snowflake is unique – but tell that to the 21st century historian soon to be buried under the weight of blogs, tweets, wikis and all other manner of web content.

Image credit: Drift of Harrachov Mine by alarch via flickr

Category: THATCamp2008

THATCamp Reflections

THATCamp 2008: Day 1 Dork Short Lightening Talks

THATCamp 2008: Crowdsourced Transcription and Collaborative Annotation

THATCamp 2008: Text Mining and the Persian Carpet Effect