I attended a THATCamp session on Text Mining. There were between 15 and 20 people in attendance. I have done my best to attribute ideas to their originators wherever possible – but please forgive the fact that I did not catch the names of everyone who was part of this session.
What Is Text Mining?
Text mining is an umbrella phrase that covers many different techniques and types of tools.
The CHNM NEH-funded text mining initiative defined text mining as needing to support these three research functions:
- Locating or finding: improving on search
- Extraction: once you find a set of interesting documents, how do you extract information in new (and hopefully faster) ways? How do you pull data from unstructured bulk into structured sets?
- Analysis: support analyzing the data, discovery of patterns, answering questions
The group discussed that there were both macro and micro aspects to text mining. Sometimes you are trying to explore a collection. Sometimes you are trying to examine a single document in great detail. Still other situations call for using text mining to generate automated classification of content using established vocabularies. Different kinds of tools will be important during different phases of research.
Projects, Tools, Examples & Cool Ideas
- PhiloLogic: An XML/SGML based full-text search, retrieval and analysis tool
- PhiloMine: a extension being developed for PhiloLogic to provide support for “a variety of machine learning, text mining, and document clustering tasks”.
Other Projects & Examples:
- MONK project (Metadata Offer New Knowledge)
- Open Content Alliance(OCA)
- Library of Congress Chronicling America – newspaper pages from 1897-1910
- Tanya Clement’s project “Using Digital Tools to Not-Read Gertrude Stein’s The Making of Americans” at University of Maryland, College Park
- Two other University of Maryland, College Park projects that were not mentioned during the session, but may be of interest are FeatureLens and BasketLens
- Google Docs now includes Flesch-Kincaid Readability Tests and Automated Readability Index in the same window in which it shows you your Word Count
- Spam filters – such as Bayesian spam filtering using text mining to identify spam e-mails
- Clustering – see my post on this: Clustering Data: Generating Organization from the Ground Up and also take a look at Clusty.com and their ‘remix clusters’ option.
Some neat ideas that were mentioned for ways text mining could be used (lots of other great ideas were discussed – these are the two that made it into my notes):
- Train a tool with collections of content from individual time periods, then use the tool to assist in identification of originating time period for new documents. Also could use this same setup to identify shifts in patterns in text by comparing large data sets from specific date ranges
- If you have a tool that has learned how to classify certain types of content well… then watch for when it breaks – this can give you interesting trails to things to investigate.
Barriers to Text Mining
All of the following were touched upon as being barriers or challenges to text mining:
- access to raw text in gated collections (ie, collections which require payment to permit access to resources) such as JSTOR and Project MUSE and others.
- tools that are too difficult for non-programmers to use
- questions relating to the validity of text mining as a technique for drawing legitimate conclusions
These ideas were ones put forward as important to move forward the field of text mining in the humanities:
- develop and share best practices for use when cultural heritage institutions make digitization and transcription deals with corporate entities
- create frameworks that enable individuals to reproduce the work of others and provide transparency into the assumptions behind the research
- create tools and techniques that smooth the path from digitization to transcription
- develop focused, easy-to-use tools that bridge the gap between computer programmers and humanities researchers
During the session I drew a parallel between the information one can glean in the field of archeology from the air that cannot be realized on the ground. I discovered it has a name:
“Archaeologists call it the Persian carpet effect. Imagine you’re a mouse running across an elaborately decorated rug. The ground would merely be a blur of shapes and colors. You could spend your life going back and forth, studying an inch at a time, and never see the patterns. Like a mouse on a carpet, an archaeologist painstakingly excavating a site might easily miss the whole for the parts.” from Airborne Archaeology, Smithsonian magazine, December 2005 (emphasis mine)
While I don’t see any coffee table books in the near future of text mining (such as The Past from Above: Aerial Photographs of Archaeological Sites), I do think that this idea captures the promise that we have before us in the form of the text mining tools. Everyone in our session seemed to agree that these tools will empower people to do things that no individual could have done in a lifetime by hand. The digital world is producing terabytes of text. We will need text mining tools just to find our way in this blizzard of content. It is all well and good to know that each snowflake is unique – but tell that to the 21st century historian soon to be buried under the weight of blogs, tweets, wikis and all other manner of web content.
As is the case with all my session summaries from THATCamp 2008, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.