information visualization | Spellbound Blog

Chapter 9: Sharing Research Data, Data Standards and Improving Opportunities for Creating Visualisations by Dr. Vetria Byrd

January 27, 2019 1 Comment

Chapter 9 of Partners for Preservation is ‘Sharing Research Data, Data Standards and Improving Opportunities for Creating Visualisations’ by Dr. Vetria Byrd. This is the second chapter of Part III: Data and Programming. I originally had envisioned a chapter focused on the ways that standardization, controlled vocabularies, and consistent documentation could increase the re-use of data. All these things help people, separated by either space or time, to understand and leverage the work of others. Scientific communities around the world have led a lot of this work. The work of archivists to preserve data in a meaningful way is made easier by it.

Luckily for all of us, my hunt for contributing authors brought Dr. Vetria Byrd to this project. Her professional focus on visualization led me to approach the topic of sharing data and data standards from a different direction.

I am a very visual thinker. I truly believe that with a large enough whiteboard, you could plan (or explain) anything. Those who are familiar with my research back in graduate school may recall my visualization project, ArchivesZ, focused on visualizing archival descriptive information. So, when I realized that this chapter could talk about both data standards and visualization, I was sold.

The introduction to the chapter explains:

“This chapter looks at the collaborative nature of sharing the underlying data that propels the system, rather than focusing on systems and services. It provides an overview of the visualisation process, and discusses the challenge of sharing research data and ways data standards can increase opportunities for creating and sharing visualisations, while also increasing visualisation capacity building among researchers and scientists.”

A single visualization is often reliant on multiple sets of data that have been analyzed, linked, and summarized over multiple iterations to generate the final product. Take the xkcd webcomic featured at the top of this post. The citation within the webcomic itself reads “based on map data from US Drought Monitor/NOAA/Richard Tinker”. Digging a bit deeper, I found my way to the US Drought Monitor website which provides easy access to data and maps. You can learn more about the data included on the site and read the details about how they calculate the drought classifications.

I was able to quickly generate this chart, showing California Drought data over time. While it certainly makes it clear that there has been a dramatic increase of drought over time, it does not communicate the same information as the maps in the webcomic above.

I think this is a great example of how different ways of visualizing information can fundamentally change our understanding of something. Documentation of and transparency in sharing data is key. It gives the tools to a broader audience of creative individuals who can then increase the visibility of the original work and build upon it.

Bio:
Dr. Vetria Byrd is an Assistant Professor of Computer Graphics Technology and Director of the Byrd Data Visualization Lab in the Polytechnic Institute at Purdue University’s main campus in West Lafayette, Indiana. Dr. Byrd is introducing and integrating visualization capacity building into the undergraduate data visualization curriculum. She is the founder of the Broadening Participation in Visualization (BPViz) Workshop. She served as a steering committee member on the Midwest Big Data Hub (2016-2018). She has taught data visualization courses on national and international platforms as an invited lecturer of the International High Performance Computing Summer School (IHPCSS). Her visualization webinars on Blue Waters, a petascale supercomputer at the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, introduce data visualization to audiences around the world. As described in her invited plenary talk featured on HPC Wire Dr. Byrd utilizes data visualization as a catalyst for communication, a conduit collaboration and as a platform to broaden participation of underrepresented groups in data visualization. Dr. Byrd’s research interests include data visualization, data analytics, data integration, visualizing heterogeneous data and the science of learning and incorporating data visualization at the curriculum level and everyday practice.

Image source: xkcd comic: California. https://www.xkcd.com/1410/

PS: If you haven’t yet discovered xkcd (self-described as “A webcomic of romance, sarcasm, math, and language.”) – you are in for a treat!

SXSW Interactive: Data and Revelations

March 13, 2011 1 Comment

I am typing on a laptop in the Samsung blogger lounge at SXSW. Given this easy opportunity to blog, I wanted to share the overarching theme for my experience so far (3 days in) to SXSW Interactive. Data. It is all about data. APIs exposing data. People visualizing data. Using data to make business and policy decisions. Graphing data to keep track of web site and application performance. Privacy of data. Crowdsourcing data. Data about social media behavior. And on and on!

It has been a common thread I have traced from session to session, conversation to conversation. I expect someone with less of a database and metadata fixation might see something else as the overall meme, but I have a purse full of cards pointing me to new data sources and a notebook full of URLs to track down later to defend my view.

I keep catching myself giving mini-lessons on archives and preservation of electronic records like some sort of envoy from another universe. While I feel like a strong overall tech person at an archives conference, I feel like a data and visualization person here. This morning two of my sessions were over in the same hotel that SAA in Austin was hosted in and it was strange to be in that hotel with such a different group of people. I have managed to connect with an assortment of digital humanities folks. Someone even managed to find space for and plan an informal event for tomorrow night: Innovating and Developing with Libraries, Archives, and Museums.

My list of tech to learn (HTML5, NoSQL) and projects to contemplate and move forward (mostly ideas for visualizations using all the data everyone is sharing) is getting longer by the hour. It has been a process to figure out how to get the most I can out of SXSW. It is definitely more a space for inspiration than for deep diving into specifics. Letting go of the instinct that I am supposed to ‘learn new skills’ at a conference is fabulous!

Creative Funding for Text-Mining and Visualization Project

January 16, 2011 2 Comments

The Hip-Hop word count project on Kickstarter.com caught my eye because it seems to be a really interesting new model for funding a digital humanities project. You can watch the video below – but the core of the project tackles assorted metadata from 40,000 rap songs from 1979 to the present including stats about each song (word count, syllables, education level, etc), individual words, artist location and date. This information aims to become a public online almanac fueled by visualizations.

I am a backer of this project, and you can be too. As of the original writing of this post, they are currently 47% funded twenty-eight days out from their deadline. For those of you not familiar with Kickstarter, people can post creative projects and provide rewards for their funders. The funding only goes through if they reach their goal within the time limit – otherwise nothing happens, a model they call ‘all-or-nothing funding’.

What will the money be spent on?

45% for PHP programmers who have been coding the custom web interface
35% for interface designers
10% for data acquisition & data clean up
10% for hosting bills

They aim for a five month time-line to move from their existing functional prototype to something viable to release to the public.

I am also intrigued by ways that the work on this project might be leveraged in the future to support similar text-mining projects that tie in location and date. How about doing the same thing with civil war letters? How about mining the lyrics from Broadway musical songs?

If this all sounds interesting, take a look at the video below and read more on the Hip-Hop Word Count Kickstarter home page. If half the people who follow my RSS feed pitch in $10, this project would be funded. Take a look and consider pitching in. If this project doesn’t speak to you – take a look around Kickstarter for something else you might want to support.

Gridworks: Super Data Cleanup and Exploration Tool

May 29, 2010

In my presentation at the Spring 2010 Mid-Atlantic Regional Archives Conference (MARAC), Whirlwind Tour of Visualization-Land, I showed some screenshots of a tool called Gridworks. At the time, Gridworks was not available to the general public. The good news is that earlier this month Gridworks 1.0 was officially released and you can get Gridworks right now.

For those of you who didn’t see my presentation, Gridworks is tool you run locally on your computer via a web browser. It permits you to load ‘grid-shaped data’ for examination, filtering and data cleanup. That makes is sound so much less exciting than it is. The best way to get a sense of what you can do is to watch the Gridworks Videos.

What sort of data do I think there is in archives to be pumped into Gridworks? How about collection descriptive data and electronic record datasets? Since all the data is kept locally, you don’t need to worry about uploading your data to some anonymous server in order to work with it. It all stays safely on your local computer the whole time.

A quick list of things that Gridworks can do:

Cluster data to find values that are almost the same so you can normalize your data (for example – NYC vs N.Y.C.)
Create instant facetted browsing based on any column in your data
Provide scatterplots of the values from any two numeric columns as well as a way to spot the most interesting combinations across many possible columns
Reconcilliation and validation of values based on data from within Freebase.com
Pull data from Freebase.com based on a matched column – such as the population of a country, if you have a column in your dataset with country specified
Splitting data within a cell based on a specified delimiter
Application of regular expressions and other simple code to data to create new columns

This list just scratches the surface, but it should give you a decent idea of the power of Gridworks. Even if the only feature you ever use is the one which lets you cluster and update your data to remove the ‘almost the same’ values, Gridworks can save you hours of painstaking data cleanup.

Why is data cleanup exciting? Because once you have nice clean data with all the attributes that are usefull to have for your data set – then you can start playing with the data in visualization tools! So go watch some Gridworks Videos, get Gridworks for yourself and start playing with data. It is free and it makes working with data fun!

Google Tackles Magazine Archives

December 10, 2008 13 Comments

As has been reported around the web today, Google is now digitizing and adding magazines to Google Book Search. This follows on the tails of the recent Google Life Photo archive announcement.

I took a look around to see what I could see. I was intrigued by the fact that I couldn’t see a list of all the magazines in their collection. So I went after the information the hard way and kept reloading the Google Book Search home page until I didn’t see any new titles displayed in their highlighted magazine section. This is what I came up with, roughly grouped by general topic groupings.

Science and technology:

The Bulletin of the Atomic Scientists: which started out as the Bulletin of the Atomic Scientists of Chicago in December of 1945 through November of 1998
CIO: The Magazine for Information Executives: back to Volume 1, Number 1 from Sept/Oct 1987
Maximum PC: October 1998 through the present
Popular Science: stretching back to an issue for March of 1872 when it was known as Popular Science Monthly through to February 2008
Popular Mechanics: January 1905 through November 2005

Lifestyle and city themed:

New York Magazine: April 1968 through December 1997. Fascinating that some of the magazines still have the original mailing label on them (see this example from a July 1969 issue of New York )
Cincinnati Magazine: January 1971 through December 2005, at which point it seems to switch to being an annual city guide titled Cincinnati USA
Atlanta: January 2003 through August 2008 – and mis-titled ‘Atlants’
Indianapolis Monthly: January 1995 to the present
Cruise Travel: June 1979 through December 2007

African American:

Ebony Jr!: May 1973 through October 1985
Jet: November 1961 through October 2008
Black Digest: Named ‘Negro Digest’ from November 1961 through April 1970, then Black Digest from May 1970 through April 1976.

Health, nutrition and organic:

Women’s Health and Men’s Health: January 2006 through present. I found it very amusing to be able to scan the covers of all the issues so easily – true for all of these magazines of course, but funny to see cover after cover of almost identically clad men and women exercising.
Prevention: January 2006 through the present
Better Nutrition: January 1999 through December 2004
Organic Gardening: November 2005 to the present
Vegetarian Times: March1981 through November 2004

Sports and the outdoors:

Baseball Digest: July 1945 through October 2007
American Cowboy: May 1994 through August 2008
Bicycling, Mountain Bike and Runner’s World: January 2006 through present

They of course promise more magazines on the way, so if you are reading this long after mid December 2008 I would assume there are more magazines and more issues available now. I hope that they make it easier to browse just magazines. Once they have a broader array of titles – how neat would it be to build a virtual news stand for a specific week in history? Shouldn’t be hard – they have all the metadata and cover images they need.

I love being able to read the magazine – advertising and all. They display the covers in batches by decade or 5 year period depending on the number of issues. I also like the Google map provided on each magazines ‘about’ page that shows ‘Places mentioned in this magazine’ and easily links you directly to the article that mentions the location marked on the map.

I think it is interesting that Google went with more of a PDF single scrolling model rather than an interface that mimics turning pages. In many issues (maybe all?) they have hot-linked the table of contents so that you can scroll down to that section instantly. You can also search within the magazine, though from my short experiments it seems that only the articles are text indexed and the advertisements are not.

Google’s current model for search is to return results for magazines mixed in with books in Google Book Search results – but they do let you limit your results to only magazines from their Advanced Search page within Google Book Search. See these results for a quick search on sunscreen in magazines.

Overall I mark this as a really nice step forward in access to old magazines. As with many visualizations, seeing the about page for any of these magazines made me ask myself new questions. It will be interesting to see how many magazines sign on to be included and how the interface evolves.

To read more about Google’s foray into magazine digitization and search take a look at:

For a really nice analysis of the information that Google provides on the magazine pages see Search Engine Land’s Google Book Search Puts Magazines Online.

NEH Digital Humanities Startup Grant News: Visualizing Archival Collections

September 12, 2008 2 Comments

As of August 22nd, 2008 it was official. There is even a blog post over on the NEH Office of Digital Humanities updates page to prove it. The University of Maryland was granted a Level I NEH Digital Humanities Startup Grant to fund work on the ‘Visualizing Archival Collections’ project. The official one liner is that the project will support “The development of visualization tools for assessing information contained in electronic archival finding aids created with Encoded Archival Description (EAD)”. Why did I wait so long to announce this on the blog? I wanted to have something fun to announce at the end of my SAA presentation out in San Francisco!

The project director is Dr. Jennifer Golbeck. I also have the support of University of Maryland’s Jennie Levine, Dr. Bruce Ambacher, and Dr. Doug Oard. This amazing set collaborators should help me stay on the right track and make sure I keep the sometimes competing issues relating to archives, information retrieval and interface design in balance.

I will be collecting EAD encoded finding aids over the next few months. My goal is to gather a broad sample of English language finding aids from a wide range of institutions and work on the script that extracts this data into a database. Once we have the data extracted I get to look at what we have, do some data cleanup and start thinking about what sorts of visualizations might work with our real world data. During the spring term we will design and build a 2nd generation prototype of ArchivesZ.

Want your data to be part of this? If you would like to contribute EAD finding aids in XML format to the project, please send me the following information:

Archives Name
Archives Parent Institution (if applicable)
Archives Location
Contact at Archives for questions about the finding aids (name, email and phone number)
Estimate of # of finding aids being offered
Controlled Vocabulary or Thesaurus used for Subject values (as many as are used)
Method of finding aid delivery (sending me a zip file? pointing me at a directory online? some other way?)
Do I have your permission to post a discussion of the data issues I may find in your finding aids here on Spellbound Blog? (Please see the OSU Archives post as an example of they types of issues I discuss)

You can either put this into the form on my Contact Page or send email directly to jeanne AT spellboundblog dot com.

Thank you to everyone for their enthusiasm about the ArchivesZ project. It is very exciting to have the opportunity to take all these shiny ideas to the next level.

Freebase Parallax Search Interface: Exploring Olympic Games Facts

August 16, 2008 3 Comments

Well-formed data posted about a new Freebase project named Parallax. This new search interface takes faceted browsing another step – in this case making it easy to jump sideways from one dataset to another related dataset. Parallax still includes filters on the left side – but the twist comes from the opportunity to select what are called ‘Connections’ from the list in the upper right hand corner of the search results page.

This sort of thing makes the most sense when you can see examples. The creator of Parallax has published a great little video tour, but I also wanted to show you some neat data sets that were very easy to discover and embed in my blog. Since so many people are thinking about the Olympics right now, I thought I would start by exploring the Olympic Games Collection from Freebase. Below I have two data sets. On the left you will see a list of Olympic Games – and on the right you will see a list of Olympic event venues. (NOTE: to those reading this through a feed reader – you will likely have to click through to view the lists)

Now lets take a real sidestep and pull up a list of sports teams who use a former Olympic facility as a venue. This is the sort of question that you could figure out on your own, but it would be a pain in the neck to do by hand. See the list on the left below which took just as long to create as it took me to spot that Team (venue) was on the list of ‘more connections’ when my list of Olympic Venues was being displayed. The frame on the right below displays the one Olympic Venue that Freebase knows to have won an award (in this case the Structural Special Award).

Of course the lists above are only as good as the data behind them, but you can see how interesting it could be to use Parallax to explore connected information. Now take this idea to the world of archives and libraries, OPACs and finding aids and imagine the sorts of questions you can start asking. Yes – it does depend on the data being connected, but that is happening more and more all the time. The promise of the semantic web is structured data everywhere we turn.

Go play with Parallax. Look at Venture Funded Companies and then look at all the Games Developed by those companies. Examine the list of Bird Species and then see what schools have bird mascots… and THEN see a list of famous people who went to schools that have bird mascots.

Put in your own search from the Parallax homepage and play with the available connections. Map and timeline views are also available – though they only work if your data includes location and temporal data, respectively. If you find a great sequence of data sets – please share them!

Dipity: Easy Hosted Timelines

July 20, 2008 3 Comments

I discovered Dipity via the Reuters article An open-source timeline of the virtual world. The article discusses the creation of a Virtual Worlds Timeline on the Dipity website. Dipity lets anyone create an account and start building timelines. In the case of the Virtual Worlds Timeline, the creator chose to permit others to collaborate on the timeline. Dipity also provides four ways of viewing any timeline: a classic left to right scrolling view, a flipbook, a list and a map.

I chose to experiment by creating a timeline for Spellbound Blog. Dipity made this very easy – I just selected WordPress and provided my blog’s URL. This was supposed to grab my 20 most recent posts – but it seems to have taken 10 instead. I tried to provide a username/password so that Dipity could pull ‘more’ of my posts (they didn’t say how many – maybe all of them?). I couldn’t get it to work as of this writing – but if I figure it out you will see many more than 10 posts.

I particularly like the way they use the images I include in my posts in the various views. I also appreciate that you can read the full posts in-place without leaving the timeline interface. I assume this is because I publish my full articles to my RSS feed. It was also interesting to note that posts that mentioned a specific location put a marker on a map – both within the single post ‘event’ as well as the full map view.

Dipity also supports the streamlined addition of many other sources such as Flickr, Picasa, YouTube, Vimeo, Blogger, Tumblr, Pandora, Twitter and any RSS feed. They have also created some neat mashups. TimeTube uses your supplied phrase to query YouTube and generates a timeline based on the video creation dates. Tickr lets you generate an interactive timeline based on a keyword or user search of Flickr.

Why should archivists care? I always perk up anytime a new web service appears that makes it easy to present time and location sensitive information. I wrote a while ago about MIT’s SIMILE project and I like their Timeline software, but in some ways hosted services like Dipity throw the net wider. I particularly appreciate the opportunity for virtual collaboration that Dipity provides. Imagine if every online archives exhibit included a Dipity timeline? Dipity provides embed code for all the timelines. This means that it should be easy to both feature the timeline within an online exhibit and use the timeline as a way to attract a broader audience to your website.

There has been discussion in the past about creating custom GoogleMaps to show off archival records in a new and different way. During THATCamp there was a lot of enthusiasm for timelines and maps as being two of the most accessible types of visualizations. By anchoring information in time and/or location it gives people a way to approach new information in a predictable way.

Most of my initial thoughts about how archives could use Dipity related to individual collections and exhibits – but what if an archive created one of these timelines and added an entry for every one of their collections. The map could be used if individual collections were from a single location. The timeline could let users see at a glance what time periods were the focus of collections within that archives. A link could be provided in each entry pointing to the online finding aid for each collection or record group

Dipity is still in working out the kinks of some of their services, but if this sounds at all interesting I encourage you to go take a look at a few fun examples:

The 100 Most Influential Americans: The Atlantic recently asked ten historians to compose their own lists of the 100 most influential Americans.
Johnny Cash Recorded Appearances: Click on a few of these and you will see the amount of detail that has been added is amazing – video clips, map locations and set lists are included for most of these
Civil Rights Movement – apparently created by students in “Taft’s thrilling third period American history class at USM”

And finally I have embedded the Internet Memes timeline below to give you a feel of what this looks like. Try clicking on any of the events that include a little film icon at the bottom edge and see how you can view the video right in place:

Image Credit: I found and ‘borrowed’ the Dipity image above from Dipity’s About page.

THATCamp 2008: Day 1 Dork Short Lightening Talks

June 14, 2008 2 Comments

During lunch on the first day of THATCamp people volunteered to give lightning talks they called ‘Dork Shorts’. As we ate our lunch, a steady stream of folks paraded up to the podium and gave an elevator pitch length demo. These are the projects about which I managed to type URLs and some other info into my laptop. If you are looking for examples of inspirational and innovative work at the intersection of technology and the humanities – these are a great place to start!

World Digital Library (Library of Congress )
PicLens + FireFox + any search results page from the New York Public Library Digital Gallery = a 3D experience of ALL the photos at one time. PicLens uses the RSS feed to retrieve the full set of images along with their captions and will work with any RSS feed of images – such as RSS image feeds from Flickr or Smugmug .
HistoryWired (National Museum of American History): A new spin on a treemap visualization built on top of museum metadata. One box is displayed per item and the box size is based on popularity. The rest of its innovations are just easier to experience than describe.
The Object of History (National Museum of American History + CHNM )
Omeka (CHNM )
Eminent Domain (NYPLOnline Exhibition): built on Omeka
American Social History Online (Digital Library Federation): Zotero enabled. They are on the hunt for more MODS records. Built on Ruby On Rails (RoR) and will be put out as open source software within a couple of months.
Typographia(David Rieder, NC State University)

Have more links to projects I missed including? Please add them in the comments below.

Image credit: Lightning by thenss (Christopher Cacho) via flickr

THATCamp 2008: Text Mining and the Persian Carpet Effect

June 1, 2008 5 Comments

I attended a THATCamp session on Text Mining. There were between 15 and 20 people in attendance. I have done my best to attribute ideas to their originators wherever possible – but please forgive the fact that I did not catch the names of everyone who was part of this session.

What Is Text Mining?

Text mining is an umbrella phrase that covers many different techniques and types of tools.

The CHNM NEH-funded text mining initiative defined text mining as needing to support these three research functions:

Locating or finding: improving on search
Extraction: once you find a set of interesting documents, how do you extract information in new (and hopefully faster) ways? How do you pull data from unstructured bulk into structured sets?
Analysis: support analyzing the data, discovery of patterns, answering questions

The group discussed that there were both macro and micro aspects to text mining. Sometimes you are trying to explore a collection. Sometimes you are trying to examine a single document in great detail. Still other situations call for using text mining to generate automated classification of content using established vocabularies. Different kinds of tools will be important during different phases of research.

Projects, Tools, Examples & Cool Ideas

Andrea Eastman-Mullins, from Alexander Street Press, mentioned the University of Chicago’s ARTFL Project and these two tools:

PhiloLogic: An XML/SGML based full-text search, retrieval and analysis tool
PhiloMine: a extension being developed for PhiloLogic to provide support for “a variety of machine learning, text mining, and document clustering tasks”.

Dan Cohen directed us to his post about Mapping What Americans Did on September 11 and to Twistori which text mines Twitter.

Other Projects & Examples:

MONK project (Metadata Offer New Knowledge)
Open Content Alliance(OCA)
Library of Congress Chronicling America – newspaper pages from 1897-1910
Tanya Clement’s project “Using Digital Tools to Not-Read Gertrude Stein’s The Making of Americans” at University of Maryland, College Park
Two other University of Maryland, College Park projects that were not mentioned during the session, but may be of interest are FeatureLens and BasketLens
Google Docs now includes Flesch-Kincaid Readability Tests and Automated Readability Index in the same window in which it shows you your Word Count
Spam filters – such as Bayesian spam filtering using text mining to identify spam e-mails
Clustering – see my post on this: Clustering Data: Generating Organization from the Ground Up and also take a look at Clusty.com and their ‘remix clusters’ option.

Some neat ideas that were mentioned for ways text mining could be used (lots of other great ideas were discussed – these are the two that made it into my notes):

Train a tool with collections of content from individual time periods, then use the tool to assist in identification of originating time period for new documents. Also could use this same setup to identify shifts in patterns in text by comparing large data sets from specific date ranges
If you have a tool that has learned how to classify certain types of content well… then watch for when it breaks – this can give you interesting trails to things to investigate.

Barriers to Text Mining

All of the following were touched upon as being barriers or challenges to text mining:

access to raw text in gated collections (ie, collections which require payment to permit access to resources) such as JSTOR and Project MUSE and others.
tools that are too difficult for non-programmers to use
questions relating to the validity of text mining as a technique for drawing legitimate conclusions

Next Steps

These ideas were ones put forward as important to move forward the field of text mining in the humanities:

develop and share best practices for use when cultural heritage institutions make digitization and transcription deals with corporate entities
create frameworks that enable individuals to reproduce the work of others and provide transparency into the assumptions behind the research
create tools and techniques that smooth the path from digitization to transcription
develop focused, easy-to-use tools that bridge the gap between computer programmers and humanities researchers

My thoughts
During the session I drew a parallel between the information one can glean in the field of archeology from the air that cannot be realized on the ground. I discovered it has a name:

“Archaeologists call it the Persian carpet effect. Imagine you’re a mouse running across an elaborately decorated rug. The ground would merely be a blur of shapes and colors. You could spend your life going back and forth, studying an inch at a time, and never see the patterns. Like a mouse on a carpet, an archaeologist painstakingly excavating a site might easily miss the whole for the parts.” from Airborne Archaeology, Smithsonian magazine, December 2005 (emphasis mine)

While I don’t see any coffee table books in the near future of text mining (such as The Past from Above: Aerial Photographs of Archaeological Sites), I do think that this idea captures the promise that we have before us in the form of the text mining tools. Everyone in our session seemed to agree that these tools will empower people to do things that no individual could have done in a lifetime by hand. The digital world is producing terabytes of text. We will need text mining tools just to find our way in this blizzard of content. It is all well and good to know that each snowflake is unique – but tell that to the 21st century historian soon to be buried under the weight of blogs, tweets, wikis and all other manner of web content.

Image credit: Drift of Harrachov Mine by alarch via flickr

As is the case with all my session summaries from THATCamp 2008, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Category: information visualization