Menu Close

Category: search

Clustering Data: Generating Organization from the Ground Up

Flickr: water tag clustersMy trip to the 2008 Information Architecture Summit (IA Summit) down in Miami has me thinking a lot about helping people find information. In this post I am going to examine clustering data.

Flickr Tag Clusters
Tag clusters are not new on Flickr – they were announced way back in August of 2005. The best way to understand tag clusters is to look at a few. Some of my favorites are the water clusters (shown in the image above). From this page you can view the reflection/nature/green cluster, the sky/lake/river cluster, the blue/beach/sun cluster or the sea/sand/waves cluster.

So what is going on here? Basically Flickr is analyzing groupings of tags assigned to Flickr images and identifying common clusters of tags. In our water example above – they found four different sets of tags that occurred together and distinctly apart from other sets of tags. The proof is in the pudding – the groupings make sense. They get at very subtle differences even though the mass of data being analyzed is from many different individuals with many different perspectives.

Tag clusters are very powerful and quite different from tag clouds. Tag clouds, by their nature, are a blunt instrument. They only show you the most popular tags. Take a look at the tag cloud for the Library of Congress photostream on Flickr. I do learn something from this. I get a sense of the broad brush topics, time periods and locations. But if you look at the full list of Library of Congress Flickr tags you see what a small percentage the top 150 really are (and yes.. that page does takes a while to load). Who else is now itching to ask Flickr to generate clusters within the LOC tag set?

Steve.Museum
Another example of cultural heritage images being tagged is the Steve Museum Art Museum Social Tagging Project which lets individuals tag objects from museums via Steve Tagger. It resembles the Library of Congress on Flickr project in that it includes existing metadata with each image and permits users to add any tags they deem appropriate. I think it would be fascinating to contrast the traffic of image taggers on Steve.Museum vs Flickr for a common set of images. Is it better to build a custom interface that users must seek out but where you have complete control over the user experience and collected data? Or is it better to put images in the already existing path of users familiar with tagging images? I have no answers of course. All I know is I wish I could see the tag clusters one could generate off the Steve.Museum tag database. Perhaps someday we will!

Del.icio.us Tags
del.icio.us related tagsDel.icio.us, a web service for storing and tagging your bookmarks online, supports what they call ‘related tags’ and ‘tag bundles’. If you view the page for the tag ‘archives’ – you will see to the far right a list of related tags like those shown in the image here. What is interesting is that if I look at my own personal tag page for archives I see a much longer list of related tags (big surprise that I have a lot of links tagged archives!) and I am given the option of selecting additional tags to filter my list of links via a combination of tags.

Del.icio.us’s ‘tag bundles’ let me create my own named groupings of tags – but I must assemble these groups manually rather than have them generated or suggested. On the plus side, Del.icio.us is very open about publishing its data via APIs and therefore supporting third party tools. I think my favorite off that list for now has to be MySQLicious which mirrors your del.icio.us bookmarks into a MySQL database. Once those tags are in a database, all you need are the right queries to generate the clusters I want to see.

Clusty: Clustered Search Results
Clusty: clusters screen shotAn example of what this might look like for search results can be seen via the search engine Clusty.com from the folks over at Vivisimo. For example – try a search on the term archives. This is one of those search terms for which general web searching is usually just infuriating. Clusty starts us with the same top 2 results as a search for archives on Google does, but it also gives us a list of clusters on the left sidebar. You can click on any of those clusters to filter the search results.

Those groups don’t look good to you? Click the ‘remix’ link in the upper right hand corner of the cluster list and you get a new list of clusters. In a blog post titled Introducing Clustering 2.0 Vivisimo CEO Raul Valdes-Perez explains what happens when you click remix:

With a single click, remix clustering answers the question: What other, subtler topics are there? It works by clustering again the same search results, but with an added input: ignore the topics that the user just saw. Typically, the user will then see new major topics that didn’t quite make the final cut at the last round, but may still be interesting.

I played for a while.. clicking remix over and over. It was as if it was slicing and dicing the facets for me – picking new common threads to highlight. I liked that I wasn’t stuck with what someone else thought was the right way to group things. It gave me the control to explore other groupings.

Ontology is Overrated
Clay Shirky’s talk Ontology is Overrated: Categories, Links and Tags from the spring of 2005 ties a lot of these ideas together in a way that makes a lot of sense to me. I highly recommend you go read it through – but I am going to give away the conclusion here:

It’s all dependent on human context. This is what we’re starting to see with del.icio.us, with Flickr, with systems that are allowing for and aggregating tags. The signal benefit of these systems is that they don’t recreate the structured, hierarchical categorization so often forced onto us by our physical systems. Instead, we’re dealing with a significant break — by letting users tag URLs and then aggregating those tags, we’re going to be able to build alternate organizational systems, systems that, like the Web itself, do a better job of letting individuals create value for one another, often without realizing it.

I currently spend my days working with controlled vocabularies for websites, so please don’t think I am suggesting we throw it all away. And yes, you do need a lot of information to reach the critical mass needed to support the generation of useful clusters. But there is something here that can have a real and positive impact on users of cultural heritage materials actually finding and exploring information. We can’t know how everyone will approach our records. We can’t know what aspects of them they will find interesting.

There Is No Box
Archivists already know that much of the value of records is in the picture they paint as a group. A group of records share a context and gives the individual records meaning. Librarians and catalogers have long lived in a world of shelves. A book must be assigned a single physical location. Much has been made (both in the Clay Shirky talk and elsewhere) that on the web there is no shelf.

What if we take the analogy a step further and say that for an online archives there is no box? Of course, just as with books, we still need our metadata telling us who created this record originally (and when and why and which record comes before it and after it) – but picture a world where a single record can be virtually grouped many times over. Computer programs are only going to get better at generating clusters, be they of user assigned tags or search results or other metdata. From where I sit, the opportunity for leveraging clustering to do interesting things with archival records seems very high indeed.

Of Pirates, Treasure Chests and Keys: Improving Access to Digitized Materials

Key to Anything by Stoker Studios (flickr)Dan Cohen posted yesterday about what he calls The Pirate Problem. Basically the Pirate Problem can be summed up as “there are ways of acting and thinking that we can’t understand or anticipate.” Why is that a ‘Pirate Problem’? Because a pirate pub opened near his home and rather than folding shortly thereafter due to lack of interest from the ‘very serious professionals’ who populate DC suburbs – the pub was a rousing success due to the pirate aficionados who came out of the woodwork to sing sea shanties and drink grog. This surprising turn of events highlighted for him the fact that there are many ways of acting and thinking (some people even know all the words to sea shanties without needing sheet music).

Dan recently delivered the keynote speech at a workshop at the University of North Carolina at Chapel Hill. The workshop brought together dozens of historians to talk about how the 16 million archival documents of the Southern Historical Collection (SHC) should be put online. He devoted his keynote “to prodding the attendees into recognizing that the future of archives and research might not be like the past” and goes on in his post to explain:

The most memorable response from the audience was from an award-winning historian I know from my graduate school years, who said that during my talk she felt like “a crab being lowered into the warm water of the pot.” Behind the humor was the difficult fact that I was saying that her way of approaching an archive and understanding the past was about to be replaced by techniques that were new, unknown, and slightly scary.

This resistance to thinking in new ways about digital archives and research was reflected in the pre-workshop survey of historians. Extremely tellingly, the historians surveyed wanted the online version of the SHC to be simply a digital reproduction of the physical SHC.

Much of the stress of Dan’s article is on fear of new techniques of analysis. The choppy waters of text mining and pattern recognition threaten to wash away traditional methods of actually reading individual pages and “most historians just want to do their research they way they’ve always done it, by taking one letter out of the box at a time”.

I certainly like the idea of new technologically based ways of analyzing large sets of cultural heritage materials, but I also believe that reading individual letters will always be important. The trick is finding the right letter!

And of course – we still need the context. It isn’t as if when we digitize major collections like the SHC that we are going to scan and OCR each page without regard to which box it came out of. We can’t slice and dice archival records and manuscripts into their component parts to feed into text analysis with no way back to the originals.

I like to imagine the combination of all the new technology (be it digitization, cross collection searching, text mining or pattern recognition) as creating keys to different treasure chests. Humanities scholars are treasure hunters. Some will find their gems through careful reading of individual passages. Others will discover patterns spread across materials now co-existing virtually that before digitization would have been widely separated by space and time. Both methods will benefit from the digitization of materials and the creation of innovative search and text analysis tools. Both still require an understanding of a material’s origin. The importance of context isn’t going anywhere – we still need to know which box the letter came from (and in a perfect world, which page came before and which came after). I want scholars to still be able to read one page from the box – I just want them to be able to do it from home in the middle of the night if they are so inclined with their travel budget no worse for wear.

Dan ties his post together by pointing out that:

… in Chapel Hill I was the pirate with the strange garb and ways of behaving, and this is a good lesson for all boosters of digital methods within the humanities. We need to recognize that the digital humanities represent a scary, rule-breaking, swashbuckling movement for many historians and other scholars.

In my opinion, the core message should be that we just found more locked treasure chests – and for those who are interested, we have some new keys that just might open those locks. I enjoyed the Pirate metaphor (obviously) and I appreciate that there are real issues here relating to strong discomfort with the fast changing landscape of technology, but I have to believe that if we do something that prevents historians from being able to read one letter at a time we are abandoning the treasure chests that are already open for the new ones for which we haven’t yet found the right keys. I am greedy. I want all the treasure!

Image credit: key to anything by Stoker Studios via flickr

Using WWI Draft Registration Cards for Research: NARA Records Provide Crucial Data

NARA:   	 World War I photograph, 1918 (ARC Identifier: 285374)

In the HealthDay article Having Lots of Kids Helps Dads Live to 100, a recent study was described that examined what increased the chances of a man living past 100.

A young, trim farmer with four or more children: According to a new study, that’s the ideal profile for American men hoping to reach 100 years of age. The research, based largely on data from World War I draft cards, suggests that keeping off excess weight in youth, farming and fathering a large number of offspring all help men live past a century.

The article mentions that this research was “spurred by the fact that a treasure trove of information about 20th-century American males has now been put online”. The study was based out of the University of Chicago’s Center on Aging. The paper, New Findings on Human Longevity Predictors, includes the following reference:

Banks, R. (2000). World War I Civilian Draft Registrations. [database on-line]. Provo, UT, Ancestry.com.

With an account on Ancestry.com, you too could examine the online database of World War I Draft Registration Cards. This Ancestry.com page notes the source of the original data as:

United States, Selective Service System. World War I Selective Service System Draft Registration Cards, 1917-1918. Washington, D.C.: National Archives and Records Administration. M1509, 4,582 rolls

NARA’s page for the World War I Selective Service System Draft Registration Cards, M1509 includes similar background information to what can be found on the Ancestry.com page, but of course – no access to the actual records.

It is frustrating to a study based on archival records that is making the news, but that does not make it clear to the reader that archival records were the source for the research. As I discussed at length in my post Epidemiological Research and Archival Records: Source of Records Used for Research Fails to Make the News, I feel that it is very important to take every opportunity to help the general public understand how archival records are supporting research that impacts our understanding of the world around us. I appreciate that partnering with 3rd parties to get government records digitized is often the only option – but I want people to be clear about why those records still exist in the first place.

Photo Credit: US. National Archives, World War I Photographs, 1918. Army photographs. Battle of St. Mihiel-American Engineers returning from the front; tank going over the top; group photo of the 129th Machine gun Battalion, 35th Division before leaving for the front; views of headquarters of the 89th Division next to destroyed bridge; Company E, 314th Engineers, 89th Division, and making rolling barbed wire entanglements. NAIL Control Number: NRE-75-HAS(PHO)-65

SAA2007: Archives and E-Commerce, Three Case Studies (Session 404)

George Washington US DollarDiane Kaplan, of Yale University Library’s Manuscripts and Archives unit, started off Session 404 (officially titled Exploring the Headwaters of the Revenue Stream) by thanking everyone for showing up for the last session of the day. This was a one hour session that examined ways to generate new funds through e-commerce . Three different e-commerce case studies were presented, followed by a short question and answer period.

University of Wyoming’s American Heritage Center

Mark Shelstad‘s presentation, “Show Me the Money: Or: How Do We Pay for This?”, detailed the approach taken by the University of Wyoming‘s American Heritage Center (AHC) to find alternate revenue streams. After completing a digitization project in the fall of 2004, the AHC had to figure out how to continue their project after their original grant money ran out.

Since they didn’t have a lot of in-house resources, they chose Zazzle.com for their effort to profit from their existing high resolution images. They can earn up to 17% from the sales through a combination of affiliate sales and profits from the sale of products featuring American Heritage Center images.

They had a lot of good reasons for choosing Zazzle.com. Zazzle.com already had an existing ‘special collections’ area, meaning that their images would have a better chance of being found by those interested in their offerings (for example – take a look at the Library of Congress Vintage Photos store). Zazzle.com also did not require an exclusive license to the images. The American Heritage Center Zazzle on-line store opened in 2005.

Currently they are making about $30 a month in royalties from 200 images. Mark pointed out that everyone needs to keep in mind that the major photo provider, Corbis, has yet to turn a profit in online photo sales. He also mentioned a website called Cogteeth.com that lets you click on any image and use those images on t-shirts, mugs.. etc.

Near the end of his talk, Mark shared an amazing idea to create a non-profit that would be a joint organization for featuring and selling products using archival images. I love it! It is easy to see that many archives are small and don’t have the infrastructure to create and run their own e-commerce websites. At the same time, general sites that let anyone set up a store to sell items with custom images on them threaten to loose the special nature of historical images in the shuffle. Even the special collections section of Zazzle lumps the American Heritage Center and the Library of Congress collections with Disney and Star Wars. I would love to see this idea grow!

Minnesota Historical Society

Kathryn Otto of the Minnesota Historical Society (MHS) spoke next. She first gave an overview of traditional services provided by MHS for a fee, such as photocopies, reader-printer copies, microfilm sales, media sales, inter-library loan fees, classes and photograph sales. MHS also earned income via standard use fees and research services.

The first e-commerce initiative at MHS was the sale of Minnesota State Death Certificates from 1904 – 2001. Made available via the Minnesota Death Certificate Index they provide the same data as Ancestry.com, but the MHS index provides a better search interface. They have had users tell them that they couldn’t find something on Ancestry.com – but that they were able to find what they needed on the MHS site.

To their existing Visual Resources Database, MHS also added a buy button for most images. Extra steps were added into the standard buy process to deal with the addition of a use fee depending on how the purchaser claims the image will ultimately be used. One approach that did not work for them was to offer expensively printed pre-selected images. The historical society sells classes online and can handle member vs non-member rates. TheVeterans Graves Registration Index is a tiny database that was created by reusing the interface used for the death certificates.

The Birth Certificate Index provides “single, non-certified copies of individual birth certificates reproduced from the originals” via the website.. while “[o]fficial, certified copies of these birth certificates are available through the Minnesota Department of Health.” The MHS site provides much faster and easier service than the Department of Health as can be seen from this page detailing how to order a non-certified copy of a birth record from the DOH – which requires printing, filling out and either faxing or snail mailing a form.

Features to keep in mind as you branch into in e-commerce:

  • Statistics – Consider the types of statistics you want. Their system just gave them info about orders – not how much they made.
  • Sales tax – Figure out how is it handled
  • Postage/Handling fees – Look at the details! The MHS Library-Archives was stuck with the Museum Store’s postage rates because the e-commerce system could not handle different fees for different types of objects.
  • Can’t afford credit card fees? Consider PayPal.
  • Advertise what you are selling on your own website.

Godfrey Memorial Library, Middletown, CT

The final panelist was Richard Black, Director of the Godfrey Memorial Library in Middletown, Connecticut. The Godfrey is a small, non-profit, genealogical research library with approximately 120,000 genealogical items. They currently have 5 full time staff and 60 volunteers.

Services they provide:

About 3 years ago they had exhausted all of their endowment money and faced the strong possibility of closing the doors. They were down to one full time librarian and a few volunteers and were dependent mostly on donations and some minor income from other sources/services.

They had only a few options open to them:

  • find more money from other sources
  • merge with another library
  • close the doors
  • sell some of the content
  • others??

The first approach to raise funds was to create a subscription website. The Godfrey acquired Heritage Quest census records and added other databases as resources allowed. Subscriptions were sold for $35 a year. The board thought they might be lucky to get 100 subscriptions.. but they actually got approximately 14,000!

Now the portal provides access to sites for which a premium has been paid (so that subscribers don’t have to pay), sites that are available free on the Internet (but made easier to find) and sites unique to Godfrey, including digitized material in the library and other material that has been made available to them. They just added 95,000 Jewish grave-sites – brought to them by a local rabbi. Another recent addition was a set of transcriptions of a grave-site made as an Eagle Scout project. They also negotiated to have their books digitized for them for free. The company performing the digitization will pay a royalty to Godfrey as the books are used.

The costs to acquire data for the portal includes $60,000 a year for access to premium sites, the cost to digitize and transcribe unique content (there are opportunities to partner and reduce costs) and the cost to acquire patrons. The efforts of the Godfrey staff and volunteers is ‘free’ – but costs time.

The Godfrey subsequently lost access to the Heritage Quest material. This was like taking the anchor store out of the corner of a mall. It forced them to diversify their revenue streams and watch for new opportunities.

Current revenue source distribution:

  • online portal 45%
  • annual appeal 10%
  • patron requests 5%
  • contract services 35% (OCLC analytical cataloging that they do)
  • misc 5%

The endowment funds have been restored and the Godfrey’s staff is now growing again.

Questions

Question: Did you meet resistance in your institutions?
Answer: No.. Minnesota said they had such success that the 2 questions they here now are A) What do we put online next? B) How long can they protect their income from the rest of the institution?

Question: (From someone from a NJ archives) Is there a way to do e-commerce with government records and not have the money ‘stolen’ from them?
Answer: Minnesota – The department of health was happy for death and birth certificates business to go away? They do worry about the future when they might try to make a marriage index – because that territory is already ‘owned’ by a group that wants to keep that income.

Question: When you charge for use fees – are there people who don’t pay them?
Answer: Minnesota: Probably – no way to really know.
Mark (American Heritage Center): Our images are public domain – they can do what they like with them.

Question: Do you brand your images?
Answer: Mark: Yes.. a logo and URL goes with the images.

My Thoughts

I was particularly impressed by how much information was conveyed in the course of the 1 hour session. My personal highlights were:

  • As I mentioned above, I want Mark’s idea for a non-profit to sell co-located products based on archival images to gain support and momentum.
  • I was pleased by the point that the MHS makes money from their Minnesota Death Certificate Index partly due to their improved and powerful search interface. The data is available elsewhere – but they made it easier to find information, so they will become the destination of choice for that information.
  • The Godfrey’s story is inspirational. In an age when we hear more and more often about archives and libraries being forced to cut back services due to funding shortfalls, it is great to hear about a small archives that pulled themselves back from the brink of disaster by brave experimentation.

These three case studies gave a great glimpse of some of the ways that archives can get on the e-commerce bandwagon. There is no magic here – just the willingness to dig in, figure out what can be done and try it. That said – there is definitely lots of room to learn from others successes and mistakes. The more real world success and failure stories archives share with the archival community about how to ‘do’ e-commerce, the easier it will be for each subsequent project to be a success.

As is the case with all my session summaries from SAA2007, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Blog Action Day: A Look At Earth Day as Archived Online

In honor of this year’s Blog Action Day theme of discussing the environment, I decided to see what records the Internet had available about the history of Earth Day.

I started by simply Googling Earth Day. In a new browser window I opened the Internet Archive’s Wayback Machine. These were to be my two main avenues for unearthing the way that Earth Day was represented on the internet over the years.

Wikipedia’s first version of an Earth Day page was created on December 16th, 2002. This is the current Earth Day page as of the creation of this post – last updated about a week ago.

The current home page for the Earthday Network appears identical to the most recent version stored in the Wayback Machine, dated June 29, 2007 – until you notice that the featured headline on the link to http://www.earthdaynetwork.tv is different.

The site that claims to be ‘The Official Site of International Earth Day’ is EarthSite.org. The oldest version from the Wayback Machine is from December of 1996. This version shows a web visitor counter perpetually set to 1,671. Earth day ten years ago was scheduled for March 20th, 1997. If you scroll down a bit on the What’s New page you can read the 1997 State of the World Message By John McConnell (attributed as the founder of Earth Day).

The U.S. Government portal for Earth Day was first archived in the Internet Archive on April 6, 2003. The site, EarthDay.gov, hasn’t changed much in the past 4 years. The EPA has an Earth Day page of it’s own, that was first archived in early 1999. No clear way to know if that actually means that the EPA’s Earth Day page is older or if it was just found earlier by the Internet Archives ambitious web crawlers.

Envirolink.org, with the tagline “The Online Environmental Community”, was first archived back in 1996. You can see on the Wayback Machine page for Environlink.org, has a fairly full ten years worth of web page archiving.

Next I wanted to explore what the world of government records might produce on the subject. A quick stop over at Footnote.com to search for “Earth Day” didn’t yield a terribly promising list of results (no surprise there – most of their records date to before the time period we are looking for). Next I tried searching in Archival Research Catalog (ARC) over on the U.S. National Archives website. I got 15 hits – all fairly interesting looking… but none of them linked to digitized content. A search in Access to Archival Databases (AAD) system found 2 hits – one to some sort of contract between the EPA and a Fairfax Virginia company named EARTH DAY XXV from 1995 and the other a State Department telegram including this passage:

THIS NATION IS COMMITTED TO STRIVING FOR AN ENVIRONMENT THAT NOT ONLY SUSTAINS LIFE, BUT ALSO ENRICHES THE LIVES OF PEOPLE EVERYWHERE – – HARMONIZING THE WORKS OF MAN AND NATURE. THIS COMMITMENT HAS RECENTLY BEEN REINFORCED BY MY PROCLAMATION, PURSUANT TO A JOINT RESOLUTION OF THE CONGRESS, DESIGNATING MARCH 21, 1975 AS EARTH DAY, AND ASKING THAT SPECIAL ATTENTION BE GIVEN TO EDUCATIONAL EFFORTS DIRECTED TOWARD PROTECTING AND ENHANCING OUR LIFE-GIVING ENVIRONMENT.

I also thought to check the Government Printing Office’s (GPO) website for the Public Papers of the Presidents of the United States. Currently it only permits searching back through 1991 online – but my search for “Earth Day” did bring back 50 speeches, proclamations and other writings by the various presidents.

Frustrated by the total scattering of documents without any big picture, I headed back to Google – this time to search the Google News Archive for articles including “Earth Day” published before 1990. The timeline display showed me articles mostly from TIME, the Washington Post and the New York Times – some of which claimed I would need to pay in order to read.

Back again to do one more regular Google search – this time for earth day archive. This yielded an assortment of hits – and just above the fold I found my favorite snapshot of Earth Day history. The TIME Earth Day Archive Collection is a selection of the best covers, quotes and articles about Earth Day – from February 2, 1970 to the present. This is the gold mine for getting perspective on Earth Day as it has been perceived and celebrated in the United States. The covers are brilliant! If I had started this post early enough, I would have requested permission to include some here.

With the passionate title Fighting to Save the Earth from Man, the first article in the TIME Earth Day Collection begins by quoting then President Nixon’s first State of the Union Address:

The great question of the seventies is, shall we surrender to our surroundings, or shall we make our peace with nature and begin to make reparations for the damage we have done to our air, to our land, and to our water?

Fast forward to the recent awarding of the Nobel Peace Prize for 2007 to the Intergovernmental Panel on Climate Change (IPCC) and Al Gore and I have to image that the answer to that question of if we were ready to make peace with nature asked so long ago was ‘Not Yet’.

Overall, this was an interesting experiment. The hunt for ‘old’ (such as it is in the fast moving world of the Internet) data about a topic online is a strange and frustrating experience. Even with the Wayback Machine, I often found myself with only part of the picture. Often the pages I tried to view were missing images or other key elements. Sometimes I found a link to something tantalizing, only to realize that the target page was not archived (or is so broken as to be of no use). The search through government records and old newspaper stories did produce some interesting results – but again seemed to fail to produce any sense of the big picture of Earth Day over the years.

The TIME Collection about Earth Day was assembled by humans and arranged nicely for examination by those interested in the subject. It is properly named a ‘collection’ (in the archival sense) because it is not the pure output of activities surrounding Earth Day, but rather a selected snapshot of related articles and images that share a common topic. That said, it is my fervent hope that websites such as these appear more and more. I suspect that the lure of attracting more readers to their websites with existing content will only encourage more content creators with a long history to join in the fun. If other do it as well as TIME has seemed to in this case, it will be a win/win situation for everyone.

Visualizing Archival Collections

As I mentioned earlier, I am taking an Information Visualization class this term. For our final class project I managed to inspire two other classmates to join me in creating a visualization tool based on the structured data found in the XML version of EAD finding aids.

We started with the XML of the EAD finding aids from University of Maryland’s ArchivesUM and the Library of Congress Finding Aids. My teammates have written a parser that extracts various things from the XML such as title, collection size, inclusive dates and subjects. Our goal is to create an innovative way to improve the exploration and understanding of archival collections using an interactive visualization.

Our main targets right now are to use a combination of subjects, years and collection size to give users a better impression of the quantity of archival materials that fit various search criteria. I am a bit obsessed about using the collection size as a metric for helping users understand the quantity of materials. If you do a search for a book in a library’s catalog – getting 20 hits usually means that you are considering 20 books. If you consider archival collections – 20 hits could mean 20 linear feet (20 collections each of which is 1 linear foot in size) or it could mean 2000 linear feet (20 collections each of which is 100 linear feet in size). Understanding this difference is something that visualization can help us with. Rather than communicating only the number of results – the visualization will communicate the total size of collections assigned each of the various subjects.

I have uploaded 2 preliminary screen mockups one here and the second here trying to get at my ideas for how this might work.

Not reflected in the mock-ups is what could happen when a user clicks on the ‘related subject’ bars. Depending on where they click – one of two things could happen. If they click on the ‘related subject’ bar WITHIN the boundaries of the selected subject (in the case above, that would mean within the ‘Maryland’ box), then the search would filter further to only show those collections that have both the ‘Maryland’ and newly ‘added’ tag. The ‘related subjects’ list and displayed year distribution would change accordingly as well. If, instead, the user clicks on a ‘related subject’ bar OUTSIDE the boundary of the selected subject — then that subject would become the new (and only) selected subject and the displayed collections, related subjects and years would change accordingly.

So that is what we have so far. If you want to keep an eye on our progress, our team has a page up on our class wiki about this project. I have a ton of ideas of other things I would love to add to this (my favorite being a map of the world with indications of where the largest amount of archival materials can be found based on a keyword or subject search) – but we have to keep our feet on the ground long enough actually build something for our class project. This is probably a good thing. Smaller goals make for a greater chance of success.

Google, Privacy, Records Managment and Archives

BoingBoing.net posted on March 14 and March 15 about Google’s announcement of a plan to change their log retention policy . Their new plan is to strip parts of IP data from records in order to protect privacy. Read more in the AP article covering the announcement.

For those who are not familiar with them – IP addresses are made up of sets of numbers and look something like 192.39.288.3. To see how good a job they can do figuring out the location you are in right now – go to IP Address or IP Address Guide (click on ‘Find City’).

Google currently keeps IP addresses and their corresponding search requests in their log files (more on this in the personal info section of their Privacy Policy). Their new plan is that after 18-24 months they will permanently erase part of the IP address, so that the address no longer can point to a single computer – rather it would point to a set of 256 computers (according to the AP article linked above).

Their choice to permanently redact these records after a set amount of time is interesting. They don’t want to get rid of the records – just remove the IP addresses to reduce the chance that those records could be traced back to specific individuals. This policy will be retroactive – so all log records more than 18-24 months old will be modified.

I am not going to talk about how good an idea this is.. or if it doesn’t go far enough (plenty of others are doing that, see articles at EFF and Wired: 27B Stroke 6 ). I want to explore the impact of choices like these on the records we will have the opportunity to preserve in archives in the future.

With my ‘archives’ hat on – the bigger question here is how much the information that Google captures in the process of doing their business could be worth to the historians of the future. I wonder if we will one day regret the fact that the only way to protect the privacy of those who have done Google searches is to erase part of the electronic trail. One of the archivist tenants is to never do anything to the record you cannot undo. In order for Google to succeed at their goal (making the records useless to government investigators) – it will HAVE to be done such that it cannot be undone.

In my information visualization course yesterday, our professor spoke about how great maps are at tying information down. We understand maps and they make a fabulous stable framework upon which we can organize large volumes of information. It sounds like the new modified log records would still permit a general connection to the physical geographic world – so that is a good thing. I do wonder if the ‘edited’ versions of the log records will still permit the grouping of search requests such that they can be identified as having been performed by the same person (or at least from the same computer)? Without the context of other searches by the same person/computer, would this data still be useful to a historian? Would being able to examine the searches of a ‘community’ of 256 computers be useful (if that is what the IP updates mean).

What if Google could lock up the unmodified version of those stats in a box for 100 years (and we could still read the media it is recorded on and we had documentation telling us what the values meant and we had software that could read the records)? What could a researcher discover about the interests of those of us who used Google in 2007? Would we loose a lot by if we didn’t know what each individual user searched for? Would it be enough to know what a gillion groups of 256 people/computers from around the world were searching for – or would loosing that tie to an individual turn the data into noise?

Privacy has been such a major issue with the records of many businesses in the past. Health records and school records spring to mind. I also find myself thinking of Arthur Anderson who would not have gotten into trouble for shredding their records if they had done so according to their own records disposition schedules and policies. Googling Electronic Document Retention Policy got me over a million hits. Lots of people (lawyers in particular) have posted articles all over the web talking about the importance of a well implemented Electronic Document Retention Policy. I was intrigued by the final line of a USAToday article from January 2006 about Google and their battle with the government over a pornography investigation:

Google has no stated guidelines on how long it keeps data, leading critics to warn that retention could be for years because of inexpensive data-storage costs.

That isn’t true any longer.

For me, this choice by Google has illuminated a previously hidden perfect storm. That the US government often request of this sort of log data is clear, though Google will not say how often. The intersection of concerns about privacy, government investigations, document retention and tremendous volumes of private sector business data seem destined to cause more major choices such as the one Google has just announced. I just wonder what the researchers of the future will think of what we leave in our wake.