open source | Spellbound Blog

WordPress Blog Magic – A look under the hood

June 22, 2007 9 Comments

Spellbound Blog is served to you via the fabulous open source software that is WordPress. Last night I finally created a page about the major WordPress plugins I have used to customize this blog. If you are interested in such things, take a look at my new WordPress Customizations page.

reCAPTCHA: crowdsourcing transcription comes to life

May 28, 2007 5 Comments

With a tag-line like ‘Stop Spam, Read Books’ – how can you not love reCAPTCHA? You might have already read about it on Boing Boing , NetworkWorld.com or digitizationblog – but I just couldn’t let it go by without talking about it.

Haven’t heard about reCAPTCHA yet? Ok.. have you ever filled out an online form that made you look at an image and type the letters or numbers that you see? These ‘verify you are a human’ sorts of challenges are used everywhere from on-line concert ticket purchase sites who don’t want scalpers to get too many of the tickets to blogs that are trying to prevent spam. What reCAPTCHA has done is harness this user effort to assist in the transcription of hard to OCR text from digitized books in the Internet Archive. Their website has a great explanation about what they are doing – and they include this great graphic below to show why human intervention is needed.

reCAPTCHA shows two words for each challenge – one that it knows the transcription of and a second that needs human verification. Slowly but surely all the words OCR doesn’t understand get transcribed and made available for indexing and search.

I have posted before about ideas for transcription using the power of many hands and eyes (see Archival Transcriptions: for the public, by the public) – but my ideas were more along the lines of what the genealogists are doing on sites like USGenWeb. It is so exciting to me that a version of this is out there – and I LOVE their take on it. Rather than find people who want to do transcription, they have taken an action lots of folks are already used to performing and given it more purpose. The statistics behind this are powerful. Apparently 60 million of these challenges are entered every DAY.

Want to try it? Leave a comment on this post (or any post in my blog) and you will get to see and use reCAPTCHA. I can also testify that the installation of this on a WordPress blog is well documented, fast and easy.

Book Review: Dreaming in Code (a book about why software is hard)

May 24, 2007 1 Comment

Dreaming in Code: Two Dozen Programmers, Three Years, 4,732 Bugs, and One Quest for Transcendent Software
(or “A book about why software is hard”) by Scott Rosenberg

Before I dive into my review of this book – I have to come clean. I must admit that I have lived and breathed the world of software development for years. I have, in fact, dreamt in code. That is NOT to say that I was programming in my dream, rather that the logic of the dream itself was rooted in the logic of the programming language I was learning at the time (they didn’t call it Oracle Bootcamp for nothing).

With that out of the way I can say that I loved this book. This book was so good that I somehow managed to read it cover to cover while taking two graduate school courses and working full time. Looking back, I am not sure when I managed to fit in all 416 pages of it (ok, there are some appendices and such at the end that I merely skimmed).

Rosenberg reports on the creation of an open source software tool named Chandler. He got permission to report on the project much as an embedded journalist does for a military unit. He went to meetings. He interviewed team members. He documented the ups and downs and real-world challenges of building a complex software tool based on a vision.

If you have even a shred of interest in the software systems that are generating records that archivists will need to preserve in the future – read this book. It is well written – and it might just scare you. If there is that much chaos in the creation of these software systems (and such frequent failure in the process), what does that mean for the archivist charged with the preservation of the data locked up inside these systems?

I have written about some of this before (see Understanding Born Digital Records: Journalists and Archivists with Parallel Challenges), but it stands repeating: If you think preserving records originating from standardized packages of off-the-shelf software is hard, then please consider that really understanding the meaning of all the data (and business rules surrounding its creation) in custom built software systems is harder still by a factor of 10 (or a 100).

It is interesting for me to feel so pessimistic about finding (or rebuilding) appropriate contextual information for electronic records. I am usually such an optimist. I suspect it is a case of knowing too much for my own good. I also think that so many attempts at preservation of archival electronic records are in their earliest stages – perhaps in that phase in which you think you have all the pieces of the puzzle. I am sure there are others who have gotten further down the path only to discover that their map to the data does not bear any resemblance to the actual records they find themselves in charge of describing and arranging. I know that in some cases everything is fine. The records being accessioned are well documented and thoroughly understood.

My fear is that in many cases we won’t know that we don’t have all the pieces we need to decipher the data until many years down the road leads me to an even darker place. While I may sound alarmist, I don’t think I am overstating the situation. This comes from my first hand experience in working with large custom built databases. Often (back in my life as a software consultant) I would be assigned to fix or add on to a program I had not written myself. This often feels like trying to crawl into someone else’s brain.

Imagine being told you must finish a 20 page paper tonight – but you don’t get to start from scratch and you have no access to the original author. You are provided a theoretically almost complete 18 page paper and piles of books with scraps of paper stuck in them. The citations are only partly done. The original assignment leaves room for original ideas – so you must discern the topic chosen by the original author by reading the paper itself. You decide that writing from scratch is foolish – but are then faced with figuring out what the person who originally was writing this was trying to say. You find 1/2 finished sentences here and there. It seems clear they meant to add entire paragraphs in some sections. The final thorn in your side is being forced to write in a voice that matches that of the original author – one that is likely odd sounding and awkward for you. About halfway through the evening you start wishing you had started from scratch – but now it is too late to start over, you just have to get it done.

So back to the archivist tasked with ensuring that future generations can make use of the electronic records in their care. The challenges are great. This sort of thing is hard even when you have the people who wrote the code sitting next to you available to answer questions and a working program with which to experiment. It just makes my head hurt to imagine piecing together the meaning of data in custom built databases long after the working software and programmers are well beyond reach.

Does this sound interesting or scary or relevant to your world? Dreaming in Code is really a great read. The people are interesting. The issues are interesting. The author does a good job of explaining the inner workings of the software world by following one real world example and grounding it in the landscape of the history of software creation. And he manages to include great analogies to explain things to those looking in curiously from outside of the software world. I hope you enjoy it as much as I did.

SAA 2007 Session Proposal: Preserving Context and Original Order in a Digital World

September 28, 2006 1 Comment

Abby Adams, Assistant Access and Outreach Archivist of the Richard B. Russell Library for Political Research and Studies, University of Georgia, and I are putting together a proposal for a session at SAA 2007 in Chicago. She and I found each other via my poster at SAA 2006: Communicating Context in Online Collections. We have been pondering many of the same questions related to the effective communication of context and original order in online digitized collections.

Our proposal is for a traditional 3 presentation panel with the title “Preserving Context and Original Order in a Digital World”. All we need now is a 3rd presenter, the endorsement of an SAA section or roundtable and (of course) the approval of the session selection committee. (And some plane tickets!)

This is the current version of our description for the proposal (mostly composed by Abby) :

Now that digitization projects have become more common in archival repositories, user and archivists alike have uncovered problems when it comes to understanding the context of online materials. However, there are various ways to provide more contextual information, thus enhancing the use of digital archives. But, archivists must confront the obstacles surrounding this task by developing best practices and incorporating new software into their digitization projects. In order to simplify the problem, we should return to our traditional archival principles and draw connections to collection arrangement and description in a digital environment. Join three archivists to explore how to improve on “analog” techniques in the communication of context. When done right, the digitization of a collection will not only retain all the same opportunities for communicating context that we are familiar with, it may revolutionize the way that archivists and users interact and understand our records.

The short take on what we want to cover in our session’s presentations is:

What should archivists be doing to not loose context and original order information in the transition from analog records to digitized records?
What can digitization give us the ability to do that we couldn’t do in the analog world?
What tools and standards are out there today to help archivists do both of the above? What information should archivists be capturing to permit them to take advantage of the opportunities to communicate context and original order that these tools and standards offer?

Abby’s part of the session, titled “Where’s the Context? Enhancing Access to Digital Archives”, will examine the need for preserving context and original order when digitizing archival materials – focusing on how it enhances online use and access to archives. How can new systems retain the existing ability to communicate context and original order when moving from “analog” to “digital”?

My portion, “Communicating Context: The Power of Digital Interfaces”, will discuss what archivists can do in the digital world they cannot do (or at least not easily) with analog records to communicate context and original order. I will focus on various innovative methods to do this including the use of GIS, hot-linking for ease of navigation, the ability to ‘collect’ digital surrogates for examination and more. I plan to include a combination of exciting new interfaces doing great things alongside new ideas of what could be done. Keep your fingers crossed for us that there is internet access in the session rooms in Chicago.

We have a vision of a third speaker whose talk would consider what the leading standards and software tools are permitting people to do today. How can archivists leverage the existing and evolving standards (EAD, EAC, TEI and other DTD s) to capture and communicate context and original order in the digital world? In addition, it would provide a high level review of common software packages (Archon , Archivists’ Toolkit, ContentDM , and others) and how they address original order and context. Finally we have a notion of a checklist of what to capture when digitizing to take advantage of what these tools and standards can provide for you.

Are you our mystery 3rd panelist that we are having so much trouble finding? Your first tip is that you have already mapped out 5 powerpoint slides in your head and started scribbling a rough draft of the “Archivists’ Digitization Checklist for Preserving Context” on a scrap of paper near your computer.

Maybe you know someone who would be a great person to pitch this to? Or you have advice for us concerning who to pass our proposal along to in the great hunt for that elusive session endorsement?

The deadline looms large (October 9)! Please contact us either via email (jeanne AT spellboundblog DOT com and adamsabi AT uga DOT edu) or in the comments of this post.

My New Daydream: A Hosting Service for Digitized Collections

September 20, 2006 3 Comments

In her post Predictions over on hangingtogether.org, Merrilee asked “Where do you predict that universities, libraries, archives, and museums will be irresistibly drawn to pooling their efforts?” after reading this article.

And I say: what if there were an organization that created a free (or inexpensive fee-based) framework for hosting collections of digitized materials? What I am imagining is a large group of institutions conspiring to no longer be in charge of designing, building, installing, upgrading and supporting the websites that are the vehicle for sharing digital historical or scholarly materials. I am coming at this from the archivists perspective (also having just pondered the need for something like this in my recent post: Promise to Put It All Online ) – so I am imagining a central repository that would support the upload of digitized records, customizable metadata and a way to manage privacy and security.

The hurdles I imagine this dream solution removing are those that are roughly the same for all archival digitization projects. Lack of time, expertise and ongoing funding are huge challenges to getting a good website up and keeping it running – and that is even before you consider the effort required to digitize and map metadata to records or collections of records. It seems to me that if a central organization of some sort could build a service that everyone could use to publish their content – then the archivists and librarians and other amazing folks of all different titles could focus on the actual work of handling, digitizing and describing the records.

Being the optimist I am I of course imagine this service as providing easy to use software with the flexibility for building custom DTDs for metadata and security to protect those records that cannot (yet or ever) be available to the public. My background as a software developer drives me to imagine a dream team of talented analysts, designers and programmers building an elegant web based solution that supports everything needed by the archival community. The architecture of deployment and support would be managed by highly skilled technology professionals who would guarantee uptime and redundant storage.

I think the biggest difference between this idea and the wikipedias of the world is that there would be some step required for an institution to ‘join’ such that they could use this service. The service wouldn’t control the content (in fact would need to be super careful about security and the like considering all the issues related to privacy and copyright) – rather it would provide the tools to support the work of others. While I know that some institutions would not be willing to let ‘control’ of their content out of their own IT department and their own hard drives, I think others would heave a huge sigh of relief.

There would still be a place for the Archons and the Archivists’ Toolkits of the world (and any and all other fabulous open-source tools people might be building to support archivists’ interactions with computers), but the manifestation of my dream would be the answer for those who want to digitize their archival collection and provide access easily without being forced to invent a new wheel along the way.

If you read my GIS daydreams post, then you won’t be surprised to know that I would want GIS incorporated from the start so that records could be tied into a single map of the world. The relationships among records related to the same geographic location could be found quickly and easily.

Somehow I feel a connection in these ideas to the work that the Internet Archive is doing with Archive-IT.org. In that case, producers of websites want them archived. They don’t want to figure out how to make that happen. They don’t want to figure out how to make sure that they have enough copies in enough far flung locations with enough bandwidth to support access – they just want it to work. They would rather focus on creating the content they want Archive-It to keep safe and accessible. The first line on Archive-It’s website says it beautifully: “Internet Archive’s new subscription service, Archive-It, allows institutions to build, manage and search their own web archive through a user friendly web application, without requiring any technical expertise.”

So, the tag line for my new dream service would be “DigiCollection’s new subscription service, Digitize-It, allows institutions to upload, manage and search their own digitized collections through a user friendly web application, without requiring any technical expertise.”

Ideas about Zotero and Digitized Archives

September 9, 2006 9 Comments

Dan Cohen posted recently about the soon to be available, open-source, firefox plugin, research support software named Zotero . Looking at the quick start guide, I immediately spotted the icon to “add a new collection folder”. As the “archivist-in-training” that I am, my reaction now to the word “collection” is different than it would have been a year ago. Though I strongly suspect it will not be the case (at least not in the first released version) I immediately was daydreaming of browsing a digitized collection online – clicking the “add a new collection folder” icon – and ending up with a copy of the entire collection of records for examination and comparison later.

Of course this would be most useful for the historian digging through and analyzing archival records if Zotero was able to pull down metadata beyond that of a standard citation and retain any hierarchical information or relationships among the records.

Now on Dead Reckoning‘s post on Zotero RDFa is mentioned. I don’t know anything about RDFa beyond what I have read in the last few hours, so it is not clear to me how complicated the metadata can be – perhaps it can support a full digital object XML record of some kind. So maybe the trick isn’t so much getting Zotero to do things it wasn’t designed to do – but rather the slow migration of sites to using the software packages and standards listed here.

I don’t want anyone to think that I am not excited about Zotero and all the neat things it is likely to do. I suspect I will rapidly become a frequent Zotero user verging on a zealot – but it is fun to daydream. I think it is most fun to daydream now, before I start using it and get lost in all the great stuff it CAN do. I definitely will post more after I get a chance to take it for a spin in early October.

SAA 2006: Research Library Group Roundtable – Internet Archiving

August 29, 2006

Late in the afternoon on Thursday August 3rd I attended the Research Library Group Roundtable at SAA 2006. It was an opportunity for RLG to share information with the archival community about their latest products and services. This session included presentations on the Internet Archive , Archive-It and the Web Archives Workbench.

After some brief business related to the SAA 2007 program committee and the rapid election of Brian Stevens from NYU Archives as the new chair of the group, Anne Van Camp spoke about the period of transition as RLG merges with OCLC. In the interest of the blending of cultures – she told a bar joke (as all OCLC meetings apparently begin). She explained that RLG products and services will be integrated into the OCLC product line. RLG programs will continue as RLG becomes the research arm for the joined interest areas of libraries, archives and museums. This has not existed before and they believe it will be a great chance to explore things in ways that RLG hasn’t had the opportunity to do in the past.

The initiatives on their agenda:

archival gateways: convened 2 meetings recently. The first to see if there is a way to be interactive with international archive databases and the second to bring regional archives together to see how they can work together.
web archiving: started looking at it from a service point of view, but also some community issues that have to be worked out around web archiving. Looking at big problems that will need community involvement – issues like metadata and selection.
standards: continuing to support EAD, pursuing rigorous agenda regarding EAC
OCLC has a whole group of people who works on registries (where you put information about organizations). RLG has talked about building a registry on top of Archive Grid of US archives.

In her introduction, Merrilee (frequent poster on hangingtogether.org ) highlighted that there are lots of questions about the intellectual side of web archiving (vs the technical challenges) such as:

what to archive?
what metadata data and description is appropriate for it?
what would end users of web archives need? How would they use a web archive?
what about collaborative collection development? It is expensive to archive the web – how does an institution say “I am archiving this corner of the web – this deep – this often”. This information should be publicly available for others doing research and others archiving the web.

She pointed out that RLG is happy about their work with Internet Archive – they are doing work to make the technical side easier but they understand that there is a lot for the archival community to sort out.

Next up was Kristine Hanna of the Internet Archive giving her presentation ‘Archiving and Preserving the Web’. The Internet Archive has been working with RLG this year and they need information from the users in the RLG community. They are looking into how they are going to work with OCLC and have applied for an NDIIP grant.

The Internet Archive (IA), founded by Brewster Kahle in 1996, is built on open source principles and dedicated to Open Source software.

What do they collect in the archive? Over 2 billion pages a month in 21 languages. It is free and the largest archive on the web including 55 billion pages from 55 million sites and supporting 60,000 unique users per day.

Why try to collect it all? They don’t feel comfortable making the choices about appraisal. And at risk websites and collections are disappearing all the time. The average lifespan of a web page is 100 days. They did a case study of crawling websites associated with the Nigerian election – 6 months after the election 70% of the crawled sites were gone, but they live on in the archive.

How do they collect? They use these components and tools:

Heritrix – web crawler
Wayback Machine – access tools for rendering and viewing files
Nutch – search engine
Arc File – archival file format used for preservation

How do they preserve it? They keep multiple copies at different digital repositories (CA, Alexandria (Egypt), France, Amsterdam) using over 1300 server machines.

IA also does targeted archiving for partners. Institutions that want to create specific online collections or curated domain crawls can work with IA. These archives start at 100+ million documents and are based on crawls run by IA crawl engineers. The Library of Congress has arranged for an assortment of targeted archives including archives of US National Elections 2000, September 11 and the War in Iraq (not accessible yet – marked March 2003 – Ongoing). Australia arranged for archiving of the entire .au domain. Also see Purpose, Pragmatism and Perspective – Preserving Australian Web Resources at the National Library of Australia by Paul Koerbin of the National Library of Australia and published in February of 2006.

What’s Next for Internet Archive?

collaboration and partnerships
OCA – open content alliance
Multiple copies around the world

Next, Dan Avery of IA gave a 9 minute version of his 35 minute presentation on Archive-It. Archive-It is a web based annual subscription service provided by IA to permit the capture of up to 10 million pages. Kristine gave some examples of those using Archive-It during her presentation:

Indiana University – web sites
North Carolina State Archives – Government Agencies, Occupational Licensing Boards and commissions.
Library of Virginia – Jamestown 2007 commemoration and Governor Mark Warner’s last year in office. When Mark Warner was listed by the New York Times as a possible presidential candidate, this archive got lots of hits. (This brings up interesting questions of watching content that is being purposefully preserved to get an idea of what some expect for the future. Don’t be surprised by a post on this idea all by itself later. Need to think about it some more!)

He highlighted the different elements and techniques used in Archive-It: crawling, web user interface, storage, playback, text indexing and integration.

Crawling/Browsing:
- Heritrix :
  - open source java
  - Archival-quality (they preserve exactly what they get back from the server)
  - Highly configurable
- Wayback Machine :
  - lets you surf the web as it was
  - in Archive-It – each customer has their own wayback machine
  - not open source yet.. that is a work in progress
The user interface is a web application:
- collects all the info they need to do the crawling the customer requests
- schedule (monthly, daily, weekly, quarterly… etc)
- seed URLS (the starting point for archive web crawls)
- crawl parameters
NutchWAX
- extension of Nutch which is built on Lucene
- full text search plus link analysis
- can search by date instead of relevance – useful for individual archives

While there are public collections in Archive-It, logging in gives you access to personal sites: shows the total documents archived (and more), lets you check your list of active collections and set up a new collection (includes unique collection identifier). He showed some screen shots of the interface and examples (this was the first time there wasn’t a network available for his presentation – he was amused that his paranoia that forced him to always bring screen captures finally paid off!).

It was interesting seeing this presentation back to back with the general Internet Archive overview. There are lots of overlap in tools and approaches between them – but Archive-It definitely has it’s own unique requirements. It puts the tools for managing larges scale web crawling in the hands of archivists (or more likely information managers of some sort) – rather than the technical staff of IA.

The final presentation of the roundtable was by Judy Cobb – a Product Manager fromOCLC. She gave an overview of the Web Archives Workbench. (I hunted for a good link to this – but the best I came up with was acknowledgments document and the login page .)The inspiration for the creation of Workbench was the challenge of collecting from web. The Internet is a big place. It is hard to define the scope of what to archive.

Workbench is a discovery tool that will permit its users to investigate what domains should be included when crawling a website for archiving. It will ask you which domains should be included. For example, you can tell it not to crawl Adobe.com just because there is a link to it to let people download acrobat.

Workbench will let you set metadata data for your collection based on the domains you said were in scope. It will then let you appraise and rank the entities/domains being harvested, leaving you with a list of organizations or entities in scope and ranked by importance. Next it will translate a site map of what is going to be crawled, define parts of the map as series and put the harvested content and related metadata into a repository. Other configuration options permit setting how frequently you harvest various series, choosing to only get new content and requesting notification if the sitemap changes.

Workbench is currently in beta and is still under development. The 3rd phase will add the support for Richard Pierce-Moses’s Arizona Model for Web Preservation and Access. The focus of the Arizona Model is curation, not technology. It strives to find a solution somewhere between manual harvesting and bulk harvesting that is based on standard archival theories. Workbench will be open source and funded by LOC.

I wasn’t sure what to expect from the roundtable – but I was VERY glad that I attended. The group was very enthusiastic – cramming in everything they could manage to share with those in the room. The Internet Archive, Archive-It and the Web Archives Workbench represent the front of the pack of software tools intended to support archiving the web. It was easy to see that if the Workbench is integrated in with Archive-It, that it should permit archivists to start paying more attention to the identification of what should be archived rather than figuring out how to do the actual archiving.

SAA2006 Session 305: Extended Archival Description Part II – ArchivesUM

August 11, 2006

The second part of SAA2006 session 305 (Extended Archival Description: Context and Specificity for Digital Objects) was a presentation titled “Understanding from Context: Pairing EAD and Digital Repository Description”. Delivered by Ann Handler and Jennifer O’Brian of the University of Maryland‘s ArchivesUM project, this talk explained the general approach being used to tackle the challenges of managing item level description of digitized images at a large, diverse institution.

They described the tension between collection level descriptions and the inheritance of attributes by items. With more images being digitized all the time, storing the images on a network drive made them hard to find or inventory. Different departments at University of Maryland described images at different levels and using different software. While one department had just folder collection level descriptions, another set of 3000 photos were described at the item level but without adherence to any standard.

A flat system of description was not the answer because it provided no way to include hierarchical description. Their answer was to combine archival description via finding aids with well structured item level description.

As the ArchivesUM project defines them, an item can be part of more than one collection and could be part of a collection that represents the source of the item as well as an online exhibition. The team also wanted to create relationships between the existing repository of finding aids and new digital objects. In order to accomplish everything they required (and in contrast with the Archives of American Art’s choice of ColdFusion), ArchivesUM selected FEDORA – an open source digital object repository for their development. They created a transitional database using SQL Server to track and store the metadata of newly scanned digital objects to not loose anything while the custom FEDORA system was being developed. The digital object repository uses a rich descriptive standard based on Dublin Core and Visual Resources Association (VRA) Core .

In order to connect to other kinds of objects from non-archival collections and other sources they simply add them in finding aid. To present the finding aid they needed new style sheets. Their improved layout helps user understand the hierarchy of the collection and how to find the individual record they are looking for, lets them move up and down through tree easily to explore and helps user understand the context of the record they are viewing.

One of the questions that was asked at the end of the session related to the possible drawbacks of linking just one or two digitized documents to a finding aid. The concern was that it might mislead the user into believing that these few documents were the only documents in the collection held by the archives. The response was strong, asserting that the opposite was true – that having the link into the finding aid gave users the context they desperately needed to understand the few digitized records and point them in the right direction for finding the rest of the offline collection. This is especially important in the Google universe. Users may end up viewing a single digitized record from an online collection and, without a link back to the finding aid for the collection, not understand the meaning of the record in question.

The ArchivesUM team is currently setting the stage for migration to the live system. I look forward to exploring the actual user interface after it does.

Session 510: Digital History and Digital Collections (aka, a fan letter for Roy and Dan)

August 6, 2006

There were lots of interesting ideas in the talks given by Dan Cohen and Roy Rosenzweig during their SAA session Archives Seminar: Possibilities and Problems of Digital History and Digital Collections (session 510).

Two big ideas were discussed: the first about historians and their relationship to internet archiving and the second about using the internet to create collections around significant events. These are not the same thing.

In his article Scarcity or Abundance? Preserving the Past in a Digital Era, Roy talks extensively about the dual challenges of loosing information as it disappears from the net before being archived and the future challenge to historians faced with a nearly complete historical record. This assumes we get the internet archiving thing right in the first place. It assumes those in power let the multitude of voices be heard. It assumes corporately sponsored sites providing free services for posting content survive, are archived and do the right thing when it comes to preventing censorship.

The Who Built America CD-ROM, released in 1993 and bundled with Apple computers for K-12 educational use, covered the history of America from 1876 and 1914. It came under fire in the Wall Street Journal for including discussions of homosexuality, birth control and abortion. Fast forward to now when schools use filtering software to prevent ‘inappropriate’ material from being viewed by students – in much the same way as Google China uses to filter search results. He shared with us the contrast of the search results from Google Images for ‘Tiananmen square’ vs the search results from Google Images China for ‘Tiananmen square’. Something so simple makes you appreciate the freedoms we often forget here in the US.

It makes me look again at the DOPA (Deleting Online Predators Act) legislation recently passed by the House of Representatives. In the ALA’s analysis of DOPA, they point out all the basics as to why DOPA is a rotten idea. Cool Cat Teacher Blog has a great point by point analysis of What’s Wrong with DOPA. There are many more rants about this all over the net – and I don’t feel the need to add my voice to that throng – but I can’t get it out of my head that DOPA’s being signed into law would be a huge step BACK for freedom of speech and learning and internet innovation in the USA. How crazy is it that at the same time that we are fighting to get enough funding for our archivists, librarians and teachers – we should also have to fight initiatives such as this that would not only make their jobs harder but also siphon away some of those precious resources in order to enforce DOPA?

In the category of good things for historians and educators is the great progress of open source projects of all sorts. When I say Open Source I don’t just mean software – but also the collection and communication of knowledge and experience in many forms. Wikipedia and YouTube are not just fun experiments – but sources of real information. I can only imagine the sorts of insights a researcher might glean from the specific clips of TV shows selected and arranged as music videos by TV show fans (to see what I am talking about, take a look at some of the video’s returned from a search on gilmore girls music video – or the name of your favorite pop TV characters). I would even venture to say that YouTube has found a way to provide a method of responding to TV, perhaps starting down a path away from TV as the ultimate passive one way experience.

Roy talked about ‘Open Sources’ being the ultimate goal – and gave a final plug to fight to increase budgets of institutions that are funding important projects.

Dan’s part of the session addressed that second big idea I listed – using the internet to document major events. He presented an overview of the work of ECHO: Exploring and Collecting History Online. ECHO had been in existence for a year at the time of 9/11 and used 9/11 as a test case for their research to that point. The Hurricane Digital Memory Bank is another project launched by ECHO to document stories of Katrina, Rita and Wilma.

He told us the story behind the creation of the 9/11 digital archive – how they decided they had to do something quickly to collect the experiences of people surrounding the events of September 11th, 2001. They weren’t quite sure what they were doing – if they were making the best choices – but they just went for it. They keep everything. There was no ‘appraisal’ phase to creating this ‘digital archive’. He actually made a point a few minutes into his talk to say he would stop using the word archive, and use the term collection instead, in the interest of not having tomatoes thrown at him by his archivist audience.

The lack of appraisal issue brought a question at the end of the session about where that leaves archivists who believe that appraisal is part of the foundation of archival practice? The answer was that we have the space – so why not keep it all? Dan gave an example of a colleague who had written extensively based on research done using World War II rumors they found in the Library of Congress. These easily could have been discarded as not important – but you never know how information you keep can be used later. He told a story about how they noticed that some people are using the 9/11 digital archive as a place to research teen slang because it has such a deep collection of teen narratives submitted to be part of the archive.

This reminded me a story that Prof. Bruce Ambacher told us during his Archival Principals, Practices and Programs course at UMD. During the design phase for the new National Archives building in College Park, MD, the Electronic Records division was approached to find out how much room they needed for future records. Their answer was none. They believed that the speed at which the space required to store digital data was shrinking was faster than the rate of growth of new records coming into the archive. One of the driving forces behind the strong arguments for the need for appraisal in US archives was born out of the sheer bulk of records that could not possibly be kept. While I know that I am oversimplifying the arguments for and against appraisal (Jenkinson vs Schellenberg, etc) – at the same time it is interesting to take a fresh look at this in the light of removing the challenges of storage.

Dan also addressed some interesting questions about the needs of ‘digital scholarship’. They got zip codes from 60% of the submissions for the 9/11 archive – they hope to increase the accuracy and completeness of GIS information in the hurricane archive by using Google Maps new feature to permit pinpointing latitude and longitude based on an address or intersection. He showed us some interesting analysis made possible by pulling slices of data out of the 9/11 archive and placing it as layers on a Google Map. In the world of mashups, one can see this as an interesting and exciting new avenue for research. I will update this post with links to his promised details to come on his website about how to do this sort of analysis with Google Maps. There will soon be a researchers interface of some kind available at the 9/11 archive (I believe in sync with the 5 year annivarsary of September 11).
Near the end of the session a woman took a moment to thank them for taking the initiative to create the 9/11 archive. She pointed out that much of what is in archives across the US today is the result of individuals choosing to save and collect things they believed to be important. The woman who had originally asked about the place of appraisal in a ‘keep everything digital world’ was clapping and nodding and saying ‘she’s right!’ as the full room applauded.

So – keep it all. Snatch it up before it disappears (there were fun stats like the fact that most blogs remain active for 3 months, most email addresses last about 2 years and inactive Yahoo Groups are deleted after 6 months). There is likely a place for ‘curitorial views’ of the information created by those who evaluate the contents of the archive – but why assume that something isn’t important? I would imagine that as computers become faster and programming becomes smarter – if we keep as much as we can now, we can perhaps automate the sorting it out later with expert systems that follow very detailed rules for creating more organized views of the information for researchers.

This panel had so many interesting themes that crossed over into other panels throughout the conference. The Maine Archivist talking about ‘stopping the bleeding’ of digital data loss in his talk about the Maine GeoArchives. The panel on blogging (that I will write more about in a future post). The RLG Roundtable with presentations from people over at InternetArchive and their talks about archiving everything (ALSO deserves it’s own future post).

I feel guilty for not managing to touch on everything they spoke about – it really was one of the best sessions I attended at the conference. I think that having voices from outside the archival profession represented is both a good reality check and great for the cross-polination of ideas. Roy and Dan have recently published a book titled Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web – definitely on my ‘to be read’ list.

SAA 2006 Session 103: “X” Marks the Spot: Archiving GIS Databases – Part I

August 4, 2006

‘X’ Marks the Spot was a fantastic first session for me at the SAA conference. I have had a facination with GIS (Geographic Information Systems) for a long time. I love the layers of information. I love the fact that you can represent information in a way that often makes you realize new things just from seeing it on a map.

Since my write-ups of each panelist is fairly long, I will put each in a separate post.

Helen Wong Smith, from the Kamehameha Schools, started off the panel discussing her work on the Land Legacy Database in her presentation titled “Wahi Kupuna: Digitized Cultural Resources Database with GIS Access”.

Kamehameha Schools (KS) was founded by the will of Princess Bernice Pauahi Bishop. With approximately 360,000 acres, KS is the largest private landowner in the state of Hawaii. With over $7 billion in assets the K-12 schools subsidize a significant portion of the cost to educate every student (parents pay only 10% of the cost).

KS generates income from residential, commercial and resort leases. In addition to generating income – a lot of the land has a strong cultural connection. Helen was charged with empowering the land management staff to apply 5 values every time there is any type of land transaction: Economic, Educational, Cultural, Environmental and Community. They realized that they had to know about the lands they own. For example, if they take a parcel back from a long lease and they are going to re-lease it, they need to know about the land. Does it have archaelogical sites? Special place to the Hawai’ian people?

Requirements for the GIS enabled system:

Find the information
Keep it all in one place
Ability to export and import from other standard-based databases (MARC, Dublin Core, Open Archives Initiative)
Some information is private – not to be shared with public
GIS info
Digitize all text and images
Identify by Tax map keys (TMK)
Identify by ‘traditional place name’
Identify by ‘common names’ – surfer invented names (her favorites examples are ‘suicides’ and ‘leftovers’)

The final system would enforce the following security:

Lowest – material from public repositories i.e the Hawaii State Archives
Medium – material for which we’ve acquired the usage rights for limited use
Highest – leases and archaeological reports

Currently the Land Legacy Database is only available within the firewall – but eventually the lowest level of security will be made public.
They already had a web GIS portal and needed this new system to hook up to the Web GIS as well and needed to collect and disseminate data, images, audio/visual clips and references in all formats. In addition, the land managers needed easy way to access information from the field, such as lease agreement or archaeological reports (native burials? Location & who they were).

Helen selected Greenstone – open source software (from New Zealand) for the following reasons:

open source
multilingual (deals with glottals and other issues with spelling in Hawiian language)
GNU General Public License
Software for building and distributing digital library collections
New way to organizing information
Publish it on the internet and CD-ROM
many ways of access including by Search, Titles and Genres
support for audio and video clips (Example – Felix E Grant Collection).

The project started with 60,000 TIF records (can be viewed as JPEGS) – pre-scanned and indexed by another person. Each of these ‘Claim documents’ includes a testimony and a register. It is crucial to reproduce the original primary resources to prevent confusion, such as can occur between place names and people names.

Helen showed an example from another Greenstone database of newspaper articles published in a new Hawaiian journal. It was displayed in 3 columns, one each for:

original hawaiian language newspaper as published
the text including the diacriticals
English translation

OCR would be a major challenge with these documents – so it isn’t being used.

Helen worked with programmers in New Zealand to do the customizations needed (such as GIS integration) after loosing the services of the IT department. She has been told that she made more progress working with the folks from New Zealand than she would have with IT!

The screen shots were fun – they showed examples of how the Land Legacy Database data uses GIS to display layers on maps of Hawaii including outlines of TMKs or areas with ‘traditional names’. One can access the Land Legacy Database by clicking on a location on the map and selecting Land Legacy Database to get to records.

The Land Legacy Database was envisioned as a tool to compile diverse resources regarding the Schools’ lands to support decision making i.e. as the location and destruction of cultural sites. Its evolution includes:

inclusion of internal and external records including reports conducted for and by the Schools in the past 121 years
a platform providing access to staff, faculty and students across the islands
sharing server space with the Education Division

Helen is only supposed to spend 20% of her time on this project! Her progress is amazing.

Category: open source