Menu Close

Category: historical research

Archival Transcriptions: for the public, by the public

There is a recent thread on the archives listserv that talks about transcriptions – specifically for small projects or those that have little financial support. There is even a case in which there is no easy OCR answer due to the state of the digitized microfilm records.
One of the suggestions was to use some combination of human effort to read the documents – either into a program that would transcribe them, or to another human who would do the typing. It made me wonder what it would look like to make a place online where people who wanted to could volunteer their transcription time. In the case where the records are already digitized and viewable, this seems like an interesting approach.

Something like this already exists for the genealogy world over at the USGenWeb Archives Project. They have a long list of different projects listed here. Though the interface is a bit confusing, the spirit of the effort is clear – many hands make light work. Precious genealogical resources can be digitized, transcribed and added to this archive to support the research of many by anyone – anywhere in the world.

Of course in the case of transcribing archival records there are challenges to be overcome. How do you validate what is transcribed? How do you provide guidance and training for people working from anywhere in the world? If I have figured out that a particular shape is a capital S in a specific set of documents, that could help me (or an OCR program) as I progress through the documents, but if I only see one page from a series – I will have to puzzle through that one page without the support of my past experience. Perhaps that would encourage people to keep helping with a specific set of records? Maybe you give people a few sample pages with validated translations to practice with? And many records won’t be that hard to read – easy for a human’s eye but still a challenge for an OCR program.

The optimist in me hopes that it could be a tempting task for those who want to volunteer but don’t have time to come in during the normal working day. Transcribing digitized records can be done in the middle of the night in your pajamas from anywhere in the world. Talk about increasing your pool of possible volunteers! I would think that it could even be an interesting project for high school and college students – a chance to work with primary sources. With careful design, I can even imagine providing an option to select from a preordained set of subjects or tags (or in Folksonomy friendly environment, the option to add any tags that the transcriber deems appropriate) – though that may be another topic worthy of its own exploration independent of transcription.

The initial investment for a project like this would come from building a framework to support a distributed group of volunteers. You would need an easy way to serve up a record or group of records to a volunteer and prevent duplication of effort – but this is an old problem with good solutions from the configuration management world of software development and other collaboration work environments.

It makes a nice picture in my mind – a slow, but steady, team effort to transcribe collections like the Colorado River Bed Case (2,125 pages of digitized microfilm at the University of Utah’s J. Willard Marriott Library) – mostly done from people’s homes on their personal computers in the middle of the night. A central website for managing digitized archival transcriptions could give the research community the ability to vote on the next collection that warrants attention. Admit it – you would type a page or two yourself, wouldn’t you?

Records Speaking to the Present: Voices Not Silenced

When I composed my main essay for my application to University of Maryland’s MLS program, I wrote about why I was drawn to their Archives Program. I told them I revel in hearing the voices of the past speak through records such as those at EllisIsland.org. I love the power that records can wield – especially when they can be accessed digitally from anywhere in the world. It is this sort of power that let me see the ship manifests and the names of the boats on which my grandparents came to this country (such as The Finland ).

All this came rushing back to me while reading the September 18th article 2 siblings reunited after being separated in Holocaust. The grandsons of a Holocaust survivor looked up their grandmother in Yad Vashem’s central database of Shoah Victims’ Names – and found an entry stating that she had died during the Holocaust. One thing led to another – and two siblings that thought they had lost each other 65 years earlier were reunited.

The fact that access to records can bring people together across time speaks to me at a very primal level. So now you know – I am a romantic and an optimist (okay, if you have been reading my blog already – this shouldn’t come as any surprise). I want to believe that people who were separated long ago can be reunited – either through words or in person. This isn’t the first story like this – a quick search in google news turned up others – such as this holocaust reunion story from 2003.

This led me to do more research into how archival records are being used to find people lost during the Holocaust.

The Red Cross Holocaust Tracing Center has researched 28,000 individuals – and found over 1,000 of them alive since 1990. The FAQ on their website states that they believe there to be over 280,000 Holocaust survivors and family members in the United States alone and that they believe their work may continue for many years. As much as I love the idea of finding a way to provide access to digitized records – it is easy to see why the Tracing Center isn’t going away anytime soon. First of all – consider their main data sources – lots of private information that likely does NOT belong someplace where it can be read by just anyone:

While the American Red Cross has been providing tracing for victims of WWII and the Nazi regime since 1939, impetus for the creation of the center occurred in 1989 with the release of files on 130,000 people detained for forced labor and 46 death books containing 74,000 names from Auschwitz. Microfilm copies released to the International Committee of the Red Cross (ICRC) by the Soviet Union provided the single largest source of information since the end of WWII.

The staff of the center have also forged strong ties with the ICRC’s International Tracing Service in Arolsen, Germany – and get rapid turnaround times for their queries as a result. They have access to many organizations, archives and museums around the world in their hunt for evidence of what happened to individuals. They use all the records they can find to discover the answers to the questions they are asked – to be the detectives that families need to discover what happened to their loved ones. To answer the questions that have never been answered.

The USC Shoah Foundation Institute for Visual History and Education consists of 52,000 testimonies of survivors and other witnesses to the Holocaust collected in 56 countries and 32 languages from 1994 through 2000. These video testimonies document experiences before, during and after the Holocaust. It is the sort of first hand documentation that just could not have existed without the vision and efforts of many. They say on their FAQ page:

Now that this unmatched archive has been amassed, the Shoah Foundation is engaged in a new and equally urgent mission: to overcome prejudice, intolerance, and bigotry – and the suffering they cause – through the educational use of the Foundation’s visual history testimonies… Currently, the Foundation is committed to making these videotaped testimonies accessible to the public as an international educational resource. Simultaneously, an intensive program of cataloguing and indexing the testimonies is underway. This process will eventually enable researchers and the general public to access information about specific people, places, and experiences mentioned in the testimonies in much the same way as an index permits a reader to find specific information in a book.

The testimonies also serve as a basis for a series of educational materials such as interactive web exhibits, documentary films, and classroom videos developed by the Shoah Foundation.

I guess I am not sure where I am going with this – other than to point out a dramatic array of archives that are touching the lives of people right now. Consider this post a fan letter to all the amazing people who have sheparded these collections (and in some cases their digital counterparts) into the twenty-first century where they will continue to help people hear the voices of their ancestors.

I have more ideas brewing on how these records compare and contrast with those about the survivors and those who were lost to 9/11, The Asian Tsunami and Katrina. How do these types of records compare with the Asian Tsunami Web Archive or the Hurricane Digital Memory Bank? Where will the grandchildren of those who lost their homes to Katrina go in 30 years to find out what street the family home used to be on? Who will give witness to the people lost in Asia to the Tsunami? Lots to think about.

My New Daydream: A Hosting Service for Digitized Collections

In her post Predictions over on hangingtogether.org, Merrilee asked “Where do you predict that universities, libraries, archives, and museums will be irresistibly drawn to pooling their efforts?” after reading this article.

And I say: what if there were an organization that created a free (or inexpensive fee-based) framework for hosting collections of digitized materials? What I am imagining is a large group of institutions conspiring to no longer be in charge of designing, building, installing, upgrading and supporting the websites that are the vehicle for sharing digital historical or scholarly materials. I am coming at this from the archivists perspective (also having just pondered the need for something like this in my recent post: Promise to Put It All Online ) – so I am imagining a central repository that would support the upload of digitized records, customizable metadata and a way to manage privacy and security.

The hurdles I imagine this dream solution removing are those that are roughly the same for all archival digitization projects. Lack of time, expertise and ongoing funding are huge challenges to getting a good website up and keeping it running – and that is even before you consider the effort required to digitize and map metadata to records or collections of records. It seems to me that if a central organization of some sort could build a service that everyone could use to publish their content – then the archivists and librarians and other amazing folks of all different titles could focus on the actual work of handling, digitizing and describing the records.

Being the optimist I am I of course imagine this service as providing easy to use software with the flexibility for building custom DTDs for metadata and security to protect those records that cannot (yet or ever) be available to the public. My background as a software developer drives me to imagine a dream team of talented analysts, designers and programmers building an elegant web based solution that supports everything needed by the archival community. The architecture of deployment and support would be managed by highly skilled technology professionals who would guarantee uptime and redundant storage.

I think the biggest difference between this idea and the wikipedias of the world is that there would be some step required for an institution to ‘join’ such that they could use this service. The service wouldn’t control the content (in fact would need to be super careful about security and the like considering all the issues related to privacy and copyright) – rather it would provide the tools to support the work of others. While I know that some institutions would not be willing to let ‘control’ of their content out of their own IT department and their own hard drives, I think others would heave a huge sigh of relief.

There would still be a place for the Archons and the Archivists’ Toolkits of the world (and any and all other fabulous open-source tools people might be building to support archivists’ interactions with computers), but the manifestation of my dream would be the answer for those who want to digitize their archival collection and provide access easily without being forced to invent a new wheel along the way.

If you read my GIS daydreams post, then you won’t be surprised to know that I would want GIS incorporated from the start so that records could be tied into a single map of the world. The relationships among records related to the same geographic location could be found quickly and easily.

Somehow I feel a connection in these ideas to the work that the Internet Archive is doing with Archive-IT.org. In that case, producers of websites want them archived. They don’t want to figure out how to make that happen. They don’t want to figure out how to make sure that they have enough copies in enough far flung locations with enough bandwidth to support access – they just want it to work. They would rather focus on creating the content they want Archive-It to keep safe and accessible. The first line on Archive-It’s website says it beautifully: “Internet Archive’s new subscription service, Archive-It, allows institutions to build, manage and search their own web archive through a user friendly web application, without requiring any technical expertise.”

So, the tag line for my new dream service would be “DigiCollection’s new subscription service, Digitize-It, allows institutions to upload, manage and search their own digitized collections through a user friendly web application, without requiring any technical expertise.”

Google Newspaper Archives

I was intrigued by the news that Google had launched a News Archive search interface. For my first search, I searched on “Banjo Dancing” (a one man show that spent most of the 1980s in Arena Stage‘s Old Vat Room). It was tantalizing to see articles from “way back when” appear. The ‘timeline’ format was very useful way to quickly move through the articles and help focus your search.

Many newspapers that provide online access to their archives charge a per article fee for viewing the full article. You are not charged when you click on the link – but you do get a chance to view some sort of short abstract before paying. The advanced search permits you to limit your results based on their cost – so you can search only for those articles which are free or cost below a specific amount. By modifying my original search to only include free articles I found three, one from 1979, one from 2002 and one which did not yield anything.

So what does this mean for archives? In their FAQ, Google states “If you have a historical archive that you think would be a good fit in News archive search, we would love to hear from you.”. Take a moment and think about that – archives with digitized news content could raise their hand and ask to be included. Google has suddenly put the tools for increasing access in the hands of everyone. The university that has digitized it’s newspapers can suddenly be put on the same level with the New York Times and the Washington Post. There currently does not seem to be a fixed list showing “these are the news sources included in the Google news archive” – but I hope they add one.

In their usual fashion, Google has increased the chance of the serendipitous discovery of information – but because everything in the news archive will come from a vetted source, the quality and reliability of the information found should be far and above your standard web search.

Session 510: Digital History and Digital Collections (aka, a fan letter for Roy and Dan)

There were lots of interesting ideas in the talks given by Dan Cohen and Roy Rosenzweig during their SAA session Archives Seminar: Possibilities and Problems of Digital History and Digital Collections (session 510).

Two big ideas were discussed: the first about historians and their relationship to internet archiving and the second about using the internet to create collections around significant events. These are not the same thing.

In his article Scarcity or Abundance? Preserving the Past in a Digital Era, Roy talks extensively about the dual challenges of loosing information as it disappears from the net before being archived and the future challenge to historians faced with a nearly complete historical record. This assumes we get the internet archiving thing right in the first place. It assumes those in power let the multitude of voices be heard. It assumes corporately sponsored sites providing free services for posting content survive, are archived and do the right thing when it comes to preventing censorship.

The Who Built America CD-ROM, released in 1993 and bundled with Apple computers for K-12 educational use, covered the history of America from 1876 and 1914. It came under fire in the Wall Street Journal for including discussions of homosexuality, birth control and abortion. Fast forward to now when schools use filtering software to prevent ‘inappropriate’ material from being viewed by students – in much the same way as Google China uses to filter search results. He shared with us the contrast of the search results from Google Images for ‘Tiananmen square’ vs the search results from Google Images China for ‘Tiananmen square’. Something so simple makes you appreciate the freedoms we often forget here in the US.

It makes me look again at the DOPA (Deleting Online Predators Act) legislation recently passed by the House of Representatives. In the ALA’s analysis of DOPA, they point out all the basics as to why DOPA is a rotten idea. Cool Cat Teacher Blog has a great point by point analysis of What’s Wrong with DOPA. There are many more rants about this all over the net – and I don’t feel the need to add my voice to that throng – but I can’t get it out of my head that DOPA’s being signed into law would be a huge step BACK for freedom of speech and learning and internet innovation in the USA. How crazy is it that at the same time that we are fighting to get enough funding for our archivists, librarians and teachers – we should also have to fight initiatives such as this that would not only make their jobs harder but also siphon away some of those precious resources in order to enforce DOPA?

In the category of good things for historians and educators is the great progress of open source projects of all sorts. When I say Open Source I don’t just mean software – but also the collection and communication of knowledge and experience in many forms. Wikipedia and YouTube are not just fun experiments – but sources of real information. I can only imagine the sorts of insights a researcher might glean from the specific clips of TV shows selected and arranged as music videos by TV show fans (to see what I am talking about, take a look at some of the video’s returned from a search on gilmore girls music video – or the name of your favorite pop TV characters). I would even venture to say that YouTube has found a way to provide a method of responding to TV, perhaps starting down a path away from TV as the ultimate passive one way experience.

Roy talked about ‘Open Sources’ being the ultimate goal – and gave a final plug to fight to increase budgets of institutions that are funding important projects.

Dan’s part of the session addressed that second big idea I listed – using the internet to document major events. He presented an overview of the work of ECHO: Exploring and Collecting History Online. ECHO had been in existence for a year at the time of 9/11 and used 9/11 as a test case for their research to that point. The Hurricane Digital Memory Bank is another project launched by ECHO to document stories of Katrina, Rita and Wilma.

He told us the story behind the creation of the 9/11 digital archive – how they decided they had to do something quickly to collect the experiences of people surrounding the events of September 11th, 2001. They weren’t quite sure what they were doing – if they were making the best choices – but they just went for it. They keep everything. There was no ‘appraisal’ phase to creating this ‘digital archive’. He actually made a point a few minutes into his talk to say he would stop using the word archive, and use the term collection instead, in the interest of not having tomatoes thrown at him by his archivist audience.

The lack of appraisal issue brought a question at the end of the session about where that leaves archivists who believe that appraisal is part of the foundation of archival practice? The answer was that we have the space – so why not keep it all? Dan gave an example of a colleague who had written extensively based on research done using World War II rumors they found in the Library of Congress. These easily could have been discarded as not important – but you never know how information you keep can be used later. He told a story about how they noticed that some people are using the 9/11 digital archive as a place to research teen slang because it has such a deep collection of teen narratives submitted to be part of the archive.

This reminded me a story that Prof. Bruce Ambacher told us during his Archival Principals, Practices and Programs course at UMD. During the design phase for the new National Archives building in College Park, MD, the Electronic Records division was approached to find out how much room they needed for future records. Their answer was none. They believed that the speed at which the space required to store digital data was shrinking was faster than the rate of growth of new records coming into the archive. One of the driving forces behind the strong arguments for the need for appraisal in US archives was born out of the sheer bulk of records that could not possibly be kept. While I know that I am oversimplifying the arguments for and against appraisal (Jenkinson vs Schellenberg, etc) – at the same time it is interesting to take a fresh look at this in the light of removing the challenges of storage.

Dan also addressed some interesting questions about the needs of ‘digital scholarship’. They got zip codes from 60% of the submissions for the 9/11 archive – they hope to increase the accuracy and completeness of GIS information in the hurricane archive by using Google Maps new feature to permit pinpointing latitude and longitude based on an address or intersection. He showed us some interesting analysis made possible by pulling slices of data out of the 9/11 archive and placing it as layers on a Google Map. In the world of mashups, one can see this as an interesting and exciting new avenue for research. I will update this post with links to his promised details to come on his website about how to do this sort of analysis with Google Maps. There will soon be a researchers interface of some kind available at the 9/11 archive (I believe in sync with the 5 year annivarsary of September 11).
Near the end of the session a woman took a moment to thank them for taking the initiative to create the 9/11 archive. She pointed out that much of what is in archives across the US today is the result of individuals choosing to save and collect things they believed to be important. The woman who had originally asked about the place of appraisal in a ‘keep everything digital world’ was clapping and nodding and saying ‘she’s right!’ as the full room applauded.

So – keep it all. Snatch it up before it disappears (there were fun stats like the fact that most blogs remain active for 3 months, most email addresses last about 2 years and inactive Yahoo Groups are deleted after 6 months). There is likely a place for ‘curitorial views’ of the information created by those who evaluate the contents of the archive – but why assume that something isn’t important? I would imagine that as computers become faster and programming becomes smarter – if we keep as much as we can now, we can perhaps automate the sorting it out later with expert systems that follow very detailed rules for creating more organized views of the information for researchers.

This panel had so many interesting themes that crossed over into other panels throughout the conference. The Maine Archivist talking about ‘stopping the bleeding’ of digital data loss in his talk about the Maine GeoArchives. The panel on blogging (that I will write more about in a future post). The RLG Roundtable with presentations from people over at InternetArchive and their talks about archiving everything (ALSO deserves it’s own future post).

I feel guilty for not managing to touch on everything they spoke about – it really was one of the best sessions I attended at the conference. I think that having voices from outside the archival profession represented is both a good reality check and great for the cross-polination of ideas. Roy and Dan have recently published a book titled Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web – definitely on my ‘to be read’ list.