Menu Close

Year: 2012

Election Eve: Fighting for the Right to Vote

In less than six hours, the polls in Maryland will open for the 2012 general election. Here on ‘election eve’ in the United States of America, I wanted to share some records of those who fought to gain the right to vote for all throughout the USA. Some of these you may have seen before – but I did my best to find images, audio, and video that may not have crossed your path. Why do we have these? In most cases it is because an archive kept them.

Of course I couldn’t do this post without including some of the great images out there of suffragists, but I bet you didn’t know that they had Suffrage Straw Rides.

Or perhaps Suffrage Dancers?

Here we see a group from the Suffrage Hike to Albany, NY in 1914.

Fast forward to the 1960s and the tone shifts. In this excerpt from a telegram sent to President Kennedy in 1961, civil rights activist James Farmer reports on an attack on a bus of Freedom Riders:

We also find images like this one of the leaders of the 1963 Civil Rights March on Washington, DC:

In Alabama from 1964 to 1965, a complicated voter registration process was in place to discourage registration of African-American voters. If you click through you can see a sample of one of these multi-page voter registration forms. In a different glimpse of what voter suppression looked like, listen to Theresa Burroughs tell her daughter Toni Love about registering to vote in this StoryCorps recording:

Finally, you can watch Lyndon B. Johnson’s remarks on the signing of the Voting Rights Act on August 6th, 1965.

These records just scratch the surface, but at least they give you a taste of the hard work by so many that has gone into gaining the right to vote for all in the United States. If you are a registered voter in the USA, please honor this hard work by exercising your right to vote at the polls Tuesday!

Harnessing The Power of We: Transcription, Acquisition and Tagging

In honor of the Blog Action Day for 2012 and their theme of ‘The Power of We’, I would like to highlight a number of successful crowdsourced projects focused on transcribing, acquisition and tagging of archival materials. Nothing I can think of embodies ‘the power of we’ more clearly than the work being done by many hands from across the Internet.

Transcription

  • Old Weather Records: “Old Weather volunteers explore, mark, and transcribe historic ship’s logs from the 19th and early 20th centuries. We need your help because this task is impossible for computers, due to diverse and idiosyncratic handwriting that only human beings can read and understand effectively. By participating in Old Weather you’ll be helping advance research in multiple fields. Data about past weather and sea-ice conditions are vital for climate scientists, while historians value knowing about the course of a voyage and the events that transpired. Since many of these logs haven’t been examined since they were originally filled in by a mariner long ago you might even discover something surprising.”
  • From The Page: “FromThePage is free software that allows volunteers to transcribe handwritten documents on-line.” A number of different projects are using this software including: The San Diego Museum of Natural History’s project to transcribe the field notes of herpetologist Laurence M. Klaube and Southwestern University’s project to transcribe the Mexican War Diary of Zenas Matthews.
  • National Archives Transcription: as part of the National Archives Citizen Archivist program, individuals have the opportunity to transcribe a variety of records. As described on the transcription home page: “letters to a civil war spy, presidential records, suffrage petitions, and fugitive slave case files”.

Acquisition:

  • Archive Team: The ArchiveTeam describes itself as “a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage.” Here is an example of the information gathered, shared and collaborated on by the ArchiveTeam focused on saving content from Friendster. The rescued data is (whenever possible) uploaded in the Internet Archive and can be found here:

    Springing into action, Archive Team began mirroring Friendster accounts, downloading all relevant data and archiving it, focusing on the first 2-3 years of Friendster’s existence (for historical purposes and study) as well as samples scattered throughout the site’s history – in all, roughly 20 million of the 112 million accounts of Friendster were mirrored before the site rebooted.

Tagging:

  • National Archives Tagging: another part of the Citizen Archivist project encourages tagging of a variety of records, including images of the Titanic, architectural drawings of lighthouses and the Petition Against the Annexation of Hawaii from 1898.
  • Flickr Commons: throughout the Flickr Commons, archives and other cultural heritage institutions encourage tagging of images

These are just a taste of the crowdsourced efforts currently being experimented with across the internet. Did I miss your favorite? Please add it below!

UNESCO/UBC Vancouver Declaration

In honor of the 2012 Day of Digtal Archives, I am posting a link to the UNESCO/UBC Vancouver Declaration. This is the product of the recent Memory of the World in the Digital Age conference and they are looking for feedback on this declaration by October 19th, 2012 (see link on the conference page for sending in feedback).

To give you a better sense of the aim of this conference, here are the ‘conference goals’ from the programme:

The safeguard of digital documents is a fundamental issue that touches everyone, yet most people are unaware of the risk of loss or the magnitude of resources needed for long-term protection. This Conference will provide a platform to showcase major initiatives in the area while scaling up awareness of issues in order to find solutions at a global level. Ensuring digital continuity of content requires a range of legal, technological, social, financial, political and other obstacles to be overcome.

The declaration itself is only four pages long and includes recommendations to UNESCO, member states and industry. If you are concerned with digital preservation and/or digitization, please take a few minutes to read through it and send in your feedback by October 19th.

CURATEcamp Processing 2012

CURATEcamp Processing 2012 was held the day after the National Digital Information Infrastructure and Preservation Program (NDIIPP) and the National Digital Stewardship Alliance (NDSA) sponsored Digital Preservation annual meeting.

The unconference was framed by this idea:

Processing means different things to an archivist and a software developer. To the former, processing is about taking custody of collections, preserving context, and providing arrangement, description, and accessibility. To the latter, processing is about computer processing and has to do with how one automates a range of tasks through computation.

The first hour or so was dedicated to mingling and suggesting sessions.  Anyone with an idea for a session wrote down a title and short description on a paper and taped it to the wall. These were then reviewed, rearranged on the schedule and combined where appropriate until we had our full final schedule. More than half the sessions on the schedule have links through to notes from the session. There were four session slots, plus a noon lunch slot of lightening talks.

Session I: At Risk Records in 3rd Party Systems This was the session I had proposed combined with a proposal from Brandon Hirsch. My focus was on identification and capture of the records, while Brandon started with capture and continued on to questions of data extraction vs emulation of the original platforms. Two sets of notes were created – one by me on the Wiki and the other by Sarah Bender in Google Docs. Our group had a great discussion including these assorted points:

  • Can you mandate use of systems we (archivists) know how to get content out of? Consensus was that you would need some way to enforce usage of the mandated systems. This is rare, if not impossible.
  •  The NY Philharmonic had to figure out how to capture the new digital program created for the most recent season. Either that, or break their streak for preserving every season’s programs since 1842.
  • There are consequences to not having and following a ‘file plan’. Part of people’s jobs have to be to follow the rules.
  • What are the significant properties? What needs to be preserved – just the content you can extract? Or do you need the full experience? Sometimes the answer is yes – especially if the new format is a continuation of an existing series of records.
  • “Collecting Evidence” vs “Archiving” – maybe “collecting evidence” is more convincing to the general public
  • When should archivists be in the process? At the start – before content is created, before systems are created?
  • Keep the original data AND keep updated data. Document everything, data sources, processes applied.

Session II: Automating Review for Restrictions? This was the session that I would have suggested if it hadn’t already been on the wall. The notes from the session are online in a Google Doc. It was so nice to realize that that challenge of review of records for restricted information is being felt in many large archives. It was described as the biggest roadblock to the fast delivery of records to researchers. The types of restrictions were categorized as ‘easy’ or ‘hard’. The ‘Easy’ category was for well defined content that follow rules that we could imagine teaching a computer to identity — things like US social security numbers, passport numbers or credit card numbers. The ‘Hard’ category was for restrictions that had more human judgement involved. The group could imagine modules coded to spot the easy restrictions. The modules could be combined to review for whatever set was required – and carry with them some sort of community blessing that was legally defensible. The modules should be open source. The hard category likely needs us as a community to reach out to the eDiscovery specialists from the legal realm, the intelligence community and perhaps those developing autoclassification tools. This whole topic seems like a great seed for a Community of Practice. Anyone interested? If so – drop a comment below please!

Lunchtime Lightning Talks: At five minutes each, these talks gave the attendees a chance to highlight a project or question they would like to discuss with others. While all the talks were interesting, there was one that really stuck with me: Harvard University’s Zone 1 project which is a ‘rescue repository’. I would love to see this model spread! Learn more in the video below.

Session III: Virtualization as a means for Preservation In this session we discussed the question posed in the session proposal “How can we leverage virtualization for large-scale, robust preservation?”. I am not sure if any notes were generated for this session. Notes are available on the conference wiki. Our discussion touched on the potential to save snapshots of virtualized systems over time, the challenges of all the variables that go into making a specific environment, and the ongoing question of how important is it to view records in their original environment (vs examining the extracted ‘content’).

Session IV: Accessible Visualization This session quickly turned into a cheerful show and tell of visualization projects, tools and platforms – most made it into a list on the Wiki.

Final Thoughts
The group assembled for this unconference definitely included a great cross-section of archivists and those focused on the tech of electronic records and archives. I am not sure how many there were exclusively software developers or IT folks. We did go around the room for introductions and hand raising for how people self-identified (archivists? developers? both? other?). I was a bit distracted during the hand raising (I was typing the schedule into the wiki) – but it is my impression that there were many more archivists and archivist/developers than there were ‘just developers’. That said, the conversations were productive and definitely solidly in the technical realm.

One cross-cutting theme I spotted was the value of archivists collaborating with those building systems or selecting tech solutions. While archivists may not have the option to enforce (through carrots or sticks) adherence to software or platform standards, any amount of involvement further up the line than the point of turning a system off will decrease the risks of losing records.

So why the picture of the abandoned factory at the top of this post? I think a lot of the challenges of preservation of born digital records tie back to the fact that archivists often end up walking around in the abandoned factory equivalent of the system that created the records. The workers are gone and all we have left is a shell and some samples of the product. Maybe having just what the factory produced is enough. Would it be a better record if you understood how it moved through the factory to become what it is in the end? Also, for many born digital records you can’t interact with them or view them unless you have the original environment (or a virtual one) in which to experience them. Lots to think about here.

If this sounds like a discussion you would like to participate in, there are more CURATEcamps on the way. In fact – one is being held before SAA’s annual meeting tomorrow!

Image Credit: abandoned factory image from Flickr user sonyasonya.

MARAC Spring 2012: Preservation of Digital Materials (Session S1)

602px-Zip-100a-transparent.png

The official title for this session is “Preservation and Conservation of Captured and Born Digital Materials” and it was divided into three presentations with introduction and question moderation by Jordon Steele, University Archivist at Johns Hopkins University.

Digital Curation, Understanding the lifecycle of born digital items

Isaiah Beard, Digital Data Curator from Rutgers, started out with the question ‘What Is Digital Curation?’. He showed a great Dilbert cartoon on digital media curation and the set of six photos showing all different perspectives on what digital curation really is (a la the ‘what I really do’ meme – here is one for librarians).

“The curation, preservation, maintenance, collection and archiving of digital assets.” — Digital Curation Center.

What does a Digital Curator do?

Aquire digital assets:

  • digitized analog sources
  • assets that were born digital, no physical analog exists

Certify content integrity:

  • workflow and standards and best practices
  • train staff on handling of the assets
  • perform quality assurance

Certify trustworthiness of the architecture:

  • vet codecs and container/file formats – must make sure that we are comfortable with the technology, hardware and formats
  • active role in the storage decisions
  • technical metadata, audit trails and chain of custody

Digital assets are much easier to destroy than physical objects. In contrast with physical objects which can be stored, left behind, forgotten and ‘rediscovered’, digital objects are more fragile and easier to destroy. Just one keystroke or application error can destroy digital materials. Casual collectors typically delete what they don’t want with no sense of a need to retain the content. People need to be made aware that the content might be important long term.

Digital assets are dependent on file formats and hardware/software platforms. More and more people are capturing content on mobile devices and uploading it to the web. We need to be aware of the underlying structure. File formats are proliferating and growing over time. Sound files come in 27 common file formats and 90 common codecs. Moving images files come in 58 common containers/codecs and come with audio tracks in the 27 file formats/90 common codecs.

Digital assets are vulnerable to format obsolescence — examples include Wordperect (1979), Lotus 1-2-3 (1978) and Dbase (1978). We need to find ways to migrate from the old format to something researchers can use.

Physical format obsolescence is a danger — examples include tapes, floppy disk, zip disk, IBM demi-disk and video floppy. There is a threat of a ‘digital dark age’. The cloud is replacing some of this pain – but replacing it with a different challenge. People don’t have a sense of where their content is in the physical world.

Research data is the bleeding edge. Datasets come in lots of different flavors. Lots of new and special file formats relating specifically to scientific data gathering and reporting… long list including things like GRIB (for meterological data), SUR (MRI data), DWG (for CAD data), SPSS (for statistical data from the social sciences) and on and on. You need to become a specialist in each new project on how to manage the research data to keep it viable.

There are ways to mitigate the challenges through predictable use cases and rigid standards. Most standard file types are known quantities. There is a built-in familiarity.

File format support: Isaiah showed a grid with one axis Open vs Closed and the other Free vs Proprietary. Expensive proprietary software that does the job so well that it is the best practice and assumed format for use can be a challenge – but it is hard to shift people from using these types of solutions.

Digital Curation Lifecycle

  • Objects are evaluated, preserve, maintained, verified and re-evaluated
  • iterative – the cycle doesn’t end with doing it just once
  • Good exercise for both known and unknown formats

The diagram from the slide shows layers – looks like a diagram of the geologic layers of the earth.

Steps:

  • data is the center of the universe
  • plan, describe, evaluate, learn meanings.
  • ingest, preserve curate
  • continually iterate

Controlled chaos! Evaluate the collection and needs of the digital assets. Using preservation grade tools to originate assets. Take stock of the software, systems and recording apparatus . Describe in the tech metadata so we know how it originated. We need to pick our battles and need to use de facto industry standards. Sometimes those standards drive us to choices we wouldn’t pick on our own. Example – final cut pro – even though it is mac and proprietary.

Establish a format guide and handling procedures. Evaluate the veracity and longevity of the data format. Document and share our findings. Help others keep from needing to reinvent the wheel.

Determine method of access: How are users expected to access and view these digital items? Software/hardware required? View online – plug-in required? third party software?

Primary guidelines: Do no harm to the digital assets.

  • preservation masters, derivatives as needed
  • content modification must be done with extreme care
  • any changes must be traceable, audit-able, reversible.

Prepare for the inevitable: more format migrations. Re-assess the formats.. migrate to new formats when the old is obsolete. Maintain accessibility while ensuring data integrity.

At Rutgers they have the RUcore Community Repository which is open source, and based on FEDORA. It is dedicated to the digital preservation of multiple digital asset types and contains 26,238 digital assets (as of April 2012). Includes audio, video, still images, documents and research data. Mix of digital surrogates and born digital assets.

Publicly available digital object standards are available for all traditional asset types. Define baseline quality requirements for ‘reservation grade’ files. Periodically reviewed and revised as tech evolves. See Rutgers’ Page2Pixel Digital Curation standards.

They use a team approach as they need to triage new asset types. Do analysis and assessment. Apply holistic data models and the preservation lifecycle and continue to publish and share what they have done. Openness is paramount and key to the entire effort.

More resources:

The Archivist’s Dilemma: Access to collections in the digital era

Next, Tim Pyatt from Penn State spoke about ‘The Archivist’s Dilemma’ — starting with examples of how things are being done at Penn State, but then moving on to show examples of other work being done.

There are lots of different ways of putting content online. Penn State’s digital collections are published online via ContentDM, Flickr, social media and Penn State IR Tools. The University Faculty Senate put up things on their own. Internet Archive. Custom built platform. Need to think about how the researcher is going to approach this content.

With analog collections that have portions digitized they describe both, but then includes a link to digital collection. These link through to a description of the digital collection.. and then links to CONTENTdm for the collection itself.

Examples from Penn State:

  • A Google search for College of Agricultural Science Publications leads users to a complimentary/competing site with no link back to the catalog nor any descriptive/contextual information.
  • Next, we were shown the finding aid for William W. Scranton Papers from Penn State. They also have images up on Flickr ‘William W. Scranton Papers’ . Flickr provides easy access, but acts as another content silo. It is crucial to have metadata in the header of the file to help people find their way back to the originating source. Google Analytics showed them that 8x more often content is seen in Flickr than CONTENTdm.
  • The Judy Chicago Art Education Collection is a hybrid collection. The finding aid has a link to the curriculum site. There is a separate site for the Judy Chicago Art Education Collectiion more focused on providing access to her education materials.
  • The University Curriculum Archive is a hybrid collection with a combination of digitized old course proposals, while the past 5 years of curriculum have been born digital. They worked with IT to build a database to commingle the digitized & born digital files. It was custom built and not integrated into other systems – but at least everything is in one place.

Examples of what is being done at other institutions:

PennState is loading up a Hydra repository for their next wave!

Born-Digital @UVa: Born Digital Material in Special Collections

Gretchen Gueguen, UVA

Presentation slides available for download.

AIMS (An Inter-Institutional Model for Stewardship) born digital collections: a 2 year project to create a framework for the stewardship of born-digital archival records in collecting repositories. Funded by Andrew W. Mellon Foundation with partners: UVA, Stanford, University of Hull, and Yale. A white paper on AIMS was published in January 2012.

Parts of the framework: collection development, accessioning, arrangement & description, discovery & access are all covered in the whitepaper – including outcomes, decision points and tasks. The framework can be used to develop an institutionally specific workflow. Gretchen showed an example objective ‘transfer records and gain administrative control’ and walked through outcome, decision points and tasks.

Back at UVA, their post-AIMS strategizing is focusing on collection development and accessioning.

In the future, they need to work on Agreements: copyright, access & ownership policies and procedures. People don’t have the copyright for a lot of the content that they are trying to donate. This makes it harder, especially when you are trying to put content online. You need to define exactly what is being donated. With born digital content, content can be donated multiple places. Which one is the institution of record? Are multiple teams working on the same content in a redundant effort?

Need to create a feasibility evaluation to determine systematically if something is it worth collecting. Should include:

  • file formats
  • hardware/software needs
  • scope
  • normalization/migration needed?
  • private/sensitive information
  • third-party/copyrighted information?
  • physical needs for transfer (network, storage space, etc.)

If you decide it is feasible to collect, how do you accomplish the transfer with uncorrupted data, support files (like fonts, software, databases) and ‘enhanced curation’? You may need a ‘write blocker’ to make sure you don’t change the content just by accessing the disk. You may want to document how the user interacted with their computer and software. Digital material is very interactive – you need to have an understanding of how the user interacted with it. Might include screen shots.

Next she showed their accessioning workflow:

  • take the files
  • create a disk image – bit for bit copy – makes the preservation master
  • move that from the donor’s computer to their secure network with all the good digital curation stuff
  • extract technical metadata
  • remove duplicates
  • may not take stuff with PPI
  • triage if more processing is necessary

Be ready for surprises – lots of things that don’t fit the process:

  • 8″ floppy disk
  • badly damaged CD
  • disk no longer functions – afraid to throw away in case of miracle
  • hard drive from 1999
  • mini disks

These have no special notation taken of them in the accessioning.

Priorities with this challenging material:

  • get the data of aging media
  • put it someplace safe and findable
  • inventory
  • triage
  • transfer

Forensic Workstation:

  • FRED = forensic recovery of evidence device – built in ultra bay writeblocker with usb, firewire, sata, csi, ide ad molex for power- external 5.25 floppy drive, cd/dvd/blu-ray, microcard reader, LTO tape drive, external 3.5″ drive + external hard drive for additional storage.
  • toolbox
  • big screen

FRED’s FDK software shows you overview of what is there, recognizes 1,000s of file format, deleted data, finds duplicates, and can identify PPI. It is very useful for description and for selecting what to accession – but it costs a lot and requires an annual license.

BitCurrator is making an open source version. From their website: “The BitCurator Project is an effort to build, test, and analyze systems and software for incorporating digital forensics methods into the workflows of a variety of collecting institutions.”

Archivematica:

  • creates PREMIS record recording what activities are done – preservation metadata standard
  • creates derivative records – migration!!
  • yields a preservation master + access copies to be provided in the reading room

Hoping for Hypatia like thing in the future

Final words: Embrace your inner nerd! Experiment – you have nothing to loose. If you do nothing you will lose the records anyway.

Questions and Answers

QUESTION: How do you convince your administration that this needs to be a priority?

ANSWER:

Isaiah: Find examples of other institutions that are doing this. Show them that our history is at risk moving forward. A digital dark age is coming if we don’t do something now. It is really important that we show people “this is what we need to preserve”

Tim: Figure out who your local partners are. Who else has a vested interest in this content? IT was happy at Penn State that they didn’t need to keep everything – happy that there is an appraisal process.. and that they are preserving content so it doesn’t need to be kept by everyone. I am one of the authors of the upcoming report on born digital records — end of the summer: Association of Research Libraries – Managing Electronic Records – Spec Kit

Gretchen: Numbers are really useful. Sometimes you don’t think about it, but it is a good practice to count the size of what you created. How much time would it take to recreate it if you lost it. How many people have used the content? Get some usage stats. Who is your rival and what are their statistics?

Jordon: Point to others who you want to keep up with

QUESTION: would the panelists like to share experiences with preserving dynamic digital objects like databases?

ANSWER:

Isaiah: We don’t want to embarrass people. We get so many different formats. It is a trial and error thing. You need to say gently that there is a better way to do this. Sad example – burned DVDs from tapes in 2004.. got them in 2007. The DVDs were not verified. They were not stored well – stored in a hot warehouse. Opened the boxes and found unreadable DVDs – delaminating.

Tim: From my Duke Days, we had a number of faculty data sets in proprietary formats. We would do checksums on them, wrap them up and put them in the repository. They are there.. but who knows if anyone will be able to read them later. Same as with paper – preserve them now in good acid-free papers.

Gretchen: My 19 yo student held up a zip disk and said “Due to my extreme youth I don’t know what this is!” (And now you know why there is a photo of a zip disk at the top of this post – your reward for reading all the way to the end!)

Image Credit: ‘100MB Zip Disc for Iomega Zip, Fujifilm/IBM-branded‘ taken by Shizhao

As is the case with all my session summaries from MARAC, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Digitization Quality vs Quantity: An Exercise in Fortune Telling

The quality vs quantity dilemma is high in the minds of those planning major digitization projects. Do you spend your time and energy creating the highest quality images of your archival records? Or do you focus on digitizing the largest quantity you can manage? Choosing one over the other has felt a bit like an exercise in fortune telling to me over the past few months, so I thought I would work through at least a few of the moving parts of this issue here.

The two ends of the spectrum are traditionally described as follows:

  • digitize at very high quality to ensure that you need not re-digitize later, create a high quality master copy from which all possible derivatives can be created later
  • digitize at the minimum quality required for your current needs, the theory being that this will increase the quantity of digitized records you can digitize

This sounds very well and good on the surface, but this is not nearly as black and white a question as it appears. It is not the case that one can simply choose one over the other. I suppose that choosing ‘perfect quality’ (whatever that means) probably drives the most obvious of the digitization choices. Highest resolution. 100% accurate transcription. 100% quality control.

It is the rare digitization project that has the luxury of time and money required to aim for such a definition of perfect. At what point would you stop noticing any improvement, while just increasing your the time it takes to capture the image and the disk space required to store it? 600 DPI? 1200 DPI? Scanners and cameras keep increasing the dots per inch and the megapixels they can capture. Disk space keeps getting cheaper. Even at the top of the ‘perfect image’ spectrum you have to reach a point of saying ‘good enough’.

When you consider the choices one might make short of perfect, you start to get into a gray area in which the following questions start to crop up:

  • How will lower quality image impact OCR accuracy?
  • Is one measure of lower quality simply a lower level of quality assurance (QA) to reduce the cost and increase the throughput?
  • How will expectations of available image resolution evolve over the next five years? What may seem ‘good enough’ now, may seem grainy and sad in a few years.
  • What do we add to the images to improve access? Transcription? TEI? Tagging? Translation?
  • How bad is it if you need to re-digitize something that is needed at a higher resolution on demand? How often will that actually be needed?
  • Will storing in JPEG2000 (rather than TIFF)  save enough money from reduced disk space to make it worth the risk of a lossy format? Or is ‘visually lossless‘ good enough?

Even the question of OCR accuracy is not so simple. In D-Lib Magazine‘s article from the July/August 2009 issue titled Measuring Mass Text Digitization Quality and Usefulness the authors list multiple types of accuracy which may be measured:

  • Character accuracy
  • Word accuracy
  • Significant word accuracy
  • Significant words with capital letter start accuracy (i.e. proper nouns)
  • Number group accuracy

So many things to consider!

The primary goal of the digitization project I am focused on is to increase access to materials for those unable to travel to our repository. As I work with my colleagues to navigate the choices, I find myself floating towards the side of ‘good enough’ across the board. Even the process of deciding this blog post is done has taken longer than I meant it to. I publish it tonight with the hope to put a line in the sand and move forward with the conversation. For me, it all comes back to what are you trying to accomplish.

I would love to hear about how others are weighing all these choices. How often have long term digitization programs shifted their digitization standards? What aspects of your goals are most dramatically impacting your priorities on the quality vs quantity scale?

Image Credit: Our lovely fortune teller is an image from the George Eastman House collection in the Flickr Commons, taken by Nickolas Muray in 1940 for use by McCall’s Magazine. [UPDATED 1/6/2019: Image no longer on Flickr, but is available in the Eastman Museum online collection.]

Digitization Program Site Visit: Archives of American Art

The image of Alexander Calder above shows him in his studio, circa 1950. It is from a folder titled Photographs: Calder at Work, 1927-1956, undated, part of Alexander Calder’s Papers held by the Smithsonian Archives of American Art and available online through the efforts of their digitization project. I love that this image capture him in his creative space – you get to see the happy chaos from which Calder drew his often sleek and sparse sculptures.

Back in October, I had the opportunity to visit with staff of the digitization program for the Smithsonian Archives of American Art along with a group of my colleagues from the World Bank. This is a report on that site visit. It is my hope that these details can help others planning digitization projects – much as it is informing our own internal planning.

Date of Visit: October 18, 2011

Destination: Smithsonian Archives of American Art

Smithsonian Archives of American Art Hosts:

Summary:  This visit was two hours in length and consisted of a combination of presentation, discussion and site tour to meet staff and examine equipment.

Background: The Smithsonian’s Archives of American Art (AAA) program was first funded by a grant from the Terra Foundation of American Art in 2005, recently extended through 2016. This funding supports both staff and research.

Their digitization project replaced their existing microfilm program and focuses on digitizing complete collections. Digitization focused on in-house collections (in contrast with collections captured on microfilm from other institutions across the USA as part of their microfilm program).

Over the course of the past 6 years, they have scanned over 110 collections – a total of 1,000 linear feet – out of an available total of 13,000 linear feet from 4,500 collections. They keep a prioritized list of what they want digitized.

The Smithsonian DAM (digital asset management system) had to be adjusted to handle the hierarchy of EAD and the digitized assets. Master files are stored in the Smithsonian DAM. Files stored in intermediate storage areas are only for processing and evaluation and are disposed of after they have been ingested into the DAM.

Current staffing is two and a half archivists and two digital imaging specialists. One digital imaging specialist focuses on scanning full collections, while the other focuses on on-demand single items.

The website is built in ColdFusion and pulls content from a SQL database. Currently they have no way to post media files (audio, oral histories, video) on the external web interface.

They do not delineate separate items within folders. When feedback comes in from end users about individual items, this information is usually incorporated into the scope note for the collection, or the folder title of the folder containing the item. Full size images in both the image gallery and the full collections are watermarked.

They track the processing stats and status of their projects.

Standard Procedures:

Full Collection Digitization:

  • Their current digitization workflow is based on their microfilm process. The workflow is managed via an internal web-based management system. Every task required for the process is listed, then crossed off and annotated with the staff and date the action was performed.
  • Collections earmarked for digitization are thoroughly described by a processing archivist.
  • Finding aids are encoded in EAD and created in XML using NoteTab Pro software.
  • MARC records are created when the finding aid is complete. The summary information from the MARC record is used to create the summary of the collection published on the website.
  • Box numbers and folder numbers are assigned and associated with a finding aid. The number of the box and folder are all a scanning technician needs.
  • A ‘scanning information worksheet’ provides room for notes from the archivist to the scanning technician.  It provides the opportunity to indicate which documents should not be scanned. Possible reasons for this are duplicate documents or those containing personal identifying information (PIP).
  • A directory structure is generated by a script based on the finding aid, creating a directory folder for each physical folder which exists for the collection. Images are saved directly into this directory structure. The disk space to hold these images is centrally managed by the Smithsonian and automatically backed up.
  • All scanning is done in 600dpi color, according to their internal  guidelines. They frequently have internal projects which demand high resolution images for use in publication.
  • After scanning is complete, the processing archivist does the post scanning review before the images are pushed into the DAM for web publication.
  • Their policy is to post everything from a digitized collection, but they do support a take-down policy.
  • A recent improvement was made in January, 2010. At that time they relaunched the site to include all of their collections co-located on the same list, both digitized and non-digitized.

On Demand Digitization:

  • Patrons may request the digitization of individual items.
  • These requests are evaluated by archivists to determine if it is appropriate to digitize the entire folder (or even box) to which the item belongs.
  • Requests are logged in a paper log.
  • Item level scanning ties back to an item level record with an item ID. There is an ‘Online Removal Notice’ to create item level stub.
  • An item level cataloger describes the content after it is scanned.
  • Unless there is an explicit copyright or donor restriction, the items is put online in the Image Gallery (which currently has 12,000 documents).
  • Access to images is provided by keyword searching.
  • Individual images are linked back to the archival description for the collection from which they came.

Improvements/Changes they wish for:

  • They currently have no flexibility to make changes in the database nimbly. It is a tedious process to change the display and each change requires a programmer.
  • They would like to consider a move to open source software or to use a central repository – though they have concerns about what other sacrifices this would require.
  • Show related collections, list connected names (currently the only options for discovery are an A-Z list of creators or keyword search).
  • Ability to connect to guides and other exhibits.

References:

Image Credit: Alexander Calder papers, Archives of American Art, Smithsonian Institution.