MARAC | Spellbound Blog

MARAC Spring 2012: Preservation of Digital Materials (Session S1)

May 3, 2012

The official title for this session is “Preservation and Conservation of Captured and Born Digital Materials” and it was divided into three presentations with introduction and question moderation by Jordon Steele, University Archivist at Johns Hopkins University.

Digital Curation, Understanding the lifecycle of born digital items

Isaiah Beard, Digital Data Curator from Rutgers, started out with the question ‘What Is Digital Curation?’. He showed a great Dilbert cartoon on digital media curation and the set of six photos showing all different perspectives on what digital curation really is (a la the ‘what I really do’ meme – here is one for librarians).

“The curation, preservation, maintenance, collection and archiving of digital assets.” — Digital Curation Center.

What does a Digital Curator do?

Aquire digital assets:

digitized analog sources
assets that were born digital, no physical analog exists

Certify content integrity:

workflow and standards and best practices
train staff on handling of the assets
perform quality assurance

Certify trustworthiness of the architecture:

vet codecs and container/file formats – must make sure that we are comfortable with the technology, hardware and formats
active role in the storage decisions
technical metadata, audit trails and chain of custody

Digital assets are much easier to destroy than physical objects. In contrast with physical objects which can be stored, left behind, forgotten and ‘rediscovered’, digital objects are more fragile and easier to destroy. Just one keystroke or application error can destroy digital materials. Casual collectors typically delete what they don’t want with no sense of a need to retain the content. People need to be made aware that the content might be important long term.

Digital assets are dependent on file formats and hardware/software platforms. More and more people are capturing content on mobile devices and uploading it to the web. We need to be aware of the underlying structure. File formats are proliferating and growing over time. Sound files come in 27 common file formats and 90 common codecs. Moving images files come in 58 common containers/codecs and come with audio tracks in the 27 file formats/90 common codecs.

Digital assets are vulnerable to format obsolescence — examples include Wordperect (1979), Lotus 1-2-3 (1978) and Dbase (1978). We need to find ways to migrate from the old format to something researchers can use.

Physical format obsolescence is a danger — examples include tapes, floppy disk, zip disk, IBM demi-disk and video floppy. There is a threat of a ‘digital dark age’. The cloud is replacing some of this pain – but replacing it with a different challenge. People don’t have a sense of where their content is in the physical world.

Research data is the bleeding edge. Datasets come in lots of different flavors. Lots of new and special file formats relating specifically to scientific data gathering and reporting… long list including things like GRIB (for meterological data), SUR (MRI data), DWG (for CAD data), SPSS (for statistical data from the social sciences) and on and on. You need to become a specialist in each new project on how to manage the research data to keep it viable.

There are ways to mitigate the challenges through predictable use cases and rigid standards. Most standard file types are known quantities. There is a built-in familiarity.

File format support: Isaiah showed a grid with one axis Open vs Closed and the other Free vs Proprietary. Expensive proprietary software that does the job so well that it is the best practice and assumed format for use can be a challenge – but it is hard to shift people from using these types of solutions.

Digital Curation Lifecycle

Objects are evaluated, preserve, maintained, verified and re-evaluated
iterative – the cycle doesn’t end with doing it just once
Good exercise for both known and unknown formats

The diagram from the slide shows layers – looks like a diagram of the geologic layers of the earth.

Steps:

data is the center of the universe
plan, describe, evaluate, learn meanings.
ingest, preserve curate
continually iterate

Controlled chaos! Evaluate the collection and needs of the digital assets. Using preservation grade tools to originate assets. Take stock of the software, systems and recording apparatus . Describe in the tech metadata so we know how it originated. We need to pick our battles and need to use de facto industry standards. Sometimes those standards drive us to choices we wouldn’t pick on our own. Example – final cut pro – even though it is mac and proprietary.

Establish a format guide and handling procedures. Evaluate the veracity and longevity of the data format. Document and share our findings. Help others keep from needing to reinvent the wheel.

Determine method of access: How are users expected to access and view these digital items? Software/hardware required? View online – plug-in required? third party software?

Primary guidelines: Do no harm to the digital assets.

preservation masters, derivatives as needed
content modification must be done with extreme care
any changes must be traceable, audit-able, reversible.

Prepare for the inevitable: more format migrations. Re-assess the formats.. migrate to new formats when the old is obsolete. Maintain accessibility while ensuring data integrity.

At Rutgers they have the RUcore Community Repository which is open source, and based on FEDORA. It is dedicated to the digital preservation of multiple digital asset types and contains 26,238 digital assets (as of April 2012). Includes audio, video, still images, documents and research data. Mix of digital surrogates and born digital assets.

Publicly available digital object standards are available for all traditional asset types. Define baseline quality requirements for ‘reservation grade’ files. Periodically reviewed and revised as tech evolves. See Rutgers’ Page2Pixel Digital Curation standards.

They use a team approach as they need to triage new asset types. Do analysis and assessment. Apply holistic data models and the preservation lifecycle and continue to publish and share what they have done. Openness is paramount and key to the entire effort.

More resources:

The Archivist’s Dilemma: Access to collections in the digital era

Next, Tim Pyatt from Penn State spoke about ‘The Archivist’s Dilemma’ — starting with examples of how things are being done at Penn State, but then moving on to show examples of other work being done.

There are lots of different ways of putting content online. Penn State’s digital collections are published online via ContentDM, Flickr, social media and Penn State IR Tools. The University Faculty Senate put up things on their own. Internet Archive. Custom built platform. Need to think about how the researcher is going to approach this content.

With analog collections that have portions digitized they describe both, but then includes a link to digital collection. These link through to a description of the digital collection.. and then links to CONTENTdm for the collection itself.

Examples from Penn State:

A Google search for College of Agricultural Science Publications leads users to a complimentary/competing site with no link back to the catalog nor any descriptive/contextual information.
Next, we were shown the finding aid for William W. Scranton Papers from Penn State. They also have images up on Flickr ‘William W. Scranton Papers’ . Flickr provides easy access, but acts as another content silo. It is crucial to have metadata in the header of the file to help people find their way back to the originating source. Google Analytics showed them that 8x more often content is seen in Flickr than CONTENTdm.
The Judy Chicago Art Education Collection is a hybrid collection. The finding aid has a link to the curriculum site. There is a separate site for the Judy Chicago Art Education Collectiion more focused on providing access to her education materials.
The University Curriculum Archive is a hybrid collection with a combination of digitized old course proposals, while the past 5 years of curriculum have been born digital. They worked with IT to build a database to commingle the digitized & born digital files. It was custom built and not integrated into other systems – but at least everything is in one place.

Examples of what is being done at other institutions:

Duke University Libraries (the ‘good’ example): Construction of Duke University finding aid. Drill down into discovery of the digitized content, good linkages back to the analog collection description.
UNC (the ‘better’ example): George Washington Jones Papers finding aid integrates the digital content with the finding aid. Folders linked in the body of the finding aid. Collapsed the silos.
DuraSpace (‘best’): AIMS project example at Stanford: Stephen Jay Gould papers finding aid – lets you drills down to digital content and view content of a specific floppy.

PennState is loading up a Hydra repository for their next wave!

Born-Digital @UVa: Born Digital Material in Special Collections

Gretchen Gueguen, UVA

Presentation slides available for download.

AIMS (An Inter-Institutional Model for Stewardship) born digital collections: a 2 year project to create a framework for the stewardship of born-digital archival records in collecting repositories. Funded by Andrew W. Mellon Foundation with partners: UVA, Stanford, University of Hull, and Yale. A white paper on AIMS was published in January 2012.

Parts of the framework: collection development, accessioning, arrangement & description, discovery & access are all covered in the whitepaper – including outcomes, decision points and tasks. The framework can be used to develop an institutionally specific workflow. Gretchen showed an example objective ‘transfer records and gain administrative control’ and walked through outcome, decision points and tasks.

Back at UVA, their post-AIMS strategizing is focusing on collection development and accessioning.

In the future, they need to work on Agreements: copyright, access & ownership policies and procedures. People don’t have the copyright for a lot of the content that they are trying to donate. This makes it harder, especially when you are trying to put content online. You need to define exactly what is being donated. With born digital content, content can be donated multiple places. Which one is the institution of record? Are multiple teams working on the same content in a redundant effort?

Need to create a feasibility evaluation to determine systematically if something is it worth collecting. Should include:

file formats
hardware/software needs
scope
normalization/migration needed?
private/sensitive information
third-party/copyrighted information?
physical needs for transfer (network, storage space, etc.)

If you decide it is feasible to collect, how do you accomplish the transfer with uncorrupted data, support files (like fonts, software, databases) and ‘enhanced curation’? You may need a ‘write blocker’ to make sure you don’t change the content just by accessing the disk. You may want to document how the user interacted with their computer and software. Digital material is very interactive – you need to have an understanding of how the user interacted with it. Might include screen shots.

Next she showed their accessioning workflow:

take the files
create a disk image – bit for bit copy – makes the preservation master
move that from the donor’s computer to their secure network with all the good digital curation stuff
extract technical metadata
remove duplicates
may not take stuff with PPI
triage if more processing is necessary

Be ready for surprises – lots of things that don’t fit the process:

8″ floppy disk
badly damaged CD
disk no longer functions – afraid to throw away in case of miracle
hard drive from 1999
mini disks

These have no special notation taken of them in the accessioning.

Priorities with this challenging material:

get the data of aging media
put it someplace safe and findable
inventory
triage
transfer

Forensic Workstation:

FRED = forensic recovery of evidence device – built in ultra bay writeblocker with usb, firewire, sata, csi, ide ad molex for power- external 5.25 floppy drive, cd/dvd/blu-ray, microcard reader, LTO tape drive, external 3.5″ drive + external hard drive for additional storage.
toolbox
big screen

FRED’s FDK software shows you overview of what is there, recognizes 1,000s of file format, deleted data, finds duplicates, and can identify PPI. It is very useful for description and for selecting what to accession – but it costs a lot and requires an annual license.

BitCurrator is making an open source version. From their website: “The BitCurator Project is an effort to build, test, and analyze systems and software for incorporating digital forensics methods into the workflows of a variety of collecting institutions.”

Archivematica:

creates PREMIS record recording what activities are done – preservation metadata standard
creates derivative records – migration!!
yields a preservation master + access copies to be provided in the reading room

Hoping for Hypatia like thing in the future

Final words: Embrace your inner nerd! Experiment – you have nothing to loose. If you do nothing you will lose the records anyway.

Questions and Answers

QUESTION: How do you convince your administration that this needs to be a priority?

ANSWER:

Isaiah: Find examples of other institutions that are doing this. Show them that our history is at risk moving forward. A digital dark age is coming if we don’t do something now. It is really important that we show people “this is what we need to preserve”

Tim: Figure out who your local partners are. Who else has a vested interest in this content? IT was happy at Penn State that they didn’t need to keep everything – happy that there is an appraisal process.. and that they are preserving content so it doesn’t need to be kept by everyone. I am one of the authors of the upcoming report on born digital records — end of the summer: Association of Research Libraries – Managing Electronic Records – Spec Kit

Gretchen: Numbers are really useful. Sometimes you don’t think about it, but it is a good practice to count the size of what you created. How much time would it take to recreate it if you lost it. How many people have used the content? Get some usage stats. Who is your rival and what are their statistics?

Jordon: Point to others who you want to keep up with

QUESTION: would the panelists like to share experiences with preserving dynamic digital objects like databases?

ANSWER:

Isaiah: We don’t want to embarrass people. We get so many different formats. It is a trial and error thing. You need to say gently that there is a better way to do this. Sad example – burned DVDs from tapes in 2004.. got them in 2007. The DVDs were not verified. They were not stored well – stored in a hot warehouse. Opened the boxes and found unreadable DVDs – delaminating.

Tim: From my Duke Days, we had a number of faculty data sets in proprietary formats. We would do checksums on them, wrap them up and put them in the repository. They are there.. but who knows if anyone will be able to read them later. Same as with paper – preserve them now in good acid-free papers.

Gretchen: My 19 yo student held up a zip disk and said “Due to my extreme youth I don’t know what this is!” (And now you know why there is a photo of a zip disk at the top of this post – your reward for reading all the way to the end!)

Image Credit: ‘100MB Zip Disc for Iomega Zip, Fujifilm/IBM-branded‘ taken by Shizhao

As is the case with all my session summaries from MARAC, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Gridworks: Super Data Cleanup and Exploration Tool

May 29, 2010

In my presentation at the Spring 2010 Mid-Atlantic Regional Archives Conference (MARAC), Whirlwind Tour of Visualization-Land, I showed some screenshots of a tool called Gridworks. At the time, Gridworks was not available to the general public. The good news is that earlier this month Gridworks 1.0 was officially released and you can get Gridworks right now.

For those of you who didn’t see my presentation, Gridworks is tool you run locally on your computer via a web browser. It permits you to load ‘grid-shaped data’ for examination, filtering and data cleanup. That makes is sound so much less exciting than it is. The best way to get a sense of what you can do is to watch the Gridworks Videos.

What sort of data do I think there is in archives to be pumped into Gridworks? How about collection descriptive data and electronic record datasets? Since all the data is kept locally, you don’t need to worry about uploading your data to some anonymous server in order to work with it. It all stays safely on your local computer the whole time.

A quick list of things that Gridworks can do:

Cluster data to find values that are almost the same so you can normalize your data (for example – NYC vs N.Y.C.)
Create instant facetted browsing based on any column in your data
Provide scatterplots of the values from any two numeric columns as well as a way to spot the most interesting combinations across many possible columns
Reconcilliation and validation of values based on data from within Freebase.com
Pull data from Freebase.com based on a matched column – such as the population of a country, if you have a column in your dataset with country specified
Splitting data within a cell based on a specified delimiter
Application of regular expressions and other simple code to data to create new columns

This list just scratches the surface, but it should give you a decent idea of the power of Gridworks. Even if the only feature you ever use is the one which lets you cluster and update your data to remove the ‘almost the same’ values, Gridworks can save you hours of painstaking data cleanup.

Why is data cleanup exciting? Because once you have nice clean data with all the attributes that are usefull to have for your data set – then you can start playing with the data in visualization tools! So go watch some Gridworks Videos, get Gridworks for yourself and start playing with data. It is free and it makes working with data fun!

MARAC Spring 2010: Hurray for Archival Metadata (Session S2)

May 7, 2010

The official title for this session is “Discovery Tools for Archival Collections: Getting the Most Out of Your Metadata” and was divided into two presentations with introduction and question moderation by Jaime L. Margalotti, senior assistant librarian in Special Collections at the University of Delaware.

Introduction to Metadata Standards

Michael Bolam, metadata librarian for digital production, is in charge of all the metadata for all the collections at the Digital Research Library at the University of Pittsburgh. He is not an archivist – but does know where the archives is at Pitt! He has put lots of archival material online through digitization and assignment of metadata.

The best definition he has found of metadata, good for all audiences: “Metadata consists of statements we make about resources to help us find, identify, use, manage, evaluate and preserve them” Marty Kurth – Head of Metadata Services, Cornell University Libraries

Reviewed examples of metadata for images, text documents and archival collections. There is also data related to the business of scanning and making content available – administrative/behind the scene. Standards let you take your data and use it for other purposes.

Overview of alphabet soup of metadata standards:

MARC: bibliographic information in machine-readable form (a MAchine-Readable Cataloging record).
Dublin Core: the goal of Dublin Core was to create a core set of metadata fields that could be used across platforms, across various disciplines.
MARCXML: schema for representing MARC in XML. Makes it easy to convert to and from MARC without loosing any data. May have more data than you need. MARCXML is not very ‘human readable’. You need to recall all the code numbers for the different data elements. Can be exported from Archivist Toolkit.
MODS: Metadata Object Description Schema – sort of a ‘MARCXML light’. Tries to be a step between MARCXML (robust & complicated) and Dublin Core (really simple). May result in compacting multiple MARCXML fields into single MODS fields. May loose some of the granularity of the data. The tags ARE human readable. The tag is the word ‘author’ – not a number. Also can be exported in Archivists Toolkit.
ONIX: ONline Information eXchange – standard used by the book publishing industry. XML-based standard for making available intellectual property in published form, both physical & digital. Data created by the publisher. They use different ways of representing authors, keywords..etc in comparison to LOC and library cataloging.
METS: Metadata Encoding & Transmission Standard. XML standard wrapper for describing divergent types of content within a digital library. The metadata for books, images, collections etc keep this data in different formats – METS lets you bring them together.
OAI-PMH: Not a metadata standard – but rather a protocol for sharing metadata. Gives us a way to pull baseline information about a digital object out of a database and put it out somewhere where it can be harvested and used.

Examples of projects built on shared metadata:

Worldcat.org: Has everything that is shared with OCLC. They do expose their records to google and yahoo harvesting.
OAIster: Searches a harvested data set – it is not going live out on the web. The OAIster records are also available in Worldcat. Example: search for Pittsburgh City Photographer (that is a provider of data). Most digitization software will generate an OAIster harvestable version. In his example we see that address and location get compressed into Notes. This is because there is not always a place in Dublin Core that maps to the level of detail you collect at your local institution. http://www.oclc.org/us/en/oaister/default.htm – has the info about contributing your content for crawling.
Archive Grid: The goal is to pull in finding aids from many sources. It is a service – requires some sort of subscription and payment to see the data. Uses Lucene for searching. The content in Archive Grid is now available in Worldcat. To participate – see http://www.oclc.org/us/en/archivegrid/default.htm

Google and Yahoo do index OAIster and WorldCat, so that is one path to being found in search engines.

MARC Records for Archival Materials in WorldCat Local

Jennifer MacDonald from the University of Delaware presented a cataloger’s perspective of a WorldCat Local environment. She is a “concerned enthusiast” with regard to metadata. The University of Delaware was the first institution to buy WorldCat Local. She ended up on the WorldCat Local Special collections and Archives Task Force. The task force made their final report in 2008 and got a response from OCLC in 2009. They did get some immediate changes based on their feedback – like moving the 520 “summary” data element higher in the display. For some problems the task force identified, such as Archival Materials that were not being identified properly (Internet Resource is the type for all OAI records), it is hard to tell if the issue has been fixed.

She showed some screenshots from WorldCat local to show what data elements are there and how they are organized. In the FirstSearch screenshot (only available at the school), Notes and General Info holds a mishmash of content from various data elements consolidated into single fields. The task force asked for the “Browse” feature but apparently this feature is dead. They got no response from OCLC to this request in their report.

If you use the University of Delaware instance of WorldCat Local to search for walter penn shipley and drill down to the detail record display for the Walter Penn Shipley Papers you will see what was shown during the session. This display is customizable at the institution level in WorldCat Local. Some data is shown. You see lots of Web 2.0 options to add your own data, but the display is missing some of the data from the original MARC record. The full MARC record is indexed for keyword search, but since some of it is not displayed, users may not be able to determine why a record was returned.

Fields missing from the WorldCat Local display:

351 – Organization and Arrangement of Materials
545 – biographical note
506 – restrictions on access
540 – Use of materials – with link to an askspec page: http://www.lib.udel.edu/cgi-bin/askspec.cgi
525 – preferred citation form – and this is where the manuscript number is
655 – some of the parts of the genre terms are missing
656 – occupation

OCLC says that they have not included all this because people don’t want this displayed. Given that local organization is already deciding what to show, the task force would prefer the option to displayable all data elements. Due to this missing data, Jennifer prefers the FirstSearch interface – but this option is not always available at all institutions. You should take advantage of the Web 2.0 features. Archivist can create an account on WorldCat Local and add data elements.

Questions and Answers

QUESTION: You talk about having the metadta in a format that is accessible to harvesting. What I have is a bunch of CDs with images on them that have a folder and descriptor structure. Is there a metadata harvester that can go in and pull that metadata out? New York Stock Exchange photographer sent these.

ANSWER (Michael): So the metadata you are looking to extract is the filename and descriptors? You could have someone write a little script and extract what you need. I would hand it to the guy I work with because he writes perl. If then you made that available via your website – then people could find it. To get it into a database – it is just a small script.

QUESTION: Are there any specifically useful webinars/seminars for becoming familiar with these formats for skillbuilding?

ANSWER (Michael): Tons on the web. The LoC websites are very useful. You may have heard the term ‘crosswalking’ – that is where you take one format and turn it into another. Looking at the crosswalks can make it much easier to understand how a format you understand maps to one you are trying to learn about. Shareable Metadata – metadata for you and me. Not online yet – but someone in the audience said the plan is to post the materials. There have been a couple of books and ALA publications. Most of the ones I know of are about 10 years old. Jaime: SAA has a good workshop series.

QUESTION: One of the first things you said was to take data out of EAD and you didn’t go into detail in that. Were you talking about DAO tagged items?

ANSWER (Michael): I was just talking about reusing data in a new environment. For example, we just started digitizing manuscripts and each item is becoming an individual digital object. The only metadata we have is in the EAD finding aid – so we are using that data to make descriptive data about the digital objects. We are going to create a MODS or METS record for every digital object. Jaime: We use EAD to make MODS records. She has been manually extracting EAD data as Dublin Core data for ContentDM.

My QUESTION: What format does OAIster want?

ANSWER (Michael): OAIster is just harvesting Dublin Core. You can share MODS and other metadata types and you may find other aggregators that are expecting their users to work in a more detailed environment. You may publish more data elements for other harvesters as well – but OAIster will only pull the Dublin Core data elements.

QUESTION: We are working on a digitization project to digitize local historical societies, museums and libraries. Might the catalogers be able to deal with MODS or will the loss of granularity be a problem?

ANSWER (Michael): I am not a MODS expert. MARC is very granular. Maybe look at the MARCXML – MODS crosswalk?

QUESTION: At the University of Delaware, do you have any other systems?

ANSWER (Jennifer): When we first got WorldCat Local you had to know the URL to get to the library. That changed fast! The patrons couldn’t find anything. Jaime: In WorldCat Local you cannot scope the search to specific sub-collections.

QUESTION: Thank you Jennifer for your remarks. Is there a problem with catalogers trying to ‘sneak’ data elements into other places – are standards in danger?

ANSWER (Jennifer): I would hope we wouldn’t move 524 data into a 500 field just to get it displayed. There is some danger of loosing the granularity by pushing everything to Dublin Core. I don’t know how real that danger is at this point.

QUESTION: A political question for Jennifer: Who has the clout to push for changes with OCLC?

ANSWER (Jennifer): I think leaning encouraging users to give feedback is important. We were told that users don’t want that “we have proven that users don’t want that”. Users need to make comments about their challenges in dealing with the interface. FROM AUDIENCE: The strongest is to say that you are looking at Sky River. FROM AUDIENCE: Make your data more discoverable outside the catalog world – internal websites and Google. Jaime: We are working hard to make MARC records to push access to our collections. The push is to make the data available in as many locations as possible.

QUESTION: Are these all different levels of subscriptions? Are they trying to push people to buy more subscriptions?

ANSWER (Jennifer): There is a sense that WorldCat Local is pushed at local public libraries. Yes – WorldCat Local is something they have to pay for. Michael: With Archive Grid you are going a step further – EVERYTHING in the finding aid is indexed. Every search I did in there returned thousands of records. Then I filtered by institution – and it never loaded. FROM AUDIENCE: I think they are revamping Archive Grid – but I don’t know how far they are in the process. Michael: I love the detail – you don’t have to dig through other data to find something useful. Depending on the institution – and how they are allowing their data to be harvested – you may see less information. Jaime: You have to actively work with OCLC to get Archive Grid to pick up your data.

QUESTION: We are tinkering with users adding tags – are you having any success with people adding tags?

ANSWER (Jaime): No – it isn’t something we have dealt with. WorldCat Local does let you add stuff like that.

QUESTION: Will OCLC provide that UGC (user generated content) back to the institution?

ANSWER: We wouldn’t know.

QUESTION: Have they provided access to the user studies?

ANSWER: Yes – but it is based on watching individuals use the tools.

Image Credit: Statue representing Research by Henry Hering from image of the interior of the Field Museum of Natural History interior.

Category: MARAC

MARAC Spring 2012: Preservation of Digital Materials (Session S1)

Gridworks: Super Data Cleanup and Exploration Tool

MARAC Spring 2010: Hurray for Archival Metadata (Session S2)