Menu Close

Category: access

Chapter 8: Preparing and Releasing Official Statistical Data by Professor Natalie Shlomo

Black and white photo of a woman using a keypunch to tabulate the United States Census, circa 1940.Chapter 8 of Partners for Preservation is ‘Preparing and Releasing Official Statistical Data’ by Professor Natalie Shlomo. This is the first chapter of Part III:  Data and Programming. I knew early in the planning for the book that I wanted a chapter that talked about privacy and data.

During my graduate program, in March of 2007, Google announced changes to their log retention policies. I was fascinated by the implications for privacy. At the end of my reflections on Google’s proposed changes, I concluded with:

“The intersection of concerns about privacy, government investigations, document retention and tremendous volumes of private sector business data seem destined to cause more major choices such as the one Google has just announced. I just wonder what the researchers of the future will think of what we leave in our wake.”

While developing my chapter list for the book – I followed my curiosity about how the field of statistics preserves privacy and how these approaches might be applied to historical data preserved by archives. Fields of research that rely on the use of statistics and surveys have developed many techniques for balancing the desire for useful data with the expectations of confidentiality by those who participate in surveys and censuses. This chapter taught me that “statistical disclosure limitation”, or SDL, aims to prevent the disclosure of sensitive information about individuals.

This short excerpt gives a great overview of the chapter:

“With technological advancements and the increasing push by governments for open data, new forms of data dissemination are currently being explored by statistical agencies. This has changed the landscape of how disclosure risks are defined and typically involves more use of peturbative methods of SDL. In addition, the statistical community has begun to assess whether aspects of differential privacy which focus on the peturbation of outputs may provide solutions for SDL. This has led to collaborations with computer scientists”

Almost eighty years ago, the woman in the photo above used a keypunch to tabulate the US Census. The amount of hands-on detail labor required to gather that data boggles the mind in comparison to born-digital data collection techniques now possible. The 1940 census was released in 2012 and is available online for free through a National Archives website. As archives face the onslaught of born-digital data tied to individuals, the techniques used by statisticians will need to become a familiar tool for archivists seeking to both increase access to data while respecting the privacy of those who might be identified through unfettered access to the data. This chapter serves as a solid introduction to SDL, as well as a look forward to new ideas in the field. It also ties back to topics in Chapter 2: Curbing The Online Assimilation Of Personal Information and Chapter 5: The Internet Of Things.


Natalie Shlomo (BSc, Mathematics and Statistics, Hebrew University; MA, Statistics, Hebrew University; PhD, Statistics, Hebrew University) is Professor of Social Statistics at the School of Social Sciences, University of Manchester.  Her areas of interest are in survey methods, survey design and estimation, record linkage, statistical disclosure control, statistical data editing and imputation, non-response analysis and adjustments, adaptive survey designs and small area estimation.   She is the UK principle investigator for several collaborative grants from the 7th Framework Programme and H2020 of the European Union all involving research in improving survey methods and dissemination. She is also principle investigator for the Leverhulme Trust International Network Grant on Bayesian Adaptive Survey Designs. She is an elected member of the International Statistical Institute and a fellow of the Royal Statistical Society. She is an elected council member and Vice-President of the International Statistical Institute. She is associate editor of several journals, including International Statistical Review and Journal of the Royal Statistical Society, Series A.   She serves as a member of several national and international advisory boards.

Image source:  A woman using a keypunch to tabulate the United States Census, circa 1940. National Archives Identifier (NAID) 513295

Chapter 2: Curbing the Online Assimilation of Personal information by Paulan Korenhof

The second chapter in Partners for Preservation is ‘Curbing the Online Assimilation of Personal Information’ by Paulan KorenhofGiven the amount of attention being focused on the right to be forgotten and the EU General Data Protection Regulation (GDPR), I felt it was essential to include a chapter that addressed these topics. Walking the fine line between providing access to archival records and respecting the privacy of those whose personal information is included in the records has long been an archival challenge.

In this chapter, Korenhof documents the history of the right to be forgotten and the benefits and challenges of GDPR as it is currently being implemented. She also explores the impact of the broad and virtually instantaneous access to content online that the Internet has facilitated.

This quote from the chapter highlights a major issue with making so much content available online, especially content that is being digitized or surfaced from previously offline data sources:

“With global accessibility and the convergence of different contextual knowledge realms, the separating power of space is nullified and the contextual demarcations that we are used to expecting in our informational interactions are missing.”

As the second chapter in Part 1: Memory, Privacy, and Transparency, it continues to pull these ideas together. In addition to providing a solid grounding in the right to be forgotten and GDPR, it should guide the reader to explore the unintended consequences of the mad rush to put everything online and the dramatic impact that search engines (and their human coded algorithms) have on what is seen.

I hope this chapter triggers more contemplation of these issues by archivists within the big picture of the Internet. Often we are so focused on improving access to content online that these questions about the broader impact are not considered.


Paulan Korenhof

Paulan Korenhof is in the final stages of her PhD-research at the Tilburg Institute for Law, Technology, and Society (TILT). Her research is focused on the manner in which the Web affects the relation between users and personal information, and the question to what degree the Right to Be Forgotten is a fit solution to address these issues. With a background in philosophy, law, and art, she investigates this relation from an applied phenomenological and critical theory perspective. Occasionally she co-operates in projects with Hacklabs and gives privacy awareness workshops to diverse audiences. Recently she started working at the Amsterdam University of Applied Sciences (HVA) as a researcher on Legal Technology.


Image credit: Flickr Commons: British Library: Image taken from page 5 of ‘Forget-Me-Nots. [In verse.]’:

The CODATA Mission: Preserving Scientific Data for the Future

The North Jetty near the Mouth of the Columbia River 05/1973This session was part of The Memory of the World in the Digital Age: Digitization and Preservation conference and aimed to describe the initiatives of the Data at Risk Task Group (DARTG), part of the Committee on Data for Science and Technology (CODATA), a body of the International Council for Science.

The goal is to preserve scientific data that is in danger of loss because they are not in modern electronic formats, or have particularly short shelf-life. DARTG is seeking out sources of such data worldwide, knowing that many are irreplaceable for research into the long-term trends that occur in the natural world.

Organizing Data Rescue

The first speaker was Elizabeth Griffin from Canada’s Dominion Astrophysical Observatory. She spoke of two forms of knowledge that we are concerned with here: the memory of the world and the forgettery of the world. (PDF of session slides)

The “memory of the world” is vast and extends back for aeons of time, but only the digital, or recently digitized, data can be recalled readily and made immediately accessible for research in the digital formats that research needs. The “forgettery of the world” is the analog records, ones that have been set aside for whatever reason, or put away for a long time and have become almost forgotten.  It the analog data which are considered to be “at risk” and which are the task group’s immediate concern.

Many pre-digital records have never made it into a digital form.  Even some of the early digital data are insufficiently described, or the format is out of date and unreadable, or the records cannot be located at all easily.

How can such “data at risk” be recovered and made useable?  The design of an efficient rescue package needs to be based upon the big picture, so a website has been set up to create an inventory where anyone can report data-at-risk. The Data-at-Risk Inventory (built on Omeka) is front-ended by a simple form that asks for specific but fairly obvious information about the datasets, such as field (context), type, amount or volume, age, condition, and ownership. After a few years DARTG should have some better idea as to the actual amounts and distribution of different types of historic analog data.

Help and support are needed to advertise the Inventory.  A proposal is being made to link data-rescue teams from many scientific fields into an international federation, which would be launched at a major international workshop.  This would give a permanent and visible platform to the rescue of valuable and irreplaceable data.

The overarching goal is to build a research knowledge base that offers a complimentary combination of past, present and future records.  There will be many benefits, often cross-disciplinary, sometimes unexpected, and perhaps surprising.  Some will have economic pay-offs, as in the case of some uncovered pre-digital records concerning the mountain streams that feed the reservoirs of Cape Town, South Africa.  The mountain slopes had been deforested a number of years ago and replanted with “economically more appealing” species of tree.  In their basement hydrologists found stacks of papers containing 73 years of stream-flow measurements.  They digitized all the measurements, analyzed the statistics, and discovered that the new but non-native trees used more water.  The finding clearly held significant importance for the management of Cape Town’s reservoirs.  For further information about the stream-flow project see Jonkershoek – preserving 73 years of catchment monitoring data by Victoria Goodall & Nicky Allsopp.

DARTG is building a bibliography of research papers which, like the Jonkershoek one, describe projects that have depended partly or completely on the ability to access data that were not born-digital.  Any assistance in extending that bibliography would be greatly appreciated.

Several members of DARTG are themselves engaged in scientific pursuits that seek long-term data.  The following talks describe three such projects.

Data Rescue to Increase Length of the Record

The second speaker, Patrick Caldwell from the US National Oceanographic Data Center (NODC), spoke on rescue of tide gauge data. (PDF of full paper)

He started with an overview of water level measurement, explaining how an analog trace (a line on a paper style record generated by a float w/a timer) is generated. Tide gauges include geodetic survey benchmark to make sure that the land isn’t moving. The University of Hawaii maintains a network of gauges internationally. Back in the 1800s, they were keeping track of the tides and sea level for shipping. You  never know what the application may turn into – they collected for tides, but in the 1980s they started to see patterns. They used tide gauge measurements to discover El Niño!

As you increase the length of the record, the trustworthiness of the data improves. Within sea level variations, there are some changes that are on the level of decades. To take that shift out, they need 60 years to track sea level trends. They are working to extend the length of the record.

The UNESCO Joint Technical Commission for Oceanography & Marine Meteorology has  Global Sea Level Observing System (GLOSS)

GLOSS has a series of Data Centers:

  • Permanent Service for Mean Sea Level (monthly)
  • Joint archive for sea level (hourly)
  • British Oceanographic Data center (high frequency)

The biggest holding starts at 1940s. They want to increase the number of longer records. A student in France documented where he found records as he hunted for the data he needed. Oregon students documented records available at NARA.

Global Oceanographic Data Archaeology and Rescue (GODAR) and the World Ocean Database Project

The Historic Data Rescue Questionnaire created in November 2011 resulted in 18 replies from 14 countries documenting tide gauge sites with non-digital data that could be rescued. They are particularly interested in the records that are 60 years or more in length.

Future Plans: Move away from identifying what is out there to tackling the rescue aspect. This needs funding. They will continue to search repositories for data-at-risk and continue collaboration with GLOSS/DARTG to freshen on-line inventory. Collaborate with other programs (Atmospheric Circulation Reconstructions over the Earth (ACRE) meeting 11-2012). Eventually move to Phase II = recovery!

The third speaker, Stephen Del Greco from the US NOAA National Climatic Data Center (NCDC), spoke about environmental data through time and extending the climate record. (PDF of full paper) The NCDC is a weather archive with headquarters in Asheville, NC. It fulfills much of the nation’s climate data requirements. Their data comes from many different sources. Safe storage of over 5,600 terabytes of climate data (= 6.5 billion kindle books). How will they handle the upcoming explosion of data on the way? Need to both handle new content coming in AND provide increased access to larger amounts of data being downloaded over time. 2011 number = data download of 1,250 terabytes for the year. They expect that download number to increase 10 fold over the next few years.

The climate database modernization program went on over more than a decade rescuing data. It was well funded and millions of records were rescued with a budget of roughly 20 Million a year. The goal is to preserve and make major climate and environmental data available via the World Wide Web. Over 14 terabytes of climate data are now digitized. 54 million weather and environmental images are online. Hundreds of millions of records are digitized and now online. The biggest challenge was getting the surface observation data digitized. NCDC digital data for hourly surface observations generally stretch back to around 1948. Some historical marine observations go back to the spice trade records.

For international efforts they bring their imaging equipment to other countries where records were at risk. 150,000 records imaged under the Climate Database Modernization Program (CDMP).

Now they are moving from public funding to citizen-fueled projects via crowdsourcing such as the Zooniverse Program. Old Weather is a Zooniverse Project which uses crowdsourcing to digitize and analyze climate data. For example, the transcription done by volunteers help scientists model Earth’s climate using wartime ship logs. The site includes methods to validate efforts from citizens.  They have had almost 700,000 volunteers.

Long-term Archive Tasks:

  • Rescuing Satellite Data: raw images in lots of different film formats. All this is at risk. Need to get it all optically imaged. Looking at a ‘citizen alliance’ to do this work.
  • Climate Data Records: Global Essential Climate Variables (ECVs) with Heritage Records. Lots of potential records for rescue.
  • Rescued data helps people building proxy data sets: NOAA Paleoclimatology. ‘Paleoclimate proxies’ – things like boreholes, tree rings, lake levels, pollen, ice cores and more. For example – getting temperate and carbon dioxide from ice cores. These can go back 800,000 years!

We have extended the climate record through international collaboration. For example, the Australian Bureau of Meteorology provided daily temperature records for more than 1,500 additional stations. This meant a more than 10-fold increase in previous historical climate daily data holdings from that country.

Born Digital Maps

The final presentation discussed the map as a fundamental source of memory of the world, delivered by D. R. Fraser Taylor and Tracey Lauriault from Carleton University’s Geomatics and Cartographic Research Center in Canada. The full set of presentation slides are available online on SlideShare. (PDF of full paper)

We are now moving into born digital maps. For example, the Canadian Geographic Information System (CGIS) was created in the 1960s and was the worlds 1st GIS. Maps are ubiquitous in the 21st century. All kinds of organizations are creating their own maps and mash-ups. Community based NGOs, citizen science, academic and private sector are all creating maps.

We are loosing born digital maps almost faster than we are creating them. We have lost 90% of the born digital maps. Above all there is an attitude that preservation is not intrinsically important. No-one thought about the need to preserve the map – everyone thought someone else would do it. There was a complete lack of thought related to the preservation of these maps.

The Canada Land Inventory (CLI) was one of the first and largest born digital map efforts in the world. Mapped 2.6 million square kilometers of Canada. Lost in the 1980s. No-one took responsibility for archiving. Those who thought about it believed backup equaled archiving. A group of volunteers rescued the process over time – salvaged from boxes of tapes and paper in mid-1990s. It was caught just in time and took a huge effort. 80% has been saved and is now it is online. This was rescued because it was high profile. What about the low-profile data sets? Who will rescue them? No-one.

The 1986 BBC Doomsday Book was created in celebration of 900 years after William the Conqueror’s original Domesday Book. It was obsolete by the 1990s. A huge amount of social and economic information was collected for this project. In order to rescue it they needed an acorn computer and needed to be able to read the optical disks. The platform was emulated in 2002-2003. It cost 600,000 british pounds to reverse engineer and put online in 2004. New discs made in 2003 at the UK Archive.

It is easier to get Ptolomy’s maps from 15th century than it is to get a map 10 years old.

The Inuit Siku (sea ice) Atlas, an example of a Cybercartographic atlas, was produced in cooperation with Inuit communities. Arguing that the memory of what is happening in the north lies in the minds of the elders, they are capturing the information and putting it out in multi-media/multi-sensory map form. The process is controlled by the community themselves. They provide the software and hardware. They created a graphic tied to the Inuit terms for different types of sea ice. In some cases they record the audio of an elder talking about a place. The narrative of the route becomes part of the atlas. There is no right or wrong answer. There are many versions and different points of view. All are based on the same set of facts – but they come from different angles. The atlases capture them all.

The Gwich’in Place Name Atlas is building in the idea of long term preservation into the application from the start

The Cybercartographic Atlas of the Lake Huron Treaty Relationship Process is taking data from surveyors diaries from the 1850s.

There are lots of government of Canada geospatial data preservation intitatives, but in most cases there is a lot of retoric, but not so much action. There have been many consultations, studies, reports and initiatives since 2002, but the reality is that apart from the Open Government Consultations (TBS), not very much as translated into action. Even in the case where there is legislation, lots of things look good on paper but don’t get implemented.

There are Library and Archives Guidelines working to support digital preservation of geospatial data. The InterPares 2 (IP2) Geospatial Case Studies tackle a number of GIS examples, including the Cybercartographic Atlas of Antartica. See the presentation slides online for more specific examples.

In general, preservation as an afterthought rarely results in full recovery of born digital maps. It is very important to look at open source and interoperable open specifications. Proactive archiving is an important interim strategy.

Geospatial data are fundamental sources of our memory of the world. They help us understand our geo-narratives (stories tied to location), counter colonial mappings, are the result of scientific endeavors, represent multiple worldviews and they inform decisions. We need to overcome the challenges to ensure their preservation.


QUESTION: When I look at the work you are doing with recovering Inuit data from people. You recover data and republish it – who will preserve both the raw data and the new digital publication? What does it mean to try and really preserve this moving forward? Are we really preserving and archiving it?

ANSWER: No we are not. We haven’t been able to find an archive in Canada that can ingest our content. We will manage it ourselves as best we can. Our preservation strategy is temporary and holding, not permanent as it should be. We can’t find an archive to take the data. We are hopeful that we are moving towards finding a place to keep and preserve it. There is some hope on the horizon that we may move in the right directions in the Canadian context.

Luciana: I wanted to attest that we have all the data from InterPARES II. It is published in the final. I am jealously guarding my two servers that I maintain with money out of my own pocket.

QUESTION: Is it possible to have another approach to keep data where it is created, rather than a centralized approach?

ANSWER: We are providing servers to our clients in the north. Keeping copies of the database in the community where they are created. Keeping multiple copies in multiple places.

QUESTION: You mention surveys being sent out and few responses coming back. When you know there is data at risk – there may be governments that have records at risk that they are shy to reveal to the public? How do we get around that secrecy?

ANSWER: (IEDRO representative) We offer our help, rather than a request to get their data.

As is the case with all my session summaries, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Image Credit: NARA Flickr Commons image “The North Jetty near the Mouth of the Columbia River 05/1973”

Updated 2/20/2013 based on presenter feedback.

Digitization Quality vs Quantity: An Exercise in Fortune Telling

The quality vs quantity dilemma is high in the minds of those planning major digitization projects. Do you spend your time and energy creating the highest quality images of your archival records? Or do you focus on digitizing the largest quantity you can manage? Choosing one over the other has felt a bit like an exercise in fortune telling to me over the past few months, so I thought I would work through at least a few of the moving parts of this issue here.

The two ends of the spectrum are traditionally described as follows:

  • digitize at very high quality to ensure that you need not re-digitize later, create a high quality master copy from which all possible derivatives can be created later
  • digitize at the minimum quality required for your current needs, the theory being that this will increase the quantity of digitized records you can digitize

This sounds very well and good on the surface, but this is not nearly as black and white a question as it appears. It is not the case that one can simply choose one over the other. I suppose that choosing ‘perfect quality’ (whatever that means) probably drives the most obvious of the digitization choices. Highest resolution. 100% accurate transcription. 100% quality control.

It is the rare digitization project that has the luxury of time and money required to aim for such a definition of perfect. At what point would you stop noticing any improvement, while just increasing your the time it takes to capture the image and the disk space required to store it? 600 DPI? 1200 DPI? Scanners and cameras keep increasing the dots per inch and the megapixels they can capture. Disk space keeps getting cheaper. Even at the top of the ‘perfect image’ spectrum you have to reach a point of saying ‘good enough’.

When you consider the choices one might make short of perfect, you start to get into a gray area in which the following questions start to crop up:

  • How will lower quality image impact OCR accuracy?
  • Is one measure of lower quality simply a lower level of quality assurance (QA) to reduce the cost and increase the throughput?
  • How will expectations of available image resolution evolve over the next five years? What may seem ‘good enough’ now, may seem grainy and sad in a few years.
  • What do we add to the images to improve access? Transcription? TEI? Tagging? Translation?
  • How bad is it if you need to re-digitize something that is needed at a higher resolution on demand? How often will that actually be needed?
  • Will storing in JPEG2000 (rather than TIFF)  save enough money from reduced disk space to make it worth the risk of a lossy format? Or is ‘visually lossless‘ good enough?

Even the question of OCR accuracy is not so simple. In D-Lib Magazine‘s article from the July/August 2009 issue titled Measuring Mass Text Digitization Quality and Usefulness the authors list multiple types of accuracy which may be measured:

  • Character accuracy
  • Word accuracy
  • Significant word accuracy
  • Significant words with capital letter start accuracy (i.e. proper nouns)
  • Number group accuracy

So many things to consider!

The primary goal of the digitization project I am focused on is to increase access to materials for those unable to travel to our repository. As I work with my colleagues to navigate the choices, I find myself floating towards the side of ‘good enough’ across the board. Even the process of deciding this blog post is done has taken longer than I meant it to. I publish it tonight with the hope to put a line in the sand and move forward with the conversation. For me, it all comes back to what are you trying to accomplish.

I would love to hear about how others are weighing all these choices. How often have long term digitization programs shifted their digitization standards? What aspects of your goals are most dramatically impacting your priorities on the quality vs quantity scale?

Image Credit: Our lovely fortune teller is an image from the George Eastman House collection in the Flickr Commons, taken by Nickolas Muray in 1940 for use by McCall’s Magazine. [UPDATED 1/6/2019: Image no longer on Flickr, but is available in the Eastman Museum online collection.]

Digitization Program Site Visit: Archives of American Art

The image of Alexander Calder above shows him in his studio, circa 1950. It is from a folder titled Photographs: Calder at Work, 1927-1956, undated, part of Alexander Calder’s Papers held by the Smithsonian Archives of American Art and available online through the efforts of their digitization project. I love that this image capture him in his creative space – you get to see the happy chaos from which Calder drew his often sleek and sparse sculptures.

Back in October, I had the opportunity to visit with staff of the digitization program for the Smithsonian Archives of American Art along with a group of my colleagues from the World Bank. This is a report on that site visit. It is my hope that these details can help others planning digitization projects – much as it is informing our own internal planning.

Date of Visit: October 18, 2011

Destination: Smithsonian Archives of American Art

Smithsonian Archives of American Art Hosts:

Summary:  This visit was two hours in length and consisted of a combination of presentation, discussion and site tour to meet staff and examine equipment.

Background: The Smithsonian’s Archives of American Art (AAA) program was first funded by a grant from the Terra Foundation of American Art in 2005, recently extended through 2016. This funding supports both staff and research.

Their digitization project replaced their existing microfilm program and focuses on digitizing complete collections. Digitization focused on in-house collections (in contrast with collections captured on microfilm from other institutions across the USA as part of their microfilm program).

Over the course of the past 6 years, they have scanned over 110 collections – a total of 1,000 linear feet – out of an available total of 13,000 linear feet from 4,500 collections. They keep a prioritized list of what they want digitized.

The Smithsonian DAM (digital asset management system) had to be adjusted to handle the hierarchy of EAD and the digitized assets. Master files are stored in the Smithsonian DAM. Files stored in intermediate storage areas are only for processing and evaluation and are disposed of after they have been ingested into the DAM.

Current staffing is two and a half archivists and two digital imaging specialists. One digital imaging specialist focuses on scanning full collections, while the other focuses on on-demand single items.

The website is built in ColdFusion and pulls content from a SQL database. Currently they have no way to post media files (audio, oral histories, video) on the external web interface.

They do not delineate separate items within folders. When feedback comes in from end users about individual items, this information is usually incorporated into the scope note for the collection, or the folder title of the folder containing the item. Full size images in both the image gallery and the full collections are watermarked.

They track the processing stats and status of their projects.

Standard Procedures:

Full Collection Digitization:

  • Their current digitization workflow is based on their microfilm process. The workflow is managed via an internal web-based management system. Every task required for the process is listed, then crossed off and annotated with the staff and date the action was performed.
  • Collections earmarked for digitization are thoroughly described by a processing archivist.
  • Finding aids are encoded in EAD and created in XML using NoteTab Pro software.
  • MARC records are created when the finding aid is complete. The summary information from the MARC record is used to create the summary of the collection published on the website.
  • Box numbers and folder numbers are assigned and associated with a finding aid. The number of the box and folder are all a scanning technician needs.
  • A ‘scanning information worksheet’ provides room for notes from the archivist to the scanning technician.  It provides the opportunity to indicate which documents should not be scanned. Possible reasons for this are duplicate documents or those containing personal identifying information (PIP).
  • A directory structure is generated by a script based on the finding aid, creating a directory folder for each physical folder which exists for the collection. Images are saved directly into this directory structure. The disk space to hold these images is centrally managed by the Smithsonian and automatically backed up.
  • All scanning is done in 600dpi color, according to their internal  guidelines. They frequently have internal projects which demand high resolution images for use in publication.
  • After scanning is complete, the processing archivist does the post scanning review before the images are pushed into the DAM for web publication.
  • Their policy is to post everything from a digitized collection, but they do support a take-down policy.
  • A recent improvement was made in January, 2010. At that time they relaunched the site to include all of their collections co-located on the same list, both digitized and non-digitized.

On Demand Digitization:

  • Patrons may request the digitization of individual items.
  • These requests are evaluated by archivists to determine if it is appropriate to digitize the entire folder (or even box) to which the item belongs.
  • Requests are logged in a paper log.
  • Item level scanning ties back to an item level record with an item ID. There is an ‘Online Removal Notice’ to create item level stub.
  • An item level cataloger describes the content after it is scanned.
  • Unless there is an explicit copyright or donor restriction, the items is put online in the Image Gallery (which currently has 12,000 documents).
  • Access to images is provided by keyword searching.
  • Individual images are linked back to the archival description for the collection from which they came.

Improvements/Changes they wish for:

  • They currently have no flexibility to make changes in the database nimbly. It is a tedious process to change the display and each change requires a programmer.
  • They would like to consider a move to open source software or to use a central repository – though they have concerns about what other sacrifices this would require.
  • Show related collections, list connected names (currently the only options for discovery are an A-Z list of creators or keyword search).
  • Ability to connect to guides and other exhibits.


Image Credit: Alexander Calder papers, Archives of American Art, Smithsonian Institution.

Digitization Program Site Visit: University of Maryland

I recently had the opportunity to visit with staff of the University of Maryland, College Park’s Digital Collections digitization program along with a group of my colleagues from the World Bank. This is a report on that site visit. It is my hope that these details can help others planning digitization projects – much as it is informing our own internal planning.

Date of Visit: October 13, 2011

Destination: University of Maryland, Digital Collections

University of Maryland Hosts:

Summary:  This visit was two hours in length and consisted of a one hour presentation and Q&A session with Jennie Levine Knies, Manager of Digital Collections followed by a one hour tour and Q&A session with Alexandra Carter, Digital Imaging Librarian.

Background: The Digital Collections of the University of Maryland was launched in 2006 using Fedora Commons. It is distinct from the ‘Digital Repository at the University of Maryland’, aka DRUM, which is built on DSpace. DRUM contains faculty-deposited documents, a library-managed collection of UMD theses and dissertations, and collections of technical reports. The Digital Collections project focuses on digitization of photographs, postcards, manuscripts & correspondence – mostly based on patron demand. In addition, materials are selected for digitization based on the need for thematic collections to support events, such as their recent civil war exhibition.

After a period of full funding, there has been a fall off in funding which has prevented any additional changes to the Fedora system.

Another project at UMD involves digitization of Japanese childrens’ books (George W. Prange Collection) and currently uses “in house outsourcing”. In this scenario, contractors bring all their equipment and staff on site to perform the digitization process.

Standard Procedures:

  • Requests must be made using a combination of the ‘Digital Request Cover Sheet’ and ‘Digital Surrogate Request Sheet. These sheets are then reviewed for completeness by the curator under whose jurisdiction the collection falls. Space on the request forms is provided so that the curator may add additional notes to aid in the digitization process. They decide if it is worth digitizing an entire folder when only specific item(s) are requested. Standard policy is to aim for two week turnaround for digitization based on patron request.
  • The digital request is given a code name for easy reference. They choose these names alphabetically.
  • Staff are assigned to digitize materials. This work is often done by student workers using one of three Epson 10000 XL flatbed scanners. There is also a Zeutschel OS 12000 overhead scanner available for materials which cannot be handled by the flatbed scanners.
  • Alexandra reviews all scans for quality.
  • Metadata is reviewed by another individual.
  • When both the metadata & image quality has been reviewed, materials are published online.

Improvements/Changes they wish for:

  • Easier way to create a web ‘home’ for collections, currently many do not have a main page and creating one requires the involvement of the IT department.
  • Option for users to save images being viewed
  • Option to upload content to their website in PDF format
  • Way to associate transcriptions with individual pages
  • More granularity for workflow: currently the only status they have to indicate that a folder or item is ready for review is ‘Pending’. Since there are multiple quality control activities that must be performed by different staff, currently they must make manual lists to track what phases of QA are complete for which digitized content.
  • Reduce data entry.
  • Support for description at both the folder and item level at the same time. Currently description is only permitted either at the folder level OR at the item level.
  • Enable search and sorting by date added to system. This data is captured, but not exposed.

Lessons Learned:

  • Should have adopted an existing metadata standard rather than creating their own.
  • People do not use the ‘browse terms’ – do not spend a lot of time working on this


Image Credit: Women students in a green house during a Horticulture class at the University of Maryland, 1925. University Archives, Special Collections, University of Maryland Libraries

SXSW Panel Proposal – Archival Records Online: Context is King

I have a panel up for evaluation on the SXSW Interactive Panel Picker titled Archival Records Online: Context is King. The evaluation process for SXSW panels is based on a combination of staff choice, advisory board recommendations and public votes. As you can see from the pie chart shown here (thank you SXSW website for the great graphic), 30% of the selection criteria is based on public votes. That is where you come in. Voting is open through 11:59 pm Central Daylight Time on Friday, September 2. To vote in favor of my panel, all you need to do is create a free account over on SXSW Panel Picker and then find Archival Records Online: Context is King and give it a big thumbs up.

If my panel is selected, I intend this session to give me the chance to review all of the following:

  1. What are the special design requirements of archival records?
  2. What are the biggest challenges to publishing archival records online?
  3. How can archivists, designers and developers collaborate to build successful web sites?
  4. Why is metadata important?
  5. How can search engine optimization (SEO) inform the design process?

All of this ties into what I have been pondering, writing about and researching for the past few years related to getting archival records online. So many people are doing such amazing work in this space. I want to show off the best of the best and give attendees some takeaways to help them build websites that make it easy to see the context of anything they find in their search.

While archival records have a very particular dependence on the effective communication of context – I also think that this is a lesson that can improve interface design across the board. These are issues that UI and IA folks are always going to be worrying about. SXSW is such a great opportunity for cross pollination. Conferences outside the normal archives, records management and library conference circuit give us a chance to bring fresh eyes and attention to the work being done in our corner of the world.

If you like the idea of this session, please take a few minutes to go sign up at the SXSW Panel Picker and give Archival Records Online: Context is King a thumbs up. You don’t need to be planning to attend in order to cast your vote, though after you start reading through all the great panel ideas you might change your mind!