access | Spellbound Blog

Chapter 8: Preparing and Releasing Official Statistical Data by Professor Natalie Shlomo

January 26, 2019

Chapter 8 of Partners for Preservation is ‘Preparing and Releasing Official Statistical Data’ by Professor Natalie Shlomo. This is the first chapter of Part III: Data and Programming. I knew early in the planning for the book that I wanted a chapter that talked about privacy and data.

During my graduate program, in March of 2007, Google announced changes to their log retention policies. I was fascinated by the implications for privacy. At the end of my reflections on Google’s proposed changes, I concluded with:

“The intersection of concerns about privacy, government investigations, document retention and tremendous volumes of private sector business data seem destined to cause more major choices such as the one Google has just announced. I just wonder what the researchers of the future will think of what we leave in our wake.”

While developing my chapter list for the book – I followed my curiosity about how the field of statistics preserves privacy and how these approaches might be applied to historical data preserved by archives. Fields of research that rely on the use of statistics and surveys have developed many techniques for balancing the desire for useful data with the expectations of confidentiality by those who participate in surveys and censuses. This chapter taught me that “statistical disclosure limitation”, or SDL, aims to prevent the disclosure of sensitive information about individuals.

This short excerpt gives a great overview of the chapter:

“With technological advancements and the increasing push by governments for open data, new forms of data dissemination are currently being explored by statistical agencies. This has changed the landscape of how disclosure risks are defined and typically involves more use of peturbative methods of SDL. In addition, the statistical community has begun to assess whether aspects of differential privacy which focus on the peturbation of outputs may provide solutions for SDL. This has led to collaborations with computer scientists”

Almost eighty years ago, the woman in the photo above used a keypunch to tabulate the US Census. The amount of hands-on detail labor required to gather that data boggles the mind in comparison to born-digital data collection techniques now possible. The 1940 census was released in 2012 and is available online for free through a National Archives website. As archives face the onslaught of born-digital data tied to individuals, the techniques used by statisticians will need to become a familiar tool for archivists seeking to both increase access to data while respecting the privacy of those who might be identified through unfettered access to the data. This chapter serves as a solid introduction to SDL, as well as a look forward to new ideas in the field. It also ties back to topics in Chapter 2: Curbing The Online Assimilation Of Personal Information and Chapter 5: The Internet Of Things.

Bio:

Natalie Shlomo (BSc, Mathematics and Statistics, Hebrew University; MA, Statistics, Hebrew University; PhD, Statistics, Hebrew University) is Professor of Social Statistics at the School of Social Sciences, University of Manchester. Her areas of interest are in survey methods, survey design and estimation, record linkage, statistical disclosure control, statistical data editing and imputation, non-response analysis and adjustments, adaptive survey designs and small area estimation. She is the UK principle investigator for several collaborative grants from the 7th Framework Programme and H2020 of the European Union all involving research in improving survey methods and dissemination. She is also principle investigator for the Leverhulme Trust International Network Grant on Bayesian Adaptive Survey Designs. She is an elected member of the International Statistical Institute and a fellow of the Royal Statistical Society. She is an elected council member and Vice-President of the International Statistical Institute. She is associate editor of several journals, including International Statistical Review and Journal of the Royal Statistical Society, Series A. She serves as a member of several national and international advisory boards.

Image source: A woman using a keypunch to tabulate the United States Census, circa 1940. National Archives Identifier (NAID) 513295 https://commons.wikimedia.org/wiki/File:Card_puncher_-_NARA_-_513295.jpg

Chapter 2: Curbing the Online Assimilation of Personal information by Paulan Korenhof

December 17, 2018

The second chapter in Partners for Preservation is ‘Curbing the Online Assimilation of Personal Information’ by Paulan Korenhof. Given the amount of attention being focused on the right to be forgotten and the EU General Data Protection Regulation (GDPR), I felt it was essential to include a chapter that addressed these topics. Walking the fine line between providing access to archival records and respecting the privacy of those whose personal information is included in the records has long been an archival challenge.

In this chapter, Korenhof documents the history of the right to be forgotten and the benefits and challenges of GDPR as it is currently being implemented. She also explores the impact of the broad and virtually instantaneous access to content online that the Internet has facilitated.

This quote from the chapter highlights a major issue with making so much content available online, especially content that is being digitized or surfaced from previously offline data sources:

“With global accessibility and the convergence of different contextual knowledge realms, the separating power of space is nullified and the contextual demarcations that we are used to expecting in our informational interactions are missing.”

As the second chapter in Part 1: Memory, Privacy, and Transparency, it continues to pull these ideas together. In addition to providing a solid grounding in the right to be forgotten and GDPR, it should guide the reader to explore the unintended consequences of the mad rush to put everything online and the dramatic impact that search engines (and their human coded algorithms) have on what is seen.

I hope this chapter triggers more contemplation of these issues by archivists within the big picture of the Internet. Often we are so focused on improving access to content online that these questions about the broader impact are not considered.

Bio

Paulan Korenhof is in the final stages of her PhD-research at the Tilburg Institute for Law, Technology, and Society (TILT). Her research is focused on the manner in which the Web affects the relation between users and personal information, and the question to what degree the Right to Be Forgotten is a fit solution to address these issues. With a background in philosophy, law, and art, she investigates this relation from an applied phenomenological and critical theory perspective. Occasionally she co-operates in projects with Hacklabs and gives privacy awareness workshops to diverse audiences. Recently she started working at the Amsterdam University of Applied Sciences (HVA) as a researcher on Legal Technology.

Image credit: Flickr Commons: British Library: Image taken from page 5 of ‘Forget-Me-Nots. [In verse.]’: https://www.flickr.com/photos/britishlibrary/11301997276/

Caffè Lena History Project’s Searchable Database

May 13, 2014

Caffè Lena opened in Saratoga Springs, NY in May of 1960. Since then, the coffee house has kept its doors open featuring predominately performances by folk musicians. Often the performers were at the start of their careers. The café has featured now familiar songwriters including Bob Dylan, Arlo Guthrie, Ani DiFranco, and Kate and Anna McGarrigle – to name just a few. After the death of the founder, Lena Spencer, in 1989 Caffè Lena was converted to a non-profit institution.

The Caffè Lena History Project has launched an online searchable database for the complete Caffè Lena collection. The processing of this collection was made possible with support from The Andrew W. Mellon Foundation, administered through the Council on Library and Information Resources’ Cataloging Hidden Special Collections and Archives Project. The digitization of the material was made possible through generous funding from the EMC Corporation.

This collection is physically in many places, but the Omeka based website serves as a centralized index for browsing and discovering the rich set of papers, audio recordings, photographs and ephemera documenting the history of this performance space. The database was architected by Monte Evans, who managed to bring together all the large and disparate data sets and organize them within Omeka. This database is the third part of the overall history project, which also produced a three-CD set of audio recordings and a lavish book documenting the history of the coffee house through stories and many previously unseen photos.

Over the course of more than 10 years, this project has been a labor of love by Jocelyn Arem to hunt for long lost caches of materials – in attics and garages and archives across the country. Then her efforts moved on to digitization and promotion of the amazing materials she found. She was supported by many people (in addition to the funders) who helped move the work forward — including dedicated community members, volunteers, the Caffè Lena Board of Directors, and friends around the country who shared their expertise and support.

The video above is the ‘trailer’ for the project and does a great job introducing both Caffè Lena and the history project through interviews, photographs and audio clips.

The Database

The contents of the website are divided into three sections — recordings, ephemera and photographs. While it is possible to search across the full array of contents, there are additional options for filtering by tags and sorting within each sub-section of the database.

Of all the materials gathered during this project, the audio recordings were the most elusive. In many cases it required detective work to track down recordings that were remembered by some and forgotten by others. Recordings donated from as far away as Ontario and Ohio, and digitized by the GRAMMY Award-winning Magic Shop Studio in NYC. Jocelyn Arem shared with me one of her favorite stories of serendipity during the search for recordings – that of the ZBS Radio Series tapes. A former ZBS producer in Panama, Robert Durand, contacted them because he heard about the project online. He connected them with an engineer at ZBS in New York. As a result they received the edited Caffè Lena ZBS collection. However they always wondered where the unedited tapes had ended up. A few months later, Jocelyn followed a trail to another engineer who had left a note at the Caffè years earlier saying he had tapes to donate. Along with former Caffè Lena board member (and audio tape donor) Dick Kavanaugh, she drove to the mountains of upstate, NY to retrieve the tapes from this new donor and lo and behold … there was the unedited Caffè Lena tape collection!

In the Recordings section of the site, you can find descriptive information about 514 recording now held by the Library of Congress’s American Folklife Center.. The browse interface lets you filter by using a drop-down list of artist names. Occasionally the entries includes short audio samples you can listen to online, such as this one for Kate and Anna McGarrigle recorded during the 20th Anniversary Music Festival described on the flyer shown here. One thing that surprised me is that if sometimes an image will lead to a multi-page PDF. In the recording’s section this was common and the PDFs of the tape transfer reports include detailed notes about the recording.

The Ephemera section of the site features 32 boxes of materials from five separate collections, listed below.

the Lena Spencer Papers, Performer Files and Jan Nargi Collections — all held by the Saratoga Springs History Museum,
the Board of Directors Collection held by the American Folklife Center
the Lively Lucys Coffeehouse Collection held by Skidmore College

Each collection can be browsed individually, with their own dedicated options of filtering by tags. Here it was a bit more obvious that the image you saw was likely just the first page of a multi-page folder digitized and presented as a single PDF.

Finally, the Photographs section includes over 6,000 black and white photographs made by Joe Alper at Caffè Lena between 1960-1967. These images were catalogued and digitized by Edward Elbers and are held by the Joe Alper Photo Collection LLC. This section lets you filter by ‘artist’. For example, there are 296 photos of Bob Dylan.

Tagging and Controlled Vocabularies

One of the recurring challenges for those tagging content from multiple sources is the different versions of terms that mean the same thing. Controlled vocabularies are hard to enforce across different collections held by different organizations or individuals. In the case of the Caffè Lena History Project, I noticed that there were different values used for tagging across the materials. For example, the values used to tag and populate the list of recording artists’ names exactly match the names listed on the tape transfer reports. In some cases the same artist may be listed in multiple ways. For example, there are entries for The McGarrigles, Kate McGarrigle and Anna McGarrigle. Another example can be found in the tagging for materials related to Bob Dylan. Bob Dylan, bob, Dylan and Dylan Bob are all tags that give you different subsets of the search results that just searching on Dylan provides.

These types of issues are often compounded when materials come from so many different sources and through many different hands along the way. That said, the combination of artists lists, tags and search functions make it easy to discover materials related to your favorite folk music artists. Just keep in mind that looking for multiple versions of a performer’s name might help you find more materials.

Other Virtual Archives

The approach of creating a single website to unify materials that are not co-located but do all relate to a single unifying theme reminded me of two earlier projects: The Publishers’ Bindings Online (PBO) and the Greene & Greene Virtual Archive.

The Publishers’ Bindings Online project now features over 13,000 images online of over 5,000 book bindings from 1815 to 1930. These books are held in libraries at multiple institutions and the project’s success at (and challenges with) using a single unified vocabulary for tagging was discussed in detail during a session at SAA in 2007: Publishers’ Bindings Online – Digitization, Collaboration, Standardization and Community Building. I used the subject vocabulary to find a book cover featuring polar bears.

The Greene & Greene Virtual Archives presents materials of the southern California design firm Greene & Greene. Active from 1894-1922, they are associated with the architecture and craftsmanship of the American Arts and Crafts Movement. The Greene & Greene website presents images and metadata of a selected set of 4000 items held by four different repositories. They also provide links to the full descriptions of materials held by each of the repositories. The search functionality on the website is geared towards exploration of individual architectural projects, but also permits advanced search by topic, repository, location, document type and date. Unlike PBO and Caffè Lena, this site doesn’t expose the tagging that lets the results be returned by these groupings.

My Personal Connection

Finally, I would like to share my personal connection to this project. Jocelyn contacted me over a year ago while in the final stages of working on the book to ask if my father had taken a particular photo of Loudon Wainwright III. My father was his manager at the start of his career and did take many photos of him, though not the one in question.

After the launch of the site, I was curious to see what might be in the database related to Loudon and perhaps my father. I found this great photo in the Ephemera file for Wainwright Loudon. My father is the gentleman on the left with the mustache!

More about Caffè Lena

The New York Times published a great article back in 2013 that talks all about the history project and the two products which preceded the online database. In September 2013, the history project created a three CD box set featuring the best of the historical audio material: Live at Caffe Lena:Music From America’s Legendary Coffeehouse,1967-2013.

The book was released in October 2013. Soon to be available via a second printing direct from the publisher, you can still find copies of the first printing of Caffe Lena: Inside America’s Legendary Folk Music Coffeehouse on Amazon.com.

There is a traveling exhibit that can be brought to your local venue with tons of details available online. You can also subscribe to an online newsletter. If you have a project for which you would like to use any of the materials (audio included) – there is a form for making licensing requests.

Of course, one of the best sources of information about Caffè Lena and its founder are the materials featured on the history project website.

Finally, do you have materials to donate to the archive? You can contact the Caffè Lena History Project team directly via this online form.

Image Credits: Courtesy of the Caffè Lena Collection and the Saratoga Springs History Museum

The CODATA Mission: Preserving Scientific Data for the Future

February 18, 2013 1 Comment

This session was part of The Memory of the World in the Digital Age: Digitization and Preservation conference and aimed to describe the initiatives of the Data at Risk Task Group (DARTG), part of the Committee on Data for Science and Technology (CODATA), a body of the International Council for Science.

The goal is to preserve scientific data that is in danger of loss because they are not in modern electronic formats, or have particularly short shelf-life. DARTG is seeking out sources of such data worldwide, knowing that many are irreplaceable for research into the long-term trends that occur in the natural world.

Organizing Data Rescue

The first speaker was Elizabeth Griffin from Canada’s Dominion Astrophysical Observatory. She spoke of two forms of knowledge that we are concerned with here: the memory of the world and the forgettery of the world. (PDF of session slides)

The “memory of the world” is vast and extends back for aeons of time, but only the digital, or recently digitized, data can be recalled readily and made immediately accessible for research in the digital formats that research needs. The “forgettery of the world” is the analog records, ones that have been set aside for whatever reason, or put away for a long time and have become almost forgotten. It the analog data which are considered to be “at risk” and which are the task group’s immediate concern.

Many pre-digital records have never made it into a digital form. Even some of the early digital data are insufficiently described, or the format is out of date and unreadable, or the records cannot be located at all easily.

How can such “data at risk” be recovered and made useable? The design of an efficient rescue package needs to be based upon the big picture, so a website has been set up to create an inventory where anyone can report data-at-risk. The Data-at-Risk Inventory (built on Omeka) is front-ended by a simple form that asks for specific but fairly obvious information about the datasets, such as field (context), type, amount or volume, age, condition, and ownership. After a few years DARTG should have some better idea as to the actual amounts and distribution of different types of historic analog data.

Help and support are needed to advertise the Inventory. A proposal is being made to link data-rescue teams from many scientific fields into an international federation, which would be launched at a major international workshop. This would give a permanent and visible platform to the rescue of valuable and irreplaceable data.

The overarching goal is to build a research knowledge base that offers a complimentary combination of past, present and future records. There will be many benefits, often cross-disciplinary, sometimes unexpected, and perhaps surprising. Some will have economic pay-offs, as in the case of some uncovered pre-digital records concerning the mountain streams that feed the reservoirs of Cape Town, South Africa. The mountain slopes had been deforested a number of years ago and replanted with “economically more appealing” species of tree. In their basement hydrologists found stacks of papers containing 73 years of stream-flow measurements. They digitized all the measurements, analyzed the statistics, and discovered that the new but non-native trees used more water. The finding clearly held significant importance for the management of Cape Town’s reservoirs. For further information about the stream-flow project see Jonkershoek – preserving 73 years of catchment monitoring data by Victoria Goodall & Nicky Allsopp.

DARTG is building a bibliography of research papers which, like the Jonkershoek one, describe projects that have depended partly or completely on the ability to access data that were not born-digital. Any assistance in extending that bibliography would be greatly appreciated.

Several members of DARTG are themselves engaged in scientific pursuits that seek long-term data. The following talks describe three such projects.

Data Rescue to Increase Length of the Record

The second speaker, Patrick Caldwell from the US National Oceanographic Data Center (NODC), spoke on rescue of tide gauge data. (PDF of full paper)

He started with an overview of water level measurement, explaining how an analog trace (a line on a paper style record generated by a float w/a timer) is generated. Tide gauges include geodetic survey benchmark to make sure that the land isn’t moving. The University of Hawaii maintains a network of gauges internationally. Back in the 1800s, they were keeping track of the tides and sea level for shipping. You never know what the application may turn into – they collected for tides, but in the 1980s they started to see patterns. They used tide gauge measurements to discover El Niño!

As you increase the length of the record, the trustworthiness of the data improves. Within sea level variations, there are some changes that are on the level of decades. To take that shift out, they need 60 years to track sea level trends. They are working to extend the length of the record.

The UNESCO Joint Technical Commission for Oceanography & Marine Meteorology has Global Sea Level Observing System (GLOSS)

GLOSS has a series of Data Centers:

Permanent Service for Mean Sea Level (monthly)
Joint archive for sea level (hourly)
British Oceanographic Data center (high frequency)

The biggest holding starts at 1940s. They want to increase the number of longer records. A student in France documented where he found records as he hunted for the data he needed. Oregon students documented records available at NARA.

Global Oceanographic Data Archaeology and Rescue (GODAR) and the World Ocean Database Project

The Historic Data Rescue Questionnaire created in November 2011 resulted in 18 replies from 14 countries documenting tide gauge sites with non-digital data that could be rescued. They are particularly interested in the records that are 60 years or more in length.

Future Plans: Move away from identifying what is out there to tackling the rescue aspect. This needs funding. They will continue to search repositories for data-at-risk and continue collaboration with GLOSS/DARTG to freshen on-line inventory. Collaborate with other programs (Atmospheric Circulation Reconstructions over the Earth (ACRE) meeting 11-2012). Eventually move to Phase II = recovery!

The third speaker, Stephen Del Greco from the US NOAA National Climatic Data Center (NCDC), spoke about environmental data through time and extending the climate record. (PDF of full paper) The NCDC is a weather archive with headquarters in Asheville, NC. It fulfills much of the nation’s climate data requirements. Their data comes from many different sources. Safe storage of over 5,600 terabytes of climate data (= 6.5 billion kindle books). How will they handle the upcoming explosion of data on the way? Need to both handle new content coming in AND provide increased access to larger amounts of data being downloaded over time. 2011 number = data download of 1,250 terabytes for the year. They expect that download number to increase 10 fold over the next few years.

The climate database modernization program went on over more than a decade rescuing data. It was well funded and millions of records were rescued with a budget of roughly 20 Million a year. The goal is to preserve and make major climate and environmental data available via the World Wide Web. Over 14 terabytes of climate data are now digitized. 54 million weather and environmental images are online. Hundreds of millions of records are digitized and now online. The biggest challenge was getting the surface observation data digitized. NCDC digital data for hourly surface observations generally stretch back to around 1948. Some historical marine observations go back to the spice trade records.

For international efforts they bring their imaging equipment to other countries where records were at risk. 150,000 records imaged under the Climate Database Modernization Program (CDMP).

Now they are moving from public funding to citizen-fueled projects via crowdsourcing such as the Zooniverse Program. Old Weather is a Zooniverse Project which uses crowdsourcing to digitize and analyze climate data. For example, the transcription done by volunteers help scientists model Earth’s climate using wartime ship logs. The site includes methods to validate efforts from citizens. They have had almost 700,000 volunteers.

Long-term Archive Tasks:

Rescuing Satellite Data: raw images in lots of different film formats. All this is at risk. Need to get it all optically imaged. Looking at a ‘citizen alliance’ to do this work.
Climate Data Records: Global Essential Climate Variables (ECVs) with Heritage Records. Lots of potential records for rescue.
Rescued data helps people building proxy data sets: NOAA Paleoclimatology. ‘Paleoclimate proxies’ – things like boreholes, tree rings, lake levels, pollen, ice cores and more. For example – getting temperate and carbon dioxide from ice cores. These can go back 800,000 years!

We have extended the climate record through international collaboration. For example, the Australian Bureau of Meteorology provided daily temperature records for more than 1,500 additional stations. This meant a more than 10-fold increase in previous historical climate daily data holdings from that country.

Born Digital Maps

The final presentation discussed the map as a fundamental source of memory of the world, delivered by D. R. Fraser Taylor and Tracey Lauriault from Carleton University’s Geomatics and Cartographic Research Center in Canada. The full set of presentation slides are available online on SlideShare. (PDF of full paper)

We are now moving into born digital maps. For example, the Canadian Geographic Information System (CGIS) was created in the 1960s and was the worlds 1st GIS. Maps are ubiquitous in the 21st century. All kinds of organizations are creating their own maps and mash-ups. Community based NGOs, citizen science, academic and private sector are all creating maps.

We are loosing born digital maps almost faster than we are creating them. We have lost 90% of the born digital maps. Above all there is an attitude that preservation is not intrinsically important. No-one thought about the need to preserve the map – everyone thought someone else would do it. There was a complete lack of thought related to the preservation of these maps.

The Canada Land Inventory (CLI) was one of the first and largest born digital map efforts in the world. Mapped 2.6 million square kilometers of Canada. Lost in the 1980s. No-one took responsibility for archiving. Those who thought about it believed backup equaled archiving. A group of volunteers rescued the process over time – salvaged from boxes of tapes and paper in mid-1990s. It was caught just in time and took a huge effort. 80% has been saved and is now it is online. This was rescued because it was high profile. What about the low-profile data sets? Who will rescue them? No-one.

The 1986 BBC Doomsday Book was created in celebration of 900 years after William the Conqueror’s original Domesday Book. It was obsolete by the 1990s. A huge amount of social and economic information was collected for this project. In order to rescue it they needed an acorn computer and needed to be able to read the optical disks. The platform was emulated in 2002-2003. It cost 600,000 british pounds to reverse engineer and put online in 2004. New discs made in 2003 at the UK Archive.

It is easier to get Ptolomy’s maps from 15th century than it is to get a map 10 years old.

The Inuit Siku (sea ice) Atlas, an example of a Cybercartographic atlas, was produced in cooperation with Inuit communities. Arguing that the memory of what is happening in the north lies in the minds of the elders, they are capturing the information and putting it out in multi-media/multi-sensory map form. The process is controlled by the community themselves. They provide the software and hardware. They created a graphic tied to the Inuit terms for different types of sea ice. In some cases they record the audio of an elder talking about a place. The narrative of the route becomes part of the atlas. There is no right or wrong answer. There are many versions and different points of view. All are based on the same set of facts – but they come from different angles. The atlases capture them all.

The Gwich’in Place Name Atlas is building in the idea of long term preservation into the application from the start

The Cybercartographic Atlas of the Lake Huron Treaty Relationship Process is taking data from surveyors diaries from the 1850s.

There are lots of government of Canada geospatial data preservation intitatives, but in most cases there is a lot of retoric, but not so much action. There have been many consultations, studies, reports and initiatives since 2002, but the reality is that apart from the Open Government Consultations (TBS), not very much as translated into action. Even in the case where there is legislation, lots of things look good on paper but don’t get implemented.

There are Library and Archives Guidelines working to support digital preservation of geospatial data. The InterPares 2 (IP2) Geospatial Case Studies tackle a number of GIS examples, including the Cybercartographic Atlas of Antartica. See the presentation slides online for more specific examples.

In general, preservation as an afterthought rarely results in full recovery of born digital maps. It is very important to look at open source and interoperable open specifications. Proactive archiving is an important interim strategy.

Geospatial data are fundamental sources of our memory of the world. They help us understand our geo-narratives (stories tied to location), counter colonial mappings, are the result of scientific endeavors, represent multiple worldviews and they inform decisions. We need to overcome the challenges to ensure their preservation.

Q&A:

QUESTION: When I look at the work you are doing with recovering Inuit data from people. You recover data and republish it – who will preserve both the raw data and the new digital publication? What does it mean to try and really preserve this moving forward? Are we really preserving and archiving it?

ANSWER: No we are not. We haven’t been able to find an archive in Canada that can ingest our content. We will manage it ourselves as best we can. Our preservation strategy is temporary and holding, not permanent as it should be. We can’t find an archive to take the data. We are hopeful that we are moving towards finding a place to keep and preserve it. There is some hope on the horizon that we may move in the right directions in the Canadian context.

Luciana: I wanted to attest that we have all the data from InterPARES II. It is published in the final. I am jealously guarding my two servers that I maintain with money out of my own pocket.

QUESTION: Is it possible to have another approach to keep data where it is created, rather than a centralized approach?

ANSWER: We are providing servers to our clients in the north. Keeping copies of the database in the community where they are created. Keeping multiple copies in multiple places.

QUESTION: You mention surveys being sent out and few responses coming back. When you know there is data at risk – there may be governments that have records at risk that they are shy to reveal to the public? How do we get around that secrecy?

ANSWER: (IEDRO representative) We offer our help, rather than a request to get their data.

As is the case with all my session summaries, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Image Credit: NARA Flickr Commons image “The North Jetty near the Mouth of the Columbia River 05/1973”

Updated 2/20/2013 based on presenter feedback.

Grateful Dead Archive Online: First Impressions

July 9, 2012 3 Comments

The Grateful Dead Archive Online threw open its virtual doors in late June, 2012. This project has gotten a lot of attention from both the archives community and the Grateful Dead community. I got a message from my husband shortly after it went online directing me to the envelope shown above from the fan art section of the site. This was the envelope I helped decorate for our mail order ticket request sent back in January of 1992. The theory was that if you made your envelope beautiful, it was more likely to get pulled out of the pile of orders vying for a limited number of tickets. It worked for us this time – we plan to upload images of the tickets we received from that order (yes, we still have them!).

A little digging shows that the site is built on the open source Omeka platform. The prominent milestones timeline was built using the Neatline suite of Omeka plugins. The Omeka platform gives the site creator a lot of flexibility in what data is used to manage the collection.

The amount of metadata that the GDAO staff have populated on their 45,000 digitized items is quite impressive. They have tied the materials into the logical structure dictated by the Grateful Dead’s concerts. You can search for items related to a specific venue or a specific show by zooming in to locations on a map of the world to pick out individual venues where the Dead played. A wealth of media from the Internet Archive is tied into the site so that it is easy to find using the standard search mechanisms and cross linking based on metadata. The artists section features both photographers and poster artists. Two exhibits are in place for the site launch – one on Europe ’72 and the other on the Posters of the Grateful Dead Archive.

The resolution of the scanned fan art is amazing. Take a look at how far I could zoom in to the bluebell I drew way back when. I wonder what their default scanning resolution was?

The site also invites the Grateful Dead community to contribute items to the collection. They have a wish list for content to flesh out gaps they see in what they have.

These are the types available for selection when contributing content:

Audio
Image
Video
Your Story
Poster
Ticket
Laminate
Backstage Pass
Article

For each of the contributions, the user is asked for a mandatory Title, and optional Description, Date of Show, Venue Name and Venue Location. The contribution page also prompts the user to enter their name, e-mail, copyright and license.

For license, the user is given three options and encouraged to select one of the broader Creative Commons licenses rather than the more restrictive default license only granting rights to the University of California.

I am contributing this work and irrevocably grant a non-exclusive, perpetual, royalty-free, worldwide license for this work to the University of California Regents to display, distribute, reproduce, perform, or create derivatives works based upon it.
I am contributing this item under a Creative Commons Attribution (CC BY) License. Others are free to share, remix, or make commercial use of the work as long as they credit me.
I am contributing this item under a Creative Commons Attribution-NonCommercial (CC BY_NC) License. Others are free to share or remix the work noncommercially, as long as they credit me.

For the ‘Your Story’ option, the user is also prompted with the following:

How did you become a Deadhead?
What is your favorite Dead show, and why?
What is your favorite Dead song, and why?
What is your favorite aspect of the Dead scene?
What, if anything, do you think is important about the Dead, and about the Dead phenomenon?

I really loved that they provide a phone number which you can call and leave up to a three minute message which they will then transcribe for you. This looks to be an example (the first according to the comment) of a ‘Your Story’ submission.

The Advanced Search page gives us a full list of formats:

Album Cover
Article
Backstage Pass
Envelope
Fan Art
Fan Tape
Fanzine
Image
Laminate
Newsletter
Notebook
Poster
Program
Story
T-Shirt
Ticket
Sound
Video
Website

The search results let you filter by item type, creator name, venue, year and subject. I wish I could see a full list of the subjects. They seem to be a mix of named individuals associated with the band, events and song titles.

The Grateful Dead Archive Online is a great example of what can be done with good planning and the staff necessary to follow through with the vision. I appreciated the thought that clearly went into the copyright and license issues – both for content being contributed as well as content owned by the archive. I also see evidence of efforts to build a sense of community. The ‘Your Story’ contribution form specifically mentions that contributors should consider carefully what they share and how it might reflect on others. Each item offers the option to post comments as well as to add tags. It will be interesting to how these communal aspects of the site grow over time. Many archives have to work to build community – but the Grateful Dead fan community has a long and strong history.

Finally – as I mentioned above, the item level description is impressive. I was amazed to note that the envelope shown above was linked to the two shows the request was for, the creation date was tied to the post mark, the extent was the envelope’s measurements and the citation included the name of the creator from the return address. And yes – we know that they misspelled Smyth as Smith. We have already let them know about the typo and received a prompt response with a promise to fix the spelling.

Digitization Quality vs Quantity: An Exercise in Fortune Telling

March 31, 2012 5 Comments

The quality vs quantity dilemma is high in the minds of those planning major digitization projects. Do you spend your time and energy creating the highest quality images of your archival records? Or do you focus on digitizing the largest quantity you can manage? Choosing one over the other has felt a bit like an exercise in fortune telling to me over the past few months, so I thought I would work through at least a few of the moving parts of this issue here.

The two ends of the spectrum are traditionally described as follows:

digitize at very high quality to ensure that you need not re-digitize later, create a high quality master copy from which all possible derivatives can be created later
digitize at the minimum quality required for your current needs, the theory being that this will increase the quantity of digitized records you can digitize

This sounds very well and good on the surface, but this is not nearly as black and white a question as it appears. It is not the case that one can simply choose one over the other. I suppose that choosing ‘perfect quality’ (whatever that means) probably drives the most obvious of the digitization choices. Highest resolution. 100% accurate transcription. 100% quality control.

It is the rare digitization project that has the luxury of time and money required to aim for such a definition of perfect. At what point would you stop noticing any improvement, while just increasing your the time it takes to capture the image and the disk space required to store it? 600 DPI? 1200 DPI? Scanners and cameras keep increasing the dots per inch and the megapixels they can capture. Disk space keeps getting cheaper. Even at the top of the ‘perfect image’ spectrum you have to reach a point of saying ‘good enough’.

When you consider the choices one might make short of perfect, you start to get into a gray area in which the following questions start to crop up:

How will lower quality image impact OCR accuracy?
Is one measure of lower quality simply a lower level of quality assurance (QA) to reduce the cost and increase the throughput?
How will expectations of available image resolution evolve over the next five years? What may seem ‘good enough’ now, may seem grainy and sad in a few years.
What do we add to the images to improve access? Transcription? TEI? Tagging? Translation?
How bad is it if you need to re-digitize something that is needed at a higher resolution on demand? How often will that actually be needed?
Will storing in JPEG2000 (rather than TIFF) save enough money from reduced disk space to make it worth the risk of a lossy format? Or is ‘visually lossless‘ good enough?

Even the question of OCR accuracy is not so simple. In D-Lib Magazine‘s article from the July/August 2009 issue titled Measuring Mass Text Digitization Quality and Usefulness the authors list multiple types of accuracy which may be measured:

Character accuracy
Word accuracy
Significant word accuracy
Significant words with capital letter start accuracy (i.e. proper nouns)
Number group accuracy

So many things to consider!

The primary goal of the digitization project I am focused on is to increase access to materials for those unable to travel to our repository. As I work with my colleagues to navigate the choices, I find myself floating towards the side of ‘good enough’ across the board. Even the process of deciding this blog post is done has taken longer than I meant it to. I publish it tonight with the hope to put a line in the sand and move forward with the conversation. For me, it all comes back to what are you trying to accomplish.

I would love to hear about how others are weighing all these choices. How often have long term digitization programs shifted their digitization standards? What aspects of your goals are most dramatically impacting your priorities on the quality vs quantity scale?

Image Credit: Our lovely fortune teller is an image from the George Eastman House collection in the Flickr Commons, taken by Nickolas Muray in 1940 for use by McCall’s Magazine. [UPDATED 1/6/2019: Image no longer on Flickr, but is available in the Eastman Museum online collection.]

Digitization Program Site Visit: Archives of American Art

February 3, 2012 1 Comment

The image of Alexander Calder above shows him in his studio, circa 1950. It is from a folder titled Photographs: Calder at Work, 1927-1956, undated, part of Alexander Calder’s Papers held by the Smithsonian Archives of American Art and available online through the efforts of their digitization project. I love that this image capture him in his creative space – you get to see the happy chaos from which Calder drew his often sleek and sparse sculptures.

Back in October, I had the opportunity to visit with staff of the digitization program for the Smithsonian Archives of American Art along with a group of my colleagues from the World Bank. This is a report on that site visit. It is my hope that these details can help others planning digitization projects – much as it is informing our own internal planning.

Date of Visit: October 18, 2011

Destination: Smithsonian Archives of American Art

Smithsonian Archives of American Art Hosts:

Karen Weiss
Barbara Aikens
Many additional staff members

Summary: This visit was two hours in length and consisted of a combination of presentation, discussion and site tour to meet staff and examine equipment.

Background: The Smithsonian’s Archives of American Art (AAA) program was first funded by a grant from the Terra Foundation of American Art in 2005, recently extended through 2016. This funding supports both staff and research.

Their digitization project replaced their existing microfilm program and focuses on digitizing complete collections. Digitization focused on in-house collections (in contrast with collections captured on microfilm from other institutions across the USA as part of their microfilm program).

Over the course of the past 6 years, they have scanned over 110 collections – a total of 1,000 linear feet – out of an available total of 13,000 linear feet from 4,500 collections. They keep a prioritized list of what they want digitized.

The Smithsonian DAM (digital asset management system) had to be adjusted to handle the hierarchy of EAD and the digitized assets. Master files are stored in the Smithsonian DAM. Files stored in intermediate storage areas are only for processing and evaluation and are disposed of after they have been ingested into the DAM.

Current staffing is two and a half archivists and two digital imaging specialists. One digital imaging specialist focuses on scanning full collections, while the other focuses on on-demand single items.

The website is built in ColdFusion and pulls content from a SQL database. Currently they have no way to post media files (audio, oral histories, video) on the external web interface.

They do not delineate separate items within folders. When feedback comes in from end users about individual items, this information is usually incorporated into the scope note for the collection, or the folder title of the folder containing the item. Full size images in both the image gallery and the full collections are watermarked.

They track the processing stats and status of their projects.

Standard Procedures:

Full Collection Digitization:

Their current digitization workflow is based on their microfilm process. The workflow is managed via an internal web-based management system. Every task required for the process is listed, then crossed off and annotated with the staff and date the action was performed.
Collections earmarked for digitization are thoroughly described by a processing archivist.
Finding aids are encoded in EAD and created in XML using NoteTab Pro software.
MARC records are created when the finding aid is complete. The summary information from the MARC record is used to create the summary of the collection published on the website.
Box numbers and folder numbers are assigned and associated with a finding aid. The number of the box and folder are all a scanning technician needs.
A ‘scanning information worksheet’ provides room for notes from the archivist to the scanning technician. It provides the opportunity to indicate which documents should not be scanned. Possible reasons for this are duplicate documents or those containing personal identifying information (PIP).
A directory structure is generated by a script based on the finding aid, creating a directory folder for each physical folder which exists for the collection. Images are saved directly into this directory structure. The disk space to hold these images is centrally managed by the Smithsonian and automatically backed up.
All scanning is done in 600dpi color, according to their internal guidelines. They frequently have internal projects which demand high resolution images for use in publication.
After scanning is complete, the processing archivist does the post scanning review before the images are pushed into the DAM for web publication.
Their policy is to post everything from a digitized collection, but they do support a take-down policy.
A recent improvement was made in January, 2010. At that time they relaunched the site to include all of their collections co-located on the same list, both digitized and non-digitized.

On Demand Digitization:

Patrons may request the digitization of individual items.
These requests are evaluated by archivists to determine if it is appropriate to digitize the entire folder (or even box) to which the item belongs.
Requests are logged in a paper log.
Item level scanning ties back to an item level record with an item ID. There is an ‘Online Removal Notice’ to create item level stub.
An item level cataloger describes the content after it is scanned.
Unless there is an explicit copyright or donor restriction, the items is put online in the Image Gallery (which currently has 12,000 documents).
Access to images is provided by keyword searching.
Individual images are linked back to the archival description for the collection from which they came.

Improvements/Changes they wish for:

They currently have no flexibility to make changes in the database nimbly. It is a tedious process to change the display and each change requires a programmer.
They would like to consider a move to open source software or to use a central repository – though they have concerns about what other sacrifices this would require.
Show related collections, list connected names (currently the only options for discovery are an A-Z list of creators or keyword search).
Ability to connect to guides and other exhibits.

References:

Image Gallery
Main Website
Digitization Project
Technical Documentation – shares internal procedures and guidelines
OCLC rapid capture paper
Scanning equipment

Image Credit: Alexander Calder papers, Archives of American Art, Smithsonian Institution.

Digitization Program Site Visit: University of Maryland

December 12, 2011

I recently had the opportunity to visit with staff of the University of Maryland, College Park’s Digital Collections digitization program along with a group of my colleagues from the World Bank. This is a report on that site visit. It is my hope that these details can help others planning digitization projects – much as it is informing our own internal planning.

Date of Visit: October 13, 2011

Destination: University of Maryland, Digital Collections

University of Maryland Hosts:

Summary: This visit was two hours in length and consisted of a one hour presentation and Q&A session with Jennie Levine Knies, Manager of Digital Collections followed by a one hour tour and Q&A session with Alexandra Carter, Digital Imaging Librarian.

Background: The Digital Collections of the University of Maryland was launched in 2006 using Fedora Commons. It is distinct from the ‘Digital Repository at the University of Maryland’, aka DRUM, which is built on DSpace. DRUM contains faculty-deposited documents, a library-managed collection of UMD theses and dissertations, and collections of technical reports. The Digital Collections project focuses on digitization of photographs, postcards, manuscripts & correspondence – mostly based on patron demand. In addition, materials are selected for digitization based on the need for thematic collections to support events, such as their recent civil war exhibition.

After a period of full funding, there has been a fall off in funding which has prevented any additional changes to the Fedora system.

Another project at UMD involves digitization of Japanese childrens’ books (George W. Prange Collection) and currently uses “in house outsourcing”. In this scenario, contractors bring all their equipment and staff on site to perform the digitization process.

Standard Procedures:

Requests must be made using a combination of the ‘Digital Request Cover Sheet’ and ‘Digital Surrogate Request Sheet. These sheets are then reviewed for completeness by the curator under whose jurisdiction the collection falls. Space on the request forms is provided so that the curator may add additional notes to aid in the digitization process. They decide if it is worth digitizing an entire folder when only specific item(s) are requested. Standard policy is to aim for two week turnaround for digitization based on patron request.
The digital request is given a code name for easy reference. They choose these names alphabetically.
Staff are assigned to digitize materials. This work is often done by student workers using one of three Epson 10000 XL flatbed scanners. There is also a Zeutschel OS 12000 overhead scanner available for materials which cannot be handled by the flatbed scanners.
Alexandra reviews all scans for quality.
Metadata is reviewed by another individual.
When both the metadata & image quality has been reviewed, materials are published online.

Improvements/Changes they wish for:

Easier way to create a web ‘home’ for collections, currently many do not have a main page and creating one requires the involvement of the IT department.
Option for users to save images being viewed
Option to upload content to their website in PDF format
Way to associate transcriptions with individual pages
More granularity for workflow: currently the only status they have to indicate that a folder or item is ready for review is ‘Pending’. Since there are multiple quality control activities that must be performed by different staff, currently they must make manual lists to track what phases of QA are complete for which digitized content.
Reduce data entry.
Support for description at both the folder and item level at the same time. Currently description is only permitted either at the folder level OR at the item level.
Enable search and sorting by date added to system. This data is captured, but not exposed.

Lessons Learned:

Should have adopted an existing metadata standard rather than creating their own.
People do not use the ‘browse terms’ – do not spend a lot of time working on this

Resources:

Digital Content Guidelines: Selection Criteria for Digital Objects

Image Credit: Women students in a green house during a Horticulture class at the University of Maryland, 1925. University Archives, Special Collections, University of Maryland Libraries

SXSW Panel Proposal – Archival Records Online: Context is King

August 31, 2011 1 Comment

I have a panel up for evaluation on the SXSW Interactive Panel Picker titled Archival Records Online: Context is King. The evaluation process for SXSW panels is based on a combination of staff choice, advisory board recommendations and public votes. As you can see from the pie chart shown here (thank you SXSW website for the great graphic), 30% of the selection criteria is based on public votes. That is where you come in. Voting is open through 11:59 pm Central Daylight Time on Friday, September 2. To vote in favor of my panel, all you need to do is create a free account over on SXSW Panel Picker and then find Archival Records Online: Context is King and give it a big thumbs up.

If my panel is selected, I intend this session to give me the chance to review all of the following:

What are the special design requirements of archival records?
What are the biggest challenges to publishing archival records online?
How can archivists, designers and developers collaborate to build successful web sites?
Why is metadata important?
How can search engine optimization (SEO) inform the design process?

All of this ties into what I have been pondering, writing about and researching for the past few years related to getting archival records online. So many people are doing such amazing work in this space. I want to show off the best of the best and give attendees some takeaways to help them build websites that make it easy to see the context of anything they find in their search.

While archival records have a very particular dependence on the effective communication of context – I also think that this is a lesson that can improve interface design across the board. These are issues that UI and IA folks are always going to be worrying about. SXSW is such a great opportunity for cross pollination. Conferences outside the normal archives, records management and library conference circuit give us a chance to bring fresh eyes and attention to the work being done in our corner of the world.

If you like the idea of this session, please take a few minutes to go sign up at the SXSW Panel Picker and give Archival Records Online: Context is King a thumbs up. You don’t need to be planning to attend in order to cast your vote, though after you start reading through all the great panel ideas you might change your mind!

Support EAD Tagging Research

December 6, 2010

In case you haven’t seen this request via other channels, please consider supporting the research effort described below into how different organizations encode finding aids using EAD. As someone who has dug into the gory details of eleven institutions’ finding aids to extract data for my ArchivesZ project, I am here to tell you that this work is VERY important. With better standards in place we will have a better foundation upon which to create interesting new tools and services to support archivists and researchers.

Is part of your job is to encode finding aids in EAD? Then please ask if you can send a dozen of them to the researchers on this project!

Seeking EAD records from repositories that have implemented EAD

Standards have been entering the archival lexicon at a fast pace to ensure data reliability, enable data aggregation, and manage data over the long term. However, we have not yet examined the use of these standards across the archival community. As we move into the next phase of standards-creation, a broad look at current implementations will help to inform the next generation of these standards. To do this, Kathy Wisser (Simmons College) and Jackie Dean (UNC Chapel Hill) are conducting research on EAD tag usage in the encoding community.

This project is intended to inform the TS-EAD revision process of the standard, and results will be disseminated through traditional publication avenues.

We are seeking a sample of encoded finding aids from institutions that have implemented EAD. If you are willing to participate in this project, please submit via electronic mail 12 to 15 finding aids to eadtagresearch@gmail.com by December 15, 2010.

The goal of the project is to identify encoding behavior and not to evaluate the quality of the encoding or the content of the finding aid. We will be noting the presence and absence of elements and attributes and the way that elements are used within the context of an EAD instance.

All results will be anonymized; no institution-specific information will be linked to the results. Institutions willing to participate will be acknowledged.

In order to obtain an accurate account of the use of the standard, we are looking for EAD instances from as many institutions as possible. We hope you will consider contributing to this effort.

If you have any questions about the project, please contact:

Kathy Wisser (Simmons College – wisser@simmons.edu)

Jackie Dean (UNC Chapel Hill – jdean@email.unc.edu)

Category: access