Menu Close

Category: born digital records

Born Digital Records are those records which are created in a digital format. These records may never have a physical or analog expression.

Chapter 7: Historical Building Information Model (BIM)+: Sharing, Preserving and Reusing Architectural Design Data by Dr. JuHyun Lee and Dr. Ning Gu

Chapter 7 of Partners for Preservation is ‘Historical Building Information Model (BIM)+: Sharing, Preserving and Reusing Architectural Design Data’ by Dr. JuHyun Lee and Dr. Ning Gu. The final chapter in Part II: The physical world: objects, art, and architecture, this chapter addresses the challenges of digital records created to represent physical structures. I picked the image above because I love the contrast between the type of house plans you could order from a catalog a century ago and the way design plans exist today.

This chapter was another of my “must haves” from my initial brainstorm of ideas for the book. I attended a session on ‘Preserving Born-Digital Records Of The Design Community’ at the 2007 annual SAA meeting. It was a compelling discussion, with representatives from multiple fields. Archivists working to preserve born-digital designs. People working on building tools and setting standards. There were lots of questions from the audience – many of which I managed to capture in my notes that became a detailed blog post on the session itself. It was exciting to be in the room with so many enthusiastic experts in overlapping fields. They were there to talk about what might work long term.

This chapter takes you forward to see how BIM has evolved – and how historical BIM+ might serve multiple communities. This passage gives a good overview of the chapter:

“…the chapter first briefly introduces the challenges the design and building industry have faced in sharing, preserving and reusing architectural design data before the emergence and adoption of BIM, and discusses BIM as a solution for these challenges. It then reviews the current state of BIM technologies and subsequently presents the concept of historical BIM+ (HBIM+), which aims to share, preserve and reuse historical building information. HBIM+ is based on a new framework that combines the theoretical foundation of HBIM with emerging ontologies and technologies in the field including geographic information systems (GIS), mobile computing and cloud computing to create, manage and exchange historical building data and their associated values more effectively.”

I hope you find the ideas shared in this chapter as intriguing as I do. I see lots of opportunities for archivists to collaborate with those focused on architecture and design, especially in the case of historical buildings and the proposed vision for HBIM+.

Bios:

Ning Gu is Professor of Architecture in the School of Art, Architecture and Design at the University of South Australia. Having an academic background from both Australia and China, Professor Ning Gu’s most significant contributions have been made towards research in design computing and cognition, including topics such as computational design analysis, design cognition, design com­munication and collaboration, generative design systems, and Building Information Modelling. The outcomes of his research have been documented in over 170 peer-reviewed publications. Professor Gu’s research has been supported by prestigious Australian research funding schemes from Australian Research Council, Office for Learning and Teaching, and Cooperative Research Centre for Construction Innovation. He has guest edited/chaired major international journals/conferences in the field. He was Visiting Scholar at MIT, Columbia University and Technische Universiteit Eindhoven.

JuHyun Lee is an adjunct senior lecturer, at the University of Newcastle (UoN). Dr. Lee has made a significant contribution towards architectural and design research in three main areas: design cognition (design and language), planning and design analysis, and design computing. As an expert in the field of architectural and design computing, Dr. Lee was invited to become a visiting academic at the UoN in 2011. Dr. Lee has developed innovative computational applications for pervasive computing and context awareness in the building environments. The research has been published in Computers in Industry, Advanced Engineering Informatics, Journal of Intelligent and Robotic Systems. His international contribution has been recognised as: Associate editor for a special edition of Architectural Science Review; Reviewer for many international journals and conferences; International reviewer for national grants.

Image Source: Image from page 717 of ‘Easy steps in architecture and architectural drawing’ by Hodgson, Frederick Thomas, 1915. https://archive.org/details/easystepsinarch00hodg/page/n717

Chapter 6: Accurate Digital Colour Reproduction on Displays: from Hardware Design to Software Features by Dr. Abhijit Sarkar

The sixth chapter in Partners for Preservation is “Accurate Digital Colour Reproduction on Displays: from Hardware Design to Software Features” by Dr. Abhijit Sarkar. As the second chapter in Part II: The physical world: objects, art, and architecture, this chapter continues to walk the edge between the physical and digital worlds.

My mother was an artist. I spent a fair amount of time as a child by her side in museums in New York City. As my own creativity has led me to photography and graphic design, I have become more and more interested in color and how it can change (or not change) across the digital barrier and across digital platforms. Add in the ongoing challenges to archival preservation of born-digital visual records and the ever-increasing efforts to digitize archival materials, and this was a key chapter I was anxious to include.

One of my favorite passages from this chapter:

If you are involved in digital content creation or digitisation of existing artwork, the single most important advice I can give you is to start by capturing and preserving as much information as possible, and allow redundant information to be discarded later as and when needed. It is a lot more difficult to synthesise missing colour fidelity information than to discard information that is not needed.

This chapter, perhaps more than any other in the book, can stand alone as a reference. It is a solid introduction to color management and representation, including both information about basic color theory and important aspects of the technology choices that govern what we see when we look at a digital image on a particular piece of hardware.

On my computer screen, the colors of the image I selected for the top of this blog post please me. How different might the 24 x 30-inch original screenprint on canvas mounted on paperboard, created fifty years ago in 1969 and now held by the Smithsonian American Art Museum, look to me in person? How different might it look on each device on which people read this blog post? I hope that this type of curiosity will lure you develop an understanding of the impacts that the choices explored in this chapter can have on how the records in your care will be viewed in the future.

Bio: 

Abhijit Sarkar specializes in the area of color science and imaging. Since his early college days, Abhijit wanted to do something different from what all his friends were doing or planning to do. That mission took him through a tortuous path of earning an undergraduate degree in electrical engineering in India, two MS degrees from Penn State and RIT on lighting and color, and a PhD in France on applied computing. His doctoral thesis was mostly focused on the fundamental understanding of how individuals perceive colors differently and devising a novel method of personalized color processing for displays in order to embrace individual differences.

Because of his interdisciplinary background encompassing science, engineering and art, Abhijit regards cross-discipline collaborations like the Partners for Preservation extremely valuable in transcending the boundaries of myriads of specialized domains and fields; thereby developing a much broader understanding of capabilities and limitations of technology.

Abhijit is currently part of the display design team at Microsoft Surface, focused on developing new display features that enhance users’ color experience. He has authored a number of conference and journal papers on color imaging and was a contributing author for the Encyclopedia of Color Science and Technology.

Image source: Bullet Proof, from the portfolio Series I by artist Gene Davis, Smithsonian American Art Museum, Bequest of Florence Coulson Davis

Chapter 5: The Internet of Things: the risks and impacts of ubiquitous computing by Éireann Leverett

Chapter 5 of Partners for Preservation is ‘The Internet of Things: the risks and impacts of ubiquitous computing’ by Éireann Leverett. This is one of the chapters that evolved a bit from my original idea – shifting from being primarily about proprietary hardware to focusing on the Internet of Things (IoT) and the cascade of social and technical fallout that needs to be considered.

Leverett gives this most basic definition of IoT in his chapter:

At its core, the Internet of Things is ‘ubiquitous computing’, tiny computers everywhere – outdoors, at work in the countryside, at use in the city, floating on the sea, or in the sky – for all kinds of real world purposes.

In 2013, I attended a session at The Memory of the World in the Digital Age: Digitization and Preservation conference on the preservation of scientific data. I was particularly taken with The Global Sea Level Observing System (GLOSS) — almost 300 tide gauge stations around the world making up a web of sea level observation sensors. The UNESCO Intergovernmental Oceanographic Commission (IOC) established this network, but cannot add to or maintain it themselves. The success of GLOSS “depends on the voluntary participation of countries and national bodies”. It is a great example of what a network of sensors deployed en masse by multiple parties can do – especially when trying to achieve more than a single individual or organization can on its own.

Much of IoT is not implemented for the greater good, but rather to further commercial aims.  This chapter gives a good overview of the basics of IoT and considers a broad array of issues related to it including privacy, proprietary technology, and big data. It is also the perfect chapter to begin Part II: The physical world: objects, art, and architecture – shifting to a topic in which the physical world outside of the computer demands consideration.

Bio:

Éireann Leverett

Éireann Leverett once found 10,000 vulnerable industrial systems on the internet.

He then worked with Computer Emergency Response Teams around the world for cyber risk reduction.

He likes teaching the basics and learning the obscure.

He continually studies computer science, cryptography, networks, information theory, economics, and magic history.

He is a regular speaker at computer security conferences such as FIRST, BlackHat, Defcon, Brucon, Hack.lu, RSA, and CCC; and also at insurance and risk conferences such as Society of Information Risk Analysts, Onshore Energy Conference, International Association of Engineering Insurers, International Risk Governance Council, and the Reinsurance Association of America. He has been featured by the BBC, The Washington Post, The Chicago Tribune, The Register, The Christian Science Monitor, Popular Mechanics, and Wired magazine.

He is a former penetration tester from IOActive, and was part of a multidisciplinary team that built the first cyber risk models for insurance with Cambridge University Centre for Risk Studies and RMS.

Image credit: Zan Zig performing with rabbit and roses, including hat trick and levitation, Strobridge Litho. Co., c1899.

NOTE: I chose the magician in the image above for two reasons:

  1. because IoT can seem like magic
  2. because the author of this chapter is a fan of magic and magic history

Chapter 1: Inheritance of Digital Media by Dr. Edina Harbinja

You're Dead, Your Data Isn't: What Happens Now?
The first chapter in Partners for Preservation is ‘Inheritance of Digital Media’, written by Dr. Edina Harbinja. This topic was one of the first I was sure I wanted to include in the book. Back in 2011, I attended an SXSW session titled Digital Death. The discussion was wide-ranging and attracted people of many backgrounds including lawyers, librarians, archivists, and social media professionals. I still love the illustration above, created live during the session.

The topic of personal digital archiving has since gained traction, inspiring events and the creation of resources. There are now multiple books addressing the subject. The Library of Congress created a kit to help people host personal digital archiving events. In April 2018 a Personal Digital Archiving Conference(PDA) was held in Houston, TX. You can watch the presentations from the PDA2017 hosted by Stanford University Libraries. PDA2016 was held at the University of Michigan Library and PDA2015 was hosted by NYU. In fact, the Internet Archive has an entire collection of videos and presentation materials from various PDA’s dating back to 2010.

I wanted the chapter on digital inheritance to address topics at the forefront of current thinking. Dr. Edina Harbinja delivered exactly what I was looking for and more. As the first chapter in Part 1: Memory, Privacy, and Transparency, it sets the stage for many of the common threads I saw in this section of the book.

Here is one of my favorite sentences from the chapter:

“Many digital assets include a large amount of personal data (e.g. e-mails, social media content) and their legal treatment cannot be looked at holistically if one does not consider privacy laws and their lack of application post-mortem.”

This quote gets at the heart of the chapter and provides a great example of the intertwining elements of memory and privacy. What do you think will happen to all of your “digital stuff”? Do you have an expectation that your privacy will be respected? Do you assume that your loved ones will have access to your digital records? To what degree are laws and policies keeping up (or not keeping up) with these questions? As an archivist, how might all this impact your ability to access, extract, and preserve digital records?

Look to chapter one of Partners for Preservation to explore these ideas.

Bio

Dr. Edina Harbinja

Dr. Edina Harbinja is a senior lecturer in media/privacy law at Aston University, Birmingham, UK. Her principal areas of research and teaching are related to the legal issues surrounding the Internet and emerging technologies. In her research, Edina explores the application of property, contract law, intellectual property, and privacy online. Edina is a pioneer and a recognized expert in post-mortem privacy, i.e. privacy of the deceased individuals. Her research has a policy and multidisciplinary focus and aims to explore different options of regulation of online behaviors and phenomena. She has been a visiting scholar and an invited speaker to universities and conferences in the USA, Latin America, and Europe, and has undertaken consultancy for the Fundamental Rights Agency. Her research has been cited by legislators, courts, and policymakers in the US, Australia, and Europe as well. Find her on Twitter at @EdinaRl.

Countdown to Partners for Preservation

Yes. I know. My last blog post was way back in May of 2014. I suspect some of you have assumed this blog was defunct.

When I first launched Spellbound Blog as a graduate student in July of 2006, I needed an outlet and a way to connect to like-minded people pondering the intersection of archives and technology. Since July 2011, I have been doing archival work full time. I work with amazing archivists. I think about archival puzzles all day long. Unsurprisingly, this reduced my drive to also research and write about archival topics in the evenings and on weekends.

Looking at the dates, I also see that after I took an amazing short story writing class, taught by Mary Robinette Kowal in May of 2013, I only wrote one more blog post before setting Spellbound Blog aside for a while in favor of fiction and other creative side-projects in my time outside of work.

Since mid-2014, I have been busy with many things – including (but certainly not limited to):

I’m back to tell you all about the book.

In mid-April of 2016, I received an email from a commissioning editor in the employ of UK-based Facet Publishing (initially described to me as the publishing arm of CILIP, the UK’s equivalent to ALA). That email was the beginning of a great adventure, which will soon culminate in the publication of Partners for Preservation by Facet (and its distribution in the US by ALA). The book, edited by me and including an introduction by Nancy McGovern, features ten chapters by representatives of non-archives professions. Each chapter discusses challenges with and victories over digital problems that share common threads with issues facing those working to preserve digital records.

Over the next few weeks, I will introduce you to each of the book’s contributing authors and highlight a few of my favorite tidbits from the book. This process was very different from writing blog posts and being able to share them immediately. After working for so long in isolation it is exciting to finally be able to share the results with everyone.

PS: I also suspect, that finally posting again may throw open the floodgates to some longer essays on topics that I’ve been thinking about over the past years.

PPS: If you are interested in following my more creative pursuits, I also have a separate mailing list for that.

The CODATA Mission: Preserving Scientific Data for the Future

The North Jetty near the Mouth of the Columbia River 05/1973This session was part of The Memory of the World in the Digital Age: Digitization and Preservation conference and aimed to describe the initiatives of the Data at Risk Task Group (DARTG), part of the Committee on Data for Science and Technology (CODATA), a body of the International Council for Science.

The goal is to preserve scientific data that is in danger of loss because they are not in modern electronic formats, or have particularly short shelf-life. DARTG is seeking out sources of such data worldwide, knowing that many are irreplaceable for research into the long-term trends that occur in the natural world.

Organizing Data Rescue

The first speaker was Elizabeth Griffin from Canada’s Dominion Astrophysical Observatory. She spoke of two forms of knowledge that we are concerned with here: the memory of the world and the forgettery of the world. (PDF of session slides)

The “memory of the world” is vast and extends back for aeons of time, but only the digital, or recently digitized, data can be recalled readily and made immediately accessible for research in the digital formats that research needs. The “forgettery of the world” is the analog records, ones that have been set aside for whatever reason, or put away for a long time and have become almost forgotten.  It the analog data which are considered to be “at risk” and which are the task group’s immediate concern.

Many pre-digital records have never made it into a digital form.  Even some of the early digital data are insufficiently described, or the format is out of date and unreadable, or the records cannot be located at all easily.

How can such “data at risk” be recovered and made useable?  The design of an efficient rescue package needs to be based upon the big picture, so a website has been set up to create an inventory where anyone can report data-at-risk. The Data-at-Risk Inventory (built on Omeka) is front-ended by a simple form that asks for specific but fairly obvious information about the datasets, such as field (context), type, amount or volume, age, condition, and ownership. After a few years DARTG should have some better idea as to the actual amounts and distribution of different types of historic analog data.

Help and support are needed to advertise the Inventory.  A proposal is being made to link data-rescue teams from many scientific fields into an international federation, which would be launched at a major international workshop.  This would give a permanent and visible platform to the rescue of valuable and irreplaceable data.

The overarching goal is to build a research knowledge base that offers a complimentary combination of past, present and future records.  There will be many benefits, often cross-disciplinary, sometimes unexpected, and perhaps surprising.  Some will have economic pay-offs, as in the case of some uncovered pre-digital records concerning the mountain streams that feed the reservoirs of Cape Town, South Africa.  The mountain slopes had been deforested a number of years ago and replanted with “economically more appealing” species of tree.  In their basement hydrologists found stacks of papers containing 73 years of stream-flow measurements.  They digitized all the measurements, analyzed the statistics, and discovered that the new but non-native trees used more water.  The finding clearly held significant importance for the management of Cape Town’s reservoirs.  For further information about the stream-flow project see Jonkershoek – preserving 73 years of catchment monitoring data by Victoria Goodall & Nicky Allsopp.

DARTG is building a bibliography of research papers which, like the Jonkershoek one, describe projects that have depended partly or completely on the ability to access data that were not born-digital.  Any assistance in extending that bibliography would be greatly appreciated.

Several members of DARTG are themselves engaged in scientific pursuits that seek long-term data.  The following talks describe three such projects.

Data Rescue to Increase Length of the Record

The second speaker, Patrick Caldwell from the US National Oceanographic Data Center (NODC), spoke on rescue of tide gauge data. (PDF of full paper)

He started with an overview of water level measurement, explaining how an analog trace (a line on a paper style record generated by a float w/a timer) is generated. Tide gauges include geodetic survey benchmark to make sure that the land isn’t moving. The University of Hawaii maintains a network of gauges internationally. Back in the 1800s, they were keeping track of the tides and sea level for shipping. You  never know what the application may turn into – they collected for tides, but in the 1980s they started to see patterns. They used tide gauge measurements to discover El Niño!

As you increase the length of the record, the trustworthiness of the data improves. Within sea level variations, there are some changes that are on the level of decades. To take that shift out, they need 60 years to track sea level trends. They are working to extend the length of the record.

The UNESCO Joint Technical Commission for Oceanography & Marine Meteorology has  Global Sea Level Observing System (GLOSS)

GLOSS has a series of Data Centers:

  • Permanent Service for Mean Sea Level (monthly)
  • Joint archive for sea level (hourly)
  • British Oceanographic Data center (high frequency)

The biggest holding starts at 1940s. They want to increase the number of longer records. A student in France documented where he found records as he hunted for the data he needed. Oregon students documented records available at NARA.

Global Oceanographic Data Archaeology and Rescue (GODAR) and the World Ocean Database Project

The Historic Data Rescue Questionnaire created in November 2011 resulted in 18 replies from 14 countries documenting tide gauge sites with non-digital data that could be rescued. They are particularly interested in the records that are 60 years or more in length.

Future Plans: Move away from identifying what is out there to tackling the rescue aspect. This needs funding. They will continue to search repositories for data-at-risk and continue collaboration with GLOSS/DARTG to freshen on-line inventory. Collaborate with other programs (Atmospheric Circulation Reconstructions over the Earth (ACRE) meeting 11-2012). Eventually move to Phase II = recovery!

The third speaker, Stephen Del Greco from the US NOAA National Climatic Data Center (NCDC), spoke about environmental data through time and extending the climate record. (PDF of full paper) The NCDC is a weather archive with headquarters in Asheville, NC. It fulfills much of the nation’s climate data requirements. Their data comes from many different sources. Safe storage of over 5,600 terabytes of climate data (= 6.5 billion kindle books). How will they handle the upcoming explosion of data on the way? Need to both handle new content coming in AND provide increased access to larger amounts of data being downloaded over time. 2011 number = data download of 1,250 terabytes for the year. They expect that download number to increase 10 fold over the next few years.

The climate database modernization program went on over more than a decade rescuing data. It was well funded and millions of records were rescued with a budget of roughly 20 Million a year. The goal is to preserve and make major climate and environmental data available via the World Wide Web. Over 14 terabytes of climate data are now digitized. 54 million weather and environmental images are online. Hundreds of millions of records are digitized and now online. The biggest challenge was getting the surface observation data digitized. NCDC digital data for hourly surface observations generally stretch back to around 1948. Some historical marine observations go back to the spice trade records.

For international efforts they bring their imaging equipment to other countries where records were at risk. 150,000 records imaged under the Climate Database Modernization Program (CDMP).

Now they are moving from public funding to citizen-fueled projects via crowdsourcing such as the Zooniverse Program. Old Weather is a Zooniverse Project which uses crowdsourcing to digitize and analyze climate data. For example, the transcription done by volunteers help scientists model Earth’s climate using wartime ship logs. The site includes methods to validate efforts from citizens.  They have had almost 700,000 volunteers.

Long-term Archive Tasks:

  • Rescuing Satellite Data: raw images in lots of different film formats. All this is at risk. Need to get it all optically imaged. Looking at a ‘citizen alliance’ to do this work.
  • Climate Data Records: Global Essential Climate Variables (ECVs) with Heritage Records. Lots of potential records for rescue.
  • Rescued data helps people building proxy data sets: NOAA Paleoclimatology. ‘Paleoclimate proxies’ – things like boreholes, tree rings, lake levels, pollen, ice cores and more. For example – getting temperate and carbon dioxide from ice cores. These can go back 800,000 years!

We have extended the climate record through international collaboration. For example, the Australian Bureau of Meteorology provided daily temperature records for more than 1,500 additional stations. This meant a more than 10-fold increase in previous historical climate daily data holdings from that country.

Born Digital Maps

The final presentation discussed the map as a fundamental source of memory of the world, delivered by D. R. Fraser Taylor and Tracey Lauriault from Carleton University’s Geomatics and Cartographic Research Center in Canada. The full set of presentation slides are available online on SlideShare. (PDF of full paper)

We are now moving into born digital maps. For example, the Canadian Geographic Information System (CGIS) was created in the 1960s and was the worlds 1st GIS. Maps are ubiquitous in the 21st century. All kinds of organizations are creating their own maps and mash-ups. Community based NGOs, citizen science, academic and private sector are all creating maps.

We are loosing born digital maps almost faster than we are creating them. We have lost 90% of the born digital maps. Above all there is an attitude that preservation is not intrinsically important. No-one thought about the need to preserve the map – everyone thought someone else would do it. There was a complete lack of thought related to the preservation of these maps.

The Canada Land Inventory (CLI) was one of the first and largest born digital map efforts in the world. Mapped 2.6 million square kilometers of Canada. Lost in the 1980s. No-one took responsibility for archiving. Those who thought about it believed backup equaled archiving. A group of volunteers rescued the process over time – salvaged from boxes of tapes and paper in mid-1990s. It was caught just in time and took a huge effort. 80% has been saved and is now it is online. This was rescued because it was high profile. What about the low-profile data sets? Who will rescue them? No-one.

The 1986 BBC Doomsday Book was created in celebration of 900 years after William the Conqueror’s original Domesday Book. It was obsolete by the 1990s. A huge amount of social and economic information was collected for this project. In order to rescue it they needed an acorn computer and needed to be able to read the optical disks. The platform was emulated in 2002-2003. It cost 600,000 british pounds to reverse engineer and put online in 2004. New discs made in 2003 at the UK Archive.

It is easier to get Ptolomy’s maps from 15th century than it is to get a map 10 years old.

The Inuit Siku (sea ice) Atlas, an example of a Cybercartographic atlas, was produced in cooperation with Inuit communities. Arguing that the memory of what is happening in the north lies in the minds of the elders, they are capturing the information and putting it out in multi-media/multi-sensory map form. The process is controlled by the community themselves. They provide the software and hardware. They created a graphic tied to the Inuit terms for different types of sea ice. In some cases they record the audio of an elder talking about a place. The narrative of the route becomes part of the atlas. There is no right or wrong answer. There are many versions and different points of view. All are based on the same set of facts – but they come from different angles. The atlases capture them all.

The Gwich’in Place Name Atlas is building in the idea of long term preservation into the application from the start

The Cybercartographic Atlas of the Lake Huron Treaty Relationship Process is taking data from surveyors diaries from the 1850s.

There are lots of government of Canada geospatial data preservation intitatives, but in most cases there is a lot of retoric, but not so much action. There have been many consultations, studies, reports and initiatives since 2002, but the reality is that apart from the Open Government Consultations (TBS), not very much as translated into action. Even in the case where there is legislation, lots of things look good on paper but don’t get implemented.

There are Library and Archives Guidelines working to support digital preservation of geospatial data. The InterPares 2 (IP2) Geospatial Case Studies tackle a number of GIS examples, including the Cybercartographic Atlas of Antartica. See the presentation slides online for more specific examples.

In general, preservation as an afterthought rarely results in full recovery of born digital maps. It is very important to look at open source and interoperable open specifications. Proactive archiving is an important interim strategy.

Geospatial data are fundamental sources of our memory of the world. They help us understand our geo-narratives (stories tied to location), counter colonial mappings, are the result of scientific endeavors, represent multiple worldviews and they inform decisions. We need to overcome the challenges to ensure their preservation.

Q&A:

QUESTION: When I look at the work you are doing with recovering Inuit data from people. You recover data and republish it – who will preserve both the raw data and the new digital publication? What does it mean to try and really preserve this moving forward? Are we really preserving and archiving it?

ANSWER: No we are not. We haven’t been able to find an archive in Canada that can ingest our content. We will manage it ourselves as best we can. Our preservation strategy is temporary and holding, not permanent as it should be. We can’t find an archive to take the data. We are hopeful that we are moving towards finding a place to keep and preserve it. There is some hope on the horizon that we may move in the right directions in the Canadian context.

Luciana: I wanted to attest that we have all the data from InterPARES II. It is published in the final. I am jealously guarding my two servers that I maintain with money out of my own pocket.

QUESTION: Is it possible to have another approach to keep data where it is created, rather than a centralized approach?

ANSWER: We are providing servers to our clients in the north. Keeping copies of the database in the community where they are created. Keeping multiple copies in multiple places.

QUESTION: You mention surveys being sent out and few responses coming back. When you know there is data at risk – there may be governments that have records at risk that they are shy to reveal to the public? How do we get around that secrecy?

ANSWER: (IEDRO representative) We offer our help, rather than a request to get their data.

As is the case with all my session summaries, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Image Credit: NARA Flickr Commons image “The North Jetty near the Mouth of the Columbia River 05/1973”

Updated 2/20/2013 based on presenter feedback.

UNESCO/UBC Vancouver Declaration

In honor of the 2012 Day of Digtal Archives, I am posting a link to the UNESCO/UBC Vancouver Declaration. This is the product of the recent Memory of the World in the Digital Age conference and they are looking for feedback on this declaration by October 19th, 2012 (see link on the conference page for sending in feedback).

To give you a better sense of the aim of this conference, here are the ‘conference goals’ from the programme:

The safeguard of digital documents is a fundamental issue that touches everyone, yet most people are unaware of the risk of loss or the magnitude of resources needed for long-term protection. This Conference will provide a platform to showcase major initiatives in the area while scaling up awareness of issues in order to find solutions at a global level. Ensuring digital continuity of content requires a range of legal, technological, social, financial, political and other obstacles to be overcome.

The declaration itself is only four pages long and includes recommendations to UNESCO, member states and industry. If you are concerned with digital preservation and/or digitization, please take a few minutes to read through it and send in your feedback by October 19th.

CURATEcamp Processing 2012

CURATEcamp Processing 2012 was held the day after the National Digital Information Infrastructure and Preservation Program (NDIIPP) and the National Digital Stewardship Alliance (NDSA) sponsored Digital Preservation annual meeting.

The unconference was framed by this idea:

Processing means different things to an archivist and a software developer. To the former, processing is about taking custody of collections, preserving context, and providing arrangement, description, and accessibility. To the latter, processing is about computer processing and has to do with how one automates a range of tasks through computation.

The first hour or so was dedicated to mingling and suggesting sessions.  Anyone with an idea for a session wrote down a title and short description on a paper and taped it to the wall. These were then reviewed, rearranged on the schedule and combined where appropriate until we had our full final schedule. More than half the sessions on the schedule have links through to notes from the session. There were four session slots, plus a noon lunch slot of lightening talks.

Session I: At Risk Records in 3rd Party Systems This was the session I had proposed combined with a proposal from Brandon Hirsch. My focus was on identification and capture of the records, while Brandon started with capture and continued on to questions of data extraction vs emulation of the original platforms. Two sets of notes were created – one by me on the Wiki and the other by Sarah Bender in Google Docs. Our group had a great discussion including these assorted points:

  • Can you mandate use of systems we (archivists) know how to get content out of? Consensus was that you would need some way to enforce usage of the mandated systems. This is rare, if not impossible.
  •  The NY Philharmonic had to figure out how to capture the new digital program created for the most recent season. Either that, or break their streak for preserving every season’s programs since 1842.
  • There are consequences to not having and following a ‘file plan’. Part of people’s jobs have to be to follow the rules.
  • What are the significant properties? What needs to be preserved – just the content you can extract? Or do you need the full experience? Sometimes the answer is yes – especially if the new format is a continuation of an existing series of records.
  • “Collecting Evidence” vs “Archiving” – maybe “collecting evidence” is more convincing to the general public
  • When should archivists be in the process? At the start – before content is created, before systems are created?
  • Keep the original data AND keep updated data. Document everything, data sources, processes applied.

Session II: Automating Review for Restrictions? This was the session that I would have suggested if it hadn’t already been on the wall. The notes from the session are online in a Google Doc. It was so nice to realize that that challenge of review of records for restricted information is being felt in many large archives. It was described as the biggest roadblock to the fast delivery of records to researchers. The types of restrictions were categorized as ‘easy’ or ‘hard’. The ‘Easy’ category was for well defined content that follow rules that we could imagine teaching a computer to identity — things like US social security numbers, passport numbers or credit card numbers. The ‘Hard’ category was for restrictions that had more human judgement involved. The group could imagine modules coded to spot the easy restrictions. The modules could be combined to review for whatever set was required – and carry with them some sort of community blessing that was legally defensible. The modules should be open source. The hard category likely needs us as a community to reach out to the eDiscovery specialists from the legal realm, the intelligence community and perhaps those developing autoclassification tools. This whole topic seems like a great seed for a Community of Practice. Anyone interested? If so – drop a comment below please!

Lunchtime Lightning Talks: At five minutes each, these talks gave the attendees a chance to highlight a project or question they would like to discuss with others. While all the talks were interesting, there was one that really stuck with me: Harvard University’s Zone 1 project which is a ‘rescue repository’. I would love to see this model spread! Learn more in the video below.

Session III: Virtualization as a means for Preservation In this session we discussed the question posed in the session proposal “How can we leverage virtualization for large-scale, robust preservation?”. I am not sure if any notes were generated for this session. Notes are available on the conference wiki. Our discussion touched on the potential to save snapshots of virtualized systems over time, the challenges of all the variables that go into making a specific environment, and the ongoing question of how important is it to view records in their original environment (vs examining the extracted ‘content’).

Session IV: Accessible Visualization This session quickly turned into a cheerful show and tell of visualization projects, tools and platforms – most made it into a list on the Wiki.

Final Thoughts
The group assembled for this unconference definitely included a great cross-section of archivists and those focused on the tech of electronic records and archives. I am not sure how many there were exclusively software developers or IT folks. We did go around the room for introductions and hand raising for how people self-identified (archivists? developers? both? other?). I was a bit distracted during the hand raising (I was typing the schedule into the wiki) – but it is my impression that there were many more archivists and archivist/developers than there were ‘just developers’. That said, the conversations were productive and definitely solidly in the technical realm.

One cross-cutting theme I spotted was the value of archivists collaborating with those building systems or selecting tech solutions. While archivists may not have the option to enforce (through carrots or sticks) adherence to software or platform standards, any amount of involvement further up the line than the point of turning a system off will decrease the risks of losing records.

So why the picture of the abandoned factory at the top of this post? I think a lot of the challenges of preservation of born digital records tie back to the fact that archivists often end up walking around in the abandoned factory equivalent of the system that created the records. The workers are gone and all we have left is a shell and some samples of the product. Maybe having just what the factory produced is enough. Would it be a better record if you understood how it moved through the factory to become what it is in the end? Also, for many born digital records you can’t interact with them or view them unless you have the original environment (or a virtual one) in which to experience them. Lots to think about here.

If this sounds like a discussion you would like to participate in, there are more CURATEcamps on the way. In fact – one is being held before SAA’s annual meeting tomorrow!

Image Credit: abandoned factory image from Flickr user sonyasonya.

MARAC Spring 2012: Preservation of Digital Materials (Session S1)

602px-Zip-100a-transparent.png

The official title for this session is “Preservation and Conservation of Captured and Born Digital Materials” and it was divided into three presentations with introduction and question moderation by Jordon Steele, University Archivist at Johns Hopkins University.

Digital Curation, Understanding the lifecycle of born digital items

Isaiah Beard, Digital Data Curator from Rutgers, started out with the question ‘What Is Digital Curation?’. He showed a great Dilbert cartoon on digital media curation and the set of six photos showing all different perspectives on what digital curation really is (a la the ‘what I really do’ meme – here is one for librarians).

“The curation, preservation, maintenance, collection and archiving of digital assets.” — Digital Curation Center.

What does a Digital Curator do?

Aquire digital assets:

  • digitized analog sources
  • assets that were born digital, no physical analog exists

Certify content integrity:

  • workflow and standards and best practices
  • train staff on handling of the assets
  • perform quality assurance

Certify trustworthiness of the architecture:

  • vet codecs and container/file formats – must make sure that we are comfortable with the technology, hardware and formats
  • active role in the storage decisions
  • technical metadata, audit trails and chain of custody

Digital assets are much easier to destroy than physical objects. In contrast with physical objects which can be stored, left behind, forgotten and ‘rediscovered’, digital objects are more fragile and easier to destroy. Just one keystroke or application error can destroy digital materials. Casual collectors typically delete what they don’t want with no sense of a need to retain the content. People need to be made aware that the content might be important long term.

Digital assets are dependent on file formats and hardware/software platforms. More and more people are capturing content on mobile devices and uploading it to the web. We need to be aware of the underlying structure. File formats are proliferating and growing over time. Sound files come in 27 common file formats and 90 common codecs. Moving images files come in 58 common containers/codecs and come with audio tracks in the 27 file formats/90 common codecs.

Digital assets are vulnerable to format obsolescence — examples include Wordperect (1979), Lotus 1-2-3 (1978) and Dbase (1978). We need to find ways to migrate from the old format to something researchers can use.

Physical format obsolescence is a danger — examples include tapes, floppy disk, zip disk, IBM demi-disk and video floppy. There is a threat of a ‘digital dark age’. The cloud is replacing some of this pain – but replacing it with a different challenge. People don’t have a sense of where their content is in the physical world.

Research data is the bleeding edge. Datasets come in lots of different flavors. Lots of new and special file formats relating specifically to scientific data gathering and reporting… long list including things like GRIB (for meterological data), SUR (MRI data), DWG (for CAD data), SPSS (for statistical data from the social sciences) and on and on. You need to become a specialist in each new project on how to manage the research data to keep it viable.

There are ways to mitigate the challenges through predictable use cases and rigid standards. Most standard file types are known quantities. There is a built-in familiarity.

File format support: Isaiah showed a grid with one axis Open vs Closed and the other Free vs Proprietary. Expensive proprietary software that does the job so well that it is the best practice and assumed format for use can be a challenge – but it is hard to shift people from using these types of solutions.

Digital Curation Lifecycle

  • Objects are evaluated, preserve, maintained, verified and re-evaluated
  • iterative – the cycle doesn’t end with doing it just once
  • Good exercise for both known and unknown formats

The diagram from the slide shows layers – looks like a diagram of the geologic layers of the earth.

Steps:

  • data is the center of the universe
  • plan, describe, evaluate, learn meanings.
  • ingest, preserve curate
  • continually iterate

Controlled chaos! Evaluate the collection and needs of the digital assets. Using preservation grade tools to originate assets. Take stock of the software, systems and recording apparatus . Describe in the tech metadata so we know how it originated. We need to pick our battles and need to use de facto industry standards. Sometimes those standards drive us to choices we wouldn’t pick on our own. Example – final cut pro – even though it is mac and proprietary.

Establish a format guide and handling procedures. Evaluate the veracity and longevity of the data format. Document and share our findings. Help others keep from needing to reinvent the wheel.

Determine method of access: How are users expected to access and view these digital items? Software/hardware required? View online – plug-in required? third party software?

Primary guidelines: Do no harm to the digital assets.

  • preservation masters, derivatives as needed
  • content modification must be done with extreme care
  • any changes must be traceable, audit-able, reversible.

Prepare for the inevitable: more format migrations. Re-assess the formats.. migrate to new formats when the old is obsolete. Maintain accessibility while ensuring data integrity.

At Rutgers they have the RUcore Community Repository which is open source, and based on FEDORA. It is dedicated to the digital preservation of multiple digital asset types and contains 26,238 digital assets (as of April 2012). Includes audio, video, still images, documents and research data. Mix of digital surrogates and born digital assets.

Publicly available digital object standards are available for all traditional asset types. Define baseline quality requirements for ‘reservation grade’ files. Periodically reviewed and revised as tech evolves. See Rutgers’ Page2Pixel Digital Curation standards.

They use a team approach as they need to triage new asset types. Do analysis and assessment. Apply holistic data models and the preservation lifecycle and continue to publish and share what they have done. Openness is paramount and key to the entire effort.

More resources:

The Archivist’s Dilemma: Access to collections in the digital era

Next, Tim Pyatt from Penn State spoke about ‘The Archivist’s Dilemma’ — starting with examples of how things are being done at Penn State, but then moving on to show examples of other work being done.

There are lots of different ways of putting content online. Penn State’s digital collections are published online via ContentDM, Flickr, social media and Penn State IR Tools. The University Faculty Senate put up things on their own. Internet Archive. Custom built platform. Need to think about how the researcher is going to approach this content.

With analog collections that have portions digitized they describe both, but then includes a link to digital collection. These link through to a description of the digital collection.. and then links to CONTENTdm for the collection itself.

Examples from Penn State:

  • A Google search for College of Agricultural Science Publications leads users to a complimentary/competing site with no link back to the catalog nor any descriptive/contextual information.
  • Next, we were shown the finding aid for William W. Scranton Papers from Penn State. They also have images up on Flickr ‘William W. Scranton Papers’ . Flickr provides easy access, but acts as another content silo. It is crucial to have metadata in the header of the file to help people find their way back to the originating source. Google Analytics showed them that 8x more often content is seen in Flickr than CONTENTdm.
  • The Judy Chicago Art Education Collection is a hybrid collection. The finding aid has a link to the curriculum site. There is a separate site for the Judy Chicago Art Education Collectiion more focused on providing access to her education materials.
  • The University Curriculum Archive is a hybrid collection with a combination of digitized old course proposals, while the past 5 years of curriculum have been born digital. They worked with IT to build a database to commingle the digitized & born digital files. It was custom built and not integrated into other systems – but at least everything is in one place.

Examples of what is being done at other institutions:

PennState is loading up a Hydra repository for their next wave!

Born-Digital @UVa: Born Digital Material in Special Collections

Gretchen Gueguen, UVA

Presentation slides available for download.

AIMS (An Inter-Institutional Model for Stewardship) born digital collections: a 2 year project to create a framework for the stewardship of born-digital archival records in collecting repositories. Funded by Andrew W. Mellon Foundation with partners: UVA, Stanford, University of Hull, and Yale. A white paper on AIMS was published in January 2012.

Parts of the framework: collection development, accessioning, arrangement & description, discovery & access are all covered in the whitepaper – including outcomes, decision points and tasks. The framework can be used to develop an institutionally specific workflow. Gretchen showed an example objective ‘transfer records and gain administrative control’ and walked through outcome, decision points and tasks.

Back at UVA, their post-AIMS strategizing is focusing on collection development and accessioning.

In the future, they need to work on Agreements: copyright, access & ownership policies and procedures. People don’t have the copyright for a lot of the content that they are trying to donate. This makes it harder, especially when you are trying to put content online. You need to define exactly what is being donated. With born digital content, content can be donated multiple places. Which one is the institution of record? Are multiple teams working on the same content in a redundant effort?

Need to create a feasibility evaluation to determine systematically if something is it worth collecting. Should include:

  • file formats
  • hardware/software needs
  • scope
  • normalization/migration needed?
  • private/sensitive information
  • third-party/copyrighted information?
  • physical needs for transfer (network, storage space, etc.)

If you decide it is feasible to collect, how do you accomplish the transfer with uncorrupted data, support files (like fonts, software, databases) and ‘enhanced curation’? You may need a ‘write blocker’ to make sure you don’t change the content just by accessing the disk. You may want to document how the user interacted with their computer and software. Digital material is very interactive – you need to have an understanding of how the user interacted with it. Might include screen shots.

Next she showed their accessioning workflow:

  • take the files
  • create a disk image – bit for bit copy – makes the preservation master
  • move that from the donor’s computer to their secure network with all the good digital curation stuff
  • extract technical metadata
  • remove duplicates
  • may not take stuff with PPI
  • triage if more processing is necessary

Be ready for surprises – lots of things that don’t fit the process:

  • 8″ floppy disk
  • badly damaged CD
  • disk no longer functions – afraid to throw away in case of miracle
  • hard drive from 1999
  • mini disks

These have no special notation taken of them in the accessioning.

Priorities with this challenging material:

  • get the data of aging media
  • put it someplace safe and findable
  • inventory
  • triage
  • transfer

Forensic Workstation:

  • FRED = forensic recovery of evidence device – built in ultra bay writeblocker with usb, firewire, sata, csi, ide ad molex for power- external 5.25 floppy drive, cd/dvd/blu-ray, microcard reader, LTO tape drive, external 3.5″ drive + external hard drive for additional storage.
  • toolbox
  • big screen

FRED’s FDK software shows you overview of what is there, recognizes 1,000s of file format, deleted data, finds duplicates, and can identify PPI. It is very useful for description and for selecting what to accession – but it costs a lot and requires an annual license.

BitCurrator is making an open source version. From their website: “The BitCurator Project is an effort to build, test, and analyze systems and software for incorporating digital forensics methods into the workflows of a variety of collecting institutions.”

Archivematica:

  • creates PREMIS record recording what activities are done – preservation metadata standard
  • creates derivative records – migration!!
  • yields a preservation master + access copies to be provided in the reading room

Hoping for Hypatia like thing in the future

Final words: Embrace your inner nerd! Experiment – you have nothing to loose. If you do nothing you will lose the records anyway.

Questions and Answers

QUESTION: How do you convince your administration that this needs to be a priority?

ANSWER:

Isaiah: Find examples of other institutions that are doing this. Show them that our history is at risk moving forward. A digital dark age is coming if we don’t do something now. It is really important that we show people “this is what we need to preserve”

Tim: Figure out who your local partners are. Who else has a vested interest in this content? IT was happy at Penn State that they didn’t need to keep everything – happy that there is an appraisal process.. and that they are preserving content so it doesn’t need to be kept by everyone. I am one of the authors of the upcoming report on born digital records — end of the summer: Association of Research Libraries – Managing Electronic Records – Spec Kit

Gretchen: Numbers are really useful. Sometimes you don’t think about it, but it is a good practice to count the size of what you created. How much time would it take to recreate it if you lost it. How many people have used the content? Get some usage stats. Who is your rival and what are their statistics?

Jordon: Point to others who you want to keep up with

QUESTION: would the panelists like to share experiences with preserving dynamic digital objects like databases?

ANSWER:

Isaiah: We don’t want to embarrass people. We get so many different formats. It is a trial and error thing. You need to say gently that there is a better way to do this. Sad example – burned DVDs from tapes in 2004.. got them in 2007. The DVDs were not verified. They were not stored well – stored in a hot warehouse. Opened the boxes and found unreadable DVDs – delaminating.

Tim: From my Duke Days, we had a number of faculty data sets in proprietary formats. We would do checksums on them, wrap them up and put them in the repository. They are there.. but who knows if anyone will be able to read them later. Same as with paper – preserve them now in good acid-free papers.

Gretchen: My 19 yo student held up a zip disk and said “Due to my extreme youth I don’t know what this is!” (And now you know why there is a photo of a zip disk at the top of this post – your reward for reading all the way to the end!)

Image Credit: ‘100MB Zip Disc for Iomega Zip, Fujifilm/IBM-branded‘ taken by Shizhao

As is the case with all my session summaries from MARAC, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Day of Digital Archives

To be honest, today was a half day of digital archives, due to personal plans taking me away from computers this afternoon. In light of that, my post is more accurately my ‘week of digital archives’.

The highlight of my digital archives week was the discovery of the Digital Curation Exchange. I promptly joined and began to explore their ‘space for all things ‘digital curation’ ‘. This led me to a fabulous list of resources, including a set of syllabi for courses related to digital curation. Each link brought me to an extensive reading list, some with full slide decks related to weekly in classroom presentations. My ‘to read’ list has gotten much longer – but in a good way!

On other days recently I have found myself involved in all of the following:

  • review of metadata standards for digital objects
  • creation of internal guidelines and requirements documents
  • networking with those at other institutions to help coordinate site visits of other digitization projects
  • records management planning and reviews
  • learning about the OCR software available to our organization
  • contemplation of the web archiving efforts of organizations and governments around the world
  • reviewing my organization’s social media policies
  • listening to the audio of online training available from PLANETS (Preservation and Long-term Access through NETworked Services)
  • contemplation of the new Journal of Digital Media Management and their recent call for articles

My new favorite quote related to digital preservation comes from What we reckon about keeping digital archives: High level principles guiding State Records’ approach from the State Records folks in New South Wales Australia, which reads:

We will keep the Robert De Niro principle in mind when adopting any software or hardware solutions: “You want to be makin moves on the street, have no attachments, allow nothing to be in your life that you cannot walk out on in 30 seconds flat if you spot the heat around the corner” (Heat, 1995)

In other words, our digital archives technology will be designed to be sustainable given our limited resources so it will be flexible and scalable to allow us to utilise the most appropriate tools at a given time to carry out actions such as creation of preservation or access copies or monitoring of repository contents, but replace these tools with new ones easily and with minimal cost and with minimal impact.

I like that this speaks to the fact that no plan can perfectly accommodate the changes in technology coming down the line. Being nimble and assuming that change will be the only constant are key to ensuring access to our digital assets in the future.