future-proofing | Spellbound Blog

Chapter 9: Sharing Research Data, Data Standards and Improving Opportunities for Creating Visualisations by Dr. Vetria Byrd

January 27, 2019 1 Comment

Chapter 9 of Partners for Preservation is ‘Sharing Research Data, Data Standards and Improving Opportunities for Creating Visualisations’ by Dr. Vetria Byrd. This is the second chapter of Part III: Data and Programming. I originally had envisioned a chapter focused on the ways that standardization, controlled vocabularies, and consistent documentation could increase the re-use of data. All these things help people, separated by either space or time, to understand and leverage the work of others. Scientific communities around the world have led a lot of this work. The work of archivists to preserve data in a meaningful way is made easier by it.

Luckily for all of us, my hunt for contributing authors brought Dr. Vetria Byrd to this project. Her professional focus on visualization led me to approach the topic of sharing data and data standards from a different direction.

I am a very visual thinker. I truly believe that with a large enough whiteboard, you could plan (or explain) anything. Those who are familiar with my research back in graduate school may recall my visualization project, ArchivesZ, focused on visualizing archival descriptive information. So, when I realized that this chapter could talk about both data standards and visualization, I was sold.

The introduction to the chapter explains:

“This chapter looks at the collaborative nature of sharing the underlying data that propels the system, rather than focusing on systems and services. It provides an overview of the visualisation process, and discusses the challenge of sharing research data and ways data standards can increase opportunities for creating and sharing visualisations, while also increasing visualisation capacity building among researchers and scientists.”

A single visualization is often reliant on multiple sets of data that have been analyzed, linked, and summarized over multiple iterations to generate the final product. Take the xkcd webcomic featured at the top of this post. The citation within the webcomic itself reads “based on map data from US Drought Monitor/NOAA/Richard Tinker”. Digging a bit deeper, I found my way to the US Drought Monitor website which provides easy access to data and maps. You can learn more about the data included on the site and read the details about how they calculate the drought classifications.

I was able to quickly generate this chart, showing California Drought data over time. While it certainly makes it clear that there has been a dramatic increase of drought over time, it does not communicate the same information as the maps in the webcomic above.

I think this is a great example of how different ways of visualizing information can fundamentally change our understanding of something. Documentation of and transparency in sharing data is key. It gives the tools to a broader audience of creative individuals who can then increase the visibility of the original work and build upon it.

Bio:
Dr. Vetria Byrd is an Assistant Professor of Computer Graphics Technology and Director of the Byrd Data Visualization Lab in the Polytechnic Institute at Purdue University’s main campus in West Lafayette, Indiana. Dr. Byrd is introducing and integrating visualization capacity building into the undergraduate data visualization curriculum. She is the founder of the Broadening Participation in Visualization (BPViz) Workshop. She served as a steering committee member on the Midwest Big Data Hub (2016-2018). She has taught data visualization courses on national and international platforms as an invited lecturer of the International High Performance Computing Summer School (IHPCSS). Her visualization webinars on Blue Waters, a petascale supercomputer at the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, introduce data visualization to audiences around the world. As described in her invited plenary talk featured on HPC Wire Dr. Byrd utilizes data visualization as a catalyst for communication, a conduit collaboration and as a platform to broaden participation of underrepresented groups in data visualization. Dr. Byrd’s research interests include data visualization, data analytics, data integration, visualizing heterogeneous data and the science of learning and incorporating data visualization at the curriculum level and everyday practice.

Image source: xkcd comic: California. https://www.xkcd.com/1410/

PS: If you haven’t yet discovered xkcd (self-described as “A webcomic of romance, sarcasm, math, and language.”) – you are in for a treat!

Chapter 7: Historical Building Information Model (BIM)+: Sharing, Preserving and Reusing Architectural Design Data by Dr. JuHyun Lee and Dr. Ning Gu

January 22, 2019

Chapter 7 of Partners for Preservation is ‘Historical Building Information Model (BIM)+: Sharing, Preserving and Reusing Architectural Design Data’ by Dr. JuHyun Lee and Dr. Ning Gu. The final chapter in Part II: The physical world: objects, art, and architecture, this chapter addresses the challenges of digital records created to represent physical structures. I picked the image above because I love the contrast between the type of house plans you could order from a catalog a century ago and the way design plans exist today.

This chapter was another of my “must haves” from my initial brainstorm of ideas for the book. I attended a session on ‘Preserving Born-Digital Records Of The Design Community’ at the 2007 annual SAA meeting. It was a compelling discussion, with representatives from multiple fields. Archivists working to preserve born-digital designs. People working on building tools and setting standards. There were lots of questions from the audience – many of which I managed to capture in my notes that became a detailed blog post on the session itself. It was exciting to be in the room with so many enthusiastic experts in overlapping fields. They were there to talk about what might work long term.

This chapter takes you forward to see how BIM has evolved – and how historical BIM+ might serve multiple communities. This passage gives a good overview of the chapter:

“…the chapter first briefly introduces the challenges the design and building industry have faced in sharing, preserving and reusing architectural design data before the emergence and adoption of BIM, and discusses BIM as a solution for these challenges. It then reviews the current state of BIM technologies and subsequently presents the concept of historical BIM+ (HBIM+), which aims to share, preserve and reuse historical building information. HBIM+ is based on a new framework that combines the theoretical foundation of HBIM with emerging ontologies and technologies in the field including geographic information systems (GIS), mobile computing and cloud computing to create, manage and exchange historical building data and their associated values more effectively.”

I hope you find the ideas shared in this chapter as intriguing as I do. I see lots of opportunities for archivists to collaborate with those focused on architecture and design, especially in the case of historical buildings and the proposed vision for HBIM+.

Bios:

Ning Gu is Professor of Architecture in the School of Art, Architecture and Design at the University of South Australia. Having an academic background from both Australia and China, Professor Ning Gu’s most significant contributions have been made towards research in design computing and cognition, including topics such as computational design analysis, design cognition, design communication and collaboration, generative design systems, and Building Information Modelling. The outcomes of his research have been documented in over 170 peer-reviewed publications. Professor Gu’s research has been supported by prestigious Australian research funding schemes from Australian Research Council, Office for Learning and Teaching, and Cooperative Research Centre for Construction Innovation. He has guest edited/chaired major international journals/conferences in the field. He was Visiting Scholar at MIT, Columbia University and Technische Universiteit Eindhoven.

JuHyun Lee is an adjunct senior lecturer, at the University of Newcastle (UoN). Dr. Lee has made a significant contribution towards architectural and design research in three main areas: design cognition (design and language), planning and design analysis, and design computing. As an expert in the field of architectural and design computing, Dr. Lee was invited to become a visiting academic at the UoN in 2011. Dr. Lee has developed innovative computational applications for pervasive computing and context awareness in the building environments. The research has been published in Computers in Industry, Advanced Engineering Informatics, Journal of Intelligent and Robotic Systems. His international contribution has been recognised as: Associate editor for a special edition of Architectural Science Review; Reviewer for many international journals and conferences; International reviewer for national grants.

Image Source: Image from page 717 of ‘Easy steps in architecture and architectural drawing’ by Hodgson, Frederick Thomas, 1915. https://archive.org/details/easystepsinarch00hodg/page/n717

Chapter 6: Accurate Digital Colour Reproduction on Displays: from Hardware Design to Software Features by Dr. Abhijit Sarkar

January 16, 2019

The sixth chapter in Partners for Preservation is “Accurate Digital Colour Reproduction on Displays: from Hardware Design to Software Features” by Dr. Abhijit Sarkar. As the second chapter in Part II: The physical world: objects, art, and architecture, this chapter continues to walk the edge between the physical and digital worlds.

My mother was an artist. I spent a fair amount of time as a child by her side in museums in New York City. As my own creativity has led me to photography and graphic design, I have become more and more interested in color and how it can change (or not change) across the digital barrier and across digital platforms. Add in the ongoing challenges to archival preservation of born-digital visual records and the ever-increasing efforts to digitize archival materials, and this was a key chapter I was anxious to include.

One of my favorite passages from this chapter:

If you are involved in digital content creation or digitisation of existing artwork, the single most important advice I can give you is to start by capturing and preserving as much information as possible, and allow redundant information to be discarded later as and when needed. It is a lot more difficult to synthesise missing colour fidelity information than to discard information that is not needed.

This chapter, perhaps more than any other in the book, can stand alone as a reference. It is a solid introduction to color management and representation, including both information about basic color theory and important aspects of the technology choices that govern what we see when we look at a digital image on a particular piece of hardware.

On my computer screen, the colors of the image I selected for the top of this blog post please me. How different might the 24 x 30-inch original screenprint on canvas mounted on paperboard, created fifty years ago in 1969 and now held by the Smithsonian American Art Museum, look to me in person? How different might it look on each device on which people read this blog post? I hope that this type of curiosity will lure you develop an understanding of the impacts that the choices explored in this chapter can have on how the records in your care will be viewed in the future.

Bio:

Abhijit Sarkar specializes in the area of color science and imaging. Since his early college days, Abhijit wanted to do something different from what all his friends were doing or planning to do. That mission took him through a tortuous path of earning an undergraduate degree in electrical engineering in India, two MS degrees from Penn State and RIT on lighting and color, and a PhD in France on applied computing. His doctoral thesis was mostly focused on the fundamental understanding of how individuals perceive colors differently and devising a novel method of personalized color processing for displays in order to embrace individual differences.

Because of his interdisciplinary background encompassing science, engineering and art, Abhijit regards cross-discipline collaborations like the Partners for Preservation extremely valuable in transcending the boundaries of myriads of specialized domains and fields; thereby developing a much broader understanding of capabilities and limitations of technology.

Abhijit is currently part of the display design team at Microsoft Surface, focused on developing new display features that enhance users’ color experience. He has authored a number of conference and journal papers on color imaging and was a contributing author for the Encyclopedia of Color Science and Technology.

Image source: Bullet Proof, from the portfolio Series I by artist Gene Davis, Smithsonian American Art Museum, Bequest of Florence Coulson Davis

Chapter 4: Link Rot, Reference Rot and the Thorny Problems of Legal Citation by Ellie Margolis

December 29, 2018

The fourth chapter in Partners for Preservation is ‘Link Rot, Reference Rot and the Thorny Problems of Legal Citation’ by Ellie Margolis. Links that no longer work and pages that have been updated since they were referenced are an issue that everyone online has struggled with. In this chapter, Margolis gives us insight into why these challenges are particularly pernicious for those working in the legal sphere.

This passage touches on the heart of the problem.

Fundamentally, link and reference rot call into question the very foundation on which legal analysis is built. The problem is particularly acute in judicial opinions because the common law concept of stare decisis means that subsequent readers must be able to trace how the law develops from one case to the next. When a source becomes unavailable due to link rot, it is as though a part of the opinion disappears. Without the ability to locate and assess the sources the court relied on, the very validity of the court’s decision could be called into question. If precedent is not built on a foundation of permanently accessible sources, it loses
its authority.

While working on this blog post, I found a WordPress Plugin called Broken Link Checker. It does exactly what you expect – scans through all your blog posts to check for broken URLs. In my 201 published blog posts (consisting of just shy of 150,000 words), I have 3002 unique URLs. The plugin checked them all and found 766 broken links! Interestingly, the plugin updates the styling of all broken links to show them with strikethroughs – see the strikethrough in the link text of the last link in the image below:

For each of the broken URLs it finds, you can click on “Edit Link”. You then have the option of updating it manually or using a suggested link to a Wayback Machine archived page – assuming it can find one.

It is no secret that link rot is a widespread issue. Back in 2013, the Internet Archive announced an initiative to fix broken links on the Internet – including the creation of the Broken Link Checker plugin I found. Three years later, on the Wikipedia blog, they announced that over a million broken outbound links on English Wikipedia had been fixed. Fast forward to October of 2018 and an Internet Archive blog post announced that More than 9 million broken links on Wikipedia are now rescued.

I particularly love this example because it combines proactive work and repair work. This quote from the 2018 blog post explains the approach:

For more than 5 years, the Internet Archive has been archiving nearly every URL referenced in close to 300 wikipedia sites as soon as those links are added or changed at the rate of about 20 million URLs/week.

And for the past 3 years, we have been running a software robot called IABot on 22 Wikipedia language editions looking for broken links (URLs that return a ‘404’, or ‘Page Not Found’). When broken links are discovered, IABot searches for archives in the Wayback Machine and other web archives to replace them with.

There are no silver bullets here – just the need for consistent attention to the problem. The examples of issues being faced by the law community, and their various approaches to prevent or work around them, can only help us all move forward toward a more stable web of internet links.

Bio:
Ellie Margolis is a Professor of Law at Temple University, Beasley School of law, where she teaches Legal Research and Writing, Appellate Advocacy, and other litigation skills courses. Her work focuses on the effect of technology on legal research and legal writing. She has written numerous law review articles, essays and textbook contributions. Her scholarship is widely cited in legal writing textbooks, law review articles, and appellate briefs.

Image credit: Image from page 235 of “American spiders and their spinningwork. A natural history of the orbweaving spiders of the United States, with special regard to their industry and habits” (1889)

Countdown to Partners for Preservation

December 4, 2018

Yes. I know. My last blog post was way back in May of 2014. I suspect some of you have assumed this blog was defunct.

When I first launched Spellbound Blog as a graduate student in July of 2006, I needed an outlet and a way to connect to like-minded people pondering the intersection of archives and technology. Since July 2011, I have been doing archival work full time. I work with amazing archivists. I think about archival puzzles all day long. Unsurprisingly, this reduced my drive to also research and write about archival topics in the evenings and on weekends.

Looking at the dates, I also see that after I took an amazing short story writing class, taught by Mary Robinette Kowal in May of 2013, I only wrote one more blog post before setting Spellbound Blog aside for a while in favor of fiction and other creative side-projects in my time outside of work.

Since mid-2014, I have been busy with many things – including (but certainly not limited to):

my day job as an archivist at the World Bank Group Archives
Tweeting @spellboundblog
writing and publishing short stories
taking photos (and using them to create a calendar and other fun things)
designing stickers
and creating a book about digital preservation and collaboration (the image at the top probably already gave it away)

I’m back to tell you all about the book.

In mid-April of 2016, I received an email from a commissioning editor in the employ of UK-based Facet Publishing (initially described to me as the publishing arm of CILIP, the UK’s equivalent to ALA). That email was the beginning of a great adventure, which will soon culminate in the publication of Partners for Preservation by Facet (and its distribution in the US by ALA). The book, edited by me and including an introduction by Nancy McGovern, features ten chapters by representatives of non-archives professions. Each chapter discusses challenges with and victories over digital problems that share common threads with issues facing those working to preserve digital records.

Over the next few weeks, I will introduce you to each of the book’s contributing authors and highlight a few of my favorite tidbits from the book. This process was very different from writing blog posts and being able to share them immediately. After working for so long in isolation it is exciting to finally be able to share the results with everyone.

PS: I also suspect, that finally posting again may throw open the floodgates to some longer essays on topics that I’ve been thinking about over the past years.

PPS: If you are interested in following my more creative pursuits, I also have a separate mailing list for that.

The CODATA Mission: Preserving Scientific Data for the Future

February 18, 2013 1 Comment

This session was part of The Memory of the World in the Digital Age: Digitization and Preservation conference and aimed to describe the initiatives of the Data at Risk Task Group (DARTG), part of the Committee on Data for Science and Technology (CODATA), a body of the International Council for Science.

The goal is to preserve scientific data that is in danger of loss because they are not in modern electronic formats, or have particularly short shelf-life. DARTG is seeking out sources of such data worldwide, knowing that many are irreplaceable for research into the long-term trends that occur in the natural world.

Organizing Data Rescue

The first speaker was Elizabeth Griffin from Canada’s Dominion Astrophysical Observatory. She spoke of two forms of knowledge that we are concerned with here: the memory of the world and the forgettery of the world. (PDF of session slides)

The “memory of the world” is vast and extends back for aeons of time, but only the digital, or recently digitized, data can be recalled readily and made immediately accessible for research in the digital formats that research needs. The “forgettery of the world” is the analog records, ones that have been set aside for whatever reason, or put away for a long time and have become almost forgotten. It the analog data which are considered to be “at risk” and which are the task group’s immediate concern.

Many pre-digital records have never made it into a digital form. Even some of the early digital data are insufficiently described, or the format is out of date and unreadable, or the records cannot be located at all easily.

How can such “data at risk” be recovered and made useable? The design of an efficient rescue package needs to be based upon the big picture, so a website has been set up to create an inventory where anyone can report data-at-risk. The Data-at-Risk Inventory (built on Omeka) is front-ended by a simple form that asks for specific but fairly obvious information about the datasets, such as field (context), type, amount or volume, age, condition, and ownership. After a few years DARTG should have some better idea as to the actual amounts and distribution of different types of historic analog data.

Help and support are needed to advertise the Inventory. A proposal is being made to link data-rescue teams from many scientific fields into an international federation, which would be launched at a major international workshop. This would give a permanent and visible platform to the rescue of valuable and irreplaceable data.

The overarching goal is to build a research knowledge base that offers a complimentary combination of past, present and future records. There will be many benefits, often cross-disciplinary, sometimes unexpected, and perhaps surprising. Some will have economic pay-offs, as in the case of some uncovered pre-digital records concerning the mountain streams that feed the reservoirs of Cape Town, South Africa. The mountain slopes had been deforested a number of years ago and replanted with “economically more appealing” species of tree. In their basement hydrologists found stacks of papers containing 73 years of stream-flow measurements. They digitized all the measurements, analyzed the statistics, and discovered that the new but non-native trees used more water. The finding clearly held significant importance for the management of Cape Town’s reservoirs. For further information about the stream-flow project see Jonkershoek – preserving 73 years of catchment monitoring data by Victoria Goodall & Nicky Allsopp.

DARTG is building a bibliography of research papers which, like the Jonkershoek one, describe projects that have depended partly or completely on the ability to access data that were not born-digital. Any assistance in extending that bibliography would be greatly appreciated.

Several members of DARTG are themselves engaged in scientific pursuits that seek long-term data. The following talks describe three such projects.

Data Rescue to Increase Length of the Record

The second speaker, Patrick Caldwell from the US National Oceanographic Data Center (NODC), spoke on rescue of tide gauge data. (PDF of full paper)

He started with an overview of water level measurement, explaining how an analog trace (a line on a paper style record generated by a float w/a timer) is generated. Tide gauges include geodetic survey benchmark to make sure that the land isn’t moving. The University of Hawaii maintains a network of gauges internationally. Back in the 1800s, they were keeping track of the tides and sea level for shipping. You never know what the application may turn into – they collected for tides, but in the 1980s they started to see patterns. They used tide gauge measurements to discover El Niño!

As you increase the length of the record, the trustworthiness of the data improves. Within sea level variations, there are some changes that are on the level of decades. To take that shift out, they need 60 years to track sea level trends. They are working to extend the length of the record.

The UNESCO Joint Technical Commission for Oceanography & Marine Meteorology has Global Sea Level Observing System (GLOSS)

GLOSS has a series of Data Centers:

Permanent Service for Mean Sea Level (monthly)
Joint archive for sea level (hourly)
British Oceanographic Data center (high frequency)

The biggest holding starts at 1940s. They want to increase the number of longer records. A student in France documented where he found records as he hunted for the data he needed. Oregon students documented records available at NARA.

Global Oceanographic Data Archaeology and Rescue (GODAR) and the World Ocean Database Project

The Historic Data Rescue Questionnaire created in November 2011 resulted in 18 replies from 14 countries documenting tide gauge sites with non-digital data that could be rescued. They are particularly interested in the records that are 60 years or more in length.

Future Plans: Move away from identifying what is out there to tackling the rescue aspect. This needs funding. They will continue to search repositories for data-at-risk and continue collaboration with GLOSS/DARTG to freshen on-line inventory. Collaborate with other programs (Atmospheric Circulation Reconstructions over the Earth (ACRE) meeting 11-2012). Eventually move to Phase II = recovery!

The third speaker, Stephen Del Greco from the US NOAA National Climatic Data Center (NCDC), spoke about environmental data through time and extending the climate record. (PDF of full paper) The NCDC is a weather archive with headquarters in Asheville, NC. It fulfills much of the nation’s climate data requirements. Their data comes from many different sources. Safe storage of over 5,600 terabytes of climate data (= 6.5 billion kindle books). How will they handle the upcoming explosion of data on the way? Need to both handle new content coming in AND provide increased access to larger amounts of data being downloaded over time. 2011 number = data download of 1,250 terabytes for the year. They expect that download number to increase 10 fold over the next few years.

The climate database modernization program went on over more than a decade rescuing data. It was well funded and millions of records were rescued with a budget of roughly 20 Million a year. The goal is to preserve and make major climate and environmental data available via the World Wide Web. Over 14 terabytes of climate data are now digitized. 54 million weather and environmental images are online. Hundreds of millions of records are digitized and now online. The biggest challenge was getting the surface observation data digitized. NCDC digital data for hourly surface observations generally stretch back to around 1948. Some historical marine observations go back to the spice trade records.

For international efforts they bring their imaging equipment to other countries where records were at risk. 150,000 records imaged under the Climate Database Modernization Program (CDMP).

Now they are moving from public funding to citizen-fueled projects via crowdsourcing such as the Zooniverse Program. Old Weather is a Zooniverse Project which uses crowdsourcing to digitize and analyze climate data. For example, the transcription done by volunteers help scientists model Earth’s climate using wartime ship logs. The site includes methods to validate efforts from citizens. They have had almost 700,000 volunteers.

Long-term Archive Tasks:

Rescuing Satellite Data: raw images in lots of different film formats. All this is at risk. Need to get it all optically imaged. Looking at a ‘citizen alliance’ to do this work.
Climate Data Records: Global Essential Climate Variables (ECVs) with Heritage Records. Lots of potential records for rescue.
Rescued data helps people building proxy data sets: NOAA Paleoclimatology. ‘Paleoclimate proxies’ – things like boreholes, tree rings, lake levels, pollen, ice cores and more. For example – getting temperate and carbon dioxide from ice cores. These can go back 800,000 years!

We have extended the climate record through international collaboration. For example, the Australian Bureau of Meteorology provided daily temperature records for more than 1,500 additional stations. This meant a more than 10-fold increase in previous historical climate daily data holdings from that country.

Born Digital Maps

The final presentation discussed the map as a fundamental source of memory of the world, delivered by D. R. Fraser Taylor and Tracey Lauriault from Carleton University’s Geomatics and Cartographic Research Center in Canada. The full set of presentation slides are available online on SlideShare. (PDF of full paper)

We are now moving into born digital maps. For example, the Canadian Geographic Information System (CGIS) was created in the 1960s and was the worlds 1st GIS. Maps are ubiquitous in the 21st century. All kinds of organizations are creating their own maps and mash-ups. Community based NGOs, citizen science, academic and private sector are all creating maps.

We are loosing born digital maps almost faster than we are creating them. We have lost 90% of the born digital maps. Above all there is an attitude that preservation is not intrinsically important. No-one thought about the need to preserve the map – everyone thought someone else would do it. There was a complete lack of thought related to the preservation of these maps.

The Canada Land Inventory (CLI) was one of the first and largest born digital map efforts in the world. Mapped 2.6 million square kilometers of Canada. Lost in the 1980s. No-one took responsibility for archiving. Those who thought about it believed backup equaled archiving. A group of volunteers rescued the process over time – salvaged from boxes of tapes and paper in mid-1990s. It was caught just in time and took a huge effort. 80% has been saved and is now it is online. This was rescued because it was high profile. What about the low-profile data sets? Who will rescue them? No-one.

The 1986 BBC Doomsday Book was created in celebration of 900 years after William the Conqueror’s original Domesday Book. It was obsolete by the 1990s. A huge amount of social and economic information was collected for this project. In order to rescue it they needed an acorn computer and needed to be able to read the optical disks. The platform was emulated in 2002-2003. It cost 600,000 british pounds to reverse engineer and put online in 2004. New discs made in 2003 at the UK Archive.

It is easier to get Ptolomy’s maps from 15th century than it is to get a map 10 years old.

The Inuit Siku (sea ice) Atlas, an example of a Cybercartographic atlas, was produced in cooperation with Inuit communities. Arguing that the memory of what is happening in the north lies in the minds of the elders, they are capturing the information and putting it out in multi-media/multi-sensory map form. The process is controlled by the community themselves. They provide the software and hardware. They created a graphic tied to the Inuit terms for different types of sea ice. In some cases they record the audio of an elder talking about a place. The narrative of the route becomes part of the atlas. There is no right or wrong answer. There are many versions and different points of view. All are based on the same set of facts – but they come from different angles. The atlases capture them all.

The Gwich’in Place Name Atlas is building in the idea of long term preservation into the application from the start

The Cybercartographic Atlas of the Lake Huron Treaty Relationship Process is taking data from surveyors diaries from the 1850s.

There are lots of government of Canada geospatial data preservation intitatives, but in most cases there is a lot of retoric, but not so much action. There have been many consultations, studies, reports and initiatives since 2002, but the reality is that apart from the Open Government Consultations (TBS), not very much as translated into action. Even in the case where there is legislation, lots of things look good on paper but don’t get implemented.

There are Library and Archives Guidelines working to support digital preservation of geospatial data. The InterPares 2 (IP2) Geospatial Case Studies tackle a number of GIS examples, including the Cybercartographic Atlas of Antartica. See the presentation slides online for more specific examples.

In general, preservation as an afterthought rarely results in full recovery of born digital maps. It is very important to look at open source and interoperable open specifications. Proactive archiving is an important interim strategy.

Geospatial data are fundamental sources of our memory of the world. They help us understand our geo-narratives (stories tied to location), counter colonial mappings, are the result of scientific endeavors, represent multiple worldviews and they inform decisions. We need to overcome the challenges to ensure their preservation.

Q&A:

QUESTION: When I look at the work you are doing with recovering Inuit data from people. You recover data and republish it – who will preserve both the raw data and the new digital publication? What does it mean to try and really preserve this moving forward? Are we really preserving and archiving it?

ANSWER: No we are not. We haven’t been able to find an archive in Canada that can ingest our content. We will manage it ourselves as best we can. Our preservation strategy is temporary and holding, not permanent as it should be. We can’t find an archive to take the data. We are hopeful that we are moving towards finding a place to keep and preserve it. There is some hope on the horizon that we may move in the right directions in the Canadian context.

Luciana: I wanted to attest that we have all the data from InterPARES II. It is published in the final. I am jealously guarding my two servers that I maintain with money out of my own pocket.

QUESTION: Is it possible to have another approach to keep data where it is created, rather than a centralized approach?

ANSWER: We are providing servers to our clients in the north. Keeping copies of the database in the community where they are created. Keeping multiple copies in multiple places.

QUESTION: You mention surveys being sent out and few responses coming back. When you know there is data at risk – there may be governments that have records at risk that they are shy to reveal to the public? How do we get around that secrecy?

ANSWER: (IEDRO representative) We offer our help, rather than a request to get their data.

As is the case with all my session summaries, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Image Credit: NARA Flickr Commons image “The North Jetty near the Mouth of the Columbia River 05/1973”

Updated 2/20/2013 based on presenter feedback.

CURATEcamp Processing 2012

August 5, 2012 2 Comments

CURATEcamp Processing 2012 was held the day after the National Digital Information Infrastructure and Preservation Program (NDIIPP) and the National Digital Stewardship Alliance (NDSA) sponsored Digital Preservation annual meeting.

The unconference was framed by this idea:

Processing means different things to an archivist and a software developer. To the former, processing is about taking custody of collections, preserving context, and providing arrangement, description, and accessibility. To the latter, processing is about computer processing and has to do with how one automates a range of tasks through computation.

The first hour or so was dedicated to mingling and suggesting sessions. Anyone with an idea for a session wrote down a title and short description on a paper and taped it to the wall. These were then reviewed, rearranged on the schedule and combined where appropriate until we had our full final schedule. More than half the sessions on the schedule have links through to notes from the session. There were four session slots, plus a noon lunch slot of lightening talks.

Session I: At Risk Records in 3rd Party Systems This was the session I had proposed combined with a proposal from Brandon Hirsch. My focus was on identification and capture of the records, while Brandon started with capture and continued on to questions of data extraction vs emulation of the original platforms. Two sets of notes were created – one by me on the Wiki and the other by Sarah Bender in Google Docs. Our group had a great discussion including these assorted points:

Can you mandate use of systems we (archivists) know how to get content out of? Consensus was that you would need some way to enforce usage of the mandated systems. This is rare, if not impossible.
The NY Philharmonic had to figure out how to capture the new digital program created for the most recent season. Either that, or break their streak for preserving every season’s programs since 1842.
There are consequences to not having and following a ‘file plan’. Part of people’s jobs have to be to follow the rules.
What are the significant properties? What needs to be preserved – just the content you can extract? Or do you need the full experience? Sometimes the answer is yes – especially if the new format is a continuation of an existing series of records.
“Collecting Evidence” vs “Archiving” – maybe “collecting evidence” is more convincing to the general public
When should archivists be in the process? At the start – before content is created, before systems are created?
Keep the original data AND keep updated data. Document everything, data sources, processes applied.

Session II: Automating Review for Restrictions? This was the session that I would have suggested if it hadn’t already been on the wall. The notes from the session are online in a Google Doc. It was so nice to realize that that challenge of review of records for restricted information is being felt in many large archives. It was described as the biggest roadblock to the fast delivery of records to researchers. The types of restrictions were categorized as ‘easy’ or ‘hard’. The ‘Easy’ category was for well defined content that follow rules that we could imagine teaching a computer to identity — things like US social security numbers, passport numbers or credit card numbers. The ‘Hard’ category was for restrictions that had more human judgement involved. The group could imagine modules coded to spot the easy restrictions. The modules could be combined to review for whatever set was required – and carry with them some sort of community blessing that was legally defensible. The modules should be open source. The hard category likely needs us as a community to reach out to the eDiscovery specialists from the legal realm, the intelligence community and perhaps those developing autoclassification tools. This whole topic seems like a great seed for a Community of Practice. Anyone interested? If so – drop a comment below please!

Lunchtime Lightning Talks: At five minutes each, these talks gave the attendees a chance to highlight a project or question they would like to discuss with others. While all the talks were interesting, there was one that really stuck with me: Harvard University’s Zone 1 project which is a ‘rescue repository’. I would love to see this model spread! Learn more in the video below.

Session III: Virtualization as a means for Preservation In this session we discussed the question posed in the session proposal “How can we leverage virtualization for large-scale, robust preservation?”. ~~I am not sure if any notes were generated for this session.~~ Notes are available on the conference wiki. Our discussion touched on the potential to save snapshots of virtualized systems over time, the challenges of all the variables that go into making a specific environment, and the ongoing question of how important is it to view records in their original environment (vs examining the extracted ‘content’).

Session IV: Accessible Visualization This session quickly turned into a cheerful show and tell of visualization projects, tools and platforms – most made it into a list on the Wiki.

Final Thoughts
The group assembled for this unconference definitely included a great cross-section of archivists and those focused on the tech of electronic records and archives. I am not sure how many there were exclusively software developers or IT folks. We did go around the room for introductions and hand raising for how people self-identified (archivists? developers? both? other?). I was a bit distracted during the hand raising (I was typing the schedule into the wiki) – but it is my impression that there were many more archivists and archivist/developers than there were ‘just developers’. That said, the conversations were productive and definitely solidly in the technical realm.

One cross-cutting theme I spotted was the value of archivists collaborating with those building systems or selecting tech solutions. While archivists may not have the option to enforce (through carrots or sticks) adherence to software or platform standards, any amount of involvement further up the line than the point of turning a system off will decrease the risks of losing records.

So why the picture of the abandoned factory at the top of this post? I think a lot of the challenges of preservation of born digital records tie back to the fact that archivists often end up walking around in the abandoned factory equivalent of the system that created the records. The workers are gone and all we have left is a shell and some samples of the product. Maybe having just what the factory produced is enough. Would it be a better record if you understood how it moved through the factory to become what it is in the end? Also, for many born digital records you can’t interact with them or view them unless you have the original environment (or a virtual one) in which to experience them. Lots to think about here.

If this sounds like a discussion you would like to participate in, there are more CURATEcamps on the way. In fact – one is being held before SAA’s annual meeting tomorrow!

Image Credit: abandoned factory image from Flickr user sonyasonya.

Day of Digital Archives

October 6, 2011

To be honest, today was a half day of digital archives, due to personal plans taking me away from computers this afternoon. In light of that, my post is more accurately my ‘week of digital archives’.

The highlight of my digital archives week was the discovery of the Digital Curation Exchange. I promptly joined and began to explore their ‘space for all things ‘digital curation’ ‘. This led me to a fabulous list of resources, including a set of syllabi for courses related to digital curation. Each link brought me to an extensive reading list, some with full slide decks related to weekly in classroom presentations. My ‘to read’ list has gotten much longer – but in a good way!

On other days recently I have found myself involved in all of the following:

review of metadata standards for digital objects
creation of internal guidelines and requirements documents
networking with those at other institutions to help coordinate site visits of other digitization projects
records management planning and reviews
learning about the OCR software available to our organization
contemplation of the web archiving efforts of organizations and governments around the world
reviewing my organization’s social media policies
listening to the audio of online training available from PLANETS (Preservation and Long-term Access through NETworked Services)
contemplation of the new Journal of Digital Media Management and their recent call for articles

My new favorite quote related to digital preservation comes from What we reckon about keeping digital archives: High level principles guiding State Records’ approach from the State Records folks in New South Wales Australia, which reads:

We will keep the Robert De Niro principle in mind when adopting any software or hardware solutions: “You want to be makin moves on the street, have no attachments, allow nothing to be in your life that you cannot walk out on in 30 seconds flat if you spot the heat around the corner” (Heat, 1995)

In other words, our digital archives technology will be designed to be sustainable given our limited resources so it will be flexible and scalable to allow us to utilise the most appropriate tools at a given time to carry out actions such as creation of preservation or access copies or monitoring of repository contents, but replace these tools with new ones easily and with minimal cost and with minimal impact.

I like that this speaks to the fact that no plan can perfectly accommodate the changes in technology coming down the line. Being nimble and assuming that change will be the only constant are key to ensuring access to our digital assets in the future.

Rescuing 5.25″ Floppy Disks from Oblivion

July 25, 2011 19 Comments

This post is a careful log of how I rescued data trapped on 5 1/4″ floppy disks, some dating back to 1984 (including those pictured here). While I have tried to make this detailed enough to help anyone who needs to try this, you will likely have more success if you are comfortable installing and configuring hardware and software.

I will break this down into a number of phases:

Phase 1: Hardware
Phase 2: Pull the data off the disk
Phase 3: Extract the files from the disk image
Phase 4: Migrate or Emulate

Phase 1: Hardware

Before you do anything else, you actually need a 5.25″ floppy drive of some kind connected to your computer. I was lucky – a friend had a floppy drive for us to work with. If you aren’t that lucky, you can generally find them on eBay for around $25 (sometimes less). A friend had been helping me by trying to connect the drive to my existing PC – but we could never get the communications working properly. Finally I found Device Side Data’s 5.25″ Floppy Drive Controller which they sell online for $55. What you are purchasing will connect your 5.25 Floppy Drive to a USB 2.0 or USB 1.1 port. It comes with drivers for connection to Windows, Mac and Linux systems.

If you don’t want to mess around with installing the disk drive into our computer, you can also purchase an external drive enclosure and a tabletop power supply. Remember, you still need the USB controller too.

Update: I just found a fantastic step-by-step guide to the hardware installation of Device Side’s drive controller from the Maryland Institute for Technology in the Humanities (MITH), including tons of photographs, which should help you get the hardware install portion done right.

Phase 2: Pull the data off the disk

The next step, once you have everything installed, is to extract the bits (all those ones and zeroes) off those floppies. I found that creating a new folder for each disk I was extracting made things easier. In each folder I store the disk image, a copy of the extracted original files and a folder named ‘converted’ in which to store migrated versions of the files.

Device Side provides software they call ‘Disk Image and Browse’. You can see an assortment of screenshots of this software on their website, but this is what I see after putting a floppy in my drive and launching USB Floppy -> Disk Image and Browse:

You will need to select the ‘Disk Type’ and indicate the destination in which to create your disk image. Make sure you create the destination directory before you click on the ‘Capture Disk File Image’ button. This is what it may look like in progress:

Fair warning that this won’t always work. At least the developers of the software that comes with Device Side Data’s controller had a sense of humor. This is what I saw when one of my disk reads didn’t work 100%:

If you are pressed for time and have many disks to work your way through, you can stop here and repeat this step for all the disks you have on hand.

Phase 3: Extract the files from the disk image

Now that you have a disk image of your floppy, how do you interact with it? For this step I used a free tool called Virtual Floppy Drive. After I got this installed properly, when my disk image appeared, it was tied to this program. Double clicking on the Floppy Image icon opens the floppy in a view like the one shown below:

It looks like any other removable disk drive. Now you can copy any or all of the files to anywhere you like.

Phase 4: Migrate or Emulate

The last step is finding a way to open your files. Your choice for this phase will depend on the file formats of the files you have rescued. My files were almost all WordStar word processing documents. I found a list of tools for converting WordStar files to other formats.

The best one I found was HABit version 3.

It converts Wordstar files into text or html and even keeps the spacing reasonably well if you choose that option. If you are interested in the content more than the layout, then not retaining spacing will be the better choice because it will not put artificial spaces in the middle of sentences to preserve indentation. In a perfect world I think I would capture it both with layout and without.

Summary

So my rhythm of working with the floppies after I had all the hardware and software installed was as follows:

create a new folder for each disk, with an empty ‘converted’ folder within it
insert floppy into the drive
run DeviceSide’s Disk Image and Browse software (found on my PC running Windows under Start -> Programs -> USB Flopy)
paste the full path of the destination folder
name the disk image
click ‘Capture Disk Image’
double click on the disk image and view the files via vfd (virtual floppy drive)
copy all files into the folder for that disk
convert files to a stable format (I was going from WordStar to ASCII text) and save the files in the ‘converted’ folder

These are the detailed instructions I tried to find when I started my own data rescue project. I hope this helps you rescue files currently trapped on 5 1/4″ floppies. Please let me know if you have any questions about what I have posted here.

Update: Another great source of information is Archive Team’s wiki page on Rescuing Floppy Disks.

SXSWi: You’re Dead, Your Data Isn’t: What Happens Now?

March 31, 2011 3 Comments

This five person panel at SXSW Interactive 2011 tackled a broad range of issues related to what happens to our online presence, assets, creations and identity after our death.

Presenters:

Adele McAlear author of Death and Digital Legacy
Dazza Greenwood author of Civics.com
Evan Carroll and John Romano, co-authors of Your Digital Afterlife
Jesse Davis cofounder of Entrustet

There was a lot to take in here. You can listen to the full audio of the session or watch a recording of the session’s live stream (the first few minutes of the stream lacks audio).

A quick and easy place to start is this lovely little video created as part of the promotion of Your Digital Afterlife – it gives a nice quick overview of the topic:

Also take a look at the Visual Map that was drawn by Ryan Robinson during the session – it is amazing! Rather than attempt to recap the entire session, I am going to just highlight the bits that most caught my attention:

Laws, Policies and Planning
Currently individuals are left reading the fine print and hunting for service specific policies regarding access to digital content after the death of the original account holder. Oklahoma recently passed a law that permits estate executors to access the online accounts of the recently deceased – the first and only state in the US to have such a law. It was pointed out during the session that in all other states, leaving your passwords to your loved ones is you asking them to impersonate you after your death.

Facebook has an online form to report a deceased person’s account – but little indication of what this action will do to the account. Google’s policy for accessing a deceased person’s email requires six steps, including mailing paper documents to Mountain View, CA.

There is a working group forming to create model terms of service – you can add your name to the list of those interested in joining at the bottom of this page.

What Does Ownership Mean?
What is the status of an individual email or digital photo? Is it private property? I don’t recall who mentioned it – but I love the notion of a tribe or family unit owning digital content. It makes sense to me that the digital model parallel the real world. When my family buys a new music CD, our family owns it – not the individual who happened to go to the store that day. It makes sense that an MP3 purchased by any member of my family would belong to our family. I want to be able to buy a Kindle for my family and know that my son can inherit my collection of e-books the same way he can inherit the books on my bookcase.

Remembering Those Who Have Passed
How does the web change the way we mourn and memorialize people? Many have now had the experience of learning of the passing of a loved one online – the process of sorting through loss in the virtual town square of Facebook. How does our identity transform after we are gone? Who is entitled to tag us in a photo?

My family suffered a tragic loss in 2009 and my reaction was to create a website dedicated to preserving memories of my cousin. At the Casey Feldman Memories site, her friends and family can contribute memories about her. As the site evolved, we also added a section to preserve her writing (she was a journalism student) – I kept imagining the day when we realized that we could no longer access her published articles online. I built the site using Omeka and I know that we have control over all the stories and photos and articles stored within the database.

It will be interesting to watch as services such as Chronicle of Life spring up claiming to help you “Save your memories FOREVER!”. They carefully explain why they are a trustworthy digital repository and why they backup their claims with a money-back guarantee.

For as little as $10, you can preserve your life story or daily journal forever: It allows you to store 1,000 pages of text, enough for your complete autobiography. For the same amount, you could also preserve less text, but up to 10 of your most important photos. – Chronicle of Life Pricing

Privacy
There are also some interesting questions about privacy and the rights of those who have passed to keep their secrets. Facebook currently deletes some parts of a profile when it converts it to a ‘memorial’ profile. They state that this is for the privacy of the original account holder. If users are ultimately given more power over the disposition of their social web presence – should these same choices be respected by archivists? Or would these choices need to be respected the way any other private information is guarded until some distant time after which it would then be made available?

Conculsion
Thanks again to all the presenters – this really was one of the best sessions for me at SXSWi! I loved that it got a whole different community of people thinking about digital preservation from a personal point of view. You may also want to read about Digital Death Day – one coming up in May 2011 in the San Francisco Bay Area and another in September 2011 in the Netherlands.

Image credit: Excerpt from Ryan Robinson’s Visual Map created live during the SXSW session.

Category: future-proofing