Menu Close

Category: open source

Chapter 10: Open Source, Version Control and Software Sustainability by Ildikó Vancsa


Chapter 10 of Partners for Preservation is ‘Open Source, Version Control and Software Sustainability’ by Ildikó Vancsa. The third chapter of Part III:  Data and Programming, and the final of the book, this chapter shifts the lens on programming to talk about the elements of communication and coordination that are required to sustain open source software projects.

When the Pacific Telegraph Route (shown above) was finished in 1861, it connected the new state of California to the East Coast. It put the Pony Express out of business. The first week it was in operation, it cost a dollar a word. Almost 110 years later, in 1969, saw the first digital transmission over ARPANET (the precursor to the Internet).

Vancsa explains early in the chapter:

We cannot really discuss open source without mentioning the effort that people need to put into communicationg with each other. Members of a community must be able to follow and track back the information that has been exchanged, no matter what avenue of communication is used.

I love envisioning the long evolution from the telegraph crossing the continent to the Internet stretching around the world. With each leap forward in technology and communication, we have made it easier to collaborate across space and time. Archives, at their heart, are dedicated to this kind of collaboration. Our two fields can learn from and support one another in so many ways.

Bio:

Ildikó Vancsa started her journey with virtualization during her university years and has been in connection with this technology in different ways since then. She started her career at a small research and development company in Budapest, where she focused on areas like system management, business process modeling and optimization. Ildikó got involved with OpenStack when she started to work on the cloud project at Ericsson in 2013. She was a member of the Ceilometer and Aodh project core teams. She is now working for the OpenStack Foundation and she drives network functions virtualization (NFV) related feature development activities in projects like Nova and Cinder. Beyond code and documentation contributions, she is also very passionate about on-boarding and training activities.

Image source: Route of the first transcontinental telegraph, 1862.
https://commons.wikimedia.org/wiki/File:Pacific_Telegraph_Route_-_map,_1862.jpg

Chapter 5: The Internet of Things: the risks and impacts of ubiquitous computing by Éireann Leverett

Chapter 5 of Partners for Preservation is ‘The Internet of Things: the risks and impacts of ubiquitous computing’ by Éireann Leverett. This is one of the chapters that evolved a bit from my original idea – shifting from being primarily about proprietary hardware to focusing on the Internet of Things (IoT) and the cascade of social and technical fallout that needs to be considered.

Leverett gives this most basic definition of IoT in his chapter:

At its core, the Internet of Things is ‘ubiquitous computing’, tiny computers everywhere – outdoors, at work in the countryside, at use in the city, floating on the sea, or in the sky – for all kinds of real world purposes.

In 2013, I attended a session at The Memory of the World in the Digital Age: Digitization and Preservation conference on the preservation of scientific data. I was particularly taken with The Global Sea Level Observing System (GLOSS) — almost 300 tide gauge stations around the world making up a web of sea level observation sensors. The UNESCO Intergovernmental Oceanographic Commission (IOC) established this network, but cannot add to or maintain it themselves. The success of GLOSS “depends on the voluntary participation of countries and national bodies”. It is a great example of what a network of sensors deployed en masse by multiple parties can do – especially when trying to achieve more than a single individual or organization can on its own.

Much of IoT is not implemented for the greater good, but rather to further commercial aims.  This chapter gives a good overview of the basics of IoT and considers a broad array of issues related to it including privacy, proprietary technology, and big data. It is also the perfect chapter to begin Part II: The physical world: objects, art, and architecture – shifting to a topic in which the physical world outside of the computer demands consideration.

Bio:

Éireann Leverett

Éireann Leverett once found 10,000 vulnerable industrial systems on the internet.

He then worked with Computer Emergency Response Teams around the world for cyber risk reduction.

He likes teaching the basics and learning the obscure.

He continually studies computer science, cryptography, networks, information theory, economics, and magic history.

He is a regular speaker at computer security conferences such as FIRST, BlackHat, Defcon, Brucon, Hack.lu, RSA, and CCC; and also at insurance and risk conferences such as Society of Information Risk Analysts, Onshore Energy Conference, International Association of Engineering Insurers, International Risk Governance Council, and the Reinsurance Association of America. He has been featured by the BBC, The Washington Post, The Chicago Tribune, The Register, The Christian Science Monitor, Popular Mechanics, and Wired magazine.

He is a former penetration tester from IOActive, and was part of a multidisciplinary team that built the first cyber risk models for insurance with Cambridge University Centre for Risk Studies and RMS.

Image credit: Zan Zig performing with rabbit and roses, including hat trick and levitation, Strobridge Litho. Co., c1899.

NOTE: I chose the magician in the image above for two reasons:

  1. because IoT can seem like magic
  2. because the author of this chapter is a fan of magic and magic history

Chapter 4: Link Rot, Reference Rot and the Thorny Problems of Legal Citation by Ellie Margolis

The fourth chapter in Partners for Preservation is ‘Link Rot, Reference Rot and the Thorny Problems of Legal Citation’ by Ellie Margolis. Links that no longer work and pages that have been updated since they were referenced are an issue that everyone online has struggled with. In this chapter, Margolis gives us insight into why these challenges are particularly pernicious for those working in the legal sphere.

This passage touches on the heart of the problem.

Fundamentally, link and reference rot call into question the very foundation on which legal analysis is built. The problem is particularly acute in judicial opinions because the common law concept of stare decisis means that subsequent readers must be able to trace how the law develops from one case to the next. When a source becomes unavailable due to link rot, it is as though a part of the opinion disappears. Without the ability to locate and assess the sources the court relied on, the very validity of the court’s decision could be called into question. If precedent is not built on a foundation of permanently accessible sources, it loses
its authority.

While working on this blog post, I found a WordPress Plugin called Broken Link Checker. It does exactly what you expect – scans through all your blog posts to check for broken URLs. In my 201 published blog posts (consisting of just shy of 150,000 words), I have 3002 unique URLs. The plugin checked them all and found 766 broken links! Interestingly, the plugin updates the styling of all broken links to show them with strikethroughs – see the strikethrough in the link text of the last link in the image below:

For each of the broken URLs it finds, you can click on “Edit Link”. You then have the option of updating it manually or using a suggested link to a Wayback Machine archived page – assuming it can find one.

It is no secret that link rot is a widespread issue. Back in 2013, the Internet Archive announced an initiative to fix broken links on the Internet – including the creation of the Broken Link Checker plugin I found. Three years later, on the Wikipedia blog, they announced that over a million broken outbound links on English Wikipedia had been fixed. Fast forward to October of 2018 and an Internet Archive blog post announced that More than 9 million broken links on Wikipedia are now rescued.

I particularly love this example because it combines proactive work and repair work. This quote from the 2018 blog post explains the approach:

For more than 5 years, the Internet Archive has been archiving nearly every URL referenced in close to 300 wikipedia sites as soon as those links are added or changed at the rate of about 20 million URLs/week.

And for the past 3 years, we have been running a software robot called IABot on 22 Wikipedia language editions looking for broken links (URLs that return a ‘404’, or ‘Page Not Found’). When broken links are discovered, IABot searches for archives in the Wayback Machine and other web archives to replace them with.

There are no silver bullets here – just the need for consistent attention to the problem. The examples of issues being faced by the law community, and their various approaches to prevent or work around them, can only help us all move forward toward a more stable web of internet links.

Ellie Margolis

Bio:
Ellie Margolis is a Professor of Law at Temple University, Beasley School of law, where she teaches Legal Research and Writing, Appellate Advocacy, and other litigation skills courses. Her work focuses on the effect of technology on legal research and legal writing. She has written numerous law review articles, essays and textbook contributions. Her scholarship is widely cited in legal writing textbooks, law review articles, and appellate briefs.

Image credit: Image from page 235 of “American spiders and their spinningwork. A natural history of the orbweaving spiders of the United States, with special regard to their industry and habits” (1889)

ArchivesZ Needs You!

I got a kind email today asking “Whither ArchivesZ?”. My reply was: “it is sleeping” (projects do need their rest) and “I just started a new job” (I am now a Metadata and Taxonomy Consultant at The World Bank) and “I need to find enthusiastic people to help me”. That final point brings me to this post.

I find myself in the odd position of having finished my Master’s Degree and not wanting to sign on for the long haul of a PhD. So I have a big project that was born in academia, initially as a joint class project and more recently as independent research with a grant-funded programmer, but I am no longer in academia.

What happens to projects like ArchivesZ? Is there an evolutionary path towards it being a collaborative project among dispersed enthusiastic individuals? Or am I more likely to succeed by recruiting current graduate students at my former (and still nearby) institution? I have discussed this one-on-one with a number of individuals, but I haven’t thrown open the gates for those who follow me here online.

For those of you who have been waiting patiently, the ArchivesZ version 2 prototype is avaiable online. I can’t promise it will stay online for long – it is definitely brittle for reasons I haven’t totally identified. A few things to be aware of:

  • when you load the main page, you should see tags listed at the bottom – if you don’t at all, then drop me an email via my contact form and I will try and get Tomcat and Solr back up. If you have a small screen – you may need to view your browser full screen to get to all the parts of the UI.
  • I know there are lots of bugs of various sizes. Some paths through the app work – some don’t. Some screens are just placeholders. Feel free to poke around and try things – you can’t break it for anyone else!

I think there are a few key challenges to building what I would think of as the first ‘full’ version of ArchivesZ – listed here in no particular order:

  • In the process of creating version 2, I was too ambitious. The current version of ArchivesZ has lots of issues, some usability – some bugs (see prototype above!)
  • Wherever a collaborative workspace of ArchivesZ were going to live, it would need large data sets. I did a lot of work on data from eleven institutions in the spring of 2009, so there is a lot of data available – but it is still a challenge.
  • A lot of my future ideas for ArchivesZ are trapped in my head. The good news is that I am honestly open to others’ ideas for where to take it in the future.
  • How do we build a community around the creation of ArchivesZ?

I still feel that there is a lot to be gained by building a centralized visualization tool/service through which researchers and archivists could explore and discover archival materials. I even think there is promise to a freestanding tool that supports exploration of materials within a single institution. I can’t build it alone. This is a good thing – it will be a much better in the end with the input, energy and knowledge of others. I am good at ideas and good at playing the devil’s advocate. I have lots of strength on the data side of things and visualization has been a passion of mine for years. I need smart people with new ideas, strong tech skills (or a desire to learn) and people who can figure out how to organize the herd of cats I hope to recruit.

So – what can you do to help ArchivesZ? Do you have mad Action Script 3 skills? Do you want to dig into the scary little ruby script that populates the database? Maybe you prefer to organize and coordinate? You have always wanted to figure out how a project like this could group from a happy (or awkward?) prototype into a real service that people depend on?

Do you have a vision for how to tackle this as a project? Open source? Grant funded? Something else clever?

Know any graduate students looking for good research topics? There are juicy bits here for those interested in data, classification, visualization and cross-repository search.

I will be at SAA in DC in August chairing a panel on search engine optimization of archival websites. If there is even just one of you out there who is interested, I would cheerfully organize an ArchivesZ summit of some sort in which I could show folks the good, bad and ugly of the prototype as it stands. Let me know in the comments below.

Won’t be at SAA but want to help? Chime in here too. I am happy to set up some shared desktop tours of whatever you would like to see.

PS: Yes, I do have all the version 2 code – and what is online at the Google Code ArchivesZ page is not up to date. Updating the ArchivesZ website and uploading the current code is on my to do list!

THATCamp 2008: Day 1 Dork Short Lightening Talks

lightningDuring lunch on the first day of THATCamp people volunteered to give lightning talks they called ‘Dork Shorts’. As we ate our lunch, a steady stream of folks paraded up to the podium and gave an elevator pitch length demo. These are the projects about which I managed to type URLs and some other info into my laptop. If you are looking for examples of inspirational and innovative work at the intersection of technology and the humanities – these are a great place to start!

Have more links to projects I missed including? Please add them in the comments below.

Image credit: Lightning by thenss (Christopher Cacho) via flickr

Digital Preservation via Emulation – Dioscuri and the Prevention of Digital Black Holes

dioscuri.JPGAvailable Online posted about the open source emulator project Dioscuri back in late September. In the course of researching Thoughts on Digital Preservation, Validation and Community I learned a bit about the Microsoft Virtual PC software. Virtual PC permits users to run multiple operating systems on the same physical computer and can therefore facilitate access to old software that won’t run on your current operating system. That emulator approach pales in comparison with what the folks over at Dioscuri are planning and building.

On the Digital Preservation page of the Dioscuri website I found this paragraph on their goals:

To prevent a digital black hole, the Koninklijke Bibliotheek (KB), National Library of the Netherlands, and the Nationaal Archief of the Netherlands started a joint project to research and develop a solution. Both institutions have a large amount of traditional documents and are very familiar with preservation over the long term. However, the amount of digital material (publications, archival records, etc.) is increasing with a rapid pace. To manage them is already a challenge. But as cultural heritage organisations, more has to be done to keep those documents safe for hundreds of years at least.

They are nothing if not ambitious… they go on to state:

Although many people recognise the importance of having a digital preservation strategy based on emulation, it has never been taken into practice. Of course, many emulators already exist and showed the usefulness and advantages it offer. But none of them have been designed to be digital preservation proof. For this reason the National Library and Nationaal Archief of the Netherlands started a joint project on emulation.

The aim of the emulation project is to develop a new preservation strategy based on emulation.

Dioscuri is part of Planets (Preservation and Long-term Access via NETworked Services) – run by the Planets consortium and coordinated by the British Library. The Dioscuri team has created an open source emulator that can be ported to any hardware that can run a Java Virtual Machine (JVM). Individual hardware components are implemented via separate modules. These modules should make it possible to mimic many different hardware configurations without creating separate programs for every possible combination.

You can get a taste of the big thinking that is going into this work by reviewing the program overview and slide presentations from the first Emulation Expert Meeting (EEM) on digital preservation that took place on October 20th, 2006.

In the presentation given by Geoffrey Brown from Indiana University titled Virtualizing the CIC Floppy Disk Project: An Experiment in Preservation Using Emulation I found the following simple answer to the question ‘Why not just migrate?’:

  • Loss of information — e.g. word edits

  • Loss of fidelity — e.g. WordPerfect to Word isn’t very good

  • Loss of authenticity — users of migrated document need access to original to verify authenticity

  • Not always possible — closed proprietary formats

  • Not always feasible — costs may be too high

  • Emulation may necessary to enable migration

After reading through Emulation at the German National Library, presented by Tobias Steinke, I found my way to the kopal website. With their great tagline ‘Data into the future’, they state their goal is “…to develop a technological and organizational solution to ensure the long-term availability of electronic publications.” The real gem for me on that site is what they call the kopal demonstrator. This is a well thought out Flash application that explains the kopal project’s ‘procedures for archiving and accessing materials’ within the OAIS Reference Model framework. But it is more than that – if you are looking for a great way to get your (or someone else’s) head around digital archiving, software and related processes – definitely take a look. They even include a full Glossary.

I liked what I saw in Defining a preservation policy for a multimedia and software heritage collection, a pragmatic attempt from the Bibliothèque nationale de France, a presentation by Grégory Miura, but felt like I was missing some of the guts by just looking at the slides. I was pleased to discover what appears to be a related paper on the same topic presented at IFLA 2006 in Seoul titled: Pushing the boundaries of traditional heritage policy: Maintaining long-term access to multimedia content by introducing emulation and contextualization instead of accepting inevitable loss . Hurrah for NOT ‘accepting inevitable loss’.

Vincent Joguin’s presentation, Emulating emulators for long-term digital objects preservation: the need for a universal machine, discussed a virtual machine project named Olonys. If I understood the slides correctly, the idea behind Olonys is to create a “portable and efficient virtual processor”. This would provide an environment in which to run programs such as emulators, but isolate the programs running within it from the disparities between the original hardware and the actual current hardware. Another benefit to this approach is that only the virtual processor need be ported to new platforms rather than each individual program or emulator.

Hilde van Wijngaarden presented an Introduction to Planets at EEM. I also found another introductory level presentation that was given by Jeffrey van der Hoeven at wePreserve in September of 2007 titled Dioscuri: emulation for digital preservation.

The wePreserve site is a gold mine for presentations on these topics. They bill themselves as “the window on the synergistic activities of DigitalPreservationEurope (DPE), Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval (CASPAR), and Preservation and Long-term Access through NETworked Services (PLANETS).” If you have time and curiosity on the subject of digital preservation, take a glance down their home page and click through to view some of the presentations.

On the site of The International Journal of Digital Curation there is a nice ten page paper that explains the most recent results of the Dioscuri project. Emulation for Digital Preservation in Practice: The Results was published in December 2007. I like being able to see slides from presentations (as linked to above), but without the notes or audio to go with them I am often left staring at really nice diagrams wondering what the author’s main point was. The paper is thorough and provides lots of great links to other reading, background and related projects.

There is a lot to dig into here. It is enough to make me wish I had a month (maybe a year?) to spend just following up on this topic alone. I found my struggle to interpret many of the Power Point slide decks that have no notes or audio very ironic. Here I was hunting for information about the preservation of born digital records and I kept finding that the records of the research provided didn’t give me the full picture. With no context beyond the text and images on the slides themselves, I was left to my own interpretation of their intended message. While I know that these presentations are not meant to be the official records of this research, I think that the effort obviously put into collecting and posting them makes it clear that others are as anxious as I to see this information.

The best digital preservation model in the world will only preserve what we choose to save. I know the famous claim on the web is that ‘content is king’ – but I would hazard to suggest that in the cultural heritage community ‘context is king’.

What does this have to do with Dioscuri and emulators? Just that as we solve the technical problems related to preservation and access, I believe that we will circle back around to realize that digital records need the same careful attention to appraisal, selection and preservation of context as ‘traditional’ records. I would like to believe that the huge hurdles we now face on the technical and process side of things will fade over time due to the immense efforts of dedicated and brilliant individuals. The next big hurdle is the same old hurdle – making sure the records we fight to preserve have enough context that they will mean anything to those in the future. We could end up with just as severe a ‘digital black hole’ due to poorly selected or poorly documented records as we could due to records that are trapped in a format we can no longer access. We need both sides of the coin to succeed in digital preservation.

Did I mention the part about ‘Hurray for open source emulator projects with ambitious goals for digital preservation’? Right. I just wanted to be clear about that.

Image Credit: The image included at the top of this post was taken from a screen shot of Dioscuri itself, the original version of which may be seen here.