at risk records | Spellbound Blog

Chapter 10: Open Source, Version Control and Software Sustainability by Ildikó Vancsa

January 29, 2019

Chapter 10 of Partners for Preservation is ‘Open Source, Version Control and Software Sustainability’ by Ildikó Vancsa. The third chapter of Part III: Data and Programming, and the final of the book, this chapter shifts the lens on programming to talk about the elements of communication and coordination that are required to sustain open source software projects.

When the Pacific Telegraph Route (shown above) was finished in 1861, it connected the new state of California to the East Coast. It put the Pony Express out of business. The first week it was in operation, it cost a dollar a word. Almost 110 years later, in 1969, saw the first digital transmission over ARPANET (the precursor to the Internet).

Vancsa explains early in the chapter:

We cannot really discuss open source without mentioning the effort that people need to put into communicationg with each other. Members of a community must be able to follow and track back the information that has been exchanged, no matter what avenue of communication is used.

I love envisioning the long evolution from the telegraph crossing the continent to the Internet stretching around the world. With each leap forward in technology and communication, we have made it easier to collaborate across space and time. Archives, at their heart, are dedicated to this kind of collaboration. Our two fields can learn from and support one another in so many ways.

Bio:

Ildikó Vancsa started her journey with virtualization during her university years and has been in connection with this technology in different ways since then. She started her career at a small research and development company in Budapest, where she focused on areas like system management, business process modeling and optimization. Ildikó got involved with OpenStack when she started to work on the cloud project at Ericsson in 2013. She was a member of the Ceilometer and Aodh project core teams. She is now working for the OpenStack Foundation and she drives network functions virtualization (NFV) related feature development activities in projects like Nova and Cinder. Beyond code and documentation contributions, she is also very passionate about on-boarding and training activities.

Image source: Route of the first transcontinental telegraph, 1862.
https://commons.wikimedia.org/wiki/File:Pacific_Telegraph_Route_-_map,_1862.jpg

Chapter 4: Link Rot, Reference Rot and the Thorny Problems of Legal Citation by Ellie Margolis

December 29, 2018

The fourth chapter in Partners for Preservation is ‘Link Rot, Reference Rot and the Thorny Problems of Legal Citation’ by Ellie Margolis. Links that no longer work and pages that have been updated since they were referenced are an issue that everyone online has struggled with. In this chapter, Margolis gives us insight into why these challenges are particularly pernicious for those working in the legal sphere.

This passage touches on the heart of the problem.

Fundamentally, link and reference rot call into question the very foundation on which legal analysis is built. The problem is particularly acute in judicial opinions because the common law concept of stare decisis means that subsequent readers must be able to trace how the law develops from one case to the next. When a source becomes unavailable due to link rot, it is as though a part of the opinion disappears. Without the ability to locate and assess the sources the court relied on, the very validity of the court’s decision could be called into question. If precedent is not built on a foundation of permanently accessible sources, it loses
its authority.

While working on this blog post, I found a WordPress Plugin called Broken Link Checker. It does exactly what you expect – scans through all your blog posts to check for broken URLs. In my 201 published blog posts (consisting of just shy of 150,000 words), I have 3002 unique URLs. The plugin checked them all and found 766 broken links! Interestingly, the plugin updates the styling of all broken links to show them with strikethroughs – see the strikethrough in the link text of the last link in the image below:

For each of the broken URLs it finds, you can click on “Edit Link”. You then have the option of updating it manually or using a suggested link to a Wayback Machine archived page – assuming it can find one.

It is no secret that link rot is a widespread issue. Back in 2013, the Internet Archive announced an initiative to fix broken links on the Internet – including the creation of the Broken Link Checker plugin I found. Three years later, on the Wikipedia blog, they announced that over a million broken outbound links on English Wikipedia had been fixed. Fast forward to October of 2018 and an Internet Archive blog post announced that More than 9 million broken links on Wikipedia are now rescued.

I particularly love this example because it combines proactive work and repair work. This quote from the 2018 blog post explains the approach:

For more than 5 years, the Internet Archive has been archiving nearly every URL referenced in close to 300 wikipedia sites as soon as those links are added or changed at the rate of about 20 million URLs/week.

And for the past 3 years, we have been running a software robot called IABot on 22 Wikipedia language editions looking for broken links (URLs that return a ‘404’, or ‘Page Not Found’). When broken links are discovered, IABot searches for archives in the Wayback Machine and other web archives to replace them with.

There are no silver bullets here – just the need for consistent attention to the problem. The examples of issues being faced by the law community, and their various approaches to prevent or work around them, can only help us all move forward toward a more stable web of internet links.

Bio:
Ellie Margolis is a Professor of Law at Temple University, Beasley School of law, where she teaches Legal Research and Writing, Appellate Advocacy, and other litigation skills courses. Her work focuses on the effect of technology on legal research and legal writing. She has written numerous law review articles, essays and textbook contributions. Her scholarship is widely cited in legal writing textbooks, law review articles, and appellate briefs.

Image credit: Image from page 235 of “American spiders and their spinningwork. A natural history of the orbweaving spiders of the United States, with special regard to their industry and habits” (1889)

UNESCO/UBC Vancouver Declaration

October 12, 2012

In honor of the 2012 Day of Digtal Archives, I am posting a link to the UNESCO/UBC Vancouver Declaration. This is the product of the recent Memory of the World in the Digital Age conference and they are looking for feedback on this declaration by October 19th, 2012 (see link on the conference page for sending in feedback).

To give you a better sense of the aim of this conference, here are the ‘conference goals’ from the programme:

The safeguard of digital documents is a fundamental issue that touches everyone, yet most people are unaware of the risk of loss or the magnitude of resources needed for long-term protection. This Conference will provide a platform to showcase major initiatives in the area while scaling up awareness of issues in order to find solutions at a global level. Ensuring digital continuity of content requires a range of legal, technological, social, financial, political and other obstacles to be overcome.

The declaration itself is only four pages long and includes recommendations to UNESCO, member states and industry. If you are concerned with digital preservation and/or digitization, please take a few minutes to read through it and send in your feedback by October 19th.

CURATEcamp Processing 2012

August 5, 2012 2 Comments

CURATEcamp Processing 2012 was held the day after the National Digital Information Infrastructure and Preservation Program (NDIIPP) and the National Digital Stewardship Alliance (NDSA) sponsored Digital Preservation annual meeting.

The unconference was framed by this idea:

Processing means different things to an archivist and a software developer. To the former, processing is about taking custody of collections, preserving context, and providing arrangement, description, and accessibility. To the latter, processing is about computer processing and has to do with how one automates a range of tasks through computation.

The first hour or so was dedicated to mingling and suggesting sessions. Anyone with an idea for a session wrote down a title and short description on a paper and taped it to the wall. These were then reviewed, rearranged on the schedule and combined where appropriate until we had our full final schedule. More than half the sessions on the schedule have links through to notes from the session. There were four session slots, plus a noon lunch slot of lightening talks.

Session I: At Risk Records in 3rd Party Systems This was the session I had proposed combined with a proposal from Brandon Hirsch. My focus was on identification and capture of the records, while Brandon started with capture and continued on to questions of data extraction vs emulation of the original platforms. Two sets of notes were created – one by me on the Wiki and the other by Sarah Bender in Google Docs. Our group had a great discussion including these assorted points:

Can you mandate use of systems we (archivists) know how to get content out of? Consensus was that you would need some way to enforce usage of the mandated systems. This is rare, if not impossible.
The NY Philharmonic had to figure out how to capture the new digital program created for the most recent season. Either that, or break their streak for preserving every season’s programs since 1842.
There are consequences to not having and following a ‘file plan’. Part of people’s jobs have to be to follow the rules.
What are the significant properties? What needs to be preserved – just the content you can extract? Or do you need the full experience? Sometimes the answer is yes – especially if the new format is a continuation of an existing series of records.
“Collecting Evidence” vs “Archiving” – maybe “collecting evidence” is more convincing to the general public
When should archivists be in the process? At the start – before content is created, before systems are created?
Keep the original data AND keep updated data. Document everything, data sources, processes applied.

Session II: Automating Review for Restrictions? This was the session that I would have suggested if it hadn’t already been on the wall. The notes from the session are online in a Google Doc. It was so nice to realize that that challenge of review of records for restricted information is being felt in many large archives. It was described as the biggest roadblock to the fast delivery of records to researchers. The types of restrictions were categorized as ‘easy’ or ‘hard’. The ‘Easy’ category was for well defined content that follow rules that we could imagine teaching a computer to identity — things like US social security numbers, passport numbers or credit card numbers. The ‘Hard’ category was for restrictions that had more human judgement involved. The group could imagine modules coded to spot the easy restrictions. The modules could be combined to review for whatever set was required – and carry with them some sort of community blessing that was legally defensible. The modules should be open source. The hard category likely needs us as a community to reach out to the eDiscovery specialists from the legal realm, the intelligence community and perhaps those developing autoclassification tools. This whole topic seems like a great seed for a Community of Practice. Anyone interested? If so – drop a comment below please!

Lunchtime Lightning Talks: At five minutes each, these talks gave the attendees a chance to highlight a project or question they would like to discuss with others. While all the talks were interesting, there was one that really stuck with me: Harvard University’s Zone 1 project which is a ‘rescue repository’. I would love to see this model spread! Learn more in the video below.

Session III: Virtualization as a means for Preservation In this session we discussed the question posed in the session proposal “How can we leverage virtualization for large-scale, robust preservation?”. ~~I am not sure if any notes were generated for this session.~~ Notes are available on the conference wiki. Our discussion touched on the potential to save snapshots of virtualized systems over time, the challenges of all the variables that go into making a specific environment, and the ongoing question of how important is it to view records in their original environment (vs examining the extracted ‘content’).

Session IV: Accessible Visualization This session quickly turned into a cheerful show and tell of visualization projects, tools and platforms – most made it into a list on the Wiki.

Final Thoughts
The group assembled for this unconference definitely included a great cross-section of archivists and those focused on the tech of electronic records and archives. I am not sure how many there were exclusively software developers or IT folks. We did go around the room for introductions and hand raising for how people self-identified (archivists? developers? both? other?). I was a bit distracted during the hand raising (I was typing the schedule into the wiki) – but it is my impression that there were many more archivists and archivist/developers than there were ‘just developers’. That said, the conversations were productive and definitely solidly in the technical realm.

One cross-cutting theme I spotted was the value of archivists collaborating with those building systems or selecting tech solutions. While archivists may not have the option to enforce (through carrots or sticks) adherence to software or platform standards, any amount of involvement further up the line than the point of turning a system off will decrease the risks of losing records.

So why the picture of the abandoned factory at the top of this post? I think a lot of the challenges of preservation of born digital records tie back to the fact that archivists often end up walking around in the abandoned factory equivalent of the system that created the records. The workers are gone and all we have left is a shell and some samples of the product. Maybe having just what the factory produced is enough. Would it be a better record if you understood how it moved through the factory to become what it is in the end? Also, for many born digital records you can’t interact with them or view them unless you have the original environment (or a virtual one) in which to experience them. Lots to think about here.

If this sounds like a discussion you would like to participate in, there are more CURATEcamps on the way. In fact – one is being held before SAA’s annual meeting tomorrow!

Image Credit: abandoned factory image from Flickr user sonyasonya.

Rescuing 5.25″ Floppy Disks from Oblivion

July 25, 2011 19 Comments

This post is a careful log of how I rescued data trapped on 5 1/4″ floppy disks, some dating back to 1984 (including those pictured here). While I have tried to make this detailed enough to help anyone who needs to try this, you will likely have more success if you are comfortable installing and configuring hardware and software.

I will break this down into a number of phases:

Phase 1: Hardware
Phase 2: Pull the data off the disk
Phase 3: Extract the files from the disk image
Phase 4: Migrate or Emulate

Phase 1: Hardware

Before you do anything else, you actually need a 5.25″ floppy drive of some kind connected to your computer. I was lucky – a friend had a floppy drive for us to work with. If you aren’t that lucky, you can generally find them on eBay for around $25 (sometimes less). A friend had been helping me by trying to connect the drive to my existing PC – but we could never get the communications working properly. Finally I found Device Side Data’s 5.25″ Floppy Drive Controller which they sell online for $55. What you are purchasing will connect your 5.25 Floppy Drive to a USB 2.0 or USB 1.1 port. It comes with drivers for connection to Windows, Mac and Linux systems.

If you don’t want to mess around with installing the disk drive into our computer, you can also purchase an external drive enclosure and a tabletop power supply. Remember, you still need the USB controller too.

Update: I just found a fantastic step-by-step guide to the hardware installation of Device Side’s drive controller from the Maryland Institute for Technology in the Humanities (MITH), including tons of photographs, which should help you get the hardware install portion done right.

Phase 2: Pull the data off the disk

The next step, once you have everything installed, is to extract the bits (all those ones and zeroes) off those floppies. I found that creating a new folder for each disk I was extracting made things easier. In each folder I store the disk image, a copy of the extracted original files and a folder named ‘converted’ in which to store migrated versions of the files.

Device Side provides software they call ‘Disk Image and Browse’. You can see an assortment of screenshots of this software on their website, but this is what I see after putting a floppy in my drive and launching USB Floppy -> Disk Image and Browse:

You will need to select the ‘Disk Type’ and indicate the destination in which to create your disk image. Make sure you create the destination directory before you click on the ‘Capture Disk File Image’ button. This is what it may look like in progress:

Fair warning that this won’t always work. At least the developers of the software that comes with Device Side Data’s controller had a sense of humor. This is what I saw when one of my disk reads didn’t work 100%:

If you are pressed for time and have many disks to work your way through, you can stop here and repeat this step for all the disks you have on hand.

Phase 3: Extract the files from the disk image

Now that you have a disk image of your floppy, how do you interact with it? For this step I used a free tool called Virtual Floppy Drive. After I got this installed properly, when my disk image appeared, it was tied to this program. Double clicking on the Floppy Image icon opens the floppy in a view like the one shown below:

It looks like any other removable disk drive. Now you can copy any or all of the files to anywhere you like.

Phase 4: Migrate or Emulate

The last step is finding a way to open your files. Your choice for this phase will depend on the file formats of the files you have rescued. My files were almost all WordStar word processing documents. I found a list of tools for converting WordStar files to other formats.

The best one I found was HABit version 3.

It converts Wordstar files into text or html and even keeps the spacing reasonably well if you choose that option. If you are interested in the content more than the layout, then not retaining spacing will be the better choice because it will not put artificial spaces in the middle of sentences to preserve indentation. In a perfect world I think I would capture it both with layout and without.

Summary

So my rhythm of working with the floppies after I had all the hardware and software installed was as follows:

create a new folder for each disk, with an empty ‘converted’ folder within it
insert floppy into the drive
run DeviceSide’s Disk Image and Browse software (found on my PC running Windows under Start -> Programs -> USB Flopy)
paste the full path of the destination folder
name the disk image
click ‘Capture Disk Image’
double click on the disk image and view the files via vfd (virtual floppy drive)
copy all files into the folder for that disk
convert files to a stable format (I was going from WordStar to ASCII text) and save the files in the ‘converted’ folder

These are the detailed instructions I tried to find when I started my own data rescue project. I hope this helps you rescue files currently trapped on 5 1/4″ floppies. Please let me know if you have any questions about what I have posted here.

Update: Another great source of information is Archive Team’s wiki page on Rescuing Floppy Disks.

SXSWi: You’re Dead, Your Data Isn’t: What Happens Now?

March 31, 2011 3 Comments

This five person panel at SXSW Interactive 2011 tackled a broad range of issues related to what happens to our online presence, assets, creations and identity after our death.

Presenters:

Adele McAlear author of Death and Digital Legacy
Dazza Greenwood author of Civics.com
Evan Carroll and John Romano, co-authors of Your Digital Afterlife
Jesse Davis cofounder of Entrustet

There was a lot to take in here. You can listen to the full audio of the session or watch a recording of the session’s live stream (the first few minutes of the stream lacks audio).

A quick and easy place to start is this lovely little video created as part of the promotion of Your Digital Afterlife – it gives a nice quick overview of the topic:

Also take a look at the Visual Map that was drawn by Ryan Robinson during the session – it is amazing! Rather than attempt to recap the entire session, I am going to just highlight the bits that most caught my attention:

Laws, Policies and Planning
Currently individuals are left reading the fine print and hunting for service specific policies regarding access to digital content after the death of the original account holder. Oklahoma recently passed a law that permits estate executors to access the online accounts of the recently deceased – the first and only state in the US to have such a law. It was pointed out during the session that in all other states, leaving your passwords to your loved ones is you asking them to impersonate you after your death.

Facebook has an online form to report a deceased person’s account – but little indication of what this action will do to the account. Google’s policy for accessing a deceased person’s email requires six steps, including mailing paper documents to Mountain View, CA.

There is a working group forming to create model terms of service – you can add your name to the list of those interested in joining at the bottom of this page.

What Does Ownership Mean?
What is the status of an individual email or digital photo? Is it private property? I don’t recall who mentioned it – but I love the notion of a tribe or family unit owning digital content. It makes sense to me that the digital model parallel the real world. When my family buys a new music CD, our family owns it – not the individual who happened to go to the store that day. It makes sense that an MP3 purchased by any member of my family would belong to our family. I want to be able to buy a Kindle for my family and know that my son can inherit my collection of e-books the same way he can inherit the books on my bookcase.

Remembering Those Who Have Passed
How does the web change the way we mourn and memorialize people? Many have now had the experience of learning of the passing of a loved one online – the process of sorting through loss in the virtual town square of Facebook. How does our identity transform after we are gone? Who is entitled to tag us in a photo?

My family suffered a tragic loss in 2009 and my reaction was to create a website dedicated to preserving memories of my cousin. At the Casey Feldman Memories site, her friends and family can contribute memories about her. As the site evolved, we also added a section to preserve her writing (she was a journalism student) – I kept imagining the day when we realized that we could no longer access her published articles online. I built the site using Omeka and I know that we have control over all the stories and photos and articles stored within the database.

It will be interesting to watch as services such as Chronicle of Life spring up claiming to help you “Save your memories FOREVER!”. They carefully explain why they are a trustworthy digital repository and why they backup their claims with a money-back guarantee.

For as little as $10, you can preserve your life story or daily journal forever: It allows you to store 1,000 pages of text, enough for your complete autobiography. For the same amount, you could also preserve less text, but up to 10 of your most important photos. – Chronicle of Life Pricing

Privacy
There are also some interesting questions about privacy and the rights of those who have passed to keep their secrets. Facebook currently deletes some parts of a profile when it converts it to a ‘memorial’ profile. They state that this is for the privacy of the original account holder. If users are ultimately given more power over the disposition of their social web presence – should these same choices be respected by archivists? Or would these choices need to be respected the way any other private information is guarded until some distant time after which it would then be made available?

Conculsion
Thanks again to all the presenters – this really was one of the best sessions for me at SXSWi! I loved that it got a whole different community of people thinking about digital preservation from a personal point of view. You may also want to read about Digital Death Day – one coming up in May 2011 in the San Francisco Bay Area and another in September 2011 in the Netherlands.

Image credit: Excerpt from Ryan Robinson’s Visual Map created live during the SXSW session.

Leveraging Google Reader’s Page Change Tracking for Web Page Preservation

January 26, 2010 5 Comments

The Official Google Reader Blog recently announced a new feature that will let users watch any page for updates. The way this works is that you add individual URLs to your Google Reader account. Just as with regular RSS feeds, when an update is detected – a new entry is added to that subscription.

My thinking is that this could be a really useful tool for archivists charged with preserving websites that change gradually over time, especially those fairly static sites that change infrequently with little or no notice of upcoming changes. If a web page was archived and then added to a dedicated Google Reader account, the archivist could scan their list of watch pages daily or weekly. Changes could then trigger the creation of a fresh snapshot of the site.

I will admit that there have been services out there for a while that do something similar to what Google has just rolled out. I personally have used Dapper.net to take a standard web page and generate an RSS feed based on updates to the page (sound familiar?). One Dapper.net feed that I created and follow is for the news archive page for the International Red Cross and can be found here. What is funny is that now they actually have an official RSS feed for their news that includes exactly what my Dapper.net feed harvested off their news archive page – but when I built that Dapper feed there was no other way for me to watch for those news updates.

There are lots of different tools out there that aim to archive websites. Archive-It is a subscription based service run by Internet Archive that targets institutions and will archive sites on demand or on a regular schedule. Internet Archive also has an open source crawler called Heritrix for those who are comfortable dealing with the code. Other institutions are building their own software to tackle this too. Harvard University has their own Web Archive Collection Service (WAX). The LiWA (Living Web Archives) Project is based in Germany and aims to “extend the current state of the art and develop the next generation of Web content capture, preservation, analysis, and enrichment services to improve fidelity, coherence, and interpretability of web archives.” One could even use something as simple as PDFmyURL.com – an online service that turns any URL into a PDF (be sure to play with the advanced options to make sure you get a wide enough snapshot). I know there are many more possibilities – these just scratch the surface.

What I like about my idea is that it isn’t meant to replace these services but rather work in tandem with them. The Internet Archive does an amazing job crawling and archiving many web pages – but they can’t archive everything and their crawl frequency may not match up with real world updates to a website. This approach certainly wouldn’t scale well for huge websites for which you would need to watch for changes on many pages. I am picturing this technique as being useful for small organizations or individuals who just need to make sure that a county government website makeover or a community organization’s website update doesn’t get lost in the shuffle. I like the idea of finding clever ways to leverage free services and tools to support those who want to protect a particular niche of websites from being lost.

Image Credit: The RSS themed image above is by Matt Forsythe.

Blog Action Day 2009: IEDRO and Climate Change

October 16, 2009

In honor of Blog Action Day 2009‘s theme of Climate Change, I am revisiting the subject of a post I wrote back in the summer of 2007: International Environmental Data Rescue Organization (IEDRO). This non-profit’s goal is to rescue and digitize at risk weather and climate data from around the world. In the past two years, IEDRO has been hard at work. Their website has gotten a great face-lift, but even more exciting is to see is how much progress they have made!

Weather balloon observations received from Lilongwe, Malawi (Africa) from 1968-1991: all the red on these charts represents data rescued by IEDRO — an increase from only 30% of the data available to over 90%.
Data rescue statistics from around the world

They do this work for many reasons – to improve understanding of weather patterns to prevent starvation and the spread of disease, to ensure that structures are built to properly withstand likely extremes of weather in the future and to help understand climate change. Since the theme for the day is climate change, I thought I would include a few excerpts from their detailed page on climate change:

“IEDRO’s mandate is to gather as much historic environmental data as possible and provide for its digitization so that researchers, educators and operational professionals can use those data to study climate change and global warming. We believe, as do most scientists, that the greater the amount of data available for study, the greater the accuracy of the final result.

If we do not fully understand the causes of climate change through a lack of detailed historic data evaluation, there is no opportunity for us to understand how humankind can either assist our environment to return to “normal” or at least mitigate its effects. Data is needed from every part of the globe to determine the extent of climate change on regional and local levels as well as globally. Without these data, we continue to guess at its causes in the dark and hope that adverse climate change will simply not happen.”

So, what does this data rescue look like? Take a quick tour through their process – from organizing papers, photographing each page, the transcription of all data and finally upload of this data to NOAA’s central database. These data rescue efforts span the globe and take the dedicated effort of many volunteers along the way. If you would like to volunteer to help, take a look at the IEDRO listings on VolunteerMatch.

DH2009: Digital Curiosities and Amateur Collections

June 29, 2009 3 Comments

Session Title: Digital Curiosities: Resource Creation Via Amateur Digitisation
Speaker: Melissa Terras

Overview: Review of 100 virtual museum websites and multiple flickr groups plus surveys of amateur website creators, memory institutions and Arts & Humanities academics leads to new perspective on digitization and creation of collections online by dedicated enthusiasts.

Session Highlights

Areas of “Amateur” endeavor have a long history of launching collections, such as:

cabinet of curiosities
foundation of astronomical research
british flora and amateur botanists
weather observations
open source software movement

Being an amateur doesn’t necessarily mean being bad at what you do!

Within the realm of self-defined museums some common topics often emerge:

ephemera (advertising, packaging, nostalgia)
comics
technology – especially old tech, there is a surprising trend of being fascinated by technology approximately 10 years older than the collector
personal and “embarrassing” collections
genealogy

For these self-defined museums the scope is self-defined – these are self-delineated collections. Virtual museums can document aspects of cultural heritage considered socially taboo or in some way too sensitive to collect. A great example of this is the Museum of Menstruation which claims to have been created 14 years ago and is currently trying to establish a public permenant display for the public.

Platforms have evolved over the life of the web, starting with static html, then blogs and now Flickr images as a mode of presentation.

This is a list of successful amateur collections online:

Today’s Inspiration – illustration from the 40’s and 50’s
JonWilliamson.com – advertising 1940s-1960s
Pulp Fiction Flickr Group – 882 members who provide basic metadata and often label stuff within the image – currently contains 3,385 items.
Curio Cabinet Flickr Group – 1,206 members and 5,537 items

Visual Arts Data Service (VADS) is a more traditional site created by a cultural heritage institution. It contains 100,000+ images copyright cleared for use in teaching, learning and research in the UK. VADS is a very detailed static source of images with metadata, but provides no interaction.

Amateurs do provide metadata, but it is intuitive metadata. It might not fit into rigid buckets of data, but that doesn’t meant that the metadata available isn’t useful.

What are the boundaries between amateur and professional? Work vs hobby?

Many of these amateur sites get much more traffic than most standard museum sites. More than 50% of museum digitized images are never visited.

Memory institutions are starting to put things into the wider online community:

Smithsonian: photos in Smithsonian Flickr Commons
Tate: The How We Are Now project invited the public to contribute photos to the How We Are Flickr Group. The images were streamed to screens within the How We Are: Photographing Britain exhibit and 40 photos were chosen to be included as the last set of photos in the physical exhibit.
Victoria & Albert Museum: created a Flicrk group of photos taken at the V&A museum along with a long list of other V&A Flickr groups and streams
Oxford University’s Great War Archive: contains 6,500 items contributed by the public and related to the First World War.
Facebook and Twitter are being used more often for informing the community about their collections

Much of amateur research has been driven by advances in technology. A great example of this is the advent of affordable metal detectors led to dramatic changes in archaeology. The internet and Web 2.0 technology are arming a whole new generation of enthusists who can find one another and collaborate more easily than might ever have been dreamed of 20 years ago.

Next Steps & Conclusions

Future research will involve looking at the psychology of collection: archives vs collections. For now it is important to realize that institutions are not the only hosts of “worthwhile” digital objects. Pro-am (aka, pro-amateur) are doing better with using web 2.0 & getting more traffic.

What can memory institutions learn from this?

interact with user communities
use the ‘grand central stations’ of flickr, twitter, facebook
usability of flickr is better than what most memory institutions build for themselves

My Thoughts

This session considers the ways cultural memory institution can take advantage of the web by looking at what the successful enthusiasts are achieving. This research-backed approach confirms what I would have expected. Libraries, museums and archives are leaving a lot on the table when it comes to putting their collections online. Sites run by non-professionals are doing an amazing job of drawing in new audiences, keeping people around and then initiating conversation within that audience.

The Flickr Commons is a big step forward, but it isn’t the only option. There are also varying opinions about how successful the crowdsourcing aspect of the Flickr Commons is for memory institutions. A lot of this goes back to to a core question “how do we know if we have succeeded?”. There is much to be said for setting out clear goals when launching online initiatives. Is your goal increased traffic to your site or crowdsourcing of metadata? A great example of an initiative whose goal is clearly collection of crowdsourced metadata is the German Federal Archives who chose to use the Wikimedia Commons for their photo metadata initiative.

If you are trying to extend your mission of providing access to materials to the public, then how do you measure success? Putting your materials in what Melissa called “grand central stations” (or what I have also heard termed “public crosswalks”) definitely increases the chances of serendipitous discovery by new individuals. That said, we can see from the successful blogs mentioned above that tackling a niche with enthusiasm and consistent posting can go a long way to building a following. JonWilliamson.com seems to have only launched back in November of 2008 with a post featuring a Scotch Tape Christmas ad from 1951. The author posted in May of 2009 that his images in Flickr had surpassed 100,000 views.

To conclude this post I leave you with a list of inspirational digitized collections online that were created by various cultural heritage institutions:

Publishers’ Bindings Online – discussed in SAA2007’s Session: Publishers’ Bindings Online – Digitization, Collaboration, Standardization and Community Building, a multi-institutional project that includes galleries of topical images combined with an essay that gives the images context. Two of my favorites are:
- From Domestic Goddesses to Suffragists: The Story of Women Told on Bookbindings, 1820-1920
- Indians, the Frontier, and the West in American Bookbindings
Calisphere – more than 150,000 digitized items organized for easy use by K-12 teachers. This is especially interesting in that it represents items already available in Online Archive of California, but organized in a way to make them easy to find and use with their target audience in mind.
Yiddish Books Online – A project by the National Yiddish Book Center that uses the Internet Archive as a platform to host 11,000 digitized out-of-print Yiddish books. This project is a nice cross between a branded custom site and a grand-central station

Have a favorite online collection website? Please share it in the comments below.

As is the case with all my session summaries from DH2009, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Image credit: http://www.flickr.com/photos/mms0131/ / CC BY-NC-ND 2.0

DH2009: Digital Lives and Personal Digital Archives

June 25, 2009

Session Title: Digital Lives: How people create, manipulate and store their personal digital archives
Speaker: Peter Williams, UCL

Digital lives is a joint project of UCL, British Library and University of Bristol

What? We need a better understanding of how people manage digital collections on their laptops, pdas and home computers. This is important due to the transition from paper-based personal collections to digital collections. The hope is to help people manage their digital archives before the content gets to the archives.

How? Talk to people with in-depth narrative interview. Ask people of their very first memories of information technology. When did they first use the computer? Do they have anything from that computer? How did they move the content from that computer? People enjoyed giving this narrative digital history of their lives.

Who? 25 interviewees – both established and emerging people whose works would or might be of interest to repositories of the future.

Findings?

They created a detailed flowchart of users’ reported process of document manipulation.
Common patterns in use of email showed that people used email across all these platforms and environments. Preserving email is not just a case of saving one account’s messages:
- work email
- Gmail/Yahoo
- mails via Facebook
- Twitter
Documented personal information styles that relate skills dimension to data security dimension.

The one question I caught was from someone who asked if they thought people would stop using folders to organize emails and digital files with the advent of easy search across documents. The speaker answered by mentioning the revelations in the paper Don’t Take My Folders Away!. People like folders.

My Thoughts

This session got me to think again about the SAA2008 session that discussed the challenges that various archivists are facing with hybrid literary collections. Matthew Kirschenbaum also pointed me to MITH’s white paper: Approaches to Managing and Collecting Born-Digital Literary Materials for Scholarly Use.

I am very interested to see how ideas about preserving personal digital records evolve. For example, what happens to the idea of a ‘draft’ in a world that auto-saves and versions documents every few minutes such as Google Documents does?

With born digital photos we run into all sorts of issues. Photos that are simultaneously kept on cameras, hard drives, web based repositories (flickr, smugmug, etc) and off-site backup (like mozy.com). Images are deleted and edited differently across environments as well. A while back I wrote a post considering the impact of digital photography on the idea of photographic negatives as the ‘photographers’ sketchbooks’: Capa’s Found Images and Thoughts on Digital Photographers’ Sketchbooks.

I really liked the approach of this project in that it looked at general patterns of behavior rather than attempting to extrapolate from experiences of archivists with individual collections. This sort of research takes a lot of energy, but I am hopeful that basically creating these general user profiles will lead to best practices for preserving personal digital collections that can be applied easily as needed.

Category: at risk records