Menu Close

Book Review: Dreaming in Code (a book about why software is hard)

Dreaming in Code: Two Dozen Programmers, Three Years, 4,732 Bugs, and One Quest for Transcendent Software
(or “A book about why software is hard”) by Scott Rosenberg

Before I dive into my review of this book – I have to come clean. I must admit that I have lived and breathed the world of software development for years. I have, in fact, dreamt in code. That is NOT to say that I was programming in my dream, rather that the logic of the dream itself was rooted in the logic of the programming language I was learning at the time (they didn’t call it Oracle Bootcamp for nothing).

With that out of the way I can say that I loved this book. This book was so good that I somehow managed to read it cover to cover while taking two graduate school courses and working full time. Looking back, I am not sure when I managed to fit in all 416 pages of it (ok, there are some appendices and such at the end that I merely skimmed).

Rosenberg reports on the creation of an open source software tool named Chandler. He got permission to report on the project much as an embedded journalist does for a military unit. He went to meetings. He interviewed team members. He documented the ups and downs and real-world challenges of building a complex software tool based on a vision.

If you have even a shred of interest in the software systems that are generating records that archivists will need to preserve in the future – read this book. It is well written – and it might just scare you. If there is that much chaos in the creation of these software systems (and such frequent failure in the process), what does that mean for the archivist charged with the preservation of the data locked up inside these systems?

I have written about some of this before (see Understanding Born Digital Records: Journalists and Archivists with Parallel Challenges), but it stands repeating: If you think preserving records originating from standardized packages of off-the-shelf software is hard, then please consider that really understanding the meaning of all the data (and business rules surrounding its creation) in custom built software systems is harder still by a factor of 10 (or a 100).

It is interesting for me to feel so pessimistic about finding (or rebuilding) appropriate contextual information for electronic records. I am usually such an optimist. I suspect it is a case of knowing too much for my own good. I also think that so many attempts at preservation of archival electronic records are in their earliest stages – perhaps in that phase in which you think you have all the pieces of the puzzle. I am sure there are others who have gotten further down the path only to discover that their map to the data does not bear any resemblance to the actual records they find themselves in charge of describing and arranging. I know that in some cases everything is fine. The records being accessioned are well documented and thoroughly understood.

My fear is that in many cases we won’t know that we don’t have all the pieces we need to decipher the data until many years down the road leads me to an even darker place. While I may sound alarmist, I don’t think I am overstating the situation. This comes from my first hand experience in working with large custom built databases. Often (back in my life as a software consultant) I would be assigned to fix or add on to a program I had not written myself. This often feels like trying to crawl into someone else’s brain.

Imagine being told you must finish a 20 page paper tonight – but you don’t get to start from scratch and you have no access to the original author. You are provided a theoretically almost complete 18 page paper and piles of books with scraps of paper stuck in them. The citations are only partly done. The original assignment leaves room for original ideas – so you must discern the topic chosen by the original author by reading the paper itself. You decide that writing from scratch is foolish – but are then faced with figuring out what the person who originally was writing this was trying to say. You find 1/2 finished sentences here and there. It seems clear they meant to add entire paragraphs in some sections. The final thorn in your side is being forced to write in a voice that matches that of the original author – one that is likely odd sounding and awkward for you. About halfway through the evening you start wishing you had started from scratch – but now it is too late to start over, you just have to get it done.

So back to the archivist tasked with ensuring that future generations can make use of the electronic records in their care. The challenges are great. This sort of thing is hard even when you have the people who wrote the code sitting next to you available to answer questions and a working program with which to experiment. It just makes my head hurt to imagine piecing together the meaning of data in custom built databases long after the working software and programmers are well beyond reach.

Does this sound interesting or scary or relevant to your world? Dreaming in Code is really a great read. The people are interesting. The issues are interesting. The author does a good job of explaining the inner workings of the software world by following one real world example and grounding it in the landscape of the history of software creation. And he manages to include great analogies to explain things to those looking in curiously from outside of the software world. I hope you enjoy it as much as I did.

Redacting Data – A T-Shirt and Other Thoughts

ThinkGeek Magic Numbers T-ShirtThinkGeek.com has created a funny t-shirt with the word redacted on it.

In case you missed it, there was a whole lot of furor early this month when someone posted an Advanced Access Content System (AACS) decryption key online. The key consists of 16 hexadecimal numbers that can be used to decrypt and copy any Blu-Ray or HD-DVD movie. Of course, it turns out to not be so simple – and I will direct you to a series of very detailed posts over at Freedom to Tinker if you want to understand the finer points of what the no longer secret key can and cannot do. The CyberSpeak column over at USA Today has a nice summary of the big picture and more details about what happened after the key was posted.

What amused me about this t-shirt (and prompted me to post about it here) is that it points out an interesting challenge of redacting data. How do you ensure that the data you leave behind doesn’t support deduction of the missing data? This is something I have thought about a great deal when designing web based software and worrying about security. It is not something I had spent much time thinking about related to archives and the protection of privacy. The joke from the shirt of course is that removing just the secret info but leaving everything else doesn’t do the job. This is a simplified case – let me give you an example that might make this more relevant.

Let’s say that you have records from a business in your archives and one series included is of personnel records. If you redacted those records to remove people’s names, SSNs and other private data, but left the records in their original order so that researchers could examine them for other information – would that be enough to protect the privacy of the business’s employees?

What if somewhere else in the collection you had the employee directory that listed names and phone extensions. No problem there – right? Ah.. but what if you assumed that the personnel records were in alphabetical order and then used the phone directory as a partial key to figuring out which personnel records were for which people?

This is definitely a hypothetical scenario, but it gets the idea across about how archivists need to take in the big picture to ensure the right level of privacy protection.

Besides, what archivist (or archivist in training) could resist a t-shirt with the word redacted on it?

RSS and Mainstream News Outlets

Recently posted on the FP Passport blog, The truth about RSS gives an overview of the results of a recent RSS study that looks at the RSS feeds produced by 19 major news outlets. The complete study (and its results) can be found here: International News and Problems with the News Media’s RSS Feeds.

If you are interested in my part in all this, read the Study Methodology section (which describes my role down under the heading ‘How the Research Team Operated’) and the What is RSS? page (which I authored, and describes both the basics of RSS as well as some other web based tools we used in the study – YahooPipes and Google Docs).

Why should you care about RSS? RSS feeds are becoming more common on archives websites. It should be treated as just another tool in the outreach toolbox for making sure that your archives maintains or improves its visibility online. To get an idea of how they are being used, consider the example of the UK National Archives. They currently publish three RSS feeds:

  • Latest news Get the latest news and events for The National Archives.
  • New document releases Highlights of new document releases from The National Archives.
  • Podcasts Listen to talks, lectures and other events presented by The National Archives.

The results of the RSS study I link to above shed light on the kinds of choices that are made by content providers who publish feeds – and on the expectations of those who use them. If you don’t know what RSS is – this is a great intro. If you use and love (or hate) RSS already – I would love to know your thoughts on the study’s conclusions.

Epidemiological Research and Archival Records: Source of Records Used for Research Fails to Make the News

Typist wearing mask, New York City, October 16, 1918 (NARA record 165-WW-269B-16)In early April, Reuters ran an article that was picked up by YahooNews titled Closing Schools reduced flu deaths in 1918. I was immediately convinced that archival records must have supported this research – even though no mention of that was included in the article. The article did tell me that it was Dr. Richard Hatchett of the National Institute of Allergy and Infectious Diseases (NIAID) who led the research.

I sent him an email asking about where the data for his research came from. Did the NIH have a set of data from long ago? Here is an excerpt from his kind reply:

Unfortunately, nobody kept track of data like this and you can see the great lengths we went to to track it down. Many of the people we thank in our acknowledgment at the end of the paper tracked down and provided information in local or municipal archives. For Baltimore, I came up and spent an entire day in the library going through old newspapers on microfilm. Some of the information had been gathered by previous historians in works on the epidemic in individual cities (Omaha — an unpublished Master’s thesis — and Newark are examples). Gathering the information was extremely arduous and probably one of the reasons no one had looked at this systematically before. Fortunately, several major newspapers (the NYTimes, Boston Globe, Washington Post, Atlanta Journal-Constitution, etc.) now have online archives going back at least until 1918 that facilitated our search.

Please let me know if you have any other questions. We were amateurs and pulling the information together took a lot longer than we would ever have imagined.

He also sent me a document titled “Supporting Information Methods”. This turned out to be 37 pages of detailed references found to support their research. They were hunting for three types of information: first reported flu cases, amplifying events (such as Liberty Loan Parades ) and interventions (such as quarantines, school closings and bands on public gatherings).

Many of the resources cited are newspapers (see The Baltimore Sun’s 1918 flu pandemic timeline for examples of what can be found in newspapers), but I was more intrigued by the wide range of non-newspaper records used to support this research. A few examples:

  • Chicago (First reported case): Robertson JD. Report and handbook of the Department of Health of the City of Chicago for the years 1911 to 1918 inclusive. Chicago, 1919.
  • Cleveland (School closings): The City Record of the Cleveland City Council, October 21, 1918, File No. 47932, citing promulgation of health regulations by Acting Commissioner of Health H.L. Rockwood.
  • New Orleans (Ban on public gatherings): Parish of Orleans and City of New Orleans. Report of the Board of Health, 1919, p. 131.
  • Seattle (Emergency Declaration): Ordinance No. 38799 of the Seattle City Council, signed by Mayor Hanson October 9, 1918.

The journal article referenced in the Reuter’s story, Public health interventions and epidemic intensity during the 1918 influenza pandemic, was published in the Proceedings of the National Academy of Sciences (PNAS) and is available online.

The good news here is that the acknowledgment that Dr. Hatchett mentions in his email includes this passage:

The analysis presented here would not have been possible without the contributions of a large number of public health and medical professionals, historians, librarians, journalists, and private citizens […followed by a long list of individuals].

The bad news is that the use of archival records is not mentioned in the news story.

We frequently hear about how little money there is at most archives. Cutbacks in funding are the norm. Every few weeks we hear of archives forced to cut their hours, staff or projects. Public understanding of the important ways that archival records are used can only help to reverse this trend.

Maybe we need a bumper sticker to hand out to new researchers. Something catchy and a little pushy – something that says “Tell the world how valuable our records are!” – only shorter.

  • If You Use Archival Records – Go On The Record
  • Put Primary Sources in the Spotlight
  • Archivists for Footnotes: Keep the paper trail alive
  • Archives Remember: Don’t Forget Them

I don’t love any of these – anyone else feeling wittier and willing to share?

(For more images of the 1918 Influenza Epidemic, visit the National Museum of Health and Medicine’s Otis Historical Archives’ Images from the 1918 Influenza Epidemic.)

Digital Archiving Articles – netConnect Spring 2007

Thanks to Jessamyn West’s blog post, I found my way to a series of articles in the Spring 2007 edition of netConnect:

“Saving Digital History” is the longest of the three and is a nice survey of many of the issues found at the interseciton of archiving, born digital records and the wild world of the web. I especially love the extensive Link List at the end of the articles — there are lots of interesting related resources. This is the sort of list of links I wish were available with ALL articles online!

I can see the evolution of some of the ideas she and her co-speakers touched on in their session at SAA 2006: Everyone’s Doing It: What Blogs Mean for Archivists in the 21st Century. I hope we continue to see more of these sorts of panels and articles. There is a lot to think about related to these issues – and there are no easy answers to the many hard questions.

Update: Here is a link to Jessamyn’s presentation from the SAA session mentioned above: Capturing Collaborative Information News, Blogs, Librarians, and You.

Copyright Law: Archives, Digital Materials and Section 108

I just found my way today to Copysense (obviously I don’t have enough feeds to read as it is!). Their current clippings post highlighted part of the following quote as their Quote of the Week.

Marybeth Peters (from http://www.copyright.gov/about.html)“[L]egislative changes to the copyright law are needed. First, we need to amend the law to give the Library of Congress additional flexibility to acquire the digital version of a work that best meets the Library’s future needs, even if that edition has not been made available to the public. Second, section 108 of the law, which provides limited exceptions for libraries and archives, does not adequately address many of the issues unique to digital media—not from the perspective of copyright owners; not from the perspective of libraries and archives.” Marybeth Peters , Register of Copyrights, March 20, 2007

Marybeth Peters was speaking to the Subcommittee on Legislative Branch of the Committee on Appropriations about the Future of Digital Libraries.

Copysense makes some great points about the quote:

Two things strike us as interesting about Ms. Peters’ quote. First, she makes the quote while The Section 108 Study Group continues to work through some very thorny issues related to the statutes application in the digital age […] Second, while Peters’ quote articulates what most information professionals involved in copyright think is obvious, her comments suggest that only recently is she acknowledging the effect of copyright law on this nation’s de facto national library. […] [S]omehow it seems that Ms. Peters is just now beginning to realize that as the Library of Congress gets involved in the digitization and digital work so many other libraries already are involved in, that august institution also may be hamstrung by copyright.

I did my best to read through Section 108 of the Copyright Law – subtitled “Limitations on exclusive rights: Reproduction by libraries and archives”. I found it hard to get my head around … definitely stiff going. There are 9 different subsections (‘a’ through ‘i’) each with there own numbered exceptions or requirements. Anxious to get a grasp on what this all really means – I found LLRX.com and their Library Digitization Projects and Copyright page. This definitely was an easier read and helped me get further in my understanding of the current rules.

Next I explored the website for the Section 108 Study Group that is hard at work figuring out what a good new version of Section 108 would look like. I particularly like the overview on the About page. They have a 32 page document titled Overview of the Libraries and Archives Exception in the Copyright Act: Background, History, and Meaning for those of you who want the full 9 years on what has gotten us to where we are today with Section 108.

For a taste of current opinions – go to the Public Comments page which provides links to all the written responses submitted to the Notice of public roundtable with request for comments. There are clear representatives from many sides of the issue. I spotted responses from SAA, ALA and ARL as well as from MPAA, AAP and RIAA. All told there are 35 responses (and no, I didn’t read them all). I was more interested in all the different groups and individuals that took the time to write and send comments (and a lot of time at that – considering the complicated nature of the original request for comments and the length of the comments themselves). I was also intrigued to see the wide array of job titles of the authors. These are leaders and policy makers (and their lawyers) making sure their organizations’ opinions are included in this discussion.

Next stop – the Public Roundtables page with it’s links to transcripts from the roundtables – including the most recent one held January 31, 2007. Thanks to the magic of Victoria’s Transcription Services, the full transcripts of the roundtables are online. No, I haven’t read all of these either. I did skim through a bit of it to get a taste of the discussions – and there is some great stuff here. Lots of people who really care about the issues carefully and respectfully exploring the nitty-gritty details to try and reach good compromises. This is definitely on my ‘bookmark to read later’ list.

Karen Coyle has a nice post over on Coyle’s InFormation that includes all sorts of excerpts from the transcripts. It gives you a good flavor of what some of these conversations are like – so many people in the same room with such different frames of reference.

This is not easy stuff. There is no simple answer. It will be interesting to see what shape the next version of Section 108 takes with so many people with very different priorities pulling in so many directions.

Section 108 Study GroupThe good news is that there are people with the patience and dedication to carefully gather feedback, hold roundtables and create recommendations. Hurrah for the hard working members of the Section 108 Study Group – all 19 of them!

Visualizing Archival Collections

As I mentioned earlier, I am taking an Information Visualization class this term. For our final class project I managed to inspire two other classmates to join me in creating a visualization tool based on the structured data found in the XML version of EAD finding aids.

We started with the XML of the EAD finding aids from University of Maryland’s ArchivesUM and the Library of Congress Finding Aids. My teammates have written a parser that extracts various things from the XML such as title, collection size, inclusive dates and subjects. Our goal is to create an innovative way to improve the exploration and understanding of archival collections using an interactive visualization.

Our main targets right now are to use a combination of subjects, years and collection size to give users a better impression of the quantity of archival materials that fit various search criteria. I am a bit obsessed about using the collection size as a metric for helping users understand the quantity of materials. If you do a search for a book in a library’s catalog – getting 20 hits usually means that you are considering 20 books. If you consider archival collections – 20 hits could mean 20 linear feet (20 collections each of which is 1 linear foot in size) or it could mean 2000 linear feet (20 collections each of which is 100 linear feet in size). Understanding this difference is something that visualization can help us with. Rather than communicating only the number of results – the visualization will communicate the total size of collections assigned each of the various subjects.

I have uploaded 2 preliminary screen mockups one here and the second here trying to get at my ideas for how this might work.

Not reflected in the mock-ups is what could happen when a user clicks on the ‘related subject’ bars. Depending on where they click – one of two things could happen. If they click on the ‘related subject’ bar WITHIN the boundaries of the selected subject (in the case above, that would mean within the ‘Maryland’ box), then the search would filter further to only show those collections that have both the ‘Maryland’ and newly ‘added’ tag. The ‘related subjects’ list and displayed year distribution would change accordingly as well. If, instead, the user clicks on a ‘related subject’ bar OUTSIDE the boundary of the selected subject — then that subject would become the new (and only) selected subject and the displayed collections, related subjects and years would change accordingly.

So that is what we have so far. If you want to keep an eye on our progress, our team has a page up on our class wiki about this project. I have a ton of ideas of other things I would love to add to this (my favorite being a map of the world with indications of where the largest amount of archival materials can be found based on a keyword or subject search) – but we have to keep our feet on the ground long enough actually build something for our class project. This is probably a good thing. Smaller goals make for a greater chance of success.

Getting Your Toes Wet: Basic Principals of Design for the New Web

Ellyssa Kroski of InfoTangle has created a great overview of current trends in website and application design in her post Information Design for the New Web. If you are going to Computers in Libraries, you can see her present the ideas she discusses in her post in a session of the same name on Monday April 16.

She highlights 3 core principles with clear explanations and great examples:

  • Keep it Simple
  • Make it Social
  • Offer Alternate Navigation

As archives continue to dive into the deep end of the internet pool, more and more archivists will find themselves participating in discussions about website design choices. Understanding basic principals like those discussed in Kroski’s post will go a long way to making archivists feel more comfortable contributing to these sorts of discussions.

Don’t think that things like this should be left to the IT department or only the ‘techie archivists’ on your staff (if you have any). You all have a lot to contribute. You know your collections. You know the importance of archival principals of provenance, original order and context. There are lots of aspects of archival materials that traditional web designers might not consider important. Things that you know are very important if people are to understand your archives’ materials while browsing from the comfort of their homes.

So dip your toes in. Learn some buzz words, look at some fun websites and get comfortable with some innovative ideas. The water is just fine.

Ideas for SAA2007: Web Awards, Wikis and Blogs

Online since late March of this year, the new ArchivesNext blog is wasting no time in generating great ideas. First of all – I love the idea of awards for the best archives websites. How about ‘Best Archives Blog’, ‘Best Online Exhibit’ and ‘Best Archives Website’? It seems like barely a week goes by on the Archives and Archivists’ listserv between each announcement of a new archives website or online exhibition. I think an entire blog could be created just showing off the best of archives websites. I would love to see those making the greatest online contributions to the profession honored at the annual conference.

Another great ArchivesNext idea is a wiki for SAA2007 in Chicago. I was amazed at the conference last summer to see the table where you could buy audio recordings of the presentations. I live so much in the tech/geek world that I had assumed that of course SAA would have someone recording the sessions so they could be posted online. I assumed that there would be a handy place for presenters to upload their handouts and slides. A wiki would be a great way to support this sort of knowledge sharing. People come from all over the world for just a few days together at conferences like this. Many more can’t make the trip. I think it would go a long way to build more of an online archival community to have something beyond a listserv that let groups of like minded individuals build a collection of resources surrounding the topics discussed at the conference.

What about blogging the conference? Last year Foldering.com suggested we all use SAA2006 to tag our conference blog posts. Technorati shows 25 posts with that tag (and yes, a lot of those posts are mine). One major stumbling block was a lack of wireless in the hotel where the convention was held. Another was a combination of lack of interest and lack of coordination. Too few people were mobilized in time to plan coverage of the panels.

We could leverage a conference wiki to coordinate more effectively than we did last year. Simple signup sheets could help us ensure coverage of the panels and roundtables. I think it would be interesting to see if those who cannot attend the conference might express preferences about which talks should definitely be covered. If there are wiki pages for each panel and roundtable, those pages could eventually include links to the blog posts of bloggers covering those talks.

Blogging last August at SAA2006 was interesting for me. I had never attempted to blog at a conference (Spellboundblog was less than 1 month old last August). I took 37 pages of notes on my laptop. Yes, there was a lot of white space – but it was still 37 pages long. I found that I couldn’t bring myself to post in the informal ‘stream of consciousness style that I have often seen in ‘live blogging’ posts. I wanted to include links. I wanted to include my thoughts about each speaker I listened to. I wanted to draw connections among all the different panels I attended. I wanted someone who hadn’t been there to be able to really understand the ideas presented from reading my posts. That took time. I ended up with 10 posts about specific panels and round tables and another 2 about general conference ideas and impressions. Then I gave up. I got to the point where I felt burdened by the pages I had not transcribed. I had gotten far enough away from the conference that I didn’t always understand my own notes. I had new things I wanted to talk about, so I set aside my notes and moved on.

I hope we get more folks interested in blogging the conference this year. Feel free to email me directly at jeanne AT spellboundblog.com if you would like to be kept in the loop for any blogging coordination (though I will certainly post whatever final plan we come up with here).