Menu Close

Category: access

Google, Privacy, Records Managment and Archives

BoingBoing.net posted on March 14 and March 15 about Google’s announcement of a plan to change their log retention policy . Their new plan is to strip parts of IP data from records in order to protect privacy. Read more in the AP article covering the announcement.

For those who are not familiar with them – IP addresses are made up of sets of numbers and look something like 192.39.288.3. To see how good a job they can do figuring out the location you are in right now – go to IP Address or IP Address Guide (click on ‘Find City’).

Google currently keeps IP addresses and their corresponding search requests in their log files (more on this in the personal info section of their Privacy Policy). Their new plan is that after 18-24 months they will permanently erase part of the IP address, so that the address no longer can point to a single computer – rather it would point to a set of 256 computers (according to the AP article linked above).

Their choice to permanently redact these records after a set amount of time is interesting. They don’t want to get rid of the records – just remove the IP addresses to reduce the chance that those records could be traced back to specific individuals. This policy will be retroactive – so all log records more than 18-24 months old will be modified.

I am not going to talk about how good an idea this is.. or if it doesn’t go far enough (plenty of others are doing that, see articles at EFF and Wired: 27B Stroke 6 ). I want to explore the impact of choices like these on the records we will have the opportunity to preserve in archives in the future.

With my ‘archives’ hat on – the bigger question here is how much the information that Google captures in the process of doing their business could be worth to the historians of the future. I wonder if we will one day regret the fact that the only way to protect the privacy of those who have done Google searches is to erase part of the electronic trail. One of the archivist tenants is to never do anything to the record you cannot undo. In order for Google to succeed at their goal (making the records useless to government investigators) – it will HAVE to be done such that it cannot be undone.

In my information visualization course yesterday, our professor spoke about how great maps are at tying information down. We understand maps and they make a fabulous stable framework upon which we can organize large volumes of information. It sounds like the new modified log records would still permit a general connection to the physical geographic world – so that is a good thing. I do wonder if the ‘edited’ versions of the log records will still permit the grouping of search requests such that they can be identified as having been performed by the same person (or at least from the same computer)? Without the context of other searches by the same person/computer, would this data still be useful to a historian? Would being able to examine the searches of a ‘community’ of 256 computers be useful (if that is what the IP updates mean).

What if Google could lock up the unmodified version of those stats in a box for 100 years (and we could still read the media it is recorded on and we had documentation telling us what the values meant and we had software that could read the records)? What could a researcher discover about the interests of those of us who used Google in 2007? Would we loose a lot by if we didn’t know what each individual user searched for? Would it be enough to know what a gillion groups of 256 people/computers from around the world were searching for – or would loosing that tie to an individual turn the data into noise?

Privacy has been such a major issue with the records of many businesses in the past. Health records and school records spring to mind. I also find myself thinking of Arthur Anderson who would not have gotten into trouble for shredding their records if they had done so according to their own records disposition schedules and policies. Googling Electronic Document Retention Policy got me over a million hits. Lots of people (lawyers in particular) have posted articles all over the web talking about the importance of a well implemented Electronic Document Retention Policy. I was intrigued by the final line of a USAToday article from January 2006 about Google and their battle with the government over a pornography investigation:

Google has no stated guidelines on how long it keeps data, leading critics to warn that retention could be for years because of inexpensive data-storage costs.

That isn’t true any longer.

For me, this choice by Google has illuminated a previously hidden perfect storm. That the US government often request of this sort of log data is clear, though Google will not say how often. The intersection of concerns about privacy, government investigations, document retention and tremendous volumes of private sector business data seem destined to cause more major choices such as the one Google has just announced. I just wonder what the researchers of the future will think of what we leave in our wake.

The Archives and Archivists Listserv: hoping for a stay of execution

There has been a lot of discussion (both on the Archives & Archivists (A&A) Listserv and in blog posts) about the SAA‘s recent decision to not preserve the A&A listserv posts from 1996 through 2006 when they are removed from the listserv’s old hosting location at Miami University of Ohio.

Most of the outcry against this decision has fallen into two camps:

  • Those who don’t understand how the SAA task force assigned to appraise the listserv archives could decide it does not have informational value – lots of discussion about how the listserv reflects the move of archivists into the digital age as well as it’s usefulness for students
  • Those who just wish it wouldn’t go away because they still use it to find old posts. Some mentioned that there are scholarly papers that reference posts in the listserv archives as their primary sources.

I added this suggestion on the listserv:

I would have thought that the Archives Listserv would be the ideal test case for developing a set of best practices for archiving an organization’s web based listserv or bboard.

Perhaps a graduate student looking for something to work on as an independent project could take this on? Even if they only got permission for working with posts from 2001 onward [post 2001 those who posted had to agree to ‘terms of participation’ that reduce issues with copyright and ownership] – I suspect it would still be worthwhile.

I have always found that you can’t understand all the issues related to a technical project (like the preservation of a listserv) until you have a real life case to work on. Even if SAA doesn’t think we need to keep the data forever – here is the perfect set of data for archivists to experiment with. Any final set of best practices would be meant for archivists to use in the future – and would be all the easier to comprehend if they dealt with a listserv that many of them are already familiar with.

Another question: couldn’t the listserv posts still be considered ‘active records’? Many current listserv posters claim they still access the old list’s archives on a regular basis. I would be curious what the traffic for the site is. That is one nice side effect of this being on a website – it makes the usage of records quantifiable.

There are similar issues in the analog world when records people still want to use loose their physical home and are disposed of but, as others have also pointed out, digital media is getting cheaper and smaller by the day. We are not talking about paying rent on a huge wharehouse or a space that needs serious temperature and humidity control.

I was glad to see Rick Prelinger’s response on the current listerv that simply reads:

The Internet Archive is looking into this issue.

I had already checked when I posted my response to the listerv yesterday – having found my way to the A&A old listserv page in the Wayback Machine. For now all that is there is the list of links to each week’s worth of postings – nothing beyond that has been pulled in.

I have my fingers crossed that enough of the right people have become aware of the situation to pull the listserv back from the brink of the digital abyss.

Understanding Born Digital Records: Journalists and Archivists with Parallel Challenges

My most recent Archival Access class had a great guest speaker from the Journalism department. Professor Ira Chinoy is currently teaching a course on Computer-Assisted Reporting. In the first half of the session, he spoke about ways that archival records can fuel and support reporting. He encouraged the class to brainstorm about what might make archival records newsworthy. How do old records that have been stashed away for so long become news? It took a bit of time, but we got into the swing of it and came up with a decent list. He then went through his own list and gave examples of published news stories that fit each of the scenarios.

In the second half of class he moved on to address issues related to the freedom of information and struggling to gain access to born digital public records. Journalists are usually early in the food chain of those vying for access to and understanding of federal, state and local databases. They have many hurdles. They must learn what databases are being kept and figure out which ones are worth pursuing. Professor Chinoy relayed a number of stories about the energy and perseverance required to convince government officials to give access to the data they have collected. The rules vary from state to state (see the Maryland Public Information Act as an example) and journalists often must quote chapter and verse to prove that officials are breaking the law if they do not hand over the information. There are officials who deny that the software they use will even permit extractions of the data – or that there is no way to edit the records to remove confidential information. Some journalists find themselves hunting down the vendors of proprietary software to find out how to perform the extract they need. They then go back to the officials with that information in the hopes of proving that it can be done. I love this article linked to in Prof. Chinoy’s syllabus: The Top 38 Excuses Government Agencies Give for Not Being Able to Fulfill Your Data Request (And Suggestions on What You Should Say or Do).

After all that work – just getting your hands on the magic file of data is not enough. The data is of no use without the decoder ring of documentation and context.

I spent most of the 1990s designing and building custom databases, many for federal government agencies. There are an almost inconceivable number of person hours that go into the creation of most of these systems. Stakeholders from all over the organization destined to use the system participate in meetings and design reviews. Huge design documents are created and frequently updated … and adjustments to the logic are often made even after the system goes live (to fix bugs or add enhancements). The systems I am describing are built using complex relational databases with hundreds of tables. It is uncommon for any one person to really understand everything in it – even if they are on the IT team for the full development life cycle.

Sometimes you get lucky and the project includes people with amazing technical writing skills, but usually those talented people are aimed at writing documentation for users of the system. Those documents may or may not explain the business processes and context related to the data. They will rarely expose the relationship between a user’s actions on a screen and the data as it is stored in the underlying tables. Some decisions are only documented in the application code itself and that is not likely to be preserved along with the data.

Teams charged with the support of these systems and their users often create their own documents and databases to explain certain confusing aspects of the system and to track bugs and their fixes. A good analogy here would be to the internal files that archivists often maintain about a collection – the notes that are not shared with the researchers but instead help the archivists who work with the collection remember such things as where frequently requested documents are or what restrictions must be applied to certain documents.

So where does that leave those who are playing detective to understand the records in these systems? Trying to figure out what the data in the tables mean based on the understanding of end-users can be a fool’s errand – and that is if you even have access to actual users of the system in the first place. I don’t think there is any easy answer given the realities of how many unique systems of managing data are being used throughout the public sector.

Archivists often find themselves struggling with the same problems. They have to fight to acquire and then understand the records being stored in databases. I suspect they have even less chance of interacting with actual users of the original system that created the records – though I recall discussions in my appraisal class last term about all the benefits of working with the producers of records long before they are earmarked to head to the archives. Unfortunately, it appeared that this was often the exception rather than the rule – even if it is the preferred scenario.

The overly ambitious and optimistic part had the idea that what ‘we’ really need is a database that lists common commercial off-the-shelf (COTS) packages used by public agencies – along with information on how to extract and redact data from these packages. For those agencies using custom systems, we could include any information on what company or contractors did the work – that sort of thing can only help later. Or how about just a list of which agencies use what software? Does something like this exist? The records of what technology is purchased are public record – right? Definitely an interesting idea (for when I have all that spare time I dream about). I wonder if I set up a wiki for people to populate with this information if people would share what they already know.

I would like to imagine a future world in which all this stuff is online and you can login and download any public record you like at any time. You can get a taste of where we are on the path to achieving this dream on the archives side of things by exploring a single series of electronic records published on the US National Archives site. For example, look at the search screen for World War II Army Enlistment Records. It includes links to sample data, record group info and an FAQ. Once you make it to viewing a record – every field includes a link to explain the value. But even this extensive detail would not be enough for someone to just pick up these records and understand them – you still need to understand about World War II and Army enlistment. You still need the context of the events and this is where the FAQ comes in. Look at the information they provide – and then take a moment to imagine what it would take for a journalist to recreate a similar level of detailed information for new database records being created in a public agency today (especially when those records are guarded by officials who are leery about permitting access to the records in the first place).

This isn’t a new problem that has appeared with born digital records. Archivists and journalists have always sought the context of the information with which they are working. The new challenge is in the added obstacles that a cryptic database system can add on top of the already existing challenges of decrypting the meaning of the records.

Archivists and Journalists care about a lot of the same issues related to born digital records. How do we acquire the records people will care about? How do we understand what they mean in the context of why and how they were created? How do we enable access to the information? Where do we get the resources, time and information to support important work like this?

It is interesting for me find a new angle from which to examine rapid software development. I have spent so much of my time creating software based on the needs of a specific user community. Usually those who are paying for the software get to call the shots on the features that will be included. Certain industries do have detailed regulations designed to promote access by external observers (I am thinking of applications related to medical/pharmaceutical research and perhaps HAZMAT data) but they are definitely exceptions.

Many people are worrying about how we will make sure that the medium upon which we record our born digital records remains viable. I know that others are pondering how to make sure we have software that can actually read the data such that it isn’t just mysterious 1s and 0s. What I am addressing here is another aspect of preservation – the preservation of context. I know this too is being worried about by others, but while I suspect we can eventually come up with best practices for the IT folks to follow to ensure we can still access the data itself – it will ultimately be up to the many individuals carrying on their daily business in offices around the world to ensure that we can understand the information in the records. I suppose that isn’t new either – just another reason for journalists and archivists to make their voices heard while the people who can explain the relationships between the born digital records and the business processes that created them are still around to answer questions.

Spring 2007:Access and Information Visualization

I don’t often post explicitly about my experiences as a graduate student – but I want to let everyone know about the focus of my studies for the next four months. I am taking two courses that I hope will complement one another. One course is on Archival Access (description, MARC, DACS, EAD and theory). The other is on Information Visualization over in the Computer Science department.

My original hope was that in my big Information Visualization final project I might get the opportunity to work with some aspect of archives and/or digital records. I want to understand how to improve access and understanding of the rich resources in the structured digital records repositories in archives around the world. What has already happened just one week into the term is that I find myself cycling through multiple points of view as I do my readings.

How can we support interaction with archival records by taking advantage of the latest information visualization techniques and tools? We can make it easier to understand what records are in a repository – both analog and digital records. I have been imagining interactive visual representations of archives collections, time periods, areas of interest and so forth. When you visit an archives’ website – it can often be so hard to get your head around the materials they offer. I suspect that this is often the case even when you are standing in the same building as the collections. In my course on appraisal last term we talked a lot about examining the collections that were already present on the path to creating a collecting policy. I am optimistic about ways that visualizing this information could improve everyone’s understanding of what an archives contains, for archivists and researchers alike.

Once I get myself to stop those daydreams… I move on to the next set of daydreams. What about the products of these visual analytics tools? How do we captured interactive visualizations in archives? This seems like a greater challenge than the average static digital record (as if there really is such an animal as an ‘average’ digital record). I can see a future in which major government and business decisions are made based on the interpretation of such interactive data models, graphs and charts. Instead of needing just the ‘records’ – don’t we need a way to recreate the experience that the original user had when interacting with the records?

This (unsurprisingly) takes me back to the struggle of how to define exactly what a record is in the digital world. Is the record a still image of a final visualization? Can this actually capture the full impact of an interactive and possible 3D visualization? With information visualization being such a rich and dynamic field I feel that there is a good chance that the race to create new methods and tools will zoom far ahead of plans to preserve its products.

I think some of my class readings will take extra effort (and extra time) as my mind cycles through these ideas. I think that a lot of this will come out in my posts over the next four months. And I still have strong hopes for rallying a team in my InfoViz class to work on an archives related project.

Book Review: Past Time, Past Place: GIS for History

Past Time, Past Place: GIS for History consists mainly of 11 case studies of geographic information systems being applied to the study of history. It includes a nice sprinkling of full color maps and images and a 20 page glossary of GIS terms. Each case study includes a list of articles and other resources for further reading.

The book begins with an introduction by the editor, Anne Kelly Knowles. This chapter explains the basics of using GIS to study history, as well as giving an overview of how the book is organized.

The meat of the book are the case studies covering the following topics:

I suspect that different audiences will take very different ideas away from this book. I was for looking for information about GIS and historical records (this is another book found during my mad hunt for information on the appraisal and preservation of GIS records) and found a bit of related information to add to my research. I think this book will be of interest to those who fall in any of the following categories:

  • Archivists curious about how GIS might enhance access to and understanding of the records under their care
  • Historians interested in understanding how GIS can be used to approach historical research in new ways
  • History buffs who love reading a good story (complete with pictures)
  • Map aficionados curious about new and different kinds of information that can be portrayed with GIS

I especially loved the maps and other images. I am a bit particular when it comes to the quality of graphics – but this book comes through with bright colors and clear images. The unusual square book format (measuring 9″x9″) gave those who arranged the layout lots of room to work – and they took full advantage of the space.

No matter if you plan to read the case studies for the history being brought to life or are looking for “how-tos” as you tackle your own GIS-History project – this book deserves some attention.

Footnote.com and US National Archives records

Thanks to Digitization 101‘s recent post “Footnote launches and announces partnership with National Archives” I was made aware of the big news about the digitization of the US National Archives’ records. Footnote.com has gone live with the first of apparently many planned installments of digitized NARA records. My first instinct was one of suspicion. In the shadow of recent historian alarm about the Smithsonian/Showtime deal, I think its valid to be concerned about new agreements between government agencies and private companies.

That said, I am feeling much more positive based on the passage below from the the January 10th National Archives Press Release about the agreement with Footnote (emphasis mine):

This non-exclusive agreement, beginning with the sizeable collection of materials currently on microfilm,will enable researchers and the general public to access millions of newly-digitized images of the National Archives historic records on a subscription basis from the Footnote web site. By February 6, the digitized materials will also be available at no charge in National Archives research rooms in Washington D.C. and regional facilities across the country. After an interval of five years, all images digitized through this agreement will be available at no charge through the National Archives web site .

This sounds like a win-win situation. NARA gets millions of records digitized (4.5 million and counting according to the press release). These records will be highlighed on the Footnote web site. They will have the advantages of Footnote’s search and browse interfaces (which I plan to do an in depth review of in the next week).

When signing up for my free account – I actually read through the entire Footnote Terms of Service including this passage (within the section labeled ‘Our Intellectual Property Rights’ – again, emphasis mine):

Content on the Website is provided to you AS IS for your information and personal use only as permitted through the functionality of the Website and may not be used, copied, reproduced, distributed, transmitted, broadcast, displayed, sold, licensed, or otherwise exploited for any other purposes whatsoever without the prior written consent of the respective owners . Footnote.com reserves all rights not expressly granted in and to the Website and the Content. You agree not to engage in the use, copying, or distribution of any of the Content other than expressly permitted herein, including any use, copying, or distribution of User Submissions of third parties obtained through the Website for any commercial purposes. If you download or print a copy of the Content for personal use, you must retain all copyright and other proprietary notices contained therein.

These terms certainly are no different from that under which most archives operate – but it did give me a moment of wondering how many extra hoops one would need to jump through if you wanted to use any of the NARA records found in Footnote for a major project like a book. A quick experiment with the Pennsylvania Archives (which are available for free with registration) did not show me any copyright information or notices related to rights. I downloaded an image to see what ‘copyright and other proprietary notices’ I might find and found none.

In his post “The Flawed Agreement between the National Archives and Footnote, Inc.“, Dan Cohen expresses his views of the agreement. I had been curious about what percentage of the records being digitized were out of copyright – Dan says they all are. If all of the records are out of copyright – exactly what rights are Footnote.com reserving (in the passage from the terms of service shown above)? I also agree with him in his frustration about the age restriction in place for using Footnote.com (you have to be over 18).

My final opinion about the agreement itself will depend on answers to a few more questions:

1) Were any of the records recently made available on Footnote.com already digitized and available via the archives.gov website?

2) What percentage of the records that were digitized by Footnote would have been digitized by NARA without this agreement?

3) What roadblocks will truly be set in place for those interested in using records found on Footnote.com?

4) What interface will be available to those accessing the records for free in “National Archives research rooms in Washington D.C. and regional facilities across the country” (from the press release above)? Will it be the Footnote.com website interface or via NARA’s own Archival Research Catalog (ARC) or Access to Archival Databases (AAD)?

If the records that Footnote has digitized and made available on Footnote.com would not otherwise have been digitized over the course of the next five years (a big if) then I think this is an interesting solution. Even the full $100 fee for a year subscription is much more reasonable than many other research databases out there (and certainly cheaper than even a single night hotel room within striking distance of National Archives II).

As I mentioned above, I plan to post a review of the Footnote.com search and browse interfaces in the next week. The Footnote.com support folks have given me permission to include screen shots – so if this topic is of interest to you, keep an eye out for it.

OBR: Optical Braille Recognition

In the interest of talking about new topics – I opened my little moleskine notebook and found a note to myself wondering if it is possible to scan Braille with the equivalent of OCR.

Enter Optical Braille Recognition or OBR. Created by a company called Neovision, this software will permit anyone with a scanner and a Windows platform computer to ‘read’ Braille documents.

Why was this in my notebook? I was thinking about unusual records that must be out in the world and wondering about how to improve access to the information within them. So if there are Braille records out there – how does the sighted person who can’t read Braille get at that information? Here is an answer. Not only does the OBR permit reading of Braille documents – but it would permit recreation of these same documents in Braille from any computer that has the right technology.

Reading through the Wikipedia Braille entry, I learned a few things that would throw a monkey wrench into some of this. For example – “because the six-dot Braille cell only offers 64 possible combinations, many Braille characters have different meanings based on their context”. The page on Braille code lists links to an assortment of different Braille codes which translate the different combinations of dots into different characters depending on the language of the text. On top of the different Braille codes used to translate Braille into specific letters or characters – there is another layer to Braille transcription. Grade 2 Braille uses a specific set of contractions and shorthand – and is used for official publications and things like menus, while Grade 3 Braille is used in the creation of personal letters.

It all goes back to context (of course!). If you have a set of Braille documents with no information on them giving you details of what sort of documents they are – you have a document that is effectively written in code. Is it music written in Braille Music notation? Is it a document in Hiranga using the Japanese Code? Is this a personal letter using Grade 3 Braille shorthand? You get the idea.

I suspect that one might even want to include a copy of both the Braille Code and the Braille transcription rules that go with a set of documents as a key to their translation in the future. If there are frequently used records – they could perhaps include the transcription (both literal transcription and a ‘translation’ of all the used Braille contractions) to improve access of analog records.

In a quick search for collections including braille manuscripts it should come as no surprise that the Helen Keller Archives does have “braille correspondence”. I also came across the finding aids for the Harvard Law School Examinations in Braille (1950-1985) and The Donald G. Morgan Papers (the papers of a blind professor at Mount Holyoke College).

I wonder how many other collections have Braille records or manuscripts. Has anyone reading this ever seen or processed a collection including Braille records?

DMCA Exemption Added That Supports Archivists

The Digital Millennium Copyright Act, aka DMCA (which made it illegal to create or distribute technology which can get around copyright protection technology) has six new classes of exemptions added today.

From the very long named Rulemaking on Exemptions from Prohibition on Circumvention of Technological Measures that Control Access to Copyrighted Works out of the U.S. Copyright Office (part of the Library of Congress) comes the addition of the following class of work that will not be “subject to the prohibition against circumventing access controls”:

Computer programs and video games distributed in formats that have become obsolete and that require the original media or hardware as a condition of access, when circumvention is accomplished for the purpose of preservation or archival reproduction of published digital works by a library or archive. A format shall be considered obsolete if the machine or system necessary to render perceptible a work stored in that format is no longer manufactured or is no longer reasonably available in the commercial marketplace.

This remain valid from November 27, 2006 through October 27, 2009. Hmm.. three years? So what happens if this expires and doesn’t get extended (though one would imagine by then either we will have a better answer to this sort of problem OR the problem will be even worse than it is now)? When you look at the fact that places like NARA have fabulous mission statements for their Electronic Records Archives with phrases like “for the life of the republic” in them – three years sounds pretty paltry.

That said, how interesting to have archivists highlighted as benefactors of new legal rules. So now it will be legal (or at least not punishable under the DMCA) to create and share  programs to access records created by obsolete software. I don’t know enough about the world of copyright and obsolete software to be clear on how much this REALLY changes what places like NARA’s ERA and other archives pondering the electronic records problem are doing, but clearly this exemption can only validate a lot of work that needs to be done.