Google, Privacy, Records Managment and Archives

BoingBoing.net posted on March 14 and March 15 about Google’s announcement of a plan to change their log retention policy . Their new plan is to strip parts of IP data from records in order to protect privacy. Read more in the AP article covering the announcement.

For those who are not familiar with them – IP addresses are made up of sets of numbers and look something like 192.39.288.3. To see how good a job they can do figuring out the location you are in right now – go to IP Address or IP Address Guide (click on ‘Find City’).

Google currently keeps IP addresses and their corresponding search requests in their log files (more on this in the personal info section of their Privacy Policy). Their new plan is that after 18-24 months they will permanently erase part of the IP address, so that the address no longer can point to a single computer – rather it would point to a set of 256 computers (according to the AP article linked above).

Their choice to permanently redact these records after a set amount of time is interesting. They don’t want to get rid of the records – just remove the IP addresses to reduce the chance that those records could be traced back to specific individuals. This policy will be retroactive – so all log records more than 18-24 months old will be modified.

I am not going to talk about how good an idea this is.. or if it doesn’t go far enough (plenty of others are doing that, see articles at EFF and Wired: 27B Stroke 6 ). I want to explore the impact of choices like these on the records we will have the opportunity to preserve in archives in the future.

With my ‘archives’ hat on – the bigger question here is how much the information that Google captures in the process of doing their business could be worth to the historians of the future. I wonder if we will one day regret the fact that the only way to protect the privacy of those who have done Google searches is to erase part of the electronic trail. One of the archivist tenants is to never do anything to the record you cannot undo. In order for Google to succeed at their goal (making the records useless to government investigators) – it will HAVE to be done such that it cannot be undone.

In my information visualization course yesterday, our professor spoke about how great maps are at tying information down. We understand maps and they make a fabulous stable framework upon which we can organize large volumes of information. It sounds like the new modified log records would still permit a general connection to the physical geographic world – so that is a good thing. I do wonder if the ‘edited’ versions of the log records will still permit the grouping of search requests such that they can be identified as having been performed by the same person (or at least from the same computer)? Without the context of other searches by the same person/computer, would this data still be useful to a historian? Would being able to examine the searches of a ‘community’ of 256 computers be useful (if that is what the IP updates mean).

What if Google could lock up the unmodified version of those stats in a box for 100 years (and we could still read the media it is recorded on and we had documentation telling us what the values meant and we had software that could read the records)? What could a researcher discover about the interests of those of us who used Google in 2007? Would we loose a lot by if we didn’t know what each individual user searched for? Would it be enough to know what a gillion groups of 256 people/computers from around the world were searching for – or would loosing that tie to an individual turn the data into noise?

Privacy has been such a major issue with the records of many businesses in the past. Health records and school records spring to mind. I also find myself thinking of Arthur Anderson who would not have gotten into trouble for shredding their records if they had done so according to their own records disposition schedules and policies. Googling Electronic Document Retention Policy got me over a million hits. Lots of people (lawyers in particular) have posted articles all over the web talking about the importance of a well implemented Electronic Document Retention Policy. I was intrigued by the final line of a USAToday article from January 2006 about Google and their battle with the government over a pornography investigation:

Google has no stated guidelines on how long it keeps data, leading critics to warn that retention could be for years because of inexpensive data-storage costs.

That isn’t true any longer.

For me, this choice by Google has illuminated a previously hidden perfect storm. That the US government often request of this sort of log data is clear, though Google will not say how often. The intersection of concerns about privacy, government investigations, document retention and tremendous volumes of private sector business data seem destined to cause more major choices such as the one Google has just announced. I just wonder what the researchers of the future will think of what we leave in our wake.

There will be a sacrifice of information with this policy. In this case, it will likely be impossible to use IP address records to identify which websites a specific user visited, but we should still be able to use the aggregate data to identify local trends and activities. (Of course, you could always go to the individual’s hard drive to recover their browser history. If the drive survived. And was readable.)

Think of it like libraries purging individual’s circulation records. You won’t be able to use the library’s records to prove that John Doe checked out The Koran, but you could check the book’s circulation record to see that it was checked out 5 times in 2006 or you could go through John Doe’s personal papers to see if he discusses his nightly readings.

As with many aspects of records management, it is a question of privacy vs. full information capture. Of course if there wasn’t a fear about that information being used improperly F.I.C. wouldn’t be an issue…

6 Comments

Lucian
March 20, 2007 at 2:55 pm

Nowadays, you can find psyhical location with only IP address with websites like http://www.ipgp.net , so, hiding IP addresses is a good thing.
Fletch
March 20, 2007 at 11:02 pm

There will be a sacrifice of information with this policy. In this case, it will likely be impossible to use IP address records to identify which websites a specific user visited, but we should still be able to use the aggregate data to identify local trends and activities. (Of course, you could always go to the individual’s hard drive to recover their browser history. If the drive survived. And was readable.)

Think of it like libraries purging individual’s circulation records. You won’t be able to use the library’s records to prove that John Doe checked out The Koran, but you could check the book’s circulation record to see that it was checked out 5 times in 2006 or you could go through John Doe’s personal papers to see if he discusses his nightly readings.

As with many aspects of records management, it is a question of privacy vs. full information capture. Of course if there wasn’t a fear about that information being used improperly F.I.C. wouldn’t be an issue…
Jeanne Post author
March 21, 2007 at 10:08 am

A friend directed me the work of Dr. Latanya Sweeney. I was fascinated to read the abstract of Trail Re-identification: Learning Who You are From Where You Have Been. It appears that with the right algorithms, software can be created to reconnect supposedly ‘unidentified data’. I will likely post more on this soon as I read and assimilate the information on Dr. Sweeney’s site.
Pingback:Considering Historians, Archivists and Born Digital Records - SpellboundBlog.com - ponderings of an archives student
evidence eliminator
May 16, 2007 at 12:45 pm

Dr. Sweeney’s research shows that with the right algorithm and computational power, anonymity is just a theoretical concept.
Pingback:The OPLIN 4cast » Blog Archive » OPLIN 4cast #47

Comments are closed.