Digitization Quality vs Quantity: An Exercise in Fortune Telling

March 31, 2012 5 Comments

The quality vs quantity dilemma is high in the minds of those planning major digitization projects. Do you spend your time and energy creating the highest quality images of your archival records? Or do you focus on digitizing the largest quantity you can manage? Choosing one over the other has felt a bit like an exercise in fortune telling to me over the past few months, so I thought I would work through at least a few of the moving parts of this issue here.

The two ends of the spectrum are traditionally described as follows:

digitize at very high quality to ensure that you need not re-digitize later, create a high quality master copy from which all possible derivatives can be created later
digitize at the minimum quality required for your current needs, the theory being that this will increase the quantity of digitized records you can digitize

This sounds very well and good on the surface, but this is not nearly as black and white a question as it appears. It is not the case that one can simply choose one over the other. I suppose that choosing ‘perfect quality’ (whatever that means) probably drives the most obvious of the digitization choices. Highest resolution. 100% accurate transcription. 100% quality control.

It is the rare digitization project that has the luxury of time and money required to aim for such a definition of perfect. At what point would you stop noticing any improvement, while just increasing your the time it takes to capture the image and the disk space required to store it? 600 DPI? 1200 DPI? Scanners and cameras keep increasing the dots per inch and the megapixels they can capture. Disk space keeps getting cheaper. Even at the top of the ‘perfect image’ spectrum you have to reach a point of saying ‘good enough’.

When you consider the choices one might make short of perfect, you start to get into a gray area in which the following questions start to crop up:

How will lower quality image impact OCR accuracy?
Is one measure of lower quality simply a lower level of quality assurance (QA) to reduce the cost and increase the throughput?
How will expectations of available image resolution evolve over the next five years? What may seem ‘good enough’ now, may seem grainy and sad in a few years.
What do we add to the images to improve access? Transcription? TEI? Tagging? Translation?
How bad is it if you need to re-digitize something that is needed at a higher resolution on demand? How often will that actually be needed?
Will storing in JPEG2000 (rather than TIFF) save enough money from reduced disk space to make it worth the risk of a lossy format? Or is ‘visually lossless‘ good enough?

Even the question of OCR accuracy is not so simple. In D-Lib Magazine‘s article from the July/August 2009 issue titled Measuring Mass Text Digitization Quality and Usefulness the authors list multiple types of accuracy which may be measured:

Character accuracy
Word accuracy
Significant word accuracy
Significant words with capital letter start accuracy (i.e. proper nouns)
Number group accuracy

So many things to consider!

The primary goal of the digitization project I am focused on is to increase access to materials for those unable to travel to our repository. As I work with my colleagues to navigate the choices, I find myself floating towards the side of ‘good enough’ across the board. Even the process of deciding this blog post is done has taken longer than I meant it to. I publish it tonight with the hope to put a line in the sand and move forward with the conversation. For me, it all comes back to what are you trying to accomplish.

I would love to hear about how others are weighing all these choices. How often have long term digitization programs shifted their digitization standards? What aspects of your goals are most dramatically impacting your priorities on the quality vs quantity scale?

Image Credit: Our lovely fortune teller is an image from the George Eastman House collection in the Flickr Commons, taken by Nickolas Muray in 1940 for use by McCall’s Magazine. [UPDATED 1/6/2019: Image no longer on Flickr, but is available in the Eastman Museum online collection.]

Posted in access, digitization

5 Comments

Simon Tanner
April 1, 2012 at 3:45 pm

Enjoyed the blog posting. Digitisatiom as a process involving imaging of text does come with a set of dependancies the foremost being:-
– it depends upon the physical nature of the originals
– it depends upon the technologies that can be applied
– it depends on the information goals and types of information content to be extracted and provided to the end user.

In relation to the question of choosing resolution I have a simple dictum. There is no exact number that works in all cases. But, if you select your resolution to capture the smallest significant detail then this is usually a great guide to selecting the best resolution. For basic printed text then greater than 300 dpi is going to be fine for many purposes where the smallest significant detail is punctuation. But for ancient manuscripts then a much higher resolution would be essential as the scholar wants to see the pores of the parchment as the smallest significant detail. Use the smallest significant detail dictum and you’ll usually get to the right place.
Jeanne Post author
April 1, 2012 at 9:58 pm

Thanks so much for such a sensible rule of thumb! We have already chosen color over black and white for the same reason – it is important to us to keep that level of detail because the color of the paper or the ink communicates extra information which we want our researchers to have. I will the ‘significant detail’ dictum to our process for selecting (and then defending) the appropriate dpi.
Gina Strack
April 2, 2012 at 2:29 am

As a government archives, we benefit from a lot of similar material: text, text, text. As a result, our standard has become 400 ppi for paper and 200 ppi for microfilm (though I believe the true resolution is 200 x reduction factor like 14 for a higher number appropriate for film). We’ve stayed with these standards steadily since about 2006.

One of our partners, FamilySearch, actually measures it so that the smallest line (like the stem on a single letter) on the page is 3 pixels wide–and handily having the software to check for that.

Generally, I use BCR-formerly-Western States best practices which is 5000 pixels on the longest side. Letter size paper=5000/11 in.=~450. I agree that handwritten or valuable manuscripts may deserve more pixels to show important detail.

One of the major issues we’ve dealt with is that storage costs dramatically went up, right during the tight budgets of the recession. At times, it’s been difficult to argue for keeping high-resolution masters when we have both original paper and preservation microfilm in our collection. I stubbornly hung on and now we may hopefully get our IT needs met, including born-digital preservation, at last.
Merrilee Proffitt
April 2, 2012 at 7:31 pm

Jeanne, it’s a little old now, but you have seen Shifting Gears, haven’t you? I think the arguments in this still stand up as to WHY you should err on the side of “good enough” — I like to think of this as MPLP for digitization where the amount of effort depends on what’s warranted. As Simon says there are sometimes you will want to do more, but lots of times less is okay.

http://www.oclc.org/research/publications/library/2007/2007-02.pdf
Jeanne Post author
April 3, 2012 at 1:58 pm

@Gina: Congrats on hanging tough. Fingers crossed that you are on a path to getting the IT support you need to preserve your digital content. Thanks for sharing the best practices you follow!

@Merrilee: Yes. I have seen Shifting Gears. It definitely gives support for the ‘good enough’ model. The trick is deciding where to draw that line. Also – amazing how ‘old’ 2007 seems now, but I think the lessons in there still definitely apply. Even with the speed of improvement of the technology, the actual needs for digitization projects haven’t changed as fast.

Comments are closed.