The two ends of the spectrum are traditionally described as follows:
- digitize at very high quality to ensure that you need not re-digitize later, create a high quality master copy from which all possible derivatives can be created later
- digitize at the minimum quality required for your current needs, the theory being that this will increase the quantity of digitized records you can digitize
This sounds very well and good on the surface, but this is not nearly as black and white a question as it appears. It is not the case that one can simply choose one over the other. I suppose that choosing ‘perfect quality’ (whatever that means) probably drives the most obvious of the digitization choices. Highest resolution. 100% accurate transcription. 100% quality control.
It is the rare digitization project that has the luxury of time and money required to aim for such a definition of perfect. At what point would you stop noticing any improvement, while just increasing your the time it takes to capture the image and the disk space required to store it? 600 DPI? 1200 DPI? Scanners and cameras keep increasing the dots per inch and the megapixels they can capture. Disk space keeps getting cheaper. Even at the top of the ‘perfect image’ spectrum you have to reach a point of saying ‘good enough’.
When you consider the choices one might make short of perfect, you start to get into a gray area in which the following questions start to crop up:
- How will lower quality image impact OCR accuracy?
- Is one measure of lower quality simply a lower level of quality assurance (QA) to reduce the cost and increase the throughput?
- How will expectations of available image resolution evolve over the next five years? What may seem ‘good enough’ now, may seem grainy and sad in a few years.
- What do we add to the images to improve access? Transcription? TEI? Tagging? Translation?
- How bad is it if you need to re-digitize something that is needed at a higher resolution on demand? How often will that actually be needed?
- Will storing in JPEG2000 (rather than TIFF) save enough money from reduced disk space to make it worth the risk of a lossy format? Or is ‘visually lossless‘ good enough?
Even the question of OCR accuracy is not so simple. In D-Lib Magazine‘s article from the July/August 2009 issue titled Measuring Mass Text Digitization Quality and Usefulness the authors list multiple types of accuracy which may be measured:
- Character accuracy
- Word accuracy
- Significant word accuracy
- Significant words with capital letter start accuracy (i.e. proper nouns)
- Number group accuracy
So many things to consider!
The primary goal of the digitization project I am focused on is to increase access to materials for those unable to travel to our repository. As I work with my colleagues to navigate the choices, I find myself floating towards the side of ‘good enough’ across the board. Even the process of deciding this blog post is done has taken longer than I meant it to. I publish it tonight with the hope to put a line in the sand and move forward with the conversation. For me, it all comes back to what are you trying to accomplish.
I would love to hear about how others are weighing all these choices. How often have long term digitization programs shifted their digitization standards? What aspects of your goals are most dramatically impacting your priorities on the quality vs quantity scale?
Image Credit: Our lovely fortune teller is an image from the George Eastman House collection in the Flickr Commons, taken by Nickolas Muray in 1940 for use by McCall’s Magazine. [UPDATED 1/6/2019: Image no longer on Flickr, but is available in the Eastman Museum online collection.]