6 Historic Newspaper OCR Accuracy
6.1 Introduction
OCR, or ‘Optical Character Recognition’, is a series of methods for turning the text in digitised images into machine-readable code.
This chapter still needs to be written. Feel free to get in touch if you think you are the person to do it!
Things to cover:
- OCR assessment of newspapers
- Tools available for OCR (eg tesseract)
- The impact on downstream tasks
- Links to external papers and projects which have been assessing and improving OCR.
6.2 What is it like in 19th century newspapers?
This is a difficult question to answer, because it varies so much between projects, format and dates. The truth is, nobody really knows what it’s like, because that would involve having large sets of very accurate, manually transcribed newspapers, to compare to the OCR text. Subjectively, we can probably make a few generalisations.
It gets better as the software gets better, but not particularly quickly, because much of the quality is dependant on things to do with the physical form.
Digitising from print is much better than from microfilm. But print can still be bad.
Standard text is much better than non-standard. For example, different fonts, sizes, and so forth.
Advertisements seem to have particularly bad OCR - they are generally not in regular blocks of text, which the OCR software finds difficult, and they often used non-standard characters or fonts to stand out.
The time dimension is not clear: type probably got better, but it also got smaller, more columns.
Problems with the physical page have a huge effect: rips, tears, foxing, dark patches and so forth. Many errors are not because of the microfilm, digital image or software, and may not be fixable.
What does this all mean? Well, it introduces bias, and probably in non-random ways, but in ways that have implications for our work. If things are digitised from a mix of print and microfilm, for example, we might get very different results for the print portion, which might easily be mis-attributed to a significant historical finding. John Evershed and Kent Fitch14
Why You (A Humanist) Should Care About Optical Character Recognition
‘Correcting Noisy OCR: Context Beats Confusion,’ ACM International Conference Proceeding Series, 2014 <https://doi.org/10.1145/2595188.2595200>.↩︎