Sunday, March 6, 2016

Thou shall not trust Xerox scanners and photocopiers / PDF compression

Abstract:
Xerox photocopiers/scanners use an unreliable compression algorithm that mangles numbers and symbols in documents. A patch was released by the company but even if fixed, the damage has been done.

Xerox, a Fortune 500 company, was founded in 1906 in Rochester, NY, USA. It directly employs more than 140,000 people [1], and there are thousands of companies selling and maintaining Xerox products around the world.

Xerox machines are so widely available that many people use it as a verb, saying they "xeroxed a document" the same way some would say they "googled someone".

If there is one person to thank for discovering and helping fixing this problem it's certainly David Kriesel, a computer scientist from the University of Bonn.

On his blog and the talk he held at FrOSCon [2], David explains what he went through with the company to get the problem fixed, and how it affects an uncountable number of people and companies.

The problem lies in the use of JBIG2, a compression algorithm designed to reduce the size of typesetted documents (i.e. basically everything that is not handwritten).
What JBIG2 does, is to look at the document and extract its symbols, like if you drew rectangles around all letters and digits. Then these rectangles are compared, and if they look the same, the system recognizes they both depict the same symbol. When compressing, it then replaces all of these symbols with only one version.
You would think that is smart, right? Well, the problem is how "similar" symbols are defined, and if the bar is set too low and/or the quality is bad and/or the resolution is low, then 6's tend to look like 8's or even B's, as you can see here:

All images in this document, source: D. Kriesel [2]
This is an extract of some figures from a cash register, where the right column is sorted in ascending order. The problem was then easy to spot. But most of the time, such errors are hard to notice.
Sure, small mistakes related to a cash register are not that bad, right? Well, now imagine these were not dollars but milligrams of drugs a hospital patient needs to take, blood test results or data used by the military or oil rigs (as it has been reported by the vice-president of Xerox). Suddendly, this becomes very serious.

The first time David saw the problem was on a scan of construction plans, where one room was reported as having the same size as a smaller one next to it. Here is the original blueprint:


Here's a scan of this document by a Xerox photocopier/scanner:

The area of the rooms ("Zimmer") have all been changed to 14.13 square meters.

After much discussion between Xerox and David, and this issue having made newspapers all around the world, Xerox finally made a software patch available. So you would think, problem solved, let's continue with our lives. Unfortunately it ain't that simple.

The years 2000s have seen companies going paperless, and that involved a heavy use of scanning. And I am not talking about your usual slow flatbed scanner at home. I am talking about heavy-duty scanners processing one page per second on both sides 10 hours a day. The software bug has been introduced in 2006 and was fixed in 2014. Not only most machines have not been updated, but we are left with millions of documents with no legal value (even if they seem to contain no character mangling) and that we cannot trust, scanned over more than 8 years.

The reason why machines have not been updated? First, Xerox doesn't have a list of end users, as it uses a decentralized network of partners installing and maintaining machines for them, so Xerox can't contact its users directly. Second, it is not in the interest of these companies to go for free at all their customers and update the software, supposing they even know about the issue. Finally,
these companies might not even know how to update the software.

The JBIG2 algorithm is so weak, that it is prohibited by Swiss and German authorities for documents scanned by their offices.

How to tell if a PDF was compressed using JBIG2?

On Linux or OS X:
strings MyDocument.pdf | grep /Filter

Example output:
/Filter /FlateDecode

Supposing you have a bunch of PDF documents you want to check, here is a way to print the name of those that use JBIG2 compression:

for i in *.pdf; do
    strings $i | grep /Filter | grep JBIG2 > /dev/null
    if [ $? -eq 0 ]; then
        echo $i
    fi

done

If you use Windows, use Cygwin, Babun or find a way to open the PDF as a text file, and then look for /Filter (note the leading slash).

Many documents will report using something known as the DCTDecode filter. This indicates JPEG compression. See table below.

What compression algorithms should we use then?

Supposing you had a choice of the compression algorithm, which if you are using a standalone scanner that produces PDF files out-of-the-box is not the case, certainly JBIG2 is a big no-no.

Here are a few common compression algorithms [3,4]:
  • JPEG, obviously. Especially suited for images, but getting old and definitely lossy
    (a lossless version exists, but it is virtually never implemented or used)
  • JPEG2000, more modern alternative to JPEG, lossy.
  • Flate (a.k.a. "deflate"), uses Huffman coding and Lempel-Ziv (LZ77) compression, can be used on images, but is excellent if the document is not a scan, lossless.
  • CCITT Group 4, for monochrome documents and images, used extensively for fax machines and known as G4, lossless.
  • RLE (Run Length Encoding) best for images containing large areas of white and black, lossless
There is no algorithm better than all the other. It depends on the type of the PDF document, and most of the time PDF documents use several of them as the same time. Texts and symbols are usually deflated, and the pictures were originally JPEG.

Flate/JPEG2000 is a good combination for anything with colors or shades of gray, while Flate/G4 is good for monochrome images.

For your information, here is the list of all the supported algorithms (a.k.a. "filters" in Adobe PDF terminology), as per [5] p. 67:



[1] http://www.xerox.com/annual-report-2014/index.html
[2] http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning
[3] http://www.prepressure.com/pdf/basics/compression
[4] http://www.verypdf.com/pdfinfoeditor/compression.htm
[5] PDF Reference, 6th edition, Adobe Systems Inc. https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf

No comments:

Post a Comment