the-technoholik

Sunday, March 6, 2016

Thou shall not trust Xerox scanners and photocopiers / PDF compression

Abstract:
Xerox photocopiers/scanners use an unreliable compression algorithm that mangles numbers and symbols in documents. A patch was released by the company but even if fixed, the damage has been done.

Xerox, a Fortune 500 company, was founded in 1906 in Rochester, NY, USA. It directly employs more than 140,000 people [1], and there are thousands of companies selling and maintaining Xerox products around the world.

Xerox machines are so widely available that many people use it as a verb, saying they "xeroxed a document" the same way some would say they "googled someone".

If there is one person to thank for discovering and helping fixing this problem it's certainly David Kriesel, a computer scientist from the University of Bonn.

On his blog and the talk he held at FrOSCon [2], David explains what he went through with the company to get the problem fixed, and how it affects an uncountable number of people and companies.

The problem lies in the use of JBIG2, a compression algorithm designed to reduce the size of typesetted documents (i.e. basically everything that is not handwritten).
What JBIG2 does, is to look at the document and extract its symbols, like if you drew rectangles around all letters and digits. Then these rectangles are compared, and if they look the same, the system recognizes they both depict the same symbol. When compressing, it then replaces all of these symbols with only one version.
You would think that is smart, right? Well, the problem is how "similar" symbols are defined, and if the bar is set too low and/or the quality is bad and/or the resolution is low, then 6's tend to look like 8's or even B's, as you can see here:

All images in this document, source: D. Kriesel [2]

This is an extract of some figures from a cash register, where the right column is sorted in ascending order. The problem was then easy to spot. But most of the time, such errors are hard to notice.
Sure, small mistakes related to a cash register are not that bad, right? Well, now imagine these were not dollars but milligrams of drugs a hospital patient needs to take, blood test results or data used by the military or oil rigs (as it has been reported by the vice-president of Xerox). Suddendly, this becomes very serious.

The first time David saw the problem was on a scan of construction plans, where one room was reported as having the same size as a smaller one next to it. Here is the original blueprint:

Here's a scan of this document by a Xerox photocopier/scanner:

The area of the rooms ("Zimmer") have all been changed to 14.13 square meters.

After much discussion between Xerox and David, and this issue having made newspapers all around the world, Xerox finally made a software patch available. So you would think, problem solved, let's continue with our lives. Unfortunately it ain't that simple.

The years 2000s have seen companies going paperless, and that involved a heavy use of scanning. And I am not talking about your usual slow flatbed scanner at home. I am talking about heavy-duty scanners processing one page per second on both sides 10 hours a day. The software bug has been introduced in 2006 and was fixed in 2014. Not only most machines have not been updated, but we are left with millions of documents with no legal value (even if they seem to contain no character mangling) and that we cannot trust, scanned over more than 8 years.

The reason why machines have not been updated? First, Xerox doesn't have a list of end users, as it uses a decentralized network of partners installing and maintaining machines for them, so Xerox can't contact its users directly. Second, it is not in the interest of these companies to go for free at all their customers and update the software, supposing they even know about the issue. Finally,
these companies might not even know how to update the software.

The JBIG2 algorithm is so weak, that it is prohibited by Swiss and German authorities for documents scanned by their offices.

How to tell if a PDF was compressed using JBIG2?

On Linux or OS X:
strings MyDocument.pdf | grep /Filter

Example output:
/Filter /FlateDecode

Supposing you have a bunch of PDF documents you want to check, here is a way to print the name of those that use JBIG2 compression:

for i in *.pdf; do
strings $i | grep /Filter | grep JBIG2 > /dev/null
if [ $? -eq 0 ]; then
echo $i
fi
done

If you use Windows, use Cygwin, Babun or find a way to open the PDF as a text file, and then look for /Filter (note the leading slash).

Many documents will report using something known as the DCTDecode filter. This indicates JPEG compression. See table below.

What compression algorithms should we use then?

Supposing you had a choice of the compression algorithm, which if you are using a standalone scanner that produces PDF files out-of-the-box is not the case, certainly JBIG2 is a big no-no.

Here are a few common compression algorithms [3,4]:

JPEG, obviously. Especially suited for images, but getting old and definitely lossy
(a lossless version exists, but it is virtually never implemented or used)
JPEG2000, more modern alternative to JPEG, lossy.
Flate (a.k.a. "deflate"), uses Huffman coding and Lempel-Ziv (LZ77) compression, can be used on images, but is excellent if the document is not a scan, lossless.
CCITT Group 4, for monochrome documents and images, used extensively for fax machines and known as G4, lossless.
RLE (Run Length Encoding) best for images containing large areas of white and black, lossless

There is no algorithm better than all the other. It depends on the type of the PDF document, and most of the time PDF documents use several of them as the same time. Texts and symbols are usually deflated, and the pictures were originally JPEG.

Flate/JPEG2000 is a good combination for anything with colors or shades of gray, while Flate/G4 is good for monochrome images.

For your information, here is the list of all the supported algorithms (a.k.a. "filters" in Adobe PDF terminology), as per [5] p. 67:

[1] http://www.xerox.com/annual-report-2014/index.html
[2] http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning
[3] http://www.prepressure.com/pdf/basics/compression
[4] http://www.verypdf.com/pdfinfoeditor/compression.htm
[5] PDF Reference, 6th edition, Adobe Systems Inc. https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf

Wednesday, February 24, 2016

Count lines in source files

Following up on the last article, here's how to count the total number of lines your source files:

find **/src/* -regex ".*\.$ts\|java\|less$" -print0 | xargs -0 wc -l | tail -1

ZSH, which I'm using supports **. I believe bash cannot do that. ** would be equivalent to */*, */*/*, etc.
What this find command does after finding this filenames is to print them to standard output, but instead of separating them with a new line character, it uses NUL or \0 (the null character).

xargs then reads this. We also tell it that the filenames are separated by NUL (with the option -0), and then we give the command to be executed on each file, the wc -l command. wc stands for word count, but actually it's also capable of counting characters and lines. When we only want lines, we can use the -l argument (l as in lucky). As we are not interested to see how many lines each of the files contains, we "select" only the last line of the output with tail, because that last line contains the total.

Here is an alternative way to do the same thing:

find **/src/* -regex ".*\.$ts\|java\|less$" -exec wc -l {} \; | awk '{print $1;}' | paste -s -d+ | bc

In this case we find built-in feature to execute a command on each file it finds, using with the -exec argument. It is described in the manual of find. Basically, everything until the semi-colon (which as all characters that would otherwise be interpreted by the shell instead of being sent to the command, needs to be escaped). {} is replaced by the filename (the path is relative to the current directory).

wc prints the number of lines, a space, and then the filename. We use awk to print the first "word" it finds (by default, awk supposes "words" are separated by one or more spaces). I am not going to explain awk syntax, but it is a very useful tool. In fact we could have used it to do the rest of the job.

paste is a handy tool too, yet most people have never heard of it and try to replicate its features with complicated shell scripts... In this example it is going to transform this:

1
2
3

into this:

1+2+3

bc, which I believe stands for basic calculator will be used to compute the sum.

Now, there are tools to define what are "lines of code" and how to count them with more context that only some text. But, let's just ignore blank lines:

find **/src/* -regex ".*\.$ts\|java\|less$" -printf "sed '/^$/d' %p | wc -l\n" | sh | awk '{print $1;}' | paste -s -d+ | bc

sed, with these arguments, truncates empty lines (the regex describes "a line that begins and ends with nothing in between").
Unfortunately find does not support pipes ( | ) in the -exec argument, so we use a little trick.
That trick is to print commands to standard output, and then to interpret them with the shell, as if these commands were part of a shell script.

Find files by extension with the terminal

Today I was trying to find source files in a development project. These files have the extensions .java, .ts and .less.
I realized I never needed to do such a search before, and here's how I did it:

find . -regex ".*\.$ts\|java\|less$"

Let me know if you find something better.

Sunday, October 11, 2015

Populating a PostgreSQL table / database

The other day I had to populate a table with hundreds of millions of records.
Here's how I managed to do it quite rapidly:

Remove all the constraints and indexes from the table.
Prepare your data in a CSV file that you put on the DB server.
Open a console on the DB server: psql -d myDb
And here comes the star of the show, the COPY statement:
COPY myTable (theFirstColumn, theSecondColumn) FROM '/path/to/some/file.csv' WITH CSV;

Append QUOTE AS '"' DELIMITER AS ',' ESCAPE AS '\' if required.
The table definition is not required if the columns in your file match their order in the table, but I would recommend against it.
Add the constraints and indexes

If you really have to use INSERT statements, at least put them in one big transaction.

Note it is not going to check against NOT NULL!

Wednesday, October 7, 2015

OS X 10.11 El Capital hangs when plugging a USB device in VMware Workstation 12.0

The other day I tried to "hot plug" my iPad into OS X running as a guest, which consistently resulted in the guest OS freezing / hanging with no other solution than reboot it.

I haven't found a real solution for this, nor do I really care to investigate the problem. My workaround is to connect it before OS X starts, and this way it's properly recognized.

A quick example about geofencing (sorry, "region monitoring") on iOS with Swift is coming up!

Monday, March 16, 2015

One more hour of sleep every night

Cheerful Sunset
(no copyright information)

It's 7 pm and the sky has this nice orange - red color.
But my computer screen does not. It's "shouting" a blue-ish tone at me. But not for long!

Everybody knows that watching screens (TV, computer or handheld devices) before going to bed is bad for your sleep. What you might not know, is that if you do use screens at night, you'll sleep more and better if the screen is yellow - red instead of blue. Yes, blue! I won't start the blue/black gold/white dress war again, but it's true and you don't realize it: there is no such thing as white in nature and color analysis is one of these very complex tasks your brain "computes" yet is subjective at the end because your brain is not wired the same way than mine, so at some point you might see a gold dress when I see a blue dress. So believe me, if you are looking at your screen right now, it's blue. If you think car headlights are white, then put one car with regular headlights and one with Xenon lamps next to each other. You will probably say one is yellow-ish and the other is blue-ish.
But that's the beauty of it. If your screen goes very slowly and smoothly from blue to yellow - red, you won't notice it.

So, about that one hour of sleep. There was a study I cannot find anymore, where they had people lying in bed doing nothing, and they measured how fast these people would fell asleep. Then they would repeat the experiment with a more yellow / red light. They found people would fell asleep faster with the second setup.

Redshift on Windows

I haven't tried it, but there is an experimental version for Windows. It does seem like f.lux is older but maybe it works better, I don't know. Tell me if you find out.

Installing and configuring Redshift on Kubuntu (or any Linux distribution really)

In this tutorial I'll assume you live in western Switzerland, and that you use Kubuntu.
The procedure is similar on other systems. I am also assuming you are not changing timezones all the time. (But there is a solution for you if you find yourself in this situation, look at the documentation.)

Install Redshift:
sudo apt-get install redshift
Create and open the configuration file with your favorite editor:
nano ~/.config/redshift.conf
Paste this (Ctrl + Shift + V in Konsole) :
[redshift]
transition=1
location-provider=manual
adjustment-method=randr
[manual]
lat=46.7
lon=7.1
Change the last two lines with your latitude and longitude (yes, go ahead, click on this link). You can keep all digits if you want to. If your latitude reads South (Australia, New Zealand) or your longitude reads West (North America), use negative values where appropriate.

If you can see "GNU nano" at the top left of the console window, press Ctrl+O then Ctrl+X when you are done to close the editor and save the file (or the other way around).
Start "redshift" from the Terminal to check if it works. Your screen should go a bit yellow in a matter of seconds if the sun is not up. Otherwise try to mess with the latitude and longitude or your computer clock. There should be no output on the console.
All well? Time to start Redshift automatically. Open the KDE menu and type "autostart". Select the entry that appears. Click "Add Program..." then type "redshift" (without quotes). Don't select anything just type "redshift" and click OK. Click OK again to close the window.
Log out and in again. Your screen should be slightly yellow. It does? Congratulations. You just bought yourself one hour of sleep each night.

Now I suggest you install a similar app like Twilight on your phone.

Thursday, March 5, 2015

Control Netflix from your phone

UPDATE: Teclado Flix has been discontinued.

Check out my new Android application Teclado Flix.

It's a remote control for people watching Netflix on their PC.
TVs from 2005 to 2014 feature HDMI inputs but cannot be connected to the Internet or run apps like Smart TVs can.
So what can Netflix fans with "old" TVs do? Just plug your laptop on the TV and watch Netflix.
But it would be silly to get up to pause and resume a video, so I made a remote control app to do that.

A computer program is required. I made a very tiny one that runs on any operating system.