Sunday, May 1, 2016

Cool rules for Nagios

I really love Nagios.
For those who don't know what it is, Nagios is an open source monitoring solution, whose architecture, configuration and customization is really simple once you grasp its main principles.
One of these principles is to define commands (which usually use nagios plugins) which you can then re-use and parametrize as you wish.

Here are a couple of definitions I find useful.
(You might need to download some nagios plugins from the online repository.)

Check domain name is resolving to your public IP address

define command {
  command_name           check_dynamic_dns
  command_line           $USER1$/check_dns -H $ARG1$ -s resolver1.opendns.com  -a $(dig +short myip.opendns.com @resolver1.opendns.com)
}
define service {
  use                    generic-service
  host_name              andromeda
  service_description    DNS resolution andromeda.ddns.net
  check_command          check_dynamic_dns!andromeda.ddns.net
}

This is mostly useful if you don't manage the domain yourself. In this case we use a Dynamic DNS service. (Yes my machines are named after planets, moons, stars, constellations, galaxies, asteroids, ... How do you name yours?)

Remember you can override the default check interval and other monitoring parameters of the host per service. For instance, I set the interval to 10 minutes for most checks, except for the antivirus scan (check_clamscan) which runs much less often than that.

Also, most tests will be run a 2nd, 3rd and 4th time if the check was not OK, as we could have been "unlucky" with the measurement. This is especially true for ping, temperatures and such kind of things. The antivirus test overrides max_check_attempts, check_interval and retry_interval.
However we cannot get unlucky with antivirus checks, so there is no point retrying the measurement a second time as it is not supposed to magically get better the second time.

Check HDD temperatures

define command {
    command_name check_hdd_temps
    command_line sudo $USER1$/check_lm_sensors --sanitize --drives --high sdaTemp=41,51 --high sdbTemp=41,51 --high sdcTemp=41,51 --low sdaTemp=20,15 --low sdbTemp=20,15 --low sdcTemp=20,15
    }
define service {
        use                             generic-service
        host_name                       andromeda
        service_description             HDD Temperatures
        check_command                   check_hdd_temps
}


This uses the builtin thermometer of the drives. Note check_lm_sensors can also be used to monitor other temperatures (CPU, motherboard, chassis, etc.)
I chose to gather everything into one command because the disks are all part of the same RAID array, but you could have one service per drive. 
Explanation of the command:
For /dev/sda, /dev/sdb and /dev/sdc, the status will be warning when the temperature is 41 degrees celsius and above, and critical when it reaches 51 degrees. Likewise, when the temperature drops below 20 degrees, there will be a warning, and the situation is considered critical below 15 degrees.

Besides HDD temperature, I suggest you have a look at check_smart. I'll be happy to document this check if someone is interested.


Monday, March 28, 2016

npm flaws

npm, the Node.js Package Manager, is the most widely used solution to download and manage dependencies in javascript-based applications yet it has many flaws, as explained in this article:


  1. The repository is run by a single company, who has full power to decide what should or should not be there, and cannot be improved by the community.
  2. Packages are not reviewed, and they can run scripts with user privileges.
    This is very different to APT on Debian/Ubuntu, where packages spend a good amount of time as being alpha ("unstable" in Debian terminology), then beta ("testing"), before being released as final versions ("stable").
  3. Npm allows to specify dependencies as minimal version, e.g. "package xyz at version 1.2.3 or greater" which will eventually result in breaking your web app for no apparent reason between two builds, which is a problem very hard to track down, forcing the most cautious users to use npm shrinkwrap.

    This is not the fault of npm for offering this feature but the package publishers'. A majority of them use this feature, probably without realizing how dangerous it is. And that includes major node packages such as webpack.
  4. Packages can be taken out from the repository by its author or the npm company without notice. This caused many builds to fail last week, as the author of the leftpad package chose to remove it from the repository, due to a conflict between his kik package and the homonym social network, whose representative was claiming the name was a trademark (which is complete BS in by opinion).
    The npm company used to go with the social network. A dictatorial decision that the kik package's author, a believer in free software, refuse to accept. I am pretty sure that problem would have been avoided if npm was not a company but run by a community.
  5. As far as I know, packages are not signed and namespaces are not protected. Hence malicious packages are likely to appear and spread.

Saturday, March 26, 2016

Set up your linux system to send emails with a Gmail account

Because many programs rely on the local mail server UNIX systems come with to send emails, or because you don't want to put your Google password in config files on your web server, or for any other reason, then you are going to need to use an Internet SMTP server.
In this example, we are going to use Gmail, but you can use a similar configuration for any other SMTP server, although you might have to look at the manual if that server is using SSL instead of TLS (in the latter, the connection starts in cleartext, and then the STARTTLS command makes it switch to TLS). You should read the manual anyway.
Note that you might need to enable IMAP in Gmail settings (I know IMAP has nothing to do with that, but...)

First, login as root and install postfix on your system:

sudo -i
apt-get install postfix

Then add the following lines to /etc/postfix/main.cf :

relayhost = [smtp.gmail.com]:587
smtp_sasl_auth_enable = yes
smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd
smtp_sasl_security_options = noanonymous
smtp_tls_CAfile = /etc/postfix/cacert.pem
smtp_use_tls = yes
smtp_tls_security_level = may

This tells Postfix it needs to relay all outgoing emails to Gmail, using the mentioned host and port, and using TLS.

The authentication credentials must be entered into /etc/postfix/sasl_passwd :

[smtp.gmail.com]:587    theUser@theDomain.tld:ThePassword

(Re)build the hash table:

postmap hash:/etc/postfix/sasl_passwd

This should create the file /etc/postfix/sasl_passwd.db

Make sure these files have permissions such as read-write and are owned by root:

chmod 600 /etc/postfix/sasl_passwd*
chown root /etc/postfix/sasl_passwd*


Restart postfix:

sudo service postfix restart

Try to send an email:

apt-get install mailutils
mail -s "The Email Subject" an-address@on-the-internet.com

Type your email at this point. Press Ctrl-D when you are done. When it asks for "Cc:", press Enter.

Watch the system log file:

tail -f /var/log/syslog

In case it failed, do not try to send the email again with the mail command.
Postfix should have automatically deferred the message to send it later.

You can look at the queue with
postqueue -p

(command not found at this point? Try with sudo...)

You can flush the queue, which actually means trying to send the messages again, with
postqueue -f

If for some reason you want to delete all deferred messages in the queue:
postsuper -d ALL deferred

If your system complains about not being able to reach Gmail on an IPv6 address, run Postfix in IPv4-only mode by adding the following line to /etc/postfix/main.cf and restart Postfix:
inet_protocols = ipv4

I hope that helped.

Checking if UDP ports are up and basics of port scanning

Nmap and UDP scanning

nmap is one hell of a tool. It is a very powerful swiss army knife for anything related to network exploration and security / port scanning.

It's relatively easy to test TCP-based services (try "telnet www.blogger.com 80", and press Ctrl-D to quit). But what about UDP?

UDP is connectionless, we can't rely on something like TCP's 3-way handshake. All that is happening here, is that somehow our machine (the client) wants to see if it can receive a datagram from a server.
Usually, the server will answer a corresponding request from the client, in the form of another UDP datagram.

Now we can either be smart about it by sending datagrams without content and risking the server not answering back, or do it by the book and send a datagram the server would be expecting for a given port. In other words if we know the protocol in advance, we better comply with that protocol.

And this is what nmap is going to do for us, i.e. sending the appropriate UDP datagrams for well-known ports and empty datagrams otherwise (except if the appropriate command-line option is used).

f

Testing one UDP port

I just read an article about Switzerland transitioning from "winter time" to "summer time", or more specifically from CET to CEST (Central European Summer Time) tonight. They were some words about the NTP (Network Time Prococol) server at ntp.metas.ch.
You can't test if that service really exists with HTTP by typing this address in your browser (even though I was expecting them to have at least some webpage there, but anyway...)
We are going to send a UDP datagram on port 123, the default for NTP.

sudo nmap -p 123 -sU -P0 ntp.metas.ch

Starting Nmap 6.40 ( http://nmap.org ) at 2016-03-26 11:24 CET
Nmap scan report for ntp.metas.ch (162.23.41.10)
Host is up (0.033s latency).
rDNS record for 162.23.41.10: metasntp13.admin.ch
PORT    STATE SERVICE
123/udp open  ntp

Nmap done: 1 IP address (1 host up) scanned in 1.10 seconds

First, let's look at the command.
sudo is required because nmap will be sending raw packets. Even when it is not strictly required, it is recommended to use nmap with a privileged user as the scans will be able to use more techniques and avoid relying on workarounds.
-p 123 limits the scan to port 123. Unfortunately UDP scanning is slow and, according to the manual, it will take 18 hours to scan all 65,536 ports, so you need to be precise.
-sU stands for "UDP scan"
-P0 activates IP protocol ping. You might want to try the command without that option. It basically asks the host operating system if there is a service running on that port (because the OS needs to know which program is listening on that port and where to forward UDP datagrams).

f

How does that work? (when the port is open)

Let's ask Wireshark what the previous command really did:


11 The client (my computer) first needs to know the IP address of ntp.metas.ch, by sending a DNS Address request to my local DNS caching server.
12 The DNS server replies back, saying: "Hey, ntp.metas.ch is actually an alias for metasntp13.admin.ch whose address is 162.23.41.10". This might be due to load-balancing.
13 Nmap got the address 162.23.41.10 for ntp.metas.ch by the operating system. Nmap doesn't know the hostname it asked the address of is actually an alias, and because Nmap is curious, it performs a reverse DNS lookup (rDNS), that is it looks for the hostname belonging to an IP address. That is what frame #13 is about. This step is completely  useless, it's only for our information.
14 is the answer from the reverse DNS lookup
15 Now that Nmap knows the IP address, it can start its work. It knows that UDP port 123 is used for NTP, so it sends the appropriate payload, which Wireshark shows like this:


It doesn't really matter what is in there, but you can see it is a valid NTP message.

16 is the response from the NTP server:



f

How does that work? (when the port is closed, 1st variant)

Let's try again with port 12345, on which the server is not listening. 

sudo nmap -p 12345 -sU -P0 ntp.metas.ch

Starting Nmap 6.40 ( http://nmap.org ) at 2016-03-26 12:18 CET
Nmap scan report for ntp.metas.ch (162.23.41.10)
Host is up.
rDNS record for 162.23.41.10: metasntp13.admin.ch
PORT      STATE         SERVICE
12345/udp open|filtered unknown

Nmap done: 1 IP address (1 host up) scanned in 2.09 seconds

Notice how Nmap reports the port as being open or filtered. This means it could not definitely say the port was closed, so it is saying it might be open but the server did not respond to our request, possibly because it was badly formed, which means the request was filtered.

This is the complete exchange (nothing more happened after that):


1 to 4 are the same as before. In many cases, you will not see these frames as clients usually have a local DNS cache so they don't need to perform DNS lookup every time

5 looks like that:


You can see the client tried to contact the server on port 12345 (which is known to Wireshark to usually be used for some application / protocol named italk). The length is 8 bytes, which is the size of the UDP header ; the source port has a size of 2 bytes (because the max value  is 65,536), same for the destination port, the length field itself is 2 bytes, and there are 2 more bytes for the checksum. So if the header size is equal to the size of the UDP datagram, then there is no payload.

6 is exactly the same as #, except for the source port.
There is no other response from the server. What does that tell us?
The server did not send an "ICMP unreachable error" packet, which prevents us from definitely determining that the port is closed or filtered (the distinction depends on the ICMP error code, but it doesn't matter now). That is why Nmap is saying the port might be open.
It is not just saying "open" either, but "open or filtered" because there was no response.

f

How does that work? (when the port is closed, 2nd variant)

You might be more lucky and get a definitive answer telling you the port is closed with certainty.

Let's try another request:

sudo nmap -p 86 -sU -P0 172.16.123.34

Starting Nmap 6.40 ( http://nmap.org ) at 2016-03-26 12:40 CET
Nmap scan report for 172.16.123.34
Host is up (0.0033s latency).
PORT   STATE  SERVICE
86/udp closed mfcobol

Nmap done: 1 IP address (1 host up) scanned in 0.59 seconds

If you followed the previous explanation, you should figure out what happened if Nmap was able to conclude that the port was closed instead of "open", or "open|filtered", or just "filtered".

You don't? Let's look at it then:


It looks pretty standard, starting with the DNS request (which this time was unsucessful), and a UDP datagram from the client. But the response (frame #6) from the server is different:


Look at that, it's an ICMP message! It says the port is unreachable. It could not be more clear. As a result, Nmap shows the port as closed.

f

Why don't we get "port closed" all the time?

The network stack on the server on my local network is probably not much different than the one from the example before that. Probably the server behind ntp.metas.ch also sent a similar packet, but the firewall in front of that server blocks ICMP packets. For instance it is not even possible to ping that server ("ping" works by sending an ICMP message and getting one in return).
Or the folks operating the firewall might just block these two types of ICMP messages (ping and port unreachable) to avoid network discovery and port scanning. This, along with blocking TCP Reset might be advisable to prevent this kind of attacks.
This is why many Internet servers don't send back diagnostic messages telling if a host or service is up. The client is usually going to assume that after sending one or more requests that don't get an answer in a reasonable amount of time ("timeout").
Unfortunately in the case of UDP, no answer doesn't necessarily means that there is no service at a particular host and port, as the server might simply be ignoring the request.

Friday, March 18, 2016

Remove git branches locally after they have been removed from the remote repository

In the workflow we have at work, which is pretty common actually, a new branch is created for a bug fix, a new feature, etc.
This then allows to create a "pull request", that is, asking for the contents of the branch to be merged back into the master branch. After this has been done, the original branch is removed.

However, if you happened to have it in your local Git repository, it won't be deleted. And after a few months, you are going to have a lot of clutter there, that is, branches you will never use again.

Here is a script that deletes local branches that don't exist on the remote repository.
Make sure to understand it will also delete branches that you never actually pushed to that remote.

#/bin/bash

REMOTE=$(git remote) # usually origin

CMD_REMOTE_BRANCHES=$(git branch -r | sed "s/${REMOTE}\///g" | sed -r 's/^\s+//' | grep -E -v '^HEAD')

deleteIfNotOnRemote() {
    while read localBranch
    do
        found=false
        for remoteBranch in $CMD_REMOTE_BRANCHES; do
            if [ $remoteBranch = $localBranch ]; then
                found=true
                break
            fi
        done
        [ $found = false ] && git branch -d $localBranch 
    done
}

git branch | sed -r 's/^\s+//' | sed -r 's/^\* //' | deleteIfNotOnRemote

Wednesday, March 16, 2016

Port Forwarding with iptables (NAT)

I am running a server at home, and it does all kinds of things. It is placed between the modem and a switch where workstations and wireless access points plug into. This is because the server also acts as a firewall.

If you also want to do this, you probably won't have a bare modem, but a one-in-all box acting as a modem, router, DHCP server, DNS proxy, WiFi access point, simple firewall, etc.
As your internal network won't be accessible from the modem, you will need to run a DHCP server yourself (or delegate it to your access point if you only have wireless clients). Also you can't use it as a WiFi AP, so you'll need a dedicated AP for that, which is better anyway. As for the DNS proxy, you don't need it but you can use Bind to do that, and even cache entries for you and, best thing in the world, act as an internal DNS server so you can refer to your machines with meaningful names (such as printer.myhome.lan).
You will need to figure out how to transmit all traffic to your firewall. It is okay to leave security features on (malformed IP packets, etc.)

By the way, I said "DNS proxy" for you to understand it, but what usually happens is known as DNS faking. The firewall / server is going to transform packets destined to itself on the DNS port (UDP 53) to the IP address of the actual DNS server. At the end of this article, you should be able to figure out how to do that. The principle is exactly the same as NAT, except that it is used in the other direction: we are letting the LAN clients think that the service they are looking at is on their network, except that we actually forward the packets further to another host. The only difference being that in this case, they could actually be routed to the DNS server, but that doesn't matter.

There are pros and cons of putting a home server acting as a firewall between the modem and the rest of the network, but in any case, if you are operating a network seriously, there is no way computers on the network can use uPnP to open ports so that Internet traffic comes right to them on certain ports. That would be a security breach.
However, there are times where you actually want to do that. For instance, you might want to access a service on a Raspberry Pi or speed up Bittorrent downloads, which involves reaching that machine from the Internet on a predefined port. You could actually have port 1000 from the Internet mapped to port 2345 on the actual machine. I have done that, but I would recommend against it if you can, simply because it is only going to confuse you. But if you have several services running on HTTP port and you want to access them from the Internet you don't have any other solution than this or the VPN (but in that case you aren't really accessing "from the Internet" per se...)

Let's assume a workstation with the address 192.168.1.100 wants to be reachable from the Internet on TCP port 6881.

It is also safe to assume that you are currently forwarding traffic (see net.ipv4.ip_forward in sysctl, and the approriate rule in the FORWARD chain) from the LAN-facing interface (here eth1) to the Internet-facing interface (here eth0), otherwise your LAN could not access the Internet.
And if you are doing this, you should also be masquerading already (see below).
Here's the way to do it:

# iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

What you shouldn't be doing is allowing all traffic to cross your firewall from the outside to the inside. Even if you did, if the outside network is the Internet, that would not work, since Internet routers don't route LAN addresses, technically called "private" addresses (10.0.0.0/8, 172.16.0.0/16, 192.168.0.0/24). However we are going to route some well-chosen traffic in that direction.

Note that all commands should be run with sudo or as root. If you run the commands as a regular UNIX user, you are going to encounter something such as

-bash: iptables: command not found                                  

Just prepend sudo to the command, and you will be fine.

First off, let's check what is currently in the filter table (-t filter is actually the default, you can omit it):

# iptables -vnL -t filter

Chain INPUT (policy DROP 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
  85M   44G ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
[ other rules ]

Chain FORWARD (policy DROP 324 packets, 17404 bytes)
 pkts bytes target     prot opt in     out     source               destination         
 165M  132G ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
1076K  106M ACCEPT     all  --  eth1   eth0    192.168.1.0/24      0.0.0.0/0            ctstate NEW
[ other rules ]

Chain OUTPUT (policy DROP 527K packets, 279M bytes)
 pkts bytes target     prot opt in     out     source               destination         
[ other rules ]

Let's look at the sections in bold. You firewall is useless if the default policy is to accept everything. Instead of blacklisting (which is what you will do if the chain is set to ACCEPT by default), we are going to whitelist some traffic flows (the chain is set to DROP).

The line that is highlighted shows that machines in the LAN can access the Internet. The source parameter is only required if you somehow happen to have several networks on the LAN side and you want to have allow only one of them.

Now we will be allowing traffic in the other direction, on TCP port 6881, which will eventually reach 192.168.1.100:

# iptables -A FORWARD -d 192.168.1.100 -i eth0 -o eth1 -p tcp --dport 6881 -m conntrack --ctstate NEW -j ACCEPT

and this is how the filter table is going to look after that:

# iptables -vnL

Chain INPUT (policy DROP 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
  85M   44G ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
[ other rules ]

Chain FORWARD (policy DROP 324 packets, 17404 bytes)
 pkts bytes target     prot opt in     out     source               destination         
 165M  132G ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
  633 37579 ACCEPT     tcp  --  eth0   eth1    0.0.0.0/0            192.168.1.100       ctstate NEW tcp dpt:6881
[ other rules ]

Chain OUTPUT (policy DROP 527K packets, 279M bytes)
 pkts bytes target     prot opt in     out     source               destination         
[ other rules ]


You can leave out the "-m conntrack --ctstate NEW" part. If you look at my INPUT and FORWARD chain, you will see that the first rule allows any packet whatever they are, if they are part of or relate to a connection that was permitted. But to enforce that, I need to verify that the rule for port 6881 checks for new connections. What this achieves is blocking TCP packets coming out of nowhere. If someone wants to transmit packets over a TCP connection, they have to establish that TCP connection first.

You might be curious and ask "Aren't the packets coming from the Internet destined to the public IP address? They will never match the rule!"
I understand your concern. The reason behind this is that by the time the FORWARD chain is evaluated, the PREROUTING will have already taken place, so the IP packets will have been modified to point to their final destination (or the next gateway that might be doing NAT).

In the previous paragraph, I used the term "modified IP packets". In the context of Network Address Translation (NAT), the process is called "masquerading". The firewall is going to manipulate IP packets.

Now let's enable that PREROUTING rule:

# iptables -t nat -A PREROUTING -p tcp --dport 6881 -j DNAT --to 192.168.1.100:6881

Let's see what the NAT table looks like:

# iptables -vnL -t nat

Chain PREROUTING (policy ACCEPT 45817 packets, 6266K bytes)
 pkts bytes target     prot opt in     out     source               destination     
 1845  109K DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:6881 to:192.168.1.100:6881

Chain INPUT (policy ACCEPT 23486 packets, 2920K bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 25162 packets, 4378K bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain POSTROUTING (policy ACCEPT 3217 packets, 208K bytes)
 pkts bytes target     prot opt in     out     source               destination         
2189K  274M MASQUERADE  all  --  *      eth0    0.0.0.0/0            0.0.0.0/0 

Voilà! You should be up and running now!
You can test it with CanYouSeeMe.org

If it doesn't work, first make sure you can access the server from your LAN, and then check your tables look like in my examples, that you didn't mix interfaces up, that your modem lets everything through, and so on.

Sunday, March 6, 2016

Lambdas in Java

Of the features introduced with Java 8, lambdas is without a doubt the most popular and the first that comes to mind. Let's see how it works in practice.

Syntax

We are going to write a lambda function (also known simply as "lambda") that concatenates two strings.

Code Description
new Concatenator() {
  String concatenate(String a, String b) { 
    return a + b;
  }
}
The equivalent interface and method.
You can actually use this in place of the lambda, the output from the compiler of all these forms is the same.
IntelliJ will tell you you can simplify this expression using a lambda function.
(String a, String b) -> { return a + b; }
The long and most descriptive lambda version.
Don't forget the semi-colon.
(a, b) -> { return a + b; }
The parameter types can be inferred by the compiler.
(a, b) -> a + b
Usually a lambda is used to return a value, so if you can write your method in only one return statement, you can simplify it like this.
You will need curly braces and "return" if more statements are needed.
(a) -> a.toLowercase()
You can simplify further this way if only one parameter is required.
a -> a.toLowercase()
And even further
String::toLowercase
It doesn't get shorter than this. Isn't that prettier and more concise than writing the equivalent anonymous class?

Writing a method that takes a lambda as parameter

Now, let's say you want that users of your method to provide a lambda for you to profit from.
As explained above, lambdas in Java is only a syntactic sugar. You only need to think about some class or interface having a method you can use.

First define that interface:
Interface for lambdas
@FunctionalInterface
public interface Concatenator {
     String concatenate (String a, String b);

@FunctionalInterface is optional, it is only an indicator.

And here is the method that uses a lambda as parameter:
Method that uses a lambda
private void methodUsingLambda(Concatenator concatenator) {
    System.out.println(concatenator.concatenate("James ""Bond"));

That's it!

Functional interfaces

It would be silly to write our own interfaces all the time. There are a bunch of them in the package java.util.function that we can use, such as:

Interface Parameter(s) Return value
Supplier<R>
BooleanSupplier
IntSupplier
-
R
boolean
int
UnaryOperator<T>
T
T
Predicate<P>
BiPredicate<P,T>
P
P, T
boolean
Consumer<P>
BiConsumer<P,T>
P
P, T
-
Function<P>
BiFunction<P,R>
P
P, T
R
Runnable
-
-

Note that Runnable is in the java.lang package. It is also annotated as a functional interface. I guess they already had it there, so they didn't make a new version in java.util.function.

Thou shall not trust Xerox scanners and photocopiers / PDF compression

Abstract:
Xerox photocopiers/scanners use an unreliable compression algorithm that mangles numbers and symbols in documents. A patch was released by the company but even if fixed, the damage has been done.

Xerox, a Fortune 500 company, was founded in 1906 in Rochester, NY, USA. It directly employs more than 140,000 people [1], and there are thousands of companies selling and maintaining Xerox products around the world.

Xerox machines are so widely available that many people use it as a verb, saying they "xeroxed a document" the same way some would say they "googled someone".

If there is one person to thank for discovering and helping fixing this problem it's certainly David Kriesel, a computer scientist from the University of Bonn.

On his blog and the talk he held at FrOSCon [2], David explains what he went through with the company to get the problem fixed, and how it affects an uncountable number of people and companies.

The problem lies in the use of JBIG2, a compression algorithm designed to reduce the size of typesetted documents (i.e. basically everything that is not handwritten).
What JBIG2 does, is to look at the document and extract its symbols, like if you drew rectangles around all letters and digits. Then these rectangles are compared, and if they look the same, the system recognizes they both depict the same symbol. When compressing, it then replaces all of these symbols with only one version.
You would think that is smart, right? Well, the problem is how "similar" symbols are defined, and if the bar is set too low and/or the quality is bad and/or the resolution is low, then 6's tend to look like 8's or even B's, as you can see here:

All images in this document, source: D. Kriesel [2]
This is an extract of some figures from a cash register, where the right column is sorted in ascending order. The problem was then easy to spot. But most of the time, such errors are hard to notice.
Sure, small mistakes related to a cash register are not that bad, right? Well, now imagine these were not dollars but milligrams of drugs a hospital patient needs to take, blood test results or data used by the military or oil rigs (as it has been reported by the vice-president of Xerox). Suddendly, this becomes very serious.

The first time David saw the problem was on a scan of construction plans, where one room was reported as having the same size as a smaller one next to it. Here is the original blueprint:


Here's a scan of this document by a Xerox photocopier/scanner:

The area of the rooms ("Zimmer") have all been changed to 14.13 square meters.

After much discussion between Xerox and David, and this issue having made newspapers all around the world, Xerox finally made a software patch available. So you would think, problem solved, let's continue with our lives. Unfortunately it ain't that simple.

The years 2000s have seen companies going paperless, and that involved a heavy use of scanning. And I am not talking about your usual slow flatbed scanner at home. I am talking about heavy-duty scanners processing one page per second on both sides 10 hours a day. The software bug has been introduced in 2006 and was fixed in 2014. Not only most machines have not been updated, but we are left with millions of documents with no legal value (even if they seem to contain no character mangling) and that we cannot trust, scanned over more than 8 years.

The reason why machines have not been updated? First, Xerox doesn't have a list of end users, as it uses a decentralized network of partners installing and maintaining machines for them, so Xerox can't contact its users directly. Second, it is not in the interest of these companies to go for free at all their customers and update the software, supposing they even know about the issue. Finally,
these companies might not even know how to update the software.

The JBIG2 algorithm is so weak, that it is prohibited by Swiss and German authorities for documents scanned by their offices.

How to tell if a PDF was compressed using JBIG2?

On Linux or OS X:
strings MyDocument.pdf | grep /Filter

Example output:
/Filter /FlateDecode

Supposing you have a bunch of PDF documents you want to check, here is a way to print the name of those that use JBIG2 compression:

for i in *.pdf; do
    strings $i | grep /Filter | grep JBIG2 > /dev/null
    if [ $? -eq 0 ]; then
        echo $i
    fi

done

If you use Windows, use Cygwin, Babun or find a way to open the PDF as a text file, and then look for /Filter (note the leading slash).

Many documents will report using something known as the DCTDecode filter. This indicates JPEG compression. See table below.

What compression algorithms should we use then?

Supposing you had a choice of the compression algorithm, which if you are using a standalone scanner that produces PDF files out-of-the-box is not the case, certainly JBIG2 is a big no-no.

Here are a few common compression algorithms [3,4]:
  • JPEG, obviously. Especially suited for images, but getting old and definitely lossy
    (a lossless version exists, but it is virtually never implemented or used)
  • JPEG2000, more modern alternative to JPEG, lossy.
  • Flate (a.k.a. "deflate"), uses Huffman coding and Lempel-Ziv (LZ77) compression, can be used on images, but is excellent if the document is not a scan, lossless.
  • CCITT Group 4, for monochrome documents and images, used extensively for fax machines and known as G4, lossless.
  • RLE (Run Length Encoding) best for images containing large areas of white and black, lossless
There is no algorithm better than all the other. It depends on the type of the PDF document, and most of the time PDF documents use several of them as the same time. Texts and symbols are usually deflated, and the pictures were originally JPEG.

Flate/JPEG2000 is a good combination for anything with colors or shades of gray, while Flate/G4 is good for monochrome images.

For your information, here is the list of all the supported algorithms (a.k.a. "filters" in Adobe PDF terminology), as per [5] p. 67:



[1] http://www.xerox.com/annual-report-2014/index.html
[2] http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning
[3] http://www.prepressure.com/pdf/basics/compression
[4] http://www.verypdf.com/pdfinfoeditor/compression.htm
[5] PDF Reference, 6th edition, Adobe Systems Inc. https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf

Wednesday, February 24, 2016

Count lines in source files

Following up on the last article, here's how to count the total number of lines your source files:

find **/src/* -regex ".*\.\(ts\|java\|less\)" -print0 | xargs -0 wc -l | tail -1

ZSH, which I'm using supports **. I believe bash cannot do that. ** would be equivalent to */*, */*/*, etc.
What this find command does after finding this filenames is to print them to standard output, but instead of separating them with a new line character, it uses NUL or \0 (the null character).

xargs then reads this. We also tell it that the filenames are separated by NUL (with the option -0), and then we give the command to be executed on each file, the wc -l command. wc stands for word count, but actually it's also capable of counting characters and lines. When we only want lines, we can use the -l argument (l as in lucky). As we are not interested to see how many lines each of the files contains, we "select" only the last line of the output with tail, because that last line contains the total.

Here is an alternative way to do the same thing:

find **/src/* -regex ".*\.\(ts\|java\|less\)" -exec wc -l {} \; | awk '{print $1;}' | paste -s -d+ | bc

In this case we find built-in feature to execute a command on each file it finds, using with the -exec argument. It is described in the manual of find. Basically, everything until the semi-colon (which as all characters that would otherwise be interpreted by the shell instead of being sent to the command, needs to be escaped). {} is replaced by the filename (the path is relative to the current directory).


wc prints the number of lines, a space, and then the filename. We use awk to print the first "word" it finds (by default, awk supposes "words" are separated by one or more spaces). I am not going to explain awk syntax, but it is a very useful tool. In fact we could have used it to do the rest of the job.

paste is a handy tool too, yet most people have never heard of it and try to replicate its features with complicated shell scripts... In this example it is going to transform this:

1
2
3

into this:

1+2+3

bc, which I believe stands for basic calculator will be used to compute the sum.


Now, there are tools to define what are "lines of code" and how to count them with more context that only some text. But, let's just ignore blank lines:

find **/src/* -regex ".*\.\(ts\|java\|less\)" -printf "sed '/^$/d' %p | wc -l\n" | sh | awk '{print $1;}' | paste -s -d+ | bc

sed, with these arguments, truncates empty lines (the regex describes "a line that begins and ends with nothing in between").
Unfortunately find does not support pipes ( | ) in the -exec argument, so we use a little trick.
That trick is to print commands to standard output, and then to interpret them with the shell, as if these commands were part of a shell script.

Find files by extension with the terminal

Today I was trying to find source files in a development project. These files have the extensions .java, .ts and .less.
I realized I never needed to do such a search before, and here's how I did it:

find . -regex ".*\.\(ts\|java\|less\)"

Let me know if you find something better.