Breaking Aljazeera’s CAPTCHA

I was on Aljazeera Arabic's website the other day and, as I was voting on a poll, was presented the following screen:

The CAPTCHA in the screen above immediately caught my attention. The distortions in it seemed very simple, the text was not warped in any form and no overlap between characters.

The following is a URL for one of the CAPTCHAs:

http://www.aljazeera.net/Sitevote/SiteServices/
Contrlos/SecureCAPTCHA/
GenerateImage.aspx?Code=EANmyyXghpajFhOX6rCRKQ==&Length=4

Opening the URL above and refreshing the page a few times gives the following CAPTCHAs:







The dashed grey lines are randomized, while the letters in the CAPTCHAs above are static. The letters are encoded in the Code parameter in the URL. Notice that there are two forms for each character; a straight form and another that is slightly rotated.

Aljazeera's CAPTCHA can easily be broken by doing the following:

  1. Removing the dashed grey lines
  2. Finding the characters in the image
  3. Separating the characters in the image
  4. Classifying each character

I'll be using Octave/Matlab for the above tasks and will be explaining my algorithm using the following CAPTCHA as an example.


I'll first begin by loading the CAPTCHA in Octave/Matlab using the line of code below. I have already saved the CAPTCHA image in a file called captcha.jpeg.

N = imread('captcha.jpeg');

N is a matrix where each element in the matrix corresponds to the color of the pixel at the corresponding location in the image. For example, the matrix below is a submatrix of N corresponding to the letter 'J' in the image. If you look closely, you can almost see the letter 'J' in the matrix.

255  249  249  255  255  255  255  255  255  255  255  255 
189  255  246    7  251  255  249  255  255  246  249  255 
170   76   11    0  253  255  255  250  245  255  255  245 
254   24   59   60  255  233  255  255  255  255  230  255 
246    1    9   96  170  255  238  255  247  255  246  255 
255    2    0    9   60  251  240  255  255  255  255  250 
238  255    2    0    0  206  154  251  241  245  255  254 
255  253    5    0    9   13  193  128  246  255  253  255 
255  243    0    4    0    0    9  241  255  247  255  255 
245  251  255    0    0    0   10    0  255  255  246  255 
255  255  233   21    0    7    0    2    0  255  245  255 
255  246  255  231   11    6    0    5   11    1  245  243 
255  253  252  255  248    0   15    0    0    0  255  255 
255  255  240  243  255  254  250   11    8    0    0  248 
238  251  255  255  255  250  255  241    0    0   10  255 
255  252  255  238  244  255  255  240  255   18    0  255 
255  246  255  253  255  242  251  255  244    0   12    0   
255  251  255  249  254  241  255  253  255    0   17    0   
255  255  244  255  255  246  255  250  246    6    0    9   
255  243  255  249  255  253  255  243    8   11   12  254 
251  250  255  249  245  233  255   10    1    0    0  255 
246  255  249  251    8   15    0    0    0    0  255  243 
246  255    0   11    0    5    1    0  255  255  237  255 
254    0    5    4   13    0  244  255  250  255  255  248 
255  255  255  247  249  255  255  244  255  247  255  255

A value of 255 corresponds to a white pixel while 0 corresponds to a black pixel. All values between 255 and 0 exclusively correspond to shades of grey.

1. Removing the dashed grey lines

The grey lines serve as a distortion to make the image harder for a computer to read. We can remove all shades of grey using the following line of code:

N = N > 100;

The line of code above changes all the elements that are light-colored to one. The matrix becomes a representation for a binary image where a value of one corresponds to white and a value of zero corresponds to black. The following image is what we see after executing the line of code above. Notice how all shades of grey have been removed and we're left only with the black letters.

2. Finding the characters in the image

I can take advantage of the fact that the characters are well-separated with whitespace in order to identify each character in the image.

First, the columns containing black pixels are identified using the code below.

  % find location of dark colours (rows and columns)
  [r, c] = find(N==0);

  % find out the columns that contain black pixels
  c = unique(c);

c is a vector that now contains the column numbers, in sorted order, that contain black pixels.

Second, cluster the columns that are close together. Each cluster of columns is then considered a character. Here, I am declaring two columns to be "close together" if they are no more than three pixels apart. The line of code below does exactly that.

  clusters = [0; find(diff(c) > 3); length(c)];

find(diff(c) > 3) finds the columns that are not close together, while 0 and length(c) are for the boundaries.

3. Extracting the characters from the image

The submatrix corresponding to each cluster in the image corresponds to a character. Running the following loop extracts each of the four characters from the image.

  for i = 1:4
    char = N(:, c(edges(i) + 1):c(edges(i + 1)));
    % Classify the character here...
  end

The following are the extracted characters:

4. Classifying each character

Now that each character's image is separate, all that remains is to identify what character is in the image. As I mentioned earlier, each character of the 26 characters has two forms, a straight form and a rotated form. This brings the total number of characters to identify to 52.

Straight form of a character:

Rotated form of a character:

Classifying a character is slightly less straightforward than the previous sections since there are some minor differences in the shape. Take the two characters below as an example; they are the same character, but there are some slight differences in each image.

So how can they be identified as the same? A very naive approach to solving this problem that seems to work very well is the following:

  • For each of the 52 characters, compute the number of white pixels in each column.

The computations from the task above forms our dataset. This can be calculated by executing the following command for each character:

sum(char_image);

Where char_image above is the matrix corresponding to the character that we want to compute the number of white pixels in its columns. The command above works since each white pixel in the matrix has the value one while the other elements of the matrix (i.e. black pixels) are zero.

  • Given an image of a character, compute the number of white pixels in its columns and compare it to that of the 52 characters in the dataset. The character in the dataset with the least difference in white pixels is the most likely match.
  % The 52 characters in our dataset
  chars = ['L' 'I' ... characters of the dataset ... ];

  % Sum of white pixels in each column for each of 
  % the characters above.
  m{1} = [40 28 24 21 28 33 38 39 40 40 40 39 38 39 40];
  m{2} = [...]
  ... rest of the data set ...

  % Compute the number of white pixels in each columns for the 
  % image we're trying to classify.
  char_image = sum(char_image);

  % Assign a score of how different the character we're trying
  % to classify from a character in the dataset. The lower the
  % score, the more likely these characters match.
  score = 100;
  char = '-';

  % Compare each character in our dataset to the character
  % we're trying to classify
  for i = 1:length(m)
    candidate_char = m{i};

    % Only compare characters that are the same width
    if length(candidate_char) == length(char_image)
      candidate_score = max(abs(char_image-candidate_char));
      if candidate_score < score  % Did we find a better match?
        score = candidate_score;
        char = chars(i);
      end
    end
  end

  % char has the solution at this point

I tested the algorithm on roughly 100 CAPTCHAs and the success rate has been 100%.

For completeness, I used the bash script below to fetch CAPTCHAs directly from Aljazeera's website.

VOTING_URL='http://www.aljazeera.net/Portal/KServices/supportPages/vote/SecureVote.aspx'
CAPTCHA_CODE=$(wget -qO- "$VOTING_URL" | grep ?Code= | sed "s/.*?Code=(.*)&.*/1/")

echo "Captcha code: $CAPTCHA_CODE"

CAPTCHA_URL="http://www.aljazeera.net/Portal/KServices/Controles/SecureCAPTCHA/GenerateImage.aspx?Code=$CAPTCHA_CODE&Length=4"
CAPTCHA_FILE=$(echo $CAPTCHA_CODE | sed "s,/,_,g")

wget "$CAPTCHA_URL" -O $CAPTCHA_FILE.jpeg

If you're interested, you can download the entire dataset and source code of the algorithm here. You will need either Octave or Matlab to run the scripts.

It's really a shame that a well-known news network like Aljazeera would use such a silly measure for security. Anyways, happy new year!

2 thoughts on “Breaking Aljazeera’s CAPTCHA”

  1. dear Sir,

    Your article is very good but, Your images links as well as source code of the algorithm are breaked now, please update it as soon as possible.

Leave a Reply

Your email address will not be published. Required fields are marked *