Using OCR To Create Captcha Bypasser

Posted by expertslogin on May 24th, 2013

How to use a free Optical Character Recognition to create a simple and efficient Captcha Bypasser.

WHAT IS AN OCR?

An OCR (Optical Character Recognition) converts image files (pixmap or PCX) into human readable characters. Our goal is to use an OCR as back-end for our simple Captcha Bypasser. GOCR, GNU Optical Character Recognition, is an Open Source and Free solution for us. Note that GOCR can have unexpected results working with non-Latin alphabets.
sudo apt-get install gocr
yum install gocr
 
IMAGE FILES

To generate a satisfactory output, we need to use image processing to handle the images (i.e. intensify colors, remove undesired lines, dots, etc.), to facility the characters recognition. We will use simple captchas that will require image processing to modify the colorspace of the images; if the image is colored it will provide to the GOCR a grayscale copy (it is preferable).
Our images will have normal characters from the Latin alphabet that will be uppercase, normal and/or bold - not italic, with a traditional font (i.e. Arial, Times New Roman, and Verdana etc.).


 
An example of colored captcha.



 
An example of a captcha already in grayscale

We will use ImageMagick to process the images. You can use it to handle complex images, trying to generate the better possible input, but our Captcha Bypasser will just create a grayscale copy.

sudo apt-get install imagemagick
yum install imagemagick


To create a grayscale copy of the image we will create a function:

# ---- Applying grayscale
grayscale ()
{
     source=$img
     id=`date +%N`
     img="$temp_dir"/img_$id.jpg
     convert $source -type Grayscale -despeckle -enhance "$img"
     convert "$img" +level-colors black, "$img"
}

THE PROCESS

We will create a temporary folder to save the grayscale copy of the image while running the script:
# ---- Creating temporary directory
making_env ()
{
dd=`date +%N`
temp_dir=decaptcha_temp_$dd
mkdir "$temp_dir"
}

You can use GOCR as below:
gocr [OPTION] [-i] pnm-file
I advise you to read the manual page to understand and increase your Linux shell script. However the options we will use here are:
gocr -l 70 -C [A-Z] -i "$img"
-l level
Set grey level to level (0<160<=255, default: 0 for autodetect), darker pixels belong to characters; brighter pixels are inter‐preted as background of the input image.
-C string
Only recognize characters from string, this is a filter function in cases where the interest is only to a part of the character alphabet, you can use 0-9 or a-z to specify ranges, use - to detect the minus sign.
-i file
Read input from file (or stdin if file is a single dash).

If the image has text with different grayscale levels would be a problem to discern every character. We will use then, 3 grayscale levels - you can use more, even all: standard, 70 and 85.
gocr -C [A-Z] -i "$img"        # standard level: 0
gocr -l 70 -C [A-Z] -i "$img"
gocr -l 85 -C [A-Z] -i "$img"

I decided 70 and 85 after I tested many levels and checked the results, but we should let the option to pass these levels as arguments if we will need (You can see in the complete code).
GOCR display an underscore "_" for unrecognized characters by default. We will store the results in variables and compare them, if in the first character in the first variable is a "_" it will be replaced by the first character in the second variable, and so on.

# ---- \Decaptching\
dcap ()
{
recog1=$(gocr -C [A-Z] -i "$img")
recog2=$(gocr -l $number -C [A-Z] -i "$img")

for (( i=0; i<${#recog2}; i++ ))
do
     array2[$i]=${recog2:$i:1}
done
#-----
for (( i=0; i<${#recog1}; i++ ))
do
     array1[$i]=${recog1:$i:1}
done

for ((i=0; i<${#recog1}; i++))
do
     if [ "${array2[$i]}" = "_" ]
     then
         cdecp="$cdecp${array1[$i]}"
     else
         cdecp="$cdecp${array2[$i]}"
     fi
done
}

We will call our functions grayscale and/or dcap based on command-line arguments:
decaptcha INPUT [OPTIONS]

These options are:

-c colored
     Must have the -c option if the image is colored.
-l level
     To change the standard grayscale levels.
     Must be followed by at last 1 and maximum 2 numbers.
img="$1"
check=`echo $* | wc -w`

for ((i=1; i<=$check; i++))
do
     case $* in
         *-c*)
             shift; shift;
             grayscale ${img};
             shift;
         ;;
         *-l*)
             shift;
             case $1 in
                 *[0-9]*)
                     number="$1"
                 ;;
                 *)
                     number=70;
                 ;;
             esac
             shift;
dcap ${img} ${number}
 f1=$cdecp; cdecp=""

case $1 in
     *[0-9]*)
         number="$1"
     ;;
     *)
         number=85;
     ;;
esac
dcap ${img} ${number}
     f2=$cdecp
;;
*)
     number=70
     dcap ${img} ${number}
     f1=$cdecp; cdecp=""
     number=85
     dcap ${img} ${number}
     f2=$cdecp
;;
esac
done
After that we will compare the results again to find the correct one:

#######################
# CHECKING...
#######################
for ((i=0; i<${#f1}; i++))
do
     if [ "${f1:$i:1}" == "${f2:$i:1}" ]
     then
         string="$string""${f1:$i:1}"
     elif [ "${f1:$i:1}" != "${f2:$i:1}" ]
     then
         case ${f1:$i:1} in
                 _)
                     string="$string""${f2:$i:1}"
                 ;;
                 *)
                     string="$string""${f1:$i:1}"
                 ;;
         esac
     fi
done
echo "CAPTCHA: $string"

exit 0

Bellow is the complete script I created under GPL License:
Some examples:
decaptcha captchas/1captcha.jpg



decaptcha captchas/2captcha.jpg -c



About the Author:

This Article is written by Bobbin Zachariah, who is also associated with ExpertsLogIn. He is passionate lover of Linux and other opensource tools. Started career in IT and Linux in early 2000. Love travelling, Technology writer, Blogging, Music and Enjoy the company with friends and family.

Like it? Share it!


expertslogin

About the Author

expertslogin
Joined: March 29th, 2013
Articles Posted: 4

More by this author