The Robustness of Google CAPTCHAs
Ahmad S El Ahmad, Jeff Yan, Mohamad Tayara
School of Computer Science
Newcastle University, UK
{Ahmad.Salah-El-Ahmad, Jeff.Yan, Mohomad.Tayara}@ncl.ac.uk
May 8, 2011
ABSTRACT
We report a novel attack on two CAPTCHAs that have been
widely deployed on the Internet, one being Google's home design
and the other acquired by Google (i.e. reCAPTCHA). With a
minor change, our attack program also works well on the latest
ReCAPTCHA version, which uses a new defence mechanism that
was unknown to us when we designed our attack. This suggests
that our attack works in a fundamental level. Our attack appears
to be applicable to a whole family of text CAPTCHAs that build
on top of the popular segmentation-resistant mechanism of
"crowding character together" for security. Next, we propose a
novel framework that guides the application of our well-tested
security engineering methodology for evaluating CAPTCHA
robustness, and we propose a new general principle for
CAPTCHA design.
Categories and Subject Descriptors
D.4.6 Security and Protection, H.1.2 User/Machine Systems.
General Terms
Security, Human factors
Keywords
CAPTCHA, robustness, segmentation attack
1. INTRODUCTION
A CAPTCHA (Completely Automated Public Turing Test to Tell
Computers and Humans Apart), also known as a human
interaction proof, is a program that generates and grades tests that
are human solvable but intended to be beyond the capabilities of
current computer programs [1]. CAPTCHA makes use of a hard,
open problem in AI, and is now a standard technology to defend
against undesirable or malicious computer bot programs. The
most widely deployed CAPTCHAs are text-based schemes, which
require a user to recognize distorted characters, a task that state-
of-the-art AI programs supposedly cannot do.
CAPTCHAs’ robustness is the strength of their resistance to AI
programs written to automatically solve CAPTCHA tests. It is
well known that the robustness of a text CAPTCHA should rely
on the difficulty of finding where each character is rather than
what it is. The rationale is the following. Computers perform
better than humans in recognizing individual characters, even
under severe distortion [6]. However, locating individual
characters in the right order (i.e. segmentation) is in general a
computationally expensive and combinatorially hard problem for
computers. Therefore, a text CAPTCHA should be segmentation-
resistant; if a scheme is vulnerable to a segmentation attack, then
it is effectively broken. A commonly accepted goal for
CAPTCHA design is that automated attacks should not be more
than 0.01% successful but that the human success rate should be
at least 90%.
Various segmentation-resistant mechanisms have been proposed
(some representative ones are shown in Figure 1), but many of
them were broken, including those developed by major companies
such as Microsoft, Yahoo and Megaupload [3, 10]. A
segmentation-resistant mechanism known as “crowding character
together” or CCT, initially adopted by Google to let adjacent
characters touch or overlap with each other at a random
intersection (see Figure 1 for illustrations), was shown to be more
resistant to known attacks [3]. This mechanism has been used by
Google until now, and it has gained wide popularity and been in
use by a large number of different text CAPTCHAs.
Figure 1. CAPTCHAs designed by Microsoft, Yahoo,
Megauplad and Google (clock wise), each using a different
segmentation-resistant mechanism.
This paper first answers an open problem that has intrigued the
CAPTCHA community for years: Is the CCT mechanism
vulnerable to novel attacks? Our analysis of a Google’s
CAPTCHA, as currently deployed by Gmail, Blogger.com and
BlogSpot, suggests that a simple but novel attack can break its
CCT mechanism. This is to date the most effective attack against
this scheme (For convenience, we will refer to this CAPTCHA as
the Google scheme in this paper).
Our attack is also applicable to ReCAPTCHA, a scheme that is
widely used by millions of users of Facebook, Twitter and other
Internet services on a daily basis (used by over 100,000 websites),
and that was recently acquired by Google. Our attack was
implemented for a ReCAPTCHA version that was active in May
2010. To our surprise, our attack, without any significant change,
also works well on the latest ReCAPTCHA version, which uses a
new defence mechanism that was unknown to us when we
designed our attack. This suggests that our attack works in a
fundamental level.
Our attack exploits shape patterns found in individual characters
(such as the “loop” shape in characters “a” and “p”, and the cross
shape in characters ‘f’ and ‘t’), as well as connection patterns of
adjacent characters. The key insight is that these features and
patterns largely stay invariant under various distortions and
transformations applied by the CAPTCHA generators, and can be
exploited to segment connected characters.
Given the CCT mechanism’s popularity, the impact of our attack
is beyond the Google scheme and ReCAPTCHA. We urge the
designers of other captcha schemes that rely on the CCT
mechanism to carefully check how vulnerable their designs are to
our attack or its variants.
In the recent years, we have been working to establish a novel
security engineering approach to the evaluation of CAPTCHA
robustness through a series of work, including [3, 4, 7, 10] and
this paper. In contrast to a parallel approach developed primarily
in the computer vision, document analysis and pattern recognition
communities, where sophisticated objection recognition
algorithms are often the design goal, our approach applies
adversarial thinking skills searching for and exploiting
vulnerabilities hidden in CAPTCHAs.
In this paper, we will summarise our approach, and for the first
time provide a detailed framework that classifies CAPTCHA
vulnerabilities into major categories. This framework provides
key insights on security vulnerabilities that text CAPTCHAs
should account for. It can be used to examine a CAPTCHA’s
robustness, as well as to guide the design of next generation
CAPTCHAs. We will also summarise general principles for the
design of robust text CAPTCHAs. We not only revisit established
principles, but also propose a new one.
The rest of this paper is organized as follows. Section 2 discusses
related work. Section 3 presents our attack on the Google scheme,
and Section 4 shows that a variant of our attack works on
ReCAPTCHA. Section 5 discusses further applicability of our
attack and its defence. Section 6 introduces our framework for
understanding Captcha vulnerabilities, and Section 7 focuses on
general principles for CAPTCHA design.
2. RELATED WORK
CAPTCHA has been an active field in the research communities.
Due to space limit, we only review the literature that is most
relevant to this paper. Kurt’s thesis [18] provided a good review
of CAPTCHA research.
Chellapillas’ et al [5] attacked a number of early CAPTCHAs
using machine learning algorithms, and they achieved 4.89%
success on an early version of Google’s CAPTCHA (around year
2004. In 2007, we developed a technique that exploited gaps
(whit e space) between characters in Google’s CAPTCHA at the
time, and our attack achieved 12% success [3]. It was reported in
[14] that some spammers succeeded in breaking the Google
CAPTCHA using two compromised zombie hosts, with each host
using a variation of their attack. This attack claimed a success rate
of 20%, yet no technical details have been revealed. An attack on
ReCAPTCHA using a combination of image processing and OCR
techniques was reported with a success rate of 17.5% [13]. At
DEFCON’18, Houck presented another attack on ReCAPTCHA;
using both character segmentation and character template
matching technique, his attack achieved 10% success on an early
version of ReCAPTCHA and 31% success on a more recent
version [8].
Other notable attacks on (image recognition) CAPTCHAs include
[3, 12].
3. A SEGMENTATION ATTACK on the
GOOGLE SCHEME
In this section, we first review key features of the Google scheme,
and then present a novel segmentation attack.
Google have tweaked their CAPTCHA from time to time in order
to improve its robustness. Figure 2(a) shows a version of the
Google scheme that is not user-friendly. For example, it is
difficult for users to tell between “d” and “cl”, “ri” and “n”, “rn”
and “m”, and “w” and “vv”. We did not choose to work on this
version of Google CAPTCHA, because even human cannot
decode them properly with a high accuracy. For such captchas,
users’ complaint will soon become a major issue or even push
them offline, just as we have seen it again and again. Figure 2(b)
shows another version of Google CAPTCHA, which still adopts
the CCT mechanism but is easier for humans to solve than the
other version, and therefore has been deployed most of the time in
our experience. We have chosen to develop our attack based on
the second version.
(a)
(b)
Figure 2. Google CAPTCHA. (a) a user-unfriendly version;
(b) a usable version.
We studied a hundred samples that was randomly collected from
the Internet1, and observed the following features.
• Each challenge uses only two colours: Red, Green, or Blue
for the challenge text, and White for the background.
• CCT is the main segmentation-resistant mechanism so that
characters connect or touch with one another horizontally.
• Global warping is applied to add randomized distortion.
• The thickness of characters varies much; even different
portions of the same character differ in thickness with each
other. This is a powerful feature for defending against
potential segmentation attacks.
• Random text strings are used; a text string’s length varies
between 5 and 8 characters; only lower-case letters are used.
• Multiple font typefaces and styles (such as bold, italics and
regular) are used.
The key insight behind our attack is that although the Google
CAPTCHA uses different font typefaces and styles, and uses
heavy distortions, shape patterns in some characters remain
invariant, e.g. the “loop” in characters “a” and “p”, and the cross
shape in ‘f’ and ‘t’. Our main idea is to first detect distinctive
shape patterns, identify their associated characters, and then cut
out these characters. It turns out that once these detectable
1 All samples we used for this paper were randomly collected
from Google email registration page and Blogger.com, where
the CAPTCHA was used whenever one attempted to create an
account or post a comment.
characters are cut out, most often a whole challenge text is also
properly segmented already.
Our attack includes the following sequential steps:
• Pre-processing – A set of standard techniques is applied to
prepare each challenge image.
• Pattern based detection of characters – A set of algorithms is
used to locate characters with their detectable shape patterns.
• Character segmentation – A set of rules and heuristics is used
to segment detected characters.
3.1 Preprocessing
We first perform image up-sampling, which enlarges the image,
increases its pixel details, and thus smoothes the embedded text.
Then, we convert the up-sampled image into a black-and-white
image. This binarising process is done via the standard
thresholding method: all the pixels with a color value above a
heuristically predetermined threshold is converted to black and
those bellow it to white. Next, we apply Zhangs’ algorithm [16] to
thin the image. Thinning is the process of identifying a binary
image’s skeleton. Figure 3 shows the output of our preprocessing
on a sample image.
The main advantages of using thinning in our attack are:
1)Thinning normalizes the character thickness, which varied very
much in a random way, to a uniformed one-pixel thickness. This
greatly simplifies our attack, as it does not have to account for
different character thicknesses. 2) It is much faster to process a
thinned image than an un-thinned one, as the former contains far
fewer pixels that matter. Clearly thinning does not segment
characters – connected characters stay connected after thinning.
Figure 3. After preprocessing (the image is not in its actual
size due to space concerns; the original image is in Figure
2(b))
3.2 Pattern Based Characters Detection
In this step, we aim to detect categories of characters in a thinned
image using shape patterns found in their generic shape. We
identified four shape patterns and each belongs to a category of
characters as follows.
• Dot shape: “i” and “j”.
• Loop shape: “a”, “b”, “d”, “e”, “g”, “o”, “p”, “q”.
• Cross shape: “t” and “f”.
• S shape or “S Vertical Histogram”: Characters that contain
three vertically juxtaposed lines, the character “s”
Detecting Characters that contains a Dot Shape. We first use
“Color Filling” segmentation on the foreground components (on
the thinned version of an image). A foreground component
contains a single character, a group of connected characters, or a
part of a character, and has a black color. “Color Filling”
Segmentation or CFS is effectively like using a distinct color to
flood each component, and it could be used to segment against
any color, so we call it “Color Filling” segmentation, more about
CFS in [3] . In Figure 4, each foreground component is segmented
by CFS and is identified by its unique color.
Figure 4. Segmentation of connecting component using CFS.
Each component is indentified by a different color.
Characters such as “i” and “j” consist of a dot and a body part
underneath the dot. We detect both parts as follows:
a) To detect the dot part: We use its relatively small pixel
count (i.e., pixel count is the number of pixels in an
component).
b) To detect the body part: we apply a series of steps as
follows:
First, using the position of the dot, we locate all foreground
components underneath it. In Figure 5 (a), only the component
“spi” is positioned underneath the dot.
Second, we use a modified version of CFS (flood up and down
only) to ignore all parts of the components with a horizontal
orientation. Figure 5 (b) shows the remaining components having
a vertical orientation.
Third, for each of the remaining components, we calculate the
equation of a line representing its orientation (components far
from the dot are ignored). In this case, only two lines are plotted
as shown in figure 5 (c).
Finally, the component with the line path closest to the dot
component is considered as the body part of either “i” or “j”.
Figure 5 (d) shows the only remaining component; in this case the
body part of “i”.
(a) (b)
(c) (d)
Figure 5. Detecting characters with a dot shape: (a) a
connected component located under the dot .(b) components
with a vertical orientation. (c) Plotting the line equation for
the remaining thinned components underneath the dot. (c)
The body part of “i”.
Detecting Characters with a Loop Component. We adopted a
loop detection method that we used in [3], and which involves
two steps as follows:
First, CFS is applied to the background color (i.e., white) of the
image. As shown in Figure 6 (a), background components are
now segmented and identified by different colors. Second, the
above step return two types of loop components, “character
loops” and “connection loops”. Connection loops are created as a
result of the crowding of characters together. Characteristics of
connection loops are: 1) a relatively small pixel count, or 2) a
relatively large pixel count if vertically overlapping and in close
proximity with character loop(s). We developed heuristics based
on the pixel count and the relative position of loops to detect and
remove connection loops. Figure 6 (b) shows a different example
containing a “connection loop” before and after removal.
(a)
(b)
Figure 6. Detection of characters with a loop shape. (a) CFS
on the background color is used for loop detection. (b) An
example of a connection loop before and after removal.
Detecting Characters with a Cross. A unique characteristic of a
cross shape is having four sides; upper, lower, left and right sides.
We observed that drawing an imaginary box around the cross
shape must intersect with the box once from each side, with each
intersection representing one of the cross shape four sides.
We detect the cross as follows. a) We traverse the image using
the imaginary box, and if each of the four sides of the box
intersects with one and only one foreground colored pixel, then
the box position is labeled as a possible cross shape component.
After that, we shift the box position and continue searching for
other cross shapes, until the entire image is traversed. b) We filter
through all the possible cross shapes, and we keep only those
satisfying these conditions. First, the position of the cross shape is
in the upper side of its foreground component. Second, all the
foreground pixels covered by the box area are connected with
each other (we used CFS to verify this condition), this condition
is needed as all the pixels in a valid cross shape are connected
with each other Finally, the cross shape must not overlap
vertically with a loop shape. In Figure 7, the red box indicates the
imaginary box and thus the location of a cross shape.
Figure 7. Detecting characters with a cross shape (the
detected cross shapes are highlighted by a red box).
S Vertical Histogram. The unique shape characteristic of the
character “s” is that it contains three vertically overlapping
strokes in its shape. We detect it as follows. First, we map the
image against a vertical histogram that represents the total number
of foreground pixels in each column. Then we ignore all parts of
the histogram that intersects with other character shapes (this is
done to insure no false detection of characters having three
vertically overlapping strokes, such as “a” or “e”). Second, we
search the histogram for consecutive occurrence of columns with
the value of three or more pixels in each column; we call such
occurrence of columns as the “s-span”. Finally, if an “s-span” has
a width larger that 25 pixels (a threshold for the character “s”
minimum width; i.e., the component under analysis has a width
large enough to contain an “s”), then we use the s-span’s left-most
and right-most columns as a reference to draw a bounding box
around the characters “s”. Figure 8 shows the histogram
(magnified by a factor of 4) and the “s” character bounding box.
Figure 8. S Vertical Histogram. Identification of the
character “s” location, as highlighted by a bounding box.
3.3 Segmentation
In this step, we cut out characters that have a shape pattern
detected in the previous step. We use the examples used in the
previous section to show how we separate ‘i’ from ‘sp’, and how
to split ‘sp’, ‘ut’ and ‘ws’ – four examples illustrate how to
segment a character with a dot, a character with a loop, a
character with a cross, and a character with “s” shape,
respectively.
We first convert the detected shape pattern’s color to white (i.e.
the image background color). This effectively hides the detected
shape, and breaks connected characters into separate components,
as shown in Figure 9. Note: for a character with dot, the detected
shape includes the dot, and the vertical part of the body.
All visible components in Figure 9 can be classified into two
types. The first type belongs to only one character, and we call
them private components. For exam