1
Fast and Memory-Efficient Regular Expression
Matching for Deep Packet Inspection
Fang Yu
UC Berkeley
fyu@eecs.berkeley.edu
Zhifeng Chen
Google Inc.
zhifengc@google.com
Yanlei Diao
University of Massachusetts,
Amherst
yanlei@cs.umass.edu
T. V. Lakshman
Bell Laboratories, Lucent Technologies
lakshman@research.bell-labs.com
Randy H. Katz
UC Berkeley
randy@eecs.berkeley.edu
ABSTRACT
Packet content scanning at high speed has become ex-
tremely important due to its applications in network secu-
rity, network monitoring, HTTP load balancing, etc. In
content scanning, the packet payload is compared against a
set of patterns specified as regular expressions. In this pa-
per, we first show that memory requirements using tradi-
tional methods are prohibitively high for many patterns
used in packet scanning applications. We then propose
regular expression rewrite techniques that can effectively
reduce memory usage. Further, we develop a grouping
scheme that can strategically compile a set of regular ex-
pressions into several engines, resulting in remarkable im-
provement of regular expression matching speed without
much increase in memory usage. We implement a new
DFA-based packet scanner using the above techniques. Our
experimental results using real-world traffic and patterns
show that our implementation achieves a factor of 12 to 42
performance improvement over a commonly used DFA-
based scanner. Compared to the state-of-art NFA-based
implementation, our DFA-based packet scanner achieves
50 to 700 times speedup.
Categories and Subject Descriptors
C.2.0 [Computer Communication Networks]: General –
Security and protection (e.g., firewalls)
General Terms
Algorithms, Design, Security.
Keywords
Regular expressions, DFA, intrusion detection, deep packet in-
spection.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
ANCS'06, December 3–5, 2006, San Jose, California, USA.
Copyright 2006 ACM 1-59593-580-0/06/0012...$5.00.
1. INTRODUCTION
Packet content scanning (also known as Layer-7 filtering or pay-
load scanning) is crucial to network security and network moni-
toring applications. In these applications, the payload of packets
in a traffic stream is matched against a given set of patterns to
identify specific classes of applications, viruses, protocol defini-
tions, etc.
Currently, regular expressions are replacing explicit string
patterns as the pattern matching language of choice in packet
scanning applications. Their widespread use is due to their ex-
pressive power and flexibility for describing useful patterns. For
example, in the Linux Application Protocol Classifier (L7-filter)
[1], all protocol identifiers are expressed as regular expressions.
Similarly, the Snort [2] intrusion detection system has evolved
from no regular expressions in its ruleset in April 2003 to 1131
out of 4867 rules using regular expressions as of February 2006.
Another intrusion detection system, Bro [3], also uses regular
expressions as its pattern language.
As regular expressions gain widespread adoption for packet
content scanning, it is imperative that regular expression matching
over the packet payload keep up with the line-speed packet header
processing. Unfortunately, this requirement cannot be met in
many existing payload scanning implementations. For example,
when all 70 protocol filters are enabled in the Linux L7-filter [1],
we found that the system throughput drops to less than 10Mbps,
which is well below current LAN speeds. Moreover, over 90% of
the CPU time is spent in regular expression matching, leaving
little time for other intrusion detection or monitoring functions.
On the other hand, although many schemes for fast string match-
ing [4-11] have been developed recently in intrusion detection
systems, they focus on explicit string patterns only and can not be
easily extended to fast regular expression matching.
The inefficiency in regular expression matching is largely due
to the fact that the current solutions are not optimized for the fol-
lowing three unique complex features of regular expressions used
in network packet scanning applications.
• First, many such patterns use multiple wildcard metachar-
acters (e.g., ‘.’, ‘*’). For example, the pattern for identifying
the Internet radio protocol, “membername.*session.*player”,
has two wildcard fragments “.*”. Some patterns even contain
over ten such wildcard fragments. As regular expressions are
converted into state machines for pattern matching, large
2
numbers of wildcards can cause the corresponding Determi-
nistic Finite Automaton (DFA) to grow exponentially.
• Second, a majority of the wildcards are used with length
restrictions (‘?’, ‘+’). As we shall show later in the paper,
such length restrictions can increase the resource needs for
expression matching.
• Third, groups of characters are also commonly used: for
example, the pattern for matching the ftp protocol,
“^220[\x09-\x0d -~]*ftp”, contains a class (inside the brack-
ets) that includes all the printing characters and space charac-
ters. The class of characters may intersect with other classes
or wildcards. Such interaction can result in a highly complex
state machine.
To the best of our knowledge, there has not been any detailed
study of optimizations for these kinds of regular expressions as
they are so specific to network packet scanning applications. In
this paper, we address this gap by analyzing these regular expres-
sions and developing memory-efficient DFA-based solutions for
high speed processing. Specifically, we make the following con-
tributions:
• We analyze the computational and storage cost of building
individual DFAs for matching regular expressions, and iden-
tify the structural characteristics of the regular expressions in
networking applications that lead to exponential growth of
DFAs, as presented in Section 3.2.
• Based on the above analysis, we propose two rewrite rules
for specific regular expressions in Section 3.3. The rewritten
rules can dramatically reduce the size of resulting DFAs,
making them small enough to fit in memory. We prove that
the patterns after rewriting are equivalent to the original ones
for detecting non-overlapping patterns. While we do not
claim to handle all possible cases of dramatic DFA growth
(in fact the worse case cannot be improved), our rewrite rules
do cover those patterns present in common payload scanning
rulesets like Snort and Bro, thus making fast DFA-based pat-
tern matching feasible for today’s payload scanning applica-
tions.
• We further develop techniques to intelligently combine mul-
tiple DFAs into a small number of groups to improve the
matching speed in Section 4, while avoiding the exponential
growth in the number of states in memory.
We demonstrate the effectiveness of our rewriting and group-
ing solutions through a detailed performance analysis using real-
world payload scanning pattern sets. As the results show, our
DFA-based implementation can increase the regular expression
matching speed on the order of 50 to 700 times over the NFA-
based implementation used in the Linux L7-filter and Snort sys-
tem. It can also achieve 12-42 times speedup over a commonly
used DFA-based parser. The pattern matching speed can achieve
gigabit rates for certain pattern sets. This is significant for im-
plementing fast regular expression matching of the packet payload
using network processors or general-purpose processors, as the
ability to more quickly and efficiently classify enables many new
technologies like real-time worm detection, content lookup in
overlay networks, fine-grained load balancing, etc.
2. PROBLEM STATEMENT
In this section, we first discuss regular expressions used in packet
payload scanning applications, then present the possible solutions
for regular expression matching, and finally define the specific
problem that we address in this paper.
2.1 Regular Expression Patterns
A regular expression describes a set of strings without enumerat-
ing them explicitly. Table 1 lists the common features of regular
expression patterns used in packet payload scanning. For exam-
ple, consider a regular expression from the Linux L7-filter [1] for
detecting Yahoo traffic:
“^(ymsg|ypns|yhoo).?.?.?.?.?.?.?[lwt].*\xc0\x80”. This pattern
matches any packet payload that starts with ymsg, ypns, or yhoo,
followed by seven or fewer arbitrary characters, and then a letter l,
w or t, and some arbitrary characters, and finally the ASCII letters
c0 and 80 in the hexadecimal form.
Table 2 compares the regular expressions used in two net-
working applications, Snort and the Linux L7-filter, against those
used in emerging Extensible Markup Language (XML) filtering
applications [12, 13] where regular expressions are matched over
text documents encoded in XML. We notice three main differ-
ences: (1) While both types of applications use wildcards (‘.’, ‘?’,
‘+’, ‘*’), the patterns for packet scanning applications contain
larger numbers of them in each pattern; (2) classes of characters
(“[]”) are used only in packet scanning applications; (3) a high
percentage of patterns in packet payload scanning applications
have length restrictions on some of the classes or wildcards, while
such length restrictions usually do not occur in XML filtering.
This shows that compared to the XML filtering applications, net-
work packet scanning applications face additional challenges
These challenges lead to a significant increase in the complexity
of regular expression matching, as we shall show later in this
paper.
Table 1. Features of Regular Expressions
Syntax Meaning Example
^ Pattern to be
matched at the start
of the input
^AB means the input starts
with AB. A pattern without
‘^’, e.g., AB, can be matched
anywhere in the input.
| OR relationship A|B denotes A or B.
. A single character
wildcard
? A quantifier denot-
ing one or less
A? denotes A or an empty
string.
* A quantifier denot-
ing zero or more
A* means an arbitrary number
of As.
{} Repeat A{100} denotes 100 As.
[ ] A class of characters [lwt] denotes a letter l, w, or t.
[^] Anything but [^\n] denotes any character
except \n.
Table 2. Comparison of regular expressions in networking
applications against those in XML filtering
Snort L7-filter XML
filter-
ing
# of regular expressions analyzed 1555 70 1,000-
100,000
% of patterns starting with “^” 74.4% 72.8% ≥80%
% of patterns with wildcards “., +,
?, *”
74.9% 75.7% 50% -
100%
Average # of wildcards per pattern 4.65 7.03 1-2
% of patterns with class “[ ]” 31.6% 52.8% 0
Average # of classes per pattern 7.97 4.78 0
% of patterns with length restric-
tions on classes or wildcards
56.3% 21.4% ≈0
3
2.2 Solution Space for Regular Expression
Matching
Finite automata are a natural formalism for regular expressions.
There are two main categories: Deterministic Finite Automaton
(DFA) and Nondeterministic Finite Automaton (NFA). In this
section, we survey existing solutions using these automata.
A DFA consists of a finite set of input symbols, denoted as ∑,
a finite set of states, and a transition function δ [14]. In network-
ing applications, ∑ contains the 28 symbols from the extended
ASCII code. Among the states, there is a single start state q0 and a
set of accepting states. The transition function δ takes a state and
an input symbol as arguments and returns a state. A key feature of
DFA is that at any time there is only one active state in the DFA.
An NFA works similarly to a DFA except that the δ function
maps from a state and a symbol to a set of new states. Therefore,
multiple states can be active simultaneously in an NFA.
A theoretical worst case study [14] shows that a single regular
expression of length n can be expressed as an NFA with O(n)
states. When the NFA is converted into a DFA, it may generate
O(∑n) states. The processing complexity for each character in the
input is O(1) in a DFA, but is O(n2) for an NFA when all n states
are active at the same time.
To handle m regular expressions, two choices are possible:
processing them individually in m automata, or compiling them
into a single automaton. The former is used in Snort [2] and Linux
L7-filter [1]. The latter is proposed in recent studies [12, 13] so
that the single composite NFA can support shared matching of
common prefixes of those expressions. Despite the demonstrated
performance gains over using m separate NFAs, in practice this
approach experiences large numbers of active states. This has the
same worst case complexity as the sum of m separate NFAs.
Therefore, this approach on a serial processor can be slow, as
given any input character, each active state must be serially exam-
ined to obtain new states.
In DFA-based systems, compiling m regular expressions into
a composite DFA provides guaranteed performance benefit over
running m individual DFA. Specifically, a composite DFA re-
duces processing cost from O(m) (O(1) for each automaton) to
O(1), i.e., a single lookup to obtain the next state for any given
character. However, the number of states in the composite
automaton grows to O(∑mn) in the theoretical worst case. In fact,
we will show in Section 4 that typical patterns in packet payload
scanning applications indeed interact with each other and can
cause the creation of an exponential number of states in the com-
posite DFA.
Table 3. Worst case comparisons of DFA and NFA
One regular expression
of length n
m regular expressions
compiled together
Processing
complexity
Storage
cost
Processing
complexity
Storage
cost
NFA O(n2) O(n) O(n2m) O(nm)
DFA O(1) O(∑n) O(1) O(∑nm)
There is a middle ground between DFA and NFA called lazy
DFA. Lazy DFA are designed to reduce memory consumption of
conventional DFA [12, 15]: a lazy DFA keeps a subset of the
DFA that matches the most common strings in memory; for un-
common strings, it extends the subset from the corresponding
NFA at runtime. As such, a lazy DFA is usually much smaller
than the corresponding fully-compiled DFA and provides good
performance for common input strings. Bro intrusion detection
systems [3] adopt this approach. However, malicious senders can
easily construct packets that keep the system busy and slow down
the matching process.
Field Programmable Gate Arrays (FPGAs) provide a high
degree of parallelism and thus can be used to speed up the regular
expression matching process. There are existing FPGA solutions
that build circuits based on DFA [16] or NFA [17-19]. Recently,
Bordie et al. presented a new architecture for matching regular
expressions using FPGA [28]. In this architecture, the finite state
machine can take multiple bytes at a time to get very high
throughput. These approaches are promising if the extra FPGA
hardware can be embedded in the packet processors. FPGAs,
however, are not available in many applications; in such situa-
tions, a network processor or general-purpose CPU-based imple-
mentation may be more desirable.
2.3 Problem statement
In this paper, we seek a fast and memory-efficient solution to
regular expression matching for packet payload scanning. We
define the scope of the problem as follows:
• We consider DFA-based approaches in this paper, as NFA-
based approaches are inefficient on serial processors or proc-
essors with limited parallelism (e.g., multi-core CPUs in
comparison to FPGAs). Our goal is to achieve O(1) compu-
tation cost for each incoming character, which cannot be ac-
complished by any existing DFA-based solutions due to their
excessive memory usage. Thus, the focus of the study is to
reduce memory overhead of DFA while approaching the op-
timal processing speed of O(1) per character.
• We focus on general-purpose processor-based architectures
and explore the limits of regular expression matching in this
environment. Wherever appropriate, we leverage the trend of
multi-core processors that are becoming prevalent in those
architectures. Nevertheless, our results can be used in FPGA-
based and ASIC-based approaches as well [20].
It is worth noting that there are two sources of memory usage
in DFAs: states and transitions. The number of transitions is linear
with respect to the number of states because for each state there
can be at most 28 (for all ASCII characters) links to next states.
Therefore, we consider the number of states (in minimized DFA)
as the primary factor for determining the memory usage in the rest
of the paper. Recently, Kumar et al. extended our work and pro-
posed algorithms to do efficient link compression [20].
3. MATCHING OF INDIVIDUAL
PATTERNS
In this section, we present our solution to matching individual
regular expression patterns. The main technical challenge is to
create DFAs that can fit in memory, thus making a fast DFA-
based approach feasible. We first define a few concepts key to
DFA construction in the context of packet payload scanning in
Section 3.1. We then analyze the size of DFAs for typical payload
scanning patterns in Section 3.2. Although theoretical analyses
[12, 14] have shown that DFAs are subject to exponential blow-
up, here, we identify specific structures that can lead to exponen-
tial growth of DFAs. Based on the insights from this analysis, in
Section 3.3, we propose pattern rewrite techniques that explore
the possibility of trading off exhaustive pattern matching (which
real-world applications often allow) for memory efficiency. Fi-
nally, we offer guidelines to pattern writers on how to write pat-
terns amenable to efficient implementation in Section 3.4.
4
3.1 Design Considerations
Although regular expressions and automata theory can be directly
applied to packet payload scanning, there is a noticeable differ-
ence in between. Most existing studies on regular expressions
focus on a specific type of evaluation, that is, checking if a fixed
length string belongs to the language that a regular expression
defines. More specifically, a fixed length string is said to be in the
language of a regular expression, if the string is matched from
start to end by a DFA corresponding to that regular expression. In
contrast, in packet payload scanning, a regular expression pattern
can be matched by the entire input or specific substrings of the
input. Without a priori knowledge of the starting and ending posi-
tions of those substrings, DFAs created for recognizing all sub-
string matches can be highly complex.
For a better understanding, we next present a few concepts
pertaining to the completeness of matching results and the DFA
execution model for substring matching.
Completeness of matching results
Given a regular expression pattern and an input string, a complete
set of results contains all substrings of the input that the pattern
can possibly match. For example, given a pattern ab* and an input
abbb, three possible matches can be reported, ab, abb, and abbb.
We call this style of matching Exhaustive Matching. It is formally
defined as below:
Exhaustive Matching: Consider the matching process M as a
function from a pattern P and a string S to a power set of S,
such that, M(P, S) = {substring S' of S| S' is accepted by the
DFA of P}.
In practice, it is expensive and often unnecessary to report all
matching substrings, as most applications can be satisfied by a