Regular expressions are one of the more cryptic tools developers utilize in every day work. Sure, just about anyone understands that ^\d\d$ matches a line containing exactly two digits, but you may need a cheat sheet to figure out a regex like \b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b.
If you’re like me, you typically know that you need a regex to match on something, then you go about making a regex to match the pattern. Then, you’ve either built a unit test or you use a website to test your regex against some test data. This is great for positive verification, but how can you be sure that bad data isn’t matched as well?
Microsoft Research has produced a tool that will explores a regex to generate matching values much like Pex explores a program to generate relevant test inputs. This tool is call Rex (Regular Expression Exploration). This solves the problem mentioned above. With Rex, you can explore valid values for a given regex and an analyze them for potential problems.
I downloaded the 1.0 release of this command-line tool and gave it a shot. It was pretty easy to get started: just type rex.exe <regex> and it will give you a value. If you want more values, type in /k:<number> (I have no idea what k stands for, it’s the only option without a full name). You can pass more than one regex in by using a file, and then with the /intersect option you can instruct rex that all regexes must be applied for the generated values. Use /? to see more options.
After trying out the regex from the beginning of this article, I discovered that Rex does have a few limitations. Here are the regex constructs not supported: anchors \G, \b, \B, named groups, lookahead, lookbehind, as-few-times-as-possible quantifiers, backreferences, conditional alternation, substitution.
I’m not satisfied with a command-line tool. Luckily, Rex is a .NET application. I created a new Console application and added a reference to Rex.exe. I then added a using clause to Rex, and began typing out some code. Here’s what I came up.
string regex = @"^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$";
RexSettings settings = new RexSettings(regex) { k = 5, encoding = CharacterEncoding.ASCII };
var results = RexEngine.GenerateMembers(settings);
foreach (var result in results)
{
Console.WriteLine(result);
}
It turned out that you can create an instance of RexEngine and GenerateMembers, but since it was meant to be a command-line tool I decided to use the RexSettings class and pass it to the static version of the GenerateMembers method. I obtained that particular regex from a site claiming that it matched dates. Well it does, but not consistently. As you can see by Rex’ output, it would be better to come up with a different regex, enforce consistency on the separator, or make it match one of several regexes.
12/09-2085
12-31-2089
09 31.2098
12/15-1989
03 31 1989
I vote for consistency on the separator.
This is a very useful tool Microsoft Research has invented. My only concern is that unscrupulous people will be using it as well. Then again, they’ve probably had these kind of generators all along.