22 October 2010

Blog

Regular Expression Madness

I just stumbled upon this great blog post about some uncommon uses of regular expressions. RapidMiner also makes a lot use of those beasts, especially for the definition of filters so I thought this post might be interesting to you.

Both examples are taken from the book The Unix Programming Environment by Kernighan and Pike (1984).

The first problem is to produce a list of all English words that contain all five vowels exactly once and in alphabetical order.

The book creates a regular expression

^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$

then uses it to filter a dictionary file. This produced 16 words ranging from abstemious to majestious.

The second problem is to produce a list of all English words of at least six letters with letters appearing in increasing alphabetical order.

The book creates a regular expression

^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$

then uses it to filter a dictionary file as before, except there is an additional filter stage.

This produced 17 words including common words such as almost and ghosty. Some of the more interesting results were bijoux, chintz, and egilops. Kernighan and Pike explain that egilops is a disease that attacks wheat.

For an explanation of those expressions please refer to the original blog post . And have fun while you are creating similar expressions for your next example filter 😉

Interested in more quirky applications of data science? Check out our post on AI for cats and dogs!

Related Resources