Wednesday, May 14, 2008

Canonical Strings, or, why I like Python

I needed a quick and easy function to map strings into a canonical form. In this case, punctuation, upper/lower case, and word order are not important. i.e. "!$%!@$!@!This!?! is... a test" == "a test this is". Less than 1 minute and I am good to go with...
import re
re_punctuation = re.compile(
r"[`~!@#\$%\^&\*\(\)\-_\+={\[}\]\\|;:\'\",<\.>/\?]")
def GetCanonical(input):
canonical = re_punctuation.sub(" ", input.lower()).split()
canonical.sort()
return ' '.join(canonical)

GetCanonical("This is a test") == GetCanonical("a test this is")

Labels:

0 Comments:

Post a Comment

<< Home