Unicode whitespaces in Python
So I had this pull request that I opened forever ago that is supposed to do some smart checks for whitespace in a Unicode aware manner. Unfortunately this project doesn't use optimistic merging1 as a strategy for changes so the pull requests can remain open for a long time, in this particular case I couldn't really remember what was preventing this moving forward until I revisited the code today.
The main idea is that you sometimes want to strip out whitespaces from text, with ASCII text this is really easy but the difficulty of course is that the text input is in Unicode.
The reason this isn't an absurdly simple fix is that, as of early 2020, there appears to be no in built whitespace character list in Python that has Unicode spaces. The
string.whitespace list really doesn't cut it:
persephone (feature-segmentation)$ python3 Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import string >>> string.whitespace ' \t\n\r\x0b\x0c'
As you can see this is only the ASCII whitespace characters. To get more coverage I made this list of unicode space characters:
UNICODE_WHITESPACE_CHARACTERS = [ "\u0009", # character tabulation "\u000a", # line feed "\u000b", # line tabulation "\u000c", # form feed "\u000d", # carriage return "\u0020", # space "\u0085", # next line "\u00a0", # no-break space "\u1680", # ogham space mark "\u2000", # en quad "\u2001", # em quad "\u2002", # en space "\u2003", # em space "\u2004", # three-per-em space "\u2005", # four-per-em space "\u2006", # six-per-em space "\u2007", # figure space "\u2008", # punctuation space "\u2009", # thin space "\u200A", # hair space "\u2028", # line separator "\u2029", # paragraph separator "\u202f", # narrow no-break space "\u205f", # medium mathematical space "\u3000", # ideographic space ]
Using this should cover more of the Unicode whitespace characters. This post is mostly a note to my future self, but you may find something useful here too