Unicode whitespaces in Python

So I had this pull request that I opened forever ago that is supposed to do some smart checks for whitespace in a Unicode aware manner. Unfortunately this project doesn't use optimistic merging¹ as a strategy for changes so the pull requests can remain open for a long time, in this particular case I couldn't really remember what was preventing this moving forward until I revisited the code today.

The main idea is that you sometimes want to strip out whitespaces from text, with ASCII text this is really easy but the difficulty of course is that the text input is in Unicode. The reason this isn't an absurdly simple fix is that, as of early 2020, there appears to be no in built whitespace character list in Python that has Unicode spaces. The string.whitespace list really doesn't cut it:

persephone (feature-segmentation)$ python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import string
>>> string.whitespace
' \t\n\r\x0b\x0c'

As you can see this is only the ASCII whitespace characters. To get more coverage I made this list of unicode space characters:

UNICODE_WHITESPACE_CHARACTERS = [
    "\u0009", # character tabulation
    "\u000a", # line feed
    "\u000b", # line tabulation
    "\u000c", # form feed
    "\u000d", # carriage return
    "\u0020", # space
    "\u0085", # next line
    "\u00a0", # no-break space
    "\u1680", # ogham space mark
    "\u2000", # en quad
    "\u2001", # em quad
    "\u2002", # en space
    "\u2003", # em space
    "\u2004", # three-per-em space
    "\u2005", # four-per-em space
    "\u2006", # six-per-em space
    "\u2007", # figure space
    "\u2008", # punctuation space
    "\u2009", # thin space
    "\u200A", # hair space
    "\u2028", # line separator
    "\u2029", # paragraph separator
    "\u202f", # narrow no-break space
    "\u205f", # medium mathematical space
    "\u3000", # ideographic space
]

Using this should cover more of the Unicode whitespace characters. This post is mostly a note to my future self, but you may find something useful here too

As time goes on I increasingly think optimistic merging is the better approach for open source projects. There's a great blog post my Pieter Hintjens about why optimistic merging works better that has a good summary. ↩

Published: Mon 20 April 2020

By Janis Lesinskis

In Software-engineering

Tags: Python strings unicode whitespace persephone