Friday, 13 September 2013

Regex Matching Unicode Characters acting oddly with different strings

Regex Matching Unicode Characters acting oddly with different strings

Ok, I am doing a unicode regex match on some strings.
These are the strings in question.
\u2018Mummy\u2019 Reboot May Get \u2018Mama\u2019 Director
\u2018Glee\u2019 Star Grant Gustin to Play The Flash in \u2018Arrow\u2019
Season 2
And I am using this regex to parse out the titles surround in unicode quotes.
regex = re.compile("\\u2018[^(?!\\u2018$)]*\\u2019",re.UNICODE)
using regex.findall() returns me
['u2018Mama\\u2019']
and
['u2018Glee\\u2019', 'u2018Arrow\\u2019']
This brings up two questions that I couldn't figure out. why isn't it
returning \u2018, where is the initial \?
Secondly, what is different. I can't see it. Finally, I replaced \u2018
and \u2019 with '. Then using this regex.
re.compile("'[^']*'")
It matches both in both strings. What is the difference here? What am I
missing in the unicode regex?
Thank you in advance.

No comments:

Post a Comment