Boin: Regex Matching Unicode Characters acting oddly with different strings

Friday, 13 September 2013

Regex Matching Unicode Characters acting oddly with different strings

Regex Matching Unicode Characters acting oddly with different strings

Ok, I am doing a unicode regex match on some strings.
These are the strings in question.
\u2018Mummy\u2019 Reboot May Get \u2018Mama\u2019 Director
\u2018Glee\u2019 Star Grant Gustin to Play The Flash in \u2018Arrow\u2019
Season 2
And I am using this regex to parse out the titles surround in unicode quotes.
regex = re.compile("\\u2018[^(?!\\u2018$)]*\\u2019",re.UNICODE)
using regex.findall() returns me
['u2018Mama\\u2019']
and
['u2018Glee\\u2019', 'u2018Arrow\\u2019']
This brings up two questions that I couldn't figure out. why isn't it
returning \u2018, where is the initial \?
Secondly, what is different. I can't see it. Finally, I replaced \u2018
and \u2019 with '. Then using this regex.
re.compile("'[^']*'")
It matches both in both strings. What is the difference here? What am I
missing in the unicode regex?
Thank you in advance.

Boin

Friday, 13 September 2013

Regex Matching Unicode Characters acting oddly with different strings

No comments:

Post a Comment