Post

2 followers Follow
0
Avatar

Boundaries don't work properly

Hi,

I am collecting tweets in brazilian portuguese and I need to define the words boundaries in the following way: "\b\b". However, the border policy doesn't work properly when the given word is part of a word that contains special characters like "ç" or "ã". For example, if I search for tweets that contains the portuguese word "liga" (language.tag == "pt" and twitter.text regex_partial "\bliga\b"), there are some tweets returned in which the word "liga" is part from the word "ligação".
I would like to know if I am using the borders policy correctly or if there are some way to avoid the unwanted tweets like those with the word "ligação".

Thanks for the avaiability.

willyanabilhoa

Please sign in to leave a comment.

2 comments

0
Avatar

This seems to be an issue with the RE2 Regex engine itself. Testing this regular expression outside of DataSift gives me the same results as you are seeing here.

If you are filtering for the word "liga", you can simply use the contains or contains_any operators to achieve this. 

 twitter.text contains "liga"

This CSDL will not match the word "ligação".

Jason D. 0 votes
Comment actions Permalink
0
Avatar

Hi,

thank you for the comment.

I have used the contains and contains_any operators before. The idea of using the regular expressions is to reduce the variations, or combinations, of a word to only one search term that is given by the regular expression. This variations can be the existence, or not, of spaces between the words or special characters like "ç", "ã", ...

willyanabilhoa 0 votes
Comment actions Permalink