Post

2 followers Follow
0
Avatar

Contains operator takes underscores as whitespaces

Hi,

When you use the following CSDL:

interaction.raw_content contains "@burn"

the filter will pick up not just mentions of the @burn twitter account, but also accounts starting with "burn" and including underscores, such as @burn_in_hell or @burn_under_sun. This is also true if I use any other contains operators (contains_any, contains_word, contains_phrase) - I think this is a faulty behavior as underscores are clearly considered as white space by these filters, which is certainly not ideal.
The only way I could actually get around this was by writing a regex like this:

interaction.raw_content regex_partial "@burn\s"
This will only pickup messages mentioning the @burn account.

Could you please look into this? Ideally, the contains operators would differentiate between underscore and whitespace properly (so I could use contains_any to batch many of these kinds of filters together as well).

Regards,

Gabor

olton

Official comment

Avatar

Underscore characters are treated as punctuation, not whitespace. We treat the phrase "Hi @burn_x" in the same way as we would treat "Hi @burn." - following the @mention with a period, or comma. A more detailed description of this can be found in our Tokenization documentation.
If you are looking to match mentions of @usernames, you should look at using the interaction.mentions or twitter.mentions target, with the IN operator.

The following CSDL will only match interactions containing mentions of the @burn account:

interaction.mentions in "burn"
Jason D.
Comment actions Permalink

Please sign in to leave a comment.

3 comments

0
Avatar

Thanks for the answer. I was aware I could use interaction.mentions target for this purpose, but the reason I was trying to rely on interaction.raw_content is because using that target I thought I could filter on both content and mentions if i defined the filter appropriately. And it seems to be working that way, the only false positives are caused by this tokenisation of underscores as punctuation. Given that many social media usernames can include underscores, I think it would not be unreasonable to change the behaviour of certain operators. in that regard. At least contains_phrase could treat underscores as alphanumeric (given the purpose of the operator is an exact phrase match, is it not?)

olton 0 votes
Comment actions Permalink
0
Avatar

The interaction.contains_raw target was implemented as a way to allow users to filter on the content of an interaction, without first having it processed, and entities such as @mentions stripped out.
Unfortunately the effects of making changes to how punctuation is tokenized can be felt by all our users, so we try not to make these changes if possible!
As a workaround, you could define a filter like the following to ensure no strings matching "burn_" are matched;

interaction.raw_content any "burn" AND
NOT interaction.raw_content any "burn_"
Jason D. 0 votes
Comment actions Permalink