Post

2 followers Follow
0
Avatar

tokenization in contains_any

Hi,

Is it possible to disable tokenization for contains_any filter? Or what is the best way (in terms of correctness, dpu costs) to achieve exact term matches for the twitter.text source?

Example:
We need a filter that does exact matches of twitter.text of multiple words. So e.g.
twitter.text contains_any "foo&bar, foo bar"

and we would like that this filter matches with:
"some text foo&bar some more"
"some text foo bar"
but NOT with
"some text & some more text"
"some foo some bar"

In the documentation about tokenization the example is given to use the substr. operator to get around the false positives:
twitter.text contains "50%"
and twitter.text substr "50%"

However this would give quite high dpu costs for us as there are 2 operator per term, and we have lots of terms to match against.

Many thanks, Ben

ben

Please sign in to leave a comment.

3 comments

0
Avatar

A couple things: there is no way to disable tokenization through the operator. Also, neither of your latter examples would be pulled in based on how you wrote the filter.

Tokenization doesn't tokenize the values you provide. Tokenization is how our platform parses the text that is flowing through the stream of data (the twitter firehose in your case). It generates the tokens and then compares them to the values you have provided.

jbreucop 0 votes
Comment actions Permalink
0
Avatar

thxs -
so just to clarify in order to match a word that gets tokenized in the twitter text, say "H&M" , the only way to do so is using the contains_any in conjunction with the substr operator?

ben 0 votes
Comment actions Permalink
0
Avatar

twitter.text contains_any "H&M" will match H&M, H&M!!!!, H & M, etc. Tokenization doesn't mean that punctuation breaks up the words. It means it presents multiple tokens to be used as a comparison to the values you've provided.

The reason why the documentation uses both contains and substr is to prevent spaces occurring between 50 and %, which contains_any allows for. 50% still matches when contains_any is used. The strategy is to protect against possible noise. Tokenization is, 90% of the time, much more helpful in getting a wider variety of relevant posts.

Example: say you were searching for "amoeba" and someone tweeted "I_found_amoeba". That tweet would be returned because one of the tokens in that tweet, amoeba, matched the value you were looking for.

jbreucop 0 votes
Comment actions Permalink