How do I filter (out) emoji characters?

Emoji characters are popular images/ideograms that appear in place of text in social media. These characters are distinct from other Unicode characters because they are represented by a surrogate pair, rather than a single Unicode value, and are not tokenized as punctuation in the CSDL engine. As such, interactions with emoji characters as a part of the content can be incorrectly filtered. 

For example, an interaction with "help\uD83D\uDE02" as part of its content won't be matched with CSDL like the following due to the emoji character immediately following the word "help":

interaction.content contains "help"

Several approaches can be taken to match content regardless of emoji characters, the easiest using either the substr or wildcard operators.

interaction.content substr "help"

will match the example interaction, however will also match false positives, including interactions with "helpful" as content. Wildcard will be similar.

The better solution to identifying emoji characters is to use one of the regex operators, and leverage the surrogate pair property:

interaction.content regex_partial "\\p{Cs}"

will match any interaction with an emoji character, by looking for a Unicode character class Cs.

From here, we can explicitly exclude interactions where emoji characters are connected to any words (albeit potentially missing false negatives in the process) with something like the following:

 interaction.content regex_partial "\\p{L}\\p{Cs}" 

Or we can explicitly filter for our original argument in addition to any potential emoji characters that may appear in interactions:

interaction.content regex_partial "(water|ice|cloud)\\p{Cs}"
Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request