I have always been interested in languages. One of my hobbies is to ask new people how to say something in their language. It’s almost always, “How are you?” but my favorite was the Mayan people who taught me, “Ko’ox hana” which means “Let’s go eat!”
At Two Hat Security we always like to take on hard challenges. For the last two years, we have been cracking the code of how to filter English. One of the tricks we learnt was to ignore that it’s English. For instance, if a 12-year-old kid is trying to get a swear through do we really think they’ll use a noun and a verb? So we joke with our Natural Language Processing team that it’s really the study of unnatural language. Early on we decided to make everything unicode and focus on the risk of characters in the context of other characters.
We joke with our Natural Language Processing team that it’s really the study of *un*natural language.
I was shocked when I threw Arabic Twitter posts at it and it popped out the phone numbers without any effort. Not bad considering the words go the opposite direction 😉 I was even happier when a client asked us to find Finnish usernames and all I had to do was add the accent mapping and some keywords and it worked.
My new challenge is to onboard 40 languages all at once and use some really cool algorithms to quickly find the highest risk items. It’s kind of cheating since we added unit tests a long time ago for these and have been running some of them for a long time. To make it fun I’m going to start from scratch and use the Twitter mini-firehose to evaluate it. Keep in touch and let me know if you want to participate.
Photo credit: Symphoney Symphoney/Flickr