Regex on non latin alphabets

Calvin · December 15, 2023, 1:37pm

Hi there guys,

Hoping someone can help me out with the Regex expression regex(., '^[1]+$') for capturing a name.

This condition works perfectly for Latin characters but it doesn't work for non Latin alphabets (eg. Amaharic alphabet)

Any help would be greatly appreciated!

@erobinson
@Mazz
@Norman_Hooper
@Ethan_Soergel
@Simon_Kelly

a-zA-Z\s ↩︎

Mazz · December 18, 2023, 9:06am

A google search on arabic regex expression return this
[\u0621-\u064A] so maybe?

Mazz · December 18, 2023, 9:07am

Norman_Hooper · December 18, 2023, 4:39pm

@Mazz a regex tester is a great idea!

@Calvin if \s works (matches any whitespace character), then \w should match a "word" character in any alphabet (Latin, Amharic, Cyrillic, etc.). Be aware that digits and underscore are also included as word characters. So [\w -]+ would match both "Charles III" and "Карл 3-й".

regex(., '^[\w\s]+$') should be the equivalent of what you've got, although I'd tweak that to regex(., '^[\w -]+$') to match just spaces instead of all whitespace, and to include hyphens for hyphenated names.

Calvin · December 20, 2023, 8:03am

Hi @Mazz and @Norman_Hooper, thank you so much for your great responses.
@Norman_Hooper, the only thing with regex(., '^[\w -]+$') is that it allows integer values and we wanted to avoid that, otherwise that would have worked perfectly.

You guys inspired me to dig a little deeper and I was able to find a list of Unicode characters for the Ethiopian languages which includes Amharic.

Will need to do some more testing but this seems to work how we need it to:
regex(., '^[\u1200-\u137Fa-zA-Z\s]+$')

Mazz · December 20, 2023, 8:21am

I think that there are situations where the character set is NOT unicode? or am I being parched for coffee?

Calvin · December 20, 2023, 8:59am

Im not sure I follow what you mean @Mazz? Lol, maybe I'm the one parched for coffee!

Mazz · December 20, 2023, 9:32am

I think that this /u tag tells regex expression the unicode code for the characters you want to allow.

I think that there are niche situations where your device's character set is NOT in unicode. but I doubt that you will have to worry about it.

One thing this makes me think about is the ability to have a different validation condition for the language selected. you can do it in a round about way by picking up the lang-code label though.

this is just me falling into the rabbit hole