  1. #1

    PHP + preg + international chars problem

    Hello everyone.

    I'm on a VDS plan, and having problems with PHP and specifically preg* regular expressions with strings containing non-english characters.

    Although things work on my local machine, and would work on the server if I used ereg* instead, I need to use preg which doesn't.

    Here is a small example:

    echo mb_detect_encoding($word)." 1- ".$word."<br/>";
    $word = preg_replace("/\W/", "", $word);
    echo mb_detect_encoding($word)." 2- ".$word."<br/>";
    This code on my machine produces:

    UTF-8 1- ααα.
    UTF-8 2- ααα
    Which is perfectly fine. However, on my VDS account it produces:

    UTF-8 1- ααα.
    ASCII 2-
    Could anyone please suggest anything? Thanks for reading

  2. #2
    I doubt preg_replace will work on mbstring since php internally represent each mbstring as single bytes.

    For example, a chinese character has two bytes (let's lable it as No.1 and 2), two would therefore have 4 (No1, 2 for the first chinese character, 3,4 for the second). Problem come arise when php trying to combine the 3rd and 4th as one character. This might be the case in regular expression when using word character definition.

  3. #3
    Still though, everything works perfectly good on my pc. How is this explained?

  4. #4
    I had the install a few more encodings. Now for this bit:

    PHP Code:
    $word="ααα ββ";
    mb_detect_encoding($word)." 1) $word<br/>";
    $word preg_replace("/(.*)\b(.)(.*)/u""$1__$2__$3"$word);
    mb_detect_encoding($word)." 2) $word<br/>";

    $word="aaa bb";
    mb_detect_encoding($word)." 1) $word<br/>";
    $word preg_replace("/(.*)\b(.)(.*)/""$1__$2__$3"$word);
    mb_detect_encoding($word)." 2) $word<br/>"
    I get:

    UTF-8 1) ααα ββ
    UTF-8 2) ααα ββ
    ASCII 1) aaa bb
    ASCII 2) aaa __b__b
    So the regex is not applied somehow on the greek chars...

    It's been like a week already I've been trying to figure this out. Someone please help!

