Web Hosting Talk







View Full Version : PHP - Unicode Conversion


jonathanbull
06-10-2009, 09:39 PM
Hi,

I've been struggling with this for what seems like forever now. What I'm trying to do is create a script that converts any form input into the UTF-16 equivalent.

This guy (http://rishida.net/scripts/uniview/conversion.php) is doing exactly what I'm after, but I can't for the life of me figure out how it's done. For example, if you enter "abc", "0061 0062 0063" is outputted. If you enter something, say in Chinese, it can figure it out too.

I've tried iconv, mbstring and countless other things but none seem to give the correct results. Any ideas would be really appreciated!


Thanks in advance guys. :)

BMG_Servers
06-11-2009, 12:21 AM
I poked around looking for an answer for this too and I didn't end up seeing what was needed in the iconv or mbstring functions. My suggestion at this point would be to consider rolling your own. The site you linked does their conversion in JavaScript, which looked like it'd be pretty easy to port over to PHP. Maybe worth a try?

mwatkins
06-11-2009, 01:33 AM
Almost an aside, but why UTF-16? Do you have a specific need for this? utf-8 covers a good deal of ground and has some real plusses over -16.

aniketh
06-11-2009, 03:19 AM
Does the iconv php extension help? (tried to post link to manual but not enough posts)

jonathanbull
06-11-2009, 10:54 AM
Almost an aside, but why UTF-16? Do you have a specific need for this? utf-8 covers a good deal of ground and has some real plusses over -16.
The data is going to be sent through an SMS gateway. Unfortunately, as often as it will contain character sets such as Arabic and Chinese, the gateway requires it is submitted in UTF-16 HEX form.

Believe me, I wouldn't have chosen it! :)

mwatkins
06-11-2009, 04:34 PM
I'm going to try to help but only half-way. To be honest, lousy support for Unicode data in PHP is one of the major reasons I left PHP behind many years ago.

The example string I'll keep referring to is:

Ç'është Unicode?, in Albanian; 유니코드에 대해? in Korean

Whatever encoding you receive it in from your web app (i.e. a form), you'll need to convert it to UTF-16. The mb functions *ought* to do this; if you are having problems with them, perhaps you aren't getting the data back in the encoding you think you are. Be sure of that before going off elsewhere.

Once you are sure you know what encoding the byte string you have obtained from the user is in, then you can easily use the mb functions to convert to another.

http://us3.php.net/manual/en/function.mb-convert-encoding.php

For a Python equivalent (close as I can since Python has native Unicode support) see the function and its matching output "other_encodings" below.

Unicode handling can be vexing, even at the simplest level. I know I initially struggled with this some years ago. It really helps to get a firm grip on what the data we are looking at is. Since some of the encodings look like ASCII (i.e. utf-8 encoded strings will often look like plain ascii in the absence of different characters) and this can lead us down weird rat holes.

Ideally an application internally should do its operations on native unicode strings (assuming your language gas such a data type), converting to a specific encoding only for output. To do this you need to decode the input data (which typically will be encoded as utf-8 or one of the windows encoding and is best to be thought of as a "binary string") into a unicode object or string. To decode you need to know what the source encoding is[1] to begin with, and that isn't always simple to do.

When it is time issue output (to a file, or a http response) you encode to the character set you wish to.

Some web frameworks do all this for you more or less automatically, which is really, really, nice.

[1]You can help force the input encoding by declaring an appropriate character set in your <form> declaration as well as in the page's content type headers and/or using meta tags in the <head>. For more information on forcing character encoding see:

http://validator.w3.org/docs/help.html#faq-charset
http://www.w3.org/International/O-charset

[2] Some background info on Unicode (not all of it is specific to Python - the Introduction to Unicode is worth a read if some of this is new to you)
http://docs.python.org/3.0/howto/unicode.html

[3]
http://us2.php.net/manual/en/intro.unicode.php
Unicode Support. Warning
This extension is still in development and it isn't available to public yet.


"""
Python 3 unicode handling example

There is a pretty major change in Python 3 which simplifies further how Unicode
is handled. The bottom line in Python >= 3.0: all "strings" are Unicode
strings; the default encoding (for input and output) is UTF-8.

Until one gets used to working with Unicode it can be confusing - often the
term "Unicode" gets used when 'encoding' is proper and vice versa.

For example the following is a string containing characters - we can't tell
just by looking at them what encoding has been used to represent them visually
on output (i.e. to our terminal, within our code editor, to a web page or
file):

In my editor, and in Python >= 3, my default character encoding is UTF-8. Thus I can
include non-ascii strings in my code provided that they can be represented as
UTF-8. Most everything I work with can; I can also change the default character
encoding for an individual Python module (file).

Remember: In the assignment to "data" above, the actual information is stored internally in
Unicode format. It may be represented in my editor/in the Python source as
UTF-8, but internally it is Unicode. The two are not analgous.

Keep repeating to yourself - the data is stored and manipulated internally as a
Unicode object (string), not as data with a specific encoding (bytes)
"""
import unicodedata
import urllib.parse
import locale

# The code line below displays a UTF-8 encoded (because that is my default
# character encoding for both Python and the editor I use) string being assigned to "data".

data = "Ç'është Unicode?, in Albanian; 유니코드에 대해? in Korean"

# Remember: "data" now contains a native unicode object, not the UTF-8
# representation we see in front of us.

# For output as an HTTP response or to a file I would have to encode it with a
# suitable encoding again:

encoded = data.encode('utf-8')

# Many web frameworks automate a great deal of the Unicode in and out
# conversions, a feature which is really, really, thoughtful.

# Here are a few functions exercising the data
language, output_encoding = locale.getdefaultlocale()

def about():
print("This OS/account defaults to [%s] as language, [%s] as "
"encoding" % (language, output_encoding))

def raw():
print("Encoded for this terminal:", data.encode(output_encoding))
for item in [data, encoded]:
print("[%s] %s" % (item.__class__.__name__,
item))

def types():
# remember, the native encoding for Python source code > Python 2 is UTF-8
# Internally the data is stored and manipulated within Python as Unicode objects
print("data is type", type(data)) # str
print("encoded is type", type(encoded)) # bytes

def other_encodings():
print("Output in various encodings")
for encoding in ['utf-8', 'utf-16', 'ascii']:
print("[%s]: %s" % (
encoding,
data.encode(encoding,
'replace' if encoding=='ascii' else 'strict')))

def entities():
# as XML entities, where necessary and possible
print("As XML entities, where possible:")
for encoding in ['utf-8', 'ascii']:
print("[%s]: %s" % (encoding,
data.encode(encoding, 'xmlcharrefreplace')))

def table():
"""
We'll take our "data" and output it in different formats
"""
print('%3s %4s %5s %4s %9s %9s %2s %s' % (
'#',
'hex',
'dec',
'u-notn',
'XML ent',
'URL enc',
'Cat',
'Full Unicode Name'))
print('%s' % ('-' * 75))
for i, c in enumerate(data):
xmlent = c.encode('ascii', 'xmlcharrefreplace')
urlent = urllib.parse.quote(c)
cbytes = [x for x in c.encode('utf-8')]
print('%3s' % i, # pos in string
'%04X' % ord(c), # Unicode ordinal in hex
'%05d' % ord(c), # Unicode ordinal in decimal
r'\u%04x' % ord(c), # Javascript + Python escape notation
'%9s' % xmlent.decode('ascii'), # XML entity encoding
'%9s' % urlent, #.decode('ascii'), # XML entity encoding
'%3s' % unicodedata.category(c),
'%04s' % cbytes,
unicodedata.name(c))

if __name__ == '__main__':

for fn in [about, raw, types, other_encodings, entities, table]:
print(fn.__name__.title())
fn()
print('\n%s' % ('-' * 75))


$ python3.1 uni.py
about
This OS/account defaults to [en_US] as language, [UTF8] as encoding
--------------------

raw
Encoded for this terminal: b"\xc3\x87'\xc3\xabsht\xc3\xab Unicode?, in Albanian; \xec\x9c\xa0\xeb\x8b\x88\xec\xbd\x94\xeb\x93\x9c\xec\x97\x90 \xeb\x8c\x80\xed\x95\xb4? in Korean"
[str] Ã'është Unicode?, in Albanian; ì*ëì½ëì ëí´? in Korean
b"\xc3\x87'\xc3\xabsht\xc3\xab Unicode?, in Albanian; \xec\x9c\xa0\xeb\x8b\x88\xec\xbd\x94\xeb\x93\x9c\xec\x97\x90 \xeb\x8c\x80\xed\x95\xb4? in Korean"
--------------------

types
data is type <class 'str'>
encoded is type <class 'bytes'>
--------------------

[B]other_encodings
Output in various encodings
[utf-8]: b"\xc3\x87'\xc3\xabsht\xc3\xab Unicode?, in Albanian; \xec\x9c\xa0\xeb\x8b\x88\xec\xbd\x94\xeb\x93\x9c\xec\x97\x90 \xeb\x8c\x80\xed\x95\xb4? in Korean"
[utf-16]: b"\xff\xfe\xc7\x00'\x00\xeb\x00s\x00h\x00t\x00\xeb\x00 \x00U\x00n\x00i\x00c\x00o\x00d\x00e\x00?\x00,\x00 \x00i\x00n\x00 \x00A\x00l\x00b\x00a\x00n\x00i\x00a\x00n\x00;\x00 \x00 \xc7\xc8\xb2T\xcf\xdc\xb4\xd0\xc5 \x00\x00\xb3t\xd5?\x00 \x00i\x00n\x00 \x00K\x00o\x00r\x00e\x00a\x00n\x00"
[ascii]: b"?'?sht? Unicode?, in Albanian; ????? ??? in Korean"
--------------------

entities
As XML entities, where possible:
[utf-8]: b"\xc3\x87'\xc3\xabsht\xc3\xab Unicode?, in Albanian; \xec\x9c\xa0\xeb\x8b\x88\xec\xbd\x94\xeb\x93\x9c\xec\x97\x90 \xeb\x8c\x80\xed\x95\xb4? in Korean"
[ascii]: b"Ç'është Unicode?, in Albanian; 유니코드에 대해? in Korean"
--------------------

table
# hex dec u-notn XML ent URL enc Cat Full Unicode Name
---------------------------------------------------------------------------
0 00C7 00199 \u00c7 Ç %C3%87 Lu [195, 135] LATIN CAPITAL LETTER C WITH CEDILLA
1 0027 00039 \u0027 ' %27 Po [39] APOSTROPHE
2 00EB 00235 \u00eb ë %C3%AB Ll [195, 171] LATIN SMALL LETTER E WITH DIAERESIS
3 0073 00115 \u0073 s s Ll [115] LATIN SMALL LETTER S
4 0068 00104 \u0068 h h Ll [104] LATIN SMALL LETTER H
5 0074 00116 \u0074 t t Ll [116] LATIN SMALL LETTER T
6 00EB 00235 \u00eb ë %C3%AB Ll [195, 171] LATIN SMALL LETTER E WITH DIAERESIS
7 0020 00032 \u0020 %20 Zs [32] SPACE
8 0055 00085 \u0055 U U Lu [85] LATIN CAPITAL LETTER U
9 006E 00110 \u006e n n Ll [110] LATIN SMALL LETTER N
10 0069 00105 \u0069 i i Ll [105] LATIN SMALL LETTER I
11 0063 00099 \u0063 c c Ll [99] LATIN SMALL LETTER C
12 006F 00111 \u006f o o Ll [111] LATIN SMALL LETTER O
13 0064 00100 \u0064 d d Ll [100] LATIN SMALL LETTER D
14 0065 00101 \u0065 e e Ll [101] LATIN SMALL LETTER E
15 003F 00063 \u003f ? %3F Po [63] QUESTION MARK
16 002C 00044 \u002c , %2C Po [44] COMMA
17 0020 00032 \u0020 %20 Zs [32] SPACE
18 0069 00105 \u0069 i i Ll [105] LATIN SMALL LETTER I
19 006E 00110 \u006e n n Ll [110] LATIN SMALL LETTER N
20 0020 00032 \u0020 %20 Zs [32] SPACE
21 0041 00065 \u0041 A A Lu [65] LATIN CAPITAL LETTER A
22 006C 00108 \u006c l l Ll [108] LATIN SMALL LETTER L
23 0062 00098 \u0062 b b Ll [98] LATIN SMALL LETTER B
24 0061 00097 \u0061 a a Ll [97] LATIN SMALL LETTER A
25 006E 00110 \u006e n n Ll [110] LATIN SMALL LETTER N
26 0069 00105 \u0069 i i Ll [105] LATIN SMALL LETTER I
27 0061 00097 \u0061 a a Ll [97] LATIN SMALL LETTER A
28 006E 00110 \u006e n n Ll [110] LATIN SMALL LETTER N
29 003B 00059 \u003b ; %3B Po [59] SEMICOLON
30 0020 00032 \u0020 %20 Zs [32] SPACE
31 C720 50976 \uc720 유 %EC%9C%A0 Lo [236, 156, 160] HANGUL SYLLABLE YU
32 B2C8 45768 \ub2c8 니 %EB%8B%88 Lo [235, 139, 136] HANGUL SYLLABLE NI
33 CF54 53076 \ucf54 코 %EC%BD%94 Lo [236, 189, 148] HANGUL SYLLABLE KO
34 B4DC 46300 \ub4dc 드 %EB%93%9C Lo [235, 147, 156] HANGUL SYLLABLE DEU
35 C5D0 50640 \uc5d0 에 %EC%97%90 Lo [236, 151, 144] HANGUL SYLLABLE E
36 0020 00032 \u0020 %20 Zs [32] SPACE
37 B300 45824 \ub300 대 %EB%8C%80 Lo [235, 140, 128] HANGUL SYLLABLE DAE
38 D574 54644 \ud574 해 %ED%95%B4 Lo [237, 149, 180] HANGUL SYLLABLE HAE
39 003F 00063 \u003f ? %3F Po [63] QUESTION MARK
40 0020 00032 \u0020 %20 Zs [32] SPACE
41 0069 00105 \u0069 i i Ll [105] LATIN SMALL LETTER I
42 006E 00110 \u006e n n Ll [110] LATIN SMALL LETTER N
43 0020 00032 \u0020 %20 Zs [32] SPACE
44 004B 00075 \u004b K K Lu [75] LATIN CAPITAL LETTER K
45 006F 00111 \u006f o o Ll [111] LATIN SMALL LETTER O
46 0072 00114 \u0072 r r Ll [114] LATIN SMALL LETTER R
47 0065 00101 \u0065 e e Ll [101] LATIN SMALL LETTER E
48 0061 00097 \u0061 a a Ll [97] LATIN SMALL LETTER A
49 006E 00110 \u006e n n Ll [110] LATIN SMALL LETTER N
--------------------

mwatkins
06-11-2009, 04:40 PM
Sadly some of the characters, notably the XML character entities, are being interpreted by the "code" block rather than output as-is. You'll have to take my word for it, or install Python 3+ and run the script yourself.

Another article worth reading:

http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

mwatkins
06-11-2009, 05:20 PM
Another bump in the road you may be hitting is the form encoding. I've seen different browser behaviour in the past and this might be introducing another variable into the mix.

Why not try to collect your data using UTF-8 - meaning set the page headers, the form accept-charset attribute, etc all to UTF-8... and then in your communication with the gateway send the mb_convert 'd string as utf-16.

<form method="POST" accept-charset="UTF-8">
<input name="submitted" type="hidden" value=1 />
<input name="name" />
<input type="submit" value="do it" />
</form>

jonathanbull
06-11-2009, 08:47 PM
Firstly, thank you all for your replies - I never expected such a helpful response! It truly baffles me that there seems to be no reliable method of conversion for this in PHP whereas other languages like Java can do it in a couple of lines.

Anyway, back to the issue. I've made sure that the page is defined as UTF-8 in the head tags and made sure that the form input is coming in as UTF-8 as well.

Just to reiterate, it's the HEX UTF-16 values that I'm after. I've found bin2hex($str) which works fine for getting the UTF-8 HEX values, just not the UTF-16 ones. I also tried mb_convert_encoding($str, "UTF-16", "UTF-8") but the output always comes out as something crazy like this (http://www.jonathanbull.co.uk/unicode.php) - a long way away from any HEX values!

Any ideas guys? :)

mwatkins
06-12-2009, 11:31 AM
Unicode handling in PHP is one of the things I loathe about the language and it is in fact one of the main reasons I left (or never returned to) PHP many years ago. The sheer inconsistency of the language is another reason. Having to recompile the interpreter to include basic functionality that a development language should offer to web developers is yet another reason. Performance of the execution model is yet another reason...

Sorry, I don't do that very often. Back on track - I think I can help you understand why your bin2hex($str) isn't working out with UTF-16 values - it is almost certainly because of the BOM - Byte Order Marker which you'll find in *most* strings and files composed of UTF-16 encoded data. The BOM indicates whether the encoding (as a result of the CPU) expects big or little endian data.

EF BB BF > UTF-8
FE FF > UTF-16/UCS-2, little endian
FF FE > UTF-16/UCS-2, big endian
FF FE 00 00 > UTF-32/UCS-4, little endian.
00 00 FE FF > UTF-32/UCS-4, big-endian.

To complicate matters further some languagues have support for UCS2 and UCS4; UCS defines a universal character set. UCS 2 and 4 are 2 byte and 4 byte character sets. If you are writing decoding schemes from scratch because your language doesn't have a working implementation then you need to take that into account, too. Sigh.

Back to BOM:

A BOM isn't *typically* found in UTF-8 strings, nor even a certainty in UTF-8 encoded files, although you will see them frequently on Windows-platform originated files.

Editorial comment: I believe UTF-8 was a neat invention designed by practical people. I'm not sure I can say the same about UTF-16.

As UTF-8 encoding is not dependent on the processor, there is no need for a *byte order* marker. However the terminology has grown to cover "marker" of any sort, in a Unicode sense, and *occasionally* you will run into one with UTF-8 data. If found, it simply is there as a marker to denote the encoding type.

mb_convert_encoding should be doing all of this detection for you, passing back only the encoded data. But it doesn't seem to be and perhaps this bug is why it may not be:

http://bugs.php.net/bug.php?id=44014

Another thought: Have you checked the default internal coding of PHP? If it isn't UTF-8 (chances are it still isn't) that could easily be causing you problems with non ASCII data.

Perhaps you should verify that the input data is indeed UTF-8.

http://www.php.net/manual/en/function.mb-check-encoding.php

If it isn't, try putting this somewhere early in your code or ideally in a much simpler test code module:

mb_internal_encoding('UTF-8');

That'll change it from what I bet is the default, iso-8859-1. And if indeed that was the default, you were not likely converting UTF-8 data in the first place. Maybe. Who knows with PHP.

Unicode should not be that hard. In the future virtually all programming languages will have unicode strings as a core. I'm surprised PHP still seems far away from this. For an example of just how easy it should be, every "string" is a Unicode object and every byte string has the ability to be decoded into a Unicode string.

$ python3.0
>>> "Cheers".encode('utf-16')
b'\xff\xfeC\x00h\x00e\x00e\x00r\x00s\x00'
>>> b'\xff\xfeC\x00h\x00e\x00e\x00r\x00s\x00'.decode('utf-16')
'Cheers'
>>> b'\xff\xfeC\x00h\x00e\x00e\x00r\x00s\x00'.decode('utf-16').encode('utf-8')
b'Cheers'

# Want to see the BOM?

>>> import codecs
>>> codecs.BOM_UTF16
b'\xff\xfe'
>>> codecs.BOM_UTF8
b'\xef\xbb\xbf'

So bottom line: you need to strip away the BOM bytes **unless** the service you are communicating with expects them (and they very well may).

Python might be helpful to you to debug issues - you can double check answers you get from PHP. I've been showing you Python 3.x in this thread, as it has an even cleaner approach to Unicode than it has had for the last decade. If you can tolerate installing Python 3.x I would recommend it for this purpose alone; if not, your system (if it is Unix) probably already has Python 2.5 or 2.6 installed and both will do but the Unicode handling is different in some subtle ways. I'm happy to provide some quick tips on that if you need it.

If you are running on a Unix machine and do elect to put Python 3.1 on, you can avoid overwriting things by downloading and compiling... easy enough:

./configure
make
make altinstall

Note "altinstall" - this will keep Python3 from over-writing any links made to your system's default Python in /usr/bin/python or /usr/local/bin/python. Even if so, it will not ever over-write a major sub-version difference - you can always have multiple python's eg 2.4 / 2.5 / 2.6 / 3.0 / 3.1 installed.

Back to PHP: Have you tried avoiding all web input and simply take a known Unicode character not in the ASCII character set (use one of the escape methods to create it) -- convert that to UTF-16 and view in browser; force your browser to UTF-16 if you must. Keep your tests really really simple with as few inputs as possible until you mash out what is failing. I do understand how maddening this is... what should be dirt simple is taking hours of your time. Been there, done that.