Any Unicode-savvy php (or perl) programmers around?
Monday, 27 September 2010 09:48 pmI'm having a devil of a time with some php to do unicode processing and display.
Do you know how to turn the unicode representation of "Ğ" into its HTML entity (Ğ)? The character is from ISO-8859-9, Latin 9, and it's one of a few I'm having trouble converting. Because they're not in Latin 1. And heaven forfend we actually want to use those other characters.
You'd *think* (or at least I would think) htmlentities() would do the job; but not with htmlentities($string, ENT_QUOTES, 'UTF-8'); it remains the unicode string. get_html_translation_table(HTML_ENTITIES) suggests htmlentities() only has about 100 mappings, which is a disappointment.
I've browsed all sorts of PHP and perl docs, as well as straight references for ISO-8851-*, including some which say "here are the HTML mappings for a number of UTF8 characters" - but I haven't found an anywhere-near-useful set of UTF8 to HTML entities.
This seems like a bug.
Halp?
[Edit to add: I found this, which is perl to convert entities to LaTeX, and maybe I need to hack that up to produce a simple array myself?... Hm.]
Do you know how to turn the unicode representation of "Ğ" into its HTML entity (Ğ)? The character is from ISO-8859-9, Latin 9, and it's one of a few I'm having trouble converting. Because they're not in Latin 1. And heaven forfend we actually want to use those other characters.
You'd *think* (or at least I would think) htmlentities() would do the job; but not with htmlentities($string, ENT_QUOTES, 'UTF-8'); it remains the unicode string. get_html_translation_table(HTML_ENTITIES) suggests htmlentities() only has about 100 mappings, which is a disappointment.
I've browsed all sorts of PHP and perl docs, as well as straight references for ISO-8851-*, including some which say "here are the HTML mappings for a number of UTF8 characters" - but I haven't found an anywhere-near-useful set of UTF8 to HTML entities.
This seems like a bug.
Halp?
[Edit to add: I found this, which is perl to convert entities to LaTeX, and maybe I need to hack that up to produce a simple array myself?... Hm.]
no subject
Date: Tuesday, 28 September 2010 01:57 am (UTC)Are you saying you have "Ğ" and you want "Ğ"? In HTML, isn't that the same thing? Or you have the UTF-8 representation "\xc4\x9f" and you want "Ğ"? [The latter's just some bitfiddling.] I don't think you should ever need to generate the string "Ğ".
no subject
Date: Tuesday, 28 September 2010 12:37 pm (UTC)Yeah. I can hardly believe that in '08 we fit everybody's data into ISO-8851-1. ...Or shoe-horned it in, more likely.
> Or you have the UTF-8 representation "\xc4\x9f" and you want "Ğ"? [The latter's just some bitfiddling.]
Yes, that is what I want. I will look at wikipedia again, and see how my comprehension is this time, with this slightly more specific pointer. :)
[edit & to &]
no subject
Date: Tuesday, 28 September 2010 11:00 pm (UTC)no subject
Date: Tuesday, 28 September 2010 04:56 am (UTC)PHP: Look into using MB -- http://php.net/manual/en/book.mbstring.php
no subject
Date: Tuesday, 28 September 2010 04:59 am (UTC)PHP's MB requires a recompile of PHP, but then allows you to do this: $out = mb_convert_encoding( $out, 'HTML-ENTITIES', 'UTF-8' );
no subject
Date: Tuesday, 28 September 2010 09:26 pm (UTC)did the trick. Exactly. Thanks...
no subject
Date: Tuesday, 28 September 2010 12:49 pm (UTC)Missing right curly or square bracket at (eval 1) line 2, at end of line syntax error at (eval 1) line 2, at EOF while trying to turn range: "[some guy's name]" \into code: sub {$ at [...]lib/HTML/Entities.pm line 457, <> line 68.
It was a one-line test, something like (from memory):
perl -n -MHTML::Decode -CIOE -e 'print encode_entities("[name]")'
:P
Based on
And I'll look at the PHP, thanks for the tip. Though I'm not likely to put a recompiled PHP into production for this project.
Le Sigh.
no subject
Date: Tuesday, 28 September 2010 09:17 pm (UTC)no subject
Date: Tuesday, 28 September 2010 09:34 pm (UTC)What I ended up doing was
mb_convert_encoding($element,'HTML-ENTITIES','UTF-8');
which fit the bill exactly (it uses named entities where available, and numeric entities where names aren't available).
And hey! Your profile builds cleanly. And is republished with a ğ where it belongs.
no subject
Date: Tuesday, 28 September 2010 12:50 pm (UTC)no subject
Date: Tuesday, 28 September 2010 05:42 pm (UTC)