da | Any Unicode-savvy php (or perl) programmers around?

I'm having a devil of a time with some php to do unicode processing and display.

Do you know how to turn the unicode representation of "Ğ" into its HTML entity (Ğ)? The character is from ISO-8859-9, Latin 9, and it's one of a few I'm having trouble converting. Because they're not in Latin 1. And heaven forfend we actually want to use those other characters.

You'd *think* (or at least I would think) htmlentities() would do the job; but not with htmlentities($string, ENT_QUOTES, 'UTF-8'); it remains the unicode string. get_html_translation_table(HTML_ENTITIES) suggests htmlentities() only has about 100 mappings, which is a disappointment.

I've browsed all sorts of PHP and perl docs, as well as straight references for ISO-8851-*, including some which say "here are the HTML mappings for a number of UTF8 characters" - but I haven't found an anywhere-near-useful set of UTF8 to HTML entities.

This seems like a bug.

Halp?

[Edit to add: I found this, which is perl to convert entities to LaTeX, and maybe I need to hack that up to produce a simple array myself?... Hm.]

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Most Popular Tags

activism - 21 uses
art - 24 uses
bike - 50 uses
book - 22 uses
canada - 12 uses
christianity - 16 uses
citizenship and immigration - 14 uses
d - 117 uses
don't shake the art - 16 uses
family - 41 uses
film - 51 uses
food - 69 uses
friends - 143 uses
fun - 17 uses
gardening - 12 uses
geek - 99 uses
gtd - 18 uses
health - 39 uses
holidays - 31 uses
home - 18 uses
house - 32 uses
i learned something today - 17 uses
ithaca - 13 uses
justify my bourgeois lifestyle - 26 uses
kinesiology - 13 uses
living adventurously - 15 uses
mac - 27 uses
meme me - 13 uses
memories - 17 uses
music - 47 uses
perl - 31 uses
philosophy - 21 uses
photos - 46 uses
politics - 43 uses
quaker - 85 uses
review - 108 uses
rocks - 67 uses
rover - 32 uses
science - 28 uses
science fiction - 20 uses
security - 13 uses
silly - 12 uses
sucks - 42 uses
theatre - 28 uses
travel - 124 uses
vacation - 75 uses
weather - 22 uses
web - 57 uses
work - 88 uses
wtf? - 75 uses

Flat | Top-Level Comments Only

From:

da-lj.livejournal.com

So close... I tried HTML::Entities, but the simple test I ran gave me this fun:

Missing right curly or square bracket at (eval 1) line 2, at end of line syntax error at (eval 1) line 2, at EOF while trying to turn range: "[some guy's name]" \into code: sub {$ at [...]lib/HTML/Entities.pm line 457, <> line 68.

It was a one-line test, something like (from memory):

perl -n -MHTML::Decode -CIOE -e 'print encode_entities("[name]")'

:P

Based on

cypherpunk's suggestion that producing the #-sequence is just bit fiddling, maybe I can find it in the source of HTML::Decode and rip that part out.

And I'll look at the PHP, thanks for the tip. Though I'm not likely to put a recompiled PHP into production for this project.

Le Sigh.

cypherpunk95.livejournal.com

http://en.wikipedia.org/wiki/UTF-8 shows how to convert "encoded bytes" into a number in the "Unicode range". The number after the &# is just that number expressed in decimal.

Ah, I see that now.

What I ended up doing was

mb_convert_encoding($element,'HTML-ENTITIES','UTF-8');

which fit the bill exactly (it uses named entities where available, and numeric entities where names aren't available).

And hey! Your profile builds cleanly. And is republished with a ğ where it belongs.

Mambo Taxi

Daniel Allen's Journal

Any Unicode-savvy php (or perl) programmers around?

Any Unicode-savvy php (or perl) programmers around?

no subject

no subject

no subject

Profile

December 2024

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags