da: (bit)
[personal profile] da
I'm having a devil of a time with some php to do unicode processing and display.

Do you know how to turn the unicode representation of "Ğ" into its HTML entity (Ğ)? The character is from ISO-8859-9, Latin 9, and it's one of a few I'm having trouble converting. Because they're not in Latin 1. And heaven forfend we actually want to use those other characters.

You'd *think* (or at least I would think) htmlentities() would do the job; but not with htmlentities($string, ENT_QUOTES, 'UTF-8'); it remains the unicode string. get_html_translation_table(HTML_ENTITIES) suggests htmlentities() only has about 100 mappings, which is a disappointment.

I've browsed all sorts of PHP and perl docs, as well as straight references for ISO-8851-*, including some which say "here are the HTML mappings for a number of UTF8 characters" - but I haven't found an anywhere-near-useful set of UTF8 to HTML entities.

This seems like a bug.

Halp?

[Edit to add: I found this, which is perl to convert entities to LaTeX, and maybe I need to hack that up to produce a simple array myself?... Hm.]

Date: Tuesday, 28 September 2010 12:49 pm (UTC)
From: [identity profile] da-lj.livejournal.com
So close... I tried HTML::Entities, but the simple test I ran gave me this fun:

Missing right curly or square bracket at (eval 1) line 2, at end of line syntax error at (eval 1) line 2, at EOF while trying to turn range: "[some guy's name]" \into code: sub {$ at [...]lib/HTML/Entities.pm line 457, <> line 68.

It was a one-line test, something like (from memory):

perl -n -MHTML::Decode -CIOE -e 'print encode_entities("[name]")'

:P

Based on [livejournal.com profile] cypherpunk's suggestion that producing the #-sequence is just bit fiddling, maybe I can find it in the source of HTML::Decode and rip that part out.

And I'll look at the PHP, thanks for the tip. Though I'm not likely to put a recompiled PHP into production for this project.

Le Sigh.

Date: Tuesday, 28 September 2010 09:17 pm (UTC)
From: [identity profile] cypherpunk95.livejournal.com
http://en.wikipedia.org/wiki/UTF-8 shows how to convert "encoded bytes" into a number in the "Unicode range". The number after the &# is just that number expressed in decimal.

Date: Tuesday, 28 September 2010 09:34 pm (UTC)
From: [identity profile] da-lj.livejournal.com
Ah, I see that now.

What I ended up doing was

mb_convert_encoding($element,'HTML-ENTITIES','UTF-8');

which fit the bill exactly (it uses named entities where available, and numeric entities where names aren't available).

And hey! Your profile builds cleanly. And is republished with a &#287; where it belongs.

December 2024

S M T W T F S
12 34567
891011121314
15161718192021
22232425262728
293031    

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Wednesday, 24 December 2025 11:00 am
Powered by Dreamwidth Studios