PHP
downloads | documentation | faq | getting help | mailing lists | reporting bugs | php.net sites | links | conferences | my php.net

search for in the

utf8_encode> <XML Parser Functions
Last updated: Fri, 10 Oct 2008

view this page in

utf8_decode

(PHP 4, PHP 5)

utf8_decode Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1

Description

string utf8_decode ( string $data )

This function decodes data , assumed to be UTF-8 encoded, to ISO-8859-1.

Parameters

data

An UTF-8 encoded string.

Return Values

Returns the ISO-8859-1 translation of data .

See Also



utf8_encode> <XML Parser Functions
Last updated: Fri, 10 Oct 2008
 
add a note add a note User Contributed Notes
utf8_decode
Blackbit
12-Aug-2008 01:40
Squirrelmail contains a nice function in the sources to convert unicode to entities:

<?php
function charset_decode_utf_8 ($string) {
  /* Only do the slow convert if there are 8-bit characters */
    /* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
    if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
        return $string;

    // decode three byte unicode characters
    $string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e", \
    "'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'", \
    $string);

    // decode two byte unicode characters
    $string = preg_replace("/([\300-\337])([\200-\277])/e", \
    "'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'", \
    $string);

    return $string;
}
?>
punchivan at gmail dot com
24-Jun-2008 11:13
EY! the bug is not in the function 'utf8_decode'. The bug is in the function 'mb_detect_encoding'. If you put a word with a special char at the end like this 'accentué', that will lead to a wrong result (UTF-8) but if you put another char at the end like this 'accentuée' you will get it right. So you should always add a ISO-8859-1 character to your string for this check. My advise is to use a blank space.
I´ve tried it and it works!

function ISO_convert($array)
{
    $array_temp = array();
    
    foreach($array as $name => $value)
    {
        if(is_array($value))
          $array_temp[(mb_detect_encoding($name." ",'UTF-8,ISO-8859-1') == 'UTF-8' ? utf8_decode($name) : $name )] = ISO_convert($value);
        else
          $array_temp[(mb_detect_encoding($name." ",'UTF-8,ISO-8859-1') == 'UTF-8' ? utf8_decode($name) : $name )] = (mb_detect_encoding($value." ",'UTF-8,ISO-8859-1') == 'UTF-8' ? utf8_decode($value) : $value );
    }

    return $array_temp;
}
webmaster at lapstore dot de
19-Jun-2008 05:11
Warning!
This function contains a possible security risk when you try to convert escaped strings (see addslashes() and related functions).
It reacts nasty on broken multibyte sequences. In UTF-8, follow-up bytes ALWAYS have the binary pattern 10xxxxxx, but this fact is not handled by utf8_decode in the way you would expect: If you pass a start byte (110xxxxx, 1110xxxx, 11110xxx - or even invalid sequences like 11111100), followed by one or more non-multibyte chars (0xxxxxxx), the start sequence "char" will be replaced by '?' (0x3F) and up to three following chars will disappear even if they are single-byte-chars (0xxxxxxx). So if you escape a string with a typical escape char like backslash, you would expect that your escaping would always survive a call to utf8decode because the escape char is in the assumed safe ascii range 0-127, but that is NOT the case!
Try things like utf8_decode("test: ü\\\"123456") to check it out.
To avoid problems take care that string-escaping always is the last step of data manipulation when you depend on leak-proof escaping.
juantxito at example dot com
28-May-2008 10:26
To decode the values and the keys, i think this would work

 function utf8_array_decode($input){

    $return = array();

        foreach ($input as $key => $val) {
        $k = utf8_decode($key);
                $return[$k] = utf8_decode($val);
            }
            return $return;          
        }
phpnet at freshsite dot de
09-May-2008 10:08
I didn't find an utf8_array_decode. This one only decodes the values, not the keys.

        function utf8_array_decode($input){
            foreach ($input as $key => $val) {
                $input[$key] = utf8_decode($val);
            }
            return $input;           
        }
haugas at gmail dot com
08-May-2008 11:11
If you don't know exactly, how many times your string is encoded, you can use this function:

<?php

function _utf8_decode($string)
{
 
$tmp = $string;
 
$count = 0;
  while (
mb_detect_encoding($tmp)=="UTF-8")
  {
   
$tmp = utf8_decode($tmp);
   
$count++;
  }
 
  for (
$i = 0; $i < $count-1 ; $i++)
  {
   
$string = utf8_decode($string);
   
  }
  return
$string;
 
}

?>
lukasz dot mlodzik at gmail dot com
05-Mar-2008 01:46
Update to MARC13 function utf2iso()
I'm using it to handle AJAX POST calls.
Despite using
http.setRequestHeader('Content-Type', 'application/x-www-form-urlencoded'; charset='utf-8');
it still code Polish letters using UTF-16

This is only for Polish letters:
 
<?php
function utf16_2_utf8 ($nowytekst) {
       
$nowytekst = str_replace('%u0104','Ą',$nowytekst);    //Ą
       
$nowytekst = str_replace('%u0106','Ć',$nowytekst);    //Ć
       
$nowytekst = str_replace('%u0118','Ę',$nowytekst);    //Ę
       
$nowytekst = str_replace('%u0141','Ł',$nowytekst);    //Ł
       
$nowytekst = str_replace('%u0143','Ń',$nowytekst);    //Ń
       
$nowytekst = str_replace('%u00D3','Ó',$nowytekst);    //Ó
       
$nowytekst = str_replace('%u015A','Ś',$nowytekst);    //Ś
       
$nowytekst = str_replace('%u0179','Ź',$nowytekst);    //Ź
       
$nowytekst = str_replace('%u017B','Ż',$nowytekst);    //Ż
      
       
$nowytekst = str_replace('%u0105','ą',$nowytekst);    //ą
       
$nowytekst = str_replace('%u0107','ć',$nowytekst);    //ć
       
$nowytekst = str_replace('%u0119','ę',$nowytekst);    //ę
       
$nowytekst = str_replace('%u0142','ł',$nowytekst);    //ł
       
$nowytekst = str_replace('%u0144','ń',$nowytekst);    //ń
       
$nowytekst = str_replace('%u00F3','ó',$nowytekst);    //ó
       
$nowytekst = str_replace('%u015B','ś',$nowytekst);    //ś
       
$nowytekst = str_replace('%u017A','ź',$nowytekst);    //ź
       
$nowytekst = str_replace('%u017C','ż',$nowytekst);    //ż
  
return ($nowytekst);
   }   
?>

Everything goes smooth, but it doesn't change '%u00D3','Ó' and '%u00F3','ó'. I dont have idea what to do with that.

Remember! File must be saved in UTF-8 coding.
tacchete at gmail dot com
13-Dec-2007 01:36
Known problem with Byte Order Mark (BOM) and header() in pages of a site.

For example at sending headings or to a dynamic conclusion in other coding distinct from UTF-8 by means of XSLT (<xsl:output encoding="windows-1251"/>).

To clean all symbols BOM from the text of page:

1. exclude BOM from the main file;
2. write down function of a return call for the buffer

<?php
header
('content-type: text/html; charset: utf-8');
ob_start('ob');
function
ob($buffer)
{
    return
str_replace("\xef\xbb\xbf", '', $buffer);
}
?>

it will exclude BOM from a code of the connected files;
3. do not experience for BOM in connected files;
4. be pleased.
ludvig dot ericson at gmail dot com
15-Jul-2007 07:52
A better way to convert would be to use iconv, see http://www.php.net/iconv -- example:

<?php
$myUnicodeString
= "Åäö";
echo
iconv("UTF-8", "ISO-8859-1", $myUnicodeString);
?>

Above would echo out the given variable in ISO-8859-1 encoding, you may replace it with whatever you prefer.

Another solution to the issue of misdisplayed glyphs is to simply send the document as UTF-8, and of course send UTF-8 data:

<?php
# Replace text/html with whatever MIME-type you prefer.
header("Content-Type: text/html; charset=utf-8");
?>
MARC13
07-Jul-2007 04:50
I did this function to convert data from AJAX call to insert to my database.
It converts UTF-8 from XMLHttpRequest() to ISO-8859-2 that I use in LATIN2 MySQL database.

<?php
function utf2iso($tekst)
{
       
$nowytekst = str_replace("%u0104","\xA1",$tekst);    //Ą
       
$nowytekst = str_replace("%u0106","\xC6",$nowytekst);    //Ć
       
$nowytekst = str_replace("%u0118","\xCA",$nowytekst);    //Ę
       
$nowytekst = str_replace("%u0141","\xA3",$nowytekst);    //Ł
       
$nowytekst = str_replace("%u0143","\xD1",$nowytekst);    //Ń
       
$nowytekst = str_replace("%u00D3","\xD3",$nowytekst);    //Ó
       
$nowytekst = str_replace("%u015A","\xA6",$nowytekst);    //Ś
       
$nowytekst = str_replace("%u0179","\xAC",$nowytekst);    //Ź
       
$nowytekst = str_replace("%u017B","\xAF",$nowytekst);    //Ż
       
       
$nowytekst = str_replace("%u0105","\xB1",$nowytekst);    //ą
       
$nowytekst = str_replace("%u0107","\xE6",$nowytekst);    //ć
       
$nowytekst = str_replace("%u0119","\xEA",$nowytekst);    //ę
       
$nowytekst = str_replace("%u0142","\xB3",$nowytekst);    //ł
       
$nowytekst = str_replace("%u0144","\xF1",$nowytekst);    //ń
       
$nowytekst = str_replace("%u00D4","\xF3",$nowytekst);    //ó
       
$nowytekst = str_replace("%u015B","\xB6",$nowytekst);    //ś
       
$nowytekst = str_replace("%u017A","\xBC",$nowytekst);    //ź
       
$nowytekst = str_replace("%u017C","\xBF",$nowytekst);    //ż
       
   
return ($nowytekst);
}
?>

In my case also the code file that deals with AJAX calls must be in UTF-8 coding.
visus at portsonline dot net
23-Jun-2007 06:08
Following code helped me with mixed (UTF8+ISO-8859-1(x)) encodings. In this case, I have template files made and maintained by designers who do not care about encoding and MySQL data in utf8_binary_ci encoded tables.

<?php

class Helper
{
    function
strSplit($text, $split = 1)
    {
        if (!
is_string($text)) return false;
        if (!
is_numeric($split) && $split < 1) return false;

       
$len = strlen($text);

       
$array = array();

       
$i = 0;

        while (
$i < $len)
        {
           
$key = NULL;

            for (
$j = 0; $j < $split; $j += 1)
            {
               
$key .= $text{$i};

               
$i += 1;
            }

           
$array[] = $key;
        }

        return
$array;
    }

    function
UTF8ToHTML($str)
    {
       
$search = array();
       
$search[] = "/([\\xC0-\\xF7]{1,1}[\\x80-\\xBF]+)/e";
       
$search[] = "/&#228;/";
       
$search[] = "/&#246;/";
       
$search[] = "/&#252;/";
       
$search[] = "/&#196;/";
       
$search[] = "/&#214;/";
       
$search[] = "/&#220;/";
       
$search[] = "/&#223;/";

       
$replace = array();
       
$replace[] = 'Helper::_UTF8ToHTML("\\1")';
       
$replace[] = "ä";
       
$replace[] = "ö";
       
$replace[] = "ü";
       
$replace[] = "Ä";
       
$replace[] = "Ö";
       
$replace[] = "ü";
       
$replace[] = "ß";

       
$str = preg_replace($search, $replace, $str);

        return
$str;
    }

    function
_UTF8ToHTML($str)
    {
       
$ret = 0;

        foreach((
Helper::strSplit(strrev(chr((ord($str{0}) % 252 % 248 % 240 % 224 % 192) + 128).substr($str, 1)))) as $k => $v)
           
$ret += (ord($v) % 128) * pow(64, $k);
        return
"&#".$ret.";";
    }
}

// Usage example:

$tpl = file_get_contents("template.tpl");
/* ... */
$row = mysql_fetch_assoc($result);

print(
Helper::UTF8ToHTML(str_replace("{VAR}", $row['var'], $tpl)));

?>
luka8088 at gmail dot com
22-Jun-2007 04:03
simple UTF-8 to HTML conversion:

function utf8_to_html ($data)
    {
    return preg_replace("/([\\xC0-\\xF7]{1,1}[\\x80-\\xBF]+)/e", '_utf8_to_html("\\1")', $data);
    }

function _utf8_to_html ($data)
    {
    $ret = 0;
    foreach((str_split(strrev(chr((ord($data{0}) % 252 % 248 % 240 % 224 % 192) + 128) . substr($data, 1)))) as $k => $v)
        $ret += (ord($v) % 128) * pow(64, $k);
    return "&#$ret;";
    }

Example:
echo utf8_to_html("a b č ć ž こ に ち わ ()[]{}!#$?*");

Output:
a b &#269; &#263; &#382; &#12371; &#12395; &#12385; &#12431; ()[]{}!#$?*
Sadi
13-Jun-2007 12:38
Once again about polish letters. If you use fananf's solution, make sure that PHP file is coded with cp1250 or else it won't work. It's quite obvious, however I spent some time before I finally figured that out, so I thought I post it here.
alexlevin at kvadro dot net
21-May-2007 05:20
If you running Gentoo Linux and encounter problems with some PHP4 applications saying:
Call to undefined function: utf8_decode()
Try reemerge PHP4 with 'expat' flag enabled.
ahmed dot adaileh at gmail dot com
08-Mar-2007 12:32
I searched a lot everywhere to find a suitable function which converts my UTF8 characters to the windows-1250 charset for Polish language, but couldn't find anything :(

Following is a function which does that:

function show_polish ($text) {
 $text = str_replace("Ä„", '&#260;', $text); //Ą
 $text = str_replace("Ć", '&#262;', $text); //Ć
 $text = str_replace("Ę", '&#280;', $text); //Ę
 $text = str_replace("Ł", '&#321;', $text); //Ł
 $text = str_replace("Ń", '&#323;', $text); //Ń
 $text = str_replace("Ó", '&#211;', $text); //Ó
 $text = str_replace("Åš", '&#346;', $text); //Ś
 $text = str_replace("Ź", '&#377;', $text); //Ź
 $text = str_replace("Å»", '&#379;', $text); //Ż
 $text = str_replace("Ä…", '&#261;', $text); //ą
 $text = str_replace("ć", '&#263;', $text); //ć
 $text = str_replace("Ä™", '&#281;', $text); //ę
 $text = str_replace("Å‚", '&#322;', $text); //ł
 $text = str_replace("Å„", '&#324;', $text); //ń
 $text = str_replace("ó", '&#243;', $text); //ó
 $text = str_replace("Å›", '&#347;', $text); //ś
 $text = str_replace("ź", '&#378;', $text); //ź
 $text = str_replace("ż", '&#380;', $text); //ż
 
return $text;
}

You can refer to http://hermes.umcs.lublin.pl/~awmarcz/awm/info/pl-codes.htm
if you want to use HTML hex. code rather than HTML dec. code which I used in my function.
fananf at nerdshack dot com
05-Mar-2007 04:22
Comment to AJGORS reply from 28-Dec-2006 02:38:

You have used twice "ż" instead of "ź".

Correct code should be:

ISO version:

function utf82iso88592($text) {
 $text = str_replace("\xC4\x85", '±', $text);
 $text = str_replace("\xC4\x84", 'ˇ', $text);
 $text = str_replace("\xC4\x87", 'ć', $text);
 $text = str_replace("\xC4\x86", 'Ć', $text);
 $text = str_replace("\xC4\x99", 'ę', $text);
 $text = str_replace("\xC4\x98", 'Ę', $text);
 $text = str_replace("\xC5\x82", 'ł', $text);
 $text = str_replace("\xC5\x81", 'Ł', $text);
 $text = str_replace("\xC3\xB3", 'ó', $text);
 $text = str_replace("\xC3\x93", 'Ó', $text);
 $text = str_replace("\xC5\x9B", '¶', $text);
 $text = str_replace("\xC5\x9A", '¦', $text);
 $text = str_replace("\xC5\xBC", 'ż', $text);
 $text = str_replace("\xC5\xBB", 'Ż', $text);
 $text = str_replace("\xC5\xBA", 'Ľ', $text);
 $text = str_replace("\xC5\xB9", '¬', $text);
 $text = str_replace("\xc5\x84", 'ń', $text);
 $text = str_replace("\xc5\x83", 'Ń', $text);

return $text;
}

CP version:

function utf82iso88592($text) {
 $text = str_replace("\xC4\x85", 'ą', $text);
 $text = str_replace("\xC4\x84", 'Ą', $text);
 $text = str_replace("\xC4\x87", 'ć', $text);
 $text = str_replace("\xC4\x86", 'Ć', $text);
 $text = str_replace("\xC4\x99", 'ę', $text);
 $text = str_replace("\xC4\x98", 'Ę', $text);
 $text = str_replace("\xC5\x82", 'ł', $text);
 $text = str_replace("\xC5\x81", 'Ł', $text);
 $text = str_replace("\xC3\xB3", 'ó', $text);
 $text = str_replace("\xC3\x93", 'Ó', $text);
 $text = str_replace("\xC5\x9B", 'ś', $text);
 $text = str_replace("\xC5\x9A", 'Ś', $text);
 $text = str_replace("\xC5\xBC", 'ż', $text);
 $text = str_replace("\xC5\xBB", 'Ż', $text);
 $text = str_replace("\xC5\xBA", 'ź', $text);
 $text = str_replace("\xC5\xB9", 'Ź', $text);
 $text = str_replace("\xc5\x84", 'ń', $text);
 $text = str_replace("\xc5\x83", 'Ń', $text);

return $text;
}
sam
06-Feb-2007 05:20
In addition to yannikh's note, to convert a hex utf8 string

<?php

echo utf8_decode("\x61\xc3\xb6\x61");
// works as expected

$abc="61c3b661";
$newstr = "";
$l = strlen($abc);
for (
$i=0;$i<$l;$i+=2){
   
$newstr .= "\x".$abc[$i].$abc[$i+1];
}
echo
utf8_decode($newstr);
// or varieties  of "\x": "\\x" etc does NOT output what you want

echo utf8_decode(pack('H*',$abc));
// this outputs the correct string, like the first line.

?>
Ajgor
28-Dec-2006 02:38
small upgrade for polish decoding:

function utf82iso88592($text) {
 $text = str_replace("\xC4\x85", 'ą', $text);
 $text = str_replace("\xC4\x84", 'Ą', $text);
 $text = str_replace("\xC4\x87", 'ć', $text);
 $text = str_replace("\xC4\x86", 'Ć', $text);
 $text = str_replace("\xC4\x99", 'ę', $text);
 $text = str_replace("\xC4\x98", 'Ę', $text);
 $text = str_replace("\xC5\x82", 'ł', $text);
 $text = str_replace("\xC5\x81", 'Ł', $text);
 $text = str_replace("\xC3\xB3", 'ó', $text);
 $text = str_replace("\xC3\x93", 'Ó', $text);
 $text = str_replace("\xC5\x9B", 'ś', $text);
 $text = str_replace("\xC5\x9A", 'Ś', $text);
 $text = str_replace("\xC5\xBC", 'ż', $text);
 $text = str_replace("\xC5\xBB", 'Ż', $text);
 $text = str_replace("\xC5\xBA", 'ż', $text);
 $text = str_replace("\xC5\xB9", 'Ż', $text);
 $text = str_replace("\xc5\x84", 'ń', $text);
 $text = str_replace("\xc5\x83", 'Ń', $text);

return $text;
} // utf82iso88592
paul.hayes at entropedia.co.uk
27-Oct-2006 04:10
I noticed that the utf-8 to html functions below are only for 2 byte long codes. Well I wanted 3 byte support (sorry haven't done 4, 5 or 6). Also I noticed the concatination of the character codes did have the hex prefix 0x and so failed with the large 2 byte codes)

<?
  public
function utf2html (&$str) {
   
   
$ret = "";
   
$max = strlen($str);
   
$last = 0// keeps the index of the last regular character
   
for ($i=0; $i<$max; $i++) {
       
$c = $str{$i};
       
$c1 = ord($c);
        if (
$c1>>5 == 6) {  // 110x xxxx, 110 prefix for 2 bytes unicode
           
$ret .= substr($str, $last, $i-$last); // append all the regular characters we've passed
           
$c1 &= 31; // remove the 3 bit two bytes prefix
           
$c2 = ord($str{++$i}); // the next byte
           
$c2 &= 63// remove the 2 bit trailing byte prefix
           
$c2 |= (($c1 & 3) << 6); // last 2 bits of c1 become first 2 of c2
           
$c1 >>= 2; // c1 shifts 2 to the right
           
$ret .= "&#" . ($c1 * 0x100 + $c2) . ";"; // this is the fastest string concatenation
           
$last = $i+1;      
        }
        elseif (
$c1>>4 == 14) {  // 1110 xxxx, 110 prefix for 3 bytes unicode
           
$ret .= substr($str, $last, $i-$last); // append all the regular characters we've passed
           
$c2 = ord($str{++$i}); // the next byte
           
$c3 = ord($str{++$i}); // the third byte
           
$c1 &= 15; // remove the 4 bit three bytes prefix
           
$c2 &= 63// remove the 2 bit trailing byte prefix
           
$c3 &= 63// remove the 2 bit trailing byte prefix
           
$c3 |= (($c2 & 3) << 6); // last 2 bits of c2 become first 2 of c3
           
$c2 >>=2; //c2 shifts 2 to the right
           
$c2 |= (($c1 & 15) << 4); // last 4 bits of c1 become first 4 of c2
           
$c1 >>= 4; // c1 shifts 4 to the right
           
$ret .= '&#' . (($c1 * 0x10000) + ($c2 * 0x100) + $c3) . ';'; // this is the fastest string concatenation
           
$last = $i+1;      
        }
    }
   
$str=$ret . substr($str, $last, $i); // append the last batch of regular characters
}
?>
tobias at code-x dot de
20-Oct-2006 02:13
converting uft8-html sign &#301; to uft8

<?
function uft8html2utf8( $s ) {
        if ( !
function_exists('uft8html2utf8_callback') ) {
             function
uft8html2utf8_callback($t) {
                    
$dec = $t[1];
            if (
$dec < 128) {
             
$utf = chr($dec);
            } else if (
$dec < 2048) {
             
$utf = chr(192 + (($dec - ($dec % 64)) / 64));
             
$utf .= chr(128 + ($dec % 64));
            } else {
             
$utf = chr(224 + (($dec - ($dec % 4096)) / 4096));
             
$utf .= chr(128 + ((($dec % 4096) - ($dec % 64)) / 64));
             
$utf .= chr(128 + ($dec % 64));
            }
            return
$utf;
             }
        }                               
        return
preg_replace_callback('|&#([0-9]{1,});|', 'uft8html2utf8_callback', $s );                                
}
echo
uft8html2utf8('test: &#301;');
?>
e dot panzyk at panzyk dot net
04-Oct-2006 11:38
enhanced UTF8-Decoder

After recognising that UTF8-Decode converts some French Characters to "?" i end with that Function.

The space will be need when a String ends with a converted Char ( a buggy Php Function will fit a /hex00 char at the End )

function utf8dec ( $s_String )
    {
    $s_String = html_entity_decode(htmlentities($s_String." ", ENT_COMPAT, 'UTF-8'));
    return substr($s_String, 0, strlen($s_String)-1);
    }

Hope it helps ... cost me a lot of time ...
michael at calwell-computing dot co dot uk
04-Oct-2006 10:15
I found that trying to put Javascript strings into a pre MySQL 4.0 database was creating problems with strange chars in the database. Closer inspection revealed that utf8 is the default character set for Javascript, which cannot be handled by the db. This function was invaluable.
ethaizone [AT] hotmail [DOT] com
02-Aug-2006 07:11
This function I use convert UTF-8 to Thai font (iso-8859-11).
It from iso8859_11toUTF8 function [Suttichai Mesaard-www.ceforce.com] at utf8_encode page.

It useful for translate string from mod_rewrite to real url.
I makes SEO Url In Thai language.

function UTF8toiso8859_11($string) {
 
     if ( ! ereg("[\241-\377]", $string) )
         return $string;
 
     $UTF8 = array(
"\xe0\xb8\x81" => "\xa1",
"\xe0\xb8\x82" => "\xa2",
"\xe0\xb8\x83" => "\xa3",
"\xe0\xb8\x84" => "\xa4",
"\xe0\xb8\x85" => "\xa5",
"\xe0\xb8\x86" => "\xa6",
"\xe0\xb8\x87" => "\xa7",
"\xe0\xb8\x88" => "\xa8",
"\xe0\xb8\x89" => "\xa9",
"\xe0\xb8\x8a" => "\xaa",
"\xe0\xb8\x8b" => "\xab",
"\xe0\xb8\x8c" => "\xac",
"\xe0\xb8\x8d" => "\xad",
"\xe0\xb8\x8e" => "\xae",
"\xe0\xb8\x8f" => "\xaf",
"\xe0\xb8\x90" => "\xb0",
"\xe0\xb8\x91" => "\xb1",
"\xe0\xb8\x92" => "\xb2",
"\xe0\xb8\x93" => "\xb3",
"\xe0\xb8\x94" => "\xb4",
"\xe0\xb8\x95" => "\xb5",
"\xe0\xb8\x96" => "\xb6",
"\xe0\xb8\x97" => "\xb7",
"\xe0\xb8\x98" => "\xb8",
"\xe0\xb8\x99" => "\xb9",
"\xe0\xb8\x9a" => "\xba",
"\xe0\xb8\x9b" => "\xbb",
"\xe0\xb8\x9c" => "\xbc",
"\xe0\xb8\x9d" => "\xbd",
"\xe0\xb8\x9e" => "\xbe",
"\xe0\xb8\x9f" => "\xbf",
"\xe0\xb8\xa0" => "\xc0",
"\xe0\xb8\xa1" => "\xc1",
"\xe0\xb8\xa2" => "\xc2",
"\xe0\xb8\xa3" => "\xc3",
"\xe0\xb8\xa4" => "\xc4",
"\xe0\xb8\xa5" => "\xc5",
"\xe0\xb8\xa6" => "\xc6",
"\xe0\xb8\xa7" => "\xc7",
"\xe0\xb8\xa8" => "\xc8",
"\xe0\xb8\xa9" => "\xc9",
"\xe0\xb8\xaa" => "\xca",
"\xe0\xb8\xab" => "\xcb",
"\xe0\xb8\xac" => "\xcc",
"\xe0\xb8\xad" => "\xcd",
"\xe0\xb8\xae" => "\xce",
"\xe0\xb8\xaf" => "\xcf",
"\xe0\xb8\xb0" => "\xd0",
"\xe0\xb8\xb1" => "\xd1",
"\xe0\xb8\xb2" => "\xd2",
"\xe0\xb8\xb3" => "\xd3",
"\xe0\xb8\xb4" => "\xd4",
"\xe0\xb8\xb5" => "\xd5",
"\xe0\xb8\xb6" => "\xd6",
"\xe0\xb8\xb7" => "\xd7",
"\xe0\xb8\xb8" => "\xd8",
"\xe0\xb8\xb9" => "\xd9",
"\xe0\xb8\xba" => "\xda",
"\xe0\xb8\xbf" => "\xdf",
"\xe0\xb9\x80" => "\xe0",
"\xe0\xb9\x81" => "\xe1",
"\xe0\xb9\x82" => "\xe2",
"\xe0\xb9\x83" => "\xe3",
"\xe0\xb9\x84" => "\xe4",
"\xe0\xb9\x85" => "\xe5",
"\xe0\xb9\x86" => "\xe6",
"\xe0\xb9\x87" => "\xe7",
"\xe0\xb9\x88" => "\xe8",
"\xe0\xb9\x89" => "\xe9",
"\xe0\xb9\x8a" => "\xea",
"\xe0\xb9\x8b" => "\xeb",
"\xe0\xb9\x8c" => "\xec",
"\xe0\xb9\x8d" => "\xed",
"\xe0\xb9\x8e" => "\xee",
"\xe0\xb9\x8f" => "\xef",
"\xe0\xb9\x90" => "\xf0",
"\xe0\xb9\x91" => "\xf1",
"\xe0\xb9\x92" => "\xf2",
"\xe0\xb9\x93" => "\xf3",
"\xe0\xb9\x94" => "\xf4",
"\xe0\xb9\x95" => "\xf5",
"\xe0\xb9\x96" => "\xf6",
"\xe0\xb9\x97" => "\xf7",
"\xe0\xb9\x98" => "\xf8",
"\xe0\xb9\x99" => "\xf9",
"\xe0\xb9\x9a" => "\xfa",
"\xe0\xb9\x9b" => "\xfb",
 );
 
     $string=strtr($string,$UTF8);
     return $string;
 }

Jo, EThaiZone.Com
2ge at NO2geSPAM dot us
26-Jan-2006 10:00
Hello all,

I like to use COOL (nice) URIs, example: http://example.com/try-something
I'm using UTF8 as input, so I have to write a function UTF8toASCII to have nice URI. Here is what I come with:

<?php
function urlize($url) {
 
$search = array('/[^a-z0-9]/', '/--+/', '/^-+/', '/-+$/' );
 
$replace = array( '-', '-', '', '');
 return
preg_replace($search, $replace, utf2ascii($url));
}    

function
utf2ascii($string) {
 
$iso88591  = "\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7";
 
$iso88591 .= "\\xE8\\xE9\\xEA\\xEB\\xEC\\xED\\xEE\\xEF";
 
$iso88591 .= "\\xF0\\xF1\\xF2\\xF3\\xF4\\xF5\\xF6\\xF7";
 
$iso88591 .= "\\xF8\\xF9\\xFA\\xFB\\xFC\\xFD\\xFE\\xFF";
 
$ascii = "aaaaaaaceeeeiiiidnooooooouuuuyyy";
 return
strtr(mb_strtolower(utf8_decode($string), 'ISO-8859-1'),$iso88591,$ascii);
}

echo
urlize("Fucking ml");

?>

I hope this helps someone.
peter dot mescalchin at geemail dot com
27-Dec-2005 06:43
Adding to below I have a few more MS word characters that need replacing. Found this was required when "fixing" some phpmyadmin export scripts from a live server where MS word characters were all through the content - before importing them back into my local mySQL database.

The code I wrote for this process also does a strpos for any extra "\\xe2\\x80" strings - which are the tell-tale sign of any funny characters I want removed.

Here are my updated arrays()

<?php
$badchr = array(
"\\xe2\\x80\\xa6", // ellipsis
"\\xe2\\x80\\x93", // long dash
"\\xe2\\x80\\x94", // long dash
"\\xe2\\x80\\x98", // single quote opening
"\\xe2\\x80\\x99", // single quote closing
"\\xe2\\x80\\x9c", // double quote opening
"\\xe2\\x80\\x9d", // double quote closing
"\\xe2\\x80\\xa2" // dot used for bullet points
);

$goodchr = array(
'...',
'-',
'-',
'\\'',
'\\'',
'"',
'"',
'*'
);
?>
php-net at ---NOSPAM---lc dot yi dot org
08-Dec-2005 09:04
I've just created this code snippet to improve the user-customizable emails sent by one of my websites.

The goal was to use UTF-8 (Unicode) so that non-english users have all the Unicode benefits, BUT also make life seamless for English (or specifically, English MS-Outlook users).  The niggle: Outlook prior to 2003 (?)  does not properly detect unicode emails.  When "smart quotes" from MS Word were pasted into a rich text area and saved in Unicode, then sent by email to an Outlook user, more often than not, these characters were wrongly rendered as "greek".

So, the following code snippet replaces a few strategic characters into html entities which Outlook XP (and possibly earlier) will render as expected.  [Code based on bits of code from previous posts on this and the htmlenties page]
<?php
    $badwordchars
=array(
       
"\xe2\x80\x98", // left single quote
       
"\xe2\x80\x99", // right single quote
       
"\xe2\x80\x9c", // left double quote
       
"\xe2\x80\x9d", // right double quote
       
"\xe2\x80\x94", // em dash
       
"\xe2\x80\xa6" // elipses
   
);
   
$fixedwordchars=array(
       
"&#8216;",
       
"&#8217;",
       
'&#8220;',
       
'&#8221;',
       
'&mdash;',
       
'&#8230;'
   
);
   
$html=str_replace($badwordchars,$fixedwordchars,$html);
?>
yannikh at gmeil dot com
08-Dec-2005 08:34
I had to tackle a very interesting problem:

I wanted to replace all \xXX in a text by it's letters. Unfortunatelly XX were ASCII and not utf8. I solved my problem that way:
<?php preg_replace ('/\\\\x([0-9a-fA-F]{2})/e', "pack('H*',utf8_decode('\\1'))",$v); ?>
thierry.bo # netcourrier point com
01-Oct-2005 09:53
to complete my previous test, here is the summarize with :

- if ($string == utf8_decode($string))
- if ($string == iconv('UTF-8', 'UTF-8', $string)

201 lines are valid UTF8 strings using phpnote regexp
203 lines are valid UTF8 strings using j.dittmer regexp
200 lines are valid UTF8 strings using fhoech regexp
239 lines are valid  UTF8 strings using using mb_detect_encoding
203 lines are valid  UTF 8 strings using using utf8_decode
224 lines are valid  UTF 8strings using using iconv

If we trust the file used for this test, no need to use a regexp, use XML::utf8_decode() to test your strings, you get the same detection chance as the three regexp tested, and XML Parser extension is almost always available, unlike Iconv and Multibyte String functions.
thierry.bo # netcourrier point com
30-Sep-2005 09:38
In response to fhoech (22-Sep-2005 11:55), I just tried a simultaneous test with the file UTF-8-test.txt using your regexp, 'j dot dittmer' (20-Sep-2005 06:30) regexp (message #56962), `php-note-2005` (17-Feb-2005 08:57) regexp in his message on `mb-detect-encoding` page (http://us3.php.net/manual/en/function.mb-detect-encoding.php#50087) who is using a regexp from the W3C (http://w3.org/International/questions/qa-forms-utf-8.html), and PHP mb_detect_encoding function.

Here are a summarize of the results :

201 lines are valid UTF8 strings using phpnote regexp
203 lines are valid UTF8 strings using j.dittmer regexp
200 lines are valid UTF8 strings using fhoech regexp
239 lines are valid  UTF8 strings using using mb_detect_encoding

Here are the lines with differences (left to right, phpnote, j.dittmer and fhoech) :

Line #70 : NOT UTF8|IS UTF8!|IS UTF8! :2.1.1 1 byte (U-00000000): ""
Line #79 : NOT UTF8|IS UTF8!|IS UTF8! :2.2.1 1 byte (U-0000007F): ""
Line #81 : IS UTF8!|IS UTF8!|NOT UTF8 :2.2.3 3 bytes (U-0000FFFF): "&#65535;" |
Line #267 : IS UTF8!|IS UTF8!|NOT UTF8 :5.3.1 U+FFFE = ef bf be = "&#65534;" |
Line #268 : IS UTF8!|IS UTF8!|NOT UTF8 :5.3.2 U+FFFF = ef bf bf = "&#65535;" |

Interesting is that you said that your regexp corrected j.dittmer regexp that failed on 5.3 section, but it my test I have the opposite result ?!

I ran this test on windows XP with PHP 4.3.11dev. Maybe these differences come from operating system, or PHP version.

For mb_detect_encoding I used the command :

mb_detect_encoding($line, 'UTF-8, ISO-8859-1, ASCII');
fhoech
22-Sep-2005 07:55
Sorry, I had a typo in my last comment. Corrected regexp:

^([\\x00-\\x7f]|
[\\xc2-\\xdf][\\x80-\\xbf]|
\\xe0[\\xa0-\\xbf][\\x80-\\xbf]|
[\\xe1-\\xec][\\x80-\\xbf]{2}|
\\xed[\\x80-\\x9f][\\x80-\\xbf]|
\\xef[\\x80-\\xbf][\\x80-\\xbd]|
\\xee[\\x80-\\xbf]{2}|
\xf0[\\x90-\\xbf][\\x80-\\xbf]{2}|
[\\xf1-\\xf3][\\x80-\\xbf]{3}|
\\xf4[\\x80-\\x8f][\\x80-\\xbf]{2})*$
fhoech
22-Sep-2005 07:12
JF Sebastian's regex is almost perfect as far as I'm concerned. I found one error (it failed section 5.3 "Other illegal code positions" from http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt) which I corrected as follows:

^([\\x00-\\x7f]|
[\\xc2-\\xdf][\\x80-\\xbf]|
\\xe0[\\xa0-\\xbf][\\x80-\\xbf]|
[\\xe1-\\xec][\\x80-\\xbf]{2}|
\\xed[\\x80-\\x9f][\\x80-\\xbf]|
\\xef[\\x80-\\xbf][\\x80-\\xbc]|
\\xee[\\x80-\\xbf]{2}|
\\xf0[\\x90-\\xbf][\\x80-\\xbf]{2}|
[\\xf1-\\xf3][\\x80-\\xbf]{3}|
\\xf4[\\x80-\\x8f][\\x80-\\xbf]{2})*$

(Again, concatenate to one single line to make it work)
j dot dittmer at portrix dot net
20-Sep-2005 02:30
The regex in the last comment has some typos. This is a
syntactically valid one, don't know if it's correct though.
You've to concat the expression in one long line.

^(
[\x00-\x7f]|
[\xc2-\xdf][\x80-\xbf]|
[\xe0][\xa0-\xbf][\x80-\xbf]|
[\xe1-\xec][\x80-\xbf]{2}|
[\xed][\x80-\x9f][\x80-\xbf]|
[\xee-\xef][\x80-\xbf]{2}|
[\xf0][\x90-\xbf][\x80-\xbf]{2}|
[\xf1-\xf3][\x80-\xbf]{3}|
[\xf4][\x80-\x8f][\x80-\xbf]{2}
)*$
chris at mrwsp dot com
09-Aug-2005 12:54
A small improvement.

JF Sebastian's regex for UTF-8 is not quite correct.  Because code points could otherwise be coded in more than one way using UTF-8, the Standard stipulates that the shortest possible representation for a character should be used.  So some 'duplicate' combinations his regex accepts are not valid UTF-8.  Additionally, his regex accepts characters beyond the valid Unicode code space.

The regex should be:

^([\x00-\x7f]|
[\xc2-\xdf][\x80-\xbf]|
[\xe0][\xa0-\xbf][\x80-xbf]|
[\xe1-\xec][\x80-\xbf]{2}|
[\xed][\x80-\x9f][\x80-\xbf]|
[\xee-\xef][\x80-\xbf]{2}|
[\xf0][\x90-\xbf][\x80-\xbf]{2}|
[\xf1-\xf3][\x80-\xbf]{3}|
[\xf4][\x80-\x8f][\x80-\xbf]{2}*$)
miracle at balkansys dot com
14-Jun-2005 11:24
The best multilanguage library I have found is a part of a CMS system - typo3

http://www.typo3.org

It can convert from and to any charset + it does it by three methods - mbstring, iconv, or the raw way by scripts. It uses only one of these techniques - the fastest if available.

That was the only way I could make mysql 3.23 contain letters in almost any language, while maintaining my website in utf-8 only.
Denjs
11-May-2005 03:09
i had some problems whith encode-decode russian-cp1251 strings into utf8, like browser do "this" in url...
(my Apache runs under windows and some local files have russian names - it needs to create correct url to them)

problem resolves whith utf8-class created by Alexandar Minkovsky.

download here:
http://www.phpclasses.org/browse/package/1974.html

UTF8 class can convert text between UTF-8 and other encodings. puplished under "BSD License"
Vladimir Stwora, vlad4321 at fastmail dot fm
09-May-2005 12:41
If you want to convert utf-8 to ascii, you can use the following procedure:

<?php
function utf2ascii($string) {
  
$string=iconv('utf-8','windows-1250',$string);
  
$win  ='...'// I was unable to paste here all set of characters, but you get the point
  
$ascii='zyaieuu...';
  
$string = StrTr($string,$win,$ascii);
   return
$string;
  }
?>
This works based on the assumption that you know what language text (and thus what charset) you want to convert from. In the above example I am converting from an eastern European language, so I know I can safely use windows-1250 charset as an intermediem charset. You will have to adjust the charset based on your language.

Please remember that you have to save this file separately and it must be coded in the charset, which you will call within function iconv. Otherwise it will not work.
JF Sebastian
30-Mar-2005 05:09
The following Perl regular expression tests if a string is well-formed Unicode UTF-8 (Broken up after each | since long lines are not permitted here. Please join as a single line, no spaces, before use.):

^([\x00-\x7f]|
[\xc2-\xdf][\x80-\xbf]|
\xe0[\xa0-\xbf][\x80-\xbf]|
[\xe1-\xec][\x80-\xbf]{2}|
\xed[\x80-\x9f][\x80-\xbf]|
[\xee-\xef][\x80-\xbf]{2}|
f0[\x90-\xbf][\x80-\xbf]{2}|
[\xf1-\xf3][\x80-\xbf]{3}|
\xf4[\x80-\x8f][\x80-\xbf]{2})*$

NOTE: This strictly follows the Unicode standard 4.0, as described in chapter 3.9, table 3-6, "Well-formed UTF-8 byte sequences" ( http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703 ).

ISO-10646, a super-set of Unicode, uses UTF-8 (there called "UCS", see http://www.unicode.org/faq/utf_bom.html#1 ) in a relaxed variant that supports a 31-bit space encoded into up to six bytes instead of Unicode's 21 bits in up to four bytes. To check for ISO-10646 UTF-8, use the following Perl regular expression (again, broken up, see above):

^([\x00-\x7f]|
[\xc0-\xdf][\x80-\xbf]|
[\xe0-\xef][\x80-\xbf]{2}|
[\xf0-\xf7][\x80-\xbf]{3}|
[\xf8-\xfb][\x80-\xbf]{4}|
[\xfc-\xfd][\x80-\xbf]{5})*$

The following function may be used with above expressions for a quick UTF-8 test, e.g. to distinguish ISO-8859-1-data from UTF-8-data if submitted from a <form accept-charset="utf-8,iso-8859-1" method=..>.

function is_utf8($string) {
    return (preg_match('/[insert regular expression here]/', $string) === 1);
}
ivanmaz(remove) at mech dot math dot msu dot su
16-Mar-2005 12:50
Here is my variant of UTF8 to Cyrillic Win-1251 encoding convertor that replaces all characters but latin and Russian ones with &#...; entities:

function utf2win1251 ($s)
{
 $out = "";

 for ($i=0; $i<strlen($s); $i++)
 {
  $c1 = substr ($s, $i, 1);
  $byte1 = ord ($c1);
  if ($byte1>>5 == 6) // 110x xxxx, 110 prefix for 2 bytes unicode
  {
   $i++;
   $c2 = substr ($s, $i, 1);
   $byte2 = ord ($c2);
   $byte1 &= 31; // remove the 3 bit two bytes prefix
   $byte2 &= 63; // remove the 2 bit trailing byte prefix
   $byte2 |= (($byte1 & 3) << 6); // last 2 bits of c1 become first 2 of c2
   $byte1 >>= 2; // c1 shifts 2 to the right

   $word = ($byte1<<8) + $byte2;
   if ($word==1025) $out .= chr(168);                    //
   elseif ($word==1105) $out .= chr(184);                //
   elseif ($word>=0x0410 && $word<=0x044F) $out .= chr($word-848); // - -
   else
   { 
     $a = dechex($byte1);
     $a = str_pad($a, 2, "0", STR_PAD_LEFT);
     $b = dechex($byte2);
     $b = str_pad($b, 2, "0", STR_PAD_LEFT);
     $out .= "&#x".$a.$b.";";
   }
  }
  else
  {
   $out .= $c1;
  }
 }

 return $out;
}

The function is based on 2 other functions posted below.
I hope it will help those who convert UTF8-encoded text to Win-1251 to use it safely on Russian web pages (works fine in all browsers).
marcelo at maccoy dot com dot br
14-Feb-2005 07:24
function decode_utf8($str){
       # erase null signs in string
          $str=eregi_replace("^.{10,13}q\?","",$str);
       # paterns
           $pat = "/=([0-9A-F]{2})/";
           $cha="'.chr(hexdec(";
       # to decode with eval and replace
          eval("\$str='".
                  preg_replace($pat,$cha."'$1')).'",$str).
                  "';");
        # return
           return $str;
        }

Note:
It's possible put it in 3 lines, but I don't got in this first code submition.
husamb at maksimum dot net
27-Jan-2005 10:16
Hi, I collected the some scripts in this page and I written a new customizable script. You can switch easily iso type to convert. There are definitions in unicode.org page at http://www.unicode.org/Public/MAPPINGS/ISO8859/.

<?php
# GLOBAL VARIABLES
$url = "http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-9.TXT";
//$url = "8859-9.txt";
$iso2utf = array();
$utf2iso = array();

# UNICODE MAPPING TABLE PARSING
function create_map($url){
    global
$iso2utf, $utf2iso;
   
$fl = @(file($url)) OR (die("cannot open file : $url\n"));
    for (
$i=0; $i<count($fl); $i++){
        if(
$fl[$i][0] != '#' && trim($fl[$i])){
            list(
$iso, $uni, $s, $desc) = split("\t",$fl[$i]);
           
$iso2utf[$iso] = $uni;
           
$utf2iso[$uni] = $iso;
        }
    }
}

# FINDING UNICODE LETTER'S DECIMAL ASCII VALUE
function uniord($c){
   
$ud = 0;
    if (
ord($c{0})>=0 && ord($c{0})<=127)   $ud = $c{0};
    if (
ord($c{0})>=192 && ord($c{0})<=223) $ud = (ord($c{0})-192)*64 + (ord($c{1})-128);
    if (
ord($c{0})>=224 && ord($c{0})<=239) $ud = (ord($c{0})-224)*4096 + (ord($c{1})-128)*64 + (ord($c{2})-128);
    if (
ord($c{0})>=240 && ord($c{0})<=247) $ud = (ord($c{0})-240)*262144 + (ord($c{1})-128)*4096 + (ord($c{2})-128)*64 + (ord($c{3})-128);
    if (
ord($c{0})>=248 && ord($c{0})<=251) $ud = (ord($c{0})-248)*16777216 + (ord($c{1})-128)*262144 + (ord($c{2})-128)*4096 + (ord($c{3})-128)*64 + (ord($c{4})-128);
    if (
ord($c{0})>=252 && ord($c{0})<=253) $ud = (ord($c{0})-252)*1073741824 + (ord($c{1})-128)*16777216 + (ord($c{2})-128)*262144 + (ord($c{3})-128)*4096 + (ord($c{4})-128)*64 + (ord($c{5})-128);
    if (
ord($c{0})>=254 && ord($c{0})<=255) $ud = false; //error
   
return $ud;
}

# PARSING UNICODE STRING
function utf2iso($source) {
    global
$utf2iso;
   
$pos = 0;
   
$len = strlen ($source);
   
$encodedString = '';
   
    while (
$pos < $len) {
       
$is_ascii = false;
       
$asciiPos = ord (substr ($source, $pos, 1));
        if((
$asciiPos >= 240) && ($asciiPos <= 255)) {
           
// 4 chars representing one unicode character
           
$thisLetter = substr ($source, $pos, 4);
           
$thisLetterOrd = uniord($thisLetter);
           
$pos += 4;
        }
        else if((
$asciiPos >= 224) && ($asciiPos <= 239)) {
           
// 3 chars representing one unicode character
           
$thisLetter = substr ($source, $pos, 3);
           
$thisLetterOrd = uniord($thisLetter);
           
$pos += 3;
        }
        else if((
$asciiPos >= 192) && ($asciiPos <= 223)) {
           
// 2 chars representing one unicode character
           
$thisLetter = substr ($source, $pos, 2);
           
$thisLetterOrd = uniord($thisLetter);
           
$pos += 2;
        }
        else{
           
// 1 char (lower ascii)
           
$thisLetter = substr ($source, $pos, 1);
     &nbs