一、Unicode的由来
1、我们知道计算机其实只认识0101这样的字符串,当然了让我们看这样的01串会比较头晕,所以为了描述简单一般都用八进制、十进制、十六进制表示。
实际上都是等价的。其它像文字图片音视频等计算机也是不认识的,为了让计算机能表示这些信息就必须转换成一些数字,必须按照一些规则转换。
比如:刚开始的时候就有ASCII字符集(American Standard Code for Information Interchange, 美国信息交换标准码)它使用7 bits来表示一个字符,
总共表示128个字符,我们一般都是用字节(byte:即8个01串)来作为基本单位。当时一个字节来表示字符时第一个bit总是0,剩下的七个字节就来表示实际内容。后来IBM公司在此基础上进行了扩展,用8bit来表示一个字符,总共可以表示256个字符。也就是当第一个bit是0时仍表示之前那些常用的字符,当为1时就表示其他补充的字符。
2、英文字母再加一些其他标点字符之类的也不会超过256个,一个字节表示足够了。但其他一些文字不止这么多 ,像汉字就上万个,
于是又出现了其他各种字符集。这样不同的字符集交换数据时就有问题了,可能你用某个数字表示字符A,但另外的字符集又是用另外一个数字表示A。
为了适应全球化的发展,便于不同语言之间的兼容交互,而ASCII不再能胜任此任务了。所以就出现了Unicode和ISO这样的组织来统一制定一个标准,任何一个字符只对应一个确定的数字。ISO取的名字叫UCS(Universal Character Set)(ucs-2对应utf-16,ucs-4对应utf-32),Unicode取的名字就叫unicode了。
二、UTF-8和UTF-16的由来
1、Unicode第一个版本涉及到两个步骤:首先定义一个规范,给所有的字符指定一个唯一对应的数字,Unicode是用0至65535(2的16次方)之间的数字来表示所有字符,其中0至127这128个数字表示的字符仍然跟ASCII完全一样;第二怎么把字符对应的数字(0至65535)转化成01串保保存在计算机中。在保存时就涉及到了在计算机中占多少字节空间,就有不同的保存方式,于是出现了UTF(unicode transformation format):UTF-8和UTF-16。
三、UTF-8和UTF-16的区别
1、UTF-16:是任何字符对应的数字都用两个字节来保存,但如果都是英文字母(一个字节能表示一个字符)这样做有点浪费。
2、UTF-8:是任何字符对应的数字保存时所占的空间是可变的,可能用一个、两个或三个字节表示一个字符。
四、UTF-8和UTF-16的优劣
1、如果全部英文或英文与其他文字混合(英文占绝大部分),用UTF-8就比UTF-16节省了很多空间。
2、而如果全部是中文这样类似的字符或者混合字符(中文占绝大多数),UTF-16就可以节省很多空间,另外还有个容错问题(比如:UTF-8需要判断每个字节中的开头标志信息,所以如果一当某个字节在传送过程中出错了,就会导致后面的字节也会解析出错;而UTF-16不会判断开头标志,即使错也只会错一个字符,所以容错能力强)。
五、Unicode举例说明
1、例如:中文字"汉"对应的unicode是6C49(这是用十六进制表示,用十进制表示是27721);
2、UTF-16表示"汉":比较简单,就是01101100 01001001(共16 bit,两个字节),程序解析的时候知道是UTF-16就把两个字节当成一个单元来解析。
3、UTF-8表示"汉":比较复杂,因为程序是一个字节一个字节的来读取,然后再根据字节中开头的bit标志来识别是该把一个、两个或三个字节做为一个单元来处理。规则如下:
0xxxxxxx:如果是这样的格式,也就是以0开头就表示把一个字节做为一个单元,就跟ASCII完全一样;
110xxxxx 10xxxxxx:如果是这样的格式,则把两个字节当一个单元;
1110xxxx 10xxxxxx 10xxxxxx:如果是这样的格式,则把三个字节当一个单元。
4、由于UTF-16不需要用其它字符来做标志,所以两字节也就是2的16次能表示65536个字符;
5、而UTF-8由于里面有额外的标志信息,所有一个字节只能表示2的7次方128个字符,两个字节只能表示2的11次方2048个字符,而三个字节能表示2的16次方,65536个字符。
6、由于"汉"的编码27721大于2048了所有两个字节还不够,所以用1110xxxx 10xxxxxx 10xxxxxx这种格式,把27721对应的二进制从左到右填充XXX符号(实际上不一定从左到右,也可以从右到左)。
7、由于填充方式的不一样,于是就出现了Big-Endian、Little-Endian的术语。Big-Endian就是从左到右,Little-Endian是从右到左。
六、Unicode第二个版本
第一个版本的65536显然不算太多的数字,用它来表示常用的字符是没一点问题足够了,但如果加上很多特殊的也就不够了。于是从1996年有了第二个版本,用四个字节表示所有字符,这样就出现了UTF-8、UTF16、UTF-32,原理和之前是完全一样的,UTF-32就是把所有的字符都用32bit也就是4个字节来表示。然后UTF-8、UTF-16就视情况而定了。UTF-8可以选择1至8个字节中的任一个来表示,而UTF-16只能是选两字节或四字节。
七、代码
utf.c
/* ************************************************************************ * Filename: utf.c * Description: * Version: 1.0 * Created: 2016年10月21日 09时50分05秒 * Revision: none * Compiler: gcc * Author: YOUR NAME (), * Company: * ************************************************************************/ #include <stdio.h> #include <string.h> #include "utf.h" static boolean isLegalUTF8(const UTF8 *source, int length) { UTF8 a; const UTF8 *srcptr = NULL; if (NULL == source){ printf("ERR, isLegalUTF8: source=%p\n", source); return FALSE; } srcptr = source+length; switch (length) { default: printf("ERR, isLegalUTF8 1: length=%d\n", length); return FALSE; /* Everything else falls through when "TRUE"... */ case 4: if ((a = (*--srcptr)) < 0x80 || a > 0xBF){ printf("ERR, isLegalUTF8 2: length=%d, a=%x\n", length, a); return FALSE; } case 3: if ((a = (*--srcptr)) < 0x80 || a > 0xBF){ printf("ERR, isLegalUTF8 3: length=%d, a=%x\n", length, a); return FALSE; } case 2: if ((a = (*--srcptr)) > 0xBF){ printf("ERR, isLegalUTF8 4: length=%d, a=%x\n", length, a); return FALSE; } switch (*source) { /* no fall-through in this inner switch */ case 0xE0: if (a < 0xA0){ printf("ERR, isLegalUTF8 1: source=%x, a=%x\n", *source, a); return FALSE; } break; case 0xED: if (a > 0x9F){ printf("ERR, isLegalUTF8 2: source=%x, a=%x\n", *source, a); return FALSE; } break; case 0xF0: if (a < 0x90){ printf("ERR, isLegalUTF8 3: source=%x, a=%x\n", *source, a); return FALSE; } break; case 0xF4: if (a > 0x8F){ printf("ERR, isLegalUTF8 4: source=%x, a=%x\n", *source, a); return FALSE; } break; default: if (a < 0x80){ printf("ERR, isLegalUTF8 5: source=%x, a=%x\n", *source, a); return FALSE; } } case 1: if (*source >= 0x80 && *source < 0xC2){ printf("ERR, isLegalUTF8: source=%x\n", *source); return FALSE; } } if (*source > 0xF4) return FALSE; return TRUE; } ConversionResult Utf8_To_Utf16 (const UTF8* sourceStart, UTF16* targetStart, size_t outLen , ConversionFlags flags) { ConversionResult result = conversionOK; const UTF8* source = sourceStart; UTF16* target = targetStart; UTF16* targetEnd = targetStart + outLen/2; const UTF8* sourceEnd = NULL; if ((NULL == source) || (NULL == targetStart)){ printf("ERR, Utf8_To_Utf16: source=%p, targetStart=%p\n", source, targetStart); return conversionFailed; } sourceEnd = strlen((const char*)sourceStart) + sourceStart; while (*source){ UTF32 ch = 0; unsigned short extraBytesToRead = trailingBytesForUTF8[*source]; if (source + extraBytesToRead >= sourceEnd){ printf("ERR, Utf8_To_Utf16----sourceExhausted: source=%p, extraBytesToRead=%d, sourceEnd=%p\n", source, extraBytesToRead, sourceEnd); result = sourceExhausted; break; } /* Do this check whether lenient or strict */ if (! isLegalUTF8(source, extraBytesToRead+1)){ printf("ERR, Utf8_To_Utf16----isLegalUTF8 return FALSE: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead); result = sourceIllegal; break; } /* * The cases all fall through. See "Note A" below. */ switch (extraBytesToRead) { case 5: ch += *source++; ch <<= 6; /* remember, illegal UTF-8 */ case 4: ch += *source++; ch <<= 6; /* remember, illegal UTF-8 */ case 3: ch += *source++; ch <<= 6; case 2: ch += *source++; ch <<= 6; case 1: ch += *source++; ch <<= 6; case 0: ch += *source++; } ch -= offsetsFromUTF8[extraBytesToRead]; if (target >= targetEnd) { source -= (extraBytesToRead+1); /* Back up source pointer! */ printf("ERR, Utf8_To_Utf16----target >= targetEnd: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead); result = targetExhausted; break; } if (ch <= UNI_MAX_BMP){ /* Target is a character <= 0xFFFF */ /* UTF-16 surrogate values are illegal in UTF-32 */ if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_LOW_END){ if (flags == strictConversion){ source -= (extraBytesToRead+1); /* return to the illegal value itself */ printf("ERR, Utf8_To_Utf16----ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_LOW_END: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead); result = sourceIllegal; break; } else { *target++ = UNI_REPLACEMENT_CHAR; } } else{ *target++ = (UTF16)ch; /* normal case */ } }else if (ch > UNI_MAX_UTF16){ if (flags == strictConversion) { result = sourceIllegal; source -= (extraBytesToRead+1); /* return to the start */ printf("ERR, Utf8_To_Utf16----ch > UNI_MAX_UTF16: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead); break; /* Bail out; shouldn't continue */ } else { *target++ = UNI_REPLACEMENT_CHAR; } } else { /* target is a character in range 0xFFFF - 0x10FFFF. */ if (target + 1 >= targetEnd) { source -= (extraBytesToRead+1); /* Back up source pointer! */ printf("ERR, Utf8_To_Utf16----target + 1 >= targetEnd: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead); result = targetExhausted; break; } ch -= halfBase; *target++ = (UTF16)((ch >> halfShift) + UNI_SUR_HIGH_START); *target++ = (UTF16)((ch & halfMask) + UNI_SUR_LOW_START); } } return result; } int Utf16_To_Utf8 (const UTF16* sourceStart, UTF8* targetStart, size_t outLen , ConversionFlags flags) { int result = 0; const UTF16* source = sourceStart; UTF8* target = targetStart; UTF8* targetEnd = targetStart + outLen; if ((NULL == source) || (NULL == targetStart)){ printf("ERR, Utf16_To_Utf8: source=%p, targetStart=%p\n", source, targetStart); return conversionFailed; } while ( *source ) { UTF32 ch; unsigned short bytesToWrite = 0; const UTF32 byteMask = 0xBF; const UTF32 byteMark = 0x80; const UTF16* oldSource = source; /* In case we have to back up because of target overflow. */ ch = *source++; /* If we have a surrogate pair, convert to UTF32 first. */ if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_HIGH_END) { /* If the 16 bits following the high surrogate are in the source buffer... */ if ( *source ){ UTF32 ch2 = *source; /* If it's a low surrogate, convert to UTF32. */ if (ch2 >= UNI_SUR_LOW_START && ch2 <= UNI_SUR_LOW_END) { ch = ((ch - UNI_SUR_HIGH_START) << halfShift) + (ch2 - UNI_SUR_LOW_START) + halfBase; ++source; }else if (flags == strictConversion) { /* it's an unpaired high surrogate */ --source; /* return to the illegal value itself */ result = sourceIllegal; break; } } else { /* We don't have the 16 bits following the high surrogate. */ --source; /* return to the high surrogate */ result = sourceExhausted; break; } } else if (flags == strictConversion) { /* UTF-16 surrogate values are illegal in UTF-32 */ if (ch >= UNI_SUR_LOW_START && ch <= UNI_SUR_LOW_END){ --source; /* return to the illegal value itself */ result = sourceIllegal; break; } } /* Figure out how many bytes the result will require */ if(ch < (UTF32)0x80){ bytesToWrite = 1; } else if (ch < (UTF32)0x800) { bytesToWrite = 2; } else if (ch < (UTF32)0x10000) { bytesToWrite = 3; } else if (ch < (UTF32)0x110000){ bytesToWrite = 4; } else { bytesToWrite = 3; ch = UNI_REPLACEMENT_CHAR; } target += bytesToWrite; if (target > targetEnd) { source = oldSource; /* Back up source pointer! */ target -= bytesToWrite; result = targetExhausted; break; } switch (bytesToWrite) { /* note: everything falls through. */ case 4: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6; case 3: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6; case 2: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6; case 1: *--target = (UTF8)(ch | firstByteMark[bytesToWrite]); } target += bytesToWrite; } return result; } int main(int argc, char *argv[]) { int i=0; UTF8 buf8[256]=""; UTF16 buf16[256]={0}; strcpy(buf8,"程序员"); Utf8_To_Utf16(buf8,buf16,sizeof(buf16),strictConversion); printf("\nUTF-8 => UTF-16 = "); while(buf16[i]) { printf("%#x ",buf16[i]); i++; } memset(buf8,0,sizeof(buf8)); memset(buf16,0,sizeof(buf16)); buf16[0]=0x7a0b; buf16[1]=0x5e8f; buf16[2]=0x5458; Utf16_To_Utf8 (buf16, buf8, sizeof(buf8) , strictConversion); printf("\nUTF-16 => UTF-8 = %s\n\n",buf8); return 0; }
utf.h
/* ************************************************************************ * Filename: utf.h * Description: * Version: 1.0 * Created: 2016年10月21日 09时50分47秒 * Revision: none * Compiler: gcc * Author: YOUR NAME (), * Company: * ************************************************************************/ #ifndef __UTF_H__ #define __UTF_H__ #define FALSE 0 #define TRUE 1 #define halfShift 10 #define UNI_SUR_HIGH_START (UTF32)0xD800 #define UNI_SUR_HIGH_END (UTF32)0xDBFF #define UNI_SUR_LOW_START (UTF32)0xDC00 #define UNI_SUR_LOW_END (UTF32)0xDFFF /* Some fundamental constants */ #define UNI_REPLACEMENT_CHAR (UTF32)0x0000FFFD #define UNI_MAX_BMP (UTF32)0x0000FFFF #define UNI_MAX_UTF16 (UTF32)0x0010FFFF #define UNI_MAX_UTF32 (UTF32)0x7FFFFFFF #define UNI_MAX_LEGAL_UTF32 (UTF32)0x0010FFFF typedef unsigned char boolean; typedef unsigned int CharType ; typedef unsigned char UTF8; typedef unsigned short UTF16; typedef unsigned int UTF32; static const UTF32 halfMask = 0x3FFUL; static const UTF32 halfBase = 0x0010000UL; static const UTF8 firstByteMark[7] = { 0x00, 0x00, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC }; static const UTF32 offsetsFromUTF8[6] = { 0x00000000UL, 0x00003080UL, 0x000E2080UL, 0x03C82080UL, 0xFA082080UL, 0x82082080UL }; static const char trailingBytesForUTF8[256] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5 }; typedef enum { strictConversion = 0, lenientConversion } ConversionFlags; typedef enum { conversionOK, /* conversion successful */ sourceExhausted, /* partial character in source, but hit end */ targetExhausted, /* insuff. room in target for conversion */ sourceIllegal, /* source sequence is illegal/malformed */ conversionFailed } ConversionResult; #endif
运行结果如下: