In the previous part (you can find it here) I discussed the character sets Oracle supports and the length limits. In this part I’ll show how it is actually stored and discuss some more topics that are relevant to using different languages with Oracle.
Seeing the Data
In order to see how Oracle saves the data, I’ll use two built-in functions: dump and vsize.
The dump function shows us the actual bytes that represent our data, vsize shows us how much space this data takes.
We will continue with the Greek example. So this is a script to create a table and insert data:
create table lang(c varchar2(100)); truncate table lang; insert into lang values ('α'); insert into lang values ('β'); insert into lang values ('γ'); insert into lang values ('αβγ.ABC');I executed this on two different databases, one is using AL32UTF8 and the other is using EL8ISO8859P7 (the Greek character set). Now we will query the data using the vsize and dump functions and see what happens:
This is the output from the Greek database:C                    DUMP(C)                                              VSIZE(C) -------------------- -------------------------------------------------- ---------- α                    Typ=1 Len=1: 225                                            1 β                    Typ=1 Len=1: 226                                            1 γ                    Typ=1 Len=1: 227                                            1 αβγ.ABC              Typ=1 Len=7: 225,226,227,46,65,66,67                        7As you can see, each character takes a single byte (α is represented by 225, β is 226, γ is 227, period is 46, and A, B and C are 65, 66 and 67 respectively). Note the last string which takes 7 bytes.
Now, this is the output from the AL32UTF8 database:C                    DUMP(C)                                              VSIZE(C) -------------------- -------------------------------------------------- ---------- α                    Typ=1 Len=2: 206,177                                        2 β                    Typ=1 Len=2: 206,178                                        2 γ                    Typ=1 Len=2: 206,179                                        2 αβγ.ABC              Typ=1 Len=10: 206,177,206,178,206,179,46,65,66,67          10See the difference? Every Greek letter is now represented by two bytes (206 and 177 for α, 206 and 178 for β and 206 and 179 for γ), while the period and the English letters are still represented by single bytes (the same as before, 46 for period and 65,66 and 67 for A, B and C). This is why the last string takes 10 bytes (3 double byte characters and 4 single byte characters).
Client and Server Character Sets
This is where things might get a little bit crazy. The database character set is set when the database is created and to change it is a big hassle (I might write a post about that in the future). The client character set is configured at the client level (the NLS_LANG environment variable) or the Windows registry. Usually, the client character set and the database character set will be the same. However, in some cases it should be different. Today, most applications and operating systems support Unicode, so things are really easy, but if the application doesn’t, we will have to handle that. Let’s take a case where the database is Unicode while the application doesn’t support it. When querying data from the table above, Oracle will send 206 and 177 for α, while the application expects 225. That is obviously not going to work. In this case we will need to change the client character set to EL8ISO8859P7. The Oracle client will notice the character set difference and will convert the multi byte characters to the correct single byte ones. Note that the client is Greek now, so if, for some reason, there is a non Greek multi byte character in the data, Oracle will not know how to convert it, and this is when we get the upside down question mark.
In the past (many years ago, when I just started with DBA stuff), we used Windows 95 and Windows NT (3.5 and 4) with Oracle 7.3 and 8.0. For some reason, we configured the database with the wrong character set (the European one, WE8ISO8859P1 instead of the Hebrew one, IW8ISO8859P8), but everything worked just fine as long as we were consistent with the mistake (as Oracle didn’t realize we use Hebrew, it simply inserted the data as is, thinking it’s European languages and no conversion was made). When we started working with Unicode application, the problems started. The application got single byte data in WEISO8859P1 character set, so it wanted to convert it to Unicode to print it on the screen. Obviously, the conversion wasn’t a successful one and we got mainly Gibberish. Once we moved the databases to the correct IW8ISO8859P8 character set, the conversion was successful and everything worked.Summary
This summarize the couple of posts about character sets in the database. As I wrote in the first post, when I set with American (or English) people and talked about this stuff, they weren’t aware of all these issues, as they are using English which is supported in any character set (and usually represented in the same way, so there are no conversion problems). But for non-English countries, this sometimes can be a lot of pain. It is very important to choose the correct character set, that way we can store what we need and can perform conversion if needed. Hope you enjoyed the reading.