I have an Oracle db which stores the content of documents originating from all over the world, with different languages. The documents are stored in a table with a BLOB colu
Oracle Globalization Development Kit can detect character sets.
The GDK is included with Oracle but it is not installed in the database by default. To load the .jar files into the database find the jlib directory in the Oracle home and run this operating system command:
loadjava -u USER_NAME@SID orai18n.jar orai18n-collation.jar orai18n-lcsd.jar orai18n-mapping.jar orai18n-net.jar orai18n-servlet.jar orai18n-tools.jar orai18n-translation.jar orai18n-utility.jar
Some extra Java privileges are needed, even if your user has DBA. Run this command and then re-connect:
exec dbms_java.grant_permission( 'YOUR_USER_NAME', 'SYS:java.lang.RuntimePermission', 'getClassLoader', '' );
Create a Java class to do the detection. Below is a very simple example that returns the best guess for a string:
create or replace and compile java source named "Character_Set_Detector"
as
import oracle.i18n.lcsd.*;
import java.sql.*;
import java.io.IOException;
public class Character_Set_Detector
{
public static String detect(Blob some_blob) throws SQLException, IOException
{
LCSDetector detector = new LCSDetector();
detector.detect(some_blob.getBinaryStream());
LCSDResultSet detector_results = detector.getResult();
return detector_results.getORACharacterSet();
}
}
/
Wrap the Java class in a PL/SQL function:
--Wrap the Java class in a PL/SQL function:
create or replace function detect_character_set(some_blob blob)
return varchar2
as language java
name 'Character_Set_Detector.detect(java.sql.Blob) return java.lang.String';
/
I simulated different character sets by translating a string into different languages, saving the text as different encodings with a text editor, opening the file with hex editor, and converting the hex into a BLOB:
--UTF8
--The quick brown fox jumps over the lazy dog
select 1 id, detect_character_set(hextoraw('54686520717569636b2062726f776e20666f78206a756d7073206f76657220746865206c617a7920646f67')) character_set from dual union all
--Western European (ISO-8859-1)
--El zorro marrón rápido salta sobre el perro perezoso
select 2 id, detect_character_set(hextoraw('456c207a6f72726f206d617272f36e2072e17069646f2073616c746120736f62726520656c20706572726f20706572657a6f736f')) from dual union all
--Chinese Simplified (GBK)
--敏捷的棕色狐狸跳过懒狗
select 3 id, detect_character_set(hextoraw('c3f4bffffdb5c4d7d8c9abbafcc0eaccf8b9fdc0c1b9b7')) from dual union all
--Western European (Windows-1252)
--Der schnelle braune Fuchs springt über den faulen Hund
select 4 id, detect_character_set(hextoraw('446572207363686e656c6c6520627261756e6520467563687320737072696e677420fc6265722064656e206661756c656e2048756e64')) from dual union all
--Cyrillic (KOI8-R)
--Быстрая коричневая лиса прыгает через ленивую собаку
select 5 id, detect_character_set(hextoraw('e2d9d3d4d2c1d120cbcfd2c9decec5d7c1d120ccc9d3c120d0d2d9c7c1c5d420dec5d2c5da20ccc5cec9d7d5c020d3cfc2c1cbd5')) from dual;
ID CHARACTER_SET
-- -------------
1 US7ASCII
2 WE8ISO8859P1
3 ZHS16CGB231280
4 WE8ISO8859P1
5 CL8KOI8R
That trivial example works well but I don't know how well it will work with real-world files. There are a lot of features in the GDK, the above code is only a simple starting point. With only minor changes the code can also detect languages as demonstrated in my answer here.