How to determine the language(English, Chinese…) of a given string in Oracle?

后端 未结 4 1986
再見小時候
再見小時候 2021-02-06 00:46

How to determine the language (English, Chinese...) of a given sting (table column value) in Oracle(multi language environment)?

相关标签:
4条回答
  • 2021-02-06 01:25

    It should be possible to use a library like Language Dectection for Java and tie it with your PL/SQL.

    It will probably be more efficient to use SQL to do naive Bayesian filtering and use language profiles derived e.g. from Wikipedia (they are neatly packed here).

    These are just pointers, not a full solution as requested for the bounty, but should help bounty-seekers.

    0 讨论(0)
  • 2021-02-06 01:33

    A possible solution could be:

    1) maintain some dictionary.txt files in the languages you are expecting

    2) when parsing the input string in question, use something like a Scanner to read each word and search for it in the most expected dictionary, until a reasonable number of matches or fails allows you to assert the string is not from that language (maybe a certain percentage).

    3) Check the next most likely dictionary, etc, until you find the answer, or cannot determine it.

    For example, have englishDict.txt, spanishDict.txt, and frenchDict.txt, and maybe check if the 1st 100 words exist in the englishDict.txt first, and if you find a reasonable number (say, 70 out of 100), you can reasonably assume it is in English; otherwise, check the next file. Or, you could also read from each Dictionary, and select the result with the most matches.

    Alternately, you could search for commonly used language words first, such as articles, pronouns and common verbs. I have a feeling that no matter the solution, you're going to have to perform some number of searches and comparisons to find the answer.

    0 讨论(0)
  • 2021-02-06 01:38

    Do you mean the language as "what language this word belongs to" or as in "character encoding"?

    For the first case I think there are only heuristics, I am not sure if Oracle Database ships any. Oracle Ultra Search has an statistical language recognizer.

    In the second case: the encoding is always the system encoding of the database (but you actually should not use it, as it gets converted to your local representation when you retrieve it (depending on your client framework and driver of course).

    0 讨论(0)
  • 2021-02-06 01:48

    Oracle Globalization Development Kit can detect languages.

    The GDK is included with Oracle but it is not installed in the database by default. To load the .jar files into the database find the jlib directory in the Oracle home and run this operating system command:

    loadjava -u USER_NAME@SID orai18n.jar orai18n-collation.jar orai18n-lcsd.jar orai18n-mapping.jar orai18n-net.jar orai18n-servlet.jar orai18n-tools.jar orai18n-translation.jar orai18n-utility.jar
    

    Some extra Java privileges are needed, even if your user has DBA. Run this command and then re-connect:

    exec dbms_java.grant_permission( 'YOUR_USER_NAME', 'SYS:java.lang.RuntimePermission', 'getClassLoader', '' );
    

    Create a Java class to do the detection. Below is a very simple example that returns the best guess for a string:

    create or replace and compile java source named "Language_Detector"
    as
    import oracle.i18n.lcsd.*;
    public class Language_Detector
    {
        public static String detect(String some_string)
        {
            LCSDetector detector = new LCSDetector();
            detector.detect(some_string);
            LCSDResultSet detector_results = detector.getResult();
            return detector_results.getORALanguage();
        }
    }
    /
    

    Wrap the Java class in a PL/SQL function:

    create or replace function detect_language(some_string varchar2)
    return varchar2
    as language java
    name 'Language_Detector.detect(java.lang.String) return java.lang.String';
    /
    

    Create a sample table:

    create table unknown_language(id number, text varchar2(4000));
    
    insert into unknown_language
    select 1, 'The quick brown fox jumps over the lazy dog' from dual union all
    select 2, 'El zorro marrón rápido salta sobre el perro perezoso' from dual union all
    select 3, '敏捷的棕色狐狸跳过懒狗' from dual union all
    select 4, 'Der schnelle braune Fuchs springt über den faulen Hund' from dual union all
    select 5, 'Быстрая коричневая лиса прыгает через ленивую собаку' from dual;
    

    Now the function is available in simple SELECT statements. In this trivial example the language detection works perfectly.

    select id, detect_language(text) language
    from unknown_language
    order by id;
    
    ID  LANGUAGE
    --  --------
    1   ENGLISH
    2   SPANISH
    3   SIMPLIFIED CHINESE
    4   GERMAN
    5   RUSSIAN
    
    0 讨论(0)
提交回复
热议问题