Call R from JAVA to get Chi-squared statistic and p-value

前端 未结 6 818
滥情空心
滥情空心 2020-12-19 23:49

I have two 4*4 matrices in JAVA, where one matrix holds observed counts and the other expected counts.

I need an automated way to calculate the p-value from the chi

相关标签:
6条回答
  • 2020-12-20 00:20

    1) Outputting my matrices from JAVA into .csv files

    Use any of CSV libraies, I would recommend http://opencsv.sourceforge.net/

    2) Uploading the .csv files into R 3) Calling the chisq.test on the .csv files into R

    2 & 3 a pretty the same, You better create parametrized script to be run in R.

    obs<-read.csv(args[1])
    exp<-read.csv(args[2])
    chisq.test(obs,exp)
    

    So you can run

    RScript your_script.r path_to_csv1 path_to_csv2, 
    

    and use unique names for the csv files for example:

    UUID.randomUUID().toString().replace("-","")
    

    And then you use

    Runtime.getRuntime().exec(command, environments, dataDir);
    

    4) Returning the outputted p-value back into JAVA? You can only read the output of R if you are using getRuntime().exec() to invoke R.

    I would also recommend to take a look at Apache's Statistics Lib & How to calculate PValue from ChiSquare. Maybe you can live without R at all :)

    0 讨论(0)
  • 2020-12-20 00:25

    Check this page JRI

    Description form their site:

    JRI is a Java/R Interface, which allows to run R inside Java applications as a single thread. Basically it loads R dynamic library into Java and provides a Java API to R functionality. It supports both simple calls to R functions and a full running REPL.

    0 讨论(0)
  • 2020-12-20 00:27

    There are (at least) two ways of going about this.


    Command Line & Scripts

    You can execute Rscripts from the command line with Rscript.exe. E.g. in your script you would have:

    # Parse arguments.
    # ...
    # ...
    
    chisq.test(obs, exp)
    

    Rather than creating CSVs in Java and having R read them, you should be able to pass them straight to R. I don't see the need to create CSVs and pass data that way, UNLESS your matrices are quite big. There are limitations on the size of command line arguments you can pass (varies across operating system I think).

    You can pass arguments into Rscripts and parse them using the commandArgs() functions or with various packages (e.g. optparse or getopt). See this thread for more information.

    There are several ways of calling and reading from the command line in Java. I don't know enough about it to give you advice but a bit of googling will give you a result. Calling a script from the command line is done like this:

    Rscript my_script.R
    

    JRI

    JRI lets you talk to R straight from Java. Here's an example of how you would pass a double array to R and have R sum it (this is Java now):

    // Start R session.
    Rengine re = new Rengine (new String [] {"--vanilla"}, false, null);
    
    // Check if the session is working.
    if (!re.waitForR()) {
        return;
    }
    
    re.assign("x", new double[] {1.5, 2.5, 3.5});
    REXP result = re.eval("(sum(x))");
    System.out.println(result.asDouble());
    re.end();
    

    The function assign() here is the same as doing this in R:

    x <- c(1.5, 2.5, 3.5)
    

    You should be able to work out how to extend this to work with a matrix.


    I think JRI is quite difficult at the beginning. So if you want to get this done quickly the command line option is probably best. I would say the JRI approach is less messy once you get it set up though. And if you have situations where you have a lot of back and forth between R and Java it is definitely better than calling multiple scripts.

    1. Link to JRI.
    2. Recommended Eclipse plugin to set up JRI.
    0 讨论(0)
  • 2020-12-20 00:39

    I recommend to simply use a Java library that does a ChiSquare test for you. There are enough of them:

    • Apache commons math: http://commons.apache.org/proper/commons-math/
    • JSC: http://www.jsc.nildram.co.uk/
    • JDistlib: http://jdistlib.sourceforge.net/

    This is not a complete list, but what I found in 5 minutes searching.

    0 讨论(0)
  • 2020-12-20 00:41

    Rserve is another way to get your data from Java to R and back. It is a server which takes R scripts as string inputs. You can use some string parsing and conversion in Java to convert the matrices into strings that can be input into R.

    import org.rosuda.REngine.REXP;
    import org.rosuda.REngine.Rserve.RConnection;
    
    
    public class RtestScript {
    
    private String emailTestScript = "open <- c('O', 'O', 'N', 'N', 'O', 'O', 'N', 'N', 'N', 'O', " +
            " 'O', 'N', 'N', 'O', 'O', 'N', 'N', 'N', 'O');" +
            "testgroup <- c('A', 'A', 'A','A','A','A','A','A','A','A', 'B'," +
            "'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B');" +
            "emailTest <- data.frame(open, testgroup);" +
            "emailTable<- table(emailTest$open, emailTest$testgroup);" +
            "emailResults<- prop.test(emailTable, correct=FALSE);" +
            "print(emailResults$p.value);";
    
    public void executeRscript() {
        try {
            //Make sure to type in library(Rserve); Rserve() in Rstudio before running this
            RConnection testConnection = new RConnection();
    
            REXP testExpression = testConnection.eval(emailTestScript);
            System.out.println("P value: " + testExpression.asString());
        } catch(Exception e) {
            e.printStackTrace();
        }
    }
    }
    

    Here is some more information on Rserve. Incidentally, this is also how Tableau can communicate with R as well with their R connection.

    https://cran.r-project.org/web/packages/Rserve/index.html

    0 讨论(0)
  • 2020-12-20 00:43

    RCaller 2.2 can do what you want to do. Suppose the frequency matrix is given as in your question. The resulted p.value and df variables can be calculated and returned using the code below:

    double[][] data = new double[][]{
            {197.136, 124.32, 63.492, 59.052},
            {124.32, 78.4, 40.04, 37.24},
            {63.492, 40.04, 20.449, 19.019},
            {59.052, 37.24, 19.019, 17.689}
            };
        RCaller caller = new RCaller();
        Globals.detect_current_rscript();
        caller.setRscriptExecutable(Globals.Rscript_current);
        RCode code = new RCode();
    
        code.addDoubleMatrix("mydata", data);
        code.addRCode("result <- chisq.test(mydata)");
        code.addRCode("mylist <- list(pval = result$p.value, df=result$parameter)");
    
        caller.setRCode(code);
        caller.runAndReturnResult("mylist");
    
        double pvalue = caller.getParser().getAsDoubleArray("pval")[0];
        double df = caller.getParser().getAsDoubleArray("df")[0];
        System.out.println("Pvalue is : "+pvalue);
        System.out.println("Df is : "+df);
    

    The output is:

    Pvalue is : 1.0
    Df is : 9.0
    

    You can get the technical details in here

    0 讨论(0)
提交回复
热议问题