Greenplum批量操作，数据库里面作删除更新速度最快

这是记一次线上GP大数据库大量重复问题解决方案 1 建临时表，把重复的数据备份，2 在备份库用查询条件去删除正式表

最近在玩Greenplum 数据库，一款分布式的数据库，MPP架构，但是有好的也有不如意的，总体感觉还是不错，底层数据库还是用PostgreSQL8.2版本，因为我是GP4.2嘛，最新版本的GP6.0是基本PostgreSQL9.2，在性能上提高了不少。

先说下，遇到的坑和一些数据问题，删除数据和更新数据，分区表等一些概念

依赖放在项目的lib下面

    <dependency>
            <groupId>com.fbcds</groupId>
            <artifactId>fbcds</artifactId>
            <version>1.0</version>
            <scope>system</scope>
            <systemPath>${project.basedir}/src/main/resources/lib/greenplum.jar</systemPath>
        </dependency>

连接数据库，当然还有其它的方法查询封装

// 饿汉式
    private static DruidDataSource dataSource = null;

    static {
        dataSource = new DruidDataSource();
        dataSource.setDriverClassName("com.pivotal.jdbc.GreenplumDriver");
        dataSource.setUsername("gptest");
         dataSource.setPassword("123456“);
     dataSource.setUrl("jdbc:pivotal:greenplum://192.168.0.1:8088;DatabaseName=npp_db");

        dataSource.setInitialSize(10);
        dataSource.setMinIdle(10);
        dataSource.setMaxActive(50); // 启用监控统计功能
        dataSource.setTimeBetweenEvictionRunsMillis(300000);
        dataSource.setValidationQuery("SELECT 'x'");
        // 配置一个连接在池中最小生存的时间，单位是毫秒
        dataSource.setMinEvictableIdleTimeMillis(300000);


    }

批量操作代码主要是为了操作线上数据，因为有大量重复，分区表处理起来，超级麻烦，就先建临时表保存要删除的数据，

然后，利用数据库相关性，先查询临时表的ID再去删除正式表，操作与同插入的方式，在数据库内部进行操作，速度会快很多，解决思路就是如此，在数据库里作删除，而不是取到程序里再操作，

解决问题的思路很重要，要去套用哪种处理方式最快。

String tablename = Snmp_Table;
        String sql_info =
                " insert into da_his.test_five_data_his_1 select min(\"id\") \"id\",min(\"timestamp\") as \"timestamp\",max(snmp_index) "
                        + "as snmp_index,max(snmp) as snmp,max(\"in\") as \"in\",max(\"out\") as \"out\"  from " + tablename
                        + " where \"timestamp\">=? " + " and \"timestamp\"<=? and  (" + sl.toString()
                        + ") and \"in\"<'1250000000000' and  \"out\"<'1250000000000' group by "
                        + "\"timestamp\" ,snmp_index  having count(\"timestamp\")>1 ";
        //List<Map<String, Object>> list_info = (List<Map<String, Object>>) Gp_tools.query(sql_info, new Object[] {obj[0], obj[1]},
        //       new AllObjectMapper());
        Gp_tools.updateTemp(sql_info, new Object[] {obj[0], obj[1]}, true);

 public static boolean updateTemp(String sql, Object[] obj, boolean isGenerateKey) {
        Connection conn = null;

        boolean bFlag = true;
        try {
            conn = JdbcUtil.getConnection();
            PreparedStatement pstmt = conn.prepareStatement(sql);
           /* for (int i = 0; i < obj.length; i++) {
                pstmt.setObject(i + 1, obj[i]);
            }*/
            pstmt.setObject(1, obj[0]);
            pstmt.setObject(2, obj[1]);
            pstmt.executeUpdate();
            conn.commit();
        } catch (SQLException ex) {
            ex.printStackTrace();
        } finally {
            try {
                JdbcUtil.releaseConnection(conn);
            } catch (SQLException ex) {
                ex.printStackTrace();
            }
        }
        return bFlag;
    }

另外的批量删除，操作也很慢

delete from da_his.tb_dev_five_data_1 as devfive using (values ('877|54',277,1444413654),('877|54',277,1444425009),('877|54',277,1444413674)) as tmp(snmp_index,snmp,"id") where 
 devfive.snmp_index=tmp.snmp_index and devfive.snmp=tmp.snmp and devfive.id=tmp.id ;

批量更新操作，原理是一样的，但是在数据量几十亿又分成几百个分区，这种操作分分钟能搞崩掉

 update test set info=tmp.info from (values (1,'new1'),(2,'new2'),(6,'new6')) as tmp (id,info) where test.id=tmp.id;

批量删除有可能如下官网解释的，最后group by id字段必须是唯一

DELETE FROM test
WHERE ctid NOT IN (
SELECT min(ctid)
FROM test
GROUP BY x);
(where 'x' is the unique column list)

最张解决方案从临时表的备份表查询要删除的数据，程序里拼接下面的SQL

delete from da_his.tb_dev_five_data_1  where snmp=472 and "id" in(5357929218,5357927657,5357929285,5357928550,5357929262);

多线程跑起来，很快，可能晚上用的人少，删除非常快

来源：CSDN

作者：大树168

链接：https://blog.csdn.net/limingcai168/article/details/103719320

标签

greenplum

时间戳

临时表