Apache Storm - Accessing database from SPOUT - connection pooling

问题

Having a spout which on each tick goes to Postgre database and reads an additional row. The spout code looks as follows:

class RawDataLevelSpout extends BaseRichSpout implements Serializable {


private int counter;

SpoutOutputCollector collector;


@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
    declarer.declare(new Fields("col1", "col2"));
}

@Override
public void open(Map map, TopologyContext context, SpoutOutputCollector spoutOutputCollector) {
    collector = spoutOutputCollector;
}

private Connection initializeDatabaseConnection() {

    try {
        Class.forName("org.postgresql.Driver");
        Connection connection = null;
        connection = DriverManager.getConnection(
                DATABASE_URI,"root", "root");
        return connection;
    } catch (ClassNotFoundException e) {
        e.printStackTrace();
    } catch (SQLException e) {
        e.printStackTrace();
    }
    return null;
}

@Override
public void close() {

}

@Override
public void nextTuple() {
    List<String> values = new ArrayList<>();

    PreparedStatement statement = null;
    try {
        Connection connection = initializeDatabaseConnection();
        statement = connection.prepareStatement("SELECT * FROM table1 ORDER BY col1 LIMIT 1 OFFSET ?");
        statement.setInt(1, counter++);
        ResultSet resultSet = statement.executeQuery();
        resultSet.next();
        ResultSetMetaData resultSetMetaData = resultSet.getMetaData();
        int totalColumns = resultSetMetaData.getColumnCount();
        for (int i = 1; i <= totalColumns; i++) {
            String value = resultSet.getString(i);
            values.add(value);
        }


        connection.close();
    } catch (SQLException e) {
        e.printStackTrace();
    }
    collector.emit(new Values(values.stream().toArray(String[]::new)));
}

}

What is the standard way how to approach connection pooling in Spouts in apache storm? Furthermore, is it possible to somehow synchronize the coutner variable accross multiple running instances within the cluster topology?

回答1:

Regarding connection pooling, you could pool connections via static variable if you wanted, but since you aren't guaranteed to have all spout instances running in the same JVM, I don't think there's any point.

No, there is no way to synchronize the counter. The spout instances may be running on different JVMs, and you don't want them all blocking while the spouts agree what the counter value is. I don't think your spout implementation makes sense though. If you wanted to just read one row at a time, why would you not just run a single spout instance instead of trying to synchronize multiple spouts?

You seem to be trying to use your relational database as a queue system, which is probably a bad fit. Consider e.g. Kafka instead. I think you should be able to use either one of https://www.confluent.io/product/connectors/ or http://debezium.io/ to stream data from your Postgres to Kafka.

来源：https://stackoverflow.com/questions/49086488/apache-storm-accessing-database-from-spout-connection-pooling

标签

database

connection-pooling

apache-storm