data-integration | 易学教程

Casting date in Talend Data Integration

阅读更多关于 Casting date in Talend Data Integration

问题 In a data flow from one table to another, I would like to cast a date. The date leaves the source table as a string in this format: "2009-01-05 00:00:00:000 + 01:00". I tried to convert this to a date using a tConvertType, but that is not allowed apparently. My second option is to cast this string to a date using a formula in a tMap component. At the moment I tried these formulas: - TalendDate.formatDate("yyyy-MM-dd",row3.rafw_dz_begi); - TalendDate.formatDate("yyyy-MM-dd HH:mm:ss",row3.rafw

Unable to connect to HDFS using PDI step

阅读更多关于 Unable to connect to HDFS using PDI step

问题 I have successfully configured Hadoop 2.4 in an Ubuntu 14.04 VM from a Windows 8 system. Hadoop installation is working absolutely fine and also i am able to view the Namenode from my windows browser. Attached Image Below: So, my host name is : ubuntu and hdfs port : 9000 (correct me if I am wrong). Core-site.xml : <property> <name>fs.defaultFS</name> <value>hdfs://ubuntu:9000</value> </property> The issue is while connecting to HDFS from my Pentaho Data Integration Tool. Attached Image Below

Missing plugins found while loading a transformation on Kettle

阅读更多关于 Missing plugins found while loading a transformation on Kettle

问题 I receive this error whenever I run my extraction from the command line, not in the Spoon UI. Missing plugins found while loading a transformation Step : MongoDbInput at org.pentaho.di.job.entries.trans.JobEntryTrans.getTransMeta(JobEntryTrans.java:1200) at org.pentaho.di.job.entries.trans.JobEntryTrans.execute(JobEntryTrans.java:643) at org.pentaho.di.job.Job.execute(Job.java:714) at org.pentaho.di.job.Job.execute(Job.java:856) ... 4 more Caused by: org.pentaho.di.core.exception

Pentaho Hadoop File Input

阅读更多关于 Pentaho Hadoop File Input

问题 I'm trying to retrieve data from an standalone Hadoop (version 2.7.2 qith properties configured by default) HDFS using Pentaho Kettle (version 6.0.1.0-386 ). Pentaho and Hadoop are not in the same machine but I have acces from one to another. I created a new "Hadoop File Input" with the following properties: Environment File/Folder Wildcard Rquired Include subfolders url-to-file N N url-to-file is built like: ${PROTOCOL}://${USER}:${PASSWORD}@${IP}:${PORT}${PATH_TO_FILE} eg: hdfs://hadoop:

Count the number of rows for each file along with the file name in Talend

阅读更多关于 Count the number of rows for each file along with the file name in Talend

问题 I have built a job that reads the data from a file, and based on the unique data of a particular columns, splits the data set into many files. I am able to acheive the requirement by the below job : Now from this job which is splitting the output into multiple files, what I want is to add a sub job which would give me two columns. In the first column I want the name of the files that I created in my main job and in the second column, I want the count of number of rows each created output file

Apache Nifi/Cassandra - how to load CSV into Cassandra table

阅读更多关于 Apache Nifi/Cassandra - how to load CSV into Cassandra table

问题 I have various CSV files incoming several times per day, storing timeseries data from sensors, which are parts of sensors stations. Each CSV is named after the sensor station and sensor id from which it is coming from, for instance "station1_sensor2.csv". At the moment, data is stored like this : > cat station1_sensor2.csv 2016-05-04 03:02:01.001000+0000;0; 2016-05-04 03:02:01.002000+0000;0.1234; 2016-05-04 03:02:01.003000+0000;0.2345; I have created a Cassandra table to store them and to be

Designing a component both producer and consumer in Kafka

阅读更多关于 Designing a component both producer and consumer in Kafka

问题 I am using Kafka and Zookeeper as the main components of my data pipeline, which is processing thousands of requests each second. I am using Samza as the real time data processing tool for small transformations that I need to make on the data. My problem is that one of my consumers (lets say ConsumerA ) consumes several topics from Kafka and processes them. Basically creating a summary of the topics that are digested. I further want to push this data to Kafka as a separate topic but that

Casting date in Talend Data Integration

阅读更多关于 Casting date in Talend Data Integration

In a data flow from one table to another, I would like to cast a date. The date leaves the source table as a string in this format: "2009-01-05 00:00:00:000 + 01:00". I tried to convert this to a date using a tConvertType, but that is not allowed apparently. My second option is to cast this string to a date using a formula in a tMap component. At the moment I tried these formulas: - TalendDate.formatDate("yyyy-MM-dd",row3.rafw_dz_begi); - TalendDate.formatDate("yyyy-MM-dd HH:mm:ss",row3.rafw_dz_begi); - return TalendDate.formatDate("yyyy-MM-dd HH:mm:ss",row3.rafw_dz_begi); None of these worked

Designing a component both producer and consumer in Kafka

阅读更多关于 Designing a component both producer and consumer in Kafka

I am using Kafka and Zookeeper as the main components of my data pipeline, which is processing thousands of requests each second. I am using Samza as the real time data processing tool for small transformations that I need to make on the data. My problem is that one of my consumers (lets say ConsumerA ) consumes several topics from Kafka and processes them. Basically creating a summary of the topics that are digested. I further want to push this data to Kafka as a separate topic but that forms a loop on Kafka and my component. This is what bothers me, is this a desired architecture in Kafka?

Apache Nifi/Cassandra - how to load CSV into Cassandra table

阅读更多关于 Apache Nifi/Cassandra - how to load CSV into Cassandra table

I have various CSV files incoming several times per day, storing timeseries data from sensors, which are parts of sensors stations. Each CSV is named after the sensor station and sensor id from which it is coming from, for instance "station1_sensor2.csv". At the moment, data is stored like this : > cat station1_sensor2.csv 2016-05-04 03:02:01.001000+0000;0; 2016-05-04 03:02:01.002000+0000;0.1234; 2016-05-04 03:02:01.003000+0000;0.2345; I have created a Cassandra table to store them and to be able to query them for various identified tasks. The Cassandra table looks like this : cqlsh > CREATE