Tachyon on Dataproc Master Replication Error

问题

I have a simple example running on a Dataproc master node where Tachyon, Spark, and Hadoop are installed.

I have a replication error writing to Tachyon from Spark. Is there any way to specify it needs no replication?

15/10/17 08:45:21 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/tachyon/workers/1445071000001/3/8 could only be replicated to 0 nodes instead of minReplication (=1).  There are 0 datanode(s) running and no node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3110)

The portion of the log I printed is just a warning, but a Spark error follows immediately.

I checked the Tachyon config docs, and found something that might be causing this:

tachyon.underfs.hdfs.impl   "org.apache.hadoop.hdfs.DistributedFileSystem"

Given that this is all on a Dataproc master node, with Hadoop preinstalled and HDFS working with Spark, I would think that this is a problem solvable from within Tachyon.

回答1:

You can adjust default replication by manually setting dfs.replication inside /etc/hadoop/conf/hdfs-site.xml to some value other than Dataproc's default of 2. Setting it just on your master should at least cover driver calls, hadoop fs calls, and it appears to correctly propagate into hadoop distcp calls as well so most likely you don't need to worry about also setting it on every worker as long as workers are getting their FileSystem configs from job-scoped configurations.

Note that replication of 1 already means a single copy of the data in total, rather than meaning "one replica in addition to the main copy". So, replication can't really go lower than 1. The minimum replication is controlled with dfs.namenode.replication.min in the same hdfs-site.xml; you can see it referenced here in BlockManager.java.

回答2:

This being a replication issue, one would naturally look at the status of worker nodes.

Turns out they were down for another reason. After fixing that, this error disappeared.

What I would like to know, and will accept as an answer, is how to change the replication factor manually.

来源：https://stackoverflow.com/questions/33192125/tachyon-on-dataproc-master-replication-error

标签

scala

apache-spark

Hadoop

google-cloud-dataproc

alluxio