问题
The Problem
My code works locally, but I am not able to get any evaluation data or exports from my TensorFlow estimator when submitting online training jobs after having upgraded to TensorFlow 2.1. Here's the bulk of my code:
def build_estimator(model_dir, config):
return tf.estimator.LinearClassifier(
feature_columns=feature_columns,
n_classes=2,
optimizer=tf.keras.optimizers.Ftrl(
learning_rate=args.learning_rate,
l1_regularization_strength=args.l1_strength
),
model_dir=model_dir,
config=config
)
run_config = tf.estimator.RunConfig(save_checkpoints_steps=100,
save_summary_steps=100)
...
estimator = build_estimator(model_dir=args.job_dir, config=run_config)
...
def serving_input_fn():
inputs = {
'feature1': tf.compat.v1.placeholder(shape=None, dtype=tf.string),
'feature2': tf.compat.v1.placeholder(shape=None, dtype=tf.string),
'feature3': tf.compat.v1.placeholder(shape=None, dtype=tf.string),
...
}
split_features = {}
for feature in inputs:
split_features[feature] = tf.strings.split(inputs[feature], "||").to_sparse()
return tf.estimator.export.ServingInputReceiver(features=split_features, receiver_tensors=inputs)
exporter_cls = tf.estimator.LatestExporter('predict', serving_input_fn)
eval_spec = tf.estimator.EvalSpec(
input_fn=lambda: input_eval_fn(args.test_dir),
exporters=[exporter_cls],
start_delay_secs=10,
throttle_secs=0)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
If I run this with local gcloud command it works fine, I get my /eval
and /export
folders:
gcloud ai-platform local train \
--package-path trainer \
--module-name trainer.task \
-- \
--train-dir $TRAIN_DATA \
--test-dir $TEST_DATA \
--training-steps $TRAINING_STEPS \
--job-dir $OUTPUT
But when I try to run it in the cloud, i do not get my /eval
/export
folders. This only started happening when upgrading to 2.1. Previously everything worked fine in 1.14.
gcloud ai-platform jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--staging-bucket gs://$STAGING_BUCKET_NAME \
--runtime-version 2.1 \
--python-version 3.7 \
--package-path trainer/ \
--module-name trainer.task \
--region $REGION \
--config config.yaml \
-- \
--train-dir $TRAIN_DATA \
--test-dir $TEST_DATA \
What I've tried
Instead of relying on the EvalSpec
to export my model, I also tried using tf.estimator.export_saved_model
. While this works both locally and online, i'd like to continue using the EvalSpec
with train_and_evaluate
if possible, because I can pass in different export methods like BestExporter
, LastExporter
, etc.
My main question is...
Am I incorrectly exporting my model in TensorFlow 2.1, or is this a bug that is happening on the platform with the new version?
回答1:
Found the answer...
Based on documentation about the TF_CONFIG
environment variable...
master is a deprecated task type in TensorFlow. master represented a task that performed a similar role as chief but also acted as an evaluator in some configurations. TensorFlow 2 does not support TF_CONFIG environment variables that contain a master task.
So previously we were using TF 1.X, which used a master worker. But, master has been deprecated when training TF 2.X jobs. Now the default is chief, but chief by default does not act as an evaluator. In order to get evaluation data, we needed to update our config yaml to explicitly allocate an evaluator.
https://cloud.google.com/ai-platform/training/docs/distributed-training-details#tf-config-format
We updated our config.yaml
with evaluatorType
and evaluatorCount
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
workerType: standard_gpu
workerCount: 1
evaluatorType: standard_gpu
evaluatorCount: 1
and it worked!!!
来源:https://stackoverflow.com/questions/62337037/ai-platform-no-eval-folder-or-export-folder-in-outputs-when-running-tensorflow