ai-platform: No eval folder or export folder in outputs when running TensorFlow 2.1 training job using Estimators

问题

The Problem

My code works locally, but I am not able to get any evaluation data or exports from my TensorFlow estimator when submitting online training jobs after having upgraded to TensorFlow 2.1. Here's the bulk of my code:

def build_estimator(model_dir, config):

    return tf.estimator.LinearClassifier(
        feature_columns=feature_columns,
        n_classes=2,
        optimizer=tf.keras.optimizers.Ftrl(
            learning_rate=args.learning_rate,
            l1_regularization_strength=args.l1_strength
        ),
        model_dir=model_dir,
        config=config
    )

run_config = tf.estimator.RunConfig(save_checkpoints_steps=100,
                                    save_summary_steps=100)  
...

estimator = build_estimator(model_dir=args.job_dir, config=run_config)

...

def serving_input_fn():
    inputs = {
        'feature1': tf.compat.v1.placeholder(shape=None, dtype=tf.string),
        'feature2': tf.compat.v1.placeholder(shape=None, dtype=tf.string),
        'feature3': tf.compat.v1.placeholder(shape=None, dtype=tf.string),
        ...
    }

    split_features = {}

    for feature in inputs:
        split_features[feature] = tf.strings.split(inputs[feature], "||").to_sparse()

    return tf.estimator.export.ServingInputReceiver(features=split_features, receiver_tensors=inputs)

exporter_cls = tf.estimator.LatestExporter('predict', serving_input_fn)

eval_spec = tf.estimator.EvalSpec(
    input_fn=lambda: input_eval_fn(args.test_dir),
    exporters=[exporter_cls],
    start_delay_secs=10,
    throttle_secs=0)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

If I run this with local gcloud command it works fine, I get my /eval and /export folders:

gcloud ai-platform local train \
--package-path trainer \
--module-name trainer.task \
-- \
--train-dir $TRAIN_DATA \
--test-dir $TEST_DATA \
--training-steps $TRAINING_STEPS \
--job-dir $OUTPUT

But when I try to run it in the cloud, i do not get my /eval /export folders. This only started happening when upgrading to 2.1. Previously everything worked fine in 1.14.

    gcloud ai-platform jobs submit training $JOB_NAME \
    --job-dir $OUTPUT_PATH \
    --staging-bucket gs://$STAGING_BUCKET_NAME \
    --runtime-version 2.1 \
    --python-version 3.7 \
    --package-path trainer/ \
    --module-name trainer.task \
    --region $REGION \
    --config config.yaml \
    -- \
    --train-dir $TRAIN_DATA \
    --test-dir $TEST_DATA \

What I've tried

Instead of relying on the EvalSpec to export my model, I also tried using tf.estimator.export_saved_model. While this works both locally and online, i'd like to continue using the EvalSpec with train_and_evaluate if possible, because I can pass in different export methods like BestExporter, LastExporter, etc.

My main question is...

Am I incorrectly exporting my model in TensorFlow 2.1, or is this a bug that is happening on the platform with the new version?

回答1:

Found the answer...

Based on documentation about the TF_CONFIG environment variable...

master is a deprecated task type in TensorFlow. master represented a task that performed a similar role as chief but also acted as an evaluator in some configurations. TensorFlow 2 does not support TF_CONFIG environment variables that contain a master task.

So previously we were using TF 1.X, which used a master worker. But, master has been deprecated when training TF 2.X jobs. Now the default is chief, but chief by default does not act as an evaluator. In order to get evaluation data, we needed to update our config yaml to explicitly allocate an evaluator.

https://cloud.google.com/ai-platform/training/docs/distributed-training-details#tf-config-format

We updated our config.yaml with evaluatorType and evaluatorCount

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerType: standard_gpu
  workerCount: 1
  evaluatorType: standard_gpu
  evaluatorCount: 1

and it worked!!!

来源：https://stackoverflow.com/questions/62337037/ai-platform-no-eval-folder-or-export-folder-in-outputs-when-running-tensorflow

标签

tensorflow

google-cloud-ml