Tensorflow not predicting accurate enough results

问题

I have some fundamental questions about the algorithms I picked in my Tensorflow project. I fed in around 1 million sets of training data and still couldn't get the accurate enough prediction results.

The code I am using is based on an old Tensorflow example (https://github.com/tensorflow/tensorflow/blob/r1.3/tensorflow/examples/tutorials/estimators/abalone.py). The goal of this example is to predict the age of an abalone based on the training features provided.

My purpose is very similar. The only difference is that I have more labels(6) compared to my features(4). Since the predictions after training are way off, the concern of the feasibility of this project is starting to raise some concerns.

I am pretty new to Machine Learning and Tensorflow so I am not very sure if I have picked the proper methods for this project. I'd like to know if there are some ways to improve my current code to possibly improve the accuracy of the predictions, like more layers, different optimization methods, etc.

Here is the code:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import sys
import numpy as np

import pandas as pd
import tensorflow as tf

LEARNING_RATE = 0.001


def model_fn(features, labels, mode, params):
  """Model function for Estimator."""

  first_hidden_layer = tf.layers.dense(features["x"], 10, activation=tf.nn.relu)

  # Connect the second hidden layer to first hidden layer with relu
  second_hidden_layer = tf.layers.dense(
      first_hidden_layer, 10, activation=tf.nn.relu)

  # Connect the output layer to second hidden layer (no activation fn)
  output_layer = tf.layers.dense(second_hidden_layer, 6)

  # Reshape output layer to 1-dim Tensor to return predictions
  predictions = tf.reshape(output_layer, [-1,6])

  # Provide an estimator spec for `ModeKeys.PREDICT`.
  if mode == tf.estimator.ModeKeys.PREDICT:
    return tf.estimator.EstimatorSpec(
        mode=mode,
        predictions={"ages": predictions})

  # Calculate loss using mean squared error
  loss = tf.losses.mean_squared_error(labels, predictions)

  optimizer = tf.train.GradientDescentOptimizer(
      learning_rate=params["learning_rate"])
  train_op = optimizer.minimize(
      loss=loss, global_step=tf.train.get_global_step())

  # Calculate root mean squared error as additional eval metric
  eval_metric_ops = {
      "rmse": tf.metrics.root_mean_squared_error(
          tf.cast(labels, tf.float64), predictions)
  }

  # Provide an estimator spec for `ModeKeys.EVAL` and `ModeKeys.TRAIN` modes.
  return tf.estimator.EstimatorSpec(
      mode=mode,
      loss=loss,
      train_op=train_op,
      eval_metric_ops=eval_metric_ops)


def main(unused_argv):

  train_file = "training_data_mc1000.csv"
  test_file = "test_data_mc1000.csv"

  train_features_interim = pd.read_csv(train_file, usecols=['vgs', 'vbs', 'vds', 'current'])
  train_features_numpy = np.asarray(train_features_interim, dtype=np.float64)

  train_labels_interim = pd.read_csv(train_file, usecols=['plo_tox', 'plo_dxl', 'plo_dxw', 'parl1', 'parl2', 'random_fn'])
  train_labels_numpy = np.asarray(train_labels_interim, dtype=np.float64)

  # Set model params
  model_params = {"learning_rate": LEARNING_RATE}

  # Instantiate Estimator
  nn = tf.estimator.Estimator(model_fn=model_fn, params=model_params)

  train_input_fn = tf.estimator.inputs.numpy_input_fn(
      x={"x": train_features_numpy},
      y=train_labels_numpy,
      num_epochs=None,
      shuffle=True)

  # Train
  nn.train(input_fn=train_input_fn, max_steps=1048576)

  test_features_interim = pd.read_csv(test_file, usecols=['vgs', 'vbs', 'vds', 'current'])
  test_features_numpy = np.asarray(test_features_interim, dtype=np.float64)

  test_labels_interim = pd.read_csv(test_file, usecols=['plo_tox', 'plo_dxl', 'plo_dxw', 'parl1', 'parl2', 'random_fn'])
  test_labels_numpy = np.asarray(test_labels_interim, dtype=np.float64)

  # Score accuracy
  test_input_fn = tf.estimator.inputs.numpy_input_fn(
      x={"x": test_features_numpy},
      y=test_labels_numpy,
      num_epochs=1,
      shuffle=False)

  ev = nn.evaluate(input_fn=test_input_fn)
  print("Loss: %s" % ev["loss"])
  print("Root Mean Squared Error: %s" % ev["rmse"])

  prediction_file = "Tensorflow_prediction_data.csv"

  predict_features_interim = pd.read_csv(prediction_file, usecols=['vgs', 'vbs', 'vds', 'current'])
  predict_features_numpy = np.asarray(predict_features_interim, dtype=np.float64)

  # Print out predictions
  predict_input_fn = tf.estimator.inputs.numpy_input_fn(
      x= {"x": predict_features_numpy},
      num_epochs=1,
      shuffle=False)
  predictions = nn.predict(input_fn=predict_input_fn)
  for i, p in enumerate(predictions):
    print("Prediction %s: %s" % (i + 1, p["ages"]))


if __name__ == '__main__':
  tf.logging.set_verbosity(tf.logging.INFO)
  parser = argparse.ArgumentParser()
  parser.register("type", "bool", lambda v: v.lower() == "true")
  parser.add_argument(
      "--train_data", type=str, default="", help="Path to the training data.")
  parser.add_argument(
      "--test_data", type=str, default="", help="Path to the test data.")
  parser.add_argument(
      "--predict_data",
      type=str,
      default="",
      help="Path to the prediction data.")
  FLAGS, unparsed = parser.parse_known_args()
  tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

And portion of the training and testing data looks like

The last four columns are the features and the first six columns are the labels. Again, you can see that I am having more labels than features. My goal is to train a model that when I feed in new sets of features, it can predict accurate enough labels.

The following part is added for clarification of my data sets. Thanks for the first ones that commented my question that reminds me to put this up as well.

The relation between my features and labels are : every 30(vgs)X10(vbs)X10(vds) corresponds to 1 set of labels. Basically it is like a 3-D array with the first three features acting like coordinates and the last feature(current) as the value that is stored within each cells. That's why the labels from the portions I showed are all the same.

Another question now is that I am expecting the loss should be getting smaller and smaller as the training progresses, but it is not. I think this should be another reason why the output is not accurate cuz the minimizing loss part isn't working. I don't really know why, though.

Thanks for taking time looking at this and I'd love to have a discussion down below.

回答1:

From what I can see in your code, you are not normalizing your features. Try normalizing them for example to have mean zero and std=1. Since your features are in a completely different range, this normalization might help.

It would also be helpful to see other labels. The ones in the provided picture are all the same.

来源：https://stackoverflow.com/questions/50380844/tensorflow-not-predicting-accurate-enough-results

标签

python

tensorflow

machine-learning

tensorflow-datasets