上QQ阅读APP看书，第一时间看更新

Dataset preparation for training

Since we do not have any unlabeled data, I would like to select some samples randomly for test. Well, one more thing is that features and labels come in two separate files. Therefore, we can perform the necessary preprocessing and then join them together so that our pre-processed data will have features and labels together.

Then the rest will be used for training. Finally, we'll save the training and testing set in a separate CSV file to be used later on. First, let's load the samples and see the statistics. By the way, we use the read() method of Spark but specify the necessary options and format too:

Dataset<Row> data = spark.read()
                .option("maxColumns", 25000)
                .format("com.databricks.spark.csv")
                .option("header", "true") // Use first line of all files as header
                .option("inferSchema", "true") // Automatically infer data types
                .load("TCGA-PANCAN-HiSeq-801x20531/data.csv");// set your path accordingly

Then we see some related statistics such as number of features and number of samples:

int numFeatures = data.columns().length;
long numSamples = data.count();
System.out.println("Number of features: " + numFeatures);
System.out.println("Number of samples: " + numSamples);

>>>
 Number of features: 20532
 Number of samples: 801

Therefore, there are 801 samples from 801 distinct patients and the dataset is too high in dimensions, having 20532 features. In addition, in Figure 2, we have seen that the id column represents only the patient's anonymous ID, so we can simply drop it:

Dataset<Row> numericDF = data.drop("id"); // now 20531 features left

Then we load the labels using the read() method of Spark and also specify the necessary options and format:

Dataset<Row> labels = spark.read()
                .format("com.databricks.spark.csv")
                .option("header", "true") // Use first line of all files as header
                .option("inferSchema", "true") // Automatically infer data types
                .load("TCGA-PANCAN-HiSeq-801x20531/labels.csv");
labels.show(10);

We have already seen how the labels dataframe looks. We will skip the id. However, the Class column is categorical. Now, as I said, DL4J does not support categorical labels to be predicted. Therefore, we have to convert it to numeric (integer, to be more specific); for that I would use StringIndexer() from Spark.

First, create a StringIndexer(); we apply the index operation to the Class column and rename it as label. Additionally, we skip null entries:

StringIndexer indexer = new StringIndexer()
                        .setInputCol("Class")
                        .setOutputCol("label")
                        .setHandleInvalid("skip");// skip null/invalid values

Then we perform the indexing operation by calling the fit() and transform() operations as follows:

Dataset<Row> indexedDF = indexer.fit(labels)
                         .transform(labels)
                         .select(col("label")
                         .cast(DataTypes.IntegerType));// casting data types to integer

Now let's take a look at the indexed DataFrame:

indexedDF.show();

Fantastic! Now all of our columns (including features and labels) are numeric. Thus, we can join both features and labels into a single DataFrame. For that, we can use the join() method from Spark as follows:

Dataset<Row> combinedDF = numericDF.join(indexedDF);

Now we can generate both the training and test sets by randomly splitting the combindedDF, as follows:

Dataset<Row>[] splits = combinedDF.randomSplit(newdouble[] {0.7, 0.3});//70% for training, 30% for testing
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = splits[1];

Now let's see the count of samples in each set:

System.out.println(trainingData.count());// number of samples in training set
System.out.println(testData.count());// number of samples in test set

>>>
 561
 240

Thus, our training set has 561 samples and the test set has 240 samples. Finally, save these two sets as separate CSV files to be used later on:

trainingData.coalesce(1).write()
                .format("com.databricks.spark.csv")
                .option("header", "false")
                .option("delimiter", ",")
                .save("data/TCGA_train.csv");
                
testData.coalesce(1).write()
                .format("com.databricks.spark.csv")
                .option("header", "false")
                .option("delimiter", ",")
                .save("data/TCGA_test.csv");

Now that we have the training and test sets, we can now train the network with the training set and evaluate the model with the test set. Considering the high dimensionality, I would rather try a better network such as LSTM, which is an improved variant of RNN. At this point, some contextual information about LSTM would be helpful to grasp the idea.