Data Form for Training, Validation & Test Data for Sequence Chunking

I’m making a sequence chunking deep learning program using TensorFlow. Currently, I’m stuck in the data form.

Before I talk about the datasets, let me talk about the program first. The program I’m using is based on “Neural Models for Sequence Chunking” by Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen Zhou. The model I want to use is model 2 (Encoder-Decoder Framework).

The deep learning program works like this:

  1. The program receives input data text
  2. The program will segment the text into a phrase
  3. The program will label them

So if the input is “But it could be much worse”, the output is like this:

But - O

it - NP

could be - VP

much worse - ADJP

Simple, yes?

Now onto the question. I can’t decide what data form I should use for this program. The data should include sentences, but I also have to add the correct output. That’s the question.

What Data Form I should make for it to be read by the program?

These are some ideas I have, but I’m not sure yet.

Chunk Phrase
But O
it NP
could be VP
much worse ADJP
Sentence Output
But it could be much worse O NP VP ADJP