I'm working my way into the chatbot topic right now. I already did some projects with rasa nlu and chatterbot.
Now I want to take the next step and want to create one with word2vec or seq2vec. Building my own corpus and train it with either a Reddit or Wikipedia corpus.
Unfortunately, I don't find good readings, tutorials on the internet. My goal is to create my own corpus (FAQ corpus and General information about my university).
Does anyone have some good readings on this topic? And more important, what's the best way to built the corpus?
Can I simply put all my answers in a csv?
do I need to do question (column A) - answer (column B) in the csv
can I put all the information as continuous text in a text file?
Is is better to do it the same way as with rasa nlu? with integer and then possible answers?
Thanks a lot for all your answers
for a FAQ Bot you can start with a bag of words instead of word2vec. word2vec is better to find similarities or to do some aritmethic like queen - man = king
you can use csv,tsv,json it´s up to you how you read it. use the answers like classes or labels and use the words in the questions as a bag of words. for every question build a one hot encoded vector and train it in a simple dnn. that way will work. for better results prepare your input with stemming and lemmatization.
btw: word2vec uses bag of words and skip gram so if you really want to use word2vec you can use the gensim library but for faq i would go the dnn bow way.