Dataset preparation

#3
by goldendase - opened

This is an excellent model. I really like what it's done to the prose. Could you share your approach to preparing a dataset of exclusively narrative works? Did you annotate the works with writing style hints or turn them into conversations or something else? I've tried a similar approach in the past but have had mixed results, at best.

I did nothing but train the model on plain, unformatted text. Specifically: tokenize each work as one chunk, ensure BOS is at the beginning and EOS is at the end, concatenate them all together, and then slice into non-overlapping 8k sequences. Those 8k token sequences are then the training examples. I didn't even do anything to try to make the break points at any natural boundary. In fact, after this slicing most sequences don't even start with BOS any more. And some sequences straddle the boundary between 2 works. This might not be the ideal way to do it, but it seemed to work for me.

@tdrussell
Haa... that makes sense. 70b+ models have a lot of data 'inside' already. So, looks like it is enough to 'just guide' it with fairly simple but good training data.
I tasted lately Llama 3 70b finetunes for creative writing / interactive fiction and they all... the same. This one is not without problems too, but it is the closest to the thing to Midnight-Miku 70b 1.5 in prose quality and creativity so far. And because this finetune is not overtrained with toxic stuff it is still can think well. For example (OOC: #something) works very well here.

So, great job! And I think it would be great to merge it with some RP-heavy finetune, so it may use more than just a novel format.

Sign up or log in to comment