بداية سريعة

الخطوة 1: Preprocess the data

th preprocess.lua -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo

We will be working with some example data in data/ folder.

The data consists of parallel source (src) and target (tgt) data containing one sentence per line with tokens separated by a space:

  • src-train.txt
  • tgt-train.txt
  • src-val.txt
  • tgt-val.txt

Validation files are required and used to evaluate the convergence of the training. It usually contains no more than 5000 sentences.

$ head -n 3 data/src-train.txt
It is not acceptable that , with the help of the national bureaucracies , Parliament 's legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance .
Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym .
" Two soldiers came up to me and told me that if I refuse to sleep with them , they will kill me . They beat me and ripped my clothes .

After running the preprocessing, the following files are generated:

  • demo.src.dict: قاموس المفردات الأصلية الى التعيينات المؤشر. 
  • demo.tgt.dict: قاموس اهداف المفردات الى التعيينات المؤشر.
  • demo-train.t7: ملف Torch المسلسل يحتوي من مفردات، التدريب، وبيانات التحقيق. 

The *.dict files are needed to check or reuse the vocabularies. هذه الملفات البسيطة هي قواميس للقراءة البشرية.

$ head -n 10 data/demo.src.dict
<blank> 1
<unk> 2
<s> 3
</s> 4
It 5
is 6
not 7
acceptable 8
that 9
, 10
with 11

 داخلياً النظام لا يلمس الكلمات نفسها، ولكن يستخدم هذه المؤشرات.

Note

If the corpus is not tokenized, you can use OpenNMT's tokenizer.

الخطوة 2: Train the model

th train.lua -data data/demo-train.t7 -save_model demo-model

أمر التدريب الرئيسي هو بسيط جداً. الحد الأدنى يأخذ ملف البيانات ويحفظ ملف. هذا وسوف يشغل النموذج الافتراضي، الذي يتكون من طبقتين LSTM مع 500  وحدات الخفية على التشفير/ فك. يمكنك أيضاً إضافة  -gpuid 1 لاستخدام (say) GPU 1.

الخطوة 3: Translate

th translate.lua -model demo-model_epochX_PPL.t7 -src data/src-test.txt -output pred.txt

الآن لديك نموذج التي يمكنك استخدامها للتنبؤ على البيانات الجديدة. نفعل ذلك عن طريق تشغيل بحث الشعاع. This will output predictions into pred.txt.

Note

لن تكون التوقعات جيده، لأن مجموعة البيانات التجريبي صغيره جداً. حاول ان تشتغل مع مجموعات بيانات أكبر! على سبيل المثال يمكنك تحميل الملايين من الجمل للترجمة أو التلخيص.