Why does this character encoding exception occur during training? #174

samhuang1991 · 2025-01-04T04:42:32Z

Here is my log:

(eole) PS E:\AI\NLP\EOLE> eole train -config en_zh.yaml
[2025-01-04 12:38:31,154 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2025-01-04 12:38:31,154 INFO] Missing transforms field for valid data, set to default: [].
[2025-01-04 12:38:31,154 INFO] Parsed 2 corpora from -data.
[2025-01-04 12:38:31,156 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2025-01-04 12:38:31,762 INFO] Transforms applied: []
[2025-01-04 12:38:31,770 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '<s>', '</s>', 'the\t308938\r', 'to\t163517\r', 'of\t163299\r', 'and\t146616\r', 'in\t106330\r', 'a\t102887\r']
[2025-01-04 12:38:31,770 INFO] The decoder start token is: <s>
[2025-01-04 12:38:31,770 INFO] bos_token token is: <s> id: [2]
[2025-01-04 12:38:31,772 INFO] eos_token token is: </s> id: [3]
[2025-01-04 12:38:31,772 INFO] pad_token token is: <blank> id: [1]
[2025-01-04 12:38:31,773 INFO] unk_token token is: <unk> id: [0]
[2025-01-04 12:38:31,774 INFO] Building model...
[2025-01-04 12:38:32,084 INFO] EncoderDecoderModel(
  (encoder): TransformerEncoder(
    (transformer_layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (self_attn): SelfMHA(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
        (dropout): Dropout(p=0.3, inplace=False)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (mlp): MLP(
          (gate_up_proj): Linear(in_features=512, out_features=2048, bias=False)
          (down_proj): Linear(in_features=2048, out_features=512, bias=False)
          (dropout_1): Dropout(p=0.3, inplace=False)
          (dropout_2): Dropout(p=0.3, inplace=False)
        )
      )
    )
    (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (transformer_layers): ModuleList(
      (0-1): 2 x TransformerDecoderLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (self_attn): SelfMHA(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
        (dropout): Dropout(p=0.3, inplace=False)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (mlp): MLP(
          (gate_up_proj): Linear(in_features=512, out_features=2048, bias=False)
          (down_proj): Linear(in_features=2048, out_features=512, bias=False)
          (dropout_1): Dropout(p=0.3, inplace=False)
          (dropout_2): Dropout(p=0.3, inplace=False)
        )
        (precontext_layernorm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (context_attn): ContextMHA(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
      )
    )
    (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
  )
  (src_emb): Embeddings(
    (embeddings): Embedding(32760, 512, padding_idx=1)
    (dropout): Dropout(p=0.3, inplace=False)
    (pe): PositionalEncoding()
  )
  (tgt_emb): Embeddings(
    (embeddings): Embedding(32768, 512, padding_idx=1)
    (dropout): Dropout(p=0.3, inplace=False)
    (pe): PositionalEncoding()
  )
  (generator): Linear(in_features=512, out_features=32768, bias=True)
)
[2025-01-04 12:38:32,085 INFO] embeddings: 33550336
[2025-01-04 12:38:32,085 INFO] encoder: 6296576
[2025-01-04 12:38:32,085 INFO] decoder: 8395776
[2025-01-04 12:38:32,086 INFO] generator: 16809984
[2025-01-04 12:38:32,086 INFO] other: 0
[2025-01-04 12:38:32,086 INFO] * number of parameters: 65052672
[2025-01-04 12:38:32,086 INFO] Trainable parameters = {'torch.float32': 65052672}
[2025-01-04 12:38:32,087 INFO] Non trainable parameters = {}
[2025-01-04 12:38:32,087 INFO]  * src vocab size = 32760
[2025-01-04 12:38:32,087 INFO]  * tgt vocab size = 32768
[2025-01-04 12:38:32,089 INFO] Starting training on GPU: [0]
[2025-01-04 12:38:32,089 INFO] Start training loop and validate every 500 steps...
[2025-01-04 12:38:32,089 INFO] Scoring with: None
[2025-01-04 12:38:34,975 INFO] Weighted corpora loaded so far:
                        * corpus_1: 1
[2025-01-04 12:38:37,847 INFO] Weighted corpora loaded so far:
                        * corpus_1: 1
[2025-01-04 12:38:38,808 INFO] Step 50/ 1000; acc: 60.0; ppl: 4215.71; xent: 8.35; aux: 0.000; lr: 1.00e+00; sents:    3104; bsz: 1454/ 179/62; 10824/1330 tok/s;      7 sec;
[2025-01-04 12:38:39,481 INFO] Step 100/ 1000; acc: 67.4; ppl: 66.66; xent: 4.20; aux: 0.000; lr: 1.00e+00; sents:    3104; bsz: 1330/ 166/62; 98699/12342 tok/s;      7 sec;
[2025-01-04 12:38:40,144 INFO] Step 150/ 1000; acc: 62.3; ppl: 84.85; xent: 4.44; aux: 0.000; lr: 1.00e+00; sents:    3152; bsz: 1375/ 174/63; 103715/13117 tok/s;      8 sec;
[2025-01-04 12:38:40,828 INFO] Step 200/ 1000; acc: 66.9; ppl: 50.40; xent: 3.92; aux: 0.000; lr: 1.00e+00; sents:    3104; bsz: 1474/ 182/62; 107841/13329 tok/s;      9 sec;
[2025-01-04 12:38:41,492 INFO] Step 250/ 1000; acc: 66.9; ppl: 23.07; xent: 3.14; aux: 0.000; lr: 1.00e+00; sents:    3152; bsz: 1331/ 171/63; 100208/12857 tok/s;      9 sec;
[2025-01-04 12:38:42,161 INFO] Step 300/ 1000; acc: 67.3; ppl: 19.81; xent: 2.99; aux: 0.000; lr: 1.00e+00; sents:    3152; bsz: 1305/ 170/63; 97668/12754 tok/s;     10 sec;
[2025-01-04 12:38:42,821 INFO] Step 350/ 1000; acc: 65.0; ppl: 21.15; xent: 3.05; aux: 0.000; lr: 1.00e+00; sents:    3104; bsz: 1406/ 178/62; 106431/13447 tok/s;     11 sec;
[2025-01-04 12:38:43,473 INFO] Step 400/ 1000; acc: 64.1; ppl: 22.80; xent: 3.13; aux: 0.000; lr: 1.00e+00; sents:    3152; bsz: 1331/ 174/63; 102137/13355 tok/s;     11 sec;
[2025-01-04 12:38:44,144 INFO] Step 450/ 1000; acc: 63.1; ppl: 17.22; xent: 2.85; aux: 0.000; lr: 1.00e+00; sents:    3104; bsz: 1440/ 182/62; 107283/13569 tok/s;     12 sec;
[2025-01-04 12:38:44,832 INFO] Step 500/ 1000; acc: 64.6; ppl: 30.48; xent: 3.42; aux: 0.000; lr: 1.00e+00; sents:    3104; bsz: 1385/ 177/62; 100644/12860 tok/s;     13 sec;
[2025-01-04 12:39:00,069 INFO] valid stats calculation
                           took: 15.235411167144775 s.
[2025-01-04 12:39:00,070 INFO] Train perplexity: 52.1136
[2025-01-04 12:39:00,070 INFO] Train accuracy: 64.7299
[2025-01-04 12:39:00,071 INFO] Sentences processed: 31232
[2025-01-04 12:39:00,071 INFO] Average bsz: 1383/ 175/62
[2025-01-04 12:39:00,071 INFO] Validation perplexity: 213.562
[2025-01-04 12:39:00,071 INFO] Validation accuracy: 71.4428
[2025-01-04 12:39:00,072 INFO] Saving optimizer and weights to step_500, and symlink to en-zh/run/model
[2025-01-04 12:39:00,281 INFO] Saving transforms artifacts, if any, to en-zh/run/model\step_500
[2025-01-04 12:39:00,282 INFO] Saving config and vocab to en-zh/run/model
Traceback (most recent call last):
  File "\\?\C:\ProgramData\anaconda3\envs\eole\Scripts\eole-script.py", line 33, in <module>
    sys.exit(load_entry_point('eole', 'console_scripts', 'eole')())
  File "e:\ai\nlp\eole\eole\eole\bin\main.py", line 39, in main
    bin_cls.run(args)
  File "e:\ai\nlp\eole\eole\eole\bin\run\train.py", line 70, in run
    train(config)
  File "e:\ai\nlp\eole\eole\eole\bin\run\train.py", line 57, in train
    train_process(config, device_id=0)
  File "e:\ai\nlp\eole\eole\eole\train_single.py", line 244, in main
    trainer.train(
  File "e:\ai\nlp\eole\eole\eole\trainer.py", line 375, in train
    self.model_saver.save(step, moving_average=self.moving_average)
  File "e:\ai\nlp\eole\eole\eole\models\model_saver.py", line 359, in save
    self._save(step)
  File "e:\ai\nlp\eole\eole\eole\models\model_saver.py", line 336, in _save
    self._save_vocab()
  File "e:\ai\nlp\eole\eole\eole\models\model_saver.py", line 280, in _save_vocab
    json.dump(vocab_data, f, indent=2, ensure_ascii=False)
  File "C:\ProgramData\anaconda3\envs\eole\lib\json\__init__.py", line 180, in dump
    fp.write(chunk)
UnicodeEncodeError: 'gbk' codec can't encode character '\u011f' in position 12: illegal multibyte sequence

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does this character encoding exception occur during training? #174

Why does this character encoding exception occur during training? #174

samhuang1991 commented Jan 4, 2025 •

edited

Loading

Why does this character encoding exception occur during training? #174

Why does this character encoding exception occur during training? #174

Comments

samhuang1991 commented Jan 4, 2025 • edited Loading

samhuang1991 commented Jan 4, 2025 •

edited

Loading