Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does this character encoding exception occur during training? #174

Open
samhuang1991 opened this issue Jan 4, 2025 · 0 comments
Open

Comments

@samhuang1991
Copy link

samhuang1991 commented Jan 4, 2025

Here is my log:

(eole) PS E:\AI\NLP\EOLE> eole train -config en_zh.yaml
[2025-01-04 12:38:31,154 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2025-01-04 12:38:31,154 INFO] Missing transforms field for valid data, set to default: [].
[2025-01-04 12:38:31,154 INFO] Parsed 2 corpora from -data.
[2025-01-04 12:38:31,156 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2025-01-04 12:38:31,762 INFO] Transforms applied: []
[2025-01-04 12:38:31,770 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '<s>', '</s>', 'the\t308938\r', 'to\t163517\r', 'of\t163299\r', 'and\t146616\r', 'in\t106330\r', 'a\t102887\r']
[2025-01-04 12:38:31,770 INFO] The decoder start token is: <s>
[2025-01-04 12:38:31,770 INFO] bos_token token is: <s> id: [2]
[2025-01-04 12:38:31,772 INFO] eos_token token is: </s> id: [3]
[2025-01-04 12:38:31,772 INFO] pad_token token is: <blank> id: [1]
[2025-01-04 12:38:31,773 INFO] unk_token token is: <unk> id: [0]
[2025-01-04 12:38:31,774 INFO] Building model...
[2025-01-04 12:38:32,084 INFO] EncoderDecoderModel(
  (encoder): TransformerEncoder(
    (transformer_layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (self_attn): SelfMHA(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
        (dropout): Dropout(p=0.3, inplace=False)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (mlp): MLP(
          (gate_up_proj): Linear(in_features=512, out_features=2048, bias=False)
          (down_proj): Linear(in_features=2048, out_features=512, bias=False)
          (dropout_1): Dropout(p=0.3, inplace=False)
          (dropout_2): Dropout(p=0.3, inplace=False)
        )
      )
    )
    (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (transformer_layers): ModuleList(
      (0-1): 2 x TransformerDecoderLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (self_attn): SelfMHA(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
        (dropout): Dropout(p=0.3, inplace=False)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (mlp): MLP(
          (gate_up_proj): Linear(in_features=512, out_features=2048, bias=False)
          (down_proj): Linear(in_features=2048, out_features=512, bias=False)
          (dropout_1): Dropout(p=0.3, inplace=False)
          (dropout_2): Dropout(p=0.3, inplace=False)
        )
        (precontext_layernorm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (context_attn): ContextMHA(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
      )
    )
    (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
  )
  (src_emb): Embeddings(
    (embeddings): Embedding(32760, 512, padding_idx=1)
    (dropout): Dropout(p=0.3, inplace=False)
    (pe): PositionalEncoding()
  )
  (tgt_emb): Embeddings(
    (embeddings): Embedding(32768, 512, padding_idx=1)
    (dropout): Dropout(p=0.3, inplace=False)
    (pe): PositionalEncoding()
  )
  (generator): Linear(in_features=512, out_features=32768, bias=True)
)
[2025-01-04 12:38:32,085 INFO] embeddings: 33550336
[2025-01-04 12:38:32,085 INFO] encoder: 6296576
[2025-01-04 12:38:32,085 INFO] decoder: 8395776
[2025-01-04 12:38:32,086 INFO] generator: 16809984
[2025-01-04 12:38:32,086 INFO] other: 0
[2025-01-04 12:38:32,086 INFO] * number of parameters: 65052672
[2025-01-04 12:38:32,086 INFO] Trainable parameters = {'torch.float32': 65052672}
[2025-01-04 12:38:32,087 INFO] Non trainable parameters = {}
[2025-01-04 12:38:32,087 INFO]  * src vocab size = 32760
[2025-01-04 12:38:32,087 INFO]  * tgt vocab size = 32768
[2025-01-04 12:38:32,089 INFO] Starting training on GPU: [0]
[2025-01-04 12:38:32,089 INFO] Start training loop and validate every 500 steps...
[2025-01-04 12:38:32,089 INFO] Scoring with: None
[2025-01-04 12:38:34,975 INFO] Weighted corpora loaded so far:
                        * corpus_1: 1
[2025-01-04 12:38:37,847 INFO] Weighted corpora loaded so far:
                        * corpus_1: 1
[2025-01-04 12:38:38,808 INFO] Step 50/ 1000; acc: 60.0; ppl: 4215.71; xent: 8.35; aux: 0.000; lr: 1.00e+00; sents:    3104; bsz: 1454/ 179/62; 10824/1330 tok/s;      7 sec;
[2025-01-04 12:38:39,481 INFO] Step 100/ 1000; acc: 67.4; ppl: 66.66; xent: 4.20; aux: 0.000; lr: 1.00e+00; sents:    3104; bsz: 1330/ 166/62; 98699/12342 tok/s;      7 sec;
[2025-01-04 12:38:40,144 INFO] Step 150/ 1000; acc: 62.3; ppl: 84.85; xent: 4.44; aux: 0.000; lr: 1.00e+00; sents:    3152; bsz: 1375/ 174/63; 103715/13117 tok/s;      8 sec;
[2025-01-04 12:38:40,828 INFO] Step 200/ 1000; acc: 66.9; ppl: 50.40; xent: 3.92; aux: 0.000; lr: 1.00e+00; sents:    3104; bsz: 1474/ 182/62; 107841/13329 tok/s;      9 sec;
[2025-01-04 12:38:41,492 INFO] Step 250/ 1000; acc: 66.9; ppl: 23.07; xent: 3.14; aux: 0.000; lr: 1.00e+00; sents:    3152; bsz: 1331/ 171/63; 100208/12857 tok/s;      9 sec;
[2025-01-04 12:38:42,161 INFO] Step 300/ 1000; acc: 67.3; ppl: 19.81; xent: 2.99; aux: 0.000; lr: 1.00e+00; sents:    3152; bsz: 1305/ 170/63; 97668/12754 tok/s;     10 sec;
[2025-01-04 12:38:42,821 INFO] Step 350/ 1000; acc: 65.0; ppl: 21.15; xent: 3.05; aux: 0.000; lr: 1.00e+00; sents:    3104; bsz: 1406/ 178/62; 106431/13447 tok/s;     11 sec;
[2025-01-04 12:38:43,473 INFO] Step 400/ 1000; acc: 64.1; ppl: 22.80; xent: 3.13; aux: 0.000; lr: 1.00e+00; sents:    3152; bsz: 1331/ 174/63; 102137/13355 tok/s;     11 sec;
[2025-01-04 12:38:44,144 INFO] Step 450/ 1000; acc: 63.1; ppl: 17.22; xent: 2.85; aux: 0.000; lr: 1.00e+00; sents:    3104; bsz: 1440/ 182/62; 107283/13569 tok/s;     12 sec;
[2025-01-04 12:38:44,832 INFO] Step 500/ 1000; acc: 64.6; ppl: 30.48; xent: 3.42; aux: 0.000; lr: 1.00e+00; sents:    3104; bsz: 1385/ 177/62; 100644/12860 tok/s;     13 sec;
[2025-01-04 12:39:00,069 INFO] valid stats calculation
                           took: 15.235411167144775 s.
[2025-01-04 12:39:00,070 INFO] Train perplexity: 52.1136
[2025-01-04 12:39:00,070 INFO] Train accuracy: 64.7299
[2025-01-04 12:39:00,071 INFO] Sentences processed: 31232
[2025-01-04 12:39:00,071 INFO] Average bsz: 1383/ 175/62
[2025-01-04 12:39:00,071 INFO] Validation perplexity: 213.562
[2025-01-04 12:39:00,071 INFO] Validation accuracy: 71.4428
[2025-01-04 12:39:00,072 INFO] Saving optimizer and weights to step_500, and symlink to en-zh/run/model
[2025-01-04 12:39:00,281 INFO] Saving transforms artifacts, if any, to en-zh/run/model\step_500
[2025-01-04 12:39:00,282 INFO] Saving config and vocab to en-zh/run/model
Traceback (most recent call last):
  File "\\?\C:\ProgramData\anaconda3\envs\eole\Scripts\eole-script.py", line 33, in <module>
    sys.exit(load_entry_point('eole', 'console_scripts', 'eole')())
  File "e:\ai\nlp\eole\eole\eole\bin\main.py", line 39, in main
    bin_cls.run(args)
  File "e:\ai\nlp\eole\eole\eole\bin\run\train.py", line 70, in run
    train(config)
  File "e:\ai\nlp\eole\eole\eole\bin\run\train.py", line 57, in train
    train_process(config, device_id=0)
  File "e:\ai\nlp\eole\eole\eole\train_single.py", line 244, in main
    trainer.train(
  File "e:\ai\nlp\eole\eole\eole\trainer.py", line 375, in train
    self.model_saver.save(step, moving_average=self.moving_average)
  File "e:\ai\nlp\eole\eole\eole\models\model_saver.py", line 359, in save
    self._save(step)
  File "e:\ai\nlp\eole\eole\eole\models\model_saver.py", line 336, in _save
    self._save_vocab()
  File "e:\ai\nlp\eole\eole\eole\models\model_saver.py", line 280, in _save_vocab
    json.dump(vocab_data, f, indent=2, ensure_ascii=False)
  File "C:\ProgramData\anaconda3\envs\eole\lib\json\__init__.py", line 180, in dump
    fp.write(chunk)
UnicodeEncodeError: 'gbk' codec can't encode character '\u011f' in position 12: illegal multibyte sequence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant