Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔗 https://github.com/IvanaXu/iDeepRec/tree/main/pro/DeepRec/tianchi/DLRM#stand-alone-training #62

Open
IvanaXu opened this issue Nov 4, 2022 · 7 comments
Labels
documentation Improvements or additions to documentation

Comments

@IvanaXu
Copy link
Owner

IvanaXu commented Nov 4, 2022

docker pull alideeprec/deeprec-release-modelzoo:latest
docker run -it alideeprec/deeprec-release-modelzoo:latest /bin/bash
cd /root/modelzoo/dlrm

python train.py

# Memory acceleration with jemalloc.
# The required ENV `MALLOC_CONF` is already set in the code.
LD_PRELOAD=../libjemalloc.so.2.5.1 python train.py
@IvanaXu IvanaXu added the help wanted Extra attention is needed label Nov 4, 2022
@IvanaXu
Copy link
Owner Author

IvanaXu commented Nov 4, 2022

INFO:tensorflow:global_step/sec: 142.617
INFO:tensorflow:loss = 0.5084822, steps = 15501 (0.701 sec)
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:global_step/sec: 137.715
INFO:tensorflow:loss = 0.42516267, steps = 15601 (0.726 sec)
INFO:tensorflow:Saving checkpoints for 15625 into ./result/model_DLRM_1667521630/model.ckpt.
Training completed.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:run with loading checkpoint
INFO:tensorflow:Restoring parameters from ./result/model_DLRM_1667521630/model.ckpt-15625
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Evaluation complate:[1000/3907]
Evaluation complate:[2000/3907]
Evaluation complate:[3000/3907]
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
Evaluation complate:[3907/3907]
ACC = 0.7830795049667358
AUC = 0.7807814478874207

@IvanaXu
Copy link
Owner Author

IvanaXu commented Nov 4, 2022

INFO:tensorflow:global_step/sec: 148.565
INFO:tensorflow:loss = 0.4433312, steps = 15501 (0.673 sec)
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:global_step/sec: 144.373
INFO:tensorflow:loss = 0.41600972, steps = 15601 (0.693 sec)
INFO:tensorflow:Saving checkpoints for 15625 into ./result/model_DLRM_1667521882/model.ckpt.
Training completed.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:run with loading checkpoint
INFO:tensorflow:Restoring parameters from ./result/model_DLRM_1667521882/model.ckpt-15625
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Evaluation complate:[1000/3907]
Evaluation complate:[2000/3907]
Evaluation complate:[3000/3907]
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
Evaluation complate:[3907/3907]
ACC = 0.7839229702949524
AUC = 0.7821462154388428

@IvanaXu
Copy link
Owner Author

IvanaXu commented Nov 4, 2022

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./result/model_DIEN_1667522105/model.ckpt.
INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None
INFO:tensorflow:loss = 1.1077905, steps = 1
INFO:tensorflow:global_step/sec: 1.02738
INFO:tensorflow:loss = 0.95943296, steps = 101 (97.337 sec)

@IvanaXu
Copy link
Owner Author

IvanaXu commented Nov 4, 2022

INFO:tensorflow:loss = 0.5911528, steps = 1801 (10.451 sec)
INFO:tensorflow:global_step/sec: 9.69216
INFO:tensorflow:loss = 0.5946928, steps = 1901 (10.318 sec)
INFO:tensorflow:global_step/sec: 9.74381
INFO:tensorflow:loss = 0.55903524, steps = 2001 (10.263 sec)
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:global_step/sec: 9.72112
INFO:tensorflow:loss = 0.58782303, steps = 2101 (10.287 sec)
INFO:tensorflow:Saving checkpoints for 2122 into ./result/model_DIN_1667522357/model.ckpt.
Training completed.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:run with loading checkpoint
INFO:tensorflow:Restoring parameters from ./result/model_DIN_1667522357/model.ckpt-2122
2022-11-04 00:43:12.791780: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_else_const head/gradients/head/loss/xentropy/Select_grad/zeros_like
2022-11-04 00:43:12.791838: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/head/loss/xentropy/Select_grad/Select_1
2022-11-04 00:43:12.791854: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1
2022-11-04 00:43:12.791910: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_else_const head/gradients/attention_layer/Select_grad/zeros_like
2022-11-04 00:43:12.791930: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/attention_layer/Select_grad/Select_1
2022-11-04 00:43:12.791956: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/attention_layer/Select_grad/tuple/control_dependency_1
2022-11-04 00:43:12.792539: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/zeros_like
2022-11-04 00:43:12.792583: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/Select
2022-11-04 00:43:12.792602: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/tuple/control_dependency
2022-11-04 00:43:12.793145: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/input_layer/UID_embedding/UID_embedding_weights][new_name:fused_op_1_select_then_scalar]
2022-11-04 00:43:12.793766: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar] match op[head/loss/xentropy/Select][new_name:fused_op_1_select_else_scalar]
2022-11-04 00:43:12.794364: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_grad/Select][new_name:fused_op_1_select_else_scalar_in_grad]
2022-11-04 00:43:12.794402: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select][new_name:fused_op_2_select_else_scalar_in_grad]
2022-11-04 00:43:12.794434: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/attention_layer/Select_grad/Select][new_name:fused_op_3_select_else_scalar_in_grad]
2022-11-04 00:43:12.795014: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select_1][new_name:fused_op_1_select_then_scalar_in_grad]
2022-11-04 00:43:12.795058: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/Select_1][new_name:fused_op_2_select_then_scalar_in_grad]
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Evaluation complate:[100/237]
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
Evaluation complate:[200/237]
Evaluation complate:[237/237]
ACC = 0.6883002519607544
AUC = 0.7631368041038513

@IvanaXu
Copy link
Owner Author

IvanaXu commented Nov 4, 2022

INFO:tensorflow:global_step/sec: 74.8738
INFO:tensorflow:loss = 0.16757868, steps = 15501 (1.336 sec)
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:global_step/sec: 71.7669
INFO:tensorflow:loss = 0.1547018, steps = 15601 (1.393 sec)
INFO:tensorflow:Saving checkpoints for 15625 into ./result/model_DeepFM_1667522871/model.ckpt.
Training completed.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:run with loading checkpoint
INFO:tensorflow:Restoring parameters from ./result/model_DeepFM_1667522871/model.ckpt-15625
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Evaluation complate:[1000/3907]
Evaluation complate:[2000/3907]
Evaluation complate:[3000/3907]
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
Evaluation complate:[3907/3907]
ACC = 0.7833489775657654
AUC = 0.777617335319519

@IvanaXu
Copy link
Owner Author

IvanaXu commented Nov 4, 2022

INFO:tensorflow:loss = 0.08075554, steps = 97501 (1.134 sec)
INFO:tensorflow:global_step/sec: 89.501
INFO:tensorflow:loss = 0.11768236, steps = 97601 (1.117 sec)
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Saving checkpoints for 97657 into ./result/model_MMOE_1667523186/model.ckpt.
Training completed.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:run with loading checkpoint
INFO:tensorflow:Restoring parameters from ./result/model_MMOE_1667523186/model.ckpt-97657
2022-11-04 01:11:50.191893: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_else_const head/gradients/head/loss/xentropy/Select_grad/zeros_like
2022-11-04 01:11:50.192006: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/head/loss/xentropy/Select_grad/Select_1
2022-11-04 01:11:50.192036: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1
2022-11-04 01:11:50.195364: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar] match op[head/loss/xentropy/Select][new_name:fused_op_1_select_else_scalar]
2022-11-04 01:11:50.196469: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_grad/Select][new_name:fused_op_1_select_else_scalar_in_grad]
2022-11-04 01:11:50.196501: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select][new_name:fused_op_2_select_else_scalar_in_grad]
2022-11-04 01:11:50.197561: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select_1][new_name:fused_op_1_select_then_scalar_in_grad]
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
Evaluation complete:[20/20]
ACC = 0.9731500148773193
AUC = 0.7530704736709595

@IvanaXu
Copy link
Owner Author

IvanaXu commented Nov 4, 2022

INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./result/model_WIDE_AND_DEEP_1667524447/model.ckpt.
INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None
INFO:tensorflow:loss = 0.71554154, steps = 1
2022-11-04 01:14:27.671448: I tensorflow/core/common_runtime/tensorpool_allocator.cc:146] TensorPoolAllocator enabled
INFO:tensorflow:global_step/sec: 5.86205
INFO:tensorflow:loss = 0.52297497, steps = 101 (17.060 sec)
INFO:tensorflow:global_step/sec: 5.81553
INFO:tensorflow:loss = 0.47905684, steps = 201 (17.195 sec)
INFO:tensorflow:global_step/sec: 5.83783
INFO:tensorflow:loss = 0.5278871, steps = 301 (17.130 sec)
INFO:tensorflow:global_step/sec: 5.99602

@IvanaXu IvanaXu changed the title https://github.com/IvanaXu/iDeepRec/tree/main/pro/DeepRec/tianchi/DLRM#stand-alone-training 🔗 https://github.com/IvanaXu/iDeepRec/tree/main/pro/DeepRec/tianchi/DLRM#stand-alone-training Nov 4, 2022
@IvanaXu IvanaXu added documentation Improvements or additions to documentation and removed help wanted Extra attention is needed labels Nov 4, 2022
@IvanaXu IvanaXu moved this from Todo to ✨TRY in iDeepRec Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: TRY
Development

No branches or pull requests

1 participant