-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
多线程并发使用seata的问题 #6311
Comments
What version are you using? |
1.7.0 |
请给我看一下你的branch table的表结构 |
https://github.com/apache/incubator-seata/blob/1.7.0/script/server/db/mysql.sql 与官方结构一样 |
把你的结构截图或者导出来,空口无凭 |
talk me branch table structure ok? |
如果你的server用的是1.7,并且gmt_create的精度为6,没有相同资源重入后因为回滚顺序不对导致的脏写问题。 |
我按照他这个使用了2.0版本的seata server和SDK,gmt_create精度也是6,同一数据连续修改两次,抛出异常后,回滚也遇到了这个问题 |
自行官网查看升级手册的说明 |
有没有可能回滚顺序是对的,但是正要回滚时的数据被其他线程修改了,就是其他线程没来得及停下来修改数据 |
20:17:40.946 INFO --- [tyServerNIOWorker_1_10_16] [rocessor.server.RegTmProcessor] [ onRegTmMessage] [] : TM register success,message:RegisterTMRequest{version='1.7.0', applicationId='dubbo-demo-app', transactionServiceGroup='my_test_tx_group', extraData='ak=null 20:17:44.046 ERROR --- [nPool.commonPool-worker-3] [ption.AbstractExceptionHandler] [eptionHandleTemplate] [10.167.51.1:8091:3468256129648835463] : Catch TransactionException while do RPC, request: BranchRegisterRequest{xid='10.167.51.1:8091:3468256129648835463', branchType=AT, resourceId='jdbc:mysql://localhost:3306/seata', lockKey='account_tbl:6', applicationData='null'} 20:17:44.048 INFO --- [ batchLoggerPrint_1_1] [ocessor.server.BatchLogHandler] [ run] [] : receive msg[merged]: BranchRegisterRequest{xid='10.167.51.1:8091:3468256129648835463', branchType=AT, resourceId='jdbc:mysql://localhost:3306/seata', lockKey='account_tbl:6', applicationData='null'}, clientIp: 127.0.0.1, vgroup: my_test_tx_group |
order_tbl:276 这个数据只被注册了一次3468256129648835479分支,没有额外的分支,他回滚不了,说明没有分支跟他重入同一个数据.你可以尝试用xa模式,还是这个例子,如果没有出现回滚不了的问题.说明修改order_tbl:276这条数据的另有别的地方,没有加上globaltransactional注解导致的 |
改成了XA,日志变了点,但回滚成功了 |
20:17:44.159 INFO --- [ batchLoggerPrint_1_1] [ocessor.server.BatchLogHandler] [ run] [] : receive msg[single]: BranchRollbackResponse{xid='10.167.51.1:8091:3468256129648835463', branchId=3468256129648835504, branchStatus=PhaseTwo_Rollbacked, resultCode=Success, msg='null'}, clientIp: 127.0.0.1, vgroup: my_test_tx_group |
undolog相关的branchid自行去看下server日志就知道回滚顺序是怎么样了 |
server端没有回滚失败的报错,并且提示回滚成功。但是RM端有报错回滚失败,下面,是我重试了,是另一次事务。 [INFO ] 2024-01-31 21:24:23,600 method:org.apache.dubbo.registry.multicast.MulticastRegistry.receive(MulticastRegistry.java:196) |
1.tc数据库的字段是纳秒精度 |
又重试了下,这次server有回滚失败的日志了。。 21:26:05.443 INFO --- [ RetryRollbacking_1_1] [server.coordinator.DefaultCore] [ doGlobalRollback] [10.167.51.1:8091:3153004174125383883] : Rollback global transaction successfully, xid = 10.167.51.1:8091:3153004174125383883. 21:26:21.479 INFO --- [ batchLoggerPrint_1_1] [ocessor.server.BatchLogHandler] [ run] [] : result msg[merged]: BranchRegisterResponse{branchId=3153004174125384144, resultCode=Success, msg='null'}, clientIp: 127.0.0.1, vgroup: my_test_tx_group 21:26:21.575 ERROR --- [nPool.commonPool-worker-1] [ption.AbstractExceptionHandler] [eptionHandleTemplate] [10.167.51.1:8091:3153004174125384106] : Catch TransactionException while do RPC, request: BranchRegisterRequest{xid='10.167.51.1:8091:3153004174125384106', branchType=AT, resourceId='jdbc:mysql://localhost:3306/seata', lockKey='stock_tbl:1', applicationData='null'} 21:26:21.579 INFO --- [ batchLoggerPrint_1_1] [ocessor.server.BatchLogHandler] [ run] [] : result msg[merged]: BranchRegisterResponse{branchId=0, resultCode=Failed, msg='TransactionException[Could not register branch into global session xid = 10.167.51.1:8091:3153004174125384106 status = Rollbacking while expecting Begin]'}, clientIp: 127.0.0.1, vgroup: my_test_tx_group |
server端必定会输出回滚失败的日志
|
这样排查的效率太低了,要不加个钉钉我共享桌面给你,没空的话明天也行 |
你数据明显被全局事务以外的事务给脏写了,xa不会是因为xa不需要globaltransactional也能防脏写,他是数据库级别的,把本地锁hold到二阶段了. |
branchid为3153004174125384132的分支事务作为该pk的第一个回滚分支,遇到脏写后回滚失败,标识全局事务状态为failed,剩下的分支也没有回滚,自行分析下日志就知道了.还有提交issue的时候按格式,提交版本, 日志,堆栈,效率也不会就这么低了 |
行,谢谢指导,我明天研究研究再答复你 |
公司电脑不方便分享,代码就是下面这部分。我觉得肯定是多线程搞出来的问题,你说的有可能 @Override
@GlobalTransactional(timeoutMills = 300000, name = "dubbo-demo-tx", rollbackFor = Exception.class)
public void purchase(String userId, String commodityCode, int orderCount) {
LOGGER.info("purchase begin ... xid: " + RootContext.getXID());
ExecutorService executorService = Executors.newFixedThreadPool(3);
List<Future<?>> submits = new ArrayList<>();
String xid = RootContext.getXID();
for (int i = 0; i < 8; i++) {
int finalI = i;
Callable<Boolean> r = () -> {
RootContext.bind(xid);
stockService.deduct(commodityCode, 1);
// just test batch update
//stockService.batchDeduct(commodityCode, orderCount);
orderService.create(userId, commodityCode, 1);
//throw new RuntimeException("random exception mock!");
if (finalI == 3){
throw new RuntimeException("random exception mock!");
}
return true;
};
Future<?> submit = executorService.submit(r);
submits.add(submit);
}
try {
for (Future<?> submit : submits) {
submit.get();
}
} catch (Exception e) {
e.printStackTrace();
throw new RuntimeException();
} finally {
executorService.shutdown();
}
} |
github的图片不太好打开,最好是发日志内容,至于有的失败有的成功,由于是多线程并行的,可能注册成功的request是早一点到的,注册失败的是稍晚了一些些,看下日志的打印时间 |
刚图片的数据内容: |
帮忙看下你库里3153004174125384144 的时间戳和3153004174125384132的时间戳,根据雪花id来看,3153004174125384132肯定是较早的,但是3153004174125384144 并没有触发回滚,我大概知道是什么原因了 |
3153004174125384132:2024-01-31 13:26:21.375072 |
3153004174125384132 注册后,client决议了回滚,此时server整准备回滚事务,从数据中读到了3153004174125384132是最后一个分支,与此同时3153004174125384144正在进行分支注册,然后他注册成功了,因为此时3153004174125384106还没被修改为rollbaking,分支3153004174125384144注册成功,然后全局事务3153004174125384106进行回滚,他当前加载到的3153004174125384132是最后一个,实际上3153004174125384144才是最后一个. |
这个应该是多线程导致server出现了线程安全的问题,因为在file和raft模式下,server会对globalsession进行加锁,这样增加分支事务等动作都是串行的,而在db和redis模式下,为了减少网络io和磁盘开销,并没有对globalsession进行加锁,所以如果决议回滚,跟分支事务同时发生,这种情况是有可能导致回滚顺序不正常,而终止了回滚的. |
3153004174125384106的全局事务的status改为5,branch table的哪行记录?全部记录?还是branch_id是3153004174125384132的还是3153004174125384144的 |
所以这个是seata的bug了?多线程并发改数据这种场景应该是正常的业务场景吧 |
3153004174125384106是全局事务,也就是10.167.51.1:8091:3153004174125384106 这个xid |
算是,你验证完如果能回滚成功,基本上就确定了,现在只是根据你日志内容和描述理论推导的猜测. |
branch_table现在有14条数据,我每条数据的状态都改成5?我改完了,怎么触发呢,现在没有自动回滚 |
server和client都启动了吗?不要改branchtable,改的是globaltable里的这个10.167.51.1:8091:3153004174125384106 的status |
22:08:57.383 INFO --- [ RetryRollbacking_1_1] [.core.rpc.netty.ChannelManager] [ getChannel] [10.167.51.1:8091:3153004174125384106] : Choose [id: 0x5d196cb5, L:/127.0.0.1:8091 - R:/127.0.0.1:53152] on the same IP[127.0.0.1] as alternative of dubbo-demo-stock-service:127.0.0.1:50305 |
刚关了,,现在都启动了,状态也改了,现在报错,状态回到12了 |
错误是什么?日志发出来看下?理论上重新回滚,按照时间戳应该是3153004174125384144先回滚了,之前是3153004174125384132,现在client的异常是什么? |
server的报错在上面,这里是client的异常: Application is keep running ... |
现在回滚顺序对了,但是数据还是被脏写了,对比下undolog和当前的内容看下区别是什么?我记得client侧会打印不相同的字段列,你这里怎么没有? |
我刚操作错了,全部重新试了一遍,在改status之后,回滚成功了,数据正常,所有seata相关表都没有数据了,完全好了 |
这是在改status之后的日志,回滚成功: 22:20:12.217 INFO --- [ batchLoggerPrint_1_1] [ocessor.server.BatchLogHandler] [ run] [] : receive msg[single]: BranchRollbackResponse{xid='10.167.51.1:8091:3153004174125384203', branchId=3153004174125384233, branchStatus=PhaseTwo_RollbackFailed_Unretryable, resultCode=Success, msg='null'}, clientIp: 127.0.0.1, vgroup: my_test_tx_group |
刚上面改status回滚失败是不是时间太长了,不让回滚了,看到有BranchRollbackFailed_Unretriable字段 |
多试几次看看,如果确实是这样,那确认是回滚与注册并发,导致回滚顺序遗漏一个分支造成的无法回滚成功的bug |
不是,是因为在高版本,应该是1.6左右,开始认为脏写的数据需要人工介入,无限自动回滚没有意义,所以改为了BranchRollbackFailed_Unretriable,标识无法回滚,将全局事务置为失败状态,等待人工介入.比如你刚才的行为也算是人工介入. |
试了5次,全部在改status之后回滚成功! |
那目前我们生产环境下有什么临时解决方案吗?不然是线上的一个隐患 |
哥我顶不住了,这段时间给我忙死了,周末加班就是这个seata问题搞的,明天我再回复吧 |
最简单的办法就是将资源重入场景去除,改为同一个本地事务下多次修改就没问题了. |
In addition to the resource reentry to be aware of, there is also the fact that futureget must wait for all the asynchronous threads to finish running, rather than resolving on a single error report, so that basically it won't cause the problem of the second-phase rollback being out of order. In your example, since futrue.get resolves as soon as an exception is raised, and doesn't wait for the other branches to register before resolving, there is a high probability of this problem occurring! |
子线程使用RootContext.bind(xid);绑定了主线程的xid,能让报错时整体回滚。但碰到一个问题:
第一个线程修改一个字段,将数据从A改到B,第二个线程修改相同字段(因为是相同xid,可以获取到全局锁),将数据从B改到C,业务异常回滚,那么第一个线程会回滚失败,因为发现脏数据,此时全局锁就一直不会被释放,数据一直被锁住,业务无法继续下去。
请问下官方大佬,这种场景,除了在业务上避免并发修改相同数据外,还有什么好的解决方案呢?seata有机制可以避免吗?或者是我们用seata的方式不对?
The text was updated successfully, but these errors were encountered: