-
-
Notifications
You must be signed in to change notification settings - Fork 597
Replication losing data when diskchunk_flush_write_timeout enabled #3061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can not reproduce the issue. I switched to the branch
the last line of the output looks like
not sure what to expect there and what should I investigate as I see no daemon logs at the dev box. Then I set I got errors
|
Can you provide the FULL log that you see? As I can see, something went wrong even when we created the cluster, so nothing else worked after. To understand, I need to see the full log. I tried to use the latest test kit and run the test; for me, the cluster is created with no issues, but it's still reproducible. |
here is a full log with
here is a full log with
|
Looks like it used a very old daemon that had problems with cluster creation and was unable to create a cluster with a null error returned. Try to update the docker image and run again, it should help:
I'm using the February version of the daemon; in the log I can see some old version of the daemon from January. |
Here's the full log with --logreplication enabled from 2 nodes when the issue happens. Please take a look into it and let me know if you see something or no |
it could be better to enable
but it is not clear why buddy failed to wait the command. What command buddy failed to wait. Рow much time has passed since the start of the command till the timeout happens in the buddy. |
We are looking here for a case where data has disappeared in a clustered table (only one row gone in some cases). That's why it's kind of hard to reproduce. What we are looking here is the magic disappearance of the ROW with key = 'master'. If we look in logs, we see that there are insert and updates for key = 'master'. While after we got waiting timeout on buddy side after creating sharded table, we can see that there is no such row in table anymore. This is the issue with diskchunk enabled. Here are the full logs with all info that hopefully will help to understand why this row is getting removed from replicated table. |
I'll try to reproduce. |
@donhardman I couldn't reproduce the issue via github here https://github.com/manticoresoftware/manticoresearch/actions/runs/13647861205 after 10 attempts. The workflow used is located here https://github.com/manticoresoftware/manticoresearch/blob/08d0023a23c5f3b8d21cffcc9f80cecfebdd3501/.github/workflows/mre.yml |
It was not reproducible due to us setting |
I've reproduced the issue without CLT in https://github.com/manticoresoftware/manticoresearch/blob/test/test-drop-sharded-table/.github/workflows/mre.yml with extra logging. @tomatolog please investigate the failure here https://github.com/manticoresoftware/manticoresearch/actions/runs/13660719391/job/38190978513 The failure is:
Below that you'll find:
|
I also confirm adding:
fixes the issue. |
I could also reproduce it via docker on dev2 after tens of attempts, but I could reproduce w/o docker on Github (i.e. in a clean runner) https://github.com/manticoresoftware/manticoresearch/actions/runs/13661200658 . |
I was able to reproduce the issue on perf3. It seems that the key to triggering it is faster disk chunk flushing, like when saving to an SSD. Here’s how you can reproduce it on perf3:
|
Maybe a sketch of fixes made by AI would be helpful and help to understand possible solutions for it in a clustered environment. Can look into it to understand if it fits or not (but probably also need to fix some code): #3166 |
The issue can be also reproduced with:
Notice, |
the issue seems that all nodes has the same server_id and uuid seed if I set server_id to uniq number the is no such error anymore.
node2
That cause auto-id generate the same sequences at both nodes. node2 issue
while the original document was just replicated from the node1 that is why that document got replaced on both nodes. |
need to make sure that nodes at the shards has uniq server_id prior to using these nodes |
or we could fail cluster join if the server_id is set with the same value or server_id is auto initialized from the MAC address and is the same as all other nodes in the cluster |
Good idea! |
Not quite sure why INSERT into does not fail in this case as usual if I issue at the both nodes of the cluster
I always get one node succeeds and other node error
ie Galera properly checks for conflicts and does not allow document that conflicts with the running transaction. Maybe it is a Galera bug and it was fixed in the recent versions |
the issue that caused this case is #3186 |
fixed at a3e3c4b default server_id use MAC along with PID file path to make sure multiple daemons started at the same node have different server_id and also added a check of the server_id on join cluster statement to make sure all nodes in the cluster has uniq server_id You could also set |
I can't reproduce the issue anymore neither in https://github.com/manticoresoftware/manticoresearch/actions/runs/13661200658 nor with |
Closing as done cuz it's not reproducible anymore |
Reopening to complete the checklist (tests, changelog). |
The checklist is complete. Closing. A test will be implemented within #3186 |
Uh oh!
There was an error while loading. Please reload this page.
Bug Description:
There is an issue with sharding logic. After investigation, we found that when creating a cluster with 2 nodes and configuring 3 shards on it, in most cases we encounter a "Waiting timeout exceed" error by default. Further investigation revealed that the issue is not related to buddy allocation but rather the "diskchunk_flush_write_timeout" setting. When we set "diskchunk_flush_write_timeout = -1" in the configuration, everything works perfectly without issues. However, when we leave it unset or set it to "diskchunk_flush_write_timeout = 1", the problem persists. After a deep analysis of logs, we discovered that in frequent cases where concurrent insert and update operations occur on the same table across different nodes, this setting causes some keys to be lost, which prevents sharding from working properly.
We should use CLT to reproduce it because without it I was unable to reproduce due to finding another issue: #3048
We should get the test from branch
test/test-drop-sharded-table
. Here's how to run it:We should not see waiting timeout exceeded. When we update the config with disabled diskchunk, everything works fine.
We can modify the config
test/clt-tests/base/searchd-with-flexible-ports.conf
to set it or unset. Currently the diskchunk_flush_write_timeout disabled!!!Manticore Search Version:
Latest dev version
Operating System Version:
Ubuntu
Have you tried the latest development version?
None
Internal Checklist:
To be completed by the assignee. Check off tasks that have been completed or are not applicable.
The text was updated successfully, but these errors were encountered: