Status Update
Comments
lu...@gmail.com <lu...@gmail.com> #2
After two days of research, we have found the following:
- Different threads are not seeing each other changes, failing to detect the ref-updates happened on the filesystem
- This is caused by the DynamicRefDbRepository which would cause the two threads having two different refDatabase instances
See the relevant code at: public class DynamicRefDbRepository extends FileRepository { ... @Override public RefDatabase getRefDatabase() { return refDatabaseSupplier.apply(path, super.getRefDatabase()); } }
Basically, even if the first thread has updated the refDB, the second thread is stuck on different SnapshottingRefDirectory instance and would never be able to see the mutable data unless an I/O exception or other error occurs.
The change that changed thread isolation is:
lu...@gmail.com <lu...@gmail.com> #3
The workaround is to disable Change 360646 and set: core.usePerRequestRefCache = false
lu...@gmail.com <lu...@gmail.com> #4
Documented the risk of having core.userPerRequestRefCache in the Gerrit v3.8 release notes, and moved the modification to the breaking changes section as it represents a radically different way of serving incoming requests from how it worked before.
ap...@google.com <ap...@google.com> #5
Branch: stable-3.8
commit 77aeeec6b57c78d5545d1517f0b63708ae0cac0d
Author: Luca Milanesio <luca.milanesio@gmail.com>
Date: Tue Nov 07 19:44:48 2023
Disable the thread-local refs caching
The thread-local refs caching opens the risk to have a split
brain situation within the same JVM, because of two threads
not being able to see the refs modifications made elsewhere
in the Gerrit.
The thread-local caching of the refDabase was introduced in
Gerrit v3.8.0 with [1] and enabled by default. Although the caching
improves read performance, it would also cause data inconsistency
between threads.
[1]
Bug:
Change-Id: I967cc56f0cc080d832d6623700f30efdf6468380
M setup_local_env/configs/gerrit.config
lu...@gmail.com <lu...@gmail.com> #6
See here below a more detailed example of what is happening:
Observed behaviour:
===================
Starting point: master branch: S0
Thread-1: Try#1 - (master: S0) … S0 => S1 … Success !
Thread-2: Try#1 - (master: S0) … S0 => S2 … Fail ! (because of JGit filesystem locking)
Try#2 - (master: S0) … S0 => S2 … Fail ! (because of Global ref-db is out-of-sync)
Try#3 - (master: S0) … S0 => S2 … Fail ! (because of Global ref-db is out-of-sync)
Try#4 - (master: S0) … S0 => S2 … Fail ! (because of Global ref-db is out-of-sync)
Try#5 - (master: S0) … S0 => S2 … Fail ! (because of Global ref-db is out-of-sync)
…
Try#15 - (master: S0) … S0 => S2 … Fail ! (because of Global ref-db is out-of-sync)
Expected behaviour (core.usePerRequestRefCache = false)
==================
Starting point: master: S0
Thread-1: Try#1 - (master: S0) … S0 => S1 … Success !
Thread-2: Try#1 - (master: S0) … S0 => S2 … Fail ! (because of JGit filesystem locking)
Try#2 - (master: S1) Fail without retry ! (because of the group already exists)
[Deleted User] <[Deleted User]> #8
[Deleted User] <[Deleted User]> #9
ma...@gmail.com <ma...@gmail.com> #10
Sounds like this caching should not be used on primary servers. Can it safely be used on replica servers ?
lu...@gmail.com <lu...@gmail.com> #11
I believe caching is absolutely safe on replicas or primaries without massive parallelism.
I will amend the release notes accordingly.
lu...@gmail.com <lu...@gmail.com> #12
Reporting @Nasser's feedback here:
I admit, we didn't test GC from within Gerrit, so there could be something bad there we didn't catch, but assuming other configuration is sane, that would definitely be a bug. I completely agree that disabling at least for your setup is a good idea until that's understood. I'll try to get some time to debug why GC is doing something bad, but at a quick glance, we should probably have SnapshottingRefDirectory override pack() and call invalidateSnapshot() before calling the super method.
(Luca) I did not see this change on Gerrit stable-3.5 branch, are you guys running this code in production? Yes, we backported this to our 3.5 branch that we have for feature backports that we don't think are appropriate for upstream stable branches. We've been running it there since Aug 19th and we had it in our stress testing since March without issues.
lu...@gmail.com <lu...@gmail.com> #13
Report @Nasser's feedback here:
I looked at the JGit side more and I was wrong, I don't see any issues with the current code. What JGit configs do you have for your setup? Did you repro this on NFS or a local disk?
lu...@gmail.com <lu...@gmail.com> #14
The corruption happens with:
- Plain-vanilla Gerrit
- Local filesystem
- No JGit configs
It is a corruption happening during concurrency, so you need to leave it running for a few minutes. As I mentioned, after disabling the snapshotting refdb caching on the thread-local storage, the problem does not reproduce anymore.
na...@linaro.org <na...@linaro.org> #15
Thanks for the details and repro instructions. I'll check out the simple setup GC case today and see if I can understand what's happening.
For the case you had on GerritHub.io, what all modules/plugins are installed? It might help me understand where the bug is in the other case.
lu...@gmail.com <lu...@gmail.com> #16
For the case you had on GerritHub.io, what all modules/plugins are installed? It might help me understand where the bug is in the other case.
On multi-site setups, you need:
- global-refdb libModule
- one of the global-refdb implementation (on GerritHub we use zookeeper-refdb)
What happens isn't a bug IMHO, but just the consequence of concurrency: two threads have an in-memory copy of the ref-db; once the refs are updated by one thread, the other has an in-memory snapshot in its thread-local storage and therefore is in split-brain.
If you do not have the thread-local caching, two threads are sharing the same repository and the same in-memory instance of the RefDirectory object instance, hence they are not in split-brain.
ap...@google.com <ap...@google.com> #17
Branch: master
commit d1b83778666008a990a049172fb292d1c0b4518b
Author: Luca Milanesio <luca.milanesio@gmail.com>
Date: Tue Nov 07 20:36:24 2023
Highlight known-issues on Gerrit v3.8
Gerrit v3.8 has been shipped with two known critical issues:
- Change 360646, caused a radical change in the way Gerrit threads
are handling the updates of the ref-database, leading to potential
corruption in multi-site and concurrent GCs
-
certain browsers that have been working fine until v3.7.x
Introduce the "Known issues" section, visually close to the
"Breaking changes" to warn the Gerrit administrator about the potential
breakages that he may encounter when migrating to Gerrit v3.8.x.
The Gerrit administrator can therefore make the necessary adjustments to
the environment and client browsers for preventing the known issues to
impact the users after the upgrade.
Bug:
Bug:
Change-Id: I83d7dc34563e381a8b6b7bb82ec820063cc0e16b
M pages/site/releases/
na...@linaro.org <na...@linaro.org> #18
Thanks for the details and repro instructions. I'll check out the simple setup GC case today and see if I can understand what's happening.
I've tried for a few hours now and I can't get this to reproduce. I'm running stable-3.8 tip (3.8.2-72-ge579b38e51
), I created a new test site using init --dev
and accepted all the defaults, I have no plugins installed, I'm using a local disk, and the only jgit config is the one added by init receive.autogc = false
. I created a non-admin user, a new 'test-repo' project, and I granted Registered Users permission to Push to refs/heads/*
in that repo. After running the commands you have above a few times I ended up with ~7300 commits on the master branch, all of which pushed successfully, and GC ran 475 times, also all successfully (at least according to HTTP response code and stdout, but see below). git fsck is still happy:
$ git fsck
Checking object directories: 100% (256/256), done.
Checking objects: 100% (790314/790314), done.
Checking connectivity: 81918, done.
dangling commit 9458e4e0a6d2bf5994b72fe1659c55cceae9eab9
Luca, am I doing something wrong?
I did find 61 StackOverflowError exceptions in the error_log from GC, all looking like:
[2023-11-15T15:26:29.802-08:00] [HTTP-984] WARN org.eclipse.jetty.server.HttpChannel : /a/projects/test-repo/gc
java.lang.StackOverflowError
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeBase(PackWriter.java:1855)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeObjectImpl(PackWriter.java:1806)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeBase(PackWriter.java:1855)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeObjectImpl(PackWriter.java:1806)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeBase(PackWriter.java:1855)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeObjectImpl(PackWriter.java:1806)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeBase(PackWriter.java:1855)
and when I ran gc with progress output requested, it cuts off:
$ curl -n -X 'POST' --data-binary '{"show_progress":true}' --header "Content-Type: application/json" 'http://localhost:8080/a/projects/test-repo/gc'
collecting garbage for "test-repo":
Pack refs: 100% (2/2)
Counting objects: 81911
Finding sources: 100% (81911/81911)
Getting sizes: 100% (27327/27327)
Writing objects: 67% (54471/81911)
I'm guessing JGit doesn't like how many pack files have accumulated between the gc runs:
$ ls -l git/test-repo.git/objects/pack/ | grep '\.pack$' | wc -l
27620
but I don't see how that would be related to the per thread ref cache. It still gives a stackoverflow if I disable the cache and restart Gerrit.
lu...@gmail.com <lu...@gmail.com> #19
See the associated
na...@linaro.org <na...@linaro.org> #20
The
ap...@google.com <ap...@google.com> #21
Project: homepage
Branch: master
Author: Luca Milanesio <
Link:
Add references to
Expand for full commit details
Add references to Issue 309098227 on v3.9, v3.10 and v3.11
The Issue 309098227 was detected in v3.8 but never addressed
therefore it is also impacting v3.9, v3.10 and v3.11.
Differently from all the other issues, the consequences of not
implementing the workaround can be severe, including the repository
corruption, therefore it is worth mentioning it in the release
notes in the "Known Issues" section.
Change-Id: I9c9a920ccd51d48354e106d0bc70f3ef92be88eb
Files:
- M
pages/site/releases/3.10.md
- M
pages/site/releases/3.11.md
- M
pages/site/releases/3.8.md
- M
pages/site/releases/3.9.md
Hash: ca578bcbbeee3e7edc50c2eca8a23f3cc1c3fae7
Date: Tue Dec 24 10:43:50 2024
Description
*** !!!! THIS BUG TRACKER IS FOR GERRIT CODE REVIEW !!!! *** Do not submit bugs for chrome/android and issues with your company's *** Gerrit setup here. Those issues belong in different issue trackers.
What steps will reproduce the problem?
(Gerrit version: 3.8.2-5-g2cbf98e3e4)
Client-1 runs:
Client-2 runs:
What is the expected output?
Gerrit retries automatically (thanks to the retryHelper) and the concurrent operation eventually completes successfully or fails at the 2nd attempt. The underlying repository remains healthy.
What do you see instead?
Gerrit retries automatically and the failing thread will retry continuously until the maximum number of retries is reached. In the worst case scenarios (e.g. pushes with concurrent GCs) both operations fail and the underlying repository is corrupted.
Please provide any additional information below.
The Example 1 scenario happened on GerritHub.io on the
4th of Nov at 02:57:56
. It looks like the second thread that was retrying was not able to see the ref-update happened on the filesystem or on a different thread.The ref-update triggered was:
The retryier tried for 15 times over 90s, always failing.
The REST-API eventually failed with a HTTP status 503:
The Example 2 was simulated locally and should be easy to reproduce with a vanilla default Gerrit setup.