Status Update
Comments
lu...@gmail.com <lu...@gmail.com> #2
lu...@gmail.com <lu...@gmail.com> #3
lu...@gmail.com <lu...@gmail.com> #4
ap...@google.com <ap...@google.com> #5
lu...@gmail.com <lu...@gmail.com> #6
lu...@gmail.com <lu...@gmail.com> #7
[Deleted User] <[Deleted User]> #8
[Deleted User] <[Deleted User]> #9
ma...@gmail.com <ma...@gmail.com> #10
Sounds like this caching should not be used on primary servers. Can it safely be used on replica servers ?
lu...@gmail.com <lu...@gmail.com> #11
I believe caching is absolutely safe on replicas or primaries without massive parallelism.
I will amend the release notes accordingly.
lu...@gmail.com <lu...@gmail.com> #12
Reporting @Nasser's feedback here:
I admit, we didn't test GC from within Gerrit, so there could be something bad there we didn't catch, but assuming other configuration is sane, that would definitely be a bug. I completely agree that disabling at least for your setup is a good idea until that's understood. I'll try to get some time to debug why GC is doing something bad, but at a quick glance, we should probably have SnapshottingRefDirectory override pack() and call invalidateSnapshot() before calling the super method.
(Luca) I did not see this change on Gerrit stable-3.5 branch, are you guys running this code in production? Yes, we backported this to our 3.5 branch that we have for feature backports that we don't think are appropriate for upstream stable branches. We've been running it there since Aug 19th and we had it in our stress testing since March without issues.
lu...@gmail.com <lu...@gmail.com> #13
Report @Nasser's feedback here:
I looked at the JGit side more and I was wrong, I don't see any issues with the current code. What JGit configs do you have for your setup? Did you repro this on NFS or a local disk?
lu...@gmail.com <lu...@gmail.com> #14
The corruption happens with:
- Plain-vanilla Gerrit
- Local filesystem
- No JGit configs
It is a corruption happening during concurrency, so you need to leave it running for a few minutes. As I mentioned, after disabling the snapshotting refdb caching on the thread-local storage, the problem does not reproduce anymore.
na...@linaro.org <na...@linaro.org> #15
Thanks for the details and repro instructions. I'll check out the simple setup GC case today and see if I can understand what's happening.
For the case you had on GerritHub.io, what all modules/plugins are installed? It might help me understand where the bug is in the other case.
lu...@gmail.com <lu...@gmail.com> #16
For the case you had on GerritHub.io, what all modules/plugins are installed? It might help me understand where the bug is in the other case.
On multi-site setups, you need:
- global-refdb libModule
- one of the global-refdb implementation (on GerritHub we use zookeeper-refdb)
What happens isn't a bug IMHO, but just the consequence of concurrency: two threads have an in-memory copy of the ref-db; once the refs are updated by one thread, the other has an in-memory snapshot in its thread-local storage and therefore is in split-brain.
If you do not have the thread-local caching, two threads are sharing the same repository and the same in-memory instance of the RefDirectory object instance, hence they are not in split-brain.
ap...@google.com <ap...@google.com> #17
Branch: master
commit d1b83778666008a990a049172fb292d1c0b4518b
Author: Luca Milanesio <luca.milanesio@gmail.com>
Date: Tue Nov 07 20:36:24 2023
Highlight known-issues on Gerrit v3.8
Gerrit v3.8 has been shipped with two known critical issues:
- Change 360646, caused a radical change in the way Gerrit threads
are handling the updates of the ref-database, leading to potential
corruption in multi-site and concurrent GCs
-
certain browsers that have been working fine until v3.7.x
Introduce the "Known issues" section, visually close to the
"Breaking changes" to warn the Gerrit administrator about the potential
breakages that he may encounter when migrating to Gerrit v3.8.x.
The Gerrit administrator can therefore make the necessary adjustments to
the environment and client browsers for preventing the known issues to
impact the users after the upgrade.
Bug:
Bug:
Change-Id: I83d7dc34563e381a8b6b7bb82ec820063cc0e16b
M pages/site/releases/
na...@linaro.org <na...@linaro.org> #18
Thanks for the details and repro instructions. I'll check out the simple setup GC case today and see if I can understand what's happening.
I've tried for a few hours now and I can't get this to reproduce. I'm running stable-3.8 tip (3.8.2-72-ge579b38e51
), I created a new test site using init --dev
and accepted all the defaults, I have no plugins installed, I'm using a local disk, and the only jgit config is the one added by init receive.autogc = false
. I created a non-admin user, a new 'test-repo' project, and I granted Registered Users permission to Push to refs/heads/*
in that repo. After running the commands you have above a few times I ended up with ~7300 commits on the master branch, all of which pushed successfully, and GC ran 475 times, also all successfully (at least according to HTTP response code and stdout, but see below). git fsck is still happy:
$ git fsck
Checking object directories: 100% (256/256), done.
Checking objects: 100% (790314/790314), done.
Checking connectivity: 81918, done.
dangling commit 9458e4e0a6d2bf5994b72fe1659c55cceae9eab9
Luca, am I doing something wrong?
I did find 61 StackOverflowError exceptions in the error_log from GC, all looking like:
[2023-11-15T15:26:29.802-08:00] [HTTP-984] WARN org.eclipse.jetty.server.HttpChannel : /a/projects/test-repo/gc
java.lang.StackOverflowError
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeBase(PackWriter.java:1855)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeObjectImpl(PackWriter.java:1806)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeBase(PackWriter.java:1855)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeObjectImpl(PackWriter.java:1806)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeBase(PackWriter.java:1855)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeObjectImpl(PackWriter.java:1806)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writeBase(PackWriter.java:1855)
and when I ran gc with progress output requested, it cuts off:
$ curl -n -X 'POST' --data-binary '{"show_progress":true}' --header "Content-Type: application/json" 'http://localhost:8080/a/projects/test-repo/gc'
collecting garbage for "test-repo":
Pack refs: 100% (2/2)
Counting objects: 81911
Finding sources: 100% (81911/81911)
Getting sizes: 100% (27327/27327)
Writing objects: 67% (54471/81911)
I'm guessing JGit doesn't like how many pack files have accumulated between the gc runs:
$ ls -l git/test-repo.git/objects/pack/ | grep '\.pack$' | wc -l
27620
but I don't see how that would be related to the per thread ref cache. It still gives a stackoverflow if I disable the cache and restart Gerrit.
lu...@gmail.com <lu...@gmail.com> #19
See the associated
na...@linaro.org <na...@linaro.org> #20
The
Description
*** !!!! THIS BUG TRACKER IS FOR GERRIT CODE REVIEW !!!! *** Do not submit bugs for chrome/android and issues with your company's *** Gerrit setup here. Those issues belong in different issue trackers.
What steps will reproduce the problem?
(Gerrit version: 3.8.2-5-g2cbf98e3e4)
Client-1 runs:
Client-2 runs:
What is the expected output?
Gerrit retries automatically (thanks to the retryHelper) and the concurrent operation eventually completes successfully or fails at the 2nd attempt. The underlying repository remains healthy.
What do you see instead?
Gerrit retries automatically and the failing thread will retry continuously until the maximum number of retries is reached. In the worst case scenarios (e.g. pushes with concurrent GCs) both operations fail and the underlying repository is corrupted.
Please provide any additional information below.
The Example 1 scenario happened on GerritHub.io on the
4th of Nov at 02:57:56
. It looks like the second thread that was retrying was not able to see the ref-update happened on the filesystem or on a different thread.The ref-update triggered was:
The retryier tried for 15 times over 90s, always failing.
The REST-API eventually failed with a HTTP status 503:
The Example 2 was simulated locally and should be easy to reproduce with a vanilla default Gerrit setup.