Thread-local ref-db snapshotting causing potential corruption [309098227]

Fixed

Bug

Status Update

No update yet.

Description

lu...@gmail.com

created issue #1

Nov 4, 2023 01:25PM

*** !!!! THIS BUG TRACKER IS FOR GERRIT CODE REVIEW !!!! *** Do not submit bugs for chrome/android and issues with your company's *** Gerrit setup here. Those issues belong in different issue trackers.

What steps will reproduce the problem?

(Gerrit version: 3.8.2-5-g2cbf98e3e4)

Create concurrency with parallel REST-API that may overlap and update the same refs.

Example-1: Create new groups concurrent create group REST-API calls (PUT /groups/some-group)
Example-2: Push commits onto a repository and run concurrent GCs

Client-1 runs:

git clone http://localhost:8080/a/test-repo
cd test-repo
for (( i=0; i<10000; i++ )); do echo $i >> foo.txt; git add foo.txt; git commit -m foo-$i; git push origin master; done

Client-2 runs:

for ((i=0; i<100; i++))
do
curl -n 'http://localhost:8080/a/projects/test-repo/gc' \
  -X 'POST'
done

What is the expected output?

Gerrit retries automatically (thanks to the retryHelper) and the concurrent operation eventually completes successfully or fails at the 2nd attempt. The underlying repository remains healthy.

What do you see instead?

Gerrit retries automatically and the failing thread will retry continuously until the maximum number of retries is reached. In the worst case scenarios (e.g. pushes with concurrent GCs) both operations fail and the underlying repository is corrupted.

$ git fsck
Checking object directories: 100% (256/256), done.
Checking objects: 100% (1773/1773), done.
error: refs/heads/master: invalid sha1 pointer a288a8956d91a2638adfa9ccc5eef21c6aaed2c0
error: HEAD: invalid sha1 pointer a288a8956d91a2638adfa9ccc5eef21c6aaed2c0
error: refs/heads/master: invalid reflog entry a288a8956d91a2638adfa9ccc5eef21c6aaed2c0
broken link from    tree 7fa528e0018128b1a65a31021deb651f7d867c12
              to    blob 8d2ef2542fa78620336148ba51653b545d9dafd0
missing blob 8d2ef2542fa78620336148ba51653b545d9dafd0

Please provide any additional information below.

The Example 1 scenario happened on GerritHub.io on the 4th of Nov at 02:57:56. It looks like the second thread that was retrying was not able to see the ref-update happened on the filesystem or on a different thread.

The ref-update triggered was:

[2023-11-04 02:57:56,545] [HTTP PUT /groups/some-group] INFO  : {"ref_name":"refs/meta/group-names","old_id":"0f4d617618442e8ef6b6254486a11cc0db2a1063","new_id":"caa85aafefac033f857d303b8aa8f70e082d75c8","committer":{"name":"Gerrit Code Review","email":"support@gerrithub.io","date":"2023-11-04 02:57:55.000000000","tz":0},"comment":"Create group some-group","project_name":"All-Users","type":"UPDATE_REF"}

The retryier tried for 15 times over 90s, always failing.

[2023-11-04T02:58:19.537Z] [HTTP PUT /groups/some-group] WARN  com.google.gerrit.server.update.RetryHelper : GROUP_UPDATE was attempted 15 times [CONTEXT request="REST /groups/*" ]

The REST-API eventually failed with a HTTP status 503:

[HTTP-1378738] - user [2023-11-04T02:58:19.538Z] "PUT /groups/some-group HTTP/1.1" 503 134 24630 - "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" 439 370 294848936 -

The Example 2 was simulated locally and should be easy to reproduce with a vanilla default Gerrit setup.

Comments

lu...@gmail.com <lu...@gmail.com> #2Nov 7, 2023 07:36PM

After two days of research, we have found the following:

Different threads are not seeing each other changes, failing to detect the ref-updates happened on the filesystem
This is caused by the DynamicRefDbRepository which would cause the two threads having two different refDatabase instances

See the relevant code at: public class DynamicRefDbRepository extends FileRepository { ... @Override public RefDatabase getRefDatabase() { return refDatabaseSupplier.apply(path, super.getRefDatabase()); } }

Basically, even if the first thread has updated the refDB, the second thread is stuck on different SnapshottingRefDirectory instance and would never be able to see the mutable data unless an I/O exception or other error occurs.

The change that changed thread isolation is: https://gerrit-review.googlesource.com/c/gerrit/+/360646

Message last modified on Nov 8, 2023 09:57AM

lu...@gmail.com <lu...@gmail.com> #3Nov 7, 2023 07:43PM

The workaround is to disable Change 360646 and set: core.usePerRequestRefCache = false

lu...@gmail.com <lu...@gmail.com> #4Nov 7, 2023 08:41PM

Assigned to lu...@gmail.com.

Documented the risk of having core.userPerRequestRefCache in the Gerrit v3.8 release notes, and moved the modification to the breaking changes section as it represents a radically different way of serving incoming requests from how it worked before.

https://gerrit-review.googlesource.com/c/homepage/+/392082

ap...@google.com <ap...@google.com> #5Nov 7, 2023 10:13PM

Project: plugins/multi-site
Branch: stable-3.8

commit 77aeeec6b57c78d5545d1517f0b63708ae0cac0d
Author: Luca Milanesio <luca.milanesio@gmail.com>
Date: Tue Nov 07 19:44:48 2023

Disable the thread-local refs caching

The thread-local refs caching opens the risk to have a split
brain situation within the same JVM, because of two threads
not being able to see the refs modifications made elsewhere
in the Gerrit.

The thread-local caching of the refDabase was introduced in
Gerrit v3.8.0 with [1] and enabled by default. Although the caching
improves read performance, it would also cause data inconsistency
between threads.

[1]

https://gerrit-review.googlesource.com/c/gerrit/+/360646

Bug:

Issue 309098227

Change-Id: I967cc56f0cc080d832d6623700f30efdf6468380

M setup_local_env/configs/gerrit.config

https://gerrit-review.googlesource.com/392081

lu...@gmail.com <lu...@gmail.com> #6Nov 8, 2023 10:00AM

See here below a more detailed example of what is happening:

Observed behaviour:
===================
Starting point: master branch: S0

Thread-1:   Try#1 - (master: S0) … S0 => S1 … Success !
Thread-2:   Try#1 - (master: S0) … S0 => S2 … Fail ! (because of JGit filesystem locking)
                   Try#2 - (master: S0) … S0 => S2 … Fail ! (because of Global ref-db is out-of-sync)
                   Try#3 - (master: S0) … S0 => S2 … Fail ! (because of Global ref-db is out-of-sync)
                   Try#4 - (master: S0) … S0 => S2 … Fail ! (because of Global ref-db is out-of-sync)
                   Try#5 - (master: S0) … S0 => S2 … Fail ! (because of Global ref-db is out-of-sync)
…
                   Try#15 - (master: S0) … S0 => S2 … Fail ! (because of Global ref-db is out-of-sync)

Expected behaviour (core.usePerRequestRefCache = false)
==================

Starting point: master: S0

Thread-1:   Try#1 - (master: S0) … S0 => S1 … Success !
Thread-2:   Try#1 - (master: S0) … S0 => S2 … Fail ! (because of JGit filesystem locking)
                   Try#2 - (master: S1)  Fail without retry ! (because of the group already exists)

lu...@gmail.com <lu...@gmail.com> #7Nov 8, 2023 10:09AM

See some relevant conversation at: https://gerrit-review.googlesource.com/c/homepage/+/392082

[Deleted User] <[Deleted User]> #8Nov 8, 2023 09:23PM

Comment has been deleted.

Message last modified on Nov 10, 2023 07:06AM

[Deleted User] <[Deleted User]> #9Nov 8, 2023 09:35PM

Comment has been deleted.

Message last modified on Nov 10, 2023 07:05AM

ma...@gmail.com <ma...@gmail.com> #10Nov 11, 2023 09:42AM

Sounds like this caching should not be used on primary servers. Can it safely be used on replica servers ?

lu...@gmail.com <lu...@gmail.com> #11Nov 11, 2023 11:13AM

I believe caching is absolutely safe on replicas or primaries without massive parallelism.

I will amend the release notes accordingly.

lu...@gmail.com <lu...@gmail.com> #12Nov 11, 2023 12:20PM

Reporting @Nasser's feedback here:

I admit, we didn't test GC from within Gerrit, so there could be something bad there we didn't catch, but assuming other configuration is sane, that would definitely be a bug. I completely agree that disabling at least for your setup is a good idea until that's understood. I'll try to get some time to debug why GC is doing something bad, but at a quick glance, we should probably have SnapshottingRefDirectory override pack() and call invalidateSnapshot() before calling the super method.

(Luca) I did not see this change on Gerrit stable-3.5 branch, are you guys running this code in production? Yes, we backported this to our 3.5 branch that we have for feature backports that we don't think are appropriate for upstream stable branches. We've been running it there since Aug 19th and we had it in our stress testing since March without issues.

lu...@gmail.com <lu...@gmail.com> #13Nov 11, 2023 12:21PM

Report @Nasser's feedback here:

I looked at the JGit side more and I was wrong, I don't see any issues with the current code. What JGit configs do you have for your setup? Did you repro this on NFS or a local disk?

lu...@gmail.com <lu...@gmail.com> #14Nov 11, 2023 12:22PM

The corruption happens with:

Plain-vanilla Gerrit
Local filesystem
No JGit configs

It is a corruption happening during concurrency, so you need to leave it running for a few minutes. As I mentioned, after disabling the snapshotting refdb caching on the thread-local storage, the problem does not reproduce anymore.

na...@linaro.org <na...@linaro.org> #15Nov 13, 2023 04:05PM

Thanks for the details and repro instructions. I'll check out the simple setup GC case today and see if I can understand what's happening.

For the case you had on GerritHub.io, what all modules/plugins are installed? It might help me understand where the bug is in the other case.

lu...@gmail.com <lu...@gmail.com> #16Nov 13, 2023 08:57PM

For the case you had on GerritHub.io, what all modules/plugins are installed? It might help me understand where the bug is in the other case.

On multi-site setups, you need:

global-refdb libModule
one of the global-refdb implementation (on GerritHub we use zookeeper-refdb)

What happens isn't a bug IMHO, but just the consequence of concurrency: two threads have an in-memory copy of the ref-db; once the refs are updated by one thread, the other has an in-memory snapshot in its thread-local storage and therefore is in split-brain.

If you do not have the thread-local caching, two threads are sharing the same repository and the same in-memory instance of the RefDirectory object instance, hence they are not in split-brain.

ap...@google.com <ap...@google.com> #17Nov 14, 2023 09:34AM

Project: homepage
Branch: master

commit d1b83778666008a990a049172fb292d1c0b4518b
Author: Luca Milanesio <luca.milanesio@gmail.com>
Date: Tue Nov 07 20:36:24 2023

Highlight known-issues on Gerrit v3.8

Gerrit v3.8 has been shipped with two known critical issues:
- Change 360646, caused a radical change in the way Gerrit threads
are handling the updates of the ref-database, leading to potential
corruption in multi-site and concurrent GCs
-

Issue 40015537

, causing Gerrit to render a complete blank page on
certain browsers that have been working fine until v3.7.x

Introduce the "Known issues" section, visually close to the
"Breaking changes" to warn the Gerrit administrator about the potential
breakages that he may encounter when migrating to Gerrit v3.8.x.

The Gerrit administrator can therefore make the necessary adjustments to
the environment and client browsers for preventing the known issues to
impact the users after the upgrade.

Bug:

Issue 309098227

Bug:

Issue 40015537

Change-Id: I83d7dc34563e381a8b6b7bb82ec820063cc0e16b

M pages/site/releases/

3.8.md

https://gerrit-review.googlesource.com/392082

na...@linaro.org <na...@linaro.org> #18Nov 15, 2023 11:52PM

Thanks for the details and repro instructions. I'll check out the simple setup GC case today and see if I can understand what's happening.

I've tried for a few hours now and I can't get this to reproduce. I'm running stable-3.8 tip (3.8.2-72-ge579b38e51), I created a new test site using init --dev and accepted all the defaults, I have no plugins installed, I'm using a local disk, and the only jgit config is the one added by init receive.autogc = false. I created a non-admin user, a new 'test-repo' project, and I granted Registered Users permission to Push to refs/heads/* in that repo. After running the commands you have above a few times I ended up with ~7300 commits on the master branch, all of which pushed successfully, and GC ran 475 times, also all successfully (at least according to HTTP response code and stdout, but see below). git fsck is still happy:

$ git fsck
Checking object directories: 100% (256/256), done.
Checking objects: 100% (790314/790314), done.
Checking connectivity: 81918, done.
dangling commit 9458e4e0a6d2bf5994b72fe1659c55cceae9eab9

Luca, am I doing something wrong?

I did find 61 StackOverflowError exceptions in the error_log from GC, all looking like:

[2023-11-15T15:26:29.802-08:00] [HTTP-984] WARN  org.eclipse.jetty.server.HttpChannel : /a/projects/test-repo/gc
java.lang.StackOverflowError
        at org.eclipse.jgit.internal.storage.pack.PackWriter.writeBase(PackWriter.java:1855)
        at org.eclipse.jgit.internal.storage.pack.PackWriter.writeObjectImpl(PackWriter.java:1806)
        at org.eclipse.jgit.internal.storage.pack.PackWriter.writeBase(PackWriter.java:1855)
        at org.eclipse.jgit.internal.storage.pack.PackWriter.writeObjectImpl(PackWriter.java:1806)
        at org.eclipse.jgit.internal.storage.pack.PackWriter.writeBase(PackWriter.java:1855)
        at org.eclipse.jgit.internal.storage.pack.PackWriter.writeObjectImpl(PackWriter.java:1806)
        at org.eclipse.jgit.internal.storage.pack.PackWriter.writeBase(PackWriter.java:1855)

and when I ran gc with progress output requested, it cuts off:

$ curl -n -X 'POST' --data-binary '{"show_progress":true}' --header "Content-Type: application/json" 'http://localhost:8080/a/projects/test-repo/gc'
collecting garbage for "test-repo":
Pack refs:              100% (2/2)
Counting objects:       81911
Finding sources:        100% (81911/81911)
Getting sizes:          100% (27327/27327)
Writing objects:         67% (54471/81911)

I'm guessing JGit doesn't like how many pack files have accumulated between the gc runs:

$ ls -l git/test-repo.git/objects/pack/ | grep '\.pack$' | wc -l
27620

but I don't see how that would be related to the per thread ref cache. It still gives a stackoverflow if I disable the cache and restart Gerrit.

lu...@gmail.com <lu...@gmail.com> #19Feb 12, 2024 11:53PM

See the associated JGit issue where loose refs and packed refs updates by multiple threads can cause unwanted forced ref-updates with potential of data loss.

na...@linaro.org <na...@linaro.org> #20Apr 10, 2024 09:54PM

Marked as fixed, reassigned to na...@linaro.org.

The fix for that JGit issue is not specific to NFS and is most likely the root cause of the issue(s) reported in this bug. The JGit bumps that include that fix have been merged into stable-3.7+ and master so I think we can close this issue and abandon the stable-3.8 change to disable the per request ref cache by default.

ap...@google.com <ap...@google.com> #21Dec 24, 2024 11:10AM

Project: homepage
Branch: master
Author: Luca Milanesio <luca.milanesio@gmail.com>
Link: https://gerrit-review.googlesource.com/446761

Add references to Issue 309098227 on v3.9, v3.10 and v3.11

Expand for full commit details

Add references to Issue 309098227 on v3.9, v3.10 and v3.11 
 
The Issue 309098227 was detected in v3.8 but never addressed 
therefore it is also impacting v3.9, v3.10 and v3.11. 
 
Differently from all the other issues, the consequences of not 
implementing the workaround can be severe, including the repository 
corruption, therefore it is worth mentioning it in the release 
notes in the "Known Issues" section. 
 
Change-Id: I9c9a920ccd51d48354e106d0bc70f3ef92be88eb

Files:

M pages/site/releases/3.10.md
M pages/site/releases/3.11.md
M pages/site/releases/3.8.md
M pages/site/releases/3.9.md

Hash: ca578bcbbeee3e7edc50c2eca8a23f3cc1c3fae7
Date: Tue Dec 24 10:43:50 2024

Issue 309098227

Description

What steps will reproduce the problem?

What is the expected output?

What do you see instead?

Please provide any additional information below.

Issue summary

Comments

lu...@gmail.com <lu...@gmail.com> #2Nov 7, 2023 07:36PM

lu...@gmail.com <lu...@gmail.com> #3Nov 7, 2023 07:43PM

lu...@gmail.com <lu...@gmail.com> #4Nov 7, 2023 08:41PM

ap...@google.com <ap...@google.com> #5Nov 7, 2023 10:13PM

lu...@gmail.com <lu...@gmail.com> #6Nov 8, 2023 10:00AM

lu...@gmail.com <lu...@gmail.com> #7Nov 8, 2023 10:09AM

[Deleted User] <[Deleted User]> #8Nov 8, 2023 09:23PM

[Deleted User] <[Deleted User]> #9Nov 8, 2023 09:35PM

ma...@gmail.com <ma...@gmail.com> #10Nov 11, 2023 09:42AM

lu...@gmail.com <lu...@gmail.com> #11Nov 11, 2023 11:13AM

lu...@gmail.com <lu...@gmail.com> #12Nov 11, 2023 12:20PM

lu...@gmail.com <lu...@gmail.com> #13Nov 11, 2023 12:21PM

lu...@gmail.com <lu...@gmail.com> #14Nov 11, 2023 12:22PM

na...@linaro.org <na...@linaro.org> #15Nov 13, 2023 04:05PM

lu...@gmail.com <lu...@gmail.com> #16Nov 13, 2023 08:57PM

ap...@google.com <ap...@google.com> #17Nov 14, 2023 09:34AM

na...@linaro.org <na...@linaro.org> #18Nov 15, 2023 11:52PM

lu...@gmail.com <lu...@gmail.com> #19Feb 12, 2024 11:53PM

na...@linaro.org <na...@linaro.org> #20Apr 10, 2024 09:54PM

ap...@google.com <ap...@google.com> #21Dec 24, 2024 11:10AM

Add comment

Issue metadata