Status Update
Comments
lu...@gmail.com <lu...@gmail.com> #2
lu...@gmail.com <lu...@gmail.com> #3
hi...@google.com <hi...@google.com> #4
lu...@gmail.com <lu...@gmail.com> #5
hi...@google.com <hi...@google.com> #6
ge...@gmail.com <ge...@gmail.com> #7
lu...@gmail.com <lu...@gmail.com> #8
hi...@google.com <hi...@google.com> #9
SGTM, sure happy to split the work. I'll upstream the repo cache, but it might take some time because I am about to go on a 3wk leave :-)
> The same API in v3.3 compared to v3.1 is 2,600 times slower
Thank you, Luca! This was the evidence I was looking for :-)
lu...@gmail.com <lu...@gmail.com> #11
lu...@gmail.com <lu...@gmail.com> #12
/a/changes/NNNN/detail?o=CURRENT_REVISION&o=COMMIT_FOOTERS
Where NNNN is a numeric change number, randomly chosen across all the changes.
The tests were made on v3.3.6 (the release where I saw first the issue) on a box with:
48 cores of Intel Xeon Gold 5118 CPU @2.30GHz
512GB of RAM
Repositories are on NFS mounted with the following options:
rw,relatime,vers=3,rsize=65536,wsize=65536,namelen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=tcp,local_lock=none
The test case is 40 concurrent users calling the REST-API for 2 minutes. I am using the 'wrk' utility executed from a different host on the same network, so that Gerrit and the client do not use the same CPU and they are effectively going over a real network.
Example:
wrk -c 40 -d 120s -t 40 <url>
The network latency is around 100 micro-seconds.
v3.3.6
*******
trustFolderStat = true
================
Latency: avg=796ms, stdev=418ms, max=2s
Throughput: avg=1.19tps, stdev=1.64tps, max=10tps
trustFolderStat = false
================
Latency: avg=816ms, stdev=480ms, max=2s
Throughput: avg=1.11tps, stdev=1.84tps, max=10tps
v3.3.6 (with the fix 316785, cherry-picked to stable-3.3)
**************************************************************
trustFolderStat = true
================
Latency: avg=278ms, stdev=344ms, max=1.9s
Throughput: avg=7.58tps, stdev=6.2tps, max=40tps
trustFolderStat = false
================
Latency: avg=227ms, stdev=282ms, max=2s
Throughput: avg=9.16tps, stdev=5.72tps, max=40tps
The *most alarming* thing is that the v3.3.6 (without the fix) during the 2 mins of test occupied most of the system CPU, peaking to system load of 30.
With the fix 316785, the peak system load reaches a maximum of 5.
--- * ---
In a nutshell, from an E2E's perspective, there is:
a) 8x slower TPS
b) 6x high system load
lu...@gmail.com <lu...@gmail.com> #13
lu...@gmail.com <lu...@gmail.com> #14
> accountDetailsCache.get(
> CachedAccountDetails.Key.create(id, userRef.getObjectId()))
> Open a repository and fetching its refs is an expensive operation, in particular at:
> FileKey loc = FileKey.lenient(path.resolve(name.get()).toFile(), FS.DETECTED);
> The 'lenient' function makes a series of checks if a directory is actually a repository or not: all of those checks are expensive.
lu...@gmail.com <lu...@gmail.com> #15
na...@codeaurora.org <na...@codeaurora.org> #16
lu...@gmail.com <lu...@gmail.com> #17
I don't have Gerrit v3.1 on that system, unfortunately :-(
na...@codeaurora.org <na...@codeaurora.org> #18
lu...@gmail.com <lu...@gmail.com> #19
v3.1.16: 0.000192 ms
v3.3.6: 0.5ms (on SSD) without concurrency
As mentioned before, on the *beefy* system with NFS I don't have v3.1 so I can't run the E2E ones.
lu...@gmail.com <lu...@gmail.com> #20
na...@codeaurora.org <na...@codeaurora.org> #21
lu...@gmail.com <lu...@gmail.com> #22
The REST-API /a/changes/NNNN/detail is just one of the use of that API.
/** Caches important (but small) account state to avoid database hits. */
public interface AccountCache {
}
P.S. AccountCache is used through the entire Gerrit code-base: I believe there must be also other consequences of this 2600x slower performance regression.
ha...@google.com <ha...@google.com> #23
>> Open a repository and fetching its refs is an expensive operation, in particular at:
>> FileKey loc = FileKey.lenient(path.resolve(name.get()).toFile(), FS.DETECTED);
wouldn't it be better to put a benchmark into JGit and try to optimize there? There is a directory there,
./org.eclipse.jgit.benchmarks/src/org/eclipse/jgit/benchmarks
alternatively, you could have Gerrit's RepositoryManager cache the Project.NameKey => FileKey association.
lu...@gmail.com <lu...@gmail.com> #24
lu...@gmail.com <lu...@gmail.com> #25
However, I believe the problem still exists: a Java API that is called AccountCache should do caching and not going straight to JGit (and thus the filesystem) to look for the account info.
ha...@google.com <ha...@google.com> #26
earlier you said
>> Open a repository and fetching its refs is an expensive operation, in particular at:
>> FileKey loc = FileKey.lenient(path.resolve(name.get()).toFile(), FS.DETECTED);
suggesting that this was the dominant part.
Could you do some more comprehensive testing? Where is the 80% of the time spent? Can you find bottlenecks in JGit?
I appreciate that you want things to go faster, but using the ref database has the big advantage that there are virtually no correctness concerns/race conditions. I think we should first understand if JGit is actually as fast as possible before embarking on Gerrit specific solutions.
lu...@gmail.com <lu...@gmail.com> #27
Sure, that can be part of the JGit project though. By keeping JGit performance as a constant, Gerrit v3.2 is 2600x times slower than Gerrit v3.1 on the AccountCache though.
> I appreciate that you want things to go faster
That isn't my concern: I would like them to go as slow as in Gerrit v3.1, and not 2600x times slower as in Gerrit v3.2.
> using the ref database has the big advantage that there are virtually no correctness concerns/race conditions
I agree: I would love to have ref-database that is able to lookup ref names to SHA1s in nano-seconds. That require a systematic performance tuning on the JGit project, and we should address that in that code-base, not in Gerrit.
> JGit is actually as fast as possible before embarking on Gerrit specific solutions
This bug is about a regression in Gerrit, not JGit: that's why I raised it here and not on the JGit project.
From an E2E's perspective, that causes the change details REST-API call to be 4x times slower and to use 6x high CPU load.
ha...@google.com <ha...@google.com> #28
I'm not saying it is wrong to raise the bug here. I am saying that the solution to the problem may lie in improvements to JGit, and that it would be helpful to have data about what JGit is actually spending its time on doing.
If you don't want to collect this data, we can try to make time for it.
lu...@gmail.com <lu...@gmail.com> #29
Let me rephrase what I said as "I agree: I would love to have ref-database that is able to lookup ref names to SHA1s in nano-seconds. That require a systematic performance tuning on the JGit project, and we should address that in that code-base, not in Gerrit."
Rephrasing: "Yes, I am more than happy to raise this on the JGit project and work with the JGit team in profiling, assessing, refactoring and making JGit ref lookups as quick as an in-memory cache and looking them up in nano-seconds" :-)
BUT, still, that isn't something we should NOT do in the Gerrit project and issue-tracker: that belongs to the JGit project.
The discussion should continue in JGit, not in this ticket.
ha...@google.com <ha...@google.com> #30
>The discussion should continue in JGit, not in this ticket.
can you open a bug against JGit to keep track of this, then ?
lu...@gmail.com <lu...@gmail.com> #31
In the meantime, we should look how to fix this in Gerrit v5.2, unless we manage to get JGit refs lookups executed in nano-seconds on JGit's stable-5.9, which it would make me happy :-)
lu...@gmail.com <lu...@gmail.com> #32
I don't believe there isn't any bug and it actually works as designed.
The problem in Gerrit remains though: v3.2 (NOT v5.2, apologies) is 2600x times slower for the AccountCache API than v3.1.
lu...@gmail.com <lu...@gmail.com> #33
lu...@gmail.com <lu...@gmail.com> #34
Hence, relying on JGit refs caching in AccountCacheImpl was a design decision that was based on incorrect assumptions: keeping the code more similar to what it was in Gerrit v3.1 (using Guava/Caffeine caches) would have be best.
@Han-Wen @Patrick do you have other ideas / hints on other possible solutions for Gerrit v3.2?
ha...@google.com <ha...@google.com> #35
Can you specify the precise boundary conditions you are after? It's a bit hard to tell.
* If you're putting the repo on shared NFS, the implication to me is that there are other writers. Are there?
* Can you say why you need nanosecond performance? I think you care about end-to-end performance (latency, CPU load) of a single REST API call. Can you give a ballpark number of what it is today for you, and where you need it to be? It's great if we can have something that returns in 10ns, but I doubt any of your clients can tell the difference between 10ns and 10us.
Can
lu...@gmail.com <lu...@gmail.com> #36
> Can you specify the precise boundary conditions you are after? It's a bit hard to tell.
The API account cache is for accessing an in-memory cache of the accounts data, without having to fetch that from the underlying storage. That's the purpose of caching: allowing a shared in-memory copy that can be accessed in memory in nano-seconds speed.
We have lots of caches in Gerrit for this reason: this is just one of them.
> * If you're putting the repo on shared NFS, the implication to me is that there are other writers. Are there?
All cache consistency is already managed by the HA plugin in case of NFS, not just for AccountCache but also for all other caches.
If you don't use the HA plugin, then you can still set the cache size to zero and you'll read from the underlying storage for every call.
> * Can you say why you need nanosecond performance? I think you care about end-to-end performance (latency, CPU load) of a single REST API call.
AccountCache is a Java API, used by Gerrit in many places and also by plugins. From a Java perspective is E2E.
The implications of having a cache that does not cache can be seen in many REST-API, one of them is mentioned in the ticket.
The implication in the REST-API mentioned is a performance degradation of 4x times.
> Can you give a ballpark number of what it is today for you, and where you need it to be?
I've already provided the exact numbers, let me copy them for you here again:
v3.1: 0.000192 ms
v3.3: 0.5ms (on SSD) without concurrency
I am looking at recovering the initial performance figures of 0.2 micro-seconds for this API.
> It's great if we can have something that returns in 10ns, but I doubt any of your clients can tell the difference between 10ns and 10us.
It all depends on many times the API is called and how busy is the server.
For a small-sized setup and a medium-sized repo, I agree with you: a end-user won't notice any difference.
For a large-sized setup and a large-sized repo, you may bring the entire service down for overloading of v3.1 compared to v3.2/v3.3.
ge...@gmail.com <ge...@gmail.com> #37
TLDR; cache (change [2]) offers `2-4` times better performance for account-heavy operations using the same amount of resources and we are talking about really simple setup where Gerrit runs on local SSD and NFS and trustFolderStat is not even involved.
[1]
[2]
ge...@gmail.com <ge...@gmail.com> #38
TLDR; Gerrit 3.1.16 performs account-heavy operations 2 (4 when trustFolderStat = false) times faster than vanilla Gerrit 3.2.12.
Gerrit 3.2.12 with `RFC: Cache All-Users object ids by ref` performs account-heavy operations 2-4 (7-8 when trustFolderStat = false) times faster than vanilla Gerrit 3.2.12.
[1]
ma...@gmail.com <ma...@gmail.com> #39
I also adapted the benchmark Luca implemented
and ran it on my laptop, here the results on a MacBook 2.9 GHz Quad-Core Intel Core i7, SSD, filesystem APFS
using doGc=true and trustFolderStat=true:
Benchmark (doGc) (numBranches) (trustFolderStat) Mode Cnt Score Error Units
GetRefsBenchmark.testGetExactRef true 100 true avgt 2 80,186 us/op
GetRefsBenchmark.testGetExactRef true 2500 true avgt 2 85,664 us/op
GetRefsBenchmark.testGetExactRef true 10000 true avgt 2 93,523 us/op
GetRefsBenchmark.testGetExactRefCached true 100 true avgt 2 0,040 us/op
GetRefsBenchmark.testGetExactRefCached true 2500 true avgt 2 0,043 us/op
GetRefsBenchmark.testGetExactRefCached true 10000 true avgt 2 0,041 us/op
GetRefsBenchmark.testGetRefsByPrefix true 100 true avgt 2 43,639 us/op
GetRefsBenchmark.testGetRefsByPrefix true 2500 true avgt 2 50,517 us/op
GetRefsBenchmark.testGetRefsByPrefix true 10000 true avgt 2 52,297 us/op
GetRefsBenchmark.testGetRefsByPrefixCached true 100 true avgt 2 0,120 us/op
GetRefsBenchmark.testGetRefsByPrefixCached true 2500 true avgt 2 0,139 us/op
GetRefsBenchmark.testGetRefsByPrefixCached true 10000 true avgt 2 0,234 us/op
This means in the tested range of 100-10k refs with the cache
- getExactRef is faster by a factor of 2000
- getRefsByPrefix is faster by a factor of 200-380
Updating the cache is implemented in a simple brute-force way:
the cache listens on RefChangedEvents and reloads all refs on each event from the RefDatabase wrapped by the cache.
lu...@gmail.com <lu...@gmail.com> #40
The JGit-level fix has been vetoed for stable-3.2 I believe: do we have other options?
lu...@gmail.com <lu...@gmail.com> #41
hi...@google.com <hi...@google.com> #42
gi...@appspot.gserviceaccount.com <gi...@appspot.gserviceaccount.com> #43
commit a02af610c8ec879c79259a5d4b7a185a22311cd1
Author: Jacek Centkowski <geminica.programs@gmail.com>
Date: Mon Dec 20 06:50:28 2021
Detect DelegateRepository in GarbageCollection operation
When GarbageCollectCommand performs GC it expects that instance of the
repository that is passed to it is either FileRepository or
DfsRepository and throws UnsupportedOperationException otherwise.
Detect if DelegateRepository is passed as parameter and make sure that
check is performed on a delegated repository instead.
Bug:
Change-Id: I8a862ac852f5f98c09662a41234d40c26e944804
[add]
[modify]
[modify]
gi...@appspot.gserviceaccount.com <gi...@appspot.gserviceaccount.com> #44
commit db31bce0473f76106f2369a9baf06ef4efa6f58b
Author: Jacek Centkowski <geminica.programs@gmail.com>
Date: Mon Dec 20 06:50:28 2021
Detect DelegateRepository in GarbageCollection operation
When GarbageCollectCommand performs GC it expects that instance of the
repository that is passed to it is either FileRepository or
DfsRepository and throws UnsupportedOperationException otherwise.
Detect if DelegateRepository is passed as parameter and make sure that
check is performed on a delegated repository instead.
Bug:
Release-Notes: skip
Change-Id: I8a862ac852f5f98c09662a41234d40c26e944804
(cherry picked from commit a02af610c8ec879c79259a5d4b7a185a22311cd1)
[add]
[modify]
[modify]
ma...@gmail.com <ma...@gmail.com> #45
Tests are now passing :-)
I still need to rework the benchmarks
- the implementation provides a write-through cache which is updated incrementally when RefUpdate, BatchRefUpdate, RefRename occur.
- the cache is enabled if core.refCache=true and extensions.refStorage is unset or set to "refdir"
- the cache is caching all refs of the repos configured to use it
- the cache doesn't recognize if refs are updated behind JGit's back e.g. by another process
- FileRepository exposes the cache so that an application using JGit can trigger a reload of the cache if necessary
- tests show that the cache consumes 200-600 bytes per cached ref
ge...@gmail.com <ge...@gmail.com> #46
na...@linaro.org <na...@linaro.org> #47
I believe this issue is addressed in Gerrit v3.8+ with the new core.usePerRequestRefCache
setting added in
lu...@gmail.com <lu...@gmail.com> #48
I believe this issue is addressed in Gerrit v3.8+ with the new core.usePerRequestRefCache setting added in 360646: Add a new core.usePerRequestRefCache setting since that adds a cache like the one Patrick initially described (caches open repositories for the duration of a request). While that does have some open issues (
https://issues.gerritcodereview.com/issues/309098227 andhttps://issues.gerritcodereview.com/issues/324314565 ), hopefully those are resolved by recent fixes/features in JGit.
I saw your JGit fixes, good stuff. Fingers crossed, the issue is resolved and we could start caching the refs on a per-thread basis. However, it would still require to fetch the ref from disk for a cache lookup, which would cause latency still ... but a lot less than before for sure !
Description
*************************************************************************
*** !!!! THIS BUG TRACKER IS FOR GERRIT CODE REVIEW !!!!
*** Do not submit bugs for chrome/android and issues with your company's
*** Gerrit setup here. Those issues belong in different issue trackers.
*************************************************************************
What steps will reproduce the problem?
1. Get the change details 10s of times (e.g. /a/changes/NNNN/detail)
What is the expected output?
For every account involved in the change NNNN, the All-Users.git repository is accessed only once, if has not been modified.
What do you see instead?
The repository All-Users.git is opened every single time a change details is requested.
Please provide any additional information below.
The culprit of the problem is in the way AccountCacheImpl is implemented.
The cache contains the objectId of the account in the cache key, which is smart for automatically handling the eviction but does not resolve the problem of avoiding the access to the underlying repository.
accountDetailsCache.get(
CachedAccountDetails.Key.create(id, userRef.getObjectId()))
Open a repository and fetching its refs is an expensive operation, in particular at:
The 'lenient' function makes a series of checks if a directory is actually a repository or not: all of those checks are expensive.