Status Update
Comments
lu...@gmail.com <lu...@gmail.com> #2
This is impacting the following versions: v2.16.x (x >= 22) v3.0.x (x >= 11) v3.1.x (x >= 10) v3.2.x (x >= 10) v3.3., v3.4., v3.5., v3.6., v3.7., v3.8., v3.9.*
lu...@gmail.com <lu...@gmail.com> #3
The issue was caused by StorageException
.
The way the LuceneChangeIndex
implements the replace
operation is:
@Override
public void replace(ChangeData cd) {
Term id = LuceneChangeIndex.idTerm(cd);
// toDocument is essentially static and doesn't depend on the specific
// sub-index, so just pick one.
Document doc = openIndex.toDocument(cd);
try {
if (cd.change().isNew()) {
Futures.allAsList(closedIndex.delete(id), openIndex.replace(id, doc)).get();
} else {
Futures.allAsList(openIndex.delete(id), closedIndex.replace(id, doc)).get();
}
} catch (ExecutionException | InterruptedException e) {
throw new StorageException(e);
}
}
The openIndex.toDocument(cd)
may either succeed or fail. The way it works is that if fails, the delete(id)
is not getting executed and therefore the reindexing fails. The change index stays in its previous state. If succeeds, the document is first removed and then added again.
However, the try/catch(StorageException e)
that swallows the whole exception and just logs the first failure on the logs. As results it returns a completely blank document, which does not even contain its primary key or anything else.
Before
public final Iterable<Values<T>> buildFields(T obj, ImmutableSet<String> skipFields) {
return fields.values().stream()
.map(f -> fieldValues(obj, f, skipFields))
.filter(Objects::nonNull)
.collect(toImmutableList());
}
After
public final Iterable<Values<T>> buildFields(T obj, ImmutableSet<String> skipFields) {
try {
return fields.values().stream()
.map(f -> fieldValues(obj, f, skipFields))
.filter(Objects::nonNull)
.collect(toImmutableList());
} catch (StorageException e) {
return ImmutableList.of();
}
There are multiple issues with the catch expression introduced:
- It aborts the full evaluation of the field values: it is possible that some of the fields would have been valid anyway (including the Document key) but the catch is aborting the whole thing.
- It clears out also some of the fields that could have been already computed and had a value.
- It does not propagate the exception upstream and therefore does not block the Lucene replace operation, causing the permanent loss of the change document in the index.
na...@linaro.org <na...@linaro.org> #4
I agree with the analysis that the issues were introduced in Change 270450. It's not yet obvious to me how we fix it.
Busy Gerrit servers have a number of concurrent operations on the repository that are causing frequent racy conditions. For example Git GC, pruning, repacking, removing loose refs.
I think these operations all have retries, so I wouldn't expect them to completely fail to build a field and then end up in the StorageException case. Is that a normal thing on your servers?
It does not propagate the exception upstream and therefore does not block the Lucene replace operation, causing the permanent loss of the change document in the index.
Is it worse to have no change doc in the index vs have a change doc with inaccurate/incomplete data in the index? They both seem bad to me. Maybe it's worth adding a safety check to replace() that confirms toDocument() returned something valid-looking?
lu...@gmail.com <lu...@gmail.com> #5
I agree with the analysis that the issues were introduced in Change 270450. It's not yet obvious to me how we fix it.
I will attempt a mitigation in a few minutes and will test using the instructions to reproduce the issue.
I think these operations all have retries
Sure, however, the permanent removal of the document from the index won't allow any retry: the change becomes invisible and no retries would resume it. StorageException
happens in busy servers hundreds of times per day. Busy means hundreds of thousands of refs updates per day, so one hundred / hundred-thousands is still less than 0.1%, however the consequences are permanent, so even if the percentage is low, it is still a daily struggle.
Is it worse to have no change doc in the index vs have a change doc with inaccurate/incomplete data in the index?
Well, rather than choosing the less of two evils, I would like to get the good choice. I'd like to either have the accurate indexing or the indexing to fail ... but having a blank document and removing permanently the record isn't a good solution IMHO.
Maybe it's worth adding a safety check to replace() that confirms toDocument() returned something valid-looking?
Something like that, yes. Let me draft a fix and send it for review in a few minutes.
ap...@google.com <ap...@google.com> #7
Branch: stable-3.2
commit 081a9914a3b75e33d94f8d176882c044879895b0
Author: Luca Milanesio <luca.milanesio@gmail.com>
Date: Thu Nov 30 23:23:35 2023
Make the indexing operation fail upon StorageException(s)
Change 270450 caused the blanking of the Lucene document
upon reindexing if any field caused a StorageException.
Whilst the overall intention was good, the implementation
caused the Lucene index replace operation to continue with
a Document without any fields instead of making the whole
operation fail.
StorageExceptions are thrown when the underlying storage,
being a filesystem or anything else, returns some errors.
Whether the failure is permanent or temporary (e.g. concurrent
GCs, repacking or pruning may cause sporadic StorageException(s))
returning a blank Lucene document was incorrect because, instead
of failing the operation, it was causing the change entry to be
completely removed from the index.
Let the StorageException fail the indexing operation, so that
existing entries may continue to exist, allowing the caller
to retry the operation.
The previous implementation, returning an empty Document, did
not allow any retry, because once the change entry was removed
from the index, it could not be discovered and reindexed anymore
for example by a `gerrit index changes <changenum>`.
Tested manually, applying a randomic StorageException thrown
during the fetching of the ChangeField extensions:
public static Set<String> getExtensions(ChangeData cd) {
if (new Random().nextBoolean()) {
throw new StorageException("Simulated storage exception");
}
return extensions(cd).collect(toSet());
}
Before this change, every time one change indexing throws a
StorageException, it disappears from the index. Eventually,
all changes will be eliminated and only an off-line reindex
or the reindex of all changes in the project would allow to
recover them.
After this change, some of the indexing operations are
successful and other fails. However, retrying the reindexing
operation would allow to reindex all of them.
Even if the above test case looks strange at first sight,
it simulates a real-life scenario where a low percentage
of indexing operation (0.1%) may fail because of sporadic
StorageExceptions. Before this change, some index entries
were missing on a daily basis (5 to 10 changes per day),
whilst after this change all indexing operation can be
retried and do not result in any indexing entry loss.
Bug:
Release-Notes: Fail the change reindex operation upon StorageException(s)
Change-Id: Ia121f47f7a68c290849a22dea657804743a26b0d
M java/com/google/gerrit/index/Schema.java
Description
Context
Busy Gerrit servers have a number of concurrent operations on the repository that are causing frequent racy conditions. For example Git GC, pruning, repacking, removing loose refs.
How to reproduce the issue
What steps will reproduce the problem?
Introduce one random StorageException in only one of the Change fields. For example add in
ChangeField.java
:Reindex one change with number 123 multiple times, using the
gerrit index changes 123
What is the expected output?
Some of the index changes commands are failing and other succeed.
What do you see instead?
After the first failure, the change index entry is completely lost and the change cannot be indexed anymore. It becomes effectively invisible to Gerrit.
See for example a simple test execution below:
In the logs I see:
Please provide any additional information below.
The loss of a change entry in the Lucene index is very serious business, because Gerrit relies on the index for many operations, including:
Storage exceptions may happen in JGit whenever a racy condition occurs, for instance during the packing of loose refs. See an example below:
The temporary failure of one REST-API because of a transient
StorageException
is not a critical issue. The permanent failure of a change index is a critical issue and can have serious consequences. What is worse is that is non-recoverable: only an off-line reindex or the reindex of all changes in a project can re-create the missing entry.