In Progress

Bug

Status Update

No update yet.

Description

lu...@gmail.com

created issue #1

Nov 30, 2023 09:48PM

Context

Busy Gerrit servers have a number of concurrent operations on the repository that are causing frequent racy conditions. For example Git GC, pruning, repacking, removing loose refs.

How to reproduce the issue

What steps will reproduce the problem?

Introduce one random StorageException in only one of the Change fields. For example add in ChangeField.java:

 public static Set<String> getExtensions(ChangeData cd) {
   if(new Random().nextBoolean()) {
     throw new StorageException("Simulated storage exception");
   }
   return extensions(cd).collect(toSet());
 }

Reindex one change with number 123 multiple times, using the gerrit index changes 123

What is the expected output?

Some of the index changes commands are failing and other succeed.

What do you see instead?

After the first failure, the change index entry is completely lost and the change cannot be indexed anymore. It becomes effectively invisible to Gerrit.

See for example a simple test execution below:

$ ssh -p 29418 lucamilanesio@localhost gerrit index changes 555998
$ ssh -p 29418 lucamilanesio@localhost gerrit index changes 555998
$ ssh -p 29418 lucamilanesio@localhost gerrit index changes 555998
warning: "555998" no such change

In the logs I see:

[2023-11-30T21:33:54.866Z] [SSH gerrit index changes 555998 (lucamilanesio)] ERROR com.google.gerrit.index.Schema : error getting field extension of ChangeData{Change{555998 (I6bc12d8d70ac2c0d721b6d25fb2bf393aad099cf), dest=test-repo,refs/heads/master, status=n}} [CONTEXT request="SSH" ]
com.google.gerrit.exceptions.StorageException: Simulated storage exception
	at com.google.gerrit.server.index.change.ChangeField.getExtensions(ChangeField.java:304)
	at com.google.gerrit.index.IndexedField.get(IndexedField.java:434)
	at com.google.gerrit.index.IndexedField$SearchSpec.get(IndexedField.java:150)
	at com.google.gerrit.index.Schema.fieldValues(Schema.java:263)
	at com.google.gerrit.index.Schema.lambda$buildFields$0(Schema.java:295)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at com.google.common.collect.CollectSpliterators$1WithCharacteristics.lambda$forEachRemaining$1(CollectSpliterators.java:72)
	at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
	at com.google.common.collect.CollectSpliterators$1WithCharacteristics.forEachRemaining(CollectSpliterators.java:72)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
	at com.google.gerrit.index.Schema.buildFields(Schema.java:297)
	at com.google.gerrit.lucene.AbstractLuceneIndex.toDocument(AbstractLuceneIndex.java:334)
	at com.google.gerrit.lucene.LuceneChangeIndex.replace(LuceneChangeIndex.java:217)
	at com.google.gerrit.lucene.LuceneChangeIndex.replace(LuceneChangeIndex.java:103)
	at com.google.gerrit.server.index.change.ChangeIndexer.indexImpl(ChangeIndexer.java:243)
	at com.google.gerrit.server.index.change.ChangeIndexer.doIndex(ChangeIndexer.java:201)
	at com.google.gerrit.server.index.change.ChangeIndexer.index(ChangeIndexer.java:197)
	at com.google.gerrit.server.index.change.ChangeIndexer.index(ChangeIndexer.java:310)
	at com.google.gerrit.server.restapi.change.Index.apply(Index.java:45)

Please provide any additional information below.

The loss of a change entry in the Lucene index is very serious business, because Gerrit relies on the index for many operations, including:

Locating the change for uploading new patch-sets
Looking up the project from the change number
Serving the list of refs in a Git clone, by checking the open changes status

Storage exceptions may happen in JGit whenever a racy condition occurs, for instance during the packing of loose refs. See an example below:

com.google.gerrit.exceptions.StorageException: java.io.FileNotFoundException: /var/gerrit/git/repo.git/refs/changes/40/452940/meta (No such file or directory)
        at com.google.gerrit.server.notedb.AbstractChangeNotes.load(AbstractChangeNotes.java:165)
        at com.google.gerrit.server.notedb.AbstractChangeNotes.load(AbstractChangeNotes.java:142)
        at com.google.gerrit.server.notedb.ChangeNotes$Factory.createChecked(ChangeNotes.java:154)
        at com.google.gerrit.server.notedb.ChangeNotes$Factory.createChecked(ChangeNotes.java:158)
        at com.google.gerrit.server.change.ChangeFinder.fromProjectNumber(ChangeFinder.java:189)
        at com.google.gerrit.server.change.ChangeFinder.find(ChangeFinder.java:141)
        at com.google.gerrit.server.change.ChangeFinder.findOne(ChangeFinder.java:105)
...
Caused by: java.io.FileNotFoundException: /var/gerrit/git/repo.git/refs/changes/40/452940/meta (No such file or directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at org.eclipse.jgit.util.io.SilentFileInputStream.<init>(SilentFileInputStream.java:31)
        at org.eclipse.jgit.util.IO.readSome(IO.java:68)
        at org.eclipse.jgit.internal.storage.file.RefDirectory.scanRef(RefDirectory.java:1096)
        at org.eclipse.jgit.internal.storage.file.RefDirectory.readRef(RefDirectory.java:1064)
        at org.eclipse.jgit.internal.storage.file.RefDirectory.readAndResolve(RefDirectory.java:264)
        at org.eclipse.jgit.internal.storage.file.RefDirectory.exactRef(RefDirectory.java:286)

The temporary failure of one REST-API because of a transient StorageException is not a critical issue. The permanent failure of a change index is a critical issue and can have serious consequences. What is worse is that is non-recoverable: only an off-line reindex or the reindex of all changes in a project can re-create the missing entry.

Comments

lu...@gmail.com <lu...@gmail.com> #2Nov 30, 2023 09:52PM

Assigned to lu...@gmail.com.

This is impacting the following versions: v2.16.x (x >= 22) v3.0.x (x >= 11) v3.1.x (x >= 10) v3.2.x (x >= 10) v3.3., v3.4., v3.5., v3.6., v3.7., v3.8., v3.9.*

lu...@gmail.com <lu...@gmail.com> #3Nov 30, 2023 10:07PM

The issue was caused by Change 270450 which has caused the blanking of the Lucene document upon reindexing if any of the fields would have caused a StorageException.

The way the LuceneChangeIndex implements the replace operation is:

  @Override
  public void replace(ChangeData cd) {
    Term id = LuceneChangeIndex.idTerm(cd);
    // toDocument is essentially static and doesn't depend on the specific
    // sub-index, so just pick one.
    Document doc = openIndex.toDocument(cd);
    try {
      if (cd.change().isNew()) {
        Futures.allAsList(closedIndex.delete(id), openIndex.replace(id, doc)).get();
      } else {
        Futures.allAsList(openIndex.delete(id), closedIndex.replace(id, doc)).get();
      }
    } catch (ExecutionException | InterruptedException e) {
      throw new StorageException(e);
    }
  }

The openIndex.toDocument(cd) may either succeed or fail. The way it works is that if fails, the delete(id) is not getting executed and therefore the reindexing fails. The change index stays in its previous state. If succeeds, the document is first removed and then added again.

However, the Change 270450 has introduced a try/catch(StorageException e) that swallows the whole exception and just logs the first failure on the logs. As results it returns a completely blank document, which does not even contain its primary key or anything else.

Before Change 270450 the code was:

  public final Iterable<Values<T>> buildFields(T obj, ImmutableSet<String> skipFields) {
    return fields.values().stream()
        .map(f -> fieldValues(obj, f, skipFields))
        .filter(Objects::nonNull)
        .collect(toImmutableList());
  }

After Change 270450 the code becomes:

  public final Iterable<Values<T>> buildFields(T obj, ImmutableSet<String> skipFields) {
    try {
      return fields.values().stream()
          .map(f -> fieldValues(obj, f, skipFields))
          .filter(Objects::nonNull)
          .collect(toImmutableList());
    } catch (StorageException e) {
      return ImmutableList.of();
    }

There are multiple issues with the catch expression introduced:

It aborts the full evaluation of the field values: it is possible that some of the fields would have been valid anyway (including the Document key) but the catch is aborting the whole thing.
It clears out also some of the fields that could have been already computed and had a value.
It does not propagate the exception upstream and therefore does not block the Lucene replace operation, causing the permanent loss of the change document in the index.

na...@linaro.org <na...@linaro.org> #4Nov 30, 2023 10:41PM

I agree with the analysis that the issues were introduced in Change 270450. It's not yet obvious to me how we fix it.

Busy Gerrit servers have a number of concurrent operations on the repository that are causing frequent racy conditions. For example Git GC, pruning, repacking, removing loose refs.

I think these operations all have retries, so I wouldn't expect them to completely fail to build a field and then end up in the StorageException case. Is that a normal thing on your servers?

It does not propagate the exception upstream and therefore does not block the Lucene replace operation, causing the permanent loss of the change document in the index.

Is it worse to have no change doc in the index vs have a change doc with inaccurate/incomplete data in the index? They both seem bad to me. Maybe it's worth adding a safety check to replace() that confirms toDocument() returned something valid-looking?

lu...@gmail.com <lu...@gmail.com> #5Nov 30, 2023 10:50PM

I agree with the analysis that the issues were introduced in Change 270450. It's not yet obvious to me how we fix it.

I will attempt a mitigation in a few minutes and will test using the instructions to reproduce the issue.

I think these operations all have retries

Sure, however, the permanent removal of the document from the index won't allow any retry: the change becomes invisible and no retries would resume it. StorageException happens in busy servers hundreds of times per day. Busy means hundreds of thousands of refs updates per day, so one hundred / hundred-thousands is still less than 0.1%, however the consequences are permanent, so even if the percentage is low, it is still a daily struggle.

Is it worse to have no change doc in the index vs have a change doc with inaccurate/incomplete data in the index?

Well, rather than choosing the less of two evils, I would like to get the good choice. I'd like to either have the accurate indexing or the indexing to fail ... but having a blank document and removing permanently the record isn't a good solution IMHO.

Maybe it's worth adding a safety check to replace() that confirms toDocument() returned something valid-looking?

Something like that, yes. Let me draft a fix and send it for review in a few minutes.

lu...@gmail.com <lu...@gmail.com> #6Nov 30, 2023 11:47PM

Accepted by lu...@gmail.com.

Fixed with: https://gerrit-review.googlesource.com/c/gerrit/+/395677

ap...@google.com <ap...@google.com> #7Dec 2, 2023 12:28PM

Project: gerrit
Branch: stable-3.2

commit 081a9914a3b75e33d94f8d176882c044879895b0
Author: Luca Milanesio <luca.milanesio@gmail.com>
Date: Thu Nov 30 23:23:35 2023

Make the indexing operation fail upon StorageException(s)

Change 270450 caused the blanking of the Lucene document
upon reindexing if any field caused a StorageException.

Whilst the overall intention was good, the implementation
caused the Lucene index replace operation to continue with
a Document without any fields instead of making the whole
operation fail.

StorageExceptions are thrown when the underlying storage,
being a filesystem or anything else, returns some errors.
Whether the failure is permanent or temporary (e.g. concurrent
GCs, repacking or pruning may cause sporadic StorageException(s))
returning a blank Lucene document was incorrect because, instead
of failing the operation, it was causing the change entry to be
completely removed from the index.

Let the StorageException fail the indexing operation, so that
existing entries may continue to exist, allowing the caller
to retry the operation.

The previous implementation, returning an empty Document, did
not allow any retry, because once the change entry was removed
from the index, it could not be discovered and reindexed anymore
for example by a `gerrit index changes <changenum>`.

Tested manually, applying a randomic StorageException thrown
during the fetching of the ChangeField extensions:

public static Set<String> getExtensions(ChangeData cd) {
if (new Random().nextBoolean()) {
throw new StorageException("Simulated storage exception");
}
return extensions(cd).collect(toSet());
}

Before this change, every time one change indexing throws a
StorageException, it disappears from the index. Eventually,
all changes will be eliminated and only an off-line reindex
or the reindex of all changes in the project would allow to
recover them.

After this change, some of the indexing operations are
successful and other fails. However, retrying the reindexing
operation would allow to reindex all of them.

Even if the above test case looks strange at first sight,
it simulates a real-life scenario where a low percentage
of indexing operation (0.1%) may fail because of sporadic
StorageExceptions. Before this change, some index entries
were missing on a daily basis (5 to 10 changes per day),
whilst after this change all indexing operation can be
retried and do not result in any indexing entry loss.

Bug:

Issue 314113030

Release-Notes: Fail the change reindex operation upon StorageException(s)
Change-Id: Ia121f47f7a68c290849a22dea657804743a26b0d

M java/com/google/gerrit/index/Schema.java

https://gerrit-review.googlesource.com/395677