Skip to content

HBASE-29774 incremental backup fails on empty WAL#7762

Open
thomassngdata wants to merge 1 commit into
apache:masterfrom
thomassngdata:HBASE-29774_incremental_backup_fails_empty_wal
Open

HBASE-29774 incremental backup fails on empty WAL#7762
thomassngdata wants to merge 1 commit into
apache:masterfrom
thomassngdata:HBASE-29774_incremental_backup_fails_empty_wal

Conversation

@thomassngdata

Copy link
Copy Markdown
Contributor

Add handling for WALHeaderEOFException in WALInputFormat with retries similar to other caught exceptions, but supress the exception after retries because empty WAL files should be skipped.

IncrementalTableBackupClient.convertWALsToHFiles fails with WALHeaderEOFException "EOF while reading PB WAL magic" when processing an empty WAL file. Because the WALInputFormat reader does not handle the WALHeaderEOFException, unlike the WALEntryStream reader.

Add handling for WALHeaderEOFException in WALInputFormat with retries similar to
other caught exceptions, but supress the exception after retries because empty
WAL files should be skipped.

IncrementalTableBackupClient.convertWALsToHFiles fails with WALHeaderEOFException
"EOF while reading PB WAL magic" when processing an empty WAL file.
Because the WALInputFormat reader does not handle the WALHeaderEOFException,
unlike the WALEntryStream reader.

@DieterDePaepe DieterDePaepe left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Thomas (my colleague) is currently unavailable, I've implemented the remarks I made myself in a new PR - see #8327

int attempt = 0;
Exception ee = null;
WALStreamReader reader = null;
boolean supressException = false;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.

reader =
WALFactory.createStreamReader(path.getFileSystem(conf), path, conf, startPosition);
return reader;
} catch (WALHeaderEOFException wheofe) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked into the potential causes for this exception. The only legitimate case this can occur in normal operation is when using the legacy WAL writer, where there's a minimal time period between opening and flushing the first entry. Other cases are due to crashing Region Servers.

I prefer skipping the retry because:

  • it causes the test to run for over a minute.
  • in case the empty WAL file originated from a crash, a retry will not solve anything, and just cause a long delay.
  • given the only legitimate case for the empty WAL (the legacy WAL writer) simply caused a crash before, it's not a regression to just skip the WAL file instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants