MDEV-39092 Copy Aria data and logs as part of backup#4971
MDEV-39092 Copy Aria data and logs as part of backup#4971mariadb-andrzejjarzabek wants to merge 42 commits intoMariaDB:MDEV-14992from
Conversation
The InnoDB write-ahead log file in the old innodb_log_archive=OFF format is named ib_logfile0, pre-allocated to innodb_log_file_size and written as a ring buffer. This is good for write performance and space management, but unsuitable for arbitrary point-in-time recovery or for facilitating efficient incremental backup. innodb_log_archive=ON: A new format where InnoDB will create and preallocate files ib_%016x.log, instead of writing a circular file ib_logfile0. Each file will be pre-allocated to innodb_log_file_size (between 4M and 4G; we impose a stricter upper limit of 4 GiB for innodb_log_archive=ON). Once a log fills up, we will create and pre-allocate another log file, to which log records will be written. Upon the completion of the first log checkpoint in a recently created log file, the old log file will be marked read-only, signaling that there will be no further writes to that file, and that the file may safely be moved to long-term storage. The file name includes the log sequence number (LSN) at file offset 12288 (log_t::START_OFFSET). Limiting the file size to 4 GiB allows us to identify each checkpoint by storing a 32-bit big-endian offset into the optional FILE_MODIFY and the mandatory FILE_CHECKPOINT records, between 12288 and the end of the file. The innodb_encrypt_log format is identified by storing the encryption information at the start of the log file. The first 32-bit value will be 1, which is an invalid checkpoint offset. Each innodb_log_archive=ON log must use the same encryption parameters. Changing innodb_encrypt_log or related parameters is only possible by setting innodb_log_archive=OFF and restarting the server, which will permanently lose the history of the archived log. The maximum number of log checkpoints that the innodb_log_archive=ON file header can represent is limited to 12288/4=3072 when using innodb_encrypt_log=OFF. If we run out of slots in a log file, each subsequently completed checkpoint in that log file will overwrite the last slot in the checkpoint header, until we switch to the next log. innodb_log_recovery_start: The checkpoint LSN to start recovery from. This will be useful when recovering from an archived log. This is useful for restoring an incremental backup (applying InnoDB log files that were copied since the previous restore). innodb_log_recovery_target: The requested LSN to end recovery at. When this is set, all persistent InnoDB tables will be read-only, and no writes to the log are allowed. The intended purpose of this setting is to prepare an incremental backup, as well as to allow data retrieval as of a particular logical point of time. Setting innodb_log_recovery_target>0 is much like setting innodb_read_only=ON, with the exception that the data files may be written to by crash recovery, and locking reads will conflict with any incomplete transactions as necessary, and all transaction isolation levels will work normally (not hard-wired to READ UNCOMMITTED). srv_read_only_mode: When this is set (innodb_read_only=ON), also recv_sys.rpo (innodb_log_recovery_target) will be set to the current LSN. This ensures that it will suffice to check only one of these variables when blocking writes to persistent tables. The status variable innodb_lsn_archived will reflect the LSN since when a complete InnoDB log archive is available. Its initial value will be that of the new parameter innodb_log_archive_start. If that variable is 0 (the default), the innodb_lsn_archived will be recovered from the available log files. If innodb_log_archive=OFF, innodb_lsn_archived will be adjusted to the latest checkpoint every time a log checkpoint is executed. If innodb_log_archive=ON, the value should not change. SET GLOBAL innodb_log_archive=!@@GLOBAL.innodb_log_archive will take effect as soon as possible, possibly after a log checkpoint has been completed. The log file will be renamed between ib_logfile0 and ib_%016x.log as appropriate. When innodb_log_archive=ON, the setting SET GLOBAL innodb_log_file_size will affect subsequently created log files when the file that is being currently written is running out. If we are switching log files exactly at the same time, then a somewhat misleading error message "innodb_log_file_size change is already in progress" will be issued. no_checkpoint_prepare.inc: A new file, to prepare for subsequent inclusion of no_checkpoint_end.inc. We will invoke the server to parse the log and to determine the latest checkpoint. All --suite=encryption tests that use innodb_encrypt_log will be skipped for innodb_log_archive=ON, because enabling or disabling encryption on the log is not possible without temporarily setting innodb_log_archive=OFF and restarting the server. The idea is to add the following arguments to an invocation of mysql-test/mtr: --mysqld=--loose-innodb-log-archive \ --mysqld=--loose-innodb-log-recovery-start=12288 \ --mysqld=--loose-innodb-log-file-mmap=OFF \ --skip-test=mariabackup Alternatively, specify --mysqld=--loose-innodb-log-file-mmap=ON to cover both code paths. The mariabackup test suite must be skipped when using the innodb_log_archive=ON format, because mariadb-backup will only support the old ib_logfile0 format (innodb_log_archive=OFF). A number of tests would fail when the parameter innodb_log_recovery_start=12288 is present, which is forcing recovery to start from the beginning of the history (the database creation). The affected tests have been adjusted with explicit --innodb-log-recovery-start=0 to override that: (0) Some injected corruption may be "healed" by replaying the log from the beginning. Some tests expect an empty buffer pool after a restart, with no page I/O due to crash recovery. (1) Any test that sets innodb_read_only=ON would fail with an error message that the setting prevents crash recovery, unless innodb_log_recovery_start=0. (2) Any test that changes innodb_undo_tablespaces would fail in crash recovery, because crash recovery assumes that the undo tablespace ID that is available from the undo* files corresponds with the start of the log. This is an unforunate design bug which we cannot fix easily. log_sys.first_lsn: The start of the current log file, to be consulted in log_t::write_checkpoint() when renaming files. log_sys.archived_lsn: New field: The value of innodb_lsn_archived. log_sys.end_lsn: New field: The log_sys.get_lsn() when the latest checkpoint was initiated. That is, the start LSN of a possibly empty sequence of FILE_MODIFY records followed by FILE_CHECKPOINT. log_sys.resize_target: The value of innodb_log_file_size that will be used for creating the next archive log file once the current file (of log_sys.file_size) fills up. log_sys.archive: New field: The value of innodb_log_archive. log_sys.next_checkpoint_no: Widen to uint16_t. There may be up to 12288/4=3072 checkpoints in the header. log_sys.log: If innodb_log_archive=ON, this file handle will be kept open also in the PMEM code path. log_sys.resize_log: If innodb_log_archive=ON, we may have two log files open both during normal operation and when parsing the log. This will store the other handle (old or new file). log_sys.resize_buf: In the memory-mapped code path, this will point to the file resize_log when innodb_log_archive=ON. recv_sys.log_archive: All innodb_log_archive=ON files that will be considered in recovery. recv_sys.was_archive: A flag indicating that an innodb_log_archive=ON file is in innodb_log_archive=OFF format. log_sys.is_pmem, log_t::is_mmap_writeable(): A new predicate. If is_mmap_writeable(), we assert and guarantee buf_size == capacity(). log_t::archive_new_write(): Create and allocate a new log file, and write the outstanding data to both the current and the new file, or only to the new file, until write_checkpoint() completes the first checkpoint in the new file. log_t::archived_mmap_switch_prepare(): Create and memory-map a new log file, and update file_size to resize_target. Remember the file handle of the current log in resize_log, so that write_checkpoint() will be able to make it read-only. log_t::archived_mmap_switch_complete(): Switch to the buffer that was created in archived_mmap_switch_prepare(). log_t::write_checkpoint(): Allow an old checkpoint to be completed in the old log file even after a new one has been created. If we are writing the first checkpoint in a new log file, we will mark the old log file read-only. We will also update log_sys.first_lsn unless it was already updated in ARCHIVED_MMAP code path. In that code path, there is the special case where log_sys.resize_buf == nullptr and log_sys.checkpoint_buf points to log_sys.resize_log (the old log file that is about to be made read-only). In this case, log_sys.first_lsn will already point to the start of the current log_sys.log, even though the switch has not been fully completed yet. log_t::header_rewrite(my_bool): Rewrite the log file header before or after renaming the log file, and write a message about the change, so that there will be a chance to recover in case the server is being killed during this operation. The recovery of the last ib_%016%.log does tolerate also the ib_logfile0 format. log_t::set_archive(my_bool,THD): Implement SET GLOBAL innodb_log_archive. An error will be returned if non-archived SET GLOBAL innodb_log_file_size (log file resizing) is in progress. Wait for checkpoint if necessary. The current log file will be renamed to either ib_logfile0 or ib_%016x.log, as appropriate. log_t::archive_rename(): Rename an archived log to ib_logfile0 on recovery in case there had been a crash during set_archive(). log_t::archive_set_size(): A new function, to ensure that log_sys.resize_target is set on startup. log_checkpoint_low(): Do not prevent a checkpoint at the start of a file. We want the first innodb_log_archive=ON file to start with a checkpoint. log_t::create(lsn_t): Initialize last_checkpoint_lsn. Initialize the log header as specified by log_sys.archive (innodb_log_archive). log_write_buf(): Add the parameter max_length, the file wrap limit. log_write_up_to(), mtr_t::commit_log_release<bool mmap=true>(): If we are switching log files, invoke buf_flush_ahead(lsn, true) to ensure that a log checkpoint will be completed in the new file. mtr_t::finish_writer(): Specialize for innodb_log_archive=ON. mtr_t::commit_file(): Ensure that log archive rotation will complete. log_t::append_prepare<log_t::ARCHIVED_MMAP>(): Special case. log_t::get_path(): Get the name of the current log file. log_t::get_circular_path(size_t): Get the path name of a circular file. Replaces get_log_file_path(). log_t::get_archive_path(lsn_t): Return a name of an archived log file. log_t::get_next_archive_path(): Return the name of the next archived log. log_t::append_archive_name(): Append the archive log file name to a path string. mtr_t::finish_writer(): Invoke log_close() only if innodb_log_archive=OFF. In the innodb_log_archive=ON, we only force log checkpoints after creating a new archive file, to ensure that the first checkpoint will be written as soon as possible. log_t::checkpoint_margin(): Replaces log_checkpoint_margin(). If a new archived log file has been created, wait for the first checkpoint in that file. srv_log_rebuild_if_needed(): Never rebuild if innodb_log_archive=ON. The setting innodb_log_file_size will affect the creation of subsequent log files. The parameter innodb_encrypt_log cannot be changed while the log is in the innodb_log_archive=ON format. log_t::attach(), log_mmap(): Add the parameter log_access, to distinguish memory-mapped or read-only access. log_t::attach(): When disabling innodb_log_file_mmap, read checkpoint_buf from the last innodb_log_archive=ON file. log_t::clear_mmap(): Clear the tail of the checkpoint buffer if is_mmap_writeable(). log_t::set_recovered(): Invoke clear_mmap(), and restore the log buffer to the correct position. recv_sys_t::apply(): Let log_t::clear_mmap() enable log writes. recv_sys_t::find_checkpoint(): Find and remember the checkpoint position in the last file when innodb_log_recovery_start points to an older file. When innodb_log_file_mmap=OFF, restore log_sys.checkpoint_buf from the latest log file. If the last archive log file is actually in innodb_log_archive=OFF format despite being named ib_%016.log, try to recover it in that format. If the circular ib_logfile0 is missing, determine the oldest archived log file with contiguous LSN. If innodb_log_archive=ON, refuse to start if ib_logfile0 exists. Open non-last archived log files in read-only mode. recv_sys_t::find_checkpoint_archived(): Validate each checkpoint in the current file header, and by default aim to recover from the last valid one. Terminate the search if the last validated checkpoint spanned two files. If innodb_log_recovery_start has been specified, attempt to validate it even if there is no such information stored in the checkpoint header. log_parse_file(): Do not invoke fil_name_process() during recv_sys_t::find_checkpoint_archived(), when we tolerate FILE_MODIFY records while looking for a FILE_CHECKPOINT record. recv_scan_log(): Invoke log_t::archived_switch_recovery() upon reaching the end of the current archived log file. log_t::archived_switch_recovery_prepare(): Make use of recv_sys.log_archive and open all but the last file read-only. log_t::archived_switch_recovery(): Switch files in the pread() code path. log_t::archived_mmap_switch_recovery_complete(): Switch files in the memory-mapped code path. recv_warp: A pointer wrapper for memory-mapped parsing that spans two archive log files. recv_sys_t::parse_mmap(): Use recv_warp for innodb_log_archive=ON. recv_sys_t::parse(): Tweak some logic for innodb_log_archive=ON. log_t::set_recovered_checkpoint(): Set the checkpoint on recovery. Updates also the end_lsn. log_t::set_recovered_lsn(): Also update flush_lock and write_lock, to ensure that log_write_up_to() will be a no-op. log_t::persist(): Even if the flushed_to_disk_lsn does not change, we may want to reset the write_lsn_offset.
This fixes up 076a99e (MDEV-37949). In the memory-mapped log writing code path we were wrongly assuming that innodb_log_file_size cannot exceed 4GiB. This assumption only holds for innodb_log_archive=ON.
log_t::archive_create(bool): Create and allocate an archive log. log_t::write_checkpoint(): Try to preallocate the next archive file if needed, with the goal that when we need the file it will already be ready for use. FIXME: Adjust crash recovery so that it will tolerate the extra empty files. We are missing calls to log_sys.unstash_archive_file() when the last file is unusable for recovery (filled with zeroes).
log_t::archive_create(): Tolerate a larger than zero-sized file. log_t::set_recovered_lsn(): Invoke unstash_archive_file() in case there was a garbage (pre-allocated) file at the end which was not parsed at all. log_file_is_zero(): Check if a log file starts with NUL bytes (is a preallocated file). recv_sys_t::find_checkpoint(): Open the last non-preallocated log file in read/write mode. recv_sys_t::archive_map: Make the elements const.
This introduces a basic driver Sql_cmd_backup, storage engine interfaces, and basic copying of InnoDB data files. On Windows, we pass a target directory name; elsewhere, we pass a target directory handle. fil_space_t::write_or_backup: Keep track of in-flight page writes and pending backup operation. We must not allow them concurrently, because that could lead into torn pages in the backup. fil_space_t::backup_end: The first page number that is not being backed up (by default 0, to indicate that no backup is in progress). log_t::backup: Whether BACKUP SERVER is in progress. The purpose of this is to make BACKUP SERVER prevent the concurrent execution of SET GLOBAL innodb_log_archive=OFF or SET GLOBAL innodb_log_file_size when innodb_log_archive=OFF. log_sys.archived_checkpoint: Keep track of the earliest available checkpoint, corresponding to log_sys.archived_lsn. This reflects SET GLOBAL innodb_log_recovery_start (which is settable now), for incremental backup. buf_flush_list_space(): Check for concurrent backup before writing each page. This is inefficient, but this function may be invoked from multiple threads concurrently, and it cannot be changed easily, especially for fil_crypt_thread(). TODO: Implement finer-grained locking around copying page ranges. TODO: Implement other storage engine interfaces. TODO: Implement the necessary locking around backup_end. TODO: Fix the space.get_create_lsn() < checkpoint logic.
TODO: Duplicate the last log file at the end innodb_backup_checkpoint(): Invoked when log checkpoint is switching to a new file.
This is an initial simple implementation which copies all the Aria files in the "end" phase of the backup. Nothing protects the copy from concurrent DDL or DML. Copying only works on MacOS (intended for refactoring to use common file copy method across engines and SQL layer).
|
|
dbc91ba to
4ef94be
Compare
Enable backup for non-Apple systems. Copy non-Aria-specific files *.frm and db.opt as part of Aria backup.
4ef94be to
c143ff2
Compare
| namespace backup | ||
| { | ||
|
|
||
| using target_dir_t= IF_WIN(const char*,int); |
There was a problem hiding this comment.
This violates MDEV-25861. We should not use the reserved POSIX _t suffix in any new type declarations.
There was a problem hiding this comment.
Do we have a suggested naming convention for types/typedefs? Target_dir? If just target_dir, do we have a convention for naming a variable of that type representing the target directory (which in current code is called "target_dir")?
|
|
||
| #pragma once | ||
|
|
||
| #include <cstdint> |
There was a problem hiding this comment.
Why would we include this header here? For uintptr_t perhaps? But we also use IF_WIN() and off_t without any declaration.
There was a problem hiding this comment.
Yes, for uintptr_t. IF_WIN is included by my_global.h from the source file - I notice it's common to not include specifically header dependencies if they're included by my_global.h. For off_t - agreed, I'll add sys/types.h.
| using target_dir_t= IF_WIN(const char*,int); | ||
|
|
||
| inline void* to_void_ptr(target_dir_t tgt) noexcept | ||
| { | ||
| return IF_WIN(const_cast<char*>, reinterpret_cast<void*>)(tgt); | ||
| } | ||
|
|
||
| inline target_dir_t to_target_dir(void* ptr) noexcept | ||
| { | ||
| return IF_WIN(static_cast<const char*>(ptr), | ||
| int(reinterpret_cast<uintptr_t>(ptr))); | ||
| } | ||
|
|
||
| #ifndef _WIN32 |
There was a problem hiding this comment.
As far as I understand, this header is not at all usable on Windows, because Microsoft Windows does not provide primitives for copying data between file descriptors, other than TransmitFile(), which we would put into use in MDEV-38362.
I don’t see the value of including this header on Microsoft Windows (until we make it cover file streaming) or defining such conversion functions. On Windows, we will need a different way of copying files, based on file names. I don’t think we should try to shoehorn it into a common interface.
There was a problem hiding this comment.
First, the header defines target_dir_t (to be renamed), which de-duplicates all the IF_WIN(const char*,int). to_void_ptr and to_target_dir are used in one header only and could be defined there, but are linked to the definition of the target dir type, so my thinking is best to keep them together (but I'm not adamant on this, I can move them to sql_backup.cc).
But also more generally this header is meant for common code for BACKUP SERVER which is shared between SQLlayer and plugins. In the current implementation both engine-agnostic files like db.opt and *.frm and Aria specific files are copied by the Aria plugin, but that will change. At that point it will be useful to have a function which copies a file in its entirety from data directory to backup target, which works on all platforms. The intention is for this header to be where functions like that are declared. Similarly the code which traverses the data directory and its subdirectories may at that point have to be extracted to be re-used between SQL and plugin layers.
| file_logger.c my_dlerror.c crc32/crc32c.cc | ||
| my_timezone.cc my_compr_int.cc my_thread_name.cc | ||
| my_virtual_mem.c) | ||
| my_virtual_mem.c my_backup.cc) |
There was a problem hiding this comment.
The functions are about copying files, not backup. The file name is misleading.
There was a problem hiding this comment.
It's intended for "implementation of common functions needed by BACKUP SERVER" (or definitions of functions/statics declared in my_backup.h - incidentally it contains just the copy function now but other shared functions are likely to end up in there. As an example a cross -platform (including Windows) function to "ensure a subdirectory exists in target directory (create if needed)" will probably end up in there, and that using the specific way that BACUP SERVER code defines a "target directory" (which may not be shared by non-backup related code which defines the concept of a "target directory" for its own purposes).
But when we consider misleading names, one existing problem in this codebase is that the term "backup" (and therefore e.g. source file names containing that word" is used in several different meanings: supporting backup via external tools (BACKUP STAGE etc.), copying files to the cloud using the S3 storage engine, and now BACKUP SERVER. This leads to less-than-fortunate situations where sql sourc directory has unrelated backup.* and sql_backup.* files, Aria plugin gets ma_backup.h and ma_backup.cc in addition to existing but unrelated ma_backup.c which defines functions in include/aria_backup.h etc. I thought about using inserver_backup as naming convention for everything specifically related to BACKUP SERVER, it seemed a bit long, but maybe that's not a problem? Or maybe have a shortened version of that (ins_backup?)
| # endif | ||
| DBUG_ASSERT(ret <= 0); | ||
| return int(ret); | ||
| #endif | ||
| } | ||
| #endif | ||
| } No newline at end of file |
There was a problem hiding this comment.
The preprocessor directives are indented inconsistently, and the last line is missing a terminator.
| int (*backup_start)(THD *thd, IF_WIN(const char*,int) target); | ||
| int (*backup_start)(THD *thd, backup::target_dir_t target); |
There was a problem hiding this comment.
The name change is misleading. In MDEV-38362, this interface is subject to revision. Then it’s thinkable that the int argument would be the handle of an output a stream rather than of a target directory.
| #include "my_backup.h" | ||
|
|
||
| using namespace backup; |
There was a problem hiding this comment.
Can we define the file-copying API in C? Most of mysys is in C.
| /* TODO: .frm failes are nto Aria-specific, they are copied here as a stop-gap */ | ||
| const std::vector<std::string> Aria_backup::data_exts {".MAD"s, ".MAI", ".frm"s}; |
There was a problem hiding this comment.
Why is this not an array of const char*?
There was a problem hiding this comment.
The reason is because it's useful to know the length of the suffix when checking if a string ends with it. Obviously we may either use strlen every time or pre-calculate it but this seemed like a reasonable solution between performance and legibility. Note that in any mainstream implementation of the standard library this particular case will not do dynamic allocation here because these suffixes fall under short string optimization. But it does make sense to make it a static member, which I will do.
| int copy_file(const std::string &path) const noexcept | ||
| { | ||
| std::string src_path= std::string(maria_data_root) + "/" + path; | ||
| #ifndef _WIN32 | ||
| int ret_val = 0; | ||
| int src_fd = open(src_path.c_str(), O_RDONLY); |
There was a problem hiding this comment.
Outside Windows, could we please retain a handle to open(maria_data_root, O_DIRECTORY) and invoke openat(src_dir, path, O_RDONLY) to open the file? That would reduce the amount of memory heap operations.
I’d avoid std::string whenever possible. We already have too many issues around heap memory fragmentation. Even on Windows we could use NtCreateFile() to simulate openat(2). I will give it a try, because it would allow us to pass target directory handles rather than names across all platforms.
There was a problem hiding this comment.
Will a string allocation which is done once per backup per file and then very quickly freed noticeably impact heap fragmentation or performance in general?
| int perform_backup() noexcept | ||
| { | ||
| if (scan_datadir()) | ||
| return 1; | ||
| if (copy_databases()) | ||
| return 1; | ||
| if (copy_control_file()) | ||
| return 1; | ||
| if (copy_logs()) | ||
| return 1; | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
This is fundamentally incompatible with the handlerton::backup_step API. Even if you are for now copying everything in handlerton::backup_end, the internal API should be kept as compatible with handlerton::backup_step as possible: copying one file at a time. At least the copy_databases() step must be refactored so that the higher-level API is invoking something that copies one file at a time. Possibly, the copying of log files should be interleaved with that, like innodb_backup_step is doing.
There was a problem hiding this comment.
This is a stop-gap to get something working. Eventual final implementation will require re-working the hadlerton API and Sql_cmd_backup to take into account that different subsets of files may be copied under diffefent levels of metadata lock, and also "start" and potentially even "end" may need to be split into different phases with different levels of metadata lock. Given that the stage API is written that way to support multi-threaded copy, which isn't implemented at this time, I propose to merge the change as proposed (I will fix some of the other problems you mentioned) and iteratively improve from there.
Incorporate changes from code review.
efbc62b to
c77278b
Compare
This is an initial simple implementation which copies all the Aria files in the "end" phase of the backup. Nothing protects the copy from concurrent DDL or DML. Copying only works on MacOS (intended for refactoring to use common file copy method across engines and SQL layer).