Release Notes
6.0.18
Fixes
- Backup metadata could falsely indicate that a backup is not usable. (PR #1007)
- Blobstore request failures could cause backup expire and delete operations to skip some files. (PR #1007)
- Blobstore request failures could cause restore to fail to apply some files. (PR #1007)
- Storage servers with large amounts of data would pause for a short period of time after rebooting. (PR #1001)
- The client library could leak memory when a thread died. (PR #1011)
Features
- Added the ability to specify versions as version-days ago from latest log in backup. (PR #1007)
6.0.17
Fixes
- Existing backups did not make progress when upgraded to 6.0.16. (PR #962)
6.0.16
Fixes
- Blobstore REST client attempted to create buckets that already existed. (PR #923)
- DNS would fail if IPv6 responses were received. (PR #945)
- Backup expiration would occasionally fail due to an incorrect assert. (PR #926)
6.0.15
Features
- Added support for asynchronous replication to a remote DC with processes in a single cluster. This improves on the asynchronous replication offered by fdbdr because servers can fetch data from the remote DC if all replicas have been lost in one DC.
- Added support for synchronous replication of the transaction log to a remote DC. This remote DC does not need to contain any storage servers, meaning you need much fewer servers in this remote DC.
- The TLS plugin is now statically linked into the client and server binaries and no longer requires a separate library. (Issue #436)
- TLS peer verification now supports verifiying on Subject Alternative Name. (Issue #514)
- TLS peer verification now supports suffix matching by field. (Issue #515)
- TLS certificates are automatically reloaded after being updated. [6.0.5] (Issue #505)
- Added the
fileconfigure
command to fdbcli, which configures a database from a JSON document. [6.0.10] (PR #713)
- Backup-to-blobstore now accepts a “bucket” URL parameter for setting the bucket name where backup data will be read/written. [6.0.15] (PR #914)
Performance
- Transaction logs do not copy mutations from previous generations of transaction logs. (PR #339)
- Load balancing temporarily avoids communicating with storage servers that have fallen behind.
- Avoid assigning storage servers responsibility for keys they do not have.
- Clients optimistically assume the first leader reply from a coordinator is correct. (PR #425)
- Network connections are now closed after no interface needs the connection. [6.0.1] (Issue #375)
- Significantly improved the CPU efficiency of copy mutations to transaction logs during recovery. [6.0.2] (PR #595)
- Significantly improved the CPU efficiency of generating status on the cluster controller. [6.0.11] (PR #758)
- Reduced CPU cost of truncating files that are being cached. [6.0.12] (PR #816)
- Significantly reduced master recovery times for clusters with large amounts of data. [6.0.14] (PR #836)
- Reduced read and commit latencies for clusters which are processing transactions larger than 1MB. [6.0.14] (PR #851)
- Significantly reduced recovery times when executing rollbacks on the memory storage engine. [6.0.14] (PR #821)
- Clients update their key location cache much more efficiently after storage server reboots. [6.0.15] (PR #892)
- Tuned multiple resolver configurations to do a better job balancing work between each resolver. [6.0.15] (PR #911)
Fixes
- Not all endpoint failures were reported to the failure monitor.
- Watches registered on a lagging storage server would take a long time to trigger.
- The cluster controller would not start a new generation until it recovered its files from disk.
- Under heavy write load, storage servers would occasionally pause for ~100ms. [6.0.2] (PR #597)
- Storage servers were not given time to rejoin the cluster before being marked as failed. [6.0.2] (PR #592)
- Incorrect accounting of incompatible connections led to occasional assertion failures. [6.0.3] (PR #616)
- A client could fail to connect to a cluster when the cluster was upgraded to a version compatible with the client. This affected upgrades that were using the multi-version client to maintain compatibility with both versions of the cluster. [6.0.4] (PR #637)
- A large number of concurrent read attempts could bring the database down after a cluster reboot. [6.0.4] (PR #650)
- Automatic suppression of trace events which occur too frequently was happening before trace events were suppressed by other mechanisms. [6.0.4] (PR #656)
- After a recovery, the rate at which transaction logs made mutations durable to disk was around 5 times slower than normal. [6.0.5] (PR #666)
- Clusters configured to use TLS could get stuck spending all of their CPU opening new connections. [6.0.5] (PR #666)
- A mismatched TLS certificate and key set could cause the server to crash. [6.0.5] (PR #689)
- Sometimes a minority of coordinators would fail to converge after a new leader was elected. [6.0.6] (PR #700)
- Calling status too many times in a 5 second interval caused the cluster controller to pause for a few seconds. [6.0.7] (PR #711)
- TLS certificate reloading could cause TLS connections to drop until process restart. [6.0.9] (PR #717)
- Watches polled the server much more frequently than intended. [6.0.10] (PR #728)
- Backup and DR didn’t allow setting certain knobs. [6.0.10] (Issue #715)
- The failure monitor will become much less reactive after multiple successive failed recoveries. [6.0.10] (PR #739)
- Data distribution did not limit the number of source servers for a shard. [6.0.10] (PR #739)
- The cluster controller did not do locality aware reads when measuring status latencies. [6.0.12] (PR #801)
- Storage recruitment would spin too quickly when the storage server responded with an error. [6.0.12] (PR #801)
- Restoring a backup to the exact version a snapshot ends did not apply mutations done at the final version. [6.0.12] (PR #787)
- Excluding a process that was both the cluster controller and something else would cause two recoveries instead of one. [6.0.12] (PR #784)
- Configuring from
three_datacenter
to three_datacenter_fallback
would cause a lot of unnecessary data movement. [6.0.12] (PR #782)
- Very rarely, backup snapshots would stop making progress. [6.0.14] (PR #837)
- Sometimes data distribution calculated the size of a shard incorrectly. [6.0.15] (PR #892)
- Changing the storage engine configuration would not effect which storage engine was used by the transaction logs. [6.0.15] (PR #892)
- On exit, fdbmonitor will only kill its child processes instead of its process group when run without the daemonize option. [6.0.15] (PR #826)
- HTTP client used by backup-to-blobstore now correctly treats response header field names as case insensitive. [6.0.15] (PR #904)
- Blobstore REST client was not following the S3 API in several ways (bucket name, date, and response formats). [6.0.15] (PR #914)
- Data distribution could queue shard movements for restoring replication at a low priority. [6.0.15] (PR #907)
Fixes only impacting 6.0.0+
- A cluster configured with usable_regions=2 did not limit the rate at which it could copy data from the primary DC to the remote DC. This caused poor performance when recovering from a DC outage. [6.0.5] (PR #673)
- Configuring usable_regions=2 on a cluster with a large amount of data caused commits to pause for a few seconds. [6.0.5] (PR #687)
- On clusters configured with usable_regions=2, status reported no replicas remaining when the primary DC was still healthy. [6.0.5] (PR #687)
- Clients could crash when passing in TLS options. [6.0.5] (PR #649)
- Databases with more than 10TB of data would pause for a few seconds after recovery. [6.0.6] (PR #705)
- Configuring from usable_regions=2 to usable_regions=1 on a cluster with a large number of processes would prevent data distribution from completing. [6.0.12] (PR #721) (PR #739) (PR #780)
- Fixed a variety of problems with
force_recovery_with_data_loss
. [6.0.12] (PR #801)
- The transaction logs would leak memory when serving peek requests to log routers. [6.0.12] (PR #801)
- The transaction logs were doing a lot of unnecessary disk writes. [6.0.12] (PR #784)
- The master will recover the transaction state store from local transaction logs if possible. [6.0.12] (PR #801)
- A bug in status collection led to various workload metrics being missing and the cluster reporting unhealthy. [6.0.13] (PR #834)
- Data distribution did not stop tracking certain unhealthy teams, leading to incorrect status reporting. [6.0.15] (PR #892)
- Fixed a variety of problems related to changing between different region configurations. [6.0.15] (PR #892) (PR #907)
- fdbcli protects against configuration changes which could cause irreversible damage to a cluster. [6.0.15] (PR #892) (PR #907)
- Significantly reduced both client and server memory usage in clusters with large amounts of data and usable_regions=2. [6.0.15] (PR #892)
Status
- The replication factor in status JSON is stored under
redundancy_mode
instead of redundancy.factor
. (PR #492)
- The metric
data_version_lag
has been replaced by data_lag.versions
and data_lag.seconds
. (PR #521)
- Additional metrics for the number of watches and mutation count have been added and are exposed through status. (PR #521)
Bindings
- API version updated to 600. See the API version upgrade guide for upgrade details.
- Several cases where functions in go might previously cause a panic now return a non-
nil
error. (PR #532)
- C API calls made on the network thread could be reordered with calls made from other threads. [6.0.2] (Issue #518)
- The TLS_PLUGIN option is now a no-op and has been deprecated. [6.0.10] (PR #710)
- Java: the Versionstamp::getUserVersion() method did not handle user versions greater than
0x00FF
due to operator precedence errors. [6.0.11] (Issue #761)
- Python: bindings didn’t work with Python 3.7 because of the new
async
keyword. [6.0.13] (Issue #830)
- Go:
PrefixRange
didn’t correctly return an error if it failed to generate the range. [6.0.15] (PR #878)
- Go: Add Tuple layer support for
uint
, uint64
, and *big.Int
integers up to 255 bytes. Integer values will be decoded into the first of int64
, uint64
, or *big.Int
in which they fit. (PR #915) [6.0.15]
- Ruby: Add Tuple layer support for integers up to 255 bytes. (PR #915) [6.0.15]
- Python: bindings didn’t work with Python 3.7 because of the new
async
keyword. [6.0.13] (Issue #830)
- Go:
PrefixRange
didn’t correctly return an error if it failed to generate the range. [6.0.15] (PR #878)
Other Changes
- Does not support upgrades from any version older than 5.0.
- Normalized the capitalization of trace event names and attributes. (PR #455)
- Various stateless processes now have a higher affinity for running on processes with unset process class, which may result in those roles changing location upon upgrade. See Version-specific notes on upgrading for details. (PR #526)
- Increased the memory requirements of the transaction log by 400MB. [6.0.5] (PR #673)