Release Notes

6.0.15

Features

  • Added support for asynchronous replication to a remote DC with processes in a single cluster. This improves on the asynchronous replication offered by fdbdr because servers can fetch data from the remote DC if all replicas have been lost in one DC.
  • Added support for synchronous replication of the transaction log to a remote DC. This remote DC does not need to contain any storage servers, meaning you need much fewer servers in this remote DC.
  • The TLS plugin is now statically linked into the client and server binaries and no longer requires a separate library. (Issue #436)
  • TLS peer verification now supports verifiying on Subject Alternative Name. (Issue #514)
  • TLS peer verification now supports suffix matching by field. (Issue #515)
  • TLS certificates are automatically reloaded after being updated. [6.0.5] (Issue #505)
  • Added the fileconfigure command to fdbcli, which configures a database from a JSON document. [6.0.10] (PR #713)
  • Backup-to-blobstore now accepts a “bucket” URL parameter for setting the bucket name where backup data will be read/written. [6.0.15] (PR #914)

Performance

  • Transaction logs do not copy mutations from previous generations of transaction logs. (PR #339)
  • Load balancing temporarily avoids communicating with storage servers that have fallen behind.
  • Avoid assigning storage servers responsibility for keys they do not have.
  • Clients optimistically assume the first leader reply from a coordinator is correct. (PR #425)
  • Network connections are now closed after no interface needs the connection. [6.0.1] (Issue #375)
  • Significantly improved the CPU efficiency of copy mutations to transaction logs during recovery. [6.0.2] (PR #595)
  • Significantly improved the CPU efficiency of generating status on the cluster controller. [6.0.11] (PR #758)
  • Reduced CPU cost of truncating files that are being cached. [6.0.12] (PR #816)
  • Significantly reduced master recovery times for clusters with large amounts of data. [6.0.14] (PR #836)
  • Reduced read and commit latencies for clusters which are processing transactions larger than 1MB. [6.0.14] (PR #851)
  • Significantly reduced recovery times when executing rollbacks on the memory storage engine. [6.0.14] (PR #821)
  • Clients update their key location cache much more efficiently after storage server reboots. [6.0.15] (PR #892)
  • Tuned multiple resolver configurations to do a better job balancing work between each resolver. [6.0.15] (PR #911)

Fixes

  • Not all endpoint failures were reported to the failure monitor.
  • Watches registered on a lagging storage server would take a long time to trigger.
  • The cluster controller would not start a new generation until it recovered its files from disk.
  • Under heavy write load, storage servers would occasionally pause for ~100ms. [6.0.2] (PR #597)
  • Storage servers were not given time to rejoin the cluster before being marked as failed. [6.0.2] (PR #592)
  • Incorrect accounting of incompatible connections led to occasional assertion failures. [6.0.3] (PR #616)
  • A client could fail to connect to a cluster when the cluster was upgraded to a version compatible with the client. This affected upgrades that were using the multi-version client to maintain compatibility with both versions of the cluster. [6.0.4] (PR #637)
  • A large number of concurrent read attempts could bring the database down after a cluster reboot. [6.0.4] (PR #650)
  • Automatic suppression of trace events which occur too frequently was happening before trace events were suppressed by other mechanisms. [6.0.4] (PR #656)
  • After a recovery, the rate at which transaction logs made mutations durable to disk was around 5 times slower than normal. [6.0.5] (PR #666)
  • Clusters configured to use TLS could get stuck spending all of their CPU opening new connections. [6.0.5] (PR #666)
  • A mismatched TLS certificate and key set could cause the server to crash. [6.0.5] (PR #689)
  • Sometimes a minority of coordinators would fail to converge after a new leader was elected. [6.0.6] (PR #700)
  • Calling status too many times in a 5 second interval caused the cluster controller to pause for a few seconds. [6.0.7] (PR #711)
  • TLS certificate reloading could cause TLS connections to drop until process restart. [6.0.9] (PR #717)
  • Watches polled the server much more frequently than intended. [6.0.10] (PR #728)
  • Backup and DR didn’t allow setting certain knobs. [6.0.10] (Issue #715)
  • The failure monitor will become much less reactive after multiple successive failed recoveries. [6.0.10] (PR #739)
  • Data distribution did not limit the number of source servers for a shard. [6.0.10] (PR #739)
  • The cluster controller did not do locality aware reads when measuring status latencies. [6.0.12] (PR #801)
  • Storage recruitment would spin too quickly when the storage server responded with an error. [6.0.12] (PR #801)
  • Restoring a backup to the exact version a snapshot ends did not apply mutations done at the final version. [6.0.12] (PR #787)
  • Excluding a process that was both the cluster controller and something else would cause two recoveries instead of one. [6.0.12] (PR #784)
  • Configuring from three_datacenter to three_datacenter_fallback would cause a lot of unnecessary data movement. [6.0.12] (PR #782)
  • Very rarely, backup snapshots would stop making progress. [6.0.14] (PR #837)
  • Sometimes data distribution calculated the size of a shard incorrectly. [6.0.15] (PR #892)
  • Changing the storage engine configuration would not effect which storage engine was used by the transaction logs. [6.0.15] (PR #892)
  • On exit, fdbmonitor will only kill its child processes instead of its process group when run without the daemonize option. [6.0.15] (PR #826)
  • HTTP client used by backup-to-blobstore now correctly treats response header field names as case insensitive. [6.0.15] (PR #904)
  • Blobstore REST client was not following the S3 API in several ways (bucket name, date, and response formats). [6.0.15] (PR #914)
  • Data distribution could queue shard movements for restoring replication at a low priority. [6.0.15] (PR #907)

Fixes only impacting 6.0.0+

  • A cluster configured with usable_regions=2 did not limit the rate at which it could copy data from the primary DC to the remote DC. This caused poor performance when recovering from a DC outage. [6.0.5] (PR #673)
  • Configuring usable_regions=2 on a cluster with a large amount of data caused commits to pause for a few seconds. [6.0.5] (PR #687)
  • On clusters configured with usable_regions=2, status reported no replicas remaining when the primary DC was still healthy. [6.0.5] (PR #687)
  • Clients could crash when passing in TLS options. [6.0.5] (PR #649)
  • Databases with more than 10TB of data would pause for a few seconds after recovery. [6.0.6] (PR #705)
  • Configuring from usable_regions=2 to usable_regions=1 on a cluster with a large number of processes would prevent data distribution from completing. [6.0.12] (PR #721) (PR #739) (PR #780)
  • Fixed a variety of problems with force_recovery_with_data_loss. [6.0.12] (PR #801)
  • The transaction logs would leak memory when serving peek requests to log routers. [6.0.12] (PR #801)
  • The transaction logs were doing a lot of unnecessary disk writes. [6.0.12] (PR #784)
  • The master will recover the transaction state store from local transaction logs if possible. [6.0.12] (PR #801)
  • A bug in status collection led to various workload metrics being missing and the cluster reporting unhealthy. [6.0.13] (PR #834)
  • Data distribution did not stop tracking certain unhealthy teams, leading to incorrect status reporting. [6.0.15] (PR #892)
  • Fixed a variety of problems related to changing between different region configurations. [6.0.15] (PR #892) (PR #907)
  • fdbcli protects against configuration changes which could cause irreversible damage to a cluster. [6.0.15] (PR #892) (PR #907)
  • Significantly reduced both client and server memory usage in clusters with large amounts of data and usable_regions=2. [6.0.15] (PR #892)

Status

  • The replication factor in status JSON is stored under redundancy_mode instead of redundancy.factor. (PR #492)
  • The metric data_version_lag has been replaced by data_lag.versions and data_lag.seconds. (PR #521)
  • Additional metrics for the number of watches and mutation count have been added and are exposed through status. (PR #521)

Bindings

  • API version updated to 600. There are no changes since API version 520.
  • Several cases where functions in go might previously cause a panic now return a non-nil error. (PR #532)
  • C API calls made on the network thread could be reordered with calls made from other threads. [6.0.2] (Issue #518)
  • The TLS_PLUGIN option is now a no-op and has been deprecated. [6.0.10] (PR #710)
  • Java: the Versionstamp::getUserVersion() method did not handle user versions greater than 0x00FF due to operator precedence errors. [6.0.11] (Issue #761)
  • Python: bindings didn’t work with Python 3.7 because of the new `async` keyword. [6.0.13] (Issue #830)
  • Go: `PrefixRange` didn’t correctly return an error if it failed to generate the range. [6.0.15] (PR #878)
  • Go: Add Tuple layer support for `uint`, `uint64`, and `*big.Int` integers up to 255 bytes. Integer values will be decoded into the first of `int64`, `uint64`, or `*big.Int` in which they fit. (PR #915) [6.0.15]
  • Ruby: Add Tuple layer support for integers up to 255 bytes. (PR #915) [6.0.15]
  • Python: bindings didn’t work with Python 3.7 because of the new async keyword. [6.0.13] (Issue #830)
  • Go: PrefixRange didn’t correctly return an error if it failed to generate the range. [6.0.15] (PR #878)

Other Changes

  • Does not support upgrades from any version older than 5.0.
  • Normalized the capitalization of trace event names and attributes. (PR #455)
  • Increased the memory requirements of the transaction log by 400MB. [6.0.5] (PR #673)