On the impact of Bloom filters on Storj's system architecture
A single, well-considered choice in one component of a distributed system can significantly impact its efficiency and behavior. At Storj, a decentralized object storage platform, this principle is demonstrated through the use of Bloom filters for garbage collection. This post is based on my notes and observations as a small-scale contributor and active community member. I found this case study interesting not just for its set of trade-offs, as is typical in system architecture, but also for demonstrating the necessity of intentional design.
The Problem of Object Deletion
In Storj’s distributed object storage system, each object is split into multiple pieces using Reed-Solomon redundancy codes and distributed to storage nodes worldwide.1 A typical 10TB storage node currently stores around 50 million individual pieces.2 Customers can request object deletions at any time, and in practice a significant number of objects is deleted at any time, necessitating careful design to manage the volume of requests.
In Storj’s system, a satellite is the component responsible for managing which objects are stored on which nodes, and hence also handle deletions. A naïve approach might involve the satellite sending deletion requests to storage nodes for each piece of a deleted object. This design would come with several issues:
- Server-Side Copies: Storj allows customers to create multiple ’thin’ copies of the same object.3 A customer may request the deletion of one copy while retaining others. The satellite must ensure that a piece is only deleted when the last reference to it is gone. Tracking this requires either a per-object counter or maintaining an index of objects, both of which have drawbacks.4 The counter introduces a contention point, as it must be updated each time an object is copied or deleted. Both approaches also require updates in the hot paths of operations like copying or deletion and increase the satellite’s master database storage requirements.
- Communication Failures: Deletion requests to storage nodes may fail due to network or node downtime issues. If a deletion request fails without retries, pieces remain indefinitely, wasting disk space. A deletion queue would help, but could become clogged when nodes experience connectivity issues or longer downtimes.5
Instead, Storj opted for a less common solution: periodically sending information about live pieces using a lossy data structure called a Bloom filter.
Bloom Filters: A Probabilistic Solution to Garbage Collection
A Bloom filter is a probabilistic data structure used to test membership in a set. It allows for a compact representation of large datasets with a trade-off: it can produce false positives but guarantees no false negatives. This property makes it well-suited for Storj’s garbage collection needs.
Here’s how Bloom filters work in Storj’s context:
- Generation: At regular intervals, a satellite generates a Bloom filter for every storage node. This filter contains the identifiers of the pieces that the node should retain. Instead of listing 50 million pieces explicitly (which would require 1.6 gigabytes of 32 byte piece identifiers), the Bloom filter condenses this information into a structure of approximately 30 megabytes.
- Distribution: The Bloom filter is sent to the corresponding storage node, which uses it to determine which pieces to keep. The storage node flags any piece not matching the filter as a candidate for deletion.
- Application: Because Bloom filters guarantee no false negatives, any piece that should be kept will remain on the node. And because they allow for false positives—some pieces that should be deleted will not be flagged, and so will remain after the filter is applied.
To understand Bloom filters, consider this analogy: imagine a storage node storing English words (e.g., “yodeling,” “fatigue,” “standing”). A Bloom filter might tell the node to keep words that use only certain letters, such as “abcdeghiklmnoprstuvy”:
- “Yodeling” matches and should be kept.
- “Fatigue,” containing the letter ‘f,’ is not in the filter and can be deleted.
- However, a deleted word like “youngster” might still match the filter (false positive) and remain.
Instead of using the alphabet of 26 letters, a real Bloom filter—like the one used by Storj—employs cryptographic hash functions to select from an ‘alphabet’ of millions of unique ’letters’. Storj adjusts the alphabet size and number of letters to generate compact filters.6 The current false positive rate is set at 10%.7
Architectural Consequences
Let’s analyze the trade-offs of this approach.
Reduced satellite storage and I/O
The satellite does not have to track deleted pieces. The moment a customer requests removal of a file, all the satellite needs to do is to remove the corresponding rows from tables. This is important, as a satellite is perceived a potential bottleneck due to being the central place of metadata management.
Reduced communication overhead
Using Bloom filters significantly reduces the communication burden between satellites and storage nodes, cutting both the number of messages and data transfer size by orders of magnitude. Again, while storage nodes are seen as having effectively unbounded total bandwidth, satellites are not, hence managing satellite bandwidth is important.
State-based communication
This design reflects state-based synchronization rather than the usual event-driven (change-based) communication. The satellite communicates the expected state (set of pieces stored) rather than issuing explicit delete commands. This form of synchronization reduces network-level complexity because it eliminates the need for retry mechanisms necessary for solving for missed updates when nodes experience temporary connectivity issues or downtime.
That said, this communication is not instantaneous. The need to cross network in order to communicate expected state, full metadata scan required to construct filters on the satellite side, and the time necessary to apply filters to a large amount of data on the storage node side will lead to the filters being stale—potentially by days—by the time storage nodes finish processing filters.8 As such, storage nodes must accept effectively stale filters. To prevent accidental deletions of recently uploaded pieces, filters must be annotated by a timestamp of the last piece considered. Storage nodes must only delete pieces uploaded before the filter’s timestamp. Bloom filters must therefore include this timestamp, and storage nodes must maintain timestamped records of piece uploads to compare against the filter’s generation time.
Another consequence is the fact that if a node receives an empty or poorly generated filter due to a bug, many or all pieces could be unintentionally deleted with just a single message. To guard against this problem, nodes do not delete any data immediately. Instead, pieces are put into a separate trash directory, to be removed after a week. Storj has built a mechanism to collect statistics on piece downloads that hit the trash directory, as well as an endpoint to put all data back from trash into regular storage. This choice trades disk space for the opportunity to recover from bugs in the filter generation code.9
Probabilistic Cleanup
A key trade-off with Bloom filters is that they are probabilistic. By allowing false positives, some obsolete pieces will remain and occupy storage longer than necessary. Currently around 10% of pieces intended for deletion survive each filter application. This results in higher storage requirements at the storage node level.
An additional consequence is that system-level disk space accounting is imprecise. While Storj satellites can track disk space used by alive pieces, storage nodes will at any time have some disk space taken by garbage. To reduce the impact of this problem, nodes explicitly report to satellites how much disk space is free for new uploads and, in the case of stale disk space statistics, can explicitly deny uploads—which is one of two reasons of satellites suggesting for uploads more storage nodes than necessary to fulfill requirements of Reed-Solomon codes (the other being performance factors not discussed in this post).
The frequent nature of the cleanup process mitigates this issue. Currently Bloom filters are updated and sent approximately once a week, with Storj being able to send them more often when necessary. If an obsolete piece is not collected during one cycle, it is highly likely to be collected in the next.
Furthermore, even if some pieces are unnecessarily preserved during a filter run, not enough survive to reconstruct full objects using redundancy codes. This renders the objects effectively deleted.
Reduced latency
Customers experience better performance during server-side copies and deletions because the state-based approach defers object lifetime tracking to an asynchronous batch job. In contrast, an event-driven approach would require explicit management of object lifetime through counters or indices to be part of the hot path for these operations, increasing latency.
Also, as a consequence of accepting stale filters, the process of generating filters does not require direct access to the master database. Instead, it can rely on a stale, but consistent copy of the database. This avoids long-running transactions on the satellite’s master database, ensuring that customer-visible latency-sensitive operations are not impacted by filter generation.
Conclusions
For me, the most interesting aspect of this case study is not the Bloom filters themselves but how a single choice has reshaped large part of the distributed system. While the core decision revolves around a data structure used in a single type of message exchanged between satellites and storage nodes, its impact extends to storage consumption, system reliability, and even customer latency. The key takeaway for anyone studying distributed systems is the importance of intentional design. This choice was made possible by system designers who thoroughly understood its implications across layers and components, accepting the associated trade-offs.
-
Storj: A Decentralized Cloud Storage Network Framework (PDF) ↩︎
-
Storj tolerates a full month of downtime for a node, with gradual impact on node’s reputation. See post by Alexey, December 15th, 2024. ↩︎
-
A good way to get some intuition for the trade-off is to play with Thomas Hurst’s Bloom filter calculator. Consider n the number of pieces actually expected to be stored by a node, p the probability of false positives (Storj’s default is 0.1), m the size of the ‘alphabet’ (and also the effective size of the Bloom filter), and k the number of ’letters’ in a word. ↩︎
-
I have originally posted this exposition in a forum thread in 2021, with a somewhat longer narrative. ↩︎
-
Just generating a set of filters may take over a day, as posted by elek on August 23, 2024. ↩︎
-
This safeguard was already exercised, e.g. as posted by elek on May 6, 2024. ↩︎