graphprotocol · lutter · Mar 29, 2023 · Mar 28, 2023 · Mar 28, 2023 · Mar 28, 2023
diff --git a/NEWS.md b/NEWS.md
@@ -2,6 +2,10 @@
 
 ## Unreleased
 
+- the behavior for `graphman prune` has changed: running just `graphman
+  prune` will mark the subgraph for ongoing pruning in addition to
+  performing an initial pruning. To avoid ongoing pruning, use `graphman
+  prune --once` ([docs](./docs/implementation/pruning.md))
 - the materialized views in the `info` schema (`table_sizes`, `subgraph_sizes`, and `chain_sizes`) that provide information about the size of various database objects are now automatically refreshed every 6 hours. [#4461](https://github.com/graphprotocol/graph-node/pull/4461)
 
 ### Fixes

diff --git a/docs/environment-variables.md b/docs/environment-variables.md
@@ -227,14 +227,14 @@ those.
   1.1 means that the subgraph will be pruned every time it contains 10%
   more history (in blocks) than its history limit. The default value is 1.2
   and the value must be at least 1.01
-- `GRAPH_STORE_HISTORY_COPY_THRESHOLD`,
-  `GRAPH_STORE_HISTORY_DELETE_THRESHOLD`: when pruning, prune by copying the
-  entities we will keep to new tables if we estimate that we will remove
-  more than a factor of `COPY_THRESHOLD` of the deployment's history. If we
-  estimate to remove a factor between `COPY_THRESHOLD` and
-  `DELETE_THRESHOLD`, prune by deleting from the existing tables of the
+- `GRAPH_STORE_HISTORY_REBUILD_THRESHOLD`,
+  `GRAPH_STORE_HISTORY_DELETE_THRESHOLD`: when pruning, prune by copying
+  the entities we will keep to new tables if we estimate that we will
+  remove more than a factor of `REBUILD_THRESHOLD` of the deployment's
+  history. If we estimate to remove a factor between `REBUILD_THRESHOLD`
+  and `DELETE_THRESHOLD`, prune by deleting from the existing tables of the
   deployment. If we estimate to remove less than `DELETE_THRESHOLD`
   entities, do not change the table. Both settings are floats, and default
-  to 0.5 for the `COPY_THRESHOLD` and 0.05 for the `DELETE_THRESHOLD`; they
-  must be between 0 and 1, and `COPY_THRESHOLD` must be bigger than
+  to 0.5 for the `REBUILD_THRESHOLD` and 0.05 for the `DELETE_THRESHOLD`;
+  they must be between 0 and 1, and `REBUILD_THRESHOLD` must be bigger than
   `DELETE_THRESHOLD`.
diff --git a/docs/implementation/README.md b/docs/implementation/README.md
@@ -9,3 +9,4 @@ the code should go into comments.
 * [Time-travel Queries](./time-travel.md)
 * [SQL Query Generation](./sql-query-generation.md)
 * [Adding support for a new chain](./add-chain.md)
+* [Pruning](./pruning.md)
diff --git a/docs/implementation/pruning.md b/docs/implementation/pruning.md
@@ -0,0 +1,99 @@
+## Pruning deployments
+
+Subgraphs, by default, store a full version history for entities, allowing
+consumers to query the subgraph as of any historical block. Pruning is an
+operation that deletes entity versions from a deployment older than a
+certain block, so it is no longer possible to query the deployment as of
+prior blocks. In GraphQL, those are only queries with a constraint `block {
+number: <n> } }` or a similar constraint by block hash where `n` is before
+the block to which the deployment is pruned. Queries that are run at a
+block height greater than that are not affected by pruning, and there is no
+difference between running these queries against an unpruned and a pruned
+deployment.
+
+Because pruning reduces the amount of data in a deployment, it reduces the
+amount of storage needed for that deployment, and is beneficial for both
+query performance and indexing speed. Especially compared to the default of
+keeping all history for a deployment, it can often reduce the amount of
+data for a deployment by a very large amount and speed up queries
+considerably. See [caveats](#caveats) below for the downsides.
+
+The block `b` to which a deployment is pruned is controlled by how many
+blocks `history_blocks` of history to retain; `b` is calculated internally
+using `history_blocks` and the latest block of the deployment when the
+prune operation is performed. When pruning finishes, it updates the
+`earliest_block` for the deployment. The `earliest_block` can be retrieved
+through the `index-node` status API, and `graph-node` will return an error
+for any query that tries to time-travel to a point before
+`earliest_block`. The value of `history_blocks` must be greater than
+`ETHEREUM_REORG_THRESHOLD` to make sure that reverts can never conflict
+with pruning.
+
+Pruning is started by running `graphman prune`. That command will perform
+an initial prune of the deployment and set the subgraph's `history_blocks`
+setting which is used to periodically check whether the deployment has
+accumulated more history than that. Whenever the deployment does contain
+more history than that, the deployment is automatically repruned. If
+ongoing pruning is not desired, pass the `--once` flag to `graphman
+prune`. Ongoing pruning can be turned off by setting `history_blocks` to a
+very large value with the `--history` flag.
+
+Repruning is performed whenever the deployment has more than
+`history_blocks * GRAPH_STORE_HISTORY_SLACK_FACTOR` blocks of history. The
+environment variable `GRAPH_STORE_HISTORY_SLACK_FACTOR` therefore controls
+how often repruning is performed: with
+`GRAPH_STORE_HISTORY_SLACK_FACTOR=1.5` and `history_blocks` set to 10,000,
+a reprune will happen every 5,000 blocks. After the initial pruning, a
+reprune therefore happens every `history_blocks * (1 -
+GRAPH_STORE_HISTORY_SLACK_FACTOR)` blocks. This value should be set high
+enough so that repruning occurs relatively infrequently to not cause too
+much database work.
+
+Pruning uses two different strategies for how to remove unneeded data:
+rebuilding tables and deleting old entity versions. Deleting old entity
+versions is straightforward: this strategy deletes rows from the underlying
+tables. Rebuilding tables will copy the data that should be kept from the
+existing tables into new tables and then replaces the existing tables with
+these much smaller tables. Which strategy to use is determined for each
+table individually, and governed by the settings for
+`GRAPH_STORE_HISTORY_REBUILD_THRESHOLD` and
+`GRAPH_STORE_HISTORY_DELETE_THRESHOLD`, both numbers between 0 and 1: if we
+estimate that we will remove more than `REBUILD_THRESHOLD` of the table,
+the table will be rebuilt. If we estimate that we will remove a fraction
+between `REBUILD_THRESHOLD` and `DELETE_THRESHOLD` of the table, unneeded
+entity versions will be deleted. If we estimate to remove less than
+`DELETE_THRESHOLD`, the table is not changed at all. With both strategies,
+operations are broken into batches that should each take
+`GRAPH_STORE_BATCH_TARGET_DURATION` seconds to avoid causing very
+long-running transactions.
+
+Pruning, in most cases, runs in parallel with indexing and does not block
+it. When the rebuild strategy is used, pruning does block indexing while it
+copies non-final entities from the existing table to the new table.
+
+The initial prune started by `graphman prune` prints a progress report on
+the console. For the ongoing prune runs that are periodically performed,
+the following information is logged: a message `Start pruning historical
+entities` which includes the earliest and latest block, a message `Analyzed
+N tables`, and a message `Finished pruning entities` with details about how
+much was deleted or copied and how long that took. Pruning analyzes tables,
+if that seems necessary, because its estimates of how much of a table is
+likely not needed are based on Postgres statistics.
+
+### Caveats
+
+Pruning is a user-visible operation and does affect some of the things that
+can be done with a deployment:
+
+* because it removes history, it restricts how far back time-travel queries
+  can be performed. This will only be an issue for entities that keep
+  lifetime statistics about some object (e.g., a token) and are used to
+  produce time series: after pruning, it is only possible to produce a time
+  series that goes back no more than `history_blocks`. It is very
+  beneficial though for entities that keep daily or similar statistics
+  about some object as it removes data that is not needed once the time
+  period is over, and does not affect how far back time series based on
+  these objects can be retrieved.
+* it restricts how far back a graft can be performed. Because it removes
+  history, it becomes impossible to graft more than `history_blocks` before
+  the current deployment head.
diff --git a/graph/src/components/store/mod.rs b/graph/src/components/store/mod.rs
@@ -1208,7 +1208,7 @@ pub enum PrunePhase {
 impl PrunePhase {
     pub fn strategy(&self) -> PruningStrategy {
         match self {
-            PrunePhase::CopyFinal | PrunePhase::CopyNonfinal => PruningStrategy::Copy,
+            PrunePhase::CopyFinal | PrunePhase::CopyNonfinal => PruningStrategy::Rebuild,
             PrunePhase::Delete => PruningStrategy::Delete,
         }
     }
@@ -1247,9 +1247,9 @@ pub trait PruneReporter: Send + 'static {
 /// Select how pruning should be done
 #[derive(Clone, Copy, Debug, Display, PartialEq)]
 pub enum PruningStrategy {
-    /// Copy the data we want to keep to new tables and swap them out for
-    /// the existing tables
-    Copy,
+    /// Rebuild by copying the data we want to keep to new tables and swap
+    /// them out for the existing tables
+    Rebuild,
     /// Delete unneeded data from the existing tables
     Delete,
 }
@@ -1270,12 +1270,12 @@ pub struct PruneRequest {
     pub final_block: BlockNumber,
     /// The latest block, i.e., the subgraph head
     pub latest_block: BlockNumber,
-    /// Use the copy strategy when removing more than this fraction of
-    /// history. Initialized from `ENV_VARS.store.copy_threshold`, but can
-    /// be modified after construction
-    pub copy_threshold: f64,
+    /// Use the rebuild strategy when removing more than this fraction of
+    /// history. Initialized from `ENV_VARS.store.rebuild_threshold`, but
+    /// can be modified after construction
+    pub rebuild_threshold: f64,
     /// Use the delete strategy when removing more than this fraction of
-    /// history but less than `copy_threshold`. Initialized from
+    /// history but less than `rebuild_threshold`. Initialized from
     /// `ENV_VARS.store.delete_threshold`, but can be modified after
     /// construction
     pub delete_threshold: f64,
@@ -1293,11 +1293,11 @@ impl PruneRequest {
         first_block: BlockNumber,
         latest_block: BlockNumber,
     ) -> Result<Self, StoreError> {
-        let copy_threshold = ENV_VARS.store.copy_threshold;
+        let rebuild_threshold = ENV_VARS.store.rebuild_threshold;
         let delete_threshold = ENV_VARS.store.delete_threshold;
-        if copy_threshold < 0.0 || copy_threshold > 1.0 {
+        if rebuild_threshold < 0.0 || rebuild_threshold > 1.0 {
             return Err(constraint_violation!(
-                "the copy threshold must be between 0 and 1 but is {copy_threshold}"
+                "the copy threshold must be between 0 and 1 but is {rebuild_threshold}"
             ));
         }
         if delete_threshold < 0.0 || delete_threshold > 1.0 {
@@ -1331,19 +1331,20 @@ impl PruneRequest {
             earliest_block,
             final_block,
             latest_block,
-            copy_threshold,
+            rebuild_threshold,
             delete_threshold,
         })
     }
 
     /// Determine what strategy to use for pruning
     ///
-    /// We are pruning `history_pct` of the blocks from a table that has a ratio
-    /// of `version_ratio` entities to versions. If we are removing more than
-    /// `copy_threshold` percent of the versions, we prune by copying, and if we
-    /// are removing more than `delete_threshold` percent of the versions, we
-    /// prune by deleting. If we would remove less than `delete_threshold`
-    /// percent of the versions, we don't prune.
+    /// We are pruning `history_pct` of the blocks from a table that has a
+    /// ratio of `version_ratio` entities to versions. If we are removing
+    /// more than `rebuild_threshold` percent of the versions, we prune by
+    /// rebuilding, and if we are removing more than `delete_threshold`
+    /// percent of the versions, we prune by deleting. If we would remove
+    /// less than `delete_threshold` percent of the versions, we don't
+    /// prune.
     pub fn strategy(&self, stats: &VersionStats) -> Option<PruningStrategy> {
         // If the deployment doesn't have enough history to cover the reorg
         // threshold, do not prune
@@ -1356,8 +1357,8 @@ impl PruneRequest {
         // that `history_pct` will tell us how much of that data pruning
         // will remove.
         let removal_ratio = self.history_pct(stats) * (1.0 - stats.ratio);
-        if removal_ratio >= self.copy_threshold {
-            Some(PruningStrategy::Copy)
+        if removal_ratio >= self.rebuild_threshold {
+            Some(PruningStrategy::Rebuild)
         } else if removal_ratio >= self.delete_threshold {
             Some(PruningStrategy::Delete)
         } else {

diff --git a/graph/src/env/store.rs b/graph/src/env/store.rs
@@ -85,11 +85,11 @@ pub struct EnvVarsStore {
     pub batch_target_duration: Duration,
 
     /// Prune tables where we will remove at least this fraction of entity
-    /// versions by copying. Set by `GRAPH_STORE_HISTORY_COPY_THRESHOLD`.
-    /// The default is 0.5
-    pub copy_threshold: f64,
+    /// versions by rebuilding the table. Set by
+    /// `GRAPH_STORE_HISTORY_REBUILD_THRESHOLD`. The default is 0.5
+    pub rebuild_threshold: f64,
     /// Prune tables where we will remove at least this fraction of entity
-    /// versions, but fewer than `copy_threshold`, by deleting. Set by
+    /// versions, but fewer than `rebuild_threshold`, by deleting. Set by
     /// `GRAPH_STORE_HISTORY_DELETE_THRESHOLD`. The default is 0.05
     pub delete_threshold: f64,
     /// How much history a subgraph with limited history can accumulate
@@ -134,7 +134,7 @@ impl From<InnerStore> for EnvVarsStore {
             connection_idle_timeout: Duration::from_secs(x.connection_idle_timeout_in_secs),
             write_queue_size: x.write_queue_size,
             batch_target_duration: Duration::from_secs(x.batch_target_duration_in_secs),
-            copy_threshold: x.copy_threshold.0,
+            rebuild_threshold: x.rebuild_threshold.0,
             delete_threshold: x.delete_threshold.0,
             history_slack_factor: x.history_slack_factor.0,
         }
@@ -180,8 +180,8 @@ pub struct InnerStore {
     write_queue_size: usize,
     #[envconfig(from = "GRAPH_STORE_BATCH_TARGET_DURATION", default = "180")]
     batch_target_duration_in_secs: u64,
-    #[envconfig(from = "GRAPH_STORE_HISTORY_COPY_THRESHOLD", default = "0.5")]
-    copy_threshold: ZeroToOneF64,
+    #[envconfig(from = "GRAPH_STORE_HISTORY_REBUILD_THRESHOLD", default = "0.5")]
+    rebuild_threshold: ZeroToOneF64,
     #[envconfig(from = "GRAPH_STORE_HISTORY_DELETE_THRESHOLD", default = "0.05")]
     delete_threshold: ZeroToOneF64,
     #[envconfig(from = "GRAPH_STORE_HISTORY_SLACK_FACTOR", default = "1.2")]

diff --git a/node/src/bin/manager.rs b/node/src/bin/manager.rs
@@ -253,12 +253,12 @@ pub enum Command {
     Prune {
         /// The deployment to prune (see `help info`)
         deployment: DeploymentSearch,
-        /// Prune by copying when removing more than this fraction of
-        /// history. Defaults to GRAPH_STORE_HISTORY_COPY_THRESHOLD
+        /// Prune by rebuilding tables when removing more than this fraction
+        /// of history. Defaults to GRAPH_STORE_HISTORY_REBUILD_THRESHOLD
         #[clap(long, short)]
-        copy_threshold: Option<f64>,
+        rebuild_threshold: Option<f64>,
         /// Prune by deleting when removing more than this fraction of
-        /// history but less than copy_threshold. Defaults to
+        /// history but less than rebuild_threshold. Defaults to
         /// GRAPH_STORE_HISTORY_DELETE_THRESHOLD
         #[clap(long, short)]
         delete_threshold: Option<f64>,
@@ -1390,7 +1390,7 @@ async fn main() -> anyhow::Result<()> {
         Prune {
             deployment,
             history,
-            copy_threshold,
+            rebuild_threshold,
             delete_threshold,
             once,
         } => {
@@ -1400,7 +1400,7 @@ async fn main() -> anyhow::Result<()> {
                 primary_pool,
                 deployment,
                 history,
-                copy_threshold,
+                rebuild_threshold,
                 delete_threshold,
                 once,
             )

diff --git a/node/src/manager/commands/prune.rs b/node/src/manager/commands/prune.rs
@@ -161,7 +161,7 @@ pub async fn run(
     primary_pool: ConnectionPool,
     search: DeploymentSearch,
     history: usize,
-    copy_threshold: Option<f64>,
+    rebuild_threshold: Option<f64>,
     delete_threshold: Option<f64>,
     once: bool,
 ) -> Result<(), anyhow::Error> {
@@ -198,8 +198,8 @@ pub async fn run(
         status.earliest_block_number,
         latest,
     )?;
-    if let Some(copy_threshold) = copy_threshold {
-        req.copy_threshold = copy_threshold;
+    if let Some(rebuild_threshold) = rebuild_threshold {
+        req.rebuild_threshold = rebuild_threshold;
     }
     if let Some(delete_threshold) = delete_threshold {
         req.delete_threshold = delete_threshold;

diff --git a/store/postgres/src/deployment_store.rs b/store/postgres/src/deployment_store.rs
@@ -1290,10 +1290,10 @@ impl DeploymentStore {
             site: Arc<Site>,
             req: PruneRequest,
         ) -> Result<(), StoreError> {
-            let logger = logger.cheap_clone();
-            retry::forever_async(&logger, "prune", move || {
+            let logger2 = logger.cheap_clone();
+            retry::forever_async(&logger2, "prune", move || {
                 let store = store.cheap_clone();
-                let reporter = OngoingPruneReporter::new(store.logger.cheap_clone());
+                let reporter = OngoingPruneReporter::new(logger.cheap_clone());
                 let site = site.cheap_clone();
                 async move { store.prune(reporter, site, req).await.map(|_| ()) }
             })
@@ -1969,7 +1969,7 @@ impl PruneReporter for OngoingPruneReporter {
 
     fn prune_batch(&mut self, _table: &str, rows: usize, phase: PrunePhase, _finished: bool) {
         match phase.strategy() {
-            PruningStrategy::Copy => self.rows_copied += rows,
+            PruningStrategy::Rebuild => self.rows_copied += rows,
             PruningStrategy::Delete => self.rows_deleted += rows,
         }
     }