From 9c5c156f87f38d646ba6106b222ae44c7f843d31 Mon Sep 17 00:00:00 2001
From: prataprc <prataprc@gmail.com>
Date: Fri, 7 Mar 2014 12:24:36 +0530
Subject: [PATCH 01/13] Added few lines about motivation for building seconday
 index.

text formatted it to 80 column width.
---
 secondary/docs/design/overview.md | 66 +++++++++++++++++++++----------
 1 file changed, 45 insertions(+), 21 deletions(-)

diff --git a/secondary/docs/design/overview.md b/secondary/docs/design/overview.md
index 4ac117354..3684322b0 100644
--- a/secondary/docs/design/overview.md
+++ b/secondary/docs/design/overview.md
@@ -1,39 +1,60 @@
-##Secondary Index Design Document
+## Secondary Index Design Document
 
 
 ###Overview
 
-This document describes the High Level Design for Secondary Indexes. It also describes the deployement options supported.
+Secondary index provide the ability to query couchbase key-value data store.
+In couchbase the primary data store is a key-value storage mainly backed by
+main-memory and distributed across dozens of nodes. If you are new couchbase
+or couchbase's secondary indexing solution, you may first want learn more about
+the [terminologies](markdown/terminology.md) that we discuss here.
+
+This document describes the High Level Design for Secondary Indexes. It also
+describes the deployment options supported.
 
 ###Components
 
 
 - __Projector__
 
-  The projector is responsible for mapping mutations to a set of key version. The projector can reside within the master KV node in which the mutation is generated or it can reside in separate node. The projector receives mutations from ep-engine through UPR protocol. The projector sends the evaluated results to router. [Details.](markdown/projector.md)
+The projector is responsible for mapping mutations to a set of key version.
+The projector can reside within the master KV node in which the mutation is
+generated or it can reside in separate node. The projector receives mutations
+from ep-engine through UPR protocol. The projector sends the evaluated results
+to router. [Details.](markdown/projector.md)
 
 - __Router__
 
-  The router is responsible for sending key version to the index nodes. It relies on the index distribution/partitioning topology to determine the indexer which should receive the key version. The router resides in the same node as the projector. [Details.](markdown/router.md)
-  
+The router is responsible for sending key version to the index nodes. It relies
+on the index distribution/partitioning topology to determine the indexer which
+should receive the key version. The router resides in the same node as the
+projector. [Details.](markdown/router.md)
+
 - __Index Manager__
 
-  The index manager is responsible for receiving requests for indexing operations (creation, deletion, maintenance, scan/lookup). The Index Manager is located in the index node, which can be different from KV node. [Details.](markdown/index_manager.md)
-  
+The index manager is responsible for receiving requests for indexing operations
+(creation, deletion, maintenance, scan/lookup). The Index Manager is located in
+the index node, which can be different from KV node.
+[Details.](markdown/index_manager.md)
+
 - __Indexer__
 
-  The indexer processes the key versions received from router and provides persistence support for the index. Indexer also provides the interface for query client to run index Scans and does scatter/gather for queries. The indexer would reside in index node.   [Details.](markdown/indexer.md)
-  
-- __Query Catalog__
+The indexer processes the key versions received from router and provides
+persistence support for the index. Indexer also provides the interface for
+query client to run index Scans and does scatter/gather for queries. The indexer
+would reside in index node.   [Details.](markdown/indexer.md)
 
-  This component provides catalog implementation for the Query Server. This component resides in the same node Query Server is running and allows Query Server to perform Index DDL (Create, Drop) and Index Scan/Stats operations.
+- __Query Catalog__
 
+This component provides catalog implementation for the Query Server. This
+component resides in the same node Query Server is running and allows Query
+Server to perform Index DDL (Create, Drop) and Index Scan/Stats operations.
 
 ###System Diagram
 
 - [KV-Index System Diagram](markdown/system.md)
 - Query-Index System Diagram
- 
+
 ###Execution Flow
 
 * [Mutation Execution Flow](markdown/mutation.md)
@@ -52,40 +73,43 @@ This document describes the High Level Design for Secondary Indexes. It also des
 - [Deployment Options](markdown/deployment.md)
 
 ###Partition Management
-* Milestone1 will have Key-based partitioning support. 
+* Milestone-1 will have Key-based partitioning support.
   * [John's Doc for Partitioning](https://docs.google.com/document/d/1eF3rJ63iv1awnfLkAQLmVmILBdgD4Vzc0IsCpTxmXgY/edit)
 
 ###Communication Protocols
 
 * Projector and Ep-Engine Protocol
-  * [UPR protocol](https://github.com/couchbaselabs/cbupr/blob/master/index.md) will be used to talk to Ep-engine in KV. 
-  
+  * [UPR protocol](https://github.com/couchbaselabs/cbupr/blob/master/index.md) will be used to talk to Ep-engine in KV.
+
 * Router and Indexer Protocol
 * Query and Indexer Protocol
   * [Existing REST Based Protocol](https://docs.google.com/document/d/1j9D4ryOi1d5CNY5EkoRuU_fc5Q3i_QwIs3zU9uObbJY/edit)
 
 ###Storage Management
-* Persistent Snapshot 
+
+* Persistent Snapshot
 
 ###Cluster Management
+
 * Master Election
 * Communication with ns_server
 
-###Metadata Management
+### Meta-data Management
+
 * Metadata Replication
 * Metadata Recovery
 
 ###Recovery
-* [Recovery Document](https://docs.google.com/document/d/1rNJSVs80TtvY0gpoebsBwzhqWRBJnieSuLTnxuDzUTQ/edit) 
+
+* [Recovery Document](https://docs.google.com/document/d/1rNJSVs80TtvY0gpoebsBwzhqWRBJnieSuLTnxuDzUTQ/edit)
 
 ###Replication
+
 * Replication Strategy
 * Failure Recovery
 
 ###Rebalance
+
 * Rebalance Strategy
 * Failure Recovery
 
-###Terminology
-
-* [Terminology Document](markdown/terminology.md)

From 7da73031e485a3a5cffd6141312d8726526de825 Mon Sep 17 00:00:00 2001
From: prataprc <prataprc@gmail.com>
Date: Fri, 7 Mar 2014 12:25:40 +0530
Subject: [PATCH 02/13] Notes for developer.

Right now it contains a documentation section that developers can use
as guidelines to author secondary index documentation
---
 secondary/docs/development.md | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)
 create mode 100644 secondary/docs/development.md

diff --git a/secondary/docs/development.md b/secondary/docs/development.md
new file mode 100644
index 000000000..0137b523d
--- /dev/null
+++ b/secondary/docs/development.md
@@ -0,0 +1,27 @@
+##Documentation
+
+- All documentation is done through text markup, preferably with markdown
+  syntax.
+
+- We encourage using diagrams where ever it is relevant or necessary. Few of
+  us use yed to draw diagrams and save it as `graphml` files. You can add
+  references to these images from within your markdown documents.
+
+- To enable spell checking in vim use this option
+    > :setlocal spell spelllang=en_us
+
+  to learn more about spell-checker shortcuts.
+    > :help spell
+
+- It is convenient to use chrome plugins like `Markdown Preview` to visually
+  check document formatting before committing your changes to git.
+  After installing it from chrome-extensions, enable "Allow access to file URLs"
+  in Extensions (menu > Tools > Extensions). Then open the file in chrome to
+  see the visual effects of markdown.
+
+- If the markdown file is stored with `.md` file extension, it could get
+  interpreted a modula-2 file. To enable correct syntax highlighting refer to
+  your editor documentation.
+
+- **please try to use 80 column width for text documents**, although it is not a
+  strict rule let there be a valid reason for lines exceeding 80 columns.

From 16e4d713231c79daa45f5e0fe929eadc4e86d7cd Mon Sep 17 00:00:00 2001
From: prataprc <prataprc@gmail.com>
Date: Tue, 11 Mar 2014 10:51:27 +0530
Subject: [PATCH 03/13] IndexManager writeup.

---
 .../docs/design/markdown/index_manager.md     | 109 +++++++++++++++++-
 1 file changed, 108 insertions(+), 1 deletion(-)

diff --git a/secondary/docs/design/markdown/index_manager.md b/secondary/docs/design/markdown/index_manager.md
index f157f599c..341a2311d 100644
--- a/secondary/docs/design/markdown/index_manager.md
+++ b/secondary/docs/design/markdown/index_manager.md
@@ -1 +1,108 @@
-##Index Manager
+## IndexManager
+
+IndexManager is the component that co-ordinate other components - like
+projector, query, indexer and local-indexer nodes during bootstrap, rollback,
+reconnection, local-indexer-restart etc.
+
+Since rest of the system depends on IndexManager for both normal operation and
+for failure recovery, restart etc. it can end up becoming single-point-of
+failure within the system. To avoid this, multiple instance of IndexManager
+will be running on different nodes, where one of them will be elected as
+master and other will act as replica to the current master.
+
+**Master election**, we expect nsServer to elect a new master during system
+bootstrap, and when ever there is a master crash. More specifically, we expect
+nsServer to,
+
+  * prepare an active list of IndexManager
+  * pick one of the instance as master
+  * post new master's connectionAddress to all active instance of IndexManagers
+  * monitor the active list of IndexManager
+  * handle Join/Leave request from IndexManager either to join or leave from the
+    active list
+
+Each instance of IndexManager will be modeled as a state machine backed by a
+data structure, called StateContext, that contains meta-data about the secondary
+index system and meta-data for normal operation of IndexManager cluster. An
+instance of IndexManager can be in one of the following state.
+
+  * bootstrap
+  * master
+  * replica
+  * rollback
+
+[TODO: Since the system is still evolving we expect changes to design of
+ IndexManager]
+
+### StateContext
+
+  * contains a CAS field that will be incremented for every mutation (insert,
+    delete, update) that happen to the StateContext.
+  * several API will be exposed by IndexManager to CREATE, READ, UPDATE and
+    DELETE StateContext or portion of StateContext.
+  * false-positive cases are possible in the present definition. It can
+    happen when a client post an update to the StateContext which get
+    persisted only in master and not with any other replica, and subsequently
+    the master fails, there by leading to a situation where the client assumes
+    an update that is not present in the system.
+  * false-negative cases are possible in the present definition. It can happen
+    when a client post an update to the StateContext which get persisted on
+    master/replica, but before replying success to client - master node fails,
+    there by leading to a situation where the client will re-post the same
+    update to StateContext.
+
+From now on we will refer to each instance of IndexManager as a node.
+
+### bootstrap
+  * all nodes, when they start afresh, will be in bootstrap State.
+  * from bootstrap State, a node can either move to master State or replica
+    State.
+  * sometimes moving to master State can be transient, that is, if system was
+    previously in rollback state and rollback context was replicated to other
+    nodes, then the new master shall immediately switch to rollback State and
+    continue co-ordinating system rollback activity.
+  * while a node is in bootstrap State it will wait for new master's
+    connectionAddress to be posted via its API.
+  * if new master is itself, detected by comparing connectionAddress,
+    it shall move to master State.
+  * otherwise the node will move to replica State.
+
+### master
+  * when a node move to master State, its local StateContext is restored from
+    persistent storage and used as system-wide StateContext.
+
+  * StateContext can be modified only by the master node. It is expected that
+    other components within the system should somehow learn the current master
+    (maybe from nsServer) and use master's API to modify StateContext.
+  * upon receiving a new update to StateContext
+    * master will fetch the active list of replica from nsServer
+    * master will update its local StateContext and persist the StateContext
+      on durable media, then it will post an update to each of its replica.
+    * if one of the replica does not respond or respond back with failure,
+      master shall notify nsServer regarding the same
+    * master responds back to the original client that initiated the
+      update-request.
+  * in case a master crashes, it shall start again from bootstrap State.
+
+  **[false-positives and false-negatives are possible in above scenario]**
+
+### replica
+  * when a node move to replica State, it will first fetch the latest
+    StateContext from the current master and persist as the local StateContext.
+  * if replica's StateContext is newer than the master's StateContext
+    (detectable by comparing the CAS value), then latest mutations on the
+    replica will be lost.
+  * replica can receive updates to StateContext only from master, if it receives
+    from other components in the system, it will respond with error.
+  * upon receiving a new update to StateContext from master, replica will
+    update its StateContext and persist the StateContext on durable media.
+  * in case if replica is unable to communicate with the master and comes back
+    alive while rest of the system have moved ahead, it shall go through
+    bootstrap state,
+      * due to hearbeat failures
+      * by detecting the CAS value in subsequent updates from master
+
+### rollback
+  * only master node can move to rollback mode.
+
+  TBD

From 4f004c699668a566b4dd6f56e200e35dd8de060b Mon Sep 17 00:00:00 2001
From: prataprc <prataprc@gmail.com>
Date: Tue, 11 Mar 2014 17:13:31 +0530
Subject: [PATCH 04/13] Index manager writeup.

---
 .../docs/design/markdown/index_manager.md     | 329 ++++++++++++++----
 1 file changed, 252 insertions(+), 77 deletions(-)

diff --git a/secondary/docs/design/markdown/index_manager.md b/secondary/docs/design/markdown/index_manager.md
index 341a2311d..e7a41785d 100644
--- a/secondary/docs/design/markdown/index_manager.md
+++ b/secondary/docs/design/markdown/index_manager.md
@@ -8,101 +8,276 @@ Since rest of the system depends on IndexManager for both normal operation and
 for failure recovery, restart etc. it can end up becoming single-point-of
 failure within the system. To avoid this, multiple instance of IndexManager
 will be running on different nodes, where one of them will be elected as
-master and other will act as replica to the current master.
+master and others will act as replica to the current master.
 
-**Master election**, we expect nsServer to elect a new master during system
-bootstrap, and when ever there is a master crash. More specifically, we expect
-nsServer to,
+From now on we will refer to each instance of IndexManager as a node.
+
+### scope of IndexManager
+
+1. create or drop index DDLs.
+2. defining initial index topology, no. of indexer-nodes, partitions and
+   slices for an index.
+3. add, delete topics in pub-sub. subscribe, un-subscribe nodes from topics,
+   optionally based on predicates.
+4. provide network interface for other components to access index-metadata,
+   index topology, publish-subscribe framework.
+5. generate restart timestamp for upr-reconnection.
+6. create rollback context and update rollback context based on how rollback
+   evolves withing the system.
+7. generate stability timestamps for purpose of query and rollback. co-ordinate
+   with every indexer-node to generate stability snapshots.
+
+### scope of ns-server
+
+1. master-election is done by ns-server
+2. master-election is done by ns-server during system start and whenever current
+   master fail to respond for `hearbeat` request.
+3. ns-server will be the central component that maintain the current master and
+   list of active replica and list of indexer-nodes.
+4. actively poll - master node, replica nodes and other local-indexer-nodes for
+   its liveliness using `heartbeat` request.
+5. provide API for index-manager instances and index-nodes to join or leave
+   the cluster.
+
+**Clarification needed**
+
+1. When a client is interfacing with a master, will there be a situation for
+   ns-server to conduct a new master-election ?
+
+### client interfacing with IndexManager and typical update cycle
+
+1. a client must first get current master from ns-server. If it cannot get
+   one, it must retry or fail.
+2. once network address of master is obtained from ns-server client can post
+   update request to master.
+3. master should get the current list of replica from ns-server.
+   * if ns-server is unreachable or doesn't respond, return error to client.
+4. master updates its local StateContext.
+5. synchronously replicate its local StateContext on all replicas. If one of
+   the replica is not responding, skip the replica.
+6. notify ns-server about not responding replicas. If ns-server is unreachable
+   ignore.
+7. if master responds with anything other than SUCCESS to the client, retry from
+   step 1.
+
+### master election
+
+We expect ns-server to elect a new master,
+
+* during system bootstrap.
+* during a master crash (when master fails to respond for heartbeat request).
+* when a master voluntarily leaves the cluster.
+
+election process,
+
+1. ns-server will maintain a election-term number which shall be incremented
+   before every master-election.
+2. if ns-server cannot reach master, then it can retire master as failed node,
+   increment election-term, and start master election process
+3. ns-server polls for active replica.
+3. pick one of the replica as master node.
+4. post `bootstrap` request to each replica. If a replica is already in
+   `bootstrap` state, it will do nothing.
+   * an active replica shall cancel outstanding updates happening to its
+     replicated data structure, move to bootstrap state, and wait for new
+     master notification.
+6. post new master's connectionAddress to all active instance of IndexManagers
+   along with it the current election-term number.
+7. current election-term number will be preserved by all instances of
+   IndexManager and used in all request/response across the cluster.
+8. in case of unexpected behavior like,
+   * an active replica not responding
+   * `bootstrap` request is not responding
+   * `newmaster` request not responding
+
+   ns-server will restart from step-2.
+9. once an election is complete, with master and replicas identified,
+   ns-server act on join and leave request from IndexManager nodes and
+   indexer-nodes.
 
-  * prepare an active list of IndexManager
-  * pick one of the instance as master
-  * post new master's connectionAddress to all active instance of IndexManagers
-  * monitor the active list of IndexManager
-  * handle Join/Leave request from IndexManager either to join or leave from the
-    active list
+### failed, orphaned and outdated IndexManager
+
+1. when an instance of IndexManager fails, it shall be restarted by ns-server,
+   join the cluster, enter bootstrap state.
+2. when a master IndexManager becomes unreachable to ns-server, it shall
+   restart itself, by joining the cluster after a timeout period and enter
+   bootstrap state.
+3. when ever IndexManager instance receives a request or response who's
+   current-election-term is higher than the local value, it will restart
+   itself, join the cluster and enter bootstrap state.
+
+### false-positive
+
+false positive is a scenario when client thinks that an update request
+succeeded but the system is not yet updated. This can happen when
+master node is reachable to client but not reachable to its replicas and
+ns-server, leading to a master-election, and ns-server elects a new-master that
+has not received the updates from old-master.
+
+### false-negative
+
+false-negative is a scenario when client thinks that an update request has
+failed but the system has applied the update into StateContext. This can happen
+when client post an update to the StateContext which get persisted on
+master/replica, but before replying success to client - master node fails,
+there by leading to a situation where the client will re-post the same
+update to new master.
+
+## IndexManager design
 
 Each instance of IndexManager will be modeled as a state machine backed by a
 data structure, called StateContext, that contains meta-data about the secondary
-index system and meta-data for normal operation of IndexManager cluster. An
-instance of IndexManager can be in one of the following state.
+index, index topology, pub-sub data and meta-data for normal operation of
+IndexManager cluster. An instance of IndexManager can be a  `master` or a
+`replica` operating in `bootstrap`, `normal` or `rollback` state
 
-  * bootstrap
-  * master
-  * replica
-  * rollback
+              | startup  |  master  | replica
+    ----------|----------|----------|---------
+    boostrap  |   yes    |          |
+    normal    |          |   yes    |  yes
+    rollback  |          |   yes    |
 
-[TODO: Since the system is still evolving we expect changes to design of
- IndexManager]
+**TODO: Since the system is still evolving we expect changes to design of
+IndexManager**
+**TODO: Convert above table to state diagram**
 
 ### StateContext
 
-  * contains a CAS field that will be incremented for every mutation (insert,
-    delete, update) that happen to the StateContext.
-  * several API will be exposed by IndexManager to CREATE, READ, UPDATE and
-    DELETE StateContext or portion of StateContext.
-  * false-positive cases are possible in the present definition. It can
-    happen when a client post an update to the StateContext which get
-    persisted only in master and not with any other replica, and subsequently
-    the master fails, there by leading to a situation where the client assumes
-    an update that is not present in the system.
-  * false-negative cases are possible in the present definition. It can happen
-    when a client post an update to the StateContext which get persisted on
-    master/replica, but before replying success to client - master node fails,
-    there by leading to a situation where the client will re-post the same
-    update to StateContext.
-
-From now on we will refer to each instance of IndexManager as a node.
+* contains a CAS field that will be incremented for every mutation (insert,
+  delete, update) that happen to the StateContext.
+* several API will be exposed by IndexManager to CREATE, READ, UPDATE and
+  DELETE StateContext or portion of StateContext.
 
 ### bootstrap
-  * all nodes, when they start afresh, will be in bootstrap State.
-  * from bootstrap State, a node can either move to master State or replica
-    State.
-  * sometimes moving to master State can be transient, that is, if system was
-    previously in rollback state and rollback context was replicated to other
-    nodes, then the new master shall immediately switch to rollback State and
-    continue co-ordinating system rollback activity.
-  * while a node is in bootstrap State it will wait for new master's
-    connectionAddress to be posted via its API.
-  * if new master is itself, detected by comparing connectionAddress,
-    it shall move to master State.
-  * otherwise the node will move to replica State.
+
+* all nodes, when they start afresh, will be in bootstrap State.
+* from bootstrap State, a node can either move to master State or replica State.
+* sometimes moving to master State can be transient, that is, if system was
+  previously in rollback state and rollback context was replicated to other
+  nodes, then the new master shall immediately switch to rollback State and
+  continue co-ordinating system rollback activity.
+* while a node is in bootstrap State it will wait for new master's
+  connectionAddress to be posted via its API.
+* if connectionAddress same as the IndexManager instance, it will move to
+  master State.
+* otherwise the node will move to replica State.
 
 ### master
-  * when a node move to master State, its local StateContext is restored from
-    persistent storage and used as system-wide StateContext.
-
-  * StateContext can be modified only by the master node. It is expected that
-    other components within the system should somehow learn the current master
-    (maybe from nsServer) and use master's API to modify StateContext.
-  * upon receiving a new update to StateContext
-    * master will fetch the active list of replica from nsServer
-    * master will update its local StateContext and persist the StateContext
-      on durable media, then it will post an update to each of its replica.
-    * if one of the replica does not respond or respond back with failure,
-      master shall notify nsServer regarding the same
-    * master responds back to the original client that initiated the
-      update-request.
-  * in case a master crashes, it shall start again from bootstrap State.
-
-  **[false-positives and false-negatives are possible in above scenario]**
+
+* when a node move to master State, its local StateContext is restored from
+  persistent storage and used as system-wide StateContext.
+
+* StateContext can be modified only by the master node. It is expected that
+  other components within the system should somehow learn the current master
+  (maybe from ns-server) and use master's API to modify StateContext.
+* upon receiving a new update to StateContext
+  * master will fetch the active list of replica from ns-server
+  * master will update its local StateContext and persist the StateContext
+    on durable media, then it will post an update to each of its replica.
+  * if one of the replica does not respond or respond back with failure,
+    master shall notify ns-server regarding the same
+  * master responds back to the original client that initiated the
+    update-request.
+* in case a master crashes, it shall start again from bootstrap State.
 
 ### replica
-  * when a node move to replica State, it will first fetch the latest
-    StateContext from the current master and persist as the local StateContext.
-  * if replica's StateContext is newer than the master's StateContext
-    (detectable by comparing the CAS value), then latest mutations on the
-    replica will be lost.
-  * replica can receive updates to StateContext only from master, if it receives
-    from other components in the system, it will respond with error.
-  * upon receiving a new update to StateContext from master, replica will
-    update its StateContext and persist the StateContext on durable media.
-  * in case if replica is unable to communicate with the master and comes back
-    alive while rest of the system have moved ahead, it shall go through
-    bootstrap state,
-      * due to hearbeat failures
-      * by detecting the CAS value in subsequent updates from master
+
+* when a node move to replica State, it will first fetch the latest
+  StateContext from the current master and persist as the local StateContext.
+* if replica's StateContext is newer than the master's StateContext
+  (detectable by comparing the CAS value), then latest mutations on the
+  replica will be lost.
+* replica can receive updates to StateContext only from master, if it receives
+  from other components in the system, it will respond with error.
+* upon receiving a new update to StateContext from master, replica will
+  update its StateContext and persist the StateContext on durable media.
+* in case if replica is unable to communicate with the master and comes back
+  alive while rest of the system have moved ahead, it shall go through
+  bootstrap state,
+    * due to hearbeat failures
+    * by detecting the CAS value in subsequent updates from master
 
 ### rollback
+
   * only master node can move to rollback mode.
 
   TBD
+
+## Data structure
+
+**cluster data structure**
+
+```go
+    type Cluster struct {
+      masterAddr string   // connection address to master
+      replicas   []string // list of connection address to replica nodes
+      nodes      []string // list of connection address to indexer-nodes
+    }
+```
+
+data structure is transient maintains the current state of the IndexManager
+cluster
+
+**StateContext**
+
+```go
+    type StateContext struct {
+    }
+```
+
+### IndexManager APIs
+
+API are defined and exposed by each and every IndexManager and explained with
+HTTP URL and JSON arguments
+
+#### /cluster/heartbeat
+request:
+
+    { "current_term": <uint64> }
+
+response:
+
+    { "current_term": <uint64> }
+
+To be called by ns-server. Node will respond back with SUCCESS irrespective of
+role or state.
+
+* if replica node does not respond back, it will be removed from active list.
+* if master does not respond back, then ns-server can start a new
+  master-election.
+
+#### /cluster/bootstrap
+
+To be called by ns-server, ns-server is expected to make this request during
+master election.
+
+* replica will cancel all outstanding updates and move to `bootstrap` state.
+* master will cancel all outstanding updates and move to `bootstrap` state.
+
+#### /cluster/newmaster
+request:
+
+    { "current_term": <uint64> }
+
+response:
+
+    { "current_term": <uint64> }
+
+Once the master election is completed ns-server will post the new master and
+election-term to the elected master and each of its new replica. After this,
+IndexManager node shall enter into `master` or `replica` state.
+
+### ns-server API requirements:
+
+#### GetClusterMaster()
+
+Returns connection address for current master. If no master is currently elected
+then return empty string.
+
+#### Replicas()
+
+Returns list of connection address for all active replicas in the system.
+
+#### IndexerNodes()
+
+Returns list of connection address for all active local-indexer-nodes in the system.

From 842374bc443a45241d4513b7175d8a911e4ebbc5 Mon Sep 17 00:00:00 2001
From: prataprc <prataprc@gmail.com>
Date: Tue, 11 Mar 2014 17:34:26 +0530
Subject: [PATCH 05/13] index manager writeup: IndexManager APIs.

---
 .../docs/design/markdown/index_manager.md     | 80 ++++++++++++++++++-
 1 file changed, 78 insertions(+), 2 deletions(-)

diff --git a/secondary/docs/design/markdown/index_manager.md b/secondary/docs/design/markdown/index_manager.md
index e7a41785d..25061bc79 100644
--- a/secondary/docs/design/markdown/index_manager.md
+++ b/secondary/docs/design/markdown/index_manager.md
@@ -220,6 +220,18 @@ cluster
 
 **StateContext**
 
+```go
+    type IndexInfo {
+        Name       string    `json:"name,omitempty"`       // Name of the index
+        Uuid       string    `json:"uuid,omitempty"`       // unique id for every index
+        Using      IndexType `json:"using,omitempty"`      // indexing algorithm
+        OnExprList []string  `json:"onExprList,omitempty"` // expression list
+        Bucket     string    `json:"bucket,omitempty"`     // bucket name
+        IsPrimary  bool      `json:"isPrimary,omitempty"`
+        Exprtype   ExprType  `json:"exprType,omitempty"`
+    }
+```
+
 ```go
     type StateContext struct {
     }
@@ -261,12 +273,68 @@ request:
 
 response:
 
-    { "current_term": <uint64> }
+    { "current_term": <uint64>
+      "status": ...
+    }
 
-Once the master election is completed ns-server will post the new master and
+Once master election is completed ns-server will post the new master and
 election-term to the elected master and each of its new replica. After this,
 IndexManager node shall enter into `master` or `replica` state.
 
+#### /cluster/index
+request:
+
+    { "current_term": <uint64>,
+      "command": CreateIndex,
+      "indexinfo": {},
+    }
+
+response:
+
+    { "current_term": <uint64>
+      "status": ...
+      "indexinfo": {},
+    }
+
+Create a new index specified by `indexinfo` field in request body. `indexinfo`
+property is same as defined by the `IndexInfo` structure above, except that
+`id` field will be generated by the master IndexManager and the same
+`indexinfo` structure will be sent back as response.
+
+#### /cluster/index
+request:
+
+    { "current_term": <uint64>,
+      "command": DropIndex,
+      "indexid": <list-of-uint64>,
+    }
+
+response:
+
+    { "current_term": <uint64>
+      "status": ...
+    }
+
+Drop all index listed in `indexid` field.
+
+#### /cluster/index
+request:
+
+    { "current_term": <uint64>,
+      "command": ListIndex,
+      "indexids": <list-of-uint64>,
+    }
+
+response:
+
+    { "current_term": <uint64>,
+      "status": ...
+      "indexinfo": <list-of-indexinfo-structure>,
+    }
+
+List index meta-data structures identified by `indexids` in request body. If
+it is empty, list all active indexes.
+
 ### ns-server API requirements:
 
 #### GetClusterMaster()
@@ -281,3 +349,11 @@ Returns list of connection address for all active replicas in the system.
 #### IndexerNodes()
 
 Returns list of connection address for all active local-indexer-nodes in the system.
+
+#### Join()
+
+For a node to join the cluster
+
+#### Leave()
+
+For a node to leave the cluster

From 5a1814adf1a5eb7397900905b8d203ecdbca3c73 Mon Sep 17 00:00:00 2001
From: prataprc <prataprc@gmail.com>
Date: Wed, 12 Mar 2014 10:59:09 +0530
Subject: [PATCH 06/13] Few thoughts on indexer.md

---
 secondary/docs/design/markdown/indexer.md | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/secondary/docs/design/markdown/indexer.md b/secondary/docs/design/markdown/indexer.md
index 23dc3183a..68a85c96c 100644
--- a/secondary/docs/design/markdown/indexer.md
+++ b/secondary/docs/design/markdown/indexer.md
@@ -1 +1,14 @@
-##Indexer
+## Indexer
+
+Indexer is a node in secondary-index cluster that will host secondary-index
+components like IndexManager, Index-Coordinator, Index-Coordinator-Replica and
+Local-Indexer process.
+
+### Scope of local indexer
+
+1. run one or more instance of storage engine to persist projected key-versions.
+2. manage index partitions and slices for each index.
+3. independent restart path
+4. Manage Storage Engine
+   * Control compaction frequency
+   * Snapshot Creation/Deletion

From 076e683e0aa4733be04b0bdbf5bebd8a98561a84 Mon Sep 17 00:00:00 2001
From: prataprc <prataprc@gmail.com>
Date: Wed, 12 Mar 2014 10:59:41 +0530
Subject: [PATCH 07/13] Updated doc with new terminology.

Index Manager that resides in each index node
Index Coordinator -- only one in set of index nodes
Index Coordinator Replica -- backup nodes for the Index Manager
---
 .../docs/design/markdown/index_manager.md     | 72 ++++++++++++-------
 1 file changed, 48 insertions(+), 24 deletions(-)

diff --git a/secondary/docs/design/markdown/index_manager.md b/secondary/docs/design/markdown/index_manager.md
index 25061bc79..46db27796 100644
--- a/secondary/docs/design/markdown/index_manager.md
+++ b/secondary/docs/design/markdown/index_manager.md
@@ -10,22 +10,41 @@ failure within the system. To avoid this, multiple instance of IndexManager
 will be running on different nodes, where one of them will be elected as
 master and others will act as replica to the current master.
 
+During system execution an instance of _IndexManager_ will reside in each index
+node. One of the instance will be picked as master, called _Index-Coordinator_,
+that will assume responsibilities to co-ordinate other components. Remaining
+instances will be called as _Index-Coordinator-Replica_ which will act as a
+replica for master to avoid single point of failure.
+
 From now on we will refer to each instance of IndexManager as a node.
 
 ### scope of IndexManager
 
-1. create or drop index DDLs.
-2. defining initial index topology, no. of indexer-nodes, partitions and
-   slices for an index.
-3. add, delete topics in pub-sub. subscribe, un-subscribe nodes from topics,
-   optionally based on predicates.
-4. provide network interface for other components to access index-metadata,
-   index topology, publish-subscribe framework.
-5. generate restart timestamp for upr-reconnection.
-6. create rollback context and update rollback context based on how rollback
-   evolves withing the system.
-7. generate stability timestamps for purpose of query and rollback. co-ordinate
-   with every indexer-node to generate stability snapshots.
+1. handle scan and query request from client SDKs and N1QL clients.
+2. co-operate with Index-Coordinator to generate stable scan.
+
+### scope of Index-Coordinator
+
+1.  process index DDLs.
+2.  manage distribution topology, partitions and slices across the cluster.
+3.  broadcast topology updates across the cluster.
+4.  index replication and re-balance.
+5.  add, delete topics in pub-sub. subscribe, un-subscribe nodes from topics,
+    optionally based on predicates.
+6.  provide network API for other components to access index-metadata,
+    index topology and publish-subscribe framework.
+7.  generate restart timestamp for upr-reconnection.
+8.  negotiation with UPR producer for failover-log and restart sequence number.
+9.  create rollback context and update rollback context based on how rollback
+    evolves within the system.
+10. generate stability timestamps for purpose of query and rollback. co-ordinate
+    with every indexer-node to generate stability snapshots.
+
+### scope of Index-Coordinator-Replica
+
+1. to maintain a replica of Index-Coordinator state.
+2. to take part in master election.
+3. scope of IndexManager is applicable to Index-Coordinator replica as well.
 
 ### scope of ns-server
 
@@ -44,13 +63,14 @@ From now on we will refer to each instance of IndexManager as a node.
 1. When a client is interfacing with a master, will there be a situation for
    ns-server to conduct a new master-election ?
 
-### client interfacing with IndexManager and typical update cycle
+### client interfacing with Index-Coordinator and typical update cycle
 
 1. a client must first get current master from ns-server. If it cannot get
    one, it must retry or fail.
-2. once network address of master is obtained from ns-server client can post
+2. once network address of master is obtained from ns-server, client can post
    update request to master.
-3. master should get the current list of replica from ns-server.
+3. master should get the current list of Index-Coordinator-Replica from
+   ns-server.
    * if ns-server is unreachable or doesn't respond, return error to client.
 4. master updates its local StateContext.
 5. synchronously replicate its local StateContext on all replicas. If one of
@@ -97,9 +117,9 @@ election process,
 
 ### failed, orphaned and outdated IndexManager
 
-1. when an instance of IndexManager fails, it shall be restarted by ns-server,
-   join the cluster, enter bootstrap state.
-2. when a master IndexManager becomes unreachable to ns-server, it shall
+1. when an IndexManager fails, it shall be restarted by ns-server, join the
+   cluster, enter bootstrap state.
+2. when a Index-Coordinator becomes unreachable to ns-server, it shall
    restart itself, by joining the cluster after a timeout period and enter
    bootstrap state.
 3. when ever IndexManager instance receives a request or response who's
@@ -145,7 +165,7 @@ IndexManager**
 
 * contains a CAS field that will be incremented for every mutation (insert,
   delete, update) that happen to the StateContext.
-* several API will be exposed by IndexManager to CREATE, READ, UPDATE and
+* several API will be exposed by Index-Coordinator to CREATE, READ, UPDATE and
   DELETE StateContext or portion of StateContext.
 
 ### bootstrap
@@ -215,8 +235,8 @@ IndexManager**
     }
 ```
 
-data structure is transient maintains the current state of the IndexManager
-cluster
+data structure is transient and maintains the current state of the
+secondary-index cluster
 
 **StateContext**
 
@@ -239,9 +259,6 @@ cluster
 
 ### IndexManager APIs
 
-API are defined and exposed by each and every IndexManager and explained with
-HTTP URL and JSON arguments
-
 #### /cluster/heartbeat
 request:
 
@@ -277,10 +294,13 @@ response:
       "status": ...
     }
 
+
 Once master election is completed ns-server will post the new master and
 election-term to the elected master and each of its new replica. After this,
 IndexManager node shall enter into `master` or `replica` state.
 
+### Index-Coordinator APIs
+
 #### /cluster/index
 request:
 
@@ -357,3 +377,7 @@ For a node to join the cluster
 #### Leave()
 
 For a node to leave the cluster
+
+Q:
+1) When do the index manager needs to subscribe/unsubcribe directly?
+   Should the router handles the change the subscriber based on the topology?

From 80f8b4a2236c21b54e0ef43c9d0a4846a73ff6f5 Mon Sep 17 00:00:00 2001
From: prataprc <prataprc@gmail.com>
Date: Fri, 14 Mar 2014 10:06:17 +0530
Subject: [PATCH 08/13] Documentation

* Refactored index_manager documentation.
* created projector and pubsub documentation.
---
 .../docs/design/markdown/index_manager.md     | 387 ++------------
 .../design/markdown/index_manager_design.md   | 490 ++++++++++++++++++
 secondary/docs/design/markdown/indexer.md     |   5 +-
 secondary/docs/design/markdown/projector.md   |  61 ++-
 secondary/docs/design/markdown/pubsub.md      |   7 +
 5 files changed, 604 insertions(+), 346 deletions(-)
 create mode 100644 secondary/docs/design/markdown/index_manager_design.md
 create mode 100644 secondary/docs/design/markdown/pubsub.md

diff --git a/secondary/docs/design/markdown/index_manager.md b/secondary/docs/design/markdown/index_manager.md
index 46db27796..39bc838af 100644
--- a/secondary/docs/design/markdown/index_manager.md
+++ b/secondary/docs/design/markdown/index_manager.md
@@ -1,50 +1,49 @@
 ## IndexManager
 
 IndexManager is the component that co-ordinate other components - like
-projector, query, indexer and local-indexer nodes during bootstrap, rollback,
-reconnection, local-indexer-restart etc.
+projector, query, indexer nodes during bootstrap, rollback, reconnection,
+local-indexer-restart etc.
 
 Since rest of the system depends on IndexManager for both normal operation and
 for failure recovery, restart etc. it can end up becoming single-point-of
 failure within the system. To avoid this, multiple instance of IndexManager
 will be running on different nodes, where one of them will be elected as
-master and others will act as replica to the current master.
+master, called Index-Coordinator, and others will act as replica, called
+Index-Coordinator-Replica, to the current master.
 
-During system execution an instance of _IndexManager_ will reside in each index
-node. One of the instance will be picked as master, called _Index-Coordinator_,
-that will assume responsibilities to co-ordinate other components. Remaining
-instances will be called as _Index-Coordinator-Replica_ which will act as a
-replica for master to avoid single point of failure.
+### StateContext
 
-From now on we will refer to each instance of IndexManager as a node.
+State context acts as the reference point for rest of the system. Typically it
+contains fields to manage index DDLs, topology, event-publisher etc.
+Several API will be exposed by Index-Coordinator to update StateContext or
+portion of StateContext.
 
 ### scope of IndexManager
 
 1. handle scan and query request from client SDKs and N1QL clients.
 2. co-operate with Index-Coordinator to generate stable scan.
+3. co-operate with ns-server for master election and master notification.
 
 ### scope of Index-Coordinator
 
-1.  process index DDLs.
-2.  manage distribution topology, partitions and slices across the cluster.
-3.  broadcast topology updates across the cluster.
-4.  index replication and re-balance.
-5.  add, delete topics in pub-sub. subscribe, un-subscribe nodes from topics,
-    optionally based on predicates.
-6.  provide network API for other components to access index-metadata,
-    index topology and publish-subscribe framework.
-7.  generate restart timestamp for upr-reconnection.
-8.  negotiation with UPR producer for failover-log and restart sequence number.
-9.  create rollback context and update rollback context based on how rollback
-    evolves within the system.
-10. generate stability timestamps for purpose of query and rollback. co-ordinate
-    with every indexer-node to generate stability snapshots.
-
-### scope of Index-Coordinator-Replica
-
-1. to maintain a replica of Index-Coordinator state.
-2. to take part in master election.
-3. scope of IndexManager is applicable to Index-Coordinator replica as well.
+1. save and restore StateContext from persistent storage.
+2. hand-shake with local-indexer-nodes confirming the topology for each index.
+3. process index DDLs.
+4. for new index, generate a topology based on,
+   * administrator supplied configuration.
+   * list of local-indexer-nodes and load handled by each of them.
+5. co-ordinate index re-balance.
+6. generate and publish persistence timestamps to local-indexer-nodes.
+   * maintain a history of persistence timestamps for each bucket.
+7. replicate changes in StateContext to other Index-Coordinator-Replica.
+8. add, delete topics in pub-sub. subscribe, un-subscribe nodes from topics,
+   optionally based on predicates.
+9. provide network API to other components to access index-metadata,
+   index topology and publish-subscribe framework.
+10. generate restart timestamp for upr-reconnection.
+11. negotiation with UPR producer for failover-log and restart sequence number.
+12. create rollback context for kv-rollback and update rollback context based
+    on how rollback evolves within the system.
 
 ### scope of ns-server
 
@@ -55,30 +54,24 @@ From now on we will refer to each instance of IndexManager as a node.
    list of active replica and list of indexer-nodes.
 4. actively poll - master node, replica nodes and other local-indexer-nodes for
    its liveliness using `heartbeat` request.
-5. provide API for index-manager instances and index-nodes to join or leave
-   the cluster.
+5. provide API for IndexManagers and local-indexers to join or leave the cluster,
+   to fetch current Index-Coordinator, Index-Coordinator-Replicas and list of
+   indexer-nodes, and transaction API for updating StateContext.
 
-**Clarification needed**
+### a note on topology
 
-1. When a client is interfacing with a master, will there be a situation for
-   ns-server to conduct a new master-election ?
+A collection of local-indexer nodes take part in building and servicing
+secondary index. For any given index a subset of local-indexer nodes will be
+responsible for building the index, some of them acting as master and few others
+acting as active replicas.
 
-### client interfacing with Index-Coordinator and typical update cycle
+Topology of any given index consist of the following elements,
 
-1. a client must first get current master from ns-server. If it cannot get
-   one, it must retry or fail.
-2. once network address of master is obtained from ns-server, client can post
-   update request to master.
-3. master should get the current list of Index-Coordinator-Replica from
-   ns-server.
-   * if ns-server is unreachable or doesn't respond, return error to client.
-4. master updates its local StateContext.
-5. synchronously replicate its local StateContext on all replicas. If one of
-   the replica is not responding, skip the replica.
-6. notify ns-server about not responding replicas. If ns-server is unreachable
-   ignore.
-7. if master responds with anything other than SUCCESS to the client, retry from
-   step 1.
+* list of indexer-nodes, aka local-indexer-nodes, hosting the index.
+* index slice, where each slice will hold a subset of index.
+* index partition, where each partition is hosted by a master-local-indexer and
+  zero or more replica-local-indexer. Each partition contains a collection of
+  one or more slices.
 
 ### master election
 
@@ -88,296 +81,6 @@ We expect ns-server to elect a new master,
 * during a master crash (when master fails to respond for heartbeat request).
 * when a master voluntarily leaves the cluster.
 
-election process,
-
-1. ns-server will maintain a election-term number which shall be incremented
-   before every master-election.
-2. if ns-server cannot reach master, then it can retire master as failed node,
-   increment election-term, and start master election process
-3. ns-server polls for active replica.
-3. pick one of the replica as master node.
-4. post `bootstrap` request to each replica. If a replica is already in
-   `bootstrap` state, it will do nothing.
-   * an active replica shall cancel outstanding updates happening to its
-     replicated data structure, move to bootstrap state, and wait for new
-     master notification.
-6. post new master's connectionAddress to all active instance of IndexManagers
-   along with it the current election-term number.
-7. current election-term number will be preserved by all instances of
-   IndexManager and used in all request/response across the cluster.
-8. in case of unexpected behavior like,
-   * an active replica not responding
-   * `bootstrap` request is not responding
-   * `newmaster` request not responding
-
-   ns-server will restart from step-2.
-9. once an election is complete, with master and replicas identified,
-   ns-server act on join and leave request from IndexManager nodes and
-   indexer-nodes.
-
-### failed, orphaned and outdated IndexManager
-
-1. when an IndexManager fails, it shall be restarted by ns-server, join the
-   cluster, enter bootstrap state.
-2. when a Index-Coordinator becomes unreachable to ns-server, it shall
-   restart itself, by joining the cluster after a timeout period and enter
-   bootstrap state.
-3. when ever IndexManager instance receives a request or response who's
-   current-election-term is higher than the local value, it will restart
-   itself, join the cluster and enter bootstrap state.
-
-### false-positive
-
-false positive is a scenario when client thinks that an update request
-succeeded but the system is not yet updated. This can happen when
-master node is reachable to client but not reachable to its replicas and
-ns-server, leading to a master-election, and ns-server elects a new-master that
-has not received the updates from old-master.
-
-### false-negative
-
-false-negative is a scenario when client thinks that an update request has
-failed but the system has applied the update into StateContext. This can happen
-when client post an update to the StateContext which get persisted on
-master/replica, but before replying success to client - master node fails,
-there by leading to a situation where the client will re-post the same
-update to new master.
-
-## IndexManager design
-
-Each instance of IndexManager will be modeled as a state machine backed by a
-data structure, called StateContext, that contains meta-data about the secondary
-index, index topology, pub-sub data and meta-data for normal operation of
-IndexManager cluster. An instance of IndexManager can be a  `master` or a
-`replica` operating in `bootstrap`, `normal` or `rollback` state
-
-              | startup  |  master  | replica
-    ----------|----------|----------|---------
-    boostrap  |   yes    |          |
-    normal    |          |   yes    |  yes
-    rollback  |          |   yes    |
-
-**TODO: Since the system is still evolving we expect changes to design of
-IndexManager**
-**TODO: Convert above table to state diagram**
-
-### StateContext
-
-* contains a CAS field that will be incremented for every mutation (insert,
-  delete, update) that happen to the StateContext.
-* several API will be exposed by Index-Coordinator to CREATE, READ, UPDATE and
-  DELETE StateContext or portion of StateContext.
-
-### bootstrap
-
-* all nodes, when they start afresh, will be in bootstrap State.
-* from bootstrap State, a node can either move to master State or replica State.
-* sometimes moving to master State can be transient, that is, if system was
-  previously in rollback state and rollback context was replicated to other
-  nodes, then the new master shall immediately switch to rollback State and
-  continue co-ordinating system rollback activity.
-* while a node is in bootstrap State it will wait for new master's
-  connectionAddress to be posted via its API.
-* if connectionAddress same as the IndexManager instance, it will move to
-  master State.
-* otherwise the node will move to replica State.
-
-### master
-
-* when a node move to master State, its local StateContext is restored from
-  persistent storage and used as system-wide StateContext.
-
-* StateContext can be modified only by the master node. It is expected that
-  other components within the system should somehow learn the current master
-  (maybe from ns-server) and use master's API to modify StateContext.
-* upon receiving a new update to StateContext
-  * master will fetch the active list of replica from ns-server
-  * master will update its local StateContext and persist the StateContext
-    on durable media, then it will post an update to each of its replica.
-  * if one of the replica does not respond or respond back with failure,
-    master shall notify ns-server regarding the same
-  * master responds back to the original client that initiated the
-    update-request.
-* in case a master crashes, it shall start again from bootstrap State.
-
-### replica
-
-* when a node move to replica State, it will first fetch the latest
-  StateContext from the current master and persist as the local StateContext.
-* if replica's StateContext is newer than the master's StateContext
-  (detectable by comparing the CAS value), then latest mutations on the
-  replica will be lost.
-* replica can receive updates to StateContext only from master, if it receives
-  from other components in the system, it will respond with error.
-* upon receiving a new update to StateContext from master, replica will
-  update its StateContext and persist the StateContext on durable media.
-* in case if replica is unable to communicate with the master and comes back
-  alive while rest of the system have moved ahead, it shall go through
-  bootstrap state,
-    * due to hearbeat failures
-    * by detecting the CAS value in subsequent updates from master
-
-### rollback
-
-  * only master node can move to rollback mode.
-
-  TBD
-
-## Data structure
-
-**cluster data structure**
-
-```go
-    type Cluster struct {
-      masterAddr string   // connection address to master
-      replicas   []string // list of connection address to replica nodes
-      nodes      []string // list of connection address to indexer-nodes
-    }
-```
-
-data structure is transient and maintains the current state of the
-secondary-index cluster
-
-**StateContext**
-
-```go
-    type IndexInfo {
-        Name       string    `json:"name,omitempty"`       // Name of the index
-        Uuid       string    `json:"uuid,omitempty"`       // unique id for every index
-        Using      IndexType `json:"using,omitempty"`      // indexing algorithm
-        OnExprList []string  `json:"onExprList,omitempty"` // expression list
-        Bucket     string    `json:"bucket,omitempty"`     // bucket name
-        IsPrimary  bool      `json:"isPrimary,omitempty"`
-        Exprtype   ExprType  `json:"exprType,omitempty"`
-    }
-```
-
-```go
-    type StateContext struct {
-    }
-```
-
-### IndexManager APIs
-
-#### /cluster/heartbeat
-request:
-
-    { "current_term": <uint64> }
-
-response:
-
-    { "current_term": <uint64> }
-
-To be called by ns-server. Node will respond back with SUCCESS irrespective of
-role or state.
-
-* if replica node does not respond back, it will be removed from active list.
-* if master does not respond back, then ns-server can start a new
-  master-election.
-
-#### /cluster/bootstrap
-
-To be called by ns-server, ns-server is expected to make this request during
-master election.
-
-* replica will cancel all outstanding updates and move to `bootstrap` state.
-* master will cancel all outstanding updates and move to `bootstrap` state.
-
-#### /cluster/newmaster
-request:
-
-    { "current_term": <uint64> }
-
-response:
-
-    { "current_term": <uint64>
-      "status": ...
-    }
-
-
-Once master election is completed ns-server will post the new master and
-election-term to the elected master and each of its new replica. After this,
-IndexManager node shall enter into `master` or `replica` state.
-
-### Index-Coordinator APIs
-
-#### /cluster/index
-request:
-
-    { "current_term": <uint64>,
-      "command": CreateIndex,
-      "indexinfo": {},
-    }
-
-response:
-
-    { "current_term": <uint64>
-      "status": ...
-      "indexinfo": {},
-    }
-
-Create a new index specified by `indexinfo` field in request body. `indexinfo`
-property is same as defined by the `IndexInfo` structure above, except that
-`id` field will be generated by the master IndexManager and the same
-`indexinfo` structure will be sent back as response.
-
-#### /cluster/index
-request:
-
-    { "current_term": <uint64>,
-      "command": DropIndex,
-      "indexid": <list-of-uint64>,
-    }
-
-response:
-
-    { "current_term": <uint64>
-      "status": ...
-    }
-
-Drop all index listed in `indexid` field.
-
-#### /cluster/index
-request:
-
-    { "current_term": <uint64>,
-      "command": ListIndex,
-      "indexids": <list-of-uint64>,
-    }
-
-response:
-
-    { "current_term": <uint64>,
-      "status": ...
-      "indexinfo": <list-of-indexinfo-structure>,
-    }
-
-List index meta-data structures identified by `indexids` in request body. If
-it is empty, list all active indexes.
-
-### ns-server API requirements:
-
-#### GetClusterMaster()
-
-Returns connection address for current master. If no master is currently elected
-then return empty string.
-
-#### Replicas()
-
-Returns list of connection address for all active replicas in the system.
-
-#### IndexerNodes()
-
-Returns list of connection address for all active local-indexer-nodes in the system.
-
-#### Join()
-
-For a node to join the cluster
-
-#### Leave()
-
-For a node to leave the cluster
-
-Q:
-1) When do the index manager needs to subscribe/unsubcribe directly?
-   Should the router handles the change the subscriber based on the topology?
+after a new master is elected, ns-server should post a bootstrap request to
+each IndexManager. There after IndexManager can fetch the current master from
+ns-server and become an Index-Coordinator or Index-Coordinator-Replica.
diff --git a/secondary/docs/design/markdown/index_manager_design.md b/secondary/docs/design/markdown/index_manager_design.md
new file mode 100644
index 000000000..332feb9df
--- /dev/null
+++ b/secondary/docs/design/markdown/index_manager_design.md
@@ -0,0 +1,490 @@
+### failed, orphaned and outdated IndexManager
+
+1. when an IndexManager crashes, it shall be restarted by ns-server, join the
+   cluster, enter bootstrap state.
+2. when a Index-Coordinator becomes unreachable to ns-server, it shall
+   restart itself, by joining the cluster after a timeout period and enter
+   bootstrap state.
+
+### client interfacing with Index-Coordinator
+
+1. a client must first get current Index-Coordinator from ns-server. If it
+   cannot get one, it must retry or fail.
+2. once network address of Index-Coordinator is obtained from ns-server, client
+   can post update request to Index-Coordinator.
+3. Index-Coordinator should create a new transaction in ns-server and then get
+   the current list of Index-Coordinator-Replicas from ns-server.
+   * if ns-server is unreachable or doesn't respond, return error to client.
+5. Index-Coordinator goes through a 2-phase commit with its replicas and return
+   success or failure to client.
+
+### two-phase commit for co-ordination failures
+
+**false positive**, is a scenario when client thinks that an update request
+succeeded but the system is not yet updated, due to failures.
+
+**false-negative**, is a scenario when client thinks that an update request has
+failed but the system has applied the update into StateContext.
+
+Meanwhile, when system is going through a rebalance or a kv-rollback or
+executing a new DDL statement, Index-Coordinator can crash. But this situation
+should not put the system in in-consistent state. To achieve this, we propose a
+solution with the help of ns-server.
+
+* Index-Coordinator and its replica shall maintain a persisted copy of its
+  local StateContext.
+* ns-server will act as commit-point-site.
+* for every update, Index-Coordinator will increment the CAS and create a
+  transaction tuple {CAS, status} with ns-server.
+* status will be "initial" when it is created and later move to either "commit"
+  or "rollback".
+* for every update accepted by Index-Coordinator it shall create a new copy of
+  StateContext, persist the new copy locally as log and push the copy to its
+  replicas.
+  * if one of the replica fails, Index-Coordinator will post "rollback" status
+    to ns-server for this transaction and issue a rollback to each of the
+    replica.
+  * upon rollback, Index-Coordinator and its replica will delete the local
+    StateContext log and return failure.
+* when all the replicas accept the new copy of StateContext and persist it
+  locally as a log, Index-Coordinator will post "commit" status to
+  ns-server for this transaction and issue commit to each of the replica.
+* during commit, local log will be switched as the new StateContext, and the
+  log file will be deleted.
+
+whenever a master or replica fails it will always go through a bootstrap phase
+during which it shall detect a log of StateContext, consult with ns-server
+for the corresponding transaction's status. If status is "commit" it will switch
+the log as new StateContext, otherwise it shall delete the log and retain the
+current StateContext.
+
+* once ns-server creates a new transaction entry, it should not accept a new
+  IndexManager into the cluster until the transaction moves to "commit" or
+  "rollback" status.
+* during a master election ns-server shall pick a master only from the set of
+  IndexManagers that took part in the last-transaction.
+* ns-server should maintain a rolling log of transaction status.
+## IndexManager design
+
+Each instance of IndexManager will be modeled as a state machine backed by a
+data structure, called StateContext, that contains meta-data about the secondary
+index, index topology, pub-sub data and meta-data for normal operation of
+IndexManager cluster.
+
+```
+
+                           *--------------*
+                           | IndexManager |
+                           *--------------*
+                                 |
+                                 | (program start, new-master-elected)
+                                 V
+        elected             *-----------*
+        master *------------| bootstrap |-------------* (not a master)
+               |            *-----------*             |
+               V                                      V
+        *---------------*                        *-----------*
+        | IMCoordinator |                        | IMReplica |
+        *---------------*                        *-----------*
+```
+
+**TODO: Replace it with a pictorial diagram**
+
+**main data structure that backs Index-Coordinator**
+
+```go
+    type Timestamp []uint64 // timestamp vector for vbuckets
+
+    type Node struct {
+        name           string
+        connectionAddr string
+        role           []string // list of roles performed by the node.
+    }
+
+    type IndexInfo struct {
+        Name       string    // Name of the index
+        Uuid       uint64    // unique id for every index
+        Using      IndexType // indexing algorithm
+        OnExprList []string  // expression list
+        Bucket     string    // bucket name
+        IsPrimary  bool
+        Exprtype   ExprType
+        Active     bool     // whether index topology is created and index is ready
+    }
+
+    type IndexTopology struct {
+        indexinfo     IndexInfo
+        numPartitions int
+        servers       []Nodes  // servers hosting this index
+        // server can be master or replica, where, len(partitionMap) == numPartitions
+        // first integer-value in each map will index to master-server in
+        // `servers` field, and remaining elements will point to replica-servers
+        partitionMap  [][MAX_REPLICAS+1]int
+        sliceMap      [MAX_SLICES][MAX_REPLICAS+1]int
+    }
+
+    type StateContext struct {
+        // value gets incremented after every updates.
+        cas           uint64
+
+        // maximum number of persistence timestamp to maintain.
+        ptsMaxHistory    int
+
+        // per bucket timestamp continuously updated by sync message, and
+        // promoted to persistence timestamp.
+        currentTimestamp map[string]Timestamp
+
+        // promoted list of persistence timestamp for each bucket.
+        persistenceTimestamps map[string][]PersistenceTimestamp
+
+        // Index info for every created index.
+        indexeInfos      map[uint64]IndexInfo
+
+        // per index map of index topology
+        indexesTopology  map[string]IndexTopology
+
+        // per index map of ongoing rebalance.
+        rebalances       map[string]Rebalance
+
+        // list of projectors
+        projectors       []Node
+
+        // list of routers
+        routers          [] Node
+    }
+```
+
+### bootstrap state for IndexManager
+
+* all IndexManagers, when they start afresh, will be in bootstrap State.
+* from bootstrap State, IndexManager can either become Index-Coordinator or
+  Index-Replica.
+* while a node is in bootstrap State it will wait for Index-Coordinator's
+  connectionAddress from ns-server or poll from ns-server.
+* if connectionAddress is same as the IndexManager instance, it will become the
+  new Index-Coordinator.
+* otherwise IndexManager will become a Replica to Index-Coordinator.
+
+* **recovery from failure**, before getting the new Index-Coordinator's
+connectionAddress IndexManager will check of un-commit transaction by checking
+for StateContext's log.
+  * If found, it will consult ns-server for transaction's status.
+    * if status is "commit", it will switch the log as new StateContext.
+    * else discard the log.
+
+* IndexManager before leaving bootstrap state will restore the StateContext
+  from its local persistence.
+
+#### Index-Coordinator initialization
+
+### normal state for Index-Coordinator
+
+* restored StateContext will be used as the master copy. Optionally
+  Index-Coordinator can consult its replica for latest StateContext based on
+  CAS value.
+* StateContext can be modified only by the master node. It is expected that
+  other components within the system should somehow learn the current master
+  (maybe from ns-server) and use master's API to modify StateContext.
+* upon receiving a new update to StateContext will use 2-phase commit to
+  replicate the update to its replica.
+
+#### Index-Coordinator publishing persistence timestamp
+
+Index-Coordinator will maintain a currentTimestamp vector for each bucket in
+its StateContext, which will be updated using SYNC message received from
+projector (projector will periodically send SYNC message for every vbucket in
+a bucket). The sync message will contain the latest sequence-number for the
+vbucket.
+
+Index-Coordinator will periodically recieve <HWHeartbeat> message from every
+local-indexer-node. Based on HWHeartbeat metrics and/or query requirements,
+Index-Coordinator will promote the currentTimestamp into a
+persistence-timestamp and publish it to all index-nodes hosting a index for
+that bucket.
+
+If,
+  * Index-Coordinator crashes while publishing the persistence timestamp,
+  * local-indexer-node did not receive the persistence-timestamp,
+  * local-indexer node received the timestamp but could not create a
+    snapshot due to a crash,
+  * compaction kicks in,
+then there is no gaurantee that local-indexer-nodes hosting an index will have
+identical snapshots.
+
+Whenever local-indexer node goes through a restart, it can fetch a log of
+persistence-timestamp uptil the latest one from Index-Coordinator.
+
+##### algorithm to compute persistence timestamp
+
+Algorithm takes following as inputs.
+
+- per bucket HighWatermark-timestamp from each of the local-indexer-node.
+- available free size in local-indexer's mutation-queue.
+
+1. For each vbucket, compute the mean seqNo
+2. Use the mean seqNo to create a persistence timestamp
+3. If heartbeat messages indicate that the faster indexer's mutation queue is
+   growing rapidly, it is possible to use a seqNo that matches that fast indexer
+   closer
+4. If the local indexer that has not sent heartbeat messages within a certain
+   time, skip the local indexer, or consult the cluster mgr on the indexer
+   availability.
+5. If the new persistent timestamp is the less than equal to the last one, do
+   nothing.
+
+**relevant data structur**
+
+```go
+    type PersistenceTimestamp struct {
+        ts        Timestamp
+        stability bool      // whether this timestamp is also treated as stability timestamp
+    }
+```
+
+##### /index-coordinator/hwtimestamp
+
+supported by Index-Coordinator
+
+request:
+
+    { "command": HWheartbeat,
+      "payload": <HWHeartbeat>,
+    }
+
+response:
+
+    { "cas": <uint64>,
+      "status": ...
+    }
+
+##### /index-coordinator/logPersistenceTimestamps
+
+supported by Index-Coordinator
+
+request:
+
+    { "command": PersistenceTimestampLog,
+      "lastPersistenceTimestamp": Timestamp,
+    }
+
+response:
+
+    { "cas":                   <uint64>,
+      "persistenceTimestamps": []PersistenceTimestamp
+      "status":                ...
+    }
+
+##### /local-indexer-node/persistenceTimestamp
+
+supported by local-indexer-node
+
+request:
+
+    { "cas":     <uint64>,
+      "command": NewPersistenceTimestamp,
+      "payload": <HWHeartbeat>,
+    }
+
+response:
+
+    { "status": ...
+    }
+
+### Index-Coordinator replica
+
+* when a node move to replica State, it will first fetch the latest
+  StateContext from the current master and persist as the local StateContext.
+* if replica's StateContext is newer than the master's StateContext
+  (detectable by comparing the CAS value), then latest mutations on the
+  replica will be lost.
+* replica can receive updates to StateContext only from master, if it receives
+  from other components in the system, it will respond with error.
+* upon receiving a new update to StateContext from master, replica will
+  update its StateContext and persist the StateContext on durable media.
+* in case if replica is unable to communicate with the master and comes back
+  alive while rest of the system have moved ahead, it shall go through
+  bootstrap state,
+    * due to hearbeat failures
+    * by detecting the CAS value in subsequent updates from master
+
+### rollback
+
+  * only master node can move to rollback mode.
+
+  TBD
+
+## Data structure
+
+**cluster data structure**
+
+```go
+    type Cluster struct {
+      masterAddr string   // connection address to master
+      replicas   []string // list of connection address to replica nodes
+      nodes      []string // list of connection address to indexer-nodes
+    }
+```
+
+data structure is transient and maintains the current state of the
+secondary-index cluster
+
+**StateContext**
+
+    type Rebalance struct {
+        topology     IndexTopology
+    }
+
+    type HWHeartbeat struct {
+        indexid          uint64
+        bucket           string
+        hw               Timestamp
+        lastPersistence  uint64      // hash of Timestamp
+        lastStability    uint64      // hash of Timestamp
+        mutationQueue    uint64
+    }
+```
+
+### IndexManager APIs
+
+#### /cluster/heartbeat
+request:
+
+    { "current_term": <uint64> }
+
+response:
+
+    { "current_term": <uint64> }
+
+To be called by ns-server. Node will respond back with SUCCESS irrespective of
+role or state.
+
+* if replica node does not respond back, it will be removed from active list.
+* if master does not respond back, then ns-server can start a new
+  master-election.
+
+#### /cluster/bootstrap
+
+To be called by ns-server, ns-server is expected to make this request during
+master election.
+
+* replica will cancel all outstanding updates and move to `bootstrap` state.
+* master will cancel all outstanding updates and move to `bootstrap` state.
+
+#### /cluster/newmaster
+request:
+
+    { "current_term": <uint64> }
+
+response:
+
+    { "current_term": <uint64>
+      "status": ...
+    }
+
+Once master election is completed ns-server will post the new master and
+election-term to the elected master and each of its new replica. After this,
+IndexManager node shall enter into `master` or `replica` state.
+
+### Index-Coordinator APIs
+
+#### /cluster/index
+request:
+
+    { "current_term": <uint64>,
+      "command": CreateIndex,
+      "indexinfo": {},
+    }
+
+response:
+
+    { "current_term": <uint64>
+      "status": ...
+      "indexinfo": {},
+    }
+
+Create a new index specified by `indexinfo` field in request body. `indexinfo`
+property is same as defined by the `IndexInfo` structure above, except that
+`id` field will be generated by the master IndexManager and the same
+`indexinfo` structure will be sent back as response.
+
+#### /cluster/index
+request:
+
+    { "current_term": <uint64>,
+      "command": DropIndex,
+      "indexid": <list-of-uint64>,
+    }
+
+response:
+
+    { "current_term": <uint64>
+      "status": ...
+    }
+
+Drop all index listed in `indexid` field.
+
+#### /cluster/index
+request:
+
+    { "current_term": <uint64>,
+      "command": ListIndex,
+      "indexids": <list-of-uint64>,
+    }
+
+response:
+
+    { "current_term": <uint64>,
+      "status": ...
+      "indexinfo": <list-of-indexinfo-structure>,
+    }
+
+List index meta-data structures identified by `indexids` in request body. If
+it is empty, list all active indexes.
+
+### ns-server API requirements:
+
+#### GetIndexCoordinator()
+
+Returns connection address for current master. If no master is currently elected
+then return empty string.
+
+#### GetIndexCoordinatorReplicas()
+
+Returns list of connection address for all active replicas in the system.
+
+#### GetIndexerNodes()
+
+Returns list of connection address for all active local-indexer-nodes in the system.
+
+#### Join()
+
+For a node to join the cluster
+
+#### Leave()
+
+For a node to leave the cluster
+
+## Appendix A: replication and fault-tolerance.
+
+To avoid single-point-of-failure in the system we need to have multiple
+nodes that are continuously synchronized with each other. So that in case of a
+a master failure, one of the replica can be promoted to master. In this
+section we explore the challenges involved in synchronous replication of data
+from master to its replica nodes.
+
+### full-replication
+
+Full replication means, for every update master will copy the entire data
+structure to its replica.
+
+false-positive is possible when master crashes immediately and a new node,
+that did not participate in the previous update, becomes a new master. This
+can be avoided if master locks ns-server to prevent any new nodes from joining
+the cluster until the update is completed.
+
+false-negative is possible when master crashes before updating all replicas
+and one of the updated replica is elected as new master.
+
+Q:
+1) When do the index manager needs to subscribe/unsubcribe directly?
+   Should the router handles the change the subscriber based on the topology?
diff --git a/secondary/docs/design/markdown/indexer.md b/secondary/docs/design/markdown/indexer.md
index 68a85c96c..47852b6bc 100644
--- a/secondary/docs/design/markdown/indexer.md
+++ b/secondary/docs/design/markdown/indexer.md
@@ -1,8 +1,7 @@
 ## Indexer
 
-Indexer is a node in secondary-index cluster that will host secondary-index
-components like IndexManager, Index-Coordinator, Index-Coordinator-Replica and
-Local-Indexer process.
+Indexer is a node in secondary-index cluster that will host IndexManager,
+Indexer process.
 
 ### Scope of local indexer
 
diff --git a/secondary/docs/design/markdown/projector.md b/secondary/docs/design/markdown/projector.md
index 62698d70e..c0ddcb93d 100644
--- a/secondary/docs/design/markdown/projector.md
+++ b/secondary/docs/design/markdown/projector.md
@@ -1 +1,60 @@
-##Projector
+## Projector
+
+* projector is a stateless component in the system. That is, it is not backed by
+  any persistent storage.
+* it fetches all relevant information form Index-Coordinator.
+* connects with KV cluster for UPR mutations.
+* evaluates each documents using index expression, if document belongs to the
+  bucket.
+* creates projected key-versions and sends them to router component.
+* when a new index is created or old index deleted, Index-Coordinator will post
+  the information to all projectors.
+
+## projector bootstrap
+
+* projector will get list of all active index-info from Index-Coordinator.
+* projector will spawn a thread, referred to per-bucket-thread, for each
+  bucket that has an index defined.
+* per-bucket-thread will,
+  * get a subset of vbuckets for which it had to start UPR streams.
+  * get a list of subscribers for each of these subsets who would like to
+    receive the key-version.
+  * open a UPR connection with one or more kv-master-nodes and start a stream
+    for each vbucket in its subset.
+  * apply expressions from each index defined for this bucket and generate
+    corresponding key-version.
+  * publish the key-version to subscribers.
+
+## changes to index DDLs
+
+Changes in index DDLs can affect projectors,
+
+* in case CREATE index DDL,
+  * projectors may have to spawn or kill a per-bucket thread.
+  * projectors may want the new index definition and its expression to generate
+    key-versions for the new index
+* in case of DROP index DDL,
+  * projects may have kill a per-bucket thread.
+  * projects may want to delete index definition and its expression from its
+    local copy to stop generating key-versions for the delete index.
+* sometimes to balance load, Index-Coordinator might change the subscriber
+  list who receive index key-versions.
+
+**relevant data structures**
+
+```go
+    type ProjectorState struct {
+        listenAddr string      // address to send/receive administration messages
+        kvAddr     string      // address to connect with KV cluster
+        indexInfos []IndexInfo
+    }
+
+    type KeyVersion struct {
+        docid      []byte // primary document id.
+        vbucket    uint16 // vbucket in which document is located.
+        vbuuid     uint64 // current required uint64 vbuuid   = 4;
+        sequenceno uint64 // sequence number corresponding to this mutation
+        indexid    uint64
+        keys       []byte
+    }
+```
diff --git a/secondary/docs/design/markdown/pubsub.md b/secondary/docs/design/markdown/pubsub.md
new file mode 100644
index 000000000..382a75e71
--- /dev/null
+++ b/secondary/docs/design/markdown/pubsub.md
@@ -0,0 +1,7 @@
+## Publisher subscriber
+
+Publisher subscriber is an event framework across the network. It is based on
+`topics`, to which network subscriptions are allowed. The event framework is
+actively managed by Index-Coordinator, and for every new topic created and
+when new nodes are subcribing to the topics, Index-Coordinator shall replicat
+the changes to topic list.

From 8836edea960aa3a94962208b26f471e3ed9c08d5 Mon Sep 17 00:00:00 2001
From: prataprc <prataprc@gmail.com>
Date: Fri, 14 Mar 2014 10:07:29 +0530
Subject: [PATCH 09/13] refactored index_manager_design.md

---
 .../design/markdown/index_manager_design.md   | 132 +++++++++---------
 1 file changed, 66 insertions(+), 66 deletions(-)

diff --git a/secondary/docs/design/markdown/index_manager_design.md b/secondary/docs/design/markdown/index_manager_design.md
index 332feb9df..c8fcd2d72 100644
--- a/secondary/docs/design/markdown/index_manager_design.md
+++ b/secondary/docs/design/markdown/index_manager_design.md
@@ -1,69 +1,3 @@
-### failed, orphaned and outdated IndexManager
-
-1. when an IndexManager crashes, it shall be restarted by ns-server, join the
-   cluster, enter bootstrap state.
-2. when a Index-Coordinator becomes unreachable to ns-server, it shall
-   restart itself, by joining the cluster after a timeout period and enter
-   bootstrap state.
-
-### client interfacing with Index-Coordinator
-
-1. a client must first get current Index-Coordinator from ns-server. If it
-   cannot get one, it must retry or fail.
-2. once network address of Index-Coordinator is obtained from ns-server, client
-   can post update request to Index-Coordinator.
-3. Index-Coordinator should create a new transaction in ns-server and then get
-   the current list of Index-Coordinator-Replicas from ns-server.
-   * if ns-server is unreachable or doesn't respond, return error to client.
-5. Index-Coordinator goes through a 2-phase commit with its replicas and return
-   success or failure to client.
-
-### two-phase commit for co-ordination failures
-
-**false positive**, is a scenario when client thinks that an update request
-succeeded but the system is not yet updated, due to failures.
-
-**false-negative**, is a scenario when client thinks that an update request has
-failed but the system has applied the update into StateContext.
-
-Meanwhile, when system is going through a rebalance or a kv-rollback or
-executing a new DDL statement, Index-Coordinator can crash. But this situation
-should not put the system in in-consistent state. To achieve this, we propose a
-solution with the help of ns-server.
-
-* Index-Coordinator and its replica shall maintain a persisted copy of its
-  local StateContext.
-* ns-server will act as commit-point-site.
-* for every update, Index-Coordinator will increment the CAS and create a
-  transaction tuple {CAS, status} with ns-server.
-* status will be "initial" when it is created and later move to either "commit"
-  or "rollback".
-* for every update accepted by Index-Coordinator it shall create a new copy of
-  StateContext, persist the new copy locally as log and push the copy to its
-  replicas.
-  * if one of the replica fails, Index-Coordinator will post "rollback" status
-    to ns-server for this transaction and issue a rollback to each of the
-    replica.
-  * upon rollback, Index-Coordinator and its replica will delete the local
-    StateContext log and return failure.
-* when all the replicas accept the new copy of StateContext and persist it
-  locally as a log, Index-Coordinator will post "commit" status to
-  ns-server for this transaction and issue commit to each of the replica.
-* during commit, local log will be switched as the new StateContext, and the
-  log file will be deleted.
-
-whenever a master or replica fails it will always go through a bootstrap phase
-during which it shall detect a log of StateContext, consult with ns-server
-for the corresponding transaction's status. If status is "commit" it will switch
-the log as new StateContext, otherwise it shall delete the log and retain the
-current StateContext.
-
-* once ns-server creates a new transaction entry, it should not accept a new
-  IndexManager into the cluster until the transaction moves to "commit" or
-  "rollback" status.
-* during a master election ns-server shall pick a master only from the set of
-  IndexManagers that took part in the last-transaction.
-* ns-server should maintain a rolling log of transaction status.
 ## IndexManager design
 
 Each instance of IndexManager will be modeled as a state machine backed by a
@@ -154,6 +88,72 @@ IndexManager cluster.
     }
 ```
 
+### failed, orphaned and outdated IndexManager
+
+1. when an IndexManager crashes, it shall be restarted by ns-server, join the
+   cluster, enter bootstrap state.
+2. when a Index-Coordinator becomes unreachable to ns-server, it shall
+   restart itself, by joining the cluster after a timeout period and enter
+   bootstrap state.
+
+### client interfacing with Index-Coordinator
+
+1. a client must first get current Index-Coordinator from ns-server. If it
+   cannot get one, it must retry or fail.
+2. once network address of Index-Coordinator is obtained from ns-server, client
+   can post update request to Index-Coordinator.
+3. Index-Coordinator should create a new transaction in ns-server and then get
+   the current list of Index-Coordinator-Replicas from ns-server.
+   * if ns-server is unreachable or doesn't respond, return error to client.
+5. Index-Coordinator goes through a 2-phase commit with its replicas and return
+   success or failure to client.
+
+### two-phase commit for co-ordination failures
+
+**false positive**, is a scenario when client thinks that an update request
+succeeded but the system is not yet updated, due to failures.
+
+**false-negative**, is a scenario when client thinks that an update request has
+failed but the system has applied the update into StateContext.
+
+Meanwhile, when system is going through a rebalance or a kv-rollback or
+executing a new DDL statement, Index-Coordinator can crash. But this situation
+should not put the system in in-consistent state. To achieve this, we propose a
+solution with the help of ns-server.
+
+* Index-Coordinator and its replica shall maintain a persisted copy of its
+  local StateContext.
+* ns-server will act as commit-point-site.
+* for every update, Index-Coordinator will increment the CAS and create a
+  transaction tuple {CAS, status} with ns-server.
+* status will be "initial" when it is created and later move to either "commit"
+  or "rollback".
+* for every update accepted by Index-Coordinator it shall create a new copy of
+  StateContext, persist the new copy locally as log and push the copy to its
+  replicas.
+  * if one of the replica fails, Index-Coordinator will post "rollback" status
+    to ns-server for this transaction and issue a rollback to each of the
+    replica.
+  * upon rollback, Index-Coordinator and its replica will delete the local
+    StateContext log and return failure.
+* when all the replicas accept the new copy of StateContext and persist it
+  locally as a log, Index-Coordinator will post "commit" status to
+  ns-server for this transaction and issue commit to each of the replica.
+* during commit, local log will be switched as the new StateContext, and the
+  log file will be deleted.
+
+whenever a master or replica fails it will always go through a bootstrap phase
+during which it shall detect a log of StateContext, consult with ns-server
+for the corresponding transaction's status. If status is "commit" it will switch
+the log as new StateContext, otherwise it shall delete the log and retain the
+current StateContext.
+
+* once ns-server creates a new transaction entry, it should not accept a new
+  IndexManager into the cluster until the transaction moves to "commit" or
+  "rollback" status.
+* during a master election ns-server shall pick a master only from the set of
+  IndexManagers that took part in the last-transaction.
+* ns-server should maintain a rolling log of transaction status.
 ### bootstrap state for IndexManager
 
 * all IndexManagers, when they start afresh, will be in bootstrap State.

From 1a99ce4bbe3ee14b198ac88b4a758bab5b4dc523 Mon Sep 17 00:00:00 2001
From: prataprc <prataprc@gmail.com>
Date: Fri, 14 Mar 2014 11:14:51 +0530
Subject: [PATCH 10/13] Document for publisher-subscriber component.

---
 secondary/docs/design/markdown/pubsub.md | 88 ++++++++++++++++++++++--
 1 file changed, 84 insertions(+), 4 deletions(-)

diff --git a/secondary/docs/design/markdown/pubsub.md b/secondary/docs/design/markdown/pubsub.md
index 382a75e71..39a08099d 100644
--- a/secondary/docs/design/markdown/pubsub.md
+++ b/secondary/docs/design/markdown/pubsub.md
@@ -1,7 +1,87 @@
 ## Publisher subscriber
 
-Publisher subscriber is an event framework across the network. It is based on
+Publisher-subscriber is an event framework across the network. It is based on
 `topics`, to which network subscriptions are allowed. The event framework is
-actively managed by Index-Coordinator, and for every new topic created and
-when new nodes are subcribing to the topics, Index-Coordinator shall replicat
-the changes to topic list.
+actively managed by Index-Coordinator, and for every topic created or delete
+and for every node subscribed or unsubscribed, Index-Coordinator will manage
+them as part of its StateContext.
+
+#### Topic
+
+Each topic is represented by following structure,
+
+```go
+    type Topic struct {
+        name        string
+        subscribers []string
+    }
+```
+
+- `name` is represented in path format, and subscribe / un-subscribe APIs will
+  accept glob-pattern on name.
+- `subscribers` is a list of connection string, where each connection string
+  is represented in **host:port** format.
+
+Subscribers should ensure that there is a thread/routine listening at the
+specified port. Topic publishers will get the latest list of subscribers on
+its topic, request a TCP connection with the subscriber and start pushing one
+or more events to it.  It is up to the publisher and the subscriber to agree
+upon the connection, transport and protocol details.
+
+**Note:** In future we might extend this definition to make it easy for
+third-party components to interplay with secondary indexing system
+
+### Topics and publishers
+
+Following are standard collection of topics and its publishers.
+
+#### /topic/_path_
+
+If a component want to be notified whenever a topic specified by `path`
+changes, happens when a topic or a subscriber to topic is added or deleted,
+process can subscribe to `/topic`, suffixed by the `path`.
+
+**Publisher: Index-Coordinator**, when ever Index-Coordinator makes a change
+to an active topic or to the list of Topics, it will check for a topic name
+`/topic/<mutating-topic>`, where mutating-topic is the topic to which a change
+is applied, and push those changes to the subscribers.
+
+event format,
+```json
+{
+    "topic":       <name-string>,
+    "topic-name":  <name-string>,
+    "subscribers": <array-of-connection-string>
+}
+```
+
+#### /indexinfos
+
+Subscribers will be notified when any index DDL changes in the StateContext.
+
+**Publisher: Index-Coordinator**, when ever Index-Coordinator makes a change
+to an index or to the list of index-information, it will publish the following
+event,
+
+```json
+{
+    "topic": "/indexinfos",
+    "indexinfo": <property-of-index-information>,
+}
+```
+
+#### /indexinfos/_indexid_
+
+Subscribers will be notified when index DDL for _indexid_ changes in the
+StateContext.
+
+**Publisher: Index-Coordinator**, when ever Index-Coordinator makes a change
+to an index or to the list of index-information, it will publish the following
+event,
+
+```json
+{
+    "topic": "/indexinfos",
+    "indexinfo": <property-of-index-information>,
+}
+```

From 274fe3f30f8c24a234d3b16a9c74fe7cd8ac3b98 Mon Sep 17 00:00:00 2001
From: prataprc <prataprc@gmail.com>
Date: Fri, 14 Mar 2014 11:26:51 +0530
Subject: [PATCH 11/13] Update to pubsub document.

---
 secondary/docs/design/markdown/pubsub.md | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/secondary/docs/design/markdown/pubsub.md b/secondary/docs/design/markdown/pubsub.md
index 39a08099d..53d496844 100644
--- a/secondary/docs/design/markdown/pubsub.md
+++ b/secondary/docs/design/markdown/pubsub.md
@@ -3,7 +3,7 @@
 Publisher-subscriber is an event framework across the network. It is based on
 `topics`, to which network subscriptions are allowed. The event framework is
 actively managed by Index-Coordinator, and for every topic created or delete
-and for every node subscribed or unsubscribed, Index-Coordinator will manage
+and for every node subscribed or un-subscribed, Index-Coordinator will manage
 them as part of its StateContext.
 
 #### Topic
@@ -31,9 +31,10 @@ upon the connection, transport and protocol details.
 **Note:** In future we might extend this definition to make it easy for
 third-party components to interplay with secondary indexing system
 
-### Topics and publishers
+### Special topic - TopicTopic
 
-Following are standard collection of topics and its publishers.
+TopicTopic is a special kind of topic than can selectively publish subscription
+changes on any other topic.
 
 #### /topic/_path_
 
@@ -43,8 +44,7 @@ process can subscribe to `/topic`, suffixed by the `path`.
 
 **Publisher: Index-Coordinator**, when ever Index-Coordinator makes a change
 to an active topic or to the list of Topics, it will check for a topic name
-`/topic/<mutating-topic>`, where mutating-topic is the topic to which a change
-is applied, and push those changes to the subscribers.
+`/topic/<path>`, and push those changes to topictopic's subscribers.
 
 event format,
 ```json
@@ -55,6 +55,10 @@ event format,
 }
 ```
 
+### Standard topics and publishers
+
+Following are standard collection of topics and its publishers.
+
 #### /indexinfos
 
 Subscribers will be notified when any index DDL changes in the StateContext.
@@ -76,8 +80,7 @@ Subscribers will be notified when index DDL for _indexid_ changes in the
 StateContext.
 
 **Publisher: Index-Coordinator**, when ever Index-Coordinator makes a change
-to an index or to the list of index-information, it will publish the following
-event,
+to index, `indexid`, it will publish the following event,
 
 ```json
 {

From 6c5bac1d6fc5133d5315258174c26c8d73616f65 Mon Sep 17 00:00:00 2001
From: prataprc <prataprc@gmail.com>
Date: Fri, 14 Mar 2014 11:33:05 +0530
Subject: [PATCH 12/13] Update to pubsub documentation.

---
 secondary/docs/design/markdown/pubsub.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/secondary/docs/design/markdown/pubsub.md b/secondary/docs/design/markdown/pubsub.md
index 53d496844..65980b400 100644
--- a/secondary/docs/design/markdown/pubsub.md
+++ b/secondary/docs/design/markdown/pubsub.md
@@ -3,7 +3,7 @@
 Publisher-subscriber is an event framework across the network. It is based on
 `topics`, to which network subscriptions are allowed. The event framework is
 actively managed by Index-Coordinator, and for every topic created or delete
-and for every node subscribed or un-subscribed, Index-Coordinator will manage
+and for every node subscribed or un-subscribed Index-Coordinator will manage
 them as part of its StateContext.
 
 #### Topic
@@ -49,8 +49,8 @@ to an active topic or to the list of Topics, it will check for a topic name
 event format,
 ```json
 {
-    "topic":       <name-string>,
-    "topic-name":  <name-string>,
+    "topic":       "/topic/<path>",
+    "topic-name":  "<path>",
     "subscribers": <array-of-connection-string>
 }
 ```
@@ -69,7 +69,7 @@ event,
 
 ```json
 {
-    "topic": "/indexinfos",
+    "topic":     "/indexinfos",
     "indexinfo": <property-of-index-information>,
 }
 ```
@@ -84,7 +84,7 @@ to index, `indexid`, it will publish the following event,
 
 ```json
 {
-    "topic": "/indexinfos",
+    "topic":     "/indexinfos/<indexid>",
     "indexinfo": <property-of-index-information>,
 }
 ```

From a0995efa0c3d196de40799a6fbb1c7f5ec0357c9 Mon Sep 17 00:00:00 2001
From: prataprc <prataprc@gmail.com>
Date: Fri, 14 Mar 2014 11:36:24 +0530
Subject: [PATCH 13/13] Update to pubsub.md

---
 secondary/docs/design/markdown/pubsub.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/secondary/docs/design/markdown/pubsub.md b/secondary/docs/design/markdown/pubsub.md
index 65980b400..349ca9646 100644
--- a/secondary/docs/design/markdown/pubsub.md
+++ b/secondary/docs/design/markdown/pubsub.md
@@ -50,7 +50,7 @@ event format,
 ```json
 {
     "topic":       "/topic/<path>",
-    "topic-name":  "<path>",
+    "topic-name":  <path>,
     "subscribers": <array-of-connection-string>
 }
 ```
@@ -70,7 +70,7 @@ event,
 ```json
 {
     "topic":     "/indexinfos",
-    "indexinfo": <property-of-index-information>,
+    "indexinfo": <property-of-index-information>
 }
 ```
 
@@ -85,6 +85,6 @@ to index, `indexid`, it will publish the following event,
 ```json
 {
     "topic":     "/indexinfos/<indexid>",
-    "indexinfo": <property-of-index-information>,
+    "indexinfo": <property-of-index-information>
 }
 ```