First and foremost, we recommend you gain a strong understanding of the Incentive Mechanism and Evaluation used by the default validators, as this is what you are optimizing for.
Especially understand the Penalties section, as these exponentially decrease your score.
Validators begin by verifying that the tokens in each chunk correspond to those in the source document. To ensure that your chunks match the source document, it is highly encouraged that you use NLTK's sentence_tokenizer to split the document by sentences before combining them into chunks.
Since this subnet evaluates chunk quality based on the semantic similarity within a given chunk and its dissimilarity to other chunks, do not overlap or repeat data. Repeating data across chunks will severely hurt dissimilarity, and thus tank your score. While overlapping chunks is a commonly used method in RAG, it comes with many drawbacks such as increased storage and inference costs, and is therefore not aligned with the goal of this subnet.
There are various approaches to chunking that can produce high-quality chunks. We recommend that you start by exploring recursive and semantic chunking. To learn more about chunking, we recommend you read this Pinecone article.
Recursive chunking begins by splitting the data set into a small number of chunks. It then checks whether each chunk meets the desired criteria (such as size or semantic self-similarity.) If a chunk does not meet these criteria, the algorithm recursively splits that chunk into smaller chunks. This process continues until all chunks satisfy the specified criteria.
Here is a diagram of this process:
Semantic chunking starts by splitting the entire data set into individual sentences. Each sentence, called an anchor sentence, is then grouped with a number of surrounding sentences to form a sentence group. These sentence groups are compared sequentially. A chunk boundary is established wherever the semantic difference between adjacent sentence groups crosses some threshold.
Here is an example with a threshold of 1:
There exist many freely available chunking utilities that can help you get a head start on your chunking algorithm, see the following links:
- https://www.youtube.com/watch?v=8OJC21T2SL4
- https://www.youtube.com/watch?v=uhVMFZjUOJI
- https://www.youtube.com/watch?v=TcRRfcbsApw
- https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb
- https://research.trychroma.com/evaluating-chunking#chunking-algorithms
Finally, as the load increases, miners may need to deprioritize or ignore requests from lower-stake validators. Not responding to a request, or taking too long to respond, will result in a score of zero.
By default, miners prioritize requests by stake. Edit the logic in blacklist()
and priority()
in miner.py to protect your miner.