Skip to content

Running ASV Benchmarks

grusev edited this page Dec 13, 2024 · 10 revisions

What are ASV Benchmarks and how do they work?

ASV is a benchmarking tool that is used by many prominent Python projects to benchmark and compare the performance of the library over time. Some prominent uses are by Numpy, Arrow, SciPy.

The tool has out-of-the-box support for creating benchmarks and using them to measure the performance over time (e.g. per commit, release, etc.) This is done by checking out, building and benchmarking each version. And the information is used to create graphics of the performance of the various versions on the various benchmarks. The latest benchmarks on the master versions can be seen here.

The benchmarks get run automatically in the following cases:

  • nighty on the master branch - this updates the performance graphs
  • on push on any branch with open PR - this benchmarks the branch in PR against the master branch and if there is a regression of more than 15% the benchmarks fail

Normally, ASV keeps track of the results in JSON files, but we are transforming them into data frames and store them in an ArcticDB database. There is a special script that helps in saving/extracting the jsons from the database.

Adding new benchmarks

All of the code for the actual benchmarks is located in the benchmarks folder. Any new benchmarks should be added there, either in one of the existing classes/files or in a new one. We are mainly using tests that benchmark the runtime (time_...) or peak memory usage (peakmem_...). But ASV support other benchmark type, you can read more about them in their docs.

Currently, we have the following 4 major groups of benchmarks:

  1. Basic functions - for benchmarking operations such as read/write/append/update, their batch variant, etc. against a local storage
  2. List functions - for benchmarking operations such as listing symbols, versions, etc. against a local storage
  3. Local query builder - for benchmarking QB against a local storage (e.g. LMDB)
  4. Persistent Query Builder - for benchmarking QB against a persistent storage (e.g. AWS S3) so we can read bigger data size
  5. Resample - benchmarking resampling functionality. Parametrized over input rows, output rows, column types, and supported aggregators

It is important to understand how the each benchmark run is set up and torn down:

  • setup_cache - is called only once, so any heavy computation should go in here (e.g. prepopulating some results in the DB)
  • setup - is called before each benchmark run, so any setup that is light on computation should go in here (e.g. initializing the Arctic client)
  • teardown - is the opposite of setup and is called after each benchmark run, so any cleanup that should be performed should go in here (be careful that you don't cleanup a library/symbol that is needed by a benchmark)

If you have made any changes to the benchmarks, you need to run them locally at least once and push the changed benchmarks.json file to GitHub. This file is very important for ASV and the results of the benchmarks will not be generated properly without it.

How many times a single time test will be executed?

Although ASV has documentation, reading it might not help you plan and do your test properly with ArcticDB. The reason is because we are traying to benchmark library functions which have dependency on the current internal state of the symbol/library. That is especially true for functions like append when we want to append dataframe that is time indexed when to generate it and how to generate it. Then Wheat about finalize_staged_data() - this function would pose problem because it can be executed exactly once not more on a staged data, its consequent execution is doomed to failure. So how to assure you generate dataframes that are always in the future> What about to measure functions that have expensive setup and can be executed exactly once.

Here are some ideas derived from experirience.

To plan your tests you must know following specifics of asv:

  • asv runs test in separated processes. Thus knowing how many processes you have to "synchronize" over your scenario is important
  • asv debigging is not possible with simply using "print()" statememnts - the print statements can be seen in console only if part of setup_cache(). For other methods because they work in different threads the output is perhaps not kept. Only solution is to append logs in file or use logger library
  • there are several properties that influence repetitions of your time tests. Note that they are also related to the number of process that asv will create
    • rounds
    • number
    • repeat
    • params
    • the time that your code takes - if time is small then asv CAN AND WILL decide on its own and will not use your suggestions

see Asv timing logic Asv timing attributes

To test your hypothesis one can use following code:

class TestASV:
    
    rounds = 2
    number = 2 
    repeat = 3
    min_run_count = 1

    params = [1,2]

    file = "/tmp/trace.txt"
    
    def setup_cache(self):

        with open(TestASV.file, 'a') as f:
            f.write(f">>>>>>>>>>>>>>>>>>>     At Setup_Cache\n")
            f.write(f"  CWD            : {os.getcwd()}\n")
            f.write(f"  PID: {psutil.Process()}\n\n")

        return time.time()
    
    def setup(self, _time, param):

        with open(TestASV.file, 'a') as f:
            f.write(f">>>>>>>>>>>>>>>>>>>     At Setup-{_time}-{psutil.Process()}-{param}\n")
            f.write(f"  CWD            : {os.getcwd()}\n")

    def teardown(self, _time, param):

        with open(TestASV.file, 'a') as f:
            f.write(f">>>>>>>>>>>>>>>>>>>     At Teardown-{_time}-{psutil.Process()}-{param}\n")
            f.write(f"  CWD            : {os.getcwd()}\n")

    def time_test(self, _time, param):

        with open(TestASV.file, 'a') as f:
            f.write(f">>>>>>>>>>>>>>>>>>>     At time_test()-{_time}-{psutil.Process()}-{param}\n")
            f.write(f"  CWD            : {os.getcwd()}\n")

        time.sleep(1)

First run this code without commenting time.sleep(), observe the "/tmp/trace.txt" file. Do not dive deep, now just comment out time.sleep() and pay attention of the "/tmp/trace.txt" file. See the difference? Now the file is really huge, obviously many attempts were made to do measurment. Therefore the first rule to enforce the above stated variables to really work is to make the the time test be significant in terms of time for the machine so that its performance is more reliable and varied because of other processes that may run on the machine.

Then once you have significant and relativly stable result of the test then most probably the above parameters will work.

Here is my understanding:

  • setup_cache() is executed exactly once in a separate process, that process is not used any more. The returned result will be available for other processes
  • there will be one setup for and one teardown for each 2 time_test() methods (number parameter)
  • There will be threads that equal to rounds*len(params)
  • in each thread there will be rounds+1 repetitions of setup()-time_test()xnumber-teardown() (note that + 1, that perhaps comes from ''calibrate'' psrt of the algorythm)
  • all tests will run in one and the same temporary directory
  • all tests will have access to one and the same pre-setup returned result from setup_cache() which comes as first parameter to all your other methods

Having that information woulld help planning the excercise. Note that if you intend to create a signle library in setup_cache() then you must assure that you have enough number of symbols that are equal to the number of threads that asv will create otherwise you might end up with very hard to debug timing problems. You can exchange info inside repetitions in a single process but you cannot exchange information between processes. Luckily we have library which could have a shared symbol ;-) for interprocess communication.

Another thing to have in mind is that your test may run several times after setup and before tear down. So if you have some logic that will be affected make sure you avoid it.

Generally working with more data in our operations and having each measured operation last at least 1 sec should be stable enough guideline when you have to deal with state of the symbol you work with. So going fast there is not best choice ...

Note that 'repeat' can be either number or tuple - (min_repeat, max_repeat, max_time).

Lastly what was discovered is that the value of "warmup_time" should plays significant role. Its default value is 1 sec and that is the time that will be used by asv for 'calibration' - ie the setup() is executed first during the calibration and then as many times as possible the time_test() function to fit within that time ("warmup_time") and after that the tear down function is executed. Thus as you can see during this warmup time you do not have predictable number of times for execution of time_test() function - it can be one or more times. You do not have setup-teardown cycle with each such function. Thus the best is to exclude this parameter from the picture by assignign it value "0". Now the parameters are very meaningful and reliable:

class TestASV:
    
    rounds = 2
    number = 2 
    repeat = 3
    min_run_count = 1
    warmup_time=0

    params = [1,2]
  • we will have 5 threads total:
    • one for setup_cache()
    • 4 = rounds * len(params)
  • we will have total of 3 repetitions (in one thread) of 2 measurments (number=2 defines that). The sequence will be:
    • Repetition X for X in repetitions
      • setup()
      • time_test()
      • time_test()
      • tear_down()

Thereofore the suggested values perhaps would be for most of our tests:

    
    rounds = 1
    number = 1 
    repeat = X # you can tune that
    min_run_count = 1
    warmup_time = 0

    params = [a1...an] # you can tune that per your needs

Above example will have exactly len(params) threads, each repeating time test X times, and each repetition having its own setup followed by teardown()

Running the benchmarks on master

There is a workflow that automatically benchmarks the latest master commit every night. But if you need to run it manually for some reason, you can issue a manual build from here and click on the Run workflow menu. This will start a build that will benchmark only the latest version.

If you have made changes to the benchmarks, you might need to regenerate all of the benchmarks. You will need to start a new manually build on master and select the run_all_benchmarks option.

Running the benchmarks on a non-master branch

Local Run

To run ASV locally, you first need to make sure that you have some prerequisites installed(both can be installed with pip), namely:

  • asv
  • virtualenv

You also need to change the asv.conf.json file to point to your branch instead on master (e.g. "branches": ["some_brnach"], ) If you have introduced any new hard dependencies, you need to add them to the matrix of dependencies that will be installed.

image

After that you can simply run:

python -m asv run -v --show-stderr HEAD^! => if you want to benchmark only the latest commit

To run a subset of benchmarks, use --bench <regex>.

After running this once, if you are just changing the benchmarks, and not ArcticDB code itself, you can run the updated benchmarks without committing, and therefore rebuilding again with:

python3 -m asv run --python=python/.asv/env/<some hash>/bin/python -v --show-stderr

where the path should be obvious from the first ASV run from HEAD^!

The benchmarks take between ~90 minutes per commit to run, depending on the machine. After the benchmarks have ran successfully, you can view the result as output in the console. You cannot generate the results as an html in a non-master branch, as this is a limitation of ASV.

GitHub Actions Run

If you want to benchmark more than one commit(e.g. if you have added new benchmarks), it might be better to run them on a GH Runner instead of locally. You will again need to change the asv.conf.json file to point to your branch instead on master (e.g. "branches": ["some_branch"], ) And if you have introduced any new hard dependencies, you need to add them to the matrix of dependencies that will be installed.

image

Then push your changes and start a manual build from here. Make sure to select your branch and whether or not you want to run the benchmarks against all commits.

image