Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample of permutations? #1

Open
dbolser opened this issue Oct 21, 2024 · 5 comments
Open

Sample of permutations? #1

dbolser opened this issue Oct 21, 2024 · 5 comments

Comments

@dbolser
Copy link

dbolser commented Oct 21, 2024

Thanks for making this available!

I'm hitting my head against cython to try to write a permutation sampling function. The issues is that my arrays are quite large (e.g. 500,000 rows, and I want to calculate statistics for 10,000,000 permutations...).

I can't see that your code produces a sample of permutations, unless I'm wrong?

If I get it working, would you look at a PR?

Many thanks,
Dan.

@hansalemaos
Copy link
Owner

Could you give me an input/output example of the data you are trying to produce?

@dbolser
Copy link
Author

dbolser commented Oct 24, 2024

Here is an example, but there is a detail:

def get_data(
    xsize=500000, ysize=2000, score_type="manual"
) -> tuple[np.ndarray, np.ndarray]:

    state = bernoulli.rvs(p=0.001, size=(xsize, ysize))
    state = state.astype(bool)  # type: ignore

    # Pull scores from a distribution
    score = norm.rvs(size=xsize, loc=0.1, scale=0.5)
    score = score**2

I actually have 10k different scores to test.

The statistic looks like this:

def statistic(score: np.ndarray, state: np.ndarray, axis: int) -> float:
    return np.sum(score * state, axis=axis)

The permutation test simply compares the statistic from 9999 random 'states' vs. the real data.

I discovered that, because my data is quite sparse per 'y' axis, looping through the 2k states and using the fisher yates shuffle is faster. I was trying to permute the 2d array, but actually, I need to keep each y separate (there are a different number of each, which is the detail I mentioned).

@hansalemaos
Copy link
Owner

So you want to shuffel (not permute) the output of get_data()?

@hansalemaos
Copy link
Owner

Seems to be kind of the same thing, right?
https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
I wouldn't change the original data. I would create an index array of a smaller datatype and loop through the index and get the data. But I don't know if it is the best idea. I couldn't even run the code you sent me (out of memory). After work, I will try it with a smaller array;

@hansalemaos
Copy link
Owner

hansalemaos commented Oct 24, 2024

This should give you a rough outline how you could calculate it. I didn't get 100% what you want to do, but I think it is close to my code. Very important: YOU HAVE TO USE MEMORY VIEWS and ALWAYS TYPED INDEXING!! The way you are using Cython is completely useless. That gives you a speed-up of max. 5%. With typed memory views you get to C-speed

cimport numpy as np
def statistic():
    cdef:
        Py_ssize_t xsize=5000
        Py_ssize_t ysize=2000
        np.ndarray score=bernoulli.rvs(p=0.001, size=(xsize, ysize)).astype(bool).astype(np.uint8) #there might be something better to save memory, i don't like 2x as type.
        np.ndarray state=norm.rvs(size=xsize, loc=0.1, scale=0.5)**2
        unsigned char[:,:] scoreview
        double[:] stateview
        np.ndarray axis0_full=np.arange(0,xsize,dtype=np.int64).reshape((-1,xsize)) # replace with algor. for permutation -> you can use mine
        np.ndarray axis1_full=np.arange(0,ysize,dtype=np.int64).reshape((-1,ysize)) # replace with  algor. for permutation
        Py_ssize_t[:,:] axis0_full_view=axis0_full
        Py_ssize_t[:,:] axis1_full_view=axis1_full
        Py_ssize_t i,j,k,l
        Py_ssize_t number_of_loops=1 ; # number of desired permutations - should be axis 0 of the permutation array
        double result

    #print(score) 
    #print(state)
    scoreview=score
    stateview=state
    for k in range(number_of_loops):
        result=0
        for i in range(axis0_full_view.shape[1]):
            for j in range(axis1_full_view.shape[1]):
                result=result+(scoreview[axis1_full_view[k,j],axis0_full_view[k,i]]*stateview[axis1_full_view[k,j]])
        print(result)

statistic()



And you can add prange (parallel) to the outer loop (don't forget to compile with openmp)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants