Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jigsaw - fails for Kaguya TC for certain cases #5173

Closed
lwellerastro opened this issue Apr 5, 2023 · 5 comments · Fixed by #5267
Closed

Jigsaw - fails for Kaguya TC for certain cases #5173

lwellerastro opened this issue Apr 5, 2023 · 5 comments · Fixed by #5267
Assignees
Labels
bug Something isn't working

Comments

@lwellerastro
Copy link
Contributor

lwellerastro commented Apr 5, 2023

ISIS version(s) affected: 7.1.0

Description
Jigsaw will fail on southern hemisphere (0 to -90 latitudes, all longitudes) combined network when solving for radius, camera acceleration, twist and spacecraft position. The network is a merge of 15 independent lunar quads (including the south pole) that all bundled successfully for the same (and the general settings below) as independent quads. Jigsaw will run successfully on the same merged network when spacecraft positions are not being solved for, but the data need to have spacecraft solved for. Additionally, when the south polar quad is not included in the merged network, jigsaw will solve for spacecraft position.

How to reproduce
An image list (files on scratch disk pointing to >2Tb of data) and the networks in commands below are in my user work area under Isis3Tests/Jigsaw/Kaguya_South/.

Failing Case:
15 combined quads including the south pole and ground control points
solves for radius, camera acceleration, twist, spacecraft position
81112 Images
5448001 Points
21320688 Measures

jigsaw froml=KTC_Morning_SouthHemisphere_Image.lis cnet=KTC_Morning_SouthHemisphere_ImageGCP.net \
        onet=JigOut_KTC_Morning_SouthHemisphere_ImageGCP.net \
        radius=yes update=no \
        sigma0=1.0e-5 maxits=10 \
        camsolve=accelerations twist=yes overexisting=yes \
        spsolve=position overhermite=yes \
        camera_angles_sigma=0.25 \
        camera_angular_velocity_sigma=0.1 \
        camera_angular_acceleration_sigma=0.01 \
        spacecraft_position_sigma=1000 \
        point_radius_sigma=100 \
        file_prefix=RadAccelTwist_SpkPos

CHOLMOD error: problem too large. file: ../Supernodal/cholmod_super_symbolic.c line: 683
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_factorize.c line: 121
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_solve.c line: 1062

As per suggestions in post #3871 I ran the above and included global uncertainties on point_latitude_sigma and point_longitude_sigma but the same error was produced.

Successful Case:
15 combined quads including the south pole and ground control points
solves for radius, camera acceleration, twist
81112 Images
5448001 Points
21320688 Measures

jigsaw froml=KTC_Morning_SouthHemisphere_Image.lis cnet=KTC_Morning_SouthHemisphere_ImageGCP.net \
        onet=JigOut_KTC_Morning_SouthHemisphere_ImageGCP.net \
        radius=yes update=no \
        sigma0=1.0e-5 maxits=10 \
        camsolve=accelerations twist=yes overexisting=yes \
        camera_angles_sigma=0.25 \
        camera_angular_velocity_sigma=0.1 \
        camera_angular_acceleration_sigma=0.01 \
        point_radius_sigma=100 \
        file_prefix=RadAccelTwist_ImageGCP

This required a maximum of 55 G of memory to run and took about 11 hours to converge in 4 iterations.

Successful Case:
14 combined quads and ground control points, excluding the south pole
solves for radius, camera acceleration, twist, spacecraft position
66282 Images
4899811 Points
17441866 Measures

jigsaw froml=KTC_Morning_SouthHemisphere_MidLat_Image.lis cnet=KTC_Morning_SouthHemisphere_MidLat_ImageGCP.net \
        onet=JigOut_KTC_Morning_SouthHemisphere_MidLat_ImageGCP.net \
        radius=yes update=no \
        sigma0=1.0e-5 maxits=10 \
        camsolve=accelerations twist=yes overexisting=yes \
        spsolve=position overhermite=yes \
        camera_angles_sigma=0.25 \
        camera_angular_velocity_sigma=0.1 \
        camera_angular_acceleration_sigma=0.01 \
        spacecraft_position_sigma=1000 \
        point_radius_sigma=100 \
        file_prefix=RadAccelTwist_SpkPos

Required 50 G of memory to run and 7.5 hours to converge in 3 iterations

Additional context
The south pole network that was excluded from the second successful case has the following basic stats:
16233 Images
548190 Points
3878822 Measures

Somehow adding this network breaks solving for spacecraft position. If the program determines there is inadequate memory available for the input data and parameters being solved for, then it would be good to know that ahead of time and what that number specifically is. If the sparsity of the network is not sparse enough (?), then we need to know that, ahead of time preferably, not after we have spent many hundreds of hours generating networks. Calculating sparsity has been brought up by a number of people (2 of whom are now retired) as something that could be a problem. I have a low level idea of what that is, but as a user I have absolutely no way of calculating it to determine if it might be a problem or not.

An enormous amount of time and effort have been put into creating the quads, and although I acknowledge the quad networks are larger than I would like (findfeatures excels at finding points even when that parameter is restricted), I'm worried that reducing the sizes of these (particularly the south pole) will break things, particularly since I don't understand what the specific problem is since some things seem to run fine in jigsaw. Additional information is needed from this or another program to determine if my data are going to exceed some software limit. (It has also been suggested by another knowledgeable person that there may be a limit in how jigsaw is allocating the size of the matrix (possibly a bug), that perhaps the data are overrunning something there that expects only x amount of input but the data need x+?? slots - I don't understand this well enough to articulate it properly, and yes, it would be helpful if those folks could add to this conversation but 2 of them no longer work here).

A spreadsheet is being compiled to compare various large network jigsaw runs from over the years to help inform why we are having problems with some networks and not others. Large polar datasets seems to have some influence. That could shared internally when that information is needed.

!! Could someone please add the Products label. That is no longer an option for me - something that has recently changed. Thanks.

@lwellerastro lwellerastro added the bug Something isn't working label Apr 5, 2023
@lwellerastro
Copy link
Contributor Author

Additional information:
The prior photogrammatrist who worked on jigsaw personally contacted me about one year ago with an idea to try for fixing this cholmod error in regard to post #3871. I documented as best as could and passed the information on to one our developers (now gone) and recently a manager. I never heard one way or another what anyone thought about the idea that this person said took them about 30 minutes to implement and tried on a smaller dataset they had (that didn't have the cholmod error) to confirm that it at least didn't break anything.

In essence, when they wrote the cholmod code they used default calls to the functions which uses a 32 bit integer to index into the normal equation. They did a little more research on the error and found that a 64 bit integer (long integer) could be called instead. He thought the default version could be exceeding the number of allowable entries. He changed his local jigsaw code to call the functions adding "_l" to the name of the function name. He said they weren't 100% sure this would the help the problem because they did not have the failing data to test it, but that it was worth a try in their opinion since it took so little time to make the change and their research suggested it might help. They also said if it did work they would recommend making it an option not have jigsaw default on it.

I apologize if this makes no sense. I'm going off of one year old notes and relaying information about things I unfamiliar with. I thought it was worth mentioning since an expert shared it with me, encouraged it and is familiar with the code and the work we do here. It seems it should be considered.

@lwellerastro
Copy link
Contributor Author

I have run numerous tests on my Kaguya TC data, mostly running into failure. I have documented the size of the network (# of images, points and measures), whether poles are included or not (most cases are w/out the polar quads), what I'm solving for, and in the cases that were successful, how much memory was used as reported by slurm since these go to the cluster. The document is internal to Astro and I can share with the appropriate individuals. I'm hoping there is useful information there.

In addition to working with the networks as they exist, I have also greatly (and excessively) reduced their size using the program cnetthinner, but those tests are also failing. I am tearing apart my networks, diminishing connections in unknown ways and literally breaking the networks in an attempt to find a scenario where jigsaw works but no global, let alone semi-global version of the networks is working. I am doing everything I can on my end to create a combined network (including undoing thousands of hours of work) that jigsaw will work on. This problem needs attention or we can not create a global Kaguya TC network or the products based upon it.

  • Does jigsaw have a memory leak?
    I have run various versions of combined networks on cluster nodes having 435G of memory where jigsaw fails The only case I can get success on is equatorial only quads (+/-30 latitude, 0-360 longitude) where jigsaw runs to success solving for radius, camera acceleration and twist using 48G of memory, and again solving for radius, camera acceleration, spacecraft position and twist using 59G of memory. Including spacecraft position did not increase memory usage to an unfeasible level.

However, if I try to pass jigsaw combined network for quads running from -30 to 65 latitude, 0-360 longitude, it runs successfully solving for radius, camera acceleration and twist (using 75G of memory), but fails with the cholmod error when spacecraft position is added. Does it really need an additional 360 Gigs of memory when adding spacecraft?

  • Does jigsaw have a limit on the size of the matrix it can work with based on # of images, points, measures and what is being solved for?
    If so, what specifically are those limits? I need to know what to work toward and determine if those limits are feasible for a global dataset of this size (139k+ images). We need to know for future work, Kaguya or otherwise.

@lwellerastro
Copy link
Contributor Author

Changes made in reference to #5176 have addressed this post and then some.

In addition to a successful jigsaw for the network in the original post (covering the southern hemisphere of the Moon), I have also been able to bundle a network covering all mid latitude (+/-- 60 latitudes) data solving for radius, camera accelerations and spacecraft position. The latter only required 124G of memory.

I am currently running a global Kaguya TC network and expect it will take 2-3 days to run, but anticipate good results in line with the other tests.

Thanks so much @AustinSanders!!

@AustinSanders
Copy link
Contributor

Awesome -- pending successful tests + merge of the PR, this sounds like it's closable! Thank you for the testing + feedback :)

@lwellerastro
Copy link
Contributor Author

Final update - A global Kaguya TC control network (90S-90N, 0-360E) has bundled successfully!
42 hours, 167G of memory, solving for radius, camera accelerations and spacecraft position.

Number of Images = 168154
NumberOfPoints   = 11786686
NumberOfMeasures = 45916089

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants