-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LROC SouthPole network cannot solve for acceleration in jigsaw #3871
Comments
The network is likely too large/connected to solve with the available memory. What system did you run this on? |
@blandoplanet Is this a products concern? Seems like it to me. |
Yes. The Products tag is appropriate. |
|
I think it was a mistake that this post was closed via the last reply. Re-opened. |
This may become more information than needed (so do not feel the need to read everything), but I would like to document for any future related efforts. Final description of Problem: cholmod_analyze uses various methods to order the passed in matrix for easy factorization and if it fails, returns null. This function can fail if the initial memory allocation fails, if all ordering methods fail, or if the supermodel analysis (if requested) fails. Based on the ‘problem too big’ error, I am guessing the memory allocation is failing (https://github.com/PetterS/SuiteSparse/blob/master/CHOLMOD/Cholesky/cholmod_analyze.c). I checked which methods were being used for cholmod_analyze by running a successful bundle and printing out the m_cholmodCommon variable and looking at the ‘nmethods’ field. This indicated that AMD or COLAMD is used for ordering m_cholmodNormal based on if the matrix passed in is symmetric or non-symmetric. If a symmetric matrix is passed into cholmod_analyze, the upper or lower triangle is accessed and ordered by the function to save on memory. If a non-symmetric matrix (A) is passed in it brings in the whole matrix and orders AA’, effectively using twice the memory. To evaluate the symmetry of m_cholmodNormal I went through the columns and checked that the largest associated row index was either less than or greater the current column index for all columns. In this way I evaluated whether the sparsely stored m_cholmodNormal was an upper triangle or lower triangle matrix, respectively. In BundleAdjust::loadCholmodTriplet():
Using this method I confirmed that the network passed in during failure (/scratch/jrichie/STechnique/ SouthPole_2017Merged_Lidar2Image1_cnetedit.net) is stored as a sparse upper triangular matrix, meaning it should be treated as symmetric matrix and the less memory intensive AMD ordering method would be used. However, a larger network of the same LROC data is able to run through the same process without failure, so I am genuinely at a loss as to why this particular network is failing. My only theory is that cholmod_analyze is taking in the m_cholmodNormal as a non-symmetric matrix for this network, therefore requiring twice the memory as is typically needed. That would require my upper triangular analysis to somehow be incorrect, but it is the only thing I can think of that would cause cholmod_analyze to act differently (especially in terms of memory) for two very similarly sized networks. What Else Was Tried: Next I tried running the jigsaw with fewer parameters by stepping down CAMSOLVE to velocities and keeping everything else consistent. This successfully ran, using 45Gb of memory. Since this bundle (with fewer parameters) used more memory than the bundle which fails because the ‘problem is too big’, it is very likely cholmod_analze is failing during the memory allocation step where there is not enough memory for every allocation and so the function fails, but the memory is never actually used. I wanted to double check that the error had to do with the SIZE of the bundle and not the fact that acceleration was one of the solve parameters, so I reran the bundle with the same number of parameters (12 total) but no acceleration solves. This was done by switching CAMSOLVE from accelerations (9 parameters) to velocities (6 parameters) and switching SPSOLVE from positions (3 parameters) to velocities (6 parameters). This bundle failed with the same ‘problem too big error’. Therefore, there is not necessarily an error without how jigsaw handles the acceleration parameter and it is indeed a size issue. We then thought that perhaps the connectivity of the matrix was causing the reduced normal camera matrix (m_cholmodNormal) to have enough off diagonal elements to make it significantly larger than what jigsaw could handle. I began creating a memory calculator for the various bundle matrices (located /home/ladoramkershner/projects/notebooks/JigsawSizeCheck_Prototype.ipynb; it is rough and needs to be reconfigured to account for the sparse storage of some of the matrices). I was pointed at a network created in previous years that bundled (/work/projects/laser_a_work/lweller/SPoleNet/2018MayJune_Network/ SouthPole_2017Merged_SP_and_Lidar2Image3.net; new_final_jig_00to350.lis), I reran that bundle with the same parameters as the one in this ticket and it was successful. Then I compared the number of graph nodes and edges in each network. In a graph diagram, nodes represent images and edges represent a shared point between images in a pair-wise fashion. Therefore, edges are a good way to evaluate the connectivity and number of off-diagonal elements in the reduced normal camera matrix for a network.
The archived network has more images, edges, points, and measures, so the connectivity could not explain the memory issue. To verify what choldmod_analyze was seeing I printed out the size and non-zero elements of the m_cholmodNormal
Again, the archived network has slighty more elements and therefore would require more memory to solve. This leads me to believe it is not just barely exceeding the memory requirements. |
Thanks Lauren! Do you think we should see if Ken Edmundson has any ideas on how to resolve this? |
Working directly with Ken has some ethics issues around how people can work with Astro after they leave. I'm also not sure if he'd be able to work on this without access to the cluster and scratch. |
Where should we go from here? The message that is output by this error is not descriptive enough to be helpful and I am still not sure why jigsaw is erroring. Jigsaw operates as expected on a network of the same size using the same amount of memory. So I am not sure if this is a bug, but it does concern me that we cannot isolate the difference between the handling of two networks tested. |
@blandoplanet What is the status on this issue? I believe we should close based on email conversation, but I do not want to close prematurely! Either way, I believe this is off the developers plate? |
Jay,
I wish that we could investigate further into why the SP will not solve for acceleration and not close the post, but it is not up to me. Whereas NP is similar, SP has some numbers that are greater, and I suspect there is a parameter limiting the effort. The software reports that the bundle is too large of which I believe has merit. Below is a comparison of numbers in SP (red) versus NP. Of course I originally opened the post, but Brent Archinal and Mike Bland make such decisions as to what to do next.
Images: 9687 18673
Points: 405532 1425784
Total Measures: 3128764 9472039
Total Observations: 6257528 18944078
Good Observations: 6257528 18944078
Rejected Observations: 0 0
Constrained Point Parameters: 438144 1438252
Constrained Image Parameters: 116244 168057
Unknowns: 1332840 444540
Degrees of Freedom: 5479076 16104978
Convergence Criteria: 1e-05(Sigma0) 1e-05(Sigma0)
Iterations: 4 5
…-Janet
________________________________
From: jlaura <[email protected]>
Sent: Monday, August 17, 2020 11:07 AM
To: USGS-Astrogeology/ISIS3 <[email protected]>
Cc: Richie, Janet O <[email protected]>; State change <[email protected]>
Subject: [EXTERNAL] Re: [USGS-Astrogeology/ISIS3] LROC SouthPole network cannot solve for acceleration in jigsaw (#3871)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
@blandoplanet<https://github.com/blandoplanet> What is the status on this issue? I believe we should close based on email conversation, but I do not want to close prematurely! Either way, I believe this is off the developers plate?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub<#3871 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALIYJUQVRHUOARZN6QYMGVTSBFPVLANCNFSM4NCXQAOA>.
|
@jorichie Thanks for the post! I agree 100% that finding out what is going on has value. Right now though, the development team has put two weeks of debugging effort into this. Checkout the conclusion in the lengthy report post from @ladoramkershner above when she concludes with:
At this point, I do not believe that the development team has other avenues to explore on this as the problem is not constrained well enough to have us aim in any particular direction. We have an internal email chain (@blandoplanet, @ladoramkershner, and Brent) discussing some other options that do not include devs. |
Some additional information from @ladoramkershner about a possible next step in the future... "the next task would be to install a custom build of cholmod and extracting more specific information from choldmod_analyze (the function that is failing). During my troubleshooting I was checking things upstream of the command and cross-referencing the documentation to predict which methods choldmod_analyze would use. However, more information may come of actually printing variables and status from inside the actual function. " |
Update 2020/09/14@ladoramkershner, @jorichie, and myself came back to this and did some more work last week; here's what we did and found. Testing with other parametersWe tested running the bundle with some slightly different parameters. First, we ran the bundle with rectangular coordinates. The hope was that this would help eliminate any errors from longitude domain or the pole. It would also slightly change the math because we would be solving for different ground point coordinates. One change that was required for this was converting the ground point sigmas from lat/lon/rad to x/y/z. In the latitudinal bundle only the radius was constrained, so we decided to constrain the Z point by that much. The data set is close to the pole so, the vast majority of radius variation would be in the Z direction. This test still failed in the same place with the same error. Next, we ran the bundle without the overhermite setting enabled. This could help check for errors in the polynomial setup portion of the bundle. Unfortunately, this also failed in the same way. Extracting a subregion of the networkWe wanted to see a subregion of the network would successfully bundle with acceleration. This could help us narrow down if there are any specific images, points, or measures that are causing this error. We identified regions that contained issues in the mosaic when solving for velocities and then extracted those subregions of the network. Unfortunately, extracting the subregions compromised the integrity of the network and additional work would have been required to make the subregions bundle by themselves. We ultimately decided that attempting to make various subregions bundle by themselves would take too long and that the potential value was not worth it. Checking the values in the normal matrixIt is possible that the normal matrix is ill-conditioned and CHOLMOD could be running into issues when it tries to analyze it. Computing the normal matrix requires computing partial derivatives and these can sometimes run into discontinuities resulting in extremely large numbers. When this happens, the resulting normal matrix could have extremely large values in some places that will result in a failure to solve the iterations. To check for this, we inserted a small bit of code to compute some statistics on the non-zero values in the normal matrix. We then ran the debug code on the network described in this issue (the Active Net), and an older network consisting of LRO NAC images of the North Pole (the Archived Net). Here are the results for the active network in question and the archived network that is able to bundle with accelerations:
None of the values stand out as too large to work with. Narrowing down when the network became unable to bundle with accelerationAfter some discussion it was found that a previous version of the network could be bundle adjusted solving for acceleration on old hardware at the ASC, but when we moved to new hardware, it could only solve for velocities. This could help narrow down any changes in the network or code that caused this. We looked for old processing and log files to determine exactly which version of ISIS and network were used on the old hardware and which were used on the new hardware. We found a log that successfully solved for acceleration. Here is the version info:
Here is the network
The print file can be found at We could not find a log file from just after the transition that attempted to solve for acceleration on new hardware. The closest we found was a successful solve for velocities on the new hardware. Here is the version info:
Here is the network
The print file can be found at This gives us a rough bound between February 2014 and August 2017, ISIS 3.5.0 to ISIS 3.5.2. Checking old ISIS versionsWe still have access to ISIS 3.5.0 and later on hardware at the ASC, so we decided to test and see if we could narrow this down further. We attempted to run the bundle with solving for acceleration under version 3.5.0, 3.5.1, 3.5.2, and 3.6.0. Unfortunately, for 3.5.0 and 3.5.1 we ran into an error:
There is some sort of issue reading the Table blobs that contain the SPICE data. We may be able to work around this by re-running spiceinit on the images using the version of ISIS we plan to bundle with. Unfortunately we ran out of time and also ran into some issues that need to be resolved with our processing cluster before this can continue. We also looked at all of the changes to jigsaw between 3.5.0 and 3.5.2. Here are the changes to the bundle adjust during that period, the changes that we think could impact this issue are in bold:
Potential Future WorkThe most promising lead is narrowing down when this worked and when this stopped working. Checking each ISIS version is a good idea but will require duplicating the data and then re-processing. Investigating suspected code changes will require careful examination of the code at the time they were made and going over the execution path. Compiling CHOLMOD with DEBUG flags enabled is also still an option. |
Thank you for your contribution! Unfortunately, this issue hasn't received much attention lately, so it is labeled as 'stale.' If no additional action is taken, this issue will be automatically closed in 180 days. |
Still waiting for a good test case here |
Thank you for your contribution! Unfortunately, this issue hasn't received much attention lately, so it is labeled as 'stale.' If no additional action is taken, this issue will be automatically closed in 180 days. |
Unresolved, keep open. |
Jigsaw not only fails to solve for acceleration but now won't solve for velocity. Solves for camera angles and position okay. Still getting the same cholmod error as when I opened the post. This network was the most recent network successfully used in jigsaw to solve for velocity before it failed. This network follows the successful run, but jigsaw failed to solve for velocity. /usgs/shareall/FOR_ISIS//usgs/shareall/FOR_ISIS/SouthPole_2020_Merged_Lidar2Image_redo13edit10_angles_outlier.net |
I don't know why that would be. Maybe something about the version your were trying to use?
I left before the network was fully loaded, but it did load and I can load points, etc. @jessemapel, qnet is up and running on astrovm4 for me and is currently using about 34G of memory. I'm not sure it requires more than that when it is loading the images, but I suppose lack of memory could be a problem for astromv5. vm5 has maybe 65G of memory, but vm4 has a bit over 100G. Maybe there were too many other things running on vm5 when Janet had her problem, but wouldn't it just run slower and maybe use swap? Update: ast{104}> free -h |
Yeah the
failure indicates that the OS abruptly stopped the application for some reason. The most common cause for this is memory issues. Hopefully IT can get the VMs reset and good to use again. |
I put a ticket in for IT to have a look and see if they see unusual. sinteractive on the cluster is a good alternative to use if astrovm4 has problems too. |
Lynn's sinteractive suggestion allows the proper loading of the network. IT plans to resolve "some stuck processes and excess memory consumption," on Aug. 13 at 5:00 PM. We are now able to run jobs and load qnet after the IT changes without having to use sinteractive. |
@jorichie, if jigsaw still can't solve for accelerations for you network, you probably wan't to keep this post open. |
I have been able to reproduce this with a Mars network when adding a single ground control point with a computed covariance matrix. I have networks available with and without ground points that illustrate the problem. I have linked those internally only at a code.chs.usgs.gov repository and provided access to the folks that are going to be working the problem. |
@jorichie Can you post your most recent jigsaw command lines that you ran. We're investigating some issues potentially related to not setting sigmas for every parameter. For example, your original post doesn't set point latitude/longitude sigmas. |
Just curious @jessemapel - do you think something might have changed how point latitude/longitude sigmas are being used over the past several years? Those sigmas were not used for the LROC NAC north pole network or any of the Themis IR bundles (including the global), both of which had ground points and solved for radius, camera accelerations and spacecraft position without problems. All of that work ended around 2017/2018. Since that time, I have found I need to add point lat/lon sigmas for Europa, Titan and now Phobos (but not Kaguya, but maybe it would help with difficult quads). I figured for the global data sets with limb and disk images having no or few ground points, it was necessary to help keep things from leaping all over the place and the extra constraints have generally helped. But those are not settings we've been encouraged to use in the past or had a need. I've tried to pick sigmas that make some sense for the data (considering resolution mostly and how good/bad the spice is) so they have ranged from 1500-5000 meters or more have helped for my problem projects. |
We've made several changes to things in the bundle over the last several years. The big ones that could impact point sigmas are the rectangular, XYZ, bundle and the lidar support. They were supposed to preserve the existing functionality, but could have introduced a bug. We also haven't 100% confirmed that not setting sigmas is the problem, it's just our best lead right now. Jay is doing a bunch more testing today. @lwellerastro your comments also help back up this being the potential issue. |
Also, I agree we should not be setting point lat/lon sigmas in most cases. It seems the logic that is supposed to leave them free has a bug. Setting them for right now, is a work around. |
I just had a look at Europa and Titan and see that point sigmas helped when sorting out islands of images, but once all data were manually joined to main network I didn't need the extra constraints. I'm recalling now it was Enceladus that needed the lat/lon constraints because the corrections were excessive. That suggestion was made by Brent at the time and although things like residuals and camera angles didn't change radically (or hardly at all), the lat/lon constraints kept the whole network from radically sliding away from where they started. The spice was not horrendous so it didn't make sense for the points to move so much. There was no ground for Enceladus either, so that added to the thought process and adding the constraints generally helped keep things closer to apriori locations (we used 1000 meters there). I agree for better mapping missions they shouldn't be necessary, but having a work around could be useful. |
Hey Jesse, please see /sbatch3_jigsaw.bsh for the commands we are currently using. Thanks for taking a look at this. |
Afternoon @jorichie, we really should not post full paths in a public place like this. I am going to edit your post to remove the path. Would you mind copy/pasting the contents into your post or a new response? That is helpful not only from a security perspective, but also for anyone who does not have access to the machine with that path who is interested in the discussion. Thanks! |
Thanks for the information, Jay. I have copied, pasted the requested information here.
jigsaw fromlist=SouthPole_2020Merged_SP_and_Lidar2Image4_updated12_image_temp.lis cnet=SouthPole_2022_Merged_Lidar2Image_redo105.net
onet= test_SouthPole_2022_Merged_Lidar2Image_redo105_remerge.net
radius=yes camsolve=angles twist=yes overexisting=yes spsolve=position
outlier_rejection=no
overhermite=no
camera_angles_sigma=1.0
camera_angular_velocity_sigma=0.5
camera_angular_acceleration_sigma=0.25
spacecraft_position_sigma=250
point_radius_sigma=150
maxits=3
…________________________________
From: jlaura ***@***.***>
Sent: Tuesday, November 8, 2022 2:57 PM
To: USGS-Astrogeology/ISIS3 ***@***.***>
Cc: Richie, Janet O ***@***.***>; Mention ***@***.***>
Subject: [EXTERNAL] Re: [USGS-Astrogeology/ISIS3] LROC SouthPole network cannot solve for acceleration in jigsaw (#3871)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
Afternoon @jorichie<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjorichie&data=05%7C01%7Cjrichie%40usgs.gov%7C3348db0a172a4713c58908dac1d44922%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C638035414756669984%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wGSfUVHUkrQddGBNakx3RXmp5%2BI2nJavlKGU1FfBg4o%3D&reserved=0>, we really should not post full paths in a public place like this. I am going to edit your post to remove the path. Would you mind copy/pasting the contents into your post or a new response? That is helpful not only from a security perspective, but also for anyone who does not have access to the machine with that path who is interested in the discussion. Thanks!
—
Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-Astrogeology%2FISIS3%2Fissues%2F3871%23issuecomment-1307880106&data=05%7C01%7Cjrichie%40usgs.gov%7C3348db0a172a4713c58908dac1d44922%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C638035414756669984%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WpbcIv0t238RTNVnFYxajpQVZomy9pRQ3b079PAKu0g%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FALIYJUVVFKVQOVJWLWKQUYTWHLEF3ANCNFSM4NCXQAOA&data=05%7C01%7Cjrichie%40usgs.gov%7C3348db0a172a4713c58908dac1d44922%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C638035414756669984%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4ap%2FKX14tehK0Zg%2BHlmNO1RJQgY4aNiEFPr3LZCjza0%3D&reserved=0>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@jorichie thanks! Have you tried this with the following added:
I don't know what values you want to put in for the ???. Maybe 500, 100, 100, 50 (total guesses that will need some iteration on to figure out; or a look at the LROC kernels to see if they have reported accuracies). I would probably try both with and without overhermite - I don't think that that is the issue. The |
I'm not sure which sigmas need to be set, but I would test setting all of them. I think point lat/lon sigmas of 500-1000m should be "safe" guesses. Your spacecraft velocity sigma can be around 100. |
Here is the result from the first test, including setting values for point_radius and longitude sigmas, point radius sigma, and spacecraft velocity: Next, I tried jigsaw specifiying overexisting=no with no change to the other parameters. There was no change in result, and the error was identical as described here. |
@jorichie That is terrific! I believe that you are getting a new error? And it looks like it could be a memory issue? How much memory are you requesting and are you exceeding the available memory on the machine? Problem too large looks like it can occur when cholmod is converting a sparse matrix to a dense matrix. Also, are you intentionally setting spsolve=positions or do you usually use spsolve=velocties or accelerations? Is using one of those resulting in this same error or a different error? |
@jlaura, it is not and never has been a memory problem. This network has been watched on multiple occasions and uses <50G of memory. Janet is asking for an entire node I believe. I believe Laruen/Jesse also confirmed there are no memory overruns. The current error appears to be identical to what is in the original post. This work has only required spsolve=position. |
Here is the source code that is throwing the error. Please do a search in there if you like for the word 'problem' (pulled for the error message). You will see two instances in the code where that is happening and both are after checks to see if the problem will fit into memory. None of the above is to say that the problem is not in jigsaw (for example if a sparse matrix is being made dense for some reason), but that error is being raised by CHOLMOD. Lauren's post from June 15 also looked at memory related issues and did not find that the nework size necessarily corresponded to the decomposition path that CHOLMOD was selecting as far as I can tell. Jesse's post September 14 specifically indicates solving for accelerations which is why I asked about that aspect. |
The latest version of this south pole network that failed to bundle using camera accelerations (and I think spacecraft position) now successfully bundles under a test version of jigsaw which utilizes changes in #5176. The network now solves for radius, camera accelerations and spacecraft position. The process solved in about 15 hours and used 80G of memory. I think this post can now be closed. |
Closing with comments and positive reporting from @lwellerastro |
ISIS version(s) affected: isis3.10.2 on astrovm4, previously astrovm2
Description
Jigsaw will not solve for acceleration. Solves for camera angles and velocity okay.
Error:
Validation complete!...
starting iteration 1
CHOLMOD error: problem too large. file: ../Supernodal/cholmod_super_symbolic.c line: 683
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_factorize.c line: 121
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_solve.c line: 1062
/var/spool/slurmd/job6627021/slurm_script: line 39: 16235 Segmentation fault (core dumped)
How to reproduce
Here are the parameters used for jigsaw: *also /scratch/jrichie/S-Technique/sbatch4_jigsaw.bsh; all pertinent files can be found at /scratch/jrichie/STechnique
jigsaw fromlist= new11_updated_jig_00to350.lis cnet= SouthPole_2017Merged_Lidar2Image1_cnetedit.net
onet= SouthPole_2017Merged_Lidar2Image2.net
update=no sigma0=1.0e-5 maxits=3 errorpropagation=no
radius=yes camsolve=accelerations twist=yes overexisting=yes
outlier_rejection=no
spsolve=position overhermite=yes
camera_angles_sigma=1.0
camera_angular_velocity_sigma= 0.5
camera_angular_acceleration_sigma=0.25
spacecraft_position_sigma=250
point_radius_sigma=150 \
Possible Solution
Check that memory requirement is sufficient or "number of" limitations, (if any), in the code. Weller was able to solve for acceleration for north pole. Differences are South Pole is twice as large and has a significant amount of shadows.
Additional context
The text was updated successfully, but these errors were encountered: