LROC SouthPole network cannot solve for acceleration in jigsaw #3871

jorichie · 2020-05-16T03:56:25Z

ISIS version(s) affected: isis3.10.2 on astrovm4, previously astrovm2

Description
Jigsaw will not solve for acceleration. Solves for camera angles and velocity okay.
Error:
Validation complete!...
starting iteration 1
CHOLMOD error: problem too large. file: ../Supernodal/cholmod_super_symbolic.c line: 683
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_factorize.c line: 121
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_solve.c line: 1062
/var/spool/slurmd/job6627021/slurm_script: line 39: 16235 Segmentation fault (core dumped)

How to reproduce
Here are the parameters used for jigsaw: *also /scratch/jrichie/S-Technique/sbatch4_jigsaw.bsh; all pertinent files can be found at /scratch/jrichie/STechnique
jigsaw fromlist= new11_updated_jig_00to350.lis cnet= SouthPole_2017Merged_Lidar2Image1_cnetedit.net
onet= SouthPole_2017Merged_Lidar2Image2.net
update=no sigma0=1.0e-5 maxits=3 errorpropagation=no
radius=yes camsolve=accelerations twist=yes overexisting=yes
outlier_rejection=no
spsolve=position overhermite=yes
camera_angles_sigma=1.0
camera_angular_velocity_sigma= 0.5
camera_angular_acceleration_sigma=0.25
spacecraft_position_sigma=250
point_radius_sigma=150 \

Possible Solution
Check that memory requirement is sufficient or "number of" limitations, (if any), in the code. Weller was able to solve for acceleration for north pole. Differences are South Pole is twice as large and has a significant amount of shadows.
Additional context

jessemapel · 2020-05-18T15:57:29Z

The network is likely too large/connected to solve with the available memory. What system did you run this on?

jessemapel · 2020-05-18T16:53:12Z

@blandoplanet Is this a products concern? Seems like it to me.

blandoplanet · 2020-05-18T20:07:26Z

Yes. The Products tag is appropriate.

jorichie · 2020-05-28T16:33:40Z

The network is likely too large/connected to solve with the available memory. What system did you run this on?
As per my conversation with Jesse, astrovm4 and astrovm2. Also noted at beginning of report.

lwellerastro · 2020-05-28T16:35:57Z

I think it was a mistake that this post was closed via the last reply. Re-opened.

ladoramkershner · 2020-06-15T19:27:03Z

This may become more information than needed (so do not feel the need to read everything), but I would like to document for any future related efforts.

Final description of Problem:
The initial error is caused by a memory issue on the call cholmod_analyze(m_cholmodNormal, &m_cholmodCommon) on line 1556 in BundleAdjust.cpp, where m_cholmodNormal is the reduced normal camera matrix used in the bundle solution and m_cholmodCommon carries around information repeatedly needed for cholmod functions.

cholmod_analyze uses various methods to order the passed in matrix for easy factorization and if it fails, returns null. This function can fail if the initial memory allocation fails, if all ordering methods fail, or if the supermodel analysis (if requested) fails. Based on the ‘problem too big’ error, I am guessing the memory allocation is failing (https://github.com/PetterS/SuiteSparse/blob/master/CHOLMOD/Cholesky/cholmod_analyze.c).

I checked which methods were being used for cholmod_analyze by running a successful bundle and printing out the m_cholmodCommon variable and looking at the ‘nmethods’ field. This indicated that AMD or COLAMD is used for ordering m_cholmodNormal based on if the matrix passed in is symmetric or non-symmetric. If a symmetric matrix is passed into cholmod_analyze, the upper or lower triangle is accessed and ordered by the function to save on memory. If a non-symmetric matrix (A) is passed in it brings in the whole matrix and orders AA’, effectively using twice the memory.

To evaluate the symmetry of m_cholmodNormal I went through the columns and checked that the largest associated row index was either less than or greater the current column index for all columns. In this way I evaluated whether the sparsely stored m_cholmodNormal was an upper triangle or lower triangle matrix, respectively.

In BundleAdjust::loadCholmodTriplet():

bool ut = true;
bool lt = true;
for (int columnIndex = 0; columnIndex < m_sparseNormals.size(); columnIndex++) {
int lastKey =  m_sparseNormals[columnIndex] -> lastKey();
 	int firstKey =  m_sparseNormals[columnIndex] -> firstKey(); 
 	ut = ut && (columnIndex == lastKey);
 	lt = lt && (columnIndex == firstKey);
}   
std::cout << "Upper Triangular: " << ut << std::endl;
std::cout << "Lower Triangular: " << lt << std::endl;

Using this method I confirmed that the network passed in during failure (/scratch/jrichie/STechnique/ SouthPole_2017Merged_Lidar2Image1_cnetedit.net) is stored as a sparse upper triangular matrix, meaning it should be treated as symmetric matrix and the less memory intensive AMD ordering method would be used. However, a larger network of the same LROC data is able to run through the same process without failure, so I am genuinely at a loss as to why this particular network is failing.

My only theory is that cholmod_analyze is taking in the m_cholmodNormal as a non-symmetric matrix for this network, therefore requiring twice the memory as is typically needed. That would require my upper triangular analysis to somehow be incorrect, but it is the only thing I can think of that would cause cholmod_analyze to act differently (especially in terms of memory) for two very similarly sized networks.

What Else Was Tried:
Since it was a memory issue, I first attempted to max out the memory allocation request on big mem for computation, using the same jigsaw call as in the ticket. This allowed for a maximum of 375 Gb of memory, but the problem only used 26 Gb and resulted in the same failure.

Next I tried running the jigsaw with fewer parameters by stepping down CAMSOLVE to velocities and keeping everything else consistent. This successfully ran, using 45Gb of memory. Since this bundle (with fewer parameters) used more memory than the bundle which fails because the ‘problem is too big’, it is very likely cholmod_analze is failing during the memory allocation step where there is not enough memory for every allocation and so the function fails, but the memory is never actually used.

I wanted to double check that the error had to do with the SIZE of the bundle and not the fact that acceleration was one of the solve parameters, so I reran the bundle with the same number of parameters (12 total) but no acceleration solves. This was done by switching CAMSOLVE from accelerations (9 parameters) to velocities (6 parameters) and switching SPSOLVE from positions (3 parameters) to velocities (6 parameters). This bundle failed with the same ‘problem too big error’. Therefore, there is not necessarily an error without how jigsaw handles the acceleration parameter and it is indeed a size issue.

We then thought that perhaps the connectivity of the matrix was causing the reduced normal camera matrix (m_cholmodNormal) to have enough off diagonal elements to make it significantly larger than what jigsaw could handle. I began creating a memory calculator for the various bundle matrices (located /home/ladoramkershner/projects/notebooks/JigsawSizeCheck_Prototype.ipynb; it is rough and needs to be reconfigured to account for the sparse storage of some of the matrices).

I was pointed at a network created in previous years that bundled (/work/projects/laser_a_work/lweller/SPoleNet/2018MayJune_Network/ SouthPole_2017Merged_SP_and_Lidar2Image3.net; new_final_jig_00to350.lis), I reran that bundle with the same parameters as the one in this ticket and it was successful. Then I compared the number of graph nodes and edges in each network. In a graph diagram, nodes represent images and edges represent a shared point between images in a pair-wise fashion. Therefore, edges are a good way to evaluate the connectivity and number of off-diagonal elements in the reduced normal camera matrix for a network.

Ticket Network	Archived Network
vertices: 18675	vertices: 18929
edges: 1034327	edges: 1165021
npoints: 1425791	npoints: 1649017
nmeas: 9469089	nmeas: 13752809

The archived network has more images, edges, points, and measures, so the connectivity could not explain the memory issue. To verify what choldmod_analyze was seeing I printed out the size and non-zero elements of the m_cholmodNormal

Ticket Network	Archived Network
m_cholmodNormal nrow: 224100	m_cholmodNormal nrow: 227148
m_cholmodNormal ncol: 224100	m_cholmodNormal ncol: 227148
m_cholmodNormal nzmax: 150399738	m_cholmodNormal nzmax: 169239486
m_cholmodNormal xtype: 1	m_cholmodNormal xtype: 1

Again, the archived network has slighty more elements and therefore would require more memory to solve. This leads me to believe it is not just barely exceeding the memory requirements.

jorichie · 2020-06-15T19:50:05Z

Thanks Lauren! Do you think we should see if Ken Edmundson has any ideas on how to resolve this?

jessemapel · 2020-06-15T20:02:13Z

Working directly with Ken has some ethics issues around how people can work with Astro after they leave. I'm also not sure if he'd be able to work on this without access to the cluster and scratch.

ladoramkershner · 2020-06-19T16:50:05Z

Where should we go from here?

The message that is output by this error is not descriptive enough to be helpful and I am still not sure why jigsaw is erroring. Jigsaw operates as expected on a network of the same size using the same amount of memory. So I am not sure if this is a bug, but it does concern me that we cannot isolate the difference between the handling of two networks tested.

jlaura · 2020-08-17T17:07:17Z

@blandoplanet What is the status on this issue? I believe we should close based on email conversation, but I do not want to close prematurely! Either way, I believe this is off the developers plate?

jorichie · 2020-08-17T21:30:41Z

Jay, I wish that we could investigate further into why the SP will not solve for acceleration and not close the post, but it is not up to me. Whereas NP is similar, SP has some numbers that are greater, and I suspect there is a parameter limiting the effort. The software reports that the bundle is too large of which I believe has merit. Below is a comparison of numbers in SP (red) versus NP. Of course I originally opened the post, but Brent Archinal and Mike Bland make such decisions as to what to do next. Images: 9687 18673 Points: 405532 1425784 Total Measures: 3128764 9472039 Total Observations: 6257528 18944078 Good Observations: 6257528 18944078 Rejected Observations: 0 0 Constrained Point Parameters: 438144 1438252 Constrained Image Parameters: 116244 168057 Unknowns: 1332840 444540 Degrees of Freedom: 5479076 16104978 Convergence Criteria: 1e-05(Sigma0) 1e-05(Sigma0) Iterations: 4 5

…

-Janet

________________________________ From: jlaura <[email protected]> Sent: Monday, August 17, 2020 11:07 AM To: USGS-Astrogeology/ISIS3 <[email protected]> Cc: Richie, Janet O <[email protected]>; State change <[email protected]> Subject: [EXTERNAL] Re: [USGS-Astrogeology/ISIS3] LROC SouthPole network cannot solve for acceleration in jigsaw (#3871) This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding. @blandoplanet<https://github.com/blandoplanet> What is the status on this issue? I believe we should close based on email conversation, but I do not want to close prematurely! Either way, I believe this is off the developers plate? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub<#3871 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALIYJUQVRHUOARZN6QYMGVTSBFPVLANCNFSM4NCXQAOA>.

jlaura · 2020-08-17T22:38:32Z

@jorichie Thanks for the post! I agree 100% that finding out what is going on has value. Right now though, the development team has put two weeks of debugging effort into this. Checkout the conclusion in the lengthy report post from @ladoramkershner above when she concludes with:

Again, the archived network has slighty more elements and therefore would require more memory to solve. This leads me to believe it is not just barely exceeding the memory requirements.

At this point, I do not believe that the development team has other avenues to explore on this as the problem is not constrained well enough to have us aim in any particular direction. We have an internal email chain (@blandoplanet, @ladoramkershner, and Brent) discussing some other options that do not include devs.

blandoplanet · 2020-08-31T17:04:55Z

Some additional information from @ladoramkershner about a possible next step in the future...

"the next task would be to install a custom build of cholmod and extracting more specific information from choldmod_analyze (the function that is failing). During my troubleshooting I was checking things upstream of the command and cross-referencing the documentation to predict which methods choldmod_analyze would use. However, more information may come of actually printing variables and status from inside the actual function. "

jessemapel · 2020-09-14T18:53:09Z

Update 2020/09/14

@ladoramkershner, @jorichie, and myself came back to this and did some more work last week; here's what we did and found.

Testing with other parameters

We tested running the bundle with some slightly different parameters.

First, we ran the bundle with rectangular coordinates. The hope was that this would help eliminate any errors from longitude domain or the pole. It would also slightly change the math because we would be solving for different ground point coordinates. One change that was required for this was converting the ground point sigmas from lat/lon/rad to x/y/z. In the latitudinal bundle only the radius was constrained, so we decided to constrain the Z point by that much. The data set is close to the pole so, the vast majority of radius variation would be in the Z direction. This test still failed in the same place with the same error.

Next, we ran the bundle without the overhermite setting enabled. This could help check for errors in the polynomial setup portion of the bundle. Unfortunately, this also failed in the same way.

Extracting a subregion of the network

We wanted to see a subregion of the network would successfully bundle with acceleration. This could help us narrow down if there are any specific images, points, or measures that are causing this error.

We identified regions that contained issues in the mosaic when solving for velocities and then extracted those subregions of the network. Unfortunately, extracting the subregions compromised the integrity of the network and additional work would have been required to make the subregions bundle by themselves.

We ultimately decided that attempting to make various subregions bundle by themselves would take too long and that the potential value was not worth it.

Checking the values in the normal matrix

It is possible that the normal matrix is ill-conditioned and CHOLMOD could be running into issues when it tries to analyze it. Computing the normal matrix requires computing partial derivatives and these can sometimes run into discontinuities resulting in extremely large numbers. When this happens, the resulting normal matrix could have extremely large values in some places that will result in a failure to solve the iterations.

To check for this, we inserted a small bit of code to compute some statistics on the non-zero values in the normal matrix. We then ran the debug code on the network described in this issue (the Active Net), and an older network consisting of LRO NAC images of the North Pole (the Archived Net). Here are the results for the active network in question and the archived network that is able to bundle with accelerations:

Stat	Archived Net	Active Net	Difference
Minimum	-15495184677439.5	-4855168835499.75	10640015841939.75
Maximum	37370058443140.8	16392435737669.1	20977622705471.695
Average	445321714.549715	307498754.542037	137822960.00767797
Standard Deviation	56499889942.841	38040249205.63005	18459640737.210503
Non-zero Elements	169239486	150399738	18839748

None of the values stand out as too large to work with.

Narrowing down when the network became unable to bundle with acceleration

After some discussion it was found that a previous version of the network could be bundle adjusted solving for acceleration on old hardware at the ASC, but when we moved to new hardware, it could only solve for velocities. This could help narrow down any changes in the network or code that caused this.

We looked for old processing and log files to determine exactly which version of ISIS and network were used on the old hardware and which were used on the new hardware.

We found a log that successfully solved for acceleration. Here is the version info:

IsisVersion       = "3.5.00.7260 beta | 2016-01-25"
ProgramVersion    = 2014-02-13

Here is the network

CNET                              = SouthPole_2017Merged_SP_and_Lidar2Ima-
                                    ge2_cnetedit.net

The print file can be found at /scratch/jrichie/SOUTHPOLE.old/NEW/print.prt

We could not find a log file from just after the transition that attempted to solve for acceleration on new hardware. The closest we found was a successful solve for velocities on the new hardware. Here is the version info:

IsisVersion       = "3.5.2.8306 beta | 2017-11-04"
ProgramVersion    = 2017-08-09

Here is the network

  CNET                              = SouthPole_5test_velocity_not_updated_-
                                      cnetedit.net

The print file can be found at /usgs/shareall/FOR_SP_ISIStest/onelasttest.prt

This gives us a rough bound between February 2014 and August 2017, ISIS 3.5.0 to ISIS 3.5.2.

Checking old ISIS versions

We still have access to ISIS 3.5.0 and later on hardware at the ASC, so we decided to test and see if we could narrow this down further. We attempted to run the bundle with solving for acceleration under version 3.5.0, 3.5.1, 3.5.2, and 3.6.0. Unfortunately, for 3.5.0 and 3.5.1 we ran into an error:

**ERROR** Unable to create camera for cube file /work/users/elee/jrichie/LROC_UPDATED_LEVELs/M111241245RE.lev1.cub in ControlNet.cpp at 1617. 
**ERROR** Unable to initialize camera model from group [Instrument] in CameraFactory.cpp at 97. 
**I/O ERROR** Unable to open [/work/users/elee/jrichie/LROC_UPDATED_LEVELs/M111241245RE.lev1.cub] in Blob.cpp at 278.

There is some sort of issue reading the Table blobs that contain the SPICE data. We may be able to work around this by re-running spiceinit on the images using the version of ISIS we plan to bundle with. Unfortunately we ran out of time and also ran into some issues that need to be resolved with our processing cluster before this can continue.

We also looked at all of the changes to jigsaw between 3.5.0 and 3.5.2. Here are the changes to the bundle adjust during that period, the changes that we think could impact this issue are in bold:

7007a39#diff-609b21c7e4f5fb96e0de7623daf06786 Large changes for error propogation and output, but we are erroring before any of that. There is also refactoring of the sparse matrix object. All checks of the sparse matrix structure have found nothing, but this is a possibility.
acfb601#diff-609b21c7e4f5fb96e0de7623daf06786 Small bug fix that just assigns a default param value.
31e5388#diff-609b21c7e4f5fb96e0de7623daf06786 Small bug fix dealing with jigsaw file output. We are erroring before this.
6b355ed#diff-609b21c7e4f5fb96e0de7623daf06786 Fixed a memory leak during iteration clean-up. We are erroring before this.
96af40d#diff-609b21c7e4f5fb96e0de7623daf06786 Continuation of previous fix.
a76c259#diff-609b21c7e4f5fb96e0de7623daf06786 Fixes output for Maximum Likelihood Estimation. We are neither using MLE or getting to report files.
87ccb35#diff-609b21c7e4f5fb96e0de7623daf06786 Fixed MLE initialization. This section of the code is not being executed.
d3d2af2#diff-609b21c7e4f5fb96e0de7623daf06786 More output clean-up.
8569b50#diff-609b21c7e4f5fb96e0de7623daf06786 New methods in the sparse matrix object that could be causing this.
a458608#diff-609b21c7e4f5fb96e0de7623daf06786 Coding standards update, this does not actually change any execution.
c33e6ac#diff-609b21c7e4f5fb96e0de7623daf06786 This is an SVN branch merge, can be ignored.
e87e527#diff-609b21c7e4f5fb96e0de7623daf06786 More output changes
7445518#diff-609b21c7e4f5fb96e0de7623daf06786 Adds some hooks for the IPCE GUI. The hooks only fire after an iteration completes. We are erroring before this.
b95b713#diff-609b21c7e4f5fb96e0de7623daf06786 Modifies how the input image list is fed in. We are not seeing any issues with the image list being handled correctly. We are successfully opening each Cube prior to erroring.
a5b5df1#diff-609b21c7e4f5fb96e0de7623daf06786 More image list updates.
5ad6843#diff-609b21c7e4f5fb96e0de7623daf06786 Makes several changes to the initialization and filling of the normal matrix. This needs more investigation. It is also very concerning this is labeled as a merge commit, could be a feature branch merge.
80c2498#diff-609b21c7e4f5fb96e0de7623daf06786 Ensures that Cube files are closed after the camera model is created. We are not running into any I/O issues, so this is not a concern.
d15ef97 Adds some exceptions around control network loading. We are successfully loading the control network and not seeing any of these exceptions so this is not a concern.

Potential Future Work

The most promising lead is narrowing down when this worked and when this stopped working. Checking each ISIS version is a good idea but will require duplicating the data and then re-processing. Investigating suspected code changes will require careful examination of the code at the time they were made and going over the execution path.

Compiling CHOLMOD with DEBUG flags enabled is also still an option.

github-actions · 2021-05-26T15:21:24Z

Thank you for your contribution!

Unfortunately, this issue hasn't received much attention lately, so it is labeled as 'stale.'

If no additional action is taken, this issue will be automatically closed in 180 days.

jessemapel · 2021-05-26T17:04:56Z

Still waiting for a good test case here

github-actions · 2021-11-23T15:15:20Z

Thank you for your contribution!

Unfortunately, this issue hasn't received much attention lately, so it is labeled as 'stale.'

If no additional action is taken, this issue will be automatically closed in 180 days.

ladoramkershner · 2021-11-24T16:32:48Z

Unresolved, keep open.

jorichie · 2022-03-29T20:03:52Z

Jigsaw not only fails to solve for acceleration but now won't solve for velocity. Solves for camera angles and position okay. Still getting the same cholmod error as when I opened the post.
CHOLMOD error: problem too large. file: ../Supernodal/cholmod_super_symbolic.c line: 683
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_factorize.c line: 121
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_solve.c line: 1062
/var/spool/slurmd/job45368968/slurm_script: line 39: 29769 Segmentation fault (core dumped)

This network was the most recent network successfully used in jigsaw to solve for velocity before it failed.
/usgs/shareall/FOR_ISIS/SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis
/usgs/shareall/FOR_ISIS/SouthPole_2020_Merged_Lidar2Image_redo12.net

This network follows the successful run, but jigsaw failed to solve for velocity.

/usgs/shareall/FOR_ISIS//usgs/shareall/FOR_ISIS/SouthPole_2020_Merged_Lidar2Image_redo13edit10_angles_outlier.net
new_SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis (updated)
Note: Only points and measures had been added to the network. (miscommunication happened here).

lwellerastro · 2022-08-12T18:13:19Z

I can't activate conda on astrovm4. I get command not found. Any clue why that would be? Ella also gets command not found.

I don't know why that would be. Maybe something about the version your were trying to use?
I copied your image list and network to my work users area and set conda to a recent version of isis and launched qnet:

conda activate isis7.0.0
qnet

I left before the network was fully loaded, but it did load and I can load points, etc.
Maybe try that version of isis.

@jessemapel, qnet is up and running on astrovm4 for me and is currently using about 34G of memory. I'm not sure it requires more than that when it is loading the images, but I suppose lack of memory could be a problem for astromv5. vm5 has maybe 65G of memory, but vm4 has a bit over 100G. Maybe there were too many other things running on vm5 when Janet had her problem, but wouldn't it just run slower and maybe use swap?

Update:
No memory available on astrovm5 - IT needs to know about this because it doesn't look like much is actively running over there, so maybe zombie processes or something in the background is using it up.

ast{104}> free -h
total used free shared buff/cache available
Mem: 62G 58G 613M 328K 3.7G 3.7G
Swap: 12G 5.4G 7.4G

jessemapel · 2022-08-12T18:53:15Z

Yeah the

error: killedprocess

failure indicates that the OS abruptly stopped the application for some reason. The most common cause for this is memory issues. Hopefully IT can get the VMs reset and good to use again.

lwellerastro · 2022-08-12T18:55:38Z

I put a ticket in for IT to have a look and see if they see unusual. sinteractive on the cluster is a good alternative to use if astrovm4 has problems too.

jorichie · 2022-08-12T21:41:27Z

Lynn's sinteractive suggestion allows the proper loading of the network. IT plans to resolve "some stuck processes and excess memory consumption," on Aug. 13 at 5:00 PM. We are now able to run jobs and load qnet after the IT changes without having to use sinteractive.

lwellerastro · 2022-08-15T21:47:46Z

@jorichie, if jigsaw still can't solve for accelerations for you network, you probably wan't to keep this post open.

jlaura · 2022-10-24T21:31:03Z

I have been able to reproduce this with a Mars network when adding a single ground control point with a computed covariance matrix.

I have networks available with and without ground points that illustrate the problem. I have linked those internally only at a code.chs.usgs.gov repository and provided access to the folks that are going to be working the problem.

jessemapel · 2022-11-08T16:27:20Z

@jorichie Can you post your most recent jigsaw command lines that you ran. We're investigating some issues potentially related to not setting sigmas for every parameter. For example, your original post doesn't set point latitude/longitude sigmas.

lwellerastro · 2022-11-08T17:20:51Z

Just curious @jessemapel - do you think something might have changed how point latitude/longitude sigmas are being used over the past several years? Those sigmas were not used for the LROC NAC north pole network or any of the Themis IR bundles (including the global), both of which had ground points and solved for radius, camera accelerations and spacecraft position without problems. All of that work ended around 2017/2018.

Since that time, I have found I need to add point lat/lon sigmas for Europa, Titan and now Phobos (but not Kaguya, but maybe it would help with difficult quads). I figured for the global data sets with limb and disk images having no or few ground points, it was necessary to help keep things from leaping all over the place and the extra constraints have generally helped. But those are not settings we've been encouraged to use in the past or had a need. I've tried to pick sigmas that make some sense for the data (considering resolution mostly and how good/bad the spice is) so they have ranged from 1500-5000 meters or more have helped for my problem projects.

jessemapel · 2022-11-08T17:23:53Z

We've made several changes to things in the bundle over the last several years. The big ones that could impact point sigmas are the rectangular, XYZ, bundle and the lidar support. They were supposed to preserve the existing functionality, but could have introduced a bug. We also haven't 100% confirmed that not setting sigmas is the problem, it's just our best lead right now. Jay is doing a bunch more testing today. @lwellerastro your comments also help back up this being the potential issue.

jessemapel · 2022-11-08T17:27:27Z

Also, I agree we should not be setting point lat/lon sigmas in most cases. It seems the logic that is supposed to leave them free has a bug.

Setting them for right now, is a work around.

lwellerastro · 2022-11-08T17:40:11Z

I just had a look at Europa and Titan and see that point sigmas helped when sorting out islands of images, but once all data were manually joined to main network I didn't need the extra constraints.

I'm recalling now it was Enceladus that needed the lat/lon constraints because the corrections were excessive. That suggestion was made by Brent at the time and although things like residuals and camera angles didn't change radically (or hardly at all), the lat/lon constraints kept the whole network from radically sliding away from where they started. The spice was not horrendous so it didn't make sense for the points to move so much. There was no ground for Enceladus either, so that added to the thought process and adding the constraints generally helped keep things closer to apriori locations (we used 1000 meters there).

I agree for better mapping missions they shouldn't be necessary, but having a work around could be useful.

jorichie · 2022-11-08T18:56:29Z

Hey Jesse, please see /sbatch3_jigsaw.bsh for the commands we are currently using. Thanks for taking a look at this.

jlaura · 2022-11-08T21:57:38Z

Afternoon @jorichie, we really should not post full paths in a public place like this. I am going to edit your post to remove the path. Would you mind copy/pasting the contents into your post or a new response? That is helpful not only from a security perspective, but also for anyone who does not have access to the machine with that path who is interested in the discussion. Thanks!

jorichie · 2022-11-08T22:05:03Z

Thanks for the information, Jay. I have copied, pasted the requested information here. jigsaw fromlist=SouthPole_2020Merged_SP_and_Lidar2Image4_updated12_image_temp.lis cnet=SouthPole_2022_Merged_Lidar2Image_redo105.net onet= test_SouthPole_2022_Merged_Lidar2Image_redo105_remerge.net radius=yes camsolve=angles twist=yes overexisting=yes spsolve=position outlier_rejection=no overhermite=no camera_angles_sigma=1.0 camera_angular_velocity_sigma=0.5 camera_angular_acceleration_sigma=0.25 spacecraft_position_sigma=250 point_radius_sigma=150 maxits=3

…

________________________________ From: jlaura ***@***.***> Sent: Tuesday, November 8, 2022 2:57 PM To: USGS-Astrogeology/ISIS3 ***@***.***> Cc: Richie, Janet O ***@***.***>; Mention ***@***.***> Subject: [EXTERNAL] Re: [USGS-Astrogeology/ISIS3] LROC SouthPole network cannot solve for acceleration in jigsaw (#3871) This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding. Afternoon @jorichie<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjorichie&data=05%7C01%7Cjrichie%40usgs.gov%7C3348db0a172a4713c58908dac1d44922%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C638035414756669984%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wGSfUVHUkrQddGBNakx3RXmp5%2BI2nJavlKGU1FfBg4o%3D&reserved=0>, we really should not post full paths in a public place like this. I am going to edit your post to remove the path. Would you mind copy/pasting the contents into your post or a new response? That is helpful not only from a security perspective, but also for anyone who does not have access to the machine with that path who is interested in the discussion. Thanks! — Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-Astrogeology%2FISIS3%2Fissues%2F3871%23issuecomment-1307880106&data=05%7C01%7Cjrichie%40usgs.gov%7C3348db0a172a4713c58908dac1d44922%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C638035414756669984%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WpbcIv0t238RTNVnFYxajpQVZomy9pRQ3b079PAKu0g%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FALIYJUVVFKVQOVJWLWKQUYTWHLEF3ANCNFSM4NCXQAOA&data=05%7C01%7Cjrichie%40usgs.gov%7C3348db0a172a4713c58908dac1d44922%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C638035414756669984%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4ap%2FKX14tehK0Zg%2BHlmNO1RJQgY4aNiEFPr3LZCjza0%3D&reserved=0>. You are receiving this because you were mentioned.Message ID: ***@***.***>

jlaura · 2022-11-09T16:21:27Z

@jorichie thanks! Have you tried this with the following added:

point_radius_sigma=???
point_longitude_sigma=??? 
point_latitude_sigma=???
spacecraft_velocity_sigma=???

I don't know what values you want to put in for the ???. Maybe 500, 100, 100, 50 (total guesses that will need some iteration on to figure out; or a look at the LROC kernels to see if they have reported accuracies).

I would probably try both with and without overhermite - I don't think that that is the issue. The spacecraft_velocity_sigmas might also not be needed. Since you are solving for positions I think it would be used, but @jessemapel should correct me when I'm wrong!

jessemapel · 2022-11-09T17:10:47Z

I'm not sure which sigmas need to be set, but I would test setting all of them. I think point lat/lon sigmas of 500-1000m should be "safe" guesses. Your spacecraft velocity sigma can be around 100.

jorichie · 2022-11-17T04:52:07Z

Here is the result from the first test, including setting values for point_radius and longitude sigmas, point radius sigma, and spacecraft velocity:
CHOLMOD error: problem too large. file: ../Supernodal/cholmod_super_symbolic.c line: 683
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_factorize.c line: 121
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_solve.c line: 1062
/var/spool/slurmd/job46016900/slurm_script: line 42: 5477 Segmentation fault (core dumped) jigsaw fromlist= SouthPole_2020Merged_SP_notupdated12_image.lis cnet=SouthPole_2022_Merged_Lidar2Image_redo112.net onet= test_SouthPole_2022_Merged_Lidar2Image_redo112.net update=no sigma0=1.0e-5 maxits=3 errorpropagation=no radius=yes camsolve=accelerations twist=yes overexisting=yes outlier_rejection=no spsolve=position overhermite=yes camera_angles_sigma=1.0 camera_angular_velocity_sigma= 0.5 camera_angular_acceleration_sigma=0.25 spacecraft_position_sigma=250 point_radius_sigma=500 point_longitude_sigma=1000 point_latitude_sigma=1000 spacecraft_velocity_sigma=100 file_prefix=test_SouthPole_2022_Merged_Lidar2Image_red0112_ -log=Southpole_112.prt

Next, I tried jigsaw specifiying overexisting=no with no change to the other parameters. There was no change in result, and the error was identical as described here.

jlaura · 2022-11-17T21:15:09Z

@jorichie That is terrific! I believe that you are getting a new error?

And it looks like it could be a memory issue? How much memory are you requesting and are you exceeding the available memory on the machine? Problem too large looks like it can occur when cholmod is converting a sparse matrix to a dense matrix.

Also, are you intentionally setting spsolve=positions or do you usually use spsolve=velocties or accelerations? Is using one of those resulting in this same error or a different error?

lwellerastro · 2022-11-17T21:22:33Z

@jlaura, it is not and never has been a memory problem. This network has been watched on multiple occasions and uses <50G of memory. Janet is asking for an entire node I believe. I believe Laruen/Jesse also confirmed there are no memory overruns.

The current error appears to be identical to what is in the original post.

This work has only required spsolve=position.

jlaura · 2022-11-17T21:42:44Z

Here is the source code that is throwing the error. Please do a search in there if you like for the word 'problem' (pulled for the error message). You will see two instances in the code where that is happening and both are after checks to see if the problem will fit into memory.

None of the above is to say that the problem is not in jigsaw (for example if a sparse matrix is being made dense for some reason), but that error is being raised by CHOLMOD. Lauren's post from June 15 also looked at memory related issues and did not find that the nework size necessarily corresponded to the decomposition path that CHOLMOD was selecting as far as I can tell.

Jesse's post September 14 specifically indicates solving for accelerations which is why I asked about that aspect.

lwellerastro · 2023-08-18T17:20:04Z

The latest version of this south pole network that failed to bundle using camera accelerations (and I think spacecraft position) now successfully bundles under a test version of jigsaw which utilizes changes in #5176.

The network now solves for radius, camera accelerations and spacecraft position. The process solved in about 15 hours and used 80G of memory.

I think this post can now be closed.

acpaquette · 2023-08-22T19:00:49Z

Closing with comments and positive reporting from @lwellerastro

blandoplanet added the Products Issues which are impacting the products group label May 18, 2020

jorichie closed this as completed May 28, 2020

lwellerastro reopened this May 28, 2020

ladoramkershner self-assigned this Jun 1, 2020

github-actions bot added the inactive Issue that has been inactive for at least 6 months label May 26, 2021

jessemapel removed the inactive Issue that has been inactive for at least 6 months label May 26, 2021

github-actions bot added the inactive Issue that has been inactive for at least 6 months label Nov 23, 2021

github-actions bot removed the inactive Issue that has been inactive for at least 6 months label Nov 25, 2021

lwellerastro mentioned this issue Feb 2, 2022

Jigsaw -segfaults on cluster when solving for camera accelerations and spacecraft position for Kaguya TC pole and Titan global network #4770

Closed

AustinSanders added this to FY22 Q3 Software Support Mar 25, 2022

jorichie closed this as completed Mar 29, 2022

jorichie closed this as completed Aug 15, 2022

Repository owner moved this from Deferred to Done in ASC Software Support Aug 15, 2022

Repository owner moved this from Todo to Done in FY22 Q3 Software Support Aug 15, 2022

lwellerastro reopened this Aug 15, 2022

lwellerastro mentioned this issue Apr 5, 2023

Jigsaw - fails for Kaguya TC for certain cases #5173

Closed

This was referenced Apr 7, 2023

Change jigsaw cholmod calls from 32bit to 64bit #5176

Closed

Have jigsaw calculate the sparsity #5177

Closed

acpaquette closed this as completed Aug 22, 2023

LROC SouthPole network cannot solve for acceleration in jigsaw #3871

LROC SouthPole network cannot solve for acceleration in jigsaw #3871

Comments

jorichie commented May 16, 2020 • edited Loading

jessemapel commented May 18, 2020

jessemapel commented May 18, 2020

blandoplanet commented May 18, 2020

jorichie commented May 28, 2020

lwellerastro commented May 28, 2020

ladoramkershner commented Jun 15, 2020

jorichie commented Jun 15, 2020 • edited Loading

jessemapel commented Jun 15, 2020

ladoramkershner commented Jun 19, 2020

jlaura commented Aug 17, 2020

jorichie commented Aug 17, 2020 via email

jlaura commented Aug 17, 2020

blandoplanet commented Aug 31, 2020

jessemapel commented Sep 14, 2020 • edited Loading

Update 2020/09/14

Testing with other parameters

Extracting a subregion of the network

Checking the values in the normal matrix

Narrowing down when the network became unable to bundle with acceleration

Checking old ISIS versions

Potential Future Work

github-actions bot commented May 26, 2021

jessemapel commented May 26, 2021

github-actions bot commented Nov 23, 2021

ladoramkershner commented Nov 24, 2021

jorichie commented Mar 29, 2022 • edited Loading

lwellerastro commented Aug 12, 2022 • edited Loading

jessemapel commented Aug 12, 2022

lwellerastro commented Aug 12, 2022

jorichie commented Aug 12, 2022 • edited Loading

lwellerastro commented Aug 15, 2022

jlaura commented Oct 24, 2022

jessemapel commented Nov 8, 2022

lwellerastro commented Nov 8, 2022

jessemapel commented Nov 8, 2022 • edited Loading

jessemapel commented Nov 8, 2022 • edited Loading

lwellerastro commented Nov 8, 2022

jorichie commented Nov 8, 2022 • edited by jlaura Loading

jlaura commented Nov 8, 2022

jorichie commented Nov 8, 2022 via email • edited Loading

jlaura commented Nov 9, 2022

jessemapel commented Nov 9, 2022

jorichie commented Nov 17, 2022 • edited Loading

jlaura commented Nov 17, 2022

lwellerastro commented Nov 17, 2022

jlaura commented Nov 17, 2022

lwellerastro commented Aug 18, 2023

acpaquette commented Aug 22, 2023

jorichie commented May 16, 2020 •

edited

Loading

jorichie commented Jun 15, 2020 •

edited

Loading

jessemapel commented Sep 14, 2020 •

edited

Loading

jorichie commented Mar 29, 2022 •

edited

Loading

lwellerastro commented Aug 12, 2022 •

edited

Loading

jorichie commented Aug 12, 2022 •

edited

Loading

jessemapel commented Nov 8, 2022 •

edited

Loading

jessemapel commented Nov 8, 2022 •

edited

Loading

jorichie commented Nov 8, 2022 •

edited by jlaura

Loading

jorichie commented Nov 8, 2022 via email •

edited

Loading

jorichie commented Nov 17, 2022 •

edited

Loading