Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LROC SouthPole network cannot solve for acceleration in jigsaw #3871

Closed
jorichie opened this issue May 16, 2020 · 62 comments
Closed

LROC SouthPole network cannot solve for acceleration in jigsaw #3871

jorichie opened this issue May 16, 2020 · 62 comments
Labels
Products Issues which are impacting the products group

Comments

@jorichie
Copy link

jorichie commented May 16, 2020

ISIS version(s) affected: isis3.10.2 on astrovm4, previously astrovm2

Description
Jigsaw will not solve for acceleration. Solves for camera angles and velocity okay.
Error:
Validation complete!...
starting iteration 1
CHOLMOD error: problem too large. file: ../Supernodal/cholmod_super_symbolic.c line: 683
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_factorize.c line: 121
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_solve.c line: 1062
/var/spool/slurmd/job6627021/slurm_script: line 39: 16235 Segmentation fault (core dumped)

How to reproduce
Here are the parameters used for jigsaw: *also /scratch/jrichie/S-Technique/sbatch4_jigsaw.bsh; all pertinent files can be found at /scratch/jrichie/STechnique
jigsaw fromlist= new11_updated_jig_00to350.lis cnet= SouthPole_2017Merged_Lidar2Image1_cnetedit.net
onet= SouthPole_2017Merged_Lidar2Image2.net
update=no sigma0=1.0e-5 maxits=3 errorpropagation=no
radius=yes camsolve=accelerations twist=yes overexisting=yes
outlier_rejection=no
spsolve=position overhermite=yes
camera_angles_sigma=1.0
camera_angular_velocity_sigma= 0.5
camera_angular_acceleration_sigma=0.25
spacecraft_position_sigma=250
point_radius_sigma=150 \

Possible Solution
Check that memory requirement is sufficient or "number of" limitations, (if any), in the code. Weller was able to solve for acceleration for north pole. Differences are South Pole is twice as large and has a significant amount of shadows.
Additional context

@jessemapel
Copy link
Contributor

The network is likely too large/connected to solve with the available memory. What system did you run this on?

@jessemapel
Copy link
Contributor

@blandoplanet Is this a products concern? Seems like it to me.

@blandoplanet
Copy link

Yes. The Products tag is appropriate.

@blandoplanet blandoplanet added the Products Issues which are impacting the products group label May 18, 2020
@jorichie
Copy link
Author

The network is likely too large/connected to solve with the available memory. What system did you run this on?
As per my conversation with Jesse, astrovm4 and astrovm2. Also noted at beginning of report.

@lwellerastro
Copy link
Contributor

I think it was a mistake that this post was closed via the last reply. Re-opened.

@lwellerastro lwellerastro reopened this May 28, 2020
@ladoramkershner ladoramkershner self-assigned this Jun 1, 2020
@ladoramkershner
Copy link
Contributor

This may become more information than needed (so do not feel the need to read everything), but I would like to document for any future related efforts.

Final description of Problem:
The initial error is caused by a memory issue on the call cholmod_analyze(m_cholmodNormal, &m_cholmodCommon) on line 1556 in BundleAdjust.cpp, where m_cholmodNormal is the reduced normal camera matrix used in the bundle solution and m_cholmodCommon carries around information repeatedly needed for cholmod functions.

cholmod_analyze uses various methods to order the passed in matrix for easy factorization and if it fails, returns null. This function can fail if the initial memory allocation fails, if all ordering methods fail, or if the supermodel analysis (if requested) fails. Based on the ‘problem too big’ error, I am guessing the memory allocation is failing (https://github.com/PetterS/SuiteSparse/blob/master/CHOLMOD/Cholesky/cholmod_analyze.c).

I checked which methods were being used for cholmod_analyze by running a successful bundle and printing out the m_cholmodCommon variable and looking at the ‘nmethods’ field. This indicated that AMD or COLAMD is used for ordering m_cholmodNormal based on if the matrix passed in is symmetric or non-symmetric. If a symmetric matrix is passed into cholmod_analyze, the upper or lower triangle is accessed and ordered by the function to save on memory. If a non-symmetric matrix (A) is passed in it brings in the whole matrix and orders AA’, effectively using twice the memory.

To evaluate the symmetry of m_cholmodNormal I went through the columns and checked that the largest associated row index was either less than or greater the current column index for all columns. In this way I evaluated whether the sparsely stored m_cholmodNormal was an upper triangle or lower triangle matrix, respectively.

In BundleAdjust::loadCholmodTriplet():

bool ut = true;
bool lt = true;
for (int columnIndex = 0; columnIndex < m_sparseNormals.size(); columnIndex++) {
int lastKey =  m_sparseNormals[columnIndex] -> lastKey();
 	int firstKey =  m_sparseNormals[columnIndex] -> firstKey(); 
 	ut = ut && (columnIndex == lastKey);
 	lt = lt && (columnIndex == firstKey);
}   
std::cout << "Upper Triangular: " << ut << std::endl;
std::cout << "Lower Triangular: " << lt << std::endl;

Using this method I confirmed that the network passed in during failure (/scratch/jrichie/STechnique/ SouthPole_2017Merged_Lidar2Image1_cnetedit.net) is stored as a sparse upper triangular matrix, meaning it should be treated as symmetric matrix and the less memory intensive AMD ordering method would be used. However, a larger network of the same LROC data is able to run through the same process without failure, so I am genuinely at a loss as to why this particular network is failing.

My only theory is that cholmod_analyze is taking in the m_cholmodNormal as a non-symmetric matrix for this network, therefore requiring twice the memory as is typically needed. That would require my upper triangular analysis to somehow be incorrect, but it is the only thing I can think of that would cause cholmod_analyze to act differently (especially in terms of memory) for two very similarly sized networks.

What Else Was Tried:
Since it was a memory issue, I first attempted to max out the memory allocation request on big mem for computation, using the same jigsaw call as in the ticket. This allowed for a maximum of 375 Gb of memory, but the problem only used 26 Gb and resulted in the same failure.

Next I tried running the jigsaw with fewer parameters by stepping down CAMSOLVE to velocities and keeping everything else consistent. This successfully ran, using 45Gb of memory. Since this bundle (with fewer parameters) used more memory than the bundle which fails because the ‘problem is too big’, it is very likely cholmod_analze is failing during the memory allocation step where there is not enough memory for every allocation and so the function fails, but the memory is never actually used.

I wanted to double check that the error had to do with the SIZE of the bundle and not the fact that acceleration was one of the solve parameters, so I reran the bundle with the same number of parameters (12 total) but no acceleration solves. This was done by switching CAMSOLVE from accelerations (9 parameters) to velocities (6 parameters) and switching SPSOLVE from positions (3 parameters) to velocities (6 parameters). This bundle failed with the same ‘problem too big error’. Therefore, there is not necessarily an error without how jigsaw handles the acceleration parameter and it is indeed a size issue.

We then thought that perhaps the connectivity of the matrix was causing the reduced normal camera matrix (m_cholmodNormal) to have enough off diagonal elements to make it significantly larger than what jigsaw could handle. I began creating a memory calculator for the various bundle matrices (located /home/ladoramkershner/projects/notebooks/JigsawSizeCheck_Prototype.ipynb; it is rough and needs to be reconfigured to account for the sparse storage of some of the matrices).

I was pointed at a network created in previous years that bundled (/work/projects/laser_a_work/lweller/SPoleNet/2018MayJune_Network/ SouthPole_2017Merged_SP_and_Lidar2Image3.net; new_final_jig_00to350.lis), I reran that bundle with the same parameters as the one in this ticket and it was successful. Then I compared the number of graph nodes and edges in each network. In a graph diagram, nodes represent images and edges represent a shared point between images in a pair-wise fashion. Therefore, edges are a good way to evaluate the connectivity and number of off-diagonal elements in the reduced normal camera matrix for a network.

Ticket Network Archived Network
vertices: 18675 vertices: 18929
edges: 1034327 edges: 1165021
npoints: 1425791 npoints: 1649017
nmeas: 9469089 nmeas: 13752809

The archived network has more images, edges, points, and measures, so the connectivity could not explain the memory issue. To verify what choldmod_analyze was seeing I printed out the size and non-zero elements of the m_cholmodNormal

Ticket Network Archived Network
m_cholmodNormal nrow: 224100 m_cholmodNormal nrow: 227148
m_cholmodNormal ncol: 224100 m_cholmodNormal ncol: 227148
m_cholmodNormal nzmax: 150399738 m_cholmodNormal nzmax: 169239486
m_cholmodNormal xtype: 1 m_cholmodNormal xtype: 1

Again, the archived network has slighty more elements and therefore would require more memory to solve. This leads me to believe it is not just barely exceeding the memory requirements.

@jorichie
Copy link
Author

jorichie commented Jun 15, 2020

Thanks Lauren! Do you think we should see if Ken Edmundson has any ideas on how to resolve this?

@jessemapel
Copy link
Contributor

Working directly with Ken has some ethics issues around how people can work with Astro after they leave. I'm also not sure if he'd be able to work on this without access to the cluster and scratch.

@ladoramkershner
Copy link
Contributor

Where should we go from here?

The message that is output by this error is not descriptive enough to be helpful and I am still not sure why jigsaw is erroring. Jigsaw operates as expected on a network of the same size using the same amount of memory. So I am not sure if this is a bug, but it does concern me that we cannot isolate the difference between the handling of two networks tested.

@jlaura
Copy link
Collaborator

jlaura commented Aug 17, 2020

@blandoplanet What is the status on this issue? I believe we should close based on email conversation, but I do not want to close prematurely! Either way, I believe this is off the developers plate?

@jorichie
Copy link
Author

jorichie commented Aug 17, 2020 via email

@jlaura
Copy link
Collaborator

jlaura commented Aug 17, 2020

@jorichie Thanks for the post! I agree 100% that finding out what is going on has value. Right now though, the development team has put two weeks of debugging effort into this. Checkout the conclusion in the lengthy report post from @ladoramkershner above when she concludes with:

Again, the archived network has slighty more elements and therefore would require more memory to solve. This leads me to believe it is not just barely exceeding the memory requirements.

At this point, I do not believe that the development team has other avenues to explore on this as the problem is not constrained well enough to have us aim in any particular direction. We have an internal email chain (@blandoplanet, @ladoramkershner, and Brent) discussing some other options that do not include devs.

@blandoplanet
Copy link

Some additional information from @ladoramkershner about a possible next step in the future...

"the next task would be to install a custom build of cholmod and extracting more specific information from choldmod_analyze (the function that is failing). During my troubleshooting I was checking things upstream of the command and cross-referencing the documentation to predict which methods choldmod_analyze would use. However, more information may come of actually printing variables and status from inside the actual function. "

@jessemapel
Copy link
Contributor

jessemapel commented Sep 14, 2020

Update 2020/09/14

@ladoramkershner, @jorichie, and myself came back to this and did some more work last week; here's what we did and found.

Testing with other parameters

We tested running the bundle with some slightly different parameters.

First, we ran the bundle with rectangular coordinates. The hope was that this would help eliminate any errors from longitude domain or the pole. It would also slightly change the math because we would be solving for different ground point coordinates. One change that was required for this was converting the ground point sigmas from lat/lon/rad to x/y/z. In the latitudinal bundle only the radius was constrained, so we decided to constrain the Z point by that much. The data set is close to the pole so, the vast majority of radius variation would be in the Z direction. This test still failed in the same place with the same error.

Next, we ran the bundle without the overhermite setting enabled. This could help check for errors in the polynomial setup portion of the bundle. Unfortunately, this also failed in the same way.

Extracting a subregion of the network

We wanted to see a subregion of the network would successfully bundle with acceleration. This could help us narrow down if there are any specific images, points, or measures that are causing this error.

We identified regions that contained issues in the mosaic when solving for velocities and then extracted those subregions of the network. Unfortunately, extracting the subregions compromised the integrity of the network and additional work would have been required to make the subregions bundle by themselves.

We ultimately decided that attempting to make various subregions bundle by themselves would take too long and that the potential value was not worth it.

Checking the values in the normal matrix

It is possible that the normal matrix is ill-conditioned and CHOLMOD could be running into issues when it tries to analyze it. Computing the normal matrix requires computing partial derivatives and these can sometimes run into discontinuities resulting in extremely large numbers. When this happens, the resulting normal matrix could have extremely large values in some places that will result in a failure to solve the iterations.

To check for this, we inserted a small bit of code to compute some statistics on the non-zero values in the normal matrix. We then ran the debug code on the network described in this issue (the Active Net), and an older network consisting of LRO NAC images of the North Pole (the Archived Net). Here are the results for the active network in question and the archived network that is able to bundle with accelerations:

Stat Archived Net Active Net Difference
Minimum -15495184677439.5 -4855168835499.75 10640015841939.75
Maximum 37370058443140.8 16392435737669.1 20977622705471.695
Average 445321714.549715 307498754.542037 137822960.00767797
Standard Deviation 56499889942.841 38040249205.63005 18459640737.210503
Non-zero Elements 169239486 150399738 18839748

None of the values stand out as too large to work with.

Narrowing down when the network became unable to bundle with acceleration

After some discussion it was found that a previous version of the network could be bundle adjusted solving for acceleration on old hardware at the ASC, but when we moved to new hardware, it could only solve for velocities. This could help narrow down any changes in the network or code that caused this.

We looked for old processing and log files to determine exactly which version of ISIS and network were used on the old hardware and which were used on the new hardware.

We found a log that successfully solved for acceleration. Here is the version info:

IsisVersion       = "3.5.00.7260 beta | 2016-01-25"
ProgramVersion    = 2014-02-13

Here is the network

CNET                              = SouthPole_2017Merged_SP_and_Lidar2Ima-
                                    ge2_cnetedit.net

The print file can be found at /scratch/jrichie/SOUTHPOLE.old/NEW/print.prt

We could not find a log file from just after the transition that attempted to solve for acceleration on new hardware. The closest we found was a successful solve for velocities on the new hardware. Here is the version info:

IsisVersion       = "3.5.2.8306 beta | 2017-11-04"
ProgramVersion    = 2017-08-09

Here is the network

  CNET                              = SouthPole_5test_velocity_not_updated_-
                                      cnetedit.net

The print file can be found at /usgs/shareall/FOR_SP_ISIStest/onelasttest.prt

This gives us a rough bound between February 2014 and August 2017, ISIS 3.5.0 to ISIS 3.5.2.

Checking old ISIS versions

We still have access to ISIS 3.5.0 and later on hardware at the ASC, so we decided to test and see if we could narrow this down further. We attempted to run the bundle with solving for acceleration under version 3.5.0, 3.5.1, 3.5.2, and 3.6.0. Unfortunately, for 3.5.0 and 3.5.1 we ran into an error:

**ERROR** Unable to create camera for cube file /work/users/elee/jrichie/LROC_UPDATED_LEVELs/M111241245RE.lev1.cub in ControlNet.cpp at 1617. 
**ERROR** Unable to initialize camera model from group [Instrument] in CameraFactory.cpp at 97. 
**I/O ERROR** Unable to open [/work/users/elee/jrichie/LROC_UPDATED_LEVELs/M111241245RE.lev1.cub] in Blob.cpp at 278. 

There is some sort of issue reading the Table blobs that contain the SPICE data. We may be able to work around this by re-running spiceinit on the images using the version of ISIS we plan to bundle with. Unfortunately we ran out of time and also ran into some issues that need to be resolved with our processing cluster before this can continue.

We also looked at all of the changes to jigsaw between 3.5.0 and 3.5.2. Here are the changes to the bundle adjust during that period, the changes that we think could impact this issue are in bold:

Potential Future Work

The most promising lead is narrowing down when this worked and when this stopped working. Checking each ISIS version is a good idea but will require duplicating the data and then re-processing. Investigating suspected code changes will require careful examination of the code at the time they were made and going over the execution path.

Compiling CHOLMOD with DEBUG flags enabled is also still an option.

@github-actions
Copy link

Thank you for your contribution!

Unfortunately, this issue hasn't received much attention lately, so it is labeled as 'stale.'

If no additional action is taken, this issue will be automatically closed in 180 days.

@github-actions github-actions bot added the inactive Issue that has been inactive for at least 6 months label May 26, 2021
@jessemapel jessemapel removed the inactive Issue that has been inactive for at least 6 months label May 26, 2021
@jessemapel
Copy link
Contributor

Still waiting for a good test case here

@github-actions
Copy link

Thank you for your contribution!

Unfortunately, this issue hasn't received much attention lately, so it is labeled as 'stale.'

If no additional action is taken, this issue will be automatically closed in 180 days.

@github-actions github-actions bot added the inactive Issue that has been inactive for at least 6 months label Nov 23, 2021
@ladoramkershner
Copy link
Contributor

Unresolved, keep open.

@jorichie
Copy link
Author

jorichie commented Mar 29, 2022

Jigsaw not only fails to solve for acceleration but now won't solve for velocity. Solves for camera angles and position okay. Still getting the same cholmod error as when I opened the post.
CHOLMOD error: problem too large. file: ../Supernodal/cholmod_super_symbolic.c line: 683
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_factorize.c line: 121
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_solve.c line: 1062
/var/spool/slurmd/job45368968/slurm_script: line 39: 29769 Segmentation fault (core dumped)

This network was the most recent network successfully used in jigsaw to solve for velocity before it failed.
/usgs/shareall/FOR_ISIS/SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis
/usgs/shareall/FOR_ISIS/SouthPole_2020_Merged_Lidar2Image_redo12.net

This network follows the successful run, but jigsaw failed to solve for velocity.

/usgs/shareall/FOR_ISIS//usgs/shareall/FOR_ISIS/SouthPole_2020_Merged_Lidar2Image_redo13edit10_angles_outlier.net
new_SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis (updated)
Note: Only points and measures had been added to the network. (miscommunication happened here).

@lwellerastro
Copy link
Contributor

lwellerastro commented Aug 12, 2022

I can't activate conda on astrovm4. I get command not found. Any clue why that would be? Ella also gets command not found.

I don't know why that would be. Maybe something about the version your were trying to use?
I copied your image list and network to my work users area and set conda to a recent version of isis and launched qnet:

conda activate isis7.0.0
qnet

I left before the network was fully loaded, but it did load and I can load points, etc.
Maybe try that version of isis.

@jessemapel, qnet is up and running on astrovm4 for me and is currently using about 34G of memory. I'm not sure it requires more than that when it is loading the images, but I suppose lack of memory could be a problem for astromv5. vm5 has maybe 65G of memory, but vm4 has a bit over 100G. Maybe there were too many other things running on vm5 when Janet had her problem, but wouldn't it just run slower and maybe use swap?

Update:
No memory available on astrovm5 - IT needs to know about this because it doesn't look like much is actively running over there, so maybe zombie processes or something in the background is using it up.

ast{104}> free -h
total used free shared buff/cache available
Mem: 62G 58G 613M 328K 3.7G 3.7G
Swap: 12G 5.4G 7.4G

@jessemapel
Copy link
Contributor

Yeah the

error: killedprocess

failure indicates that the OS abruptly stopped the application for some reason. The most common cause for this is memory issues. Hopefully IT can get the VMs reset and good to use again.

@lwellerastro
Copy link
Contributor

I put a ticket in for IT to have a look and see if they see unusual. sinteractive on the cluster is a good alternative to use if astrovm4 has problems too.

@jorichie
Copy link
Author

jorichie commented Aug 12, 2022

Lynn's sinteractive suggestion allows the proper loading of the network. IT plans to resolve "some stuck processes and excess memory consumption," on Aug. 13 at 5:00 PM. We are now able to run jobs and load qnet after the IT changes without having to use sinteractive.

Repository owner moved this from Deferred to Done in ASC Software Support Aug 15, 2022
Repository owner moved this from Todo to Done in FY22 Q3 Software Support Aug 15, 2022
@lwellerastro
Copy link
Contributor

@jorichie, if jigsaw still can't solve for accelerations for you network, you probably wan't to keep this post open.

@lwellerastro lwellerastro reopened this Aug 15, 2022
@jlaura
Copy link
Collaborator

jlaura commented Oct 24, 2022

I have been able to reproduce this with a Mars network when adding a single ground control point with a computed covariance matrix.

I have networks available with and without ground points that illustrate the problem. I have linked those internally only at a code.chs.usgs.gov repository and provided access to the folks that are going to be working the problem.

@jessemapel
Copy link
Contributor

@jorichie Can you post your most recent jigsaw command lines that you ran. We're investigating some issues potentially related to not setting sigmas for every parameter. For example, your original post doesn't set point latitude/longitude sigmas.

@lwellerastro
Copy link
Contributor

Just curious @jessemapel - do you think something might have changed how point latitude/longitude sigmas are being used over the past several years? Those sigmas were not used for the LROC NAC north pole network or any of the Themis IR bundles (including the global), both of which had ground points and solved for radius, camera accelerations and spacecraft position without problems. All of that work ended around 2017/2018.

Since that time, I have found I need to add point lat/lon sigmas for Europa, Titan and now Phobos (but not Kaguya, but maybe it would help with difficult quads). I figured for the global data sets with limb and disk images having no or few ground points, it was necessary to help keep things from leaping all over the place and the extra constraints have generally helped. But those are not settings we've been encouraged to use in the past or had a need. I've tried to pick sigmas that make some sense for the data (considering resolution mostly and how good/bad the spice is) so they have ranged from 1500-5000 meters or more have helped for my problem projects.

@jessemapel
Copy link
Contributor

jessemapel commented Nov 8, 2022

We've made several changes to things in the bundle over the last several years. The big ones that could impact point sigmas are the rectangular, XYZ, bundle and the lidar support. They were supposed to preserve the existing functionality, but could have introduced a bug. We also haven't 100% confirmed that not setting sigmas is the problem, it's just our best lead right now. Jay is doing a bunch more testing today. @lwellerastro your comments also help back up this being the potential issue.

@jessemapel
Copy link
Contributor

jessemapel commented Nov 8, 2022

Also, I agree we should not be setting point lat/lon sigmas in most cases. It seems the logic that is supposed to leave them free has a bug.

Setting them for right now, is a work around.

@lwellerastro
Copy link
Contributor

I just had a look at Europa and Titan and see that point sigmas helped when sorting out islands of images, but once all data were manually joined to main network I didn't need the extra constraints.

I'm recalling now it was Enceladus that needed the lat/lon constraints because the corrections were excessive. That suggestion was made by Brent at the time and although things like residuals and camera angles didn't change radically (or hardly at all), the lat/lon constraints kept the whole network from radically sliding away from where they started. The spice was not horrendous so it didn't make sense for the points to move so much. There was no ground for Enceladus either, so that added to the thought process and adding the constraints generally helped keep things closer to apriori locations (we used 1000 meters there).

I agree for better mapping missions they shouldn't be necessary, but having a work around could be useful.

@jorichie
Copy link
Author

jorichie commented Nov 8, 2022

Hey Jesse, please see /sbatch3_jigsaw.bsh for the commands we are currently using. Thanks for taking a look at this.

@jlaura
Copy link
Collaborator

jlaura commented Nov 8, 2022

Afternoon @jorichie, we really should not post full paths in a public place like this. I am going to edit your post to remove the path. Would you mind copy/pasting the contents into your post or a new response? That is helpful not only from a security perspective, but also for anyone who does not have access to the machine with that path who is interested in the discussion. Thanks!

@jorichie
Copy link
Author

jorichie commented Nov 8, 2022 via email

@jlaura
Copy link
Collaborator

jlaura commented Nov 9, 2022

@jorichie thanks! Have you tried this with the following added:

point_radius_sigma=???
point_longitude_sigma=??? 
point_latitude_sigma=???
spacecraft_velocity_sigma=???

I don't know what values you want to put in for the ???. Maybe 500, 100, 100, 50 (total guesses that will need some iteration on to figure out; or a look at the LROC kernels to see if they have reported accuracies).

I would probably try both with and without overhermite - I don't think that that is the issue. The spacecraft_velocity_sigmas might also not be needed. Since you are solving for positions I think it would be used, but @jessemapel should correct me when I'm wrong!

@jessemapel
Copy link
Contributor

I'm not sure which sigmas need to be set, but I would test setting all of them. I think point lat/lon sigmas of 500-1000m should be "safe" guesses. Your spacecraft velocity sigma can be around 100.

@jorichie
Copy link
Author

jorichie commented Nov 17, 2022

Here is the result from the first test, including setting values for point_radius and longitude sigmas, point radius sigma, and spacecraft velocity:
CHOLMOD error: problem too large. file: ../Supernodal/cholmod_super_symbolic.c line: 683
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_factorize.c line: 121
CHOLMOD error: argument missing. file: ../Cholesky/cholmod_solve.c line: 1062
/var/spool/slurmd/job46016900/slurm_script: line 42: 5477 Segmentation fault (core dumped) jigsaw fromlist= SouthPole_2020Merged_SP_notupdated12_image.lis cnet=SouthPole_2022_Merged_Lidar2Image_redo112.net onet= test_SouthPole_2022_Merged_Lidar2Image_redo112.net update=no sigma0=1.0e-5 maxits=3 errorpropagation=no radius=yes camsolve=accelerations twist=yes overexisting=yes outlier_rejection=no spsolve=position overhermite=yes camera_angles_sigma=1.0 camera_angular_velocity_sigma= 0.5 camera_angular_acceleration_sigma=0.25 spacecraft_position_sigma=250 point_radius_sigma=500 point_longitude_sigma=1000 point_latitude_sigma=1000 spacecraft_velocity_sigma=100 file_prefix=test_SouthPole_2022_Merged_Lidar2Image_red0112_ -log=Southpole_112.prt

Next, I tried jigsaw specifiying overexisting=no with no change to the other parameters. There was no change in result, and the error was identical as described here.

@jlaura
Copy link
Collaborator

jlaura commented Nov 17, 2022

@jorichie That is terrific! I believe that you are getting a new error?

And it looks like it could be a memory issue? How much memory are you requesting and are you exceeding the available memory on the machine? Problem too large looks like it can occur when cholmod is converting a sparse matrix to a dense matrix.

Also, are you intentionally setting spsolve=positions or do you usually use spsolve=velocties or accelerations? Is using one of those resulting in this same error or a different error?

@lwellerastro
Copy link
Contributor

@jlaura, it is not and never has been a memory problem. This network has been watched on multiple occasions and uses <50G of memory. Janet is asking for an entire node I believe. I believe Laruen/Jesse also confirmed there are no memory overruns.

The current error appears to be identical to what is in the original post.

This work has only required spsolve=position.

@jlaura
Copy link
Collaborator

jlaura commented Nov 17, 2022

Here is the source code that is throwing the error. Please do a search in there if you like for the word 'problem' (pulled for the error message). You will see two instances in the code where that is happening and both are after checks to see if the problem will fit into memory.

None of the above is to say that the problem is not in jigsaw (for example if a sparse matrix is being made dense for some reason), but that error is being raised by CHOLMOD. Lauren's post from June 15 also looked at memory related issues and did not find that the nework size necessarily corresponded to the decomposition path that CHOLMOD was selecting as far as I can tell.

Jesse's post September 14 specifically indicates solving for accelerations which is why I asked about that aspect.

@lwellerastro
Copy link
Contributor

The latest version of this south pole network that failed to bundle using camera accelerations (and I think spacecraft position) now successfully bundles under a test version of jigsaw which utilizes changes in #5176.

The network now solves for radius, camera accelerations and spacecraft position. The process solved in about 15 hours and used 80G of memory.

I think this post can now be closed.

@acpaquette
Copy link
Collaborator

Closing with comments and positive reporting from @lwellerastro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Products Issues which are impacting the products group
Projects
None yet
Development

No branches or pull requests

7 participants