Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--mfaktc and --mfakto prototype #24

Merged
merged 60 commits into from
Aug 23, 2024
Merged

Conversation

brubsby
Copy link
Contributor

@brubsby brubsby commented Jul 17, 2024

Very much a work in progress, but creating this PR mostly as a jumping off point for discussion and feedback. Probably won't be ready for v2.2, and definitely shouldn't hold up the release of CERT support.

Current TODO:

  • finish mfaktc changes
    • add os info to json
    • build on linux
    • investigate changing checkpoint time to entire bitlevel instead of class
  • switch to new worktype once implemented by James George
  • tune default work fetch options around TF assignments
  • nudge results.txt to be results.json.txt (not sure if an error or doing this automatically is better), maybe see what p95/mprime case does
  • mfakto
  • Iterative work fetch (Teal TODO)
  • Workfetch status line at each iteration
  • TF bitrange in completion date messages
  • Additional line after setup wizard
  • Optparse default usage line
  • Default tf days of work = 1

Nice to have:

  • TF1G integration
  • nag user if they have StopAfterFactor=2 (suggest Stages=1 and StopAfterFactor=1)

Testing areas:

  • new setup default parameter experience
  • multiple bitlevel ini settings
  • other mfaktc .ini configuration combinations

Bugs:

  • "stall detection" if mfaktc not making progress since last checkin
  • work estimation for multiple bitlevel assignments (checkpoints only for single bitlevel)
  • don't send d=1 if exponent still in assignments
  • fix percentage calculation for Stages=1 and multiple bitlevel assignments

Server-side changes wishlist:

  • have work preference 12 respect max and min arguments to ga
  • have work preference 12 send highest ROI assignment in the 1m bins that each category is currently giving PRP out from.

Probably won't do:

  • automatically generate and run TF worktodo lines for PRP or P-1 assignments that are underfactored (very long term stretch goal)
  • create configuration options for TF work preferences that are passed to Manual_GPU_Assignment
  • delete worktodo lines that fail to get AID via selfassign (current behavior just puts N/A, deleting instead of poaching (even if just an option) might be nice)
  • GPU72 integration

@brubsby
Copy link
Contributor Author

brubsby commented Jul 17, 2024

Ended up forking mfaktc to add json results that include the AID, will probably add default logging soon too, so we can parse the logs and use those for work estimation.

@tdulcet tdulcet added the enhancement New feature or request label Jul 18, 2024
primenet.py Show resolved Hide resolved
primenet.py Outdated
@@ -4795,14 +4795,18 @@ def get_assignments(adapter, adir, cpu_num, progress, tasks):
time_left = timedelta(seconds=time_left)
days_work = timedelta(days=options.days_of_work)
if time_left <= days_work:
num_cache += 1
num_cache = math.ceil(num_existing * days_work / time_left)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is perhaps the "exact" calculation of how big our cache should be, assuming the time_left estimation is correct, and all jobs are similarly sized. (and should work for every worktype)

The drawbacks of doing it this way are that if we start up a fresh instance, we won't have a time calculation, so we have to choose some amount of assignments for the first run, in the case that the user hasn't used num_cache. I've chosen 10 for TF and 1 for other types (1 was the previous behavior in this case, I believe)

The other drawback is that we might accidentally fetch the user a large amount of assignments, if they have an underestimate for time_left or something. Some guard-rails would maybe be prudent, but I can't think of an easy way to do it that won't be annoying when someone is trying to fetch 10 days of work before taking their computer immediately offline.

After all, the user could then --unreserve-all if we accidentally fetch too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 assignments would hold over my GPU for roughly two hours, so anything above that for --timeout would result in mfaktc closing before more work received. Perhaps in the "fresh" case, we want to get a small amount of assignments, and wake every minute until we have a good time estimate, and then get more work?

Not sure what the most general solution to this is, I don't necessarily want to have a bunch of carveouts for TF if we don't have to.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the --num-cache option is larger than what is needed for the --days-work option, this will select the smaller of the two values. I think you want something like this instead (which also eliminates the math.ceil):

Suggested change
num_cache = math.ceil(num_existing * days_work / time_left)
num_cache = max(num_cache, -(days_work * num_existing // -time_left))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I've redone the debug lines relating to this to make a bit more sense in the case where num_cache is bigger than days_of_work calculates is necessary.

@brubsby
Copy link
Contributor Author

brubsby commented Jul 25, 2024

--max-exp (and presumably --min-exp) aren't working with work preference 12.

connectionpool.py: [MainThread 2024-07-25 11:45:42,315]  DEBUG: http://v5.mersenne.org:80 "GET /v5server/?px=GIMPS&v=0.95&t=ga&g=[guid]=0&a=&max=138000000&ss=19191919&sh=ABCDABCDABCDABCDABCDABCDABCDABCD HTTP/11" 200 179
primenet.py: [MainThread 2024-07-25 11:45:42,316]  INFO: PrimeNet success code with additional info:
primenet.py: [MainThread 2024-07-25 11:45:42,316]  INFO: Server assigned trial factoring work.
primenet.py: [MainThread, Worker #1 2024-07-25 11:45:42,316]  INFO: Got assignment D5BBDEFA9AA42BBBC44F3AD5CFA8DA31: Trial factor M146204119
primenet.py: [MainThread, Worker #1 2024-07-25 11:45:42,316]  DEBUG: Fetching using v5 API
primenet.py: [MainThread, Worker #1 2024-07-25 11:45:42,316]  INFO: Getting assignment from server

We can maybe ask George or James to add support for these, as it looks correct from our side, but either way not extremely pressing. Just wanted to test them out since you mentioned them.

primenet.py Outdated
@@ -4795,14 +4795,18 @@ def get_assignments(adapter, adir, cpu_num, progress, tasks):
time_left = timedelta(seconds=time_left)
days_work = timedelta(days=options.days_of_work)
if time_left <= days_work:
num_cache += 1
num_cache = math.ceil(num_existing * days_work / time_left)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the --num-cache option is larger than what is needed for the --days-work option, this will select the smaller of the two values. I think you want something like this instead (which also eliminates the math.ceil):

Suggested change
num_cache = math.ceil(num_existing * days_work / time_left)
num_cache = max(num_cache, -(days_work * num_existing // -time_left))

primenet.py Outdated
Comment on lines 4816 to 4822
if config.has_option(SEC.PrimeNet, "MaxExponents"):
amax = config.getint(SEC.PrimeNet, "MaxExponents")
if amax < num_cache:
adapter.debug(
"Config option MaxExponents ({0:n}) is smaller than num_cache ({1:n}), num_cache decreased to {0:n}".format(
amax, num_cache, workfile))
num_cache = amax
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only to prevent a bug in the program from causing it to request too many assignments, I would prefer to keep a default value for the MaxExponents config file option, even if it is significantly larger when the user is doing TF assignments.

Copy link
Contributor Author

@brubsby brubsby Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one level I agree, as it will surely happen that someone gets way too many assignments due to a bug or misunderstanding if we don't include some safeguard, but I just personally found the experience of having to discover the default MaxExponents option in the code to be a bad user experience.

There are several factors that contribute to the confusion in the previous version:

  1. there is no cli option for max exponents, so -h doesn't lead you in the right direction
  2. it's not mentioned in --setup
  3. there's a cli option for "max exponent" but not max exponents (two different options that are confusingly similar)
  4. in the previous configuration, even if you have no MaxExponents in the config (which happens by default), you're still limited to 15, so you have to search through the actual code to figure out what's going on. (even if we raise this, it will still happen when a user puts in a high days_of_work for certain worktypes)
  5. throughout this whole process, you're not actually told about MaxExponents in the log

I have added some debug messages to hopefully make it clearer if this is happening to someone in the future (5), but they would need to have debugging on, which, I think if they're having to use debugging to figure out why the program isn't behaving like they told it to (using days work), we've already failed some test of the user experience.

10 days of TF work for my machine can be ~1500 assignments at small (but still ftc wavefront) bitlevels, which is an enormous amount of exponents, and the maximum number of days of work to fetch is 90, which would be ~9470 more averagely sized assignments for me.

I think I heard 10,000 was the maximum number of assignments you could have in primenet, though if I grabbed 10,000 wavefront TF assignments that would likely greatly negatively impact the project. We surely want the limit to be below this, but still respect days_of_work

TL;DR: I think we should have safeguards but they should be upfront.

Proposal:

  • We could add it to --setup with an appropriately sized default for the worktype.
    • (People become aware of its existence and set it according to their comfort level)
  • When creating the .ini file, we put that default in
    • do this even if the user doesn't use --setup
    • when they're editing the .ini they will see it and realize that's what's keeping the script from getting more assignments.
  • We can use the previously mentioned default if they delete the line in .ini, but display a warning that doesn't require the debug verbosity to see, that we are using max_exponents=15 or 100 (or something) as a default.
    • (Alternatively, just make all the num_cache related debugs into info messages)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed write up.

The MaxExponents config file option comes from Prime95/MPrime, which also defaults to 15 assignments. Note that this limit is per worker, so with multiple workers one can have many more assignments. I was hesitant at first to add support for this option, but over time I became concerned that a bug either in the script or on the server could cause the program to request too many assignments, so I decided to add this as a safeguard. Prime95/MPrime of course goes a step further and will actually automatically unreserve assignments, which I would never do.

Anyway, at least with CPUs, 15 wavefront assignments is typically a lot of work (days or weeks), especially with primality tests, so most users should never hit this limit. However, GPUs can obviously be faster, so feel free to increase the limit for TF work to whatever value makes sense, even if that is hundreds or thousands of assignments. Ideally, only the most advanced users should ever need to change this option. I believe TF (and P-1) assignments are only valid for 30 days, so requesting 90 days of work would not be applicable here.

Another problem I am aware of is that there is no exhaustive list of all supported config file options (without reading my release notes for each version). It would be easy enough to create such a list, but I am not sure the best place to store it. Maybe on the README? Long term, we probably need to write more extensive documentation on how to use the program.

Regarding your proposal, adding a log message when the MaxExponents value is hit was a good improvement, so thanks for implementing that. Note that in general I do not want to script to be too verbose by default, as users may miss important messages. I wish the Python logging module supported multiple levels of debug messages instead of the critical level, which is currently unused, but alas all we have is debug. Feel free to put a default value for MaxExponents in the local.ini config file. In that case, you might also restore the results_file as well. I would prefer to not add a prompt to the --setup option, as users would then just set a large value defeating the purpose of this feature, but I am open to being convinced otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MaxExponents config file option comes from Prime95/MPrime, which also defaults to 15 assignments.

Ah, didn't realize this was directly from p95, that makes a bit more sense then.

I believe TF (and P-1) assignments are only valid for 30 days, so requesting 90 days of work would not be applicable here.

This is a good point, I think non Cat 0 - 1 exponents are set to 60 days by default, mine mostly shows 60 for all of my assignment types here: https://www.mersenne.org/manual_extension/
Though I manually extended some P-1 jobs that I'm doing jointly and they now say 218 days.

Either way, 1000 TF assignments is certainly enough for most people, if you have more than 100 assignments you get put on a watchlist anyway. (https://www.mersenne.org/assignments/topusers.php)

Though I think this list actually uses 1000 for the TF threshold, otherwise I would be on it.

Feel free to put a default value for MaxExponents in the local.ini config file.

Can do!

In that case, you might also restore the results_file as well.

I'm not sure what you mean by restoring the results_file, though it may come to me while I'm in the code.

I would prefer to not add a prompt to the --setup option, as users would then just set a large value defeating the purpose of this feature, but I am open to being convinced otherwise.

That's fair. I suppose gating getting an obscene amount of assignments behind config editing is pretty good. I was envisioning it as a "launch control lock switch" sort of deal, where the user is allowed to do whatever they want with the program, but we make it difficult to do on accident.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just spitballing, but maybe we have the MaxExponents default to some value based on their selected days of work and their worktype? This way we guard against wildly inaccurate/buggy time estimates causing lots of work being fetched, but we stay out of the way of users that actually want that much work? For TF this can maybe be like 150 assignments per day, I'm not sure what other worktypes should default to though.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't realize this was directly from p95, that makes a bit more sense then.

Yes, for just a general hint, the camel cased config file options are either the same or adapted from a similar option in Prime95/MPrime, while the snake cased ones are unique to the PrimeNet program.

I'm not sure what you mean by restoring the results_file, though it may come to me while I'm in the code.

When we removed the default value for the --resultsfile option, this also removed it from the config file by default. I was just proposing that you manually add it back in the config file when you add the default value for the MaxExponents option.

I was envisioning it as a "launch control lock switch" sort of deal, where the user is allowed to do whatever they want with the program, but we make it difficult to do on accident.

Yes, I agree, that was my original goal with this feature. I am all for empowering power users to do what they need to do.

Just spitballing, but maybe we have the MaxExponents default to some value based on their selected days of work and their worktype?

This would be fine with me, but it would of course be more difficult to implement, as we could then no longer have a hard coded default in the config file. I was previously going for simplicity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these should be done now!

@brubsby
Copy link
Contributor Author

brubsby commented Jul 27, 2024

Starting another spot for discussion here, as it's not really mfaktx related per se, but primenet related, and I feel I've saturated my outgoing communications to folks for the time being :)

For my TODO post above:

fix percentage calculation for Stages=1 and multiple bitlevel assignments
This raises some question about what exactly we should report as STAGE and percentage in various cases.
Let's go through an example of an assignment that is TF from 2^75 to 2^77.

If mfaktc.ini has Stages=0, then my understanding is that it's running every single bitrange in the assignment at once, so for example, if we find a 76.5 "bit" factor, we aren't sure that there isn't also 75.5 "bit" factor to be found in the range as well. So from this, reporting a simple TF75 stage seems a bit misleading, as the percentage associated with this stage is actually the joint percentage of TF75-77.

Similarly if mfaktc.ini has Stages=1, and we're reporting progress on TF from 2^75 to 2^77, we will report progress stage TF75 (and a percentage) during 75-76, and then TF76 (and a percentage) during 76-77. What exactly these percentages should be is unclear to me. If we are reporting 0-100% for each stage, then TF75 (and a percentage) is indistinguishable from the Stages=0 case of TF75 (and a percentage).

Estimated completion date, I would think, should just be when we expect to release the exponent, and not necessarily when the current stage is done. But I'm not sure if percentage should be that as well.

TL;DR: Should percentage be our progress on the current stage, or the progress of the whole assignment?

@brubsby
Copy link
Contributor Author

brubsby commented Aug 3, 2024

As the number of changes in this PR keeps growing, merging upstream changes starts to get a bit more painful (I think it might've been better if I started with a rebase, but my tooling kept telling me a merge was a better idea), would you be open to merging this into the 2.2 branch at some point soon, and doing more changes as another PR? Or would you rather keep a version of the code with CERT stuff separate from the TF stuff, in case we want to release them independently?

@brubsby
Copy link
Contributor Author

brubsby commented Aug 3, 2024

Here's the code I used to regenerate the merged table, on the chance it saves you any amount of manual effort:

import tables
tables.array(
	(
		(4,    "✔",  "✔",   "",   "",   "",  "✔"),
		(12,    "",   "",   "",   "",  "✔",   ""),
		(100,  "✔", "✔*",  "✔",  "✔",   "",   ""),
		(101,  "✔",   "",   "",  "✔",   "",   ""),
		(102,  "✔", "✔*",  "✔",  "✔",   "",   ""),
		(104,  "✔", "✔*",  "✔",  "✔",   "",   ""),
		(106,   "", "✔*",  "✔",  "",   "",   ""),
		(150,  "✔",  "✔",  "✔",   "",   "",   ""),
		(151,  "✔",  "✔",  "✔",   "",   "",   ""),
		(152,  "✔",  "✔",  "✔",   "",   "",   ""),
		(153,  "✔",  "✔",  "✔",   "",   "",   ""),
		(154,  "✔", "✔*",   "",   "",   "",   ""),
		(155,   "",  "✔",  "✔",   "",   "",   ""),
		(156,   "",   "",   "",   "",   "",   ""),
		(160,  "✔",   "",   "",   "",   "",   ""),
		(161,  "✔",   "",   "",   "",   "",   ""),
	),
	("Worktype", "Mlucas", "GpuOwl", "PRPLL", "CUDALucas", "mfaktc/mfakto", "CUDAPm1"),
	headerrow=True,
)

(I really wish the check glyph used had a monospaced glyph in most of the fonts used in editors and stuff, lol)

primenet.py Outdated
@@ -5686,6 +5686,10 @@ def ping_server(ping_type=1):
return None


def is_pyinstaller():
return getattr(sys, 'frozen', False) and hasattr(sys, '_MEIPASS')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the canonical way to determine whether we're in a pyinstaller instance, so I've added it.

https://pyinstaller.org/en/stable/runtime-information.html

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, I added this link to a comment in the function.

@tdulcet
Copy link
Owner

tdulcet commented Aug 4, 2024

would you be open to merging this into the 2.2 branch at some point soon, and doing more changes as another PR?

Sorry about the hassle. Yes, we can merge this soon and then make any additional improvements in another PR. If we are going to support mfakto, it might be good to get that working with the script. For now, I will wait to push any more changes to the 2.2 branch until this is merged.

I really wish the check glyph used had a monospaced glyph in most of the fonts used in editors and stuff, lol

Yeah, I liked the checkmark symbol, but I agree that it is annoying that this table does not align in editors, so feel free to use different symbol. You could also change the appearance of the table by adjusting some of the options for my library.

@tdulcet tdulcet merged commit a1ef975 into tdulcet:primenet-2.2 Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants