-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--mfaktc and --mfakto prototype #24
Conversation
Ended up forking mfaktc to add json results that include the AID, will probably add default logging soon too, so we can parse the logs and use those for work estimation. |
# Conflicts: # README.md # primenet.py
…mfaktc checkpoint
primenet.py
Outdated
@@ -4795,14 +4795,18 @@ def get_assignments(adapter, adir, cpu_num, progress, tasks): | |||
time_left = timedelta(seconds=time_left) | |||
days_work = timedelta(days=options.days_of_work) | |||
if time_left <= days_work: | |||
num_cache += 1 | |||
num_cache = math.ceil(num_existing * days_work / time_left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is perhaps the "exact" calculation of how big our cache should be, assuming the time_left estimation is correct, and all jobs are similarly sized. (and should work for every worktype)
The drawbacks of doing it this way are that if we start up a fresh instance, we won't have a time calculation, so we have to choose some amount of assignments for the first run, in the case that the user hasn't used num_cache. I've chosen 10 for TF and 1 for other types (1 was the previous behavior in this case, I believe)
The other drawback is that we might accidentally fetch the user a large amount of assignments, if they have an underestimate for time_left or something. Some guard-rails would maybe be prudent, but I can't think of an easy way to do it that won't be annoying when someone is trying to fetch 10 days of work before taking their computer immediately offline.
After all, the user could then --unreserve-all if we accidentally fetch too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 assignments would hold over my GPU for roughly two hours, so anything above that for --timeout would result in mfaktc closing before more work received. Perhaps in the "fresh" case, we want to get a small amount of assignments, and wake every minute until we have a good time estimate, and then get more work?
Not sure what the most general solution to this is, I don't necessarily want to have a bunch of carveouts for TF if we don't have to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the --num-cache
option is larger than what is needed for the --days-work
option, this will select the smaller of the two values. I think you want something like this instead (which also eliminates the math.ceil
):
num_cache = math.ceil(num_existing * days_work / time_left) | |
num_cache = max(num_cache, -(days_work * num_existing // -time_left)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I've redone the debug lines relating to this to make a bit more sense in the case where num_cache is bigger than days_of_work calculates is necessary.
--max-exp (and presumably --min-exp) aren't working with work preference 12.
We can maybe ask George or James to add support for these, as it looks correct from our side, but either way not extremely pressing. Just wanted to test them out since you mentioned them. |
primenet.py
Outdated
@@ -4795,14 +4795,18 @@ def get_assignments(adapter, adir, cpu_num, progress, tasks): | |||
time_left = timedelta(seconds=time_left) | |||
days_work = timedelta(days=options.days_of_work) | |||
if time_left <= days_work: | |||
num_cache += 1 | |||
num_cache = math.ceil(num_existing * days_work / time_left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the --num-cache
option is larger than what is needed for the --days-work
option, this will select the smaller of the two values. I think you want something like this instead (which also eliminates the math.ceil
):
num_cache = math.ceil(num_existing * days_work / time_left) | |
num_cache = max(num_cache, -(days_work * num_existing // -time_left)) |
primenet.py
Outdated
if config.has_option(SEC.PrimeNet, "MaxExponents"): | ||
amax = config.getint(SEC.PrimeNet, "MaxExponents") | ||
if amax < num_cache: | ||
adapter.debug( | ||
"Config option MaxExponents ({0:n}) is smaller than num_cache ({1:n}), num_cache decreased to {0:n}".format( | ||
amax, num_cache, workfile)) | ||
num_cache = amax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only to prevent a bug in the program from causing it to request too many assignments, I would prefer to keep a default value for the MaxExponents
config file option, even if it is significantly larger when the user is doing TF assignments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On one level I agree, as it will surely happen that someone gets way too many assignments due to a bug or misunderstanding if we don't include some safeguard, but I just personally found the experience of having to discover the default MaxExponents option in the code to be a bad user experience.
There are several factors that contribute to the confusion in the previous version:
- there is no cli option for max exponents, so -h doesn't lead you in the right direction
- it's not mentioned in --setup
- there's a cli option for "max exponent" but not max exponents (two different options that are confusingly similar)
- in the previous configuration, even if you have no MaxExponents in the config (which happens by default), you're still limited to 15, so you have to search through the actual code to figure out what's going on. (even if we raise this, it will still happen when a user puts in a high days_of_work for certain worktypes)
- throughout this whole process, you're not actually told about MaxExponents in the log
I have added some debug messages to hopefully make it clearer if this is happening to someone in the future (5), but they would need to have debugging on, which, I think if they're having to use debugging to figure out why the program isn't behaving like they told it to (using days work), we've already failed some test of the user experience.
10 days of TF work for my machine can be ~1500 assignments at small (but still ftc wavefront) bitlevels, which is an enormous amount of exponents, and the maximum number of days of work to fetch is 90, which would be ~9470 more averagely sized assignments for me.
I think I heard 10,000 was the maximum number of assignments you could have in primenet, though if I grabbed 10,000 wavefront TF assignments that would likely greatly negatively impact the project. We surely want the limit to be below this, but still respect days_of_work
TL;DR: I think we should have safeguards but they should be upfront.
Proposal:
- We could add it to --setup with an appropriately sized default for the worktype.
- (People become aware of its existence and set it according to their comfort level)
- When creating the .ini file, we put that default in
- do this even if the user doesn't use --setup
- when they're editing the .ini they will see it and realize that's what's keeping the script from getting more assignments.
- We can use the previously mentioned default if they delete the line in .ini, but display a warning that doesn't require the debug verbosity to see, that we are using max_exponents=15 or 100 (or something) as a default.
- (Alternatively, just make all the num_cache related debugs into
info
messages)
- (Alternatively, just make all the num_cache related debugs into
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed write up.
The MaxExponents
config file option comes from Prime95/MPrime, which also defaults to 15 assignments. Note that this limit is per worker, so with multiple workers one can have many more assignments. I was hesitant at first to add support for this option, but over time I became concerned that a bug either in the script or on the server could cause the program to request too many assignments, so I decided to add this as a safeguard. Prime95/MPrime of course goes a step further and will actually automatically unreserve assignments, which I would never do.
Anyway, at least with CPUs, 15 wavefront assignments is typically a lot of work (days or weeks), especially with primality tests, so most users should never hit this limit. However, GPUs can obviously be faster, so feel free to increase the limit for TF work to whatever value makes sense, even if that is hundreds or thousands of assignments. Ideally, only the most advanced users should ever need to change this option. I believe TF (and P-1) assignments are only valid for 30 days, so requesting 90 days of work would not be applicable here.
Another problem I am aware of is that there is no exhaustive list of all supported config file options (without reading my release notes for each version). It would be easy enough to create such a list, but I am not sure the best place to store it. Maybe on the README? Long term, we probably need to write more extensive documentation on how to use the program.
Regarding your proposal, adding a log message when the MaxExponents
value is hit was a good improvement, so thanks for implementing that. Note that in general I do not want to script to be too verbose by default, as users may miss important messages. I wish the Python logging module supported multiple levels of debug messages instead of the critical
level, which is currently unused, but alas all we have is debug
. Feel free to put a default value for MaxExponents
in the local.ini
config file. In that case, you might also restore the results_file
as well. I would prefer to not add a prompt to the --setup
option, as users would then just set a large value defeating the purpose of this feature, but I am open to being convinced otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MaxExponents config file option comes from Prime95/MPrime, which also defaults to 15 assignments.
Ah, didn't realize this was directly from p95, that makes a bit more sense then.
I believe TF (and P-1) assignments are only valid for 30 days, so requesting 90 days of work would not be applicable here.
This is a good point, I think non Cat 0 - 1 exponents are set to 60 days by default, mine mostly shows 60 for all of my assignment types here: https://www.mersenne.org/manual_extension/
Though I manually extended some P-1 jobs that I'm doing jointly and they now say 218 days.
Either way, 1000 TF assignments is certainly enough for most people, if you have more than 100 assignments you get put on a watchlist anyway. (https://www.mersenne.org/assignments/topusers.php)
Though I think this list actually uses 1000 for the TF threshold, otherwise I would be on it.
Feel free to put a default value for MaxExponents in the local.ini config file.
Can do!
In that case, you might also restore the results_file as well.
I'm not sure what you mean by restoring the results_file, though it may come to me while I'm in the code.
I would prefer to not add a prompt to the --setup option, as users would then just set a large value defeating the purpose of this feature, but I am open to being convinced otherwise.
That's fair. I suppose gating getting an obscene amount of assignments behind config editing is pretty good. I was envisioning it as a "launch control lock switch" sort of deal, where the user is allowed to do whatever they want with the program, but we make it difficult to do on accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just spitballing, but maybe we have the MaxExponents default to some value based on their selected days of work and their worktype? This way we guard against wildly inaccurate/buggy time estimates causing lots of work being fetched, but we stay out of the way of users that actually want that much work? For TF this can maybe be like 150 assignments per day, I'm not sure what other worktypes should default to though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, didn't realize this was directly from p95, that makes a bit more sense then.
Yes, for just a general hint, the camel cased config file options are either the same or adapted from a similar option in Prime95/MPrime, while the snake cased ones are unique to the PrimeNet program.
I'm not sure what you mean by restoring the results_file, though it may come to me while I'm in the code.
When we removed the default value for the --resultsfile
option, this also removed it from the config file by default. I was just proposing that you manually add it back in the config file when you add the default value for the MaxExponents
option.
I was envisioning it as a "launch control lock switch" sort of deal, where the user is allowed to do whatever they want with the program, but we make it difficult to do on accident.
Yes, I agree, that was my original goal with this feature. I am all for empowering power users to do what they need to do.
Just spitballing, but maybe we have the MaxExponents default to some value based on their selected days of work and their worktype?
This would be fine with me, but it would of course be more difficult to implement, as we could then no longer have a hard coded default in the config file. I was previously going for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these should be done now!
Starting another spot for discussion here, as it's not really mfaktx related per se, but primenet related, and I feel I've saturated my outgoing communications to folks for the time being :) For my TODO post above:
If mfaktc.ini has Stages=0, then my understanding is that it's running every single bitrange in the assignment at once, so for example, if we find a 76.5 "bit" factor, we aren't sure that there isn't also 75.5 "bit" factor to be found in the range as well. So from this, reporting a simple TF75 stage seems a bit misleading, as the percentage associated with this stage is actually the joint percentage of TF75-77. Similarly if mfaktc.ini has Stages=1, and we're reporting progress on TF from 2^75 to 2^77, we will report progress stage TF75 (and a percentage) during 75-76, and then TF76 (and a percentage) during 76-77. What exactly these percentages should be is unclear to me. If we are reporting 0-100% for each stage, then TF75 (and a percentage) is indistinguishable from the Stages=0 case of TF75 (and a percentage). Estimated completion date, I would think, should just be when we expect to release the exponent, and not necessarily when the current stage is done. But I'm not sure if percentage should be that as well. TL;DR: Should percentage be our progress on the current stage, or the progress of the whole assignment? |
# Conflicts: # primenet.py
As the number of changes in this PR keeps growing, merging upstream changes starts to get a bit more painful (I think it might've been better if I started with a rebase, but my tooling kept telling me a merge was a better idea), would you be open to merging this into the 2.2 branch at some point soon, and doing more changes as another PR? Or would you rather keep a version of the code with CERT stuff separate from the TF stuff, in case we want to release them independently? |
Here's the code I used to regenerate the merged table, on the chance it saves you any amount of manual effort: import tables
tables.array(
(
(4, "✔", "✔", "", "", "", "✔"),
(12, "", "", "", "", "✔", ""),
(100, "✔", "✔*", "✔", "✔", "", ""),
(101, "✔", "", "", "✔", "", ""),
(102, "✔", "✔*", "✔", "✔", "", ""),
(104, "✔", "✔*", "✔", "✔", "", ""),
(106, "", "✔*", "✔", "", "", ""),
(150, "✔", "✔", "✔", "", "", ""),
(151, "✔", "✔", "✔", "", "", ""),
(152, "✔", "✔", "✔", "", "", ""),
(153, "✔", "✔", "✔", "", "", ""),
(154, "✔", "✔*", "", "", "", ""),
(155, "", "✔", "✔", "", "", ""),
(156, "", "", "", "", "", ""),
(160, "✔", "", "", "", "", ""),
(161, "✔", "", "", "", "", ""),
),
("Worktype", "Mlucas", "GpuOwl", "PRPLL", "CUDALucas", "mfaktc/mfakto", "CUDAPm1"),
headerrow=True,
) (I really wish the check glyph used had a monospaced glyph in most of the fonts used in editors and stuff, lol) |
primenet.py
Outdated
@@ -5686,6 +5686,10 @@ def ping_server(ping_type=1): | |||
return None | |||
|
|||
|
|||
def is_pyinstaller(): | |||
return getattr(sys, 'frozen', False) and hasattr(sys, '_MEIPASS') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the canonical way to determine whether we're in a pyinstaller instance, so I've added it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, I added this link to a comment in the function.
Sorry about the hassle. Yes, we can merge this soon and then make any additional improvements in another PR. If we are going to support mfakto, it might be good to get that working with the script. For now, I will wait to push any more changes to the 2.2 branch until this is merged.
Yeah, I liked the checkmark symbol, but I agree that it is annoying that this table does not align in editors, so feel free to use different symbol. You could also change the appearance of the table by adjusting some of the options for my library. |
…ting for multilevel assignments
# Conflicts: # primenet.py
Very much a work in progress, but creating this PR mostly as a jumping off point for discussion and feedback. Probably won't be ready for v2.2, and definitely shouldn't hold up the release of CERT support.
Current TODO:
JamesGeorgeNice to have:
Testing areas:
Bugs:
Server-side changes wishlist:
ga
Probably won't do: