-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Play data doesn't update after game closes #34
Comments
I have a few other examples of this I noticed as well. When i get back to a computer i will try and get examples from the json. There were a lot of fumbles being counted twice, including: Christian McCaffrey, Alex Collins, Ben Roethlisberger, Dalvin Cook, Alfred Morris, Dak Prescott, Trubisky. player_id | gsis_id | play_id | fumbles_lost | team |
This is interesting, and an issue I have not previously seen, but I can confirm that I saw this with the Minnesota/San Francisco game. It was pointed out to me that my data showed that the Vikings had 2 defensive fumble recoveries & 4 sacks when in reality they only had a single fumble recovery and 3 sacks. Sure, enough, there were records being duplicated in a single play. For example, for this play:
The raw data under that play showed:
There are just too many sequences listed for the actions that went into that play. Deleting the game's
All of A.Sandejo's contributions to the play are now gone, and the redundant This was not an isolated case as 4 defenses were pointed out to me as having wrong stats. Deleting all of the games from the week and then re-generating the data resolved the discrepancies in all cases. |
So is deleting all .json.gz for the weekend the best way to delete those and then regenerate? And do you know if there is a separate step for cleaning up nfldb? |
In Windows I just went Yes, I see that
I don't really know the best way to delete the games from One thing you can do is run this little script, and then run import nfldb
db = nfldb.connect()
# gsis_ids = [2018090600, 2018090900, 2018090901, 2018090902, 2018090903,
# 2018090904, 2018090905, 2018090906, 2018090907, 2018090908,
# 2018090909, 2018090910, 2018090911, 2018090912, 2018091000,
# 2018091001]
q = nfldb.Query(db)
q.game(season_year=2018, season_type='Regular', week=1)
gsis_ids = [game.gsis_id for game in q.as_games()]
for gsis_id in gsis_ids:
query = "DELETE FROM game where gsis_id = '{}';".format(gsis_id)
with nfldb.Tx(db) as cursor:
cursor.execute(query) That will fix
|
If this is a very rare thing, then maybe it's not worth it but these resources do send the |
Seems like its worth addressing to me. Reading this It seems pretty low impact to me... |
I noticed this happened again in last night's BAL/CIN game. CIN was being
credited with an extra INT, 2 extra fumble recoveries, and 1.5 extra sacks
in the play-by-play that was available when the last play of the game occurred.
At some point between last night and this morning that data was fixed and
deleting my local .json and regenerating fixed the discrepancy.
…On Thu, Sep 13, 2018, 2:46 PM Derek ***@***.***> wrote:
Seems like its worth addressing to me.
Reading this Last-Modified seems elegant enough to me! Why not just store
that directly on the game json?
It seems pretty low impact to me...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AC237YTYro3o9cZVR_HoQdf_x7v3lkn9ks5uarYUgaJpZM4WjoRp>
.
|
If this is going to happen every week, I'd say it's a fairly significant problem. |
Can this be renamed, "Play data doesn't update after game closes?" Just want to make sure i am clear on the issue here. |
Yeah that makes sense.
…On Fri, Sep 14, 2018 at 5:21 PM Derek ***@***.***> wrote:
Can this be renamed, "Play data doesn't update after game closes?"
Just want to make sure i am clear on the issue here.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AG52aDUkPxeIs7cDIHUgsDpqXBQIq_-Bks5ubCv4gaJpZM4WjoRp>
.
|
Storing the timestamp in the game-json would indeed be low impact. What needs to be handled in that case is support for reading both (older) files without the timestamp and files with the timestamp. Easy to fix. What needs to be decided is when/how long to check for updates of games. For instance, we don't need to check the correctness of a game played in 2016. Games of current week, x weeks back...? So the proposed logic would be to check |
Hmmmm very good point! I need to look at how this is done to provide a more informed opinion.... but perhaps an easy/non-obtrusive way would be to only check if the game occured within 2 weeks. IMO the way to ensure the most accuracy would be to check EVERY game or check if the game occurred in the given season... But i'm unsure if this would result in too much traffic that the NFL may crack down. |
Actually we could get away with simply changing the logic and instead of checking / updating currently playing games... we update all games in the past two weeks. I'm not so sure tracking last-modified header does much if we need to keep track of how recntly the game was played anyways. |
If I'm understanding correctly, on Monday nights, everytime someone accesses the API we will be dumping redownloading ~31 games worth of data instead of the one game being played? I hit the API roughly every 3 minutes (and I think nfldb is every 30 seconds). That seems like a lot of unnecessary calls for data that is almost certainly static. I think that might be a bit too lazy of a solution here. |
In practice, after analyzing how much traffic that goes just by opening NFL.com, seeing how many page visits they have every month, I don't think that we really need to worry about calling them too often (again, in practice). With that said, I think it's great that we are cautious and try to be smart. Not only does that decrease the number of calls to the NFL, it makes for a better application that runs smoother.
We have the schedule on file/in memory. Getting entries from a certain relative time range is an easy operation so no worries there. Come to think of it, we could make use of an extra flag like A user that runs the application pretty much all the time would have calls scattered over time, whereas a user that hasn't run the application in say three weeks will have to download everything anyway and all games older than (today - x) will automatically be For the historical data, it's not impossible that we have errors so one could of course verify these as well if we wanted to (preferably a single user that commits all changes to the repo). |
I wonder if we should make this thing configurable, ya know? VERIFY_POST_GAME=(False|interval) or perhaps a manual call. Like... some kind of force update function. |
Again, i'm not 100% sure on how the player stats are updated, but ASSUMING it's done in a batch process of games that are active and removed from the dictionary of games to update once they are complete. VERIFY_POST_GAME could be checked whenever we look up if a game is active or not. If the game was marked active w/in the interval we continue checking this. I believe this would prevent those extraneous calls @ochawkeye mentioned. |
Manually deleting the @derek-adair not sure if I grasp the difference between your (last) suggestion and my suggestion. @ochawkeye and @brbeaird, did you notice any wrong data this week? Also, is this the first season you have noticed this? My point being that this might be something going on at NFL.com just right now, surely they're aware of the problem and people will have pointed this out (if it's wrong here, it's wrong on NFL.com). So I'm guessing that this is a high priority bug. Probably solved in the near future. Another option is of course that they abandon these feeds completely (they are old and feed very old-looking and pretty poor functioning web content on nfl.com) in favor of newer ones in which case we have bigger problems. If this is a one-time-in-six-years (?) thing and it's back to normal now, then I think @derek-adair's suggestion of a configurable update-function is nice (force update command that can be run manually as desired as well as periodically updating games if the user wished to have such a setup) |
Yes, this is a data problem that occurred again this week. Just looking at defense's accumulated stats, I saw changes in the following after deleting and regenerating stats: Baltimore: 4 sacks > 3 sacks This is not a problem that I ever witnessed before this season. |
In regards to being a high priority bug, I'm not so certain on that one. Just because the underlying JSON data shows the redundant play information, that doesn't mean that their game center is displaying all of that information. I haven't watched the game's live in the game center, but it's probable that their displaying of the information is handling the redundancies properly. |
Yep, this is the first season I've seen this, and I looked at things pretty closely last year. It does seem like a bug, but seeing as how the NFL feed eventually gets corrected, that may just be how they've got it working now. I have no idea what's going on behind the scenes there, but it definitely feels like something we'll end up having to deal with to make sure we go back and get the corrected data at some point. |
Yep we can't know if it's high prio @ nfl or not so I guess we'd have to assume that it's not. Right, so I don't mind coding fixes (that's kinda why I'm around, I don't actually use the project more than for inspiration/as a knowledge source) but the question is how. How do you guys feel about the suggestions given so far? |
Regardless of any redundancies there will be corrections after the game is no longer active. The doubly counted plays may or may not be corrected post-game-closing. I'm all about the VERIFY_POST_GAME config option. Also note that this project is basically running itself so "high priority" is a relative term. |
"High Priority" was in the context of whether or not NFL.com viewed this as a bug in their source data, not a grade of how quickly |
Storing the last pulled information in the JSON would be fine, but even that might not be necessary. All comes down to how long after the game the data ends up being corrected. I've found if a change is going to happen it happens by the morning after. But I have not compared stats from any other site to see if they match what we are recording so I can't say with 100% certainty that after the morning after no changes occur anymore. If we could collectively agree that it looks like the morning after is the cutoff for changes, then we could probably just tap into the metadata for the .json.gz file itself. If the file was last modified on the day of the game that happened in the past, then delete it and pull it again. If it is newer than the game date then call it good. As mentioned, these post-game changes are not something we have seen previously. Each week the NFL has a handful of plays they reclassify based on further analysis. For example, there might be a pass play that, after further tape review, gets classified as a running play instead. In that case, they publish a list of statistical changes for the play (ie. Drew Brees: -1 completion, -1 pass attempt, -6 passing yards; Mark Ingram: -1 reception, -6 receiving yards, +1 rushing attempt, +6 rushing yards). But those stat changes have never been retroactively applied to the play-by-play data. So these post-game changes are all pretty new behavior for those of us that have been using |
I know with most Fantasy sites, stat corrections usually go through on Wed
or Thurs morning for the previous week's games; I'm not sure the longest
possible time NFL could make changes after the fact, though.
…On Mon, Sep 24, 2018 at 11:49 AM ochawkeye ***@***.***> wrote:
Storing the last pulled information in the JSON would be fine, but even
that might not be necessary. All comes down to how long after the game the
data ends up being corrected.
I've found if a change is going to happen it happens by the morning after.
But I have not compared stats from any other site to see if they match what
we are recording so I can't say with 100% certainty that after the morning
after no changes occur anymore. If we could collectively agree that it
looks like the morning after is the cutoff for changes, then we could
probably just tap into the metadata for the .json.gz file itself. If the
file was last modified on the day of the game that happened in the past,
then delete it and pull it again. If it is newer than the game date then
call it good.
As mentioned, these post-game changes are not something we have seen
previously. Each week the NFL has a handful of plays they reclassify based
on further analysis. For example, there might be a pass play that, after
further tape review, gets classified as a running play instead. In that
case, they publish a list of statistical changes for the play (ie. Drew
Brees: -1 completion, -1 pass attempt, -6 passing yards; Mark Ingram: -1
reception, -6 receiving yards, +1 rushing attempt, +6 rushing yards). But
those stat changes have never been retroactively applied to the
play-by-play data. So these post-game changes are all pretty new behavior
for those of us that have been using nflgame for years.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AG52aKa0Zw5RZiDCwgquag1VIpfD2oxiks5ueQ0LgaJpZM4WjoRp>
.
|
So after some thinking... Storing the Also, each time we check a remote game file on nfl.com (via No tampering with the json-files, they're all just copies of the nfl.com data. I kinda like that approach. What we haven't touched is where this should happen, e.g. where in the code. We've discussed this in prior issues. "Hi-jacking" EDIT: Making this configurable - I can't really decide if I think this should be configurable or not. I kinda like the idea that the application doesn't just download stuff from the internet without the user really knowing it. On the other hand, if this is not enabled by default, then a high percentage of the users will not know about it/how to enable it = file issues about it and not "enjoy the product" as it can be enjoyed. |
Main reason I don't like that is that it is possible to run nflgame without ever accessing any of the games. The following is perfectly valid code. import nflgame
print len(nflgame.players)
Adding the overhead of checking every single game (thousands) against when it appears in the schedule seems overly aggressive. Instead of determining if we need to replace the .JSON, what if we just delay caching the .JSON until some point after the game has concluded? https://github.com/derek-adair/nflgame/blob/master/nflgame/game.py#L307-L309 It would mean pulling the same data from nfl.com over and over again until we hit that magic date/time when we say we're satisfied that changes won't be happening anymore. Just throwing it out there, not suggesting it is the best solution to the problem. |
I don't like it either, in fact I think it's poor design. I don't think that I've just gotten the impression that we are OK with (like?) that for instance the schedule is automatically updated with the import so I've just kinda followed that philosophy. Personally I don't like it but I can live with it if it's what the team wishes 🙂
Unclear of me, I was thinking maybe the last x week(s) but didn't write it down. A user initiated function call/script could override this to make checks earlier in time.
Best solution so far I think. This is easy to fix for now and solves the problem. Well done! The only thing I'm not sure about is if all games get synced at the same time. Like TNF vs MNF...so would it be a set time (like, thursdays at 8AM) or a relative time (like 24 hours after kickoff). |
Marking on hold b/c #46 will address this if we are lucky |
Also yes, updating should not be done on import and only when you run the live script should it update. I've come to see the error in my thinking and understanding of the code base. |
@brbeaird I've updated nfldb to use python3 and have a compatible database w/ (hopefully) all of the 2019 data. |
/feeds-rs/playbyplay/ has data back to 1998. We can use this to verify the accuracy of our play data. I will be writing a script to do this. I'm not sure the best way to invoke this in real time to ensure accuracy. I'm considering adding config for stuff like this where you set your preference as a user: config.ini example:
|
Ooh nice. I’ll check that out when I get a chance. Eventually want to get that in a docker. |
Looking over the data from 2018 week 1, I'm seeing some things that don't quite line up. Check out play 3673 in 2018091001.json. This play was a 2 yard gain for Todd Gurley. However, he is listed further down in one of the sequence nodes with 32 yards.
While this may not be visible in nflgame, that data flows into nfldb as incorrect data, so if you're trying to do any aggregate queries there, they will not be correct.
Any ideas here?
The text was updated successfully, but these errors were encountered: