Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEVERE performance issues doing EDDBlink import when DB is large. #126

Open
eyeonus opened this issue Apr 23, 2024 · 69 comments
Open

SEVERE performance issues doing EDDBlink import when DB is large. #126

eyeonus opened this issue Apr 23, 2024 · 69 comments

Comments

@eyeonus
Copy link
Owner

eyeonus commented Apr 23, 2024

Not sure what the problem is, but there's a huge difference in processing time when doing an import using EDDBlink plugin when the database file is large vs. small. To whit:

TradeDangerous.db file size: 6.5 GiB
listings-live.csv file size: 10.3 MiB
Time to completion: ~24 minutes

NOTE: Processing market data from listings-live.csv: Start time = 2024-04-23 11:14:51.958359
#Getting total number of entries in listings-live.csv...
#Getting list of commodities...
#Getting list of stations...
#Processing entries...
NOTE: Finished processing market data. End time = 2024-04-23 11:38:17.200988

versus:
TradeDangerous.db file size: 43.7 MiB (empty StationItem table, otherwise identical to above database)
listings-live.csv file size: 10.3 MiB (Same file as above)
Time to completion: ~7 seconds

NOTE: Processing market data from listings-live.csv: Start time = 2024-04-23 12:20:00.816731
#Getting total number of entries in listings-live.csv...
#Getting list of commodities...
#Getting list of stations...
#Processing entries...
NOTE: Finished processing market data. End time = 2024-04-23 12:20:07.871285

Everything is exactly the same in both cases except for the size of the StationItem table in the database being imported to.

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 23, 2024

The problem is cumulative:
During testing, importing listings.csv, a 2.5 GiB file, starting with an empty database, getting from 10% to 12% took under a minute, getting from 70% to 72% took just under 10 minutes, in the same import.

@Tromador
Copy link
Collaborator

Can I suggest we expand this concept:

The server side works.
The client side works (apart from one warning, because a Rares station is in lockdown and doesn't appear in market data).

It works.

So I suggest concentrating on optimization, memory use in any case where that remains an issue and then speed rather than adding any new features. Draw a line under it, call it feature complete until we have something which can process imports and output runs in a more reasonable timeframe.

@kfsone
Copy link
Contributor

kfsone commented Apr 24, 2024

I saw we recently enabled incremental vacuuming and some other sqlite options to improve performance? It's probably worth turning those off locally and investigating SQLite's features for reporting on what the vacuum will do (keywords: sqlite3, pragma, vacuum, analyze, optimize) which may explain the issue - maybe there's an index that needs tweaking or scrapping.

You also made it more aggressive about flushing the cache (closing the db, even) yesterday, to counter the memory bloat that my simple executemany was causing. So this make sit smell like a transaction issue - either an absence of them at some point or an unexpected nesting of them.

Look for manual transactions (execute "BEGIN"/"COMMIT") and switch those to context-managed cursors so that the sqlite3 lib isn't taken by surprise.

I also strongly recommend looking at the apsw module; it's designed as a drop-in-replacement for sqlite3 but less of it is implemented in python so it has a potential to be faster, but I have not bench'd it with our particular use case.

@kfsone
Copy link
Contributor

kfsone commented Apr 24, 2024

https://stackoverflow.com/a/17794805/257645
this is intriguing, it would potentially give us a great way to get the data into the database fast and let sqlite worry about finalization, make it actually earn some of it's keep:
https://stackoverflow.com/a/3167052/257645

I strongly recommend using executemany whenever there's a loop, if possible. If you have to cap it, that's fine, but you save a bunch of overhead.

Did you try running the imports with pyinstrument?

python -m venv venv
. venv/Scripts/activate.ps1   # or . venv/bin/activate on mac/linux
pip install -r requirements/dev.txt
pip install pyinstrument
pip install -e .
cd tradedangerous
pyinstrument ./trade.py import ...

at the end it will give you a bunch of blurb and you can customize how it does that, but I tend to run it like the above and then grab the session from the last two lines:

To view this report with different options, run:
    pyinstrument --load-prev 2024-04-24T10-19-42 [options]

and run something like:

> pyinstrument --load-prev 2024-04-24T10-19-42 -r html >speed.html  # local output
> start speed.html
# or
> pyinstrument --load-prev 2024-04-24T10-19-42 -r speedscope >speed.scope
# and upload that file here: https://www.speedscope.app/

sample speed.scope from an eddblink -O solo: speedscope.zip unzip, hit https://www.speedscope.app/ and drag it into the browse button. use the buttons in the top left to change type of view.

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 24, 2024

The only transaction being performed when starting from an empty database is:

        listingStmt = """INSERT OR IGNORE INTO StationItem
                        (station_id, item_id, modified,
                         demand_price, demand_units, demand_level,
                         supply_price, supply_units, supply_level, from_live)
                        VALUES ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )"""

Removing the "OR IGNORE" part doesn't have any effect.

eddblink used to use the executemany function after it finished gathering the list of all changes to be made, by appending the values for each change to the list listingsList within the for listing in listings loop, and then running executemany(listingsStmt, listingsList) after exiting the loop, which resulted in about 6 hours (on my machine) of nothing appearing to be happening while the database was being updated.

Changing it to what it is now, where each listing is inserted during processing with execute(listingStmt, {values}) still took about 6 hours, but at least there was some indication of progress.

The pragmas that were introduced knocked it down to a little over 2 hours.

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 24, 2024

I've been working on TD for a week straight now, during any free time I had outside of work and sleep. I'm taking a break for a while.

If it ain't a bug, it can wait or someone else can tackle it, is my current mode.

@Tromador
Copy link
Collaborator

The pragmas was me. WAL journal has a massive speed benefit on Linux, though seems less so on Windows and the vacuum keeps the db size to a minimum. The server is set up to vacuum every couple of days and I do want it to do incremental, as a full vac takes a chunk of time. That said, the first vac and optimise on the server's db reduced it from 6GB and change to 4GB and change so it's definitely worth it.

@kfsone
Copy link
Contributor

kfsone commented Apr 25, 2024

I hope I didn't sound like I was handing out assignments :( if you can run the pyinstrument command on the server and snag a .scope file it might be really interesting.

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 25, 2024

I hope I didn't sound like I was handing out assignments :( if you can run the pyinstrument command on the server and snag a .scope file it might be really interesting.

I hope I didn't sound like I was feeling put upon. I'm just so tired from working on this for a week solid and need a break for awhile.

I'm running pyistrument, I'll post the results for you if you want to have a look at it.

IT definitely seems to be more an issue on my *nix box than on Tromador's Win one.

@kfsone
Copy link
Contributor

kfsone commented Apr 25, 2024

Jonathon first, trade calculator for online game later, man.

TD is a vague distant memory for me - I honestly thought one of you added the plugin system after I'd ambled off, and I was kinda wondering wtf the saitek crap was until I looked at one of the files and saw the first line comment. cough

Point: I don't know what the priorities are or the full details of what's wear and what's tear. If you think I can productively be of use somewhere specific, you'll need to point me, and while I'm human and will get my knickers in a twist or my butt hurt in the moment when it comes up that I'm responsible for any particular cretinous decision or demonstrate of derp, I'm not a diva: my first job was for a UK "national" contract software company as an embedded engineer, and when the company went belly up I got offers from each customer I'd worked for to go finish development, but I either didn't really "get" the companies or feel adequate, except for one, in my home town, on the fishing docks. So I accepted their offer, which came with a caveat: they weren't big enough to pay for a programmer, they were a cold-store/blast-freezer/fish packing plant. So if it got busy, I worked packing fish, loading/unloading 50lb boxes of frozen shark bellies or halibut. Yeah, I've worked some fancy places since, but I didn't stay.

@kfsone
Copy link
Contributor

kfsone commented Apr 25, 2024

In some countries, TD is almost old enough to start putting bread on the table :)

image

@Tromador
Copy link
Collaborator

IT definitely seems to be more an issue on my *nix box than on Tromador's Win one.

We aren't on identical hardware though so although I agree from our testing that Linux and Windows do seem to behave differently, don't forget I have lots of CPU, on the other hand you have more RAM than me. It's hard to simply draw comparisons therefore just based on OS.
The server is another game entirely, running an archaic version of Centos, such that I have had to build TD (from source) its own special environment to run in with its very own sqlite3 and OpenSSL libraries and a super special Python just for it, along with all the environment shenanigans to make it not use the outdated system versions. Not to mention it's kinda short on RAM, but again is free vhost and server hosting isn't cheap.

I to tend to waffle when I am tired, point being, many variables outside of OS. Particularly hardware.

@Tromador
Copy link
Collaborator

In some countries, TD is almost old enough to start putting bread on the table :)

I quite deliberately put Est 2015 at the start of the thread on E:D forums. Other tools may be quicker, flashier whatever, but nothing else can do what we can and nothing else has our stubborn will to survive.

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 25, 2024

I don't know what the priorities are or the full details of what's wear and what's tear. If you think I can productively be of use somewhere specific, you'll need to point me,

This is my list of things needing doing at some point. The Niceness value is how important I consider each item, with lower being more important just like *nix CPU prioritizing.


1) N = -20

Anything you can do to improve import performance is welcome. On my box, with the current TD, both spansh and eddblink take a bit over 3 hours to do an import. Obviously different machines will take different times, but ideally anything but a potato should have sub 1 hour import duration. My box has an 8-core CPU with 64GB RAM, so it's not a potato. (Also I checked the drive, and it has 45MB/s+ write speeds, but neither import ever went above 5MB/s, so I highly doubt that's the bottleneck for me.)

2) N = 5

The spicing up of the visuals is certainly welcome, I like what you've done there and would be pleased to see the visual improvements you've done to spansh done to eddblink (and the rest of TD) as well.

3) N=-10

TD needs to be able to build the other Tables. We have two reliable sources, those being the daily dumps from Spansh, and the EDCD lists, such as rare_items.csv

We currently use the Spansh data (and EDDN via the listener) to fill out System, Station, Item, and StationItem, and could potentially use it to fill out Ship, ShipVendor, Upgrade, and UpgradeVendor.

We currently use EDCD to fill out FDevOutfitting and FDevShipyard, and could potentially use it to fill out RareItem, Item, and Category.

There's obviously some overlap, so in the case where we can use either, I think EDCD is the better option.
Spansh (and listener): System, Station, StationItem, ShipVendor, UpgradeVendor, Ship, Upgrade
EDCD: FDevShipyard, FDevOutfitting, RareItem, Item, Category

That leaves Added as the only one that can't be easily updated, but as that's only used in System, to indicate (I think?) what version of Elite Dangerous a system was added to the galaxy, I feel like it should individually get N=20

Getting everything populated via Spansh means updating to utilize the information provided by the dump that is currently being ignored, as well as pulling from the EDCD sources when building the DB from scratch or they've been updated.

Getting everything populated via the TD listener means listening to the other EDDN messages besides just the commoditiy ones, and processing them accordingly. Since the listener uses Spansh, this only matters for things which change more often than once a day. The eddblink plugin pulls from Trom's server, so getting this into the listener automatically gets this into eddblink as well.

4) N=-18

Rescue Ships are EVIL to TD, because not only do they move, they also change their name, as they are named after whatever station they are currently rescuing. This results in duplicate entries in the prices file whenever two different rescue ships go to the same station.
For example, "Rescue Ship - Arc's Faith" has two entries with a different station_id because it's been visited by two different rescue ships. The FDev Id for the ships are different, and searching EDSM, I found one of them is listed as being at HIP 17694 \ Hudson Observatory and the other is at Didio \ Laumer Orbital.
The next Spansh dump should have those updates, but even with that, both of the rescue ships are now named differently since neither one is at Arc's Faith anymore. Renaming the station needs to be done so that users don't go looking for the wrong station, and maybe adding the FDev ID to the name to prevent the duplicate station error is a good idea?

5) N=12

Working GUI, as in feature complete with TDHelper.
If you have TD installed via pip, you can run tradegui and see the awful attempt I've got so far.

6) N=20

Update the wiki to account for changes made to TD since the last time it was updated.


I can probably get all that done myself, but I know it'll take a long time and be coded poorly, because I'm entirely self-taught on this stuff. I'm currently in college for computer science, so I'm better at this stuff than I was when I picked up the torch, but I'm still learning, and most if I'm learning the hard way.

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 25, 2024

Which reminds me, I set this to run before I left for work.
speed.tar.gz

@Tromador
Copy link
Collaborator

I'd venture to add. Import performance is the biggest problem as eyeonus says. No doubt. But...

I reinstalled E:D being as we had working TD. I guess I was feeling nostalgic. So I have been using it in anger and there are in fact two things which leave me twiddling my thumbs. The first is absolutely importing data. The second is exporting data, by which I mean getting query results from trade run. The size of the dataset has grown in just the same way for making queries, so that the tree of possibilities which must be examined has become vast. This can be mitigated by filtering down, excluding by age for example, or no odyssey is a good one, but if you want to do a full check with a huge hold, enough money to fill it with whatever, and a jump range sufficient to cross the bubble? It's not usable. I am taking to lying about my real jump range to constrain the tree size.

So yes. Import speed does need optimization, but so does the output.

@kfsone
Copy link
Contributor

kfsone commented Apr 25, 2024

@eyeonus The visual stuff would ultimately be about code reduction - if rich had existed 10 years ago I'd not have written progressbar etc. I was looking at the transfers code, and it feels like half that module is just about trying to get "requests" imported. That's covered now by the requests.txt file, right?

GUI I always figured that would end up being web-based but the web-app concepts at the time weren't really much use and I had reasons not to really want to get too much into html. But perhaps 'gradio' would be an option.

So that brings us to the meaningful work, and I guess I always suspected TD would find itself struggling for trying to be a cold-runner rather than some kind of service.

Juts thinking out loud here: We pay an entry fee for storing our data: every operation we need to do has to be expressed textually as SQL, parsed by the SQLite code, sqlite has to plan what to do and then execute on it. Lot of cpu cycles just to start reading. The ordinary payoff is a sort of storage-jit: tell the engine what you want in the abstract and it can figure out the fastest way to do it. We're not doing elaborate queries with sophisticated multi-table joins, groups, sorts, wheres, etc.

I mean, SELECT * FROM System is actually better for us than SELECT * FROM System ORDER BY .... We don't have geospatial indexing turned on, and the SQLite options there would hurt us anyway.

Spitballs, consider each individually not as a sequence:

  • Incremental imports

for system in systems:
INSERT system INTO SystemIncoming
for station in stations:
INSERT station into StationIncoming
executemany (
INSERT item INTO ItemIncoming FOR item in station.items
)
if thread: thread.join
thread = Thread( move *incoming -> * )

Using the temp table means we can avoid either building our own list or the cost of per-item queries or database indexes while we're doing this work.

We can also parallelize the work here; we fully populate the temporary table without index overheads etc, and then have sqlite move things over. Now we're making SQL pay for itself because the query planner/executor can use its own internal data about quantities and stuff to do the transfer.

That's the theory - I don't know if it's true for SQLite.

  • Persistent database

We intuitively get why a movie ticket costs $25 when renting the film costs $5, but TD is paying $25 to rent. SQLite's benefit is that you don't need a server, it's con is that you dont' get a server doing all the heavy lifting in another thread/process or possibly on a different cpu. That's a fundamental part of the hitch here. We've turned on indexes and joins. We've started doing vacuums and the problem is they can only happen while there's a process running, and we're it.

We could: a) have TD split itself in two, a TradeAdder that implements TradeDB and soaks up the costs of keeping the database fresh, and a TradeGecko that shuttles stuff back and forth between the user and the backend; not entirely implausible, mostly separating concerns and building facades/bridges b) Make it possible or even a default to have a persistent database spun up one of several ways - docker, podman, vm, local or remote url and use that, with sqlite as a naive fallback

This is where we might look to something like SQLAlchemy, PeeWee or pony. It's also where we might actually change some of the operations encumbered on TradeDB so that it doesn't have to in-memory everything.

  • Persistent td/REPL

A variant of (a) in the above, we literally just make TD keep itself alive over probably a non-sql persistant store (either embedded or a second process; key value or nosql: leveldb, rocksdb, memcache) and then have a very simple api for talking to it that it can use on itself; either thru json-rpc with something like flask/eel or literally have it read stdin and accept/process trade commands and pass them to the existing command-line arg parser:

> trade.py repl
trade@SOL$ import -P eddb -O solo

via pseudo code:

for line in stdin:
  args = ["trade.py"] + split_args(line)  # deals with quotes etc
  main(args)  # <-- the old main, which can't tell they didn't actually come from the command line
  • async or threading

I don't know off the top of my head where we can leverage this right now to any good use; I created an experimental Rust-based python-package providing a rust version of the "count lines in file" method. That's a hard one to optimize because cpu isn't the main bottleneck. I did a version where it would on startup create a number of counting workers, and then you simply gave them byte-regions to count from.

And there was no version where it wasn't slower with concurrency. I wasn't exhaustive by any means, but you're trying to apply a very simple set of logic and rush thru memory, so there's not a lot of juggle room.

@kfsone
Copy link
Contributor

kfsone commented Apr 25, 2024

Which reminds me, I set this to run before I left for work. speed.tar.gz

that is uniquely ... lacking info lol. I've never seen it give that little information. what are the specs on the server?

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 25, 2024

Which reminds me, I set this to run before I left for work. speed.tar.gz

that is uniquely ... lacking info lol. I've never seen it give that little information. what are the specs on the server?

That's not the server, that's my machine.

Tromador would have to provide you with any server runs.

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 25, 2024

@eyeonus The visual stuff would ultimately be about code reduction - if rich had existed 10 years ago I'd not have written progressbar etc. I was looking at the transfers code, and it feels like half that module is just about trying to get "requests" imported. That's covered now by the requests.txt file, right?

??? What requests.txt file?

GUI I always figured that would end up being web-based but the web-app concepts at the time weren't really much use and I had reasons not to really want to get too much into html. But perhaps 'gradio' would be an option.

I was considering using pySimpleGUI, I'll have to look at gradio, never heard of it.

So that brings us to the meaningful work, and I guess I always suspected TD would find itself struggling for trying to be a cold-runner rather than some kind of service.

Spitballs, consider each individually not as a sequence:

* Incremental imports

Since the speed of the import goes down as the table gets bigger, this sounds like a great idea. The vast majority of the information goes into StationItem: all the other tables, completely filled with the spansh data, don't amolunt to even .5GB, but add in the market data and it grows to ~7GB, so while I doubt we'd see much improvement doing this on any of the other tables, it might be a big help for that particular one.

* Persistent database
* Persistent td/REPL

I"m sorry, that's all over my head. I defer to your expertise.

@kfsone
Copy link
Contributor

kfsone commented Apr 25, 2024

| I"m sorry, that's all over my head. I defer to your expertise.

Oh, just to make you regret saying that:

| ??? What requests.txt file?

... was a typo of "requirements.txt". What I mean't was, in order for anyone to pip install Trade-Dangerous, it depends on requirements.txt on its own, right?

Duh. Gonna check that real quick, dum-me:

image

so it should be safe to remove all that crap screwing around to make sure requests is installed (in transfers.py), haha

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 25, 2024

Oh. Yes.

@kfsone
Copy link
Contributor

kfsone commented Apr 25, 2024

| I was considering using pySimpleGUI, I'll have to look at gradio, never heard of it.

I don't have useful input on GUIs really; 'gradio' is - I believe - what people are using to build web-based-gui-apps for things in python, so it's not sexy but you spend a lot more time on what the gui elements do than on the screen itself. So, for instance, "StableDiffusion" has a gradio-based UI, as does the popular "run your own local GPT" https://github.com/oobabooga/text-generation-webui... https://www.gradio.app/ Yeah, I know it says machine learning, but they're pitching to their new market because of those last two.

I've used flask for building UIs before but then you need to build an entire front-end application in angular or react or cut-my-wrists-for-me-now-please.

If you're going to be old school about it and do a desktopy app; wxWidgets is shit, but you can keep it relatively simple and use 'wxFormBuilder' to design the app - I've done that a few times lately: https://github.com/wxFormBuilder/wxFormBuilder.

So pySimpleGui might be rockstar - check that out before wxWidgets, because I hate it as much as I use it.

You could conceivably do something with Dear imgui - I found it easy enough to use I wrote wrappers for it in C++ (https://github.com/kfsone/imguiwrap) when I used it a few years ago, the python version is ok, but it's best use case imho is for in-game debug screens/menus etc.

So I'll look at doing some experiments on the item import and how it is currently working.

How often have frontier been adding items?

I wonder if it would make sense to actually alter the row format of StationItem so that there are two rows per station: one for selling, one for buying.

@kfsone
Copy link
Contributor

kfsone commented Apr 25, 2024

Right now I'm trying a little experiment that reintroduces a per-station cursor; we're not building a list of items for executemany, but we're also not doing the full cost of adding each and every item in its own mini-transaction. Also breaking up the code a little-bit where we're not invoking methods in loops and we can afford the extra 40-50ns it takes to call a method in python lol.

So for right now I think it would be useful for us to look at:

  • apsw instead of sqlite3; it pulls some tricks like automatic stored-procedures that we might benefit from where we're not using executeman. replacing sqlite3 across the whole program will take a bit of work, but just replacing it in one plugin might be fairly easy for benchmark purposes.

  • incremental import, i.e. write all the new entries into a pristine, unindexed table, and then let sqlite transfer them. could either be radically worse, since its doing "more" work, or radically faster because it's the database doing that work and the overheads we pay are supposed to make exactly that kind of thing faster...

  • alternative storage backend; perhaps literally take eddblink_plug and replace the sqlite3 calls with something that talks to postgres or mysql and see what it's like having it dump the data into that? not a full transition or anything, but just a naive "what if"

@kfsone
Copy link
Contributor

kfsone commented Apr 25, 2024

Win|PS> python -m apsw .\data\TradeDangerous.db
SQLite version 3.45.3 (APSW 3.45.3.0)
Enter ".help" for instructions
sqlite> select count(*) from item;
┌──────────┐
│ count(*) │
│ 395      │
└──────────┘
sqlite>

forget the row-shape idea then lol

@kfsone
Copy link
Contributor

kfsone commented Apr 26, 2024

Just so you know - I'm not remotely above figuring out how to get an AI to help me, but I'm also not expecting it to be actual help. However... Giving it the ability to read the repository definitely makes it more useful:
https://github.com/kfsone/ggama

@kfsone
Copy link
Contributor

kfsone commented Apr 26, 2024

$ ls -1sh /dev/sda /dev/sdb
/dev/sda 1PB
/dev/sdb 15MB
$ rm -rf pride  # surplus to requirements
$ mv /sdb/shame /sdb  # storage upgrade required

image

@kfsone
Copy link
Contributor

kfsone commented Apr 26, 2024

Do we care about the added column any more?

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 26, 2024

Do we care about the added column any more?

I gave the Added TABLE an N=20

I never cared about that column. :D

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 26, 2024

Just want to say, the recent PR is definitely an imrpovement. Even with the additional 'optimization' step, it finishes processing both listings and listings-live faster than the old way could get through just listings.

I'm happy, and extremely grateful.

@kfsone
Copy link
Contributor

kfsone commented Apr 29, 2024

@ultimatespirit appreciate the diligence and the detailed analysis; apologies to both of you for the churn I've caused lately - @eyeonus I'll try to do better about making less work for you man. I wanna make sure you don't think I'm gonna be a diva if you have to put a flea in my ear or point out I'm a nub.

| Oh. So the filesystem-eliminates-empty-pages thing, and the hash-bucketing thing, wouldn't eliminate the empty bits in that case? Or am I misunderstanding something?

The eliminates-empty-pages thing would be "spare files", but when I say "empty pages" I'm thinking in terms of a handful, as opposed to the file starting with 3906 empty pages (pages required to represent the 128000 unused IDs at the start of the ID-space).

But I'm spitballing there, the hash-bucket has pros-and-cons, primarily that it requires a computation to get from actual-fdev-id to bit-no in the bitmap, and after doing it the resulting bitmap is only good for a one-way lookup: the "stations selling this item" bitmap on an item object can be used to go from a station id to do a quick bit-check of "does this station have this item", but it might not be possible to do the reverse, to say "bit 40401 is set, what does that mean?" because there isn't guaranteed to be a meaningful way to reverse the algorithm and calculate what station id bit 40401 represents.

I did some experimentation to try and find algorithms that might work but at the end of the day I think it will be easiest to just have some kind of per-instance hidden column or table that collapses fdev_ids into near-sequential smaller numbers. I'll table those for an actual presentation of implementation and thinking when I have opportunity to put it together.

@kfsone
Copy link
Contributor

kfsone commented Apr 29, 2024

The reason there's no progress display during the listings.csv download is that @Tromador's server isn't giving a size on the download, and the transfers.py code decides not to try and represent the progress.

The rich progress.py changes I have in the "import-via-table" branch at least has a go at this, and we can improve on it:

image

and the Processing stage is night-and-day. The nice thing about using rich is that the progress stuff happens in a thread so you aren't bleeding away performance in the tight loop.

Recording.2024-04-28.213320.mp4

It also uses these to let you know it's still working when it gets to rebuilding the index at the end, too.

@kfsone
Copy link
Contributor

kfsone commented Apr 29, 2024

(didn't mean to throw you under the bus @Tromador - I should have said: our call to trom's server for the listings.csv isn't getting length information -- likely that there's a header missing in the request or sometihng)

@kfsone
Copy link
Contributor

kfsone commented Apr 29, 2024

image

@ultimatespirit
Copy link
Contributor

The eliminates-empty-pages thing would be "spare files", but when I say "empty pages" I'm thinking in terms of a handful, as opposed to the file starting with 3906 empty pages (pages required to represent the 128000 unused IDs at the start of the ID-space).

Not sure what the pattern is for fdev IDs, but you can remove the starting empty spots by just reducing all IDs by whatever the lowest fdev ID is. I.e. if the IDs start at 128,001 just offset all IDs by 128,000 as your zero index. From there though it's still not necessarily efficient to use a bitmap, are IDs always increasing and densely packed? If they aren't, and there are holes in the range, you'd need to non-trivially map them by order, 128,001 -> 0, 128,010 -> 1, etc. and that has no easy conversion back that doesn't involve an array access at least (have an array of IDs matching indices to bitmap positions). Which then runs into the next question, if it's sparsely populated does fdev fill in holes later? Because if they ever fill in a hole well... you have to completely rebuild that mapping, not fun to update an existing database to handle, though perhaps a worthwhile one-time cost if it has enough performance benefits and is considered a rare event.

The reason there's no progress display during the listings.csv download ...

Also, in case that was in response to my talking about progress above, I was actually referring to the DB rebuild steps / prices file processing steps, less so the downloads. But your prototype for more progress messages looks great. I noticed that new UI when spinning up the listener earlier today (though had to turn it off shortly after as it seems to download the spansh dump every hour even if it was listening to the EDDN the entire time in between, which is what spansh uses too?? I'm guessing there's some edge case requiring that...).

@kfsone
Copy link
Contributor

kfsone commented Apr 29, 2024

| I was actually referring to the DB rebuild steps / prices file processing steps,

The 10s video I posted is actually it processing the csv file, not downloading it.

I didn't record the tail part because that's the part I'm working on, and at the moment it just sits there for 25 minutes showing this:

image
and then a few more showing this:
image

(but the spinner is animated the entire time as is the timer)

@eyeonus the cause of this is these two indexes in particular:
image

I can look at splitting the data into two tables so it doesn't need that index, if that's useful. It'll mean my doing some diligence on the various commands that access price information from the StationItem table to get it right.

@kfsone
Copy link
Contributor

kfsone commented Apr 29, 2024

| Not sure what the pattern is for fdev IDs, but you can remove the starting empty spots by just reducing all IDs by whatever the lowest fdev ID is. I.e. if the IDs start at 128,001 just offset all IDs by 128,000 as your zero index. From there though it's still not necessarily efficient to use a bitmap, are IDs always increasing and densely packed? If they aren't, and there are holes in the range, you'd need to non-trivially map them by order, 128,001 -> 0, 128,010 -> 1, etc. and that has no easy conversion back that doesn't involve an array access at least (have an array of IDs matching indices to bitmap positions). Which then runs into the next question, if it's sparsely populated does fdev fill in holes later? Because if they ever fill in a hole well... you have to completely rebuild that mapping, not fun to update an existing database to handle, though perhaps a worthwhile one-time cost if it has enough performance benefits and is considered a rare event.

Yeah, that's why I think something akin to row-id is "good enough". It's sole purpose is as a guide for the local bitmap - which is not something that should ever get shared, it's entirely local stateful stuff.

My import just got to the regenerating .prices file part, which doesn't have a progress bar,
image
but the way the rich stuff works let me get rid of a bunch of code while also adding a context-manager styling for increasing the use of progress bars, so you do something like:

with progress.Progress(None, 40, style=pbar.CountingBar) as pbar:
... regenerate prices()

and that's enough to add a concurrent progress spinner/time-elapsed bar.

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 29, 2024

(didn't mean to throw you under the bus @Tromador - I should have said: our call to trom's server for the listings.csv isn't getting length information -- likely that there's a header missing in the request or sometihng)

That's certainly possible, the header we pass is headers = {'User-Agent': 'Trade-Dangerous'}

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 29, 2024

I'm fine with using a row_id if you think it'll help things. StationItem currently uses (station_id, item_id) as the PK, so adding a row_id to the table shouldn't bork anything, except maybe the two unique ids thing with the export?

@kfsone
Copy link
Contributor

kfsone commented Apr 29, 2024

They would be something totally transient/private to the data set, almost even to the particular generation. I've been running various experiments/investigations but I probably need to actually create a couple of branches in my fork and try some of them out in more detail.

for instance - I took the hash-bucket thing for a spin, trying to find if there are numbers that would actually work for us, and I actually found some cheap algorithms for transforming the range of fdev_ids in my local Station table into a 0-N range of unique numbers that essentially required 3-bits of bitmap per station to be able to create a bitmap for each item saying "these are all my stations" - so the indexes we currently have could be condensed down from 64-bits-per-station-vs-item to ~3-bits-per-station on each item's row.

I ran an import, which added a station ... ran my test again, and the new station collided with another id after running the algo again. son of a...

there's a python version here https://gist.github.com/kfsone/f7f7de6d7c0216cf44cdf97b526a474e and a rust version (cause the python version was gonna take several hours to run, I should have tried it under pypy instead) https://github.com/kfsone/tinker/tree/rust

---------- it's late, I'm blathering ... sorry.

Current big hurt has to be the two indexes that have to present on the StationItem table to make it practical for buy/sell to distinguish only rows important to them. At a smaller scale, it was actually better having one table with sell/buy prices because you avoided a join, but now it forces us to maintain a massive conditional index. power hurt assemble

So I can focus on splitting that into two tables on that import-via-table branch over the next several days, if you like. It currently has the new progress bar stuff in it because I wanted/needed the extra visibility while running the long tests.

Rich is pretty frickin' cool; https://rich.readthedocs.io/en/stable/ it brings a ton of power to the table, down to the ability to generate a recording of an apps output, and almost all the capabilities are packaged behind self-demoing modules:
python -m rich.progress
python -m rich.live
python -m rich.table
python -m rich.logging
python -m rich.tree
python -m rich.spinner
python -m rich.status
python -m rich.layout
use it as a terminal markdown viewer:
python -m rich.markdown README.md

I had written various bits and pieces of this stuff into TD early on, but I'm awful at UI so none of it is great, and switching it over to using rich should improve quality/appearance/speed and reduce the amount of maintainable code at the same time.

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 29, 2024

That all sounds great. Feel free to work on the two table thing, and if it turns out to be much better than the current implementation, I consider that a win.

@Tromador
Copy link
Collaborator

Tromador commented Apr 29, 2024

(didn't mean to throw you under the bus @Tromador - I should have said: our call to trom's server for the listings.csv isn't getting length information -- likely that there's a header missing in the request or sometihng)

So long as it's an old school Routemaster, I will happily be under the bus.

The reason the server doesn't send content length header is because it streams listings.csv as gzipped content and because it's doing the compression on the fly, it doesn't actually know how the size of the data it's ultimately going to send.

I guess listener could gzip listings.csv and then apache would know how much data it's sending?

I have a terrible feeling of deja vu here. See, I have sarcoidosis and one of the many symptoms is brain fog and I have become much more forgetful than I once was, but I have an itch in the back of my brain which suggests we've been here before. Why did we choose to have the webserver do on the fly compression instead of just creating compressed files?

@Tromador
Copy link
Collaborator

I guess if listener compressed the files, apache sent listings.csv.gz then eddblink will have to uncompress it.

As it is, the download is handled by urllib, which reads the headers, and uncompresses on the fly as the file downloads. Ultimately it's about saving a lot of bandwidth (and thus download time) given the innate compressibility of text.

@kfsone
Copy link
Contributor

kfsone commented Apr 29, 2024

Interesting: We do an initial probe for the timestamp of the file using urllib.request using POST:
image
so I could capture the "actual length" to give me a guide.

But I should probably also eliminate that separate query, because it means we have two - possibly international - SSL-based queries going on (it takes about 2s to determine if there's an update from here in CA) and worse, you actually start sending the file - so it's doubling the bandwidth use (and why the actual downloads start slow for the end user)

@kfsone
Copy link
Contributor

kfsone commented Apr 29, 2024

I can save a bunch of bandwidth here by switching from urllib.request.request_open to requests.head:

image

and this is actually good stuff for the downloader, because we actually want the uncompressed size, since the downloader never sees the compressed data in the first place.

Win win win.

@kfsone
Copy link
Contributor

kfsone commented Apr 29, 2024

I name this branch: make imports lightning fast

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 29, 2024

Interesting: We do an initial probe for the timestamp of the file using urllib.request using POST...

Yeah, that's totally my fault. I didn't know any other way to do it.

@Tromador
Copy link
Collaborator

But I should probably also eliminate that separate query, because it means we have two - possibly international

Server hosting kindly donated by Alpha Omega Computers Ltd who are in the UK. Until I got sick, I was a company director there. The deal is that I get free hosting and they get to call me to ask dumb questions once in a while.

@Tromador
Copy link
Collaborator

I can save a bunch of bandwidth here by switching from urllib.request.request_open to requests.head:

Pretty sure that 'text' is not a valid encoding standard. IANA HTTP content coding registry.

If it is allowed (or works anyway) make sure you don't use it for the download as you will be telling the server you only grok plain text and can't accept compressed content and it will presumably comply, sending the whole thing in the requested, uncompressed format.

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 29, 2024

I can save a bunch of bandwidth here by switching from urllib.request.request_open to requests.head:

Pretty sure that 'text' is not a valid encoding standard.

I tested it, and it didn't give any errors. Progress bar showed up and everything.

@kfsone
Copy link
Contributor

kfsone commented Apr 29, 2024

@Tromador Just wasn't sure how to tell it to not encode. And this is only for an HTTP HEAD request:

requests.head(url, headers={...})

Should probably give transfers.py the ability to open and maintain a "Connection" so that it can turn around the queries more rapidly - at the minute it has to do the https handshake on all of them which takes time and costs cpu/bandwidth it doesn't strictly need to, but I don't think it's hurting anyone atm so nice +19 :)

@Tromador
Copy link
Collaborator

@Tromador Just wasn't sure how to tell it to not encode.

I think you want 'identity' then, which says "I want the original with no encoding whatsoever".

@eyeonus It may work, but that doesn't make it right, I have been through the RFC and it refers right back to the list in the link I gave above. Doing things correctly per standards is important for the day such a loophole is plugged in a patch and then suddenly it doesn't work. I had a good trawl on the net looking for "text" in the context of an "accept-encoding" header and can't find it. If you can, I am happy to be educated.

@eyeonus
Copy link
Owner Author

eyeonus commented Apr 30, 2024

No, in this case I'm the one to be educated.

@kfsone
Copy link
Contributor

kfsone commented Apr 30, 2024

No! It were me as wuz educated. Ready for a round of "wen ah were a lad?"

@Tromador
Copy link
Collaborator

No! It were me as wuz educated. Ready for a round of "wen ah were a lad?"

Well, I am a Yorkshireman :)

@kfsone
Copy link
Contributor

kfsone commented Apr 30, 2024

I'm from a place called Grimsby which is neither Up North, Down South, Off East or Out West. It has its own ordinality: Theranaz. People from Grimsby are Up-In Theranaz.

GRIMSBY (n.)

A lump of something gristly and foultasting concealed in a mouthful of stew or pie. Grimsbies are sometimes merely the result of careless cookery, but more often they are placed there deliberately by Freemasons. Grimsbies can be purchased in bulk from any respectable Masonic butcher on giving him the secret Masonic handbag. One is then placed correct masonic method of dealing with it. If the guest is not a Mason, the host may find it entertaining to watch how he handles the obnoxious object. It may be (a) manfully swallowed, invariably bringing tears to the eyes. (b) chewed with resolution for up to twenty minutes before eventually resorting to method (a) (c) choked on fatally.

The Masonic handshake is easily recognised by another Mason incidentally, for by it a used grimsby is passed from hand to hand. The secret Masonic method for dealing with a grimsby is as follows: remove it carefully with the silver tongs provided, using the left hand. Cross the room to your host, hopping on one leg, and ram the grimsby firmly up his nose, shouting, 'Take that, you smug Masonic bastard.'

-- Douglas Adams, Meaning of Liff

@eyeonus
Copy link
Owner Author

eyeonus commented Jun 10, 2024

@kfsone I'm in the process of making the spansh plugin capable of filling the Ship, ShipVendor, Upgrade, and UpgradeVendor tables from the spansh data.

Because of the data that is (not) provided by spansh for the ships (cost) and upgrades (weight, cost), I'm wanting to change those tables to include the information that is avaialble.

Current testing indicates that the new code works as expected on a clean run (no pre-existing DB), but with an already existing DB the tables won't match, which borks everything.

Do you know if TD already has code to do this?

If not, where would be the best place to have TD check if the DB needs updated, and if so drop the old table and add the new version? I could put it in the plugin itself, but it might be better to put it somewhere in tradedb or tradeenv?

@eyeonus
Copy link
Owner Author

eyeonus commented Jun 10, 2024

Ideally, I'd like to see if ./tradedangerous/templates/TradeDangerous.sql matches ./data/TradeDangerous.sql and if not, apply the difference from the template to the DB. (So for example if the Ship table in the template doesn't match, TD will drop the DB Ship table and create the template's Ship table in the DB)

@kfsone
Copy link
Contributor

kfsone commented Jun 13, 2024

Yes there's a schema-migration system, but it's not particularly smart; basically it operates on the sqlite "pragma user_version". Take a look in cache.py.

In a bit of karmic bloody-nosing while I was working on it (4 weeks ago?), we suddenly needed to change our sqlite schema at work for the first time in 15 years, and I ended up writing a more advanced version there, but I'll see if the boss is OK with me donating it.

From the runbook here's a "change 3" which is recovering from a previous mistake but compounds it, and change 4 that fixes things finally.

MIGRATIONS = {
    3: [
        {
            "if": "SELECT COUNT(*) FROM pragma_table_info('History') WHERE name='Preview'",
            "eq": 0,
            "then": "ALTER TABLE History ADD COLUMN Preview BOOL;"
        },
    ],
    4: [
        "DELETE FROM History WHERE HighWaterMark IS NULL OR HighWaterMark < 1;",
        "ALTER TABLE History ADD COLUMN NewWaterMark FLOAT NOT NULL DEFAULT 0;",
        "ALTER TABLE History ADD COLUMN NewPreview BOOL;",
        "UPDATE History SET NewWaterMark = HighWaterMark, NewPreview = Preview;",
        "ALTER TABLE History DROP COLUMN HighWaterMark;",
        "ALTER TABLE History DROP COLUMN Preview;",
        "ALTER TABLE History RENAME COLUMN NewWaterMark to HighWaterMark;",
        "ALTER TABLE History RENAME COLUMN NewPreview to Preview;",
    ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants