Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EDDB will soon cease operations #110

Open
bgol opened this issue Apr 2, 2023 · 121 comments
Open

EDDB will soon cease operations #110

bgol opened this issue Apr 2, 2023 · 121 comments

Comments

@bgol
Copy link
Contributor

bgol commented Apr 2, 2023

In case you didn't notice:
https://forums.frontier.co.uk/threads/eddb-a-site-about-systems-stations-commodities-and-trade-routes-in-elite-dangerous.97059/page-37#post-10114765

@eyeonus
Copy link
Owner

eyeonus commented Apr 2, 2023

Well that's not helpful.

@Meowcat285
Copy link

Meowcat285 commented Apr 11, 2023

EDDB has now shut down, is there any plans to update TD to use something else, like inara for example?

Edit: Looks like inara doesn't seem to have a API for exporting data

@eyeonus
Copy link
Owner

eyeonus commented Apr 11, 2023

Working on it.

@Tromador
Copy link
Collaborator

Temporarily TD is working, but uses stations and systems from the day EDDB died. That said, the first phase for server work for this change is now functionally complete. We are producing our own listings.csv and continue to produce listings-live.csv as normally. Next thing is dealing with new systems and stations as those need entirely new code to handle so they are imported correctly.

@aadler
Copy link
Contributor

aadler commented May 21, 2023

Late to the party, here, but would it be possible to pull from EDSM, Inara, or even Spansh? The first two rely on EDDN, as did EDDB. Perhaps there is a way to hook into their feed.

@eyeonus
Copy link
Owner

eyeonus commented May 21, 2023

I don't think any of them have a means of obtaining bulk data. I know I looked at this back when the end of EDDB was first announced, and things didn't pan out. IIRC, one of the places I looked at, I think it was Inara, did have an API, but it was for single queries only, as in "give me the data for this station", so that wouldn't work.

I would love to be wrong about this, because figuring out how to do it ourselves sucks.

@aadler
Copy link
Contributor

aadler commented May 22, 2023

I'm not an expert in the slightest, so I don't know if this is even feasible, but can @Tromador read and aggregate EDDNs commodities feed for pricing purposes? Start with what we have now and update hourly/daily from an EDDN feed. I'm pretty sure Inara does this. Perhaps EDSM can be approached for system information. Or we can ask @spansh (I believe that's Gareth) if we can download his Data dumps for systems.

What is the major problem, not having an authoritative source for ships, modules, components?

@eyeonus
Copy link
Owner

eyeonus commented May 22, 2023

I'm not an expert in the slightest, so I don't know if this is even feasible, but can @Tromador read and aggregate EDDNs commodities feed for pricing purposes?...

This is what we have now. Tromador's server runs a python script that does exactly that, that's how listings-live.csv was generated before the end of EDDB, and after the "first phase" as Tromador put it, that is also how the listings.csv file is generated as well.

For details: https://github.com/eyeonus/EDDBlink-listener

What is the major problem, not having an authoritative source for ships, modules, components?

Yes. As far as market data is concerned, we've got that covered. However, we have no means of updating TD with new anything; new commodities, new stations, any of it. (Actually I think I did make it so if new commodities show up they do get added to the DB, but I'm not certain, and I'm too lazy to look right now.)

For some things, like rare items, that's not a big problem because it isn't very often a new ship module gets added to the game, so adding them manually isn't a huge deal, even if it would be nice to have it all done automagically.

Basically, right now we can get all the information contained in an EDDN Commodities message, and process it for inclusion in the DB, but some information TD needs isn't contained in that EDDN message, and so we need to start also processing other EDDN messages in the script so that we can process that stuff too.

For example, if we want to know what star system a station is in, we need to process the docking event of a Journal message.

@eyeonus
Copy link
Owner

eyeonus commented May 22, 2023

I'm not going to lie, my life is in a bit of an upheaval right now, so I haven't had time to work on this very much.

If anyone who reads this wants to take a crack at it please feel free.

@aadler
Copy link
Contributor

aadler commented May 22, 2023

Completely understood; real life comes first, second, and third. We're extremely grateful for the work you (and @Tromador and @bgol and @kfsone ) have done to make our lives both easier and more fun.

@EyeMWing
Copy link

EyeMWing commented Dec 28, 2023

Is there any interest in bringing this back? @eyeonus @Tromador in particular.

I've got the most egregious problems with eddblink_listener hammered out and running on my machine, and I'm working on a mechanism to replay the archived eddn streams to load data from between when EDDB went out and now. That should get TD up and going with old (EDDB era) star systems.

After that, I don't think it would be a very big lift to get a star system feed out of the EDDN journals. I haven't actually looked at the guts of TD to see what else it might need.

@Tromador
Copy link
Collaborator

Tromador commented Dec 28, 2023 via email

@eyeonus
Copy link
Owner

eyeonus commented Dec 28, 2023 via email

@aadler
Copy link
Contributor

aadler commented Jan 25, 2024

I note that @spansh (https://github.com/spansh), of neutron plotter fame, now has market data. Here is an example. He also has system data dumps. Should we reach out to him to see if there is anything TradeDangerous can leverage?

@eyeonus
Copy link
Owner

eyeonus commented Jan 25, 2024

We could potentially use the dumps https://spansh.co.uk/dumps
I haven't looked at them yet, but nightly dumps is what we used from EDDB, so....

Also what ever happened to @EyeMWing? I expected to see a PR at some point from that one.

@rmb4253
Copy link

rmb4253 commented Jan 30, 2024

I don't have the knowhow to help in this in any way but I'm so pleased that TD has not been completely forgotten. I could probably help with testing as a user though.

@Clivuus
Copy link

Clivuus commented Jan 30, 2024

I have been trying to update TDHelper every time I play Elite Dangerous Odyssey, but it seems to have stopped updating about 7 months ago. Hopefully something will soon happen. I would also be happy to help with testing of a new and improved version.

@eyeonus
Copy link
Owner

eyeonus commented Jan 30, 2024

TDHelper is run by somebody else, it's not something I have anything to do with.

@spansh
Copy link

spansh commented Jan 31, 2024 via email

@Tromador
Copy link
Collaborator

I'm more than happy to help populate data. We have the new system dumps at https://spansh.co.uk/dumps which
are purely system data (no bodies). However if you also want station data you can grab the full galaxy file though
that's probably a little large for players to download.

Thanks for the offer of support. Big files don't really scare me. Potentially we can have the server grab it and output something smaller for clients. I always used to have the server configured to grab and hold more data than the average user would download, at least by default (they could still grab it via options if they really wanted it).

I too was hoping for this PR from @EyeMWing. That said, with @spansh willing to help with a reliable data source, I am willing to run up the ED server on current code, on the assumption we can start looking again at some of the long standing issues - I mean it does work, but there were some niggles.

Assuming we do that, I would ask for patience (especially from @eyeonus 🙂) it's been a very long time since I looked at this and the brain fog from my illness and associated meds will likely have me going over old ground previously discussed as though it never happened. I know this can be a little frustrating at times, it certainly annoys me when I know my cranium isn't firing on all cylinders.

@EyeMWing
Copy link

EyeMWing commented Jan 31, 2024 via email

@Tromador
Copy link
Collaborator

@EyeMWing You posted over a month ago that you had some time "this evening". Please can you have a think and honestly decide if you have the time and inclination to do this work. If you don't, that's fine, everything here is voluntary. We'll decide if/how we want to proceed without you and that's ok.
Conversely if still intend to put in your promised PR, please can you do so? It's not really fair to tell us you have solutions for these problems, you said in December that you have code running and working on your system, but never sent the PR. Perhaps if you've lost interest, you might simply send what you have so we can look it over and use it?

@lanzz
Copy link
Contributor

lanzz commented Mar 14, 2024

I had a bit of free time today, so I put together a quick parser for @spansh's dump files. I did a (very cursory) research into fast JSON parsers and settled on csymdjson. It can ingest the 8.8GB (uncompressed size) galaxy_stations.json in about 23 seconds on my M1 Pro Macbook (without doing anything with the data, that's just load time). It does process the input line by line to avoid needing insane amounts of memory, which means it makes some assumptions about the format of the galaxy dumps, namely that each system is on a single line, and the first and last lines of the JSON are the opening and closing square brackets.

Here it is as a proof of concept:

import cysimdjson
from collections import namedtuple

DEFAULT_INPUT = 'galaxy_stations.json'

Commodity = namedtuple('Commodity', 'name,sell,buy,demand,supply,ts')

def ingest(filename):
    parser = cysimdjson.JSONParser()
    with open(filename, 'r') as f:
        f.readline()    # skip over inital open bracket
        for line in f:
            line = line.rstrip().rstrip(',')
            if line == ']':
                # end of dump
                break
            system_data = parser.loads(line)
            yield from _ingest_system_data(system_data)

def _ingest_system_data(system_data):
    for station_name, update_time, commodities in _find_markets_in_system(system_data):
        yield f'{system_data["name"]}/{station_name}', _ingest_commodities(commodities, update_time)

def _ingest_commodities(commodities, update_time):
    for category, category_commodities in commodities.items():
        yield category, _ingest_category_commodities(category_commodities, update_time)

def _ingest_category_commodities(commodities, update_time):
    for commodity, market_data in commodities.items():
        yield Commodity(
            name=commodity,
            sell=market_data["sellPrice"],
            buy=market_data["buyPrice"],
            demand=market_data["demand"],
            supply=market_data["supply"],
            ts=update_time,
        )

def _find_markets_in_system(system_data):
    for station in system_data['stations']:
        if 'Market' not in station.get('services', []):
            continue
        if not station.get('market', {}).get('commodities', []):
            continue
        yield (
            station['name'],
            station['market'].get('updateTime', None),
            _categorize_commodities(station['market']['commodities'], ),
        )

def _categorize_commodities(commodities):
    commodities_by_category = {}
    for commodity in commodities:
        commodities_by_category.setdefault(commodity['category'], {})[commodity['name']] = commodity
    return commodities_by_category

if __name__ == '__main__':
    print('#     {name:35s}  {sell:>7s}  {buy:>7s}  {demand:>10s}  {supply:>10s}  {ts}'.format(
        name='Item Name',
        sell='SellCr',
        buy='BuyCr',
        demand='Demand',
        supply='Supply',
        ts='Timestamp',
    ))
    print()
    for station_name, market in ingest(DEFAULT_INPUT):
        print(f'@ {station_name}')
        for category, commodities in market:
            print(f'   + {category}')
            for commodity in commodities:
                print('      {name:35s}  {sell:7d}  {buy:7d}  {demand:10d}  {supply:10d}  {ts}'.format(
                    name=commodity.name,
                    sell=commodity.sell,
                    buy=commodity.buy,
                    demand=commodity.demand,
                    supply=commodity.supply,
                    ts=commodity.ts,
                ))
        print()

That POC prints out the result in Trade Dangerou's prices format, but it is intended to be used to provide the data in a programmatically convenient way, so it doesn't necessarily need to pass through a conversion step, Trade Dangerous could potentially just load the prices directly from the galaxy dumps.

@spansh
Copy link

spansh commented Mar 14, 2024 via email

@lanzz
Copy link
Contributor

lanzz commented Mar 15, 2024

cysimdjson that I went with is also supposed to be wrapping the same underlying JSON implementation (simdjson), but I'll benchmark pysimdjson tomorrow.

@lanzz
Copy link
Contributor

lanzz commented Mar 15, 2024

I've fixed a bug, it wasn't picking up surface stations, so now ingestion times have jumped to the 50-70 second range.
Here's the latest iteration, supporting both cysimdjson and pysimdjson:

import cysimdjson
import simdjson
import time
from collections import namedtuple

DEFAULT_INPUT = 'galaxy_stations.json'
DEFAULT_PARSER = cysimdjson.JSONParser().loads
ALT_PARSER = lambda line: simdjson.Parser().parse(line)

Commodity = namedtuple('Commodity', 'name,sell,buy,demand,supply,ts')

def ingest(filename, parser):
    """Ingest a spansh-style galaxy dump and emits a generator cascade yielding the market data."""
    with open(filename, 'r') as f:
        f.readline()    # skip over inital open bracket
        for line in f:
            line = line.rstrip().rstrip(',')
            if line == ']':
                # end of dump
                break
            system_data = parser(line)
            yield from _ingest_system_data(system_data)

def _ingest_system_data(system_data):
    for station_name, update_time, commodities in _find_markets_in_system(system_data):
        yield f'{system_data["name"].upper()}/{station_name}', _ingest_commodities(commodities, update_time)

def _ingest_commodities(commodities, update_time):
    for category, category_commodities in commodities.items():
        yield category, _ingest_category_commodities(category_commodities, update_time)

def _ingest_category_commodities(commodities, update_time):
    for commodity, market_data in commodities.items():
        yield Commodity(
            name=commodity,
            sell=market_data["sellPrice"],
            buy=market_data["buyPrice"],
            demand=market_data["demand"],
            supply=market_data["supply"],
            ts=update_time,
        )

def _find_markets_in_system(system_data):
    # look for stations in the system and on all bodies
    targets = [system_data, *system_data.get('bodies', [])]
    for target in targets:
        for station in target['stations']:
            if 'Market' not in station.get('services', []):
                continue
            if not station.get('market', {}).get('commodities', []):
                continue
            yield (
                station['name'],
                station['market'].get('updateTime', None),
                _categorize_commodities(station['market']['commodities'], ),
            )


def _categorize_commodities(commodities):
    commodities_by_category = {}
    for commodity in commodities:
        commodities_by_category.setdefault(commodity['category'], {})[commodity['name']] = commodity
    return commodities_by_category

def benchmark(filename, parser, parser_name=None, iterations=5):
    """Benchmark a JSON parser.

    Prints timing for consuming the entire stream, without doing anything with the data.
    """
    times = []
    for _ in range(iterations):
        start_ts = time.perf_counter()
        stream = ingest(filename, parser)
        for _, market in stream:
            for _, commodities in market:
                for _ in commodities:
                    pass
        end_ts = time.perf_counter()
        elapsed = end_ts - start_ts
        times.append(elapsed)
    min_time = min(times)
    avg_time = sum(times) / len(times)
    max_time = max(times)
    if parser_name is None:
        parser_name = repr(parser)
    print(f'{min_time:6.2f} {avg_time:6.2f} {max_time:6.2f}  {parser_name}')

def benchmark_parsers(filename=DEFAULT_INPUT, **parsers):
    """Benchmark all parsers passed in as keyword arguments."""
    for name, parser in parsers.items():
        benchmark(filename, parser, parser_name=name)

def convert(filename, parser=DEFAULT_PARSER):
    """Converts spansh-style galaxy dump into TradeDangerous-style prices."""
    print('#     {name:35s}  {sell:>7s}  {buy:>7s}  {demand:>10s}  {supply:>10s}  {ts}'.format(
        name='Item Name',
        sell='SellCr',
        buy='BuyCr',
        demand='Demand',
        supply='Supply',
        ts='Timestamp',
    ))
    print()
    for station_name, market in ingest(DEFAULT_INPUT, parser=parser):
        print(f'@ {station_name}')
        for category, commodities in market:
            print(f'   + {category}')
            for commodity in commodities:
                pass
                print('      {name:35s}  {sell:7d}  {buy:7d}  {demand:10d}  {supply:10d}  {ts}'.format(
                    name=commodity.name,
                    sell=commodity.sell,
                    buy=commodity.buy,
                    demand=commodity.demand,
                    supply=commodity.supply,
                    ts=commodity.ts,
                ))
        print()

if __name__ == '__main__':
    benchmark_parsers(
        cysimdjson=DEFAULT_PARSER,
        pysimdjson=ALT_PARSER,
    )

I've benchmarked them and pysimdjson seems to be noticeably faster:

# min / avg / max time
 67.54  67.71  67.81  cysimdjson
 49.94  50.86  51.97  pysimdjson

@eyeonus
Copy link
Owner

eyeonus commented Mar 15, 2024

Very nice. Do me a flavour and submit a PR for this, formatted as an import plugin.

@lanzz
Copy link
Contributor

lanzz commented Mar 15, 2024

Yeah, that's WIP, I was just focusing on getting the parsing logic right first 👍

@bgol
Copy link
Contributor Author

bgol commented Mar 15, 2024

You don't need the category in the price file (saves some bytes), see:
https://github.com/eyeonus/Trade-Dangerous/blob/master/tradedangerous/cache.py#L583-L586

@Tromador
Copy link
Collaborator

@lanzz Probably a stupid question but I rather ask and not need to: I presume that carriers count as "stations in the system"?

@eyeonus
Copy link
Owner

eyeonus commented Apr 22, 2024

I did not even see that. Thanks!

@kfsone
Copy link
Contributor

kfsone commented Apr 22, 2024

It's like the four doctors in here :) 👋 @bgol :)

@bgol
Copy link
Contributor Author

bgol commented Apr 22, 2024

Yeah, hi Oliver, nice to hear from you. Now, where is madavo? :)
(Didn't expect this issue to become on old guys chat ;)

@lanzz
Copy link
Contributor

lanzz commented Apr 22, 2024

Sorry I haven't been following up the developments in this thread.

Yes, the reason I implemented it to generate .prices file was because that seemed like what the plugin system itself was expecting. import_cmd.run() allows the plugin to either pass control back by returning True, or to cancel the default implementation by returning anything else, so it seemed like the plugin was expected to provide some loading up to a point and then leave the actual import to the default implementation, so that's what I went with. I also did not want to reproduce lots of complexity in updating the database directly, as I couldn't immediately find any nicely reusable way to do that in the existing logic (I might not have looked too hard).

@aadler
Copy link
Contributor

aadler commented Apr 22, 2024

Yeah, hi Oliver, nice to hear from you. Now, where is madavo? :) (Didn't expect this issue to become on old guys chat ;)

Chiming in for the 50+ crew 👴

@eyeonus
Copy link
Owner

eyeonus commented Apr 22, 2024

It's like the four doctors in here :) 👋 @bgol :)

Lol, only if I'm Tennant. Also bowties are not cool.

Sorry I haven't been following up the developments in this thread.

Yes, the reason I implemented it to generate .prices file was because that seemed like what the plugin system itself was expecting.

No worries. I've changed it to go directly into the database myself, we all kind of assumed that's the reason why you did that. The RAM usage was just too much for the server doing it the way you had it.

Returning False means that the plugin handled everything itself. There's no expectation either way, it's just to give the plugin author the ability to go either way.

Your work is sincerely appreciated, after all, you did the hard part, all I did was some refactoring to make it less RAM intensive.

@kfsone
Copy link
Contributor

kfsone commented Apr 22, 2024

@lanzz the simdjson usage seemed to run into a problem I had at Super Evil recently with our python-based asset pipeline and recent optimizations in CPython so that it doesn't garbage collect so often. As you'd spotted, you had to allocate a new simdjson parser each loop or else it complained you had references. Also, the td code is full of make-a-string-with-joins because that was the optimal way to join two short strings back in those versions of python. Now it seems to be bad because with the aforementioned garbage-reduction, the likelihood python will actually have to allocate an object for the string is extra high. le sigh.

That got me to looking at ijson, and I have a local version using it that is clearly slower than simdjson for small loops, but starts catching up by the page (4kb).

(please note: I live in a near perpetual state of 'wtf python, really, why?' these days -- if any of that seems to be pointed at anything but python and/or myself, I derped at typing)

@spansh
Copy link

spansh commented Apr 22, 2024

We still need to figure out how to populate the other tables. At this point, we have Added, Category, and RareItem templated, and we have Item, Station, StationItem, and System built using the spansh data. The rest are empty until some means of generating them using the galaxy_stations.json dumps via the spansh plugin and/or the EDDN messages via the listener, preferably both.

If you let me know what the missing/extra data is I can point you to the fields if they're available in the dump and/or where I normally source that data from when/if I put them into my search index if they're not.

@eyeonus
Copy link
Owner

eyeonus commented Apr 22, 2024

@spansh
All the tables not currently being populated:

TABLE Ship:
    ship_id  = fdev_id, 
    name, 
    cost
TABLE ShipVendor: 
    ship_id = fdev_id ( ref: SHIP.ship_id ), 
    station_id = fdev_id ( ref: STATION.station_id ),
    modified = format( "YYYY-MM-DD HH:MM:SS" )
TABLE Upgrade: 
    upgrade_id = fdev_id, 
    name, 
    weight, 
    cost
TABLE UpgradeVendor: 
    upgrade_id = fdev_id ( ref: UPGRADE.upgrade_id ), 
    station_id = fdev_id ( ref: STATION.station_id ), 
    cost, 
    modified = format( "YYYY-MM-DD HH:MM:SS" )

It'd also be nice to have a way to automatically add new rares to the RareItem table:

TABLE RareItem: 
    rare_id = fdev_id, 
    station_id = fdev_id ( ref: STATION.station_id ), 
    category_id = ( ref: CATEGORY.id ), 
    name, 
    cost, 
    max_allocation, 
    illegal = ( '?', 'Y', 'N' ), 
    suppressed = ( '?', 'Y', 'N' )

@Tromador
Copy link
Collaborator

Tromador commented Apr 22, 2024 via email

@kfsone
Copy link
Contributor

kfsone commented Apr 22, 2024

Is there a reason to prefer the human-readable/strptime datetimes? Having them as utc timestamps either int or float would make parsing and import much, much faster.

@eyeonus
Copy link
Owner

eyeonus commented Apr 22, 2024

Historical inertia (that's how it was setup when I took over TD maintenance), and potentially backwards compatibility.

Regarding the former, I've no problems with changing it.

Regarding the latter, as long as it doesn't break anything, I've no problems changing it.

@kfsone
Copy link
Contributor

kfsone commented Apr 22, 2024

"Your fault, you daft old fart" is perfectly fine by me :)
^- the you being me, not you, if you see what I mean, oh this is making it worse isn't it? 🤯

@kfsone
Copy link
Contributor

kfsone commented Apr 22, 2024

@eyeonus seeing this at the moment with latest, this my fault?

Win|PS> ./trade.py import -P eddblink -O clean,solo
...
NOTE: Rebuilding cache file: this may take a few moments.
NOTE: Missing "C:\Users\oliver.smith\source\github.com\kfsone\Trade-Dangerous\tradedangerous\data\TradeDangerous.prices" file - no price data.
NOTE: Import completed.

that seems like something that shouldn't be missing at the end of an import?

@kfsone
Copy link
Contributor

kfsone commented Apr 22, 2024

Oh, that's not the same as regenerating prices file. Duh. I used to figure that eventually the cost of generating the .prices file and stability of TD would mean we didn't have to keep generating the thing, is that all this is?

@eyeonus
Copy link
Owner

eyeonus commented Apr 22, 2024

Nope, it's fine. It's a warning by tdb.reloadCache(), and is expected since it's a clean run.

@eyeonus
Copy link
Owner

eyeonus commented Apr 22, 2024

Also since you did solo, it didn't download or import the listings, there's nothing to export to a prices file, and so you won't have a prices file after the command finishes, either.

@eyeonus
Copy link
Owner

eyeonus commented Apr 22, 2024

Oh, that's not the same as regenerating prices file. Duh. I used to figure that eventually the cost of generating the .prices file and stability of TD would mean we didn't have to keep generating the thing, is that all this is?

No, it will regenerate prices IFF listings are imported, but not otherwise.

@kfsone
Copy link
Contributor

kfsone commented Apr 23, 2024

While I'm refinding my feet, I've made a number of qol changes - at least, if you're using an IDE like PyCharm/VSCode.

I tend to configure tox so that my IDEs pick up the settings from it and I get in-ide guidance and refactoring options.

I've also introduced a little bit of flair to make watching it do its import thing a little less tedious, but I'm trying to stagger how I do it so that there's always an easy way to dump the new presentation. This is what happens when I've been watching Sebastian Lague videos lately https://www.youtube.com/watch?v=SO83KQuuZvg ... but it's probably also going to be nice for end-users too.

Recording.2024-04-22.171344.mp4

These are currently in my kfsone/cleanup-pass branch.

@kfsone
Copy link
Contributor

kfsone commented Apr 23, 2024

prices.py -> dumpPrices: "oliver waz 'ere" write large... took me a while to realize that somehow if you don't capture them, the cursor has the rows ready for you to iterate on. nasty. BAD SMITH.

image

@kfsone
Copy link
Contributor

kfsone commented Apr 23, 2024

I'm doing some tuning of the tox config, it doesn't seem like we were actually running a "just flake8" run, or we weren't really using it? I got it enabled in my test branch. It should be fast (demo from pycharm)

Recording.2024-04-22.221802.mp4

@eyeonus
Copy link
Owner

eyeonus commented Apr 23, 2024

Oh, that's not the same as regenerating prices file. Duh. I used to figure that eventually the cost of generating the .prices file and stability of TD would mean we didn't have to keep generating the thing, is that all this is?

No, it will regenerate prices IFF listings are imported, but not otherwise.

Also, both the eddblink and spansh plugins use the existence of that file to determine if the database needs to be built.
(i.e., if it doesn't exist, assume starting from scratch)

@kfsone
Copy link
Contributor

kfsone commented Apr 23, 2024

I was checking a few of the 1MR posts/videos about how they tackle it in Python. We don't have 1 billion but it's not that dissimilar to what we do. Discovering that just using "rb" and doing byte-related operations was a bit of a stunner, but it's annoying trying to switch large tracts of code from utf8-to-bytes. However, it can provide a 4-8x speed up.

For instance, we count the number of lines in some of our files so we can do a progress bar, right? If the file is 86mb that takes ~250ms.

Just using "rb" gets that down to 50ms and a little use of fixed-size buffering gets it down to 41ms.

https://gist.github.com/kfsone/dcb0d7811570e40e73136a14c23bf128

@eyeonus
Copy link
Owner

eyeonus commented Apr 24, 2024

Faster is good. I like faster.

@aadler
Copy link
Contributor

aadler commented Apr 24, 2024

I was checking a few of the 1MR posts/videos about how they tackle it in Python. We don't have 1 billion but it's not that dissimilar to what we do. Discovering that just using "rb" and doing byte-related operations was a bit of a stunner, but it's annoying trying to switch large tracts of code from utf8-to-bytes. However, it can provide a 4-8x speed up.

For instance, we count the number of lines in some of our files so we can do a progress bar, right? If the file is 86mb that takes ~250ms.

Just using "rb" gets that down to 50ms and a little use of fixed-size buffering gets it down to 41ms.

https://gist.github.com/kfsone/dcb0d7811570e40e73136a14c23bf128

See https://stackoverflow.com/a/27518377/2726543

@kfsone
Copy link
Contributor

kfsone commented Apr 25, 2024

I was checking a few of the 1MR posts/videos about how they tackle it in Python. We don't have 1 billion but it's not that dissimilar to what we do. Discovering that just using "rb" and doing byte-related operations was a bit of a stunner, but it's annoying trying to switch large tracts of code from utf8-to-bytes. However, it can provide a 4-8x speed up.
For instance, we count the number of lines in some of our files so we can do a progress bar, right? If the file is 86mb that takes ~250ms.
Just using "rb" gets that down to 50ms and a little use of fixed-size buffering gets it down to 41ms.
https://gist.github.com/kfsone/dcb0d7811570e40e73136a14c23bf128

See https://stackoverflow.com/a/27518377/2726543

ooorrrrr.... https://github.com/KingFisherSoftware/traderusty/ :)

I'm thinking I should have called it "tradedangersy" since rusty projects like to end with "rs" and python with "y" :)

image

@kfsone
Copy link
Contributor

kfsone commented Apr 25, 2024

Don't read too much into that - an excuse to try a rust-python extension in anger (see https://github.com/kfsone/rumao3) and see how much pain setting up pypi and everything was (it wasn't). And I'm not sure eyeonus is like to want a second language adding to the problem :)

@kfsone
Copy link
Contributor

kfsone commented Apr 26, 2024

@Tromador is listings.csv guaranteed to be in station,item order? I think I can optimize by doing a lock-step walk thru the database and listings (you create two generators, one with database entries the other with listing entries, and you keep advancing the one that is "behind"; if the listings one runs out, you stop; if the database one runs out you just don't need to compare)

@eyeonus
Copy link
Owner

eyeonus commented Apr 26, 2024

@Tromador is listings.csv guaranteed to be in station,item order?

Yes, both the listings.csv and listings-livs.csv are guaranteed to be sorted by station_id, item_id

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests