This document is specifically about storing and processsing futures data.
Related documents:
- Using pysystemtrade as a production trading environment
- Main user guide
- Connecting pysystemtrade to interactive brokers
It is broken into three sections. The first, A futures data workflow, gives an overview of how data is typically processed. It describes how you would get some data from quandl, store it, and create back-adjusted prices. The next section Storing futures data then describes in detail each of the components of the API for storing futures data. In the third and final section simData objects you will see how we hook together individual data components to create a simData
object that is used by the main simulation system.
Although this document is about futures data, parts two and three are necessary reading if you are trying to create or modify any data objects.
- A futures data workflow
- Setting up some instrument configuration
- Roll parameter configuration
- Getting historical data for individual futures contracts
- Roll calendars
- Creating and storing multiple prices
- Creating and storing back adjusted prices
- Backadjusting 'on the fly'
- Changing the stitching method
- Getting and storing FX data
- Storing and representing futures data
- Futures data objects and their generic data storage objects
- Instruments: futuresInstrument() and futuresInstrumentData()
- Contract dates: contractDate()
- Roll cycles: rollCycle()
- Roll parameters: rollParameters() and rollParametersData()
- Contract date with roll parameters: contractDateWithRollParameters()
- Futures contracts: futuresContracts() and futuresContractData()
- Prices for individual futures contracts: futuresContractPrices(), dictFuturesContractPrices() and futuresContractPriceData()
- Final prices for individual futures contracts: futuresContractFinalPrices(), dictFuturesContractFinalPrices()
- Roll calendars: rollCalendar() and rollCalendarData()
- Multiple prices: futuresMultiplePrices() and futuresMultiplePricesData()
- Adjusted prices: futuresAdjustedPrices() and futuresAdjustedPricesData()
- Spot FX data: fxPrices() and fxPricesData()
- Creating your own data objects, and data storage objects; a few pointers
- Data storage objects for specific sources
- Static csv files used for initialisation of databases
- Csv files for time series data
- csvFuturesInstrumentData() inherits from futuresInstrumentData
- csvFxPricesData() inherits from futuresContractPriceData
- csvRollCalendarData() inherits from rollParametersData
- csvFuturesMultiplePricesData() inherits from futuresMultiplePricesData
- csvFuturesAdjustedPricesData() inherits from futuresAdjustedPricesData
- csvFxPricesData() inherits from fxPricesData
- mongo DB
- Specifying a mongoDB connection
- mongoFuturesInstrumentData() inherits from futuresInstrumentData
- mongoRollParametersData() inherits from rollParametersData
- mongoFuturesContractData() inherits from futuresContractData
- Quandl
- Arctic
- Specifying an arctic connection
- arcticFuturesContractPriceData() inherits from futuresContractPriceData
- arcticFuturesMultiplePricesData() inherits from futuresMultiplePricesData()
- arcticFuturesAdjustedPricesData() inherits from futuresAdjustedPricesData()
- arcticFxPricesData() inherits from fxPricesData()
- Creating your own data storage objects for a new source
- Futures data objects and their generic data storage objects
- simData objects
- Updating the provided .csv data from a production system
Created by gh-md-toc
This section describes a typical workflow for setting up futures data from scratch:
- Set up some static configuration information for instruments, and roll parameters
- Get, and store, some historical data
- Build, and store, roll calendars
- Create and store 'multiple' price series containing the relevant contracts we need for any given time period
- Create and store back-adjusted prices
- Get, and store, spot FX prices
In future versions of pysystemtrade there will be code to keep your prices up to date.
The first step is to store some instrument configuration information. In principal this can be done in any way, but we are going to read from .csv files, and write to a Mongo Database. There are two kinds of configuration; instrument configuration and roll configuration. Instrument configuration consists of static information that enables us to map from the instrument code like EDOLLAR (it also includes cost levels, that are required in the simulation environment).
The relevant script to setup information configuration is in sysinit - the part of pysystemtrade used to initialise a new system. Here is the script you need to run instruments_csv_mongo.py. Notice it uses two types of data objects: the object we write to mongoFuturesInstrumentData
and the object we read from csvFuturesInstrumentData
. These objects both inherit from the more generic futuresInstrumentData, and are specialist versions of that. You'll see this pattern again and again, and I describe it further in part two of this document.
Make sure you are running a Mongo Database before running this.
The information is sucked out of this file and into the mongo database whose connections are defined here. The file includes a number of futures contracts that I don't actually trade or get prices for. Any configuration information for these may not be accurate and you use it at your own risk.
For roll configuration we need to initialise by running the code in this file roll_parameters_csv_mongo.py. Again it uses two types of data objects: we read from a csv file with initCsvFuturesRollData
, and write to a mongo db with mongoRollParametersData
. Again you need to make sure you are running a Mongo Database before executing this script.
It's worth explaining the available options for roll configuration. First of all we have two roll cycles: 'priced' and 'hold'. Roll cycles use the usual definition for futures months (January is F, February G, March H, and the rest of the year is JKMNQUVX, with December Z). The 'priced' contracts are those that we can get prices for, whereas the 'hold' cycle contracts are those we actually hold. We may hold all the priced contracts (like for equities), or only only some because of liquidity issues (eg Gold), or to keep a consistent seasonal position (i.e. CRUDEW is Winter Crude, so we only hold December).
'RollOffsetDays': This indicates how many calendar days before a contract expires that we'd normally like to roll it. These vary from zero (Korean bonds KR3 and KR10 which you can't roll until the expiry date) up to -1100 (Eurodollar where I like to stay several years out on the curve).
'ExpiryOffset': How many days to shift the expiry date in a month, eg (the day of the month that a contract expires)-1. These values are just here so we can build roughly correct roll calendars (of which more later). In live trading you'd get the actual expiry date for each contract.
Using these two dates together will indicate when we'd ideally roll an instrument, relative to the first of the month.
For example for Bund futures, the ExpiryOffset is 6; the contract notionally expires on day 1+6 = 7th of the month. The RollOffsetDays is -5, so we roll 5 days before this. So we'd normally roll on the 1+6-5 = 2nd day of the month.
Let's take a more extreme example, Eurodollar. The ExpiryOffset is 18, and the roll offset is -1100 (no not a typo!). We'd roll this product 1100 days before it expired on the 19th day of the month.
'CarryOffset': Whether we take carry from an earlier dated contract (-1, which is preferable) or a later dated contract (+1, which isn't ideal but if we hold the front contract we have no choice). This calculation is done based on the priced roll cycle, so for example for winter crude where the hold roll cycle is just 'Z' (we hold December), and the carry offset is -1 we take the previous month in the priced roll cycle (which is a full year FGHJKMNQUVXZ) i.e. November (whose code is 'X'). You read more in Appendix B of my first book.
Now let's turn our attention to getting prices for individual futures contracts. We could get this from anywhere, but we'll use Quandl. Obviously you will need to get the python Quandl library, and you may want to set a Quandl key.
NOTE: Quandl are no longer supporting free futures data except for a limited number of instruments. I am looking for alternatives, but the most likely outcome is that I will use IB to get historical data although this will only go back one year and excludes closed contracts.
We can also store it, in principal, anywhere but I will be using the open source Arctic library which was released by my former employers AHL. This sits on top of Mongo DB (so we don't need yet another database) but provides straightforward and fast storage of pandas DataFrames.
We'll be using this script. Unlike the first two initialisation scripts this is set up to run for a single market.
By the way I can't just pull down this data myself and put it on github to save you time. Storing large amounts of data in github isn't a good idea regardless of whether it is in .csv or Mongo files, and there would also be licensing issues with basically just copying and pasting raw data from Quandl. You have to get, and then store, this stuff yourself. And of course at some point in a live system you would be updating this yourself.
This uses quite a few data objects:
- Price data for individual futures contracts: quandlFuturesContractPriceData and arcticFuturesContractPriceData
- Configuration needed when dealing with Quandl: quandlFuturesConfiguration - this reads this .csv and defines the code and market; but also the first contract in Quandl's database.
- Instrument data (that we prepared earlier): mongoFuturesInstrumentData
- Roll parameters data (that we prepared earlier): mongoRollParametersData
- Two generic data objects (not for a specific source): listOfFuturesContracts, futuresInstrument
The script does two things:
- Generate a list of futures contracts, starting with the first contract defined in this .csv and following the price cycle. The last contract in that series is the contract we'll currently hold, given the 'ExpiryOffset' parameter.
- For each contract, get the prices from Quandl and write them into Arctic / Mongo DB.
We're now ready to set up a roll calendar. A roll calendar is the series of dates on which we roll from one futures contract to the next. It might be helpful to read my blog post on rolling futures contracts (though bear in mind some of the details relate to my current trading system and do no reflect how pysystemtrade works).
You can see a roll calendar for Eurodollar futures, here. On each date we roll from the current_contract shown to the next_contract. We also see the current carry_contract; we use the differential between this and the current_contract to calculate forecasts for carry trading rules.
There are two ways to generate roll calendars:
- Generate an approximate calendar based on the 'ExpiryOffset' parameter, and then adjust it so it is viable given the futures prices we have from the previous stage.
- Infer from existing 'multiple price' data. Multiple price data are data series that include the prices for three types of contract: the current, next, and carry contract (though of course there may be overlaps between these).
This is the method you'd use if you were starting from scratch, and you'd just got some prices for each futures contract. The relevant script is here. Again it is only set up to run a single instrument at a time.
In this script:
- We get prices for individual futures contract from Arctic that we created in the previous stage
- We get roll parameters from Mongo, that we made earlier
- We calculate the roll calendar:
roll_calendar = rollCalendar.create_from_prices(dict_of_futures_contract_prices, roll_parameters)
- We do some checks on the roll calendar, for monotonicity and validity (these checks will generate warnings if things go wrong)
- If we're happy with the roll calendar we write our roll calendar into a csv file
The actual code that generates the roll calendar is here
The interesting part is:
approx_calendar = _generate_approximate_calendar(list_of_contract_dates, roll_parameters_object)
adjusted_calendar = _adjust_to_price_series(approx_calendar, dict_of_futures_contract_prices)
adjusted_calendar_with_carry = _add_carry_calendar(adjusted_calendar, roll_parameters_object)
So we first generate an approximate calendar, for when we'd ideally want to roll each of the contracts, based on our roll parameter RollOffsetDays
. However we may find that there weren't matching prices for a given roll date. A matching price is when we have prices for both the current and next contract on the relevant day. If we don't have that, then we can't calculate an adjusted price. The adjustment stage finds the closest date to the ideal date (looking both forwards and backwards in time). If there are no dates with matching prices, then the process will return an error. Finally we add the carry contract on to the roll calendar - this isn't used for back adjustment but we still need it for forecasting using the carry trading rule.
We then check that the roll calendar is monotonic and valid.
A monotonic roll calendar will have increasing datestamps in the index. It's possible, if your data is messy, to get non-monotonic calendars. Unfortunately there is no automatic way to fix this, you need to dive in and rebuild the data (this is why I store the calendars as .csv files to make such hacking easy).
A valid roll calendar will have current and next contract prices on the roll date. Since this is how we generate the roll calendars they should always pass this test (if we couldn't find a date when we have aligned prices then the calendar generation would have failed with an exception).
Roll calendars are stored in .csv format here. Of course you could put these into Mongo DB, or Arctic, but I like the ability to hack them if required.
In the next section we learn how to use roll calendars, and price data for individual contracts, to create DataFrames of multiple prices: the series of prices for the current, forward and carry contracts; as well as the identity of those contracts. But it's also possible to reverse this operation: work out roll calendars from multiple prices.
Of course you can only do this if you've already got these prices, which means you already need to have a roll calendar: a catch 22. Fortunately there are sets of multiple prices provided in pysystemtrade, and have been for some time, here. These are copies of the data in my legacy trading system, for which I had to generate historic roll calendars, and for the data since early 2014 include the actual dates when I rolled.
We run this script which by default will loop over all the instruments for which we have data in the multiple prices directory.
The next stage is to store multiple prices. Multiple prices are the price and contract identifier for the current contract we're holding, the next contract we'll hold, and the carry contract we compare with the current contract for the carry trading rule. They are required for the next stage, calculating back-adjusted prices, but are also used directly by the carry trading rule in a backtest. Constructing them requires a roll calendar, and prices for individual futures contracts.
We can store these prices in either Arctic or .csv files. The relevant script gives you the option of doing either or both of these.
Once we have multiple prices we can then create a backadjusted price series. The relevant script will read multiple prices from Arctic, do the backadjustment, and then write the prices to Arctic. It's easy to modify this to read/write to/from different sources.
It's also possible to implement the back-adjustment 'on the fly' within your backtest. More details later in this document, here.
If you don't like panama stitching then you can modify the method. More details later in this document, here.
Although strictly not futures prices we also need spot FX prices to run our simulation. The github for pysystemtrade contains spot FX data, but you will probably wish to update it. In live trading we'd use interactive brokers, but for now I'm going to use one of the many free data websites: investing.com
You need to register and then download enough history. To see how much FX data there already is:
from sysdata.csv.csv_spot_fx import *
data=csvFxPricesData()
data.get_fx_prices("GBPUSD")
Save the files in a directory with no other content, using the filename format "GBPUSD.csv". Using this simple script they are written to Arctic and/or .csv files. You will need to modify the script to point to the right directory, and you can also change the column and formatting parameters to use data from other sources.
The paradigm for data storage is that we have a bunch of data objects for specific types of data, i.e. futuresInstrument is the generic class for storing static information about instruments. Each of those objects then has a matching data storage object which accesses data for that object, i.e. futuresInstrumentData. Then we have specific instances of those for different data sources, i.e. mongoFuturesInstrumentData for storing instrument data in a mongo DB database.
Instruments: futuresInstrument() and futuresInstrumentData()
Futures instruments are the things we actually trade, eg Eurodollar futures, but not specific contracts. Apart from the instrument code we can store metadata about them. This isn't hard wired into the class, but currently includes things like the asset class, cost parameters, and so on.
Contract dates: contractDate()
Note: There is no data storage for contract dates, they are stored only as part of futures contracts.
A contract date allows us to identify a specific futures contract for a given instrument. Futures contracts can either be for a specific month (eg '201709') or for a specific day (eg '20170903'). The latter is required to support weekly futures contracts, or if we already know the exact expiry date of a given contract. A monthly date will be represented with trailing zeros, eg '20170900'.
We can also store expiry dates in contract dates. This can be done either by passing the exact date (which we'd do if we were getting the contract specs from our broker) or an approximate expiry offset, where 0 (the default) means the expiry is on day 1 of the relevant contract month.
Roll cycles: rollCycle()
Note: There is no data storage for roll cycles, they are stored only as part of roll parameters.
Roll cycles are the mechanism by which we know how to move forwards and backwards between contracts as they expire, or when working out carry trading rule forecasts. Roll cycles use the usual definition for futures months (January is F, February G, March H, and the rest of the year is JKMNQUVX, with December Z).
Roll parameters: rollParameters() and rollParametersData()
The roll parameters include all the information we need about how a given instrument rolls:
hold_rollcycle
andpriced_rollcycle
. The 'priced' contracts are those that we can get prices for, whereas the 'hold' cycle contracts are those we actually hold. We may hold all the priced contracts (like for equities), or only only some because of liquidity issues (eg Gold), or to keep a consistent seasonal position (i.e. CRUDEW is Winter Crude, so we only hold December).roll_offset_day
: This indicates how many calendar days before a contract expires that we'd normally like to roll it. These vary from zero (Korean bonds KR3 and KR10 which you can't roll until the expiry date) up to -1100 (Eurodollar where I like to stay several years out on the curve).carry_offset
: Whether we take carry from an earlier dated contract (-1, which is preferable) or a later dated contract (+1, which isn't ideal but if we hold the front contract we have no choice). This calculation is done based on the priced roll cycle, so for example for winter crude where the hold roll cycle is just 'Z' (we hold December), and the carry offset is -1 we take the previous month in the priced roll cycle (which is a full year FGHJKMNQUVXZ) i.e. November (whose code is 'X'). You read more in Appendix B of my first book and in my blog post.approx_expiry_offset
: How many days to shift the expiry date in a month, eg (the day of the month that a contract expires)-1. These values are just here so we can build roughly correct roll calendars (of which more later). In live trading you'd get the actual expiry date for each contract.
Contract date with roll parameters: contractDateWithRollParameters()
Note: There is no data storage for contract dates, they are stored only as part of futures contracts.
Combining a contract date with some roll parameters means we can answer important questions like, what is the next (or previous) contract in the priced (or held) roll cycle? What is the contract I should compare this contract to when calculating carry? On what date would I want to roll this contract?
Futures contracts: futuresContracts() and futuresContractData()
The combination of a specific instrument and a contract date (possibly with roll parameters) is a futuresContract
.
listOfFuturesContracts
: This dull class exists purely so we can generate a series of historical contracts from some roll parameters.
Prices for individual futures contracts: futuresContractPrices(), dictFuturesContractPrices() and futuresContractPriceData()
The price data for a given contract is just stored as a DataFrame with specific column names. Notice that we store Open, High, Low, and Final prices; but currently in the rest of pysystemtrade we effectively throw away everything except Final.
(A 'final' price is either a close or a settlement price depending on how the data has been parsed from it's underlying source)
dictFuturesContractPrices
: When calculating roll calendars we work with prices from multiple contracts at once.
Final prices for individual futures contracts: futuresContractFinalPrices(), dictFuturesContractFinalPrices()
This is just the final prices alone. There is no data storage required for these since we don't need to store them separately, just extract them from either futuresContractPrices
or dictFuturesContractPrices
objects.
dictFuturesContractFinalPrices
: When calculating roll calendars we work with prices from multiple contracts at once.
Roll calendars: rollCalendar() and rollCalendarData()
A roll calendar is a pandas DataFrame with columns for:
- current_contract
- next_contract
- carry_contract
Each row shows when we'd roll from holding current_contract (and using carry_contract) on to next_contract. As discussed earlier they can be created from a set of roll parameters and price data, or inferred from existing multiple price data.
Multiple prices: futuresMultiplePrices() and futuresMultiplePricesData()
A multiple prices object is a pandas DataFrame with columns for:PRICE, CARRY, PRICE_CONTRACT, CARRY_CONTRACT, FORWARD, and FORWARD_CONTRACT.
We'd normally create these from scratch using a roll calendar, and some individual futures contract prices (as discussed here). Once created they can be stored and reloaded.
Adjusted prices: futuresAdjustedPrices() and futuresAdjustedPricesData()
The representation of adjusted prices is boring beyond words; they are a pandas Series. More interesting is the fact you can create one with a back adjustment process given a multiple prices object:
from sysdata.futures.adjusted_prices import futuresAdjustedPrices
from sysdata.arctic.arctic_multiple_prices import arcticFuturesMultiplePricesData
# assuming we have some multiple prices
arctic_multiple_prices = arcticFuturesMultiplePricesData()
multiple_prices = arctic_multiple_prices.get_multiple_prices("EDOLLAR")
adjusted_prices = futuresAdjustedPrices.stich_multiple_prices(multiple_prices)
The adjustment defaults to the panama method. If you want to use your own stitching method then override the method futuresAdjustedPrices.stich_multiple_prices
.
Spot FX data: fxPrices() and fxPricesData()
Technically bugger all to do with futures, but implemented in pysystemtrade as it's required for position scaling.
You should store your objects in this directory (for futures) or a new subdirectory of the sysdata directory (for new asset classes). Data objects and data storage objects should live in the same file. Data objects may inherit from other objects (for example for options you might want to inherit from the underlying future), but they don't have to. Data storage objects should all inherit from baseData.
Data objects should be prefixed with the asset class if there is any potential confusion, i.e. futuresInstrument, equitiesInstrument. Data storage objects should have the same name as their data object, but with a Data suffix, eg futuresInstrumentData.
Methods you'd probably want to include in a data object:
create_from_dict
(@classmethod
): Useful when reading data from a sourceas_dict
: Useful when writing data to a sourcecreate_empty
(@classmethod
): Useful when reading data from a source if the object is unavailable, better to return one of these than throw an error in case the calling process is indifferent about missing dataempty
: returns True if this is an empty object
Methods you'd probably want to include in a data storage object:
keys()
and__getitem__
. It's nice if data storage objects look like dicts.keys()
should be mapped toget_list_of_things_with_data
.__getitem__
should be mapped toget_some_data
get_list_of_things_with_data
, i.e. the list of instrument codes with valid data. Shouldraise NotImplementedError
get_some_data
: Check to see ifis_thing_in_data
is True, then call_get_some_data_without_checking
. If not in data, return an empty instance of the data object.is_thing_in_data
i.e. is a particular instrument code in the list of codes with valid data_get_some_data_without_checking
:raise NotImplementedError
delete_data_for_thing
: Check that a 'are you sure' flag is set, and thatis_thing_in_data
is True, then call_delete_data_for_thing_without_checking
_delete_data_for_thing_without_checking
:raise NotImplementedError
add_data_for_thing
: Check to see ifis_thing_in_data
is False (or that an ignore duplicates flag is set), then call_add_data_for_thing_without_checking
_add_data_for_thing_without_checking
:raise NotImplementedError
By the way you shouldn't actually use method names like get_list_of_things_with_data
, that's just plain silly. Instead use get_list_of_instruments
or what not.
Notice the use of private methods to interact with the data inside public methods that perform standard checks; these methods that actually interact with the data (rather than just mapping to other methods, or performing checks) should raise a NotImplementedError; this will then be overriden in the data storage object for a specific data source.
This section covers the various sources for reading and writing data objects I've implemented in pysystemtrade.
In the initialisation part of the workflow (in section one of this document) I copied some information from .csv files to initialise a database. To acheive this we need to create some read-only access methods to the relevant .csv files (which are stored here).
csvFuturesInstrumentData()(/sysdata/csv/csv_instrument_config.py) inherits from futuresInstrumentData
Using this script, instruments_csv_mongo.py, reads instrument object data from here using csvFuturesInstrumentData. This class is not specific for initialising the database, and is also used later for simulation data.
initCsvFuturesRollData() inherits from rollParametersData
Using this script, roll_parameters_csv_mongo.py, reads roll parameters for each instrument from here
Storing data in .csv files has some obvious disadvantages, and doesn't feel like the sort of thing a 21st century trading system ought to be doing. However it's good for roll calendars, which sometimes need manual hacking when they're created. It's also good for the data required to run backtests that lives as part of the github repo for pysystemtrade (storing large binary files in git is not a great idea, although various workarounds exist I haven't yet found one that works satisfactorily).
For obvious (?) reasons we only implement get and read methods for .csv files (So... you want to delete the .csv file? Do it through the filesystem. Don't get python to do your dirty work for you).
csvFuturesInstrumentData() inherits from futuresInstrumentData
Reads futures configuration information from here (note this is a separate file from the one used to initialise the mongoDB database earlier although this uses the same class method to get the data). Columns currently used by the simulation engine are: Instrument, Pointsize, AssetClass, Currency, Slippage, PerBlock, Percentage, PerTrade. Extraneous columns don't affect functionality.
csvFxPricesData() inherits from futuresContractPriceData
Reads prices for individual futures contracts. There is no default directory for these as this is provided as a convenience method if you have acquired .csv contract level data and wish to put it into your system. For this reason there is a lot of flexibility in the arguments to allow different formats to be included. As an example, this code will read data downloaded from barcharts.com
(with files renamed in the format EDOLLAR_201509.csv
):
csv_futures_contract_prices = csvFuturesContractPriceData(datapath="/home/username/data/barcharts_csv",
input_date_index_name="Date Time",
input_skiprows=1, input_skipfooter=1,
input_column_mapping=dict(OPEN='Open',
HIGH='High',
LOW='Low',
FINAL='Close'))
csvRollCalendarData() inherits from rollParametersData
Reads roll calendars from here. File names are just instrument names. File format is index DATE_TIME; columns: current_contract, next_contract, carry_contract. Contract identifiers should be in yyyymmdd format, with dd='00' for monthly contracts (currently weekly contracts aren't supported).
csvFuturesMultiplePricesData() inherits from futuresMultiplePricesData
Reads multiple prices (the prices of contracts that are currently interesting) from here. File names are just instrument names. File format is index DATETIME; columns: PRICE, CARRY, FORWARD, CARRY_CONTRACT, PRICE_CONTRACT, FORWARD_CONTRACT. Prices are floats. Contract identifiers should be in yyyymmdd format, with dd='00' for monthly contracts (currently weekly contracts aren't supported).
csvFuturesAdjustedPricesData() inherits from futuresAdjustedPricesData
Reads back adjusted prices from here. File names are just instrument names. File format is index DATETIME; columns: PRICE.
csvFxPricesData() inherits from fxPricesData
Reads back adjusted prices from here. File names are CC1CC2, where CC1 and CC12 are three letter ISO currency abbreviations (eg GBPEUR). Cross rates do not have to be stored, they will be calculated on the fly. File format is index DATETIME; columns: FX.
For production code, and storing large amounts of data (eg for individual futures contracts) we probably need something more robust than .csv files. MongoDB is a no-sql database which is rather fashionable at the moment, though the main reason I selected it for this purpose is that it is used by Arctic.
Obviously you will need to make sure you already have a Mongo DB instance running. You might find you already have one running, in Linux use ps wuax | grep mongo
and then kill the relevant process.
Personally I like to keep my Mongo data in a specific subdirectory; that is achieved by starting up with mongod --dbpath ~/data/mongodb/
(in Linux). Of course this isn't compulsory.
You need to specify an IP address (host), and database name when you connect to MongoDB. These are set with the following priority:
- Firstly, arguments passed to a
mongoDb()
instance, which is then optionally passed to any data object with the argumentmongo_db=mongoDb(host='localhost', database_name='production')
All arguments are optional. - Then, variables set in the private
.yaml
configuration file: mongo_host, mongo_db - Finally, default arguments in the system defaults configuration file: mongo_host, mongo_db
Note that 'localhost' is equivalent to '127.0.0.1', i.e. this machine. Note also that no port can be specified. This is because the port is hard coded in Arctic. You should stick to the default port 27017.
If your mongoDB is running on your local machine then you can stick with the defaults (assuming you are happy with the database name 'production'). If you have different requirements, eg mongo running on another machine or you want a different database name, then you should set them in the private .yaml file. If you have highly bespoke needs, eg you want to use a different database or different host for different types of data, then you will need to add code like this:
# Instead of:
mfidata=mongoFuturesInstrumentData()
# Do this
from sysdata.mongodb import mongoDb
mfidata=mongoFuturesInstrumentData(mongo_db = mongoDb(database_name='another database')) # could also change host
mongoFuturesInstrumentData() inherits from futuresInstrumentData
This stores instrument static data in a dictionary format.
mongoRollParametersData() inherits from rollParametersData
This stores roll parameters in a dictionary format.
mongoFuturesContractData() inherits from futuresContractData
This stores futures contract data in a dictionary format.
Quandl is an awesome way of getting data, much of which is free, via a simple Python API.
At the time of writing you get this from here (external link, may fail).
Having a Quandl API key means you can download a fair amount of data for free without being throttled. If you have one then you should first create a file 'private_config.yaml' in the private directory of pysystemtrade. Then add this line:
quandl_key: 'your_key_goes_here'
Acceses this .csv file which contains the codes and markets required to get data from Quandl.
quandlFuturesContractPriceData() inherits from futuresContractPriceData
Reads price data and returns in the form of futuresContractPrices objects. Notice that as this is purely a source of data we don't implement write methods.
quandlFxPricesData() inherits from fxPricesData
DEPRECATE THIS: NO LONGER WORKS Reads FX spot prices from QUANDL. Acceses this .csv file which contains the codes required to get data from Quandl for a specific currency.
Arctic is a superb open source time series database which sits on top of Mongo DB and provides straightforward and fast storage of pandas DataFrames. It was created by my former colleagues at Man AHL (in fact I beta tested a very early version of Arctic), and then very generously released as open source. You don't need to run multiple instances of Mongo DB when using my data objects for Mongo DB and Arctic, they use the same one. However we configure them separately; the configuration for Arctic objects is here (so in theory you could use two instances on different machines with separate host names).
Basically my mongo DB objects are for storing static information, whilst Arctic is for time series.
Arctic has several storage engines, in my code I use the default VersionStore.
You need to specify an IP address (host), and database name when you connect to Arctic. Usually Arctic data objects will default to using the same settings as Mongo data objects.
Note:
- No port is specified - Arctic can only use the default port. For this reason I strongly discourage changing the port used when connecting to other mongo databases.
- In actual use Arctic prepends 'arctic-' to the database name. So instead of 'production' it specifies 'arctic-production'. This shouldn't be an issue unless you are connecting directly to the mongo database.
These are set with the following priority:
- Firstly, arguments passed to a
mongoDb()
instance, which is then optionally passed to any Arctic data object with the argumentmongo_db=mongoDb(host='localhost', database_name='production')
All arguments are optional. - Then, arguments set in the private
.yaml
configuration file: mongo_host, mongo_db - Finally, default arguments hardcoded in mongo_connection.py: DEFAULT_MONGO_DB, DEFAULT_MONGO_HOST, DEFAULT_MONGO_PORT
Note that 'localhost' is equivalent to '127.0.0.1', i.e. this machine.
If your mongoDB is running on your local machine with the standard port settings, then you can stick with the defaults (assuming you are happy with the database name 'production'). If you have different requirements, eg mongo running on another machine, then you should code them up in the private .yaml file. If you have highly bespoke needs, eg you want to use a different database for different types of data, then you will need to add code like this:
# Instead of:
afcpdata=arcticFuturesContractPriceData()
# Do this
from sysdata.mongodb import mongoDb
afcpdata=arcticFuturesContractPriceData(mongo_db = mongoDb(database_name='another database')) # could also change host
arcticFuturesContractPriceData() inherits from futuresContractPriceData
Read and writes per contract futures price data.
arcticFuturesMultiplePricesData() inherits from futuresMultiplePricesData()
Read and writes multiple price data for each instrument.
arcticFuturesAdjustedPricesData() inherits from futuresAdjustedPricesData()
Read and writes adjusted price data for each instrument.
arcticFxPricesData() inherits from fxPricesData()
Read and writes spot FX data for each instrument.
Creating your own data storage objects is trivial, assuming they are for an existing kind of data object.
They should live in a subdirectory of sysdata, named for the data source i.e. sysdata/arctic.
Look at an existing data storage object for a different source to see which methods you'd need to implement, and to see the generic data storage object you should inherit from. Normally you'd need to override all the methods in the generic object which return NotImplementedError
; the exception is if you have a read-only source like Quandl, or if you're working with .csv or similar files in which case I wouldn't recommend implementing delete methods.
Use the naming convention sourceNameOfGenericDataObject, i.e. class arcticFuturesContractPriceData(futuresContractPriceData)
.
For databases you may want to create connection objects (like this for Arctic)
The simData
object is a compulsory part of the psystemtrade system object which runs simulations (or in live trading generates desired positions). The API required for that is laid out in the userguide, here. For maximum flexibility as of version 0.17 these objects are in turn constructed of methods that hook into data storage objects for specific sources. So for example in the default csvFuturesSimData
the compulsory method (for futures) get_backadjusted_futures_price is hooked into an instance of csvFuturesAdjustedPricesData
.
This modularity allows us to easily replace the data objects, so we could load our adjusted prices from mongo DB, or do 'back adjustment' of futures prices 'on the fly'.
For futures simData objects need to know the source of:
- back adjusted prices
- multiple price data
- spot FX prices
- instrument configuration and cost information
Direct access to other kinds of information isn't neccessary for simulations.
I've provided two complete simData objects which get their data from different sources: csvSimData and mongoSimData.
The simplest simData object gets all of its data from .csv files, making it ideal for simulations if you haven't built a process yet to get your own data. It's essentially a like for like replacement for the simpler csvSimData objects that pysystemtrade used in versions before 0.17.0.
This is a simData object which gets it's data out of Mongo DB (static) and Arctic (time series) (Yes the class name should include both terms. Yes I shortened it so it isn't ridiculously long, and most of the interesting stuff comes from Arctic). It is better for live trading.
Because the mongoDB data isn't included in the github repo, before using this you need to write the required data into Mongo and Arctic. You can do this from scratch, as per the 'futures data workflow' at the start of this document:
Alternatively you can run the following scripts which will copy the data from the existing github .csv files:
Of course it's also possible to mix these two methods. Once you have the data it's just a matter of replacing the default csv data object:
from systems.provided.futures_chapter15.basesystem import futures_system
from sysdata.arctic.arctic_and_mongo_sim_futures_data import arcticFuturesSimData
system = futures_system(data = arcticFuturesSimData(), log_level="on")
print(system.accounts.portfolio().sharpe())
Configuration information about futures instruments is stored in a number of different places:
- Instrument configuration and cost levels in this .csv file, used by default with
csvFuturesSimData
or will be copied to the database with this script - Roll configuration information in this .csv file, which will be copied to Mongo DB with this script
- Interactive brokers configuration in this file and this file and this file.
The instruments in these lists won't neccessarily match up, however under DRY there shouldn't be duplicated column headings across files.
The system.get_instrument_list()
method is used by the simulation to decide which markets to trade; if no explicit list of instruments is included then it will fall back on the method system.data.get_instrument_list()
. In both the provided simData objects this will resolve to the method get_instrument_list
in the class which gets back adjusted prices, or in whatever overrides it for a given data source (.csv or Mongo DB). In practice this means it's okay if your instrument configuration (or roll configuration, when used) is a superset of the instruments you have adjusted prices for. But it's not okay if you have adjusted prices for an instrument, but no configuration information.
Constructing simData objects in the way I've done makes it relatively easy to modify them. Here are a few examples.
Let's suppose you want to use Arctic and Mongo DB data, but get your spot FX prices from a .csv file. OK this is a silly example, but hopefully it will be easy to generalise this to doing more sensible things. Modify the file arctic_and_mongo_sim_futures_data.py:
# add import
from sysdata.csv.csv_sim_futures_data import csvFXData
# replace this class: class arcticFuturesSimData()
# with:
class arcticFuturesSimData(csvFXData, arcticFuturesAdjustedPriceSimData,
mongoFuturesConfigDataForSim, arcticFuturesMultiplePriceSimData):
def __repr__(self):
return "arcticFuturesSimData for %d instruments getting FX data from csv land" % len(self.get_instrument_list())
If you want to specify a custom .csv directory or you'll also need to write a special init class to achieve that (bearing in mind that these are specified in the init for csvPaths
and dbconnections
, which ultimately are both inherited by arcticFuturesSimData
)- I haven't tried it myself.
This is a modification to csvSimData which calculates back adjustment prices 'on the fly', rather than getting them pre-loaded from another database. This would allow you to use different back adjustments and see what effects they had. Note that this will work 'out of the box' for any 'single point' back adjustment where the roll happens on a single day, and where you can use multiple price data (which we already have). For any back adjustment where the process happens over several days you'd need to add extra methods to access individual futures contract prices and roll calendars. This is explained in the next section.
Create a new class:
from sysdata.futures.futuresDataForSim import futuresAdjustedPriceData, futuresAdjustedPrice
from sysdata.futures.adjusted_prices import futuresAdjustedPrices
class backAdjustOnTheFly(futuresAdjustedPriceData):
def get_backadjusted_futures_price(self, instrument_code):
multiple_prices = self._get_all_price_data(instrument_code)
adjusted_prices = futuresAdjustedPrices.stitch_multiple_prices(multiple_prices)
return adjusted_prices
In the file csv_sim_futures_data replace:
class csvFuturesSimData(csvFXData, csvFuturesAdjustedPriceData, csvFuturesConfigDataForSim, csvFuturesMultiplePriceData):
with:
class csvFuturesSimData(csvFXData, backAdjustOnTheFly, csvFuturesConfigDataForSim, csvFuturesMultiplePriceData):
If you want to test different adjustment techniques other than the default 'Panama stitch', then you need to override futuresAdjustedPrices.stitch_multiple_prices()
.
For any back adjustment where the process happens over multiple days you'd need to add extra methods to access individual futures contract prices and roll calendars. Let's suppose we want to get these from Arctic (prices) and .csv files (roll calendars).
You'll need to override futuresAdjustedPrices.stitch_multiple_prices()
so it uses roll calendars and individual contract; I assume you inherit from futuresAdjustedPrices and have a new class with the override: futuresAdjustedPricesExtraData
. Then create the following classes:
from sysdata.futures.futuresDataForSim import futuresAdjustedPriceData, futuresAdjustedPrice
from somewhere import futuresAdjustedPricesExtraData # you need to provide this yourself
from sysdata.arctic.arctic_futures_per_contract_prices import arcticFuturesContractPriceData
from sysdata.csv.csv_roll_calendars import csvRollCalendarData
class backAdjustOnTheFlyExtraData(futuresAdjustedPriceData):
def get_backadjusted_futures_price(self, instrument_code):
individual_contract_prices = self._get_individual_contract_prices(instrument_code)
roll_calendar = self._get_roll_calendar(instrument_code)
adjusted_prices = futuresAdjustedPricesExtraData.stich_multiple_prices(roll_calendar, individual_contract_prices)
return adjusted_prices
class arcticContractPricesForSim():
def _get_individual_contract_prices(instrument_code):
arctic_contract_prices_data_object = self._get_arctic_contract_prices_data_object()
return arctic_contract_prices_data_object.get_all_prices_for_instrument(instrument_code)
def _get_arctic_contract_prices_data_object(self):
# this will just use the default connection but you change if you like
arctic_contract_prices_data_object = arcticFuturesContractPriceData()
arctic_contract_prices_data_object.log = self.log
return arctic_contract_prices_data_object
class csvRollCalendarForSim():
def _get_roll_calendar(self, instrument_code):
roll_calendar_data_object = self.__get_csv_roll_calendar_data_object()
return roll_calendar_data_object.get_roll_calendar(instrument_code)
def _get_csv_roll_calendar_data_object(self):
pathname =self._resolve_path("roll_calendars")
roll_calendar_data_object = csvRollCalendarData(data_path)
roll_calendar_data_object.log = self.log
return roll_calendar_data_object
In the file csv_sim_futures_data replace:
class csvFuturesSimData(csvFXData, csvFuturesAdjustedPriceData, csvFuturesConfigDataForSim, csvFuturesMultiplePriceData):
with:
class csvFuturesSimData(csvFXData, backAdjustOnTheFlyExtraData, csvRollCalendarForSim, arcticContractPricesForSim, csvFuturesConfigDataForSim, csvFuturesMultiplePriceData):
If you want to construct your own simData objects it's worth understanding their detailed internals in a bit more detail.
The base class is simData. This in turn inherits from baseData, which is also the parent class for the data storage objects described earlier in this document. simData implements a number of compulsory methods that we need to run simulations. These are described in more detail in the main user guide for pysystemtrade.
We then inherit from simData for a specific asset class implementation, i.e. for futures we have the method futuresSimData in futuresDataForSim.py. This adds methods for additional types of data (eg carry) but can also override methods (eg get_raw_price is overriden so it gets backadjusted futures prices).
We then inherit for specific data source implementations. For .csv files we have the method csvSimFuturesData in csv_sim_futures_data.py.
Notice the naming convention: sourceAssetclassSimData.
Because they are quite complex I've broken down the futures simData objects into sub-classes, bringing everything back together with multiple inheritance in the final simData classes we actually use.
So for futures we have the following classes in futuresDataForSim.py, which are generic regardless of source (all inheriting from simData):
- futuresAdjustedPriceData(simData)
- futuresMultiplePriceData(simData)
- futuresConfigDataForSim(simData)
- futuresSimData: This class is redundant for reasons that will become obvious below
Then for csv files we have the following in csv_sim_futures_data.py:
- csvPaths(simData): To ensure consistent resolution of path names when locating .csv files
- csvFXData(csvPaths, simData): Covers methods unrelated to futures, so directly inherits from the base simData class
- csvFuturesConfigDataForSim(csvPaths, futuresConfigDataForSim)
- csvFuturesAdjustedPriceData(csvPaths, futuresAdjustedPriceData)
- csvMultiplePriceData(csvPaths, futuresMultiplePriceData)
- csvFuturesSimData(csvFXData, csvFuturesAdjustedPriceData, csvFuturesConfigDataForSim, csvMultiplePriceData)
Classes 3,4 and 5 each inherit from one of the futures sub classes (class 2 bypasses the futures specific classes and inherits directly from simData - strictly speaking we should probably have an fxSimData class in between these). Then class 6 ties all these together. Notice that futuresSimData isn't referenced anywhere; it is included only as a template to show how you should do this 'gluing' together.
The methods we write for specific sources to override the methods in simData or simFuturesData type objects should all 'hook' into a data storage object for the appropriate source. I suggest using common methods to get the relevant data storage object, and to look up path names or otherwise configure the storage options (eg database hostname).
Eg here is the code for csvFuturesMultiplePriceData in csv_sim_futures_data.py, with additional annotations:
class csvMultiplePriceData(csvPaths, futuresMultiplePriceData):
def _get_all_price_data(self, instrument_code): # overides a method in futuresMultiplePriceData
csv_multiple_prices_data = self._get_all_prices_data_object() # get a data storage object (see method below)
instr_all_price_data = csv_multiple_prices_data.get_multiple_prices(instrument_code) # Call relevant method of data storage object
return instr_all_price_data
def _get_all_prices_data_object(self): # data storage object
pathname = self._resolve_path("multiple_price_data") # call to csvPaths class method to get path
csv_multiple_prices_data = csvFuturesMultiplePricesData(datapath=pathname) # create a data storage object for .csv files with the pathname
csv_multiple_prices_data.log = self.log # ensure logging is consistent
return csv_multiple_prices_data # return the data storage object instance
If you have set up pysystemtrade as a production trading environment you may wish to continue storing your backtest data in .csv files rather than in databases (this step is also required for the BDFL of pysystemtrade to ensure the data provided on github is up to date). The following functions will allow you to update the .csv files: