Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to request data in csv format in get_dataframe method #524

Merged
merged 4 commits into from
Oct 22, 2020

Conversation

2xmm
Copy link
Contributor

@2xmm 2xmm commented Oct 17, 2020

Hi @hydrosquall,

Summary of Changes

Checklist

  • [ x] Code follows the style guidelines of this project
  • [ x] Added tests for new functionality
  • [ x] Update HISTORY.rst with an entry summarizing your change
  • [ x] Tag a project maintainer

@codecov
Copy link

codecov bot commented Oct 17, 2020

Codecov Report

Merging #524 into master will increase coverage by 0.09%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #524      +/-   ##
==========================================
+ Coverage   94.64%   94.73%   +0.09%     
==========================================
  Files           5        5              
  Lines         224      228       +4     
==========================================
+ Hits          212      216       +4     
  Misses         12       12              
Impacted Files Coverage Δ
tiingo/api.py 94.30% <100.00%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82b9903...7e6af79. Read the comment docs.

Copy link
Owner

@hydrosquall hydrosquall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @2xmm,

First of all, thanks for taking the time to submit this clearly documented + well-tested contribution, and improving some of the adjacent code along the way. I'm tempted to suggest making the default behavior for the library fetching JSON instead of CSV given the significant performance implications, but never know who may be depending on the presence of timezones in their datetime strings.

I have 3 followup comments for you:

  1. Would you like to be added to the AUTHORS.rst file? You can provide your preferred name / email, whichever you're comfortable with.
  2. To the comment about the YAML files looking different: is this referring to how the arrays all seem to be broken out onto newlines instead remaining on the line of their associated key? E.g.
Response-Type: [*]

# vs

Response-Type
  - '*"

If so, my guess would be that perhaps an text editor plugin is may be reformatting the YAML on save, depending on whether you used the helper python script or direct file edits to remove your API keys. Were there any other differences that stood out to you?

In any event, the Travis CI tests passing indicates to me that the alternate syntax is still valid YAML, so I think this is fine, but thanks for pointing this out. A future TODO for this project may be to add a consistency linter like black and run that as a pre-commit hook to remove this sort of variation.

  1. Do you have any suggestions about where to document the presence of this new parameter? I was thinking that adding a few lines to the get_dataframe code block in README.rst could be a good starting point, indicating both the option and the performance benefit + tradeoff (e.g. no timezones in your date field). (If this feels out of scope, I'll come up with something before we cut a new version.

Once again, great work! Thanks for sharing your fix with the world, I'm excited to bring this new capability into the library.

tiingo/api.py Outdated
prices.index = pd.to_datetime(prices.index)

# Localize to UTC to ensure equivalence between data returned in json format and
# csv format. Data requested in csv format does not include a timezone.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data requested in csv format does not include a timezone

Nice catch, I was curious about whether there was still any benefit to using json format if CSV has so much better performance.

[non-blocking] @tiingo are there any other major differences between the data returned by each format that users should know about besides query performance and the presence of timezones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should specify more clearly that daily data returned by Tiingo does not include a timezone. For intraday data, a timezone is included with the csv response.

This timezone issue exposes a larger issue which I think we should discuss. Tiingo daily data and IEX intraday data are both being returned by the get_dataframe method but I think they would be better off with their own methods. This would follow the Tiingo API more closely and users would be more aware of the source of the underlying data. I would suggest replacing get_dataframe with get_daily_dataframe and get_iex_dataframe.

tiingo/api.py Show resolved Hide resolved
tiingo/api.py Show resolved Hide resolved
@2xmm
Copy link
Contributor Author

2xmm commented Oct 20, 2020

Responding to your questions from above:

  1. Yes, I can add myself to AUTHORS.rst.
  2. Yes, these were my thoughts as well. As long as all the tests are passing, the YAML format is not so important.
  3. I’m happy to update the docs. Perhaps first we look at resolving my proposal to separate get_dataframe into multiple methods. Let me know what you think.

@hydrosquall
Copy link
Owner

This would follow the Tiingo API more closely and users would be more aware of the source of the underlying data. I would suggest replacing get_dataframe with get_daily_dataframe and get_iex_dataframe

I think that's a good idea! It's consistent with how we have dedicated methods for the other endpoints (e.g. crypto, news, etc), and prevents us from having to overload get_dataframe with too many new parameters as new endpoints come out. If it kept code the simple we could keep get_dataframe available (for backwards compatibility), but document these new methods instead. I will say that I think it's good that the get_ticker_price endpoint hides the user from needing to know what the endpoints are called, but that power users of this library are likely interested in knowing exactly what endpoint is being used to fulfill their request.

I also think that once these new methods are made, it may become reasonable to make "return csv format" the default instead of JSON so people get fast performance out of the box.

Would you be interested in addressing this in a follow up? If not, I'll file an issue so someone else can pick it up, but will leave off that step if you already have the context for making the change. Let me know if you have any other questions or ideas to discuss.

I'll plan to merge this branch whenever you push a commit that updates AUTHORS.rst, unless you'd like to make your changes part of this PR. Thanks again for your contributions!

@2xmm
Copy link
Contributor Author

2xmm commented Oct 22, 2020

@hydrosquall

I've made the changes discussed above so this is ready for merge.

I will file a new issue with my design proposal.

@hydrosquall
Copy link
Owner

@2xmm This has been released to all users using version 0.13.0 as of this evening.

#545

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants