-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Docs describing the Genotype Call XArray #78
Conversation
@alimanfoo @hammer I was going to do an example with a bed file, but the bed is in sgkit_plink. Should I not bother then? |
Hey @jerowe thanks for the contribution! We're still formulating the processes and tools we will use for documentation (https://github.com/pystatgen/sgkit/issues/19). Once our documentation is in place, we will write up some contributing guidelines (https://github.com/pystatgen/sgkit/issues/8). Until those guidelines are ready, I think it would be best to either leave this PR open or stash these notebooks in a personal repo until we're ready to figure out where to put them. Also, we'd prefer that every PR has an issue open that it can reference. That's something we plan to document in our contributing guidelines, so no worries for not having an issue for this PR! |
@hammer if you don't mind I'd prefer to keep these here, even as a placeholder. The PR itself is marked as a draft. I'm working on this with Quansight with @daletovar and @dharhas, so I don't have a personal repo. ;-) I have no strong feelings regarding doc standards. I was on a call with @alimanfoo and he said it was a bit up in the air, so just go with notebooks for now. I can certainly open an issue for this. ;-) |
@jerowe this is great! I think documenting the xarray representation is core, so it's probably best to use reST format so it gets included in the main documentation. (I've set up Sphinx in this PR: https://github.com/pystatgen/sgkit/pull/51, not yet merged.) I'm thinking that it would be a bit like http://xarray.pydata.org/en/stable/data-structures.html. For a longer tutorial, a Jupyter Notebook might be more appropriate so users can run the whole thing, but that could be a separate PR. |
@jerowe - do you think it's reasonable to convert the fundamental information here into rst/sphinx rather than a notebook? It would be really excellent to get this nailed down in solid documentation. |
@jeromekelleher I'd be curious to hear why you perceive this to be something we should do right now. While I think API docs are great to have as we implement new functionality, I'd prefer to hold off on user docs until we have stabilized the package structure and core data structures. |
@hammer - so having gone through the notebooks now (it's sooo annoying that GitHub doesn't show the actual notebooks in the PR!), I see what you mean, and these are more example driven and user-focused that we'd want to commit to right now. I was imagining more high-level documentation describing the data structure layout and design philosophy, which I think would be very helpful at this stage where we're trying to onboard quite a few people. I'm increasingly a fan of documentation-driven-development, and I think we should try to document the core data structures now. Do you see where I'm coming from? |
Yes! And in fact one reason for not using Hail is their unwillingness to document their file format, so I want to be clear that I don't imagine postponing this sort of documentation for too long. I think this sort of high-level documentation describing the data structure layout and design philosophy is best done by core developers rather than new users, and I will file an issue to be sure we get that documentation written up soon. I perceive the documentation in this PR to be targeted at new users and I think it's great to have this sort of documentation written by people relatively new to the project, and don't want to block it forever. I just don't think we're at a place for the project where we can support new users, and so we should wait to have this sort of documentation until we at least have a PyPI release, for example. Also, I can imagine user-targeted notebooks might best live in a separate repo, similar to the Xarray tutorial linked above. |
OK, thanks @hammer, I think we're on the same page then. @jerowe - how would you like to proceed here? I think it would be useful to have fresh eyes on the internal data structures, but maybe it's best to get some basics down on paper first as @hammer says. But I agree that we're not really ready for the type of notebooks that are in this PR, and we'd probably want these in different, more tutorial-focused repo. |
@jeromekelleher @hammer (and other interested parties) I am totally fine with changing it to RST. I've had success going from notebook -> markdown and I'm sure I can find a markdown -> RST converter, or I can just rewrite the docs in RST. |
End result - I will convert these to RST and ping all involved. |
@jerowe as noted in https://github.com/pystatgen/sgkit/pull/78#issuecomment-668076872 I will open 2 issues, one to discuss documentation of data structures and design philosophy, and one to discuss a tutorial for new users. I think your documentation is best targeted at the latter issue. To ensure your efforts are not in vain, you may want to wait until we achieve a resolution on those issues, as it's not obvious to me we want to merge this documentation into this repository right now, nor is it obvious that this documentation is better as reST. |
I'm fine with whatever the sgkit hive minds decision is. ;-) I'm not working on this until Wednesday, but I'll pop back in and check on issues / PRs and to keep myself informed. |
Oh an done other quick note. Jupyter notebooks can be imported into sphinx docs through an extension. |
Sooooo I can't change files without making the CI freak out? |
I personally find this kind of attitude really chases people away from contributing. Generally speaking, if someone wants to contribute docs on another way of understanding something and gets told not to worry their little head until the people in charge make the decisions they will go through the process once or twice, but at some point, they simply move on. At that point, the opportunity is gone and the project has missed out on very often valuable insight from a third party that probably has expertise the software engineers lack. In this instance am I considered a core developer or a new user? I am perfectly capable of understanding a fancy dict that is itself mostly a transposed VCF file. ;-) |
Hi @jerowe, apologies for the radio silence, I'm still catching up from vacation. FWIW I really appreciated you submitting this PR, we will absolutely need documentation like this. Also apologies for not doing more to connect you with the other developers yet and get everyone aligned. There have been some recent discussions about how to organise documentation and those are ongoing, so it would be good to include you in shaping that.
I think there may have been some crossed wires here, I can see how the comment could be read that way, but I don't think that was the intention. Please give us the benefit of the doubt if you can, I know everyone in the project is committed to working collectively and supportively as a team. |
@jerowe I apologize for making your first PR to I've invited you and the rest of the Quansight team to a Zoom call on Friday and I hope we can discuss the best path forward then, as clearly I've done a poor job communicating asynchronously here on GitHub. I'd absolutely like you and the rest of the Quansight team to be "core developers" as soon as possible! My clumsy use of that term in https://github.com/pystatgen/sgkit/issues/87 was meant to refer to @alimanfoo and @eric-czech who wrote the code that inspired the creation of To seed the discussion for Friday, here are some concrete reasons why I think this documentation may be hard to maintain over the next few weeks. I do hope things will stabilize a bit by then and you will be a part of these discussions to help us get the project in shape for user-oriented documentation. Minimal Numpy example
Load from MalariaGEN Zarr
Load from VCF
|
Hey @hammer, don't worry. I don't take anything through text too seriously, as is a general best practice on the internet. ;-) I would like to see sgkit avoid some of the pitfalls of other projects I've worked on, where it's impossible to contribute anything so people simply go elsewhere if they have time to volunteer. I personally prefer to document early with lots of "hey its still early disclaimers". In my experience it tends to never get done otherwise, and if you change variable names I can switch those around with sed anyways. With that said I can certainly see the other side that it's an extra thing to maintain. I don't actually mind either way, because I tend to write these things up as I work through a problem or understanding a concept. I'll leave this here for a bit and go work on other things. |
Hi @jerowe, I had another thought for you on design philosophy as it relates to these examples and documentation in general. This is a bit reductive but there are two opposing perspectives that have come up often in the API discussions:
Perspective 1 is certainly more common (e.g. scikit-allele) and I think it lends well to these kinds of examples as well as documentation formalized around data structure models. Perspective 2 would suggest that all genetic data structures are some manifestation of an Xarray dataset and that there is nothing particularly special about a combination of variables like the one implied in I've been a proponent of perspective 2 for a number of reasons and one of them is that when you look at the individual quantities (i.e data variables) required across a collection of statgen methods like this, there isn't a clear domain-specific data structure (or smallish set of data structures) that will satisfy most or all of them without being inflexible for others. tl;dr There are two competing design philosophies that would lead to two nearly orthogonal documentation strategies. One of them is easier to learn, provides more concise access to fewer capabilities, and has greater alignment with docs/examples like this. The other is easier to maintain, facilitates more flexible workflows, and more closely resembles recent genetics toolkits for distributed systems -- at the cost of being less friendly to contributors. I've thought about trying to attempt some documentation myself a few times and I end up dropping it because we should figure out how to support all of the above first in addition to what @hammer mentioned. Separating core and user APIs is a promising path forward IMO (https://github.com/pystatgen/sgkit/issues/25#issuecomment-656093123). |
Should I close this out or leave it up for historical purposes? There's some good discussions in here, but I don't think the PR is useful ATM. Thoughts? |
I think we can probably close the PR for tidyness sake, thanks @jerowe, but it'd be good to keep the branch around so that we can view the content later (I'm assuming github doesn't keep contents of a PR available if you delete the branch?). |
Branch stays, and the notebooks are all on @alimanfoo 's datalab instance anyways. |
I thought it would be nice to have some docs that conceptually describe what the Genotype Call Xarray Dataset is all about.
The intended audience is a scientist who is familiar with variant data, with at least some programming ability, but does not assume they are familiar with numpy, Xarray in general, or the Genotype Call Dataset format in particular.