Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Standalone Keras Repository #202

Merged
merged 23 commits into from
Mar 4, 2020
Merged
Changes from 11 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
26cafee
Merge pull request #1 from tensorflow/master
qlzh727 Feb 5, 2020
7af3f0b
Initial commit for RFC standalone Keras repository
qlzh727 Feb 6, 2020
0145eea
Update wording and nits.
qlzh727 Feb 6, 2020
0405d80
Update 20200205-standalone-keras-repository.md
fchollet Feb 6, 2020
7064794
Merge pull request #2 from fchollet/patch-1
qlzh727 Feb 6, 2020
3cce978
Update the RFC nubmer
qlzh727 Feb 6, 2020
3f0dc92
Update rfcs/20200205-standalone-keras-repository.md
qlzh727 Feb 10, 2020
0109f0f
Update rfcs/20200205-standalone-keras-repository.md
qlzh727 Feb 10, 2020
b654b44
Update tf to Keras usage with more details.
qlzh727 Feb 10, 2020
7c57540
Change to use the list view for comparison.
qlzh727 Feb 10, 2020
86d2c37
Update the section for PIP package dependency.
qlzh727 Feb 10, 2020
1504e52
Update the preferred PIP package version.
qlzh727 Feb 10, 2020
f4af07b
Update more details about the Github migration.
qlzh727 Feb 10, 2020
6872453
Update wording wrt to pure python project.
qlzh727 Feb 18, 2020
7144521
Update the sectoin for TF to keras dependency.
qlzh727 Feb 18, 2020
ecbc169
Address the comment from @apassos
qlzh727 Feb 25, 2020
ce8a20f
Update the states for python change history.
qlzh727 Feb 25, 2020
9123f7f
Updating the section for CI and presubmit.
qlzh727 Feb 25, 2020
11f3828
Update style and fix typo.
qlzh727 Feb 25, 2020
5de7dfe
Adding a section for Alternative considered.
qlzh727 Feb 25, 2020
132f542
Fix typo for LasyLoader
qlzh727 Feb 26, 2020
3cb07db
Update the RFC wrt to the design meeting outcome.
qlzh727 Mar 4, 2020
1343f17
Update the RFC status to 'Accepted'
qlzh727 Mar 4, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
361 changes: 361 additions & 0 deletions rfcs/20200205-standalone-keras-repository.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,361 @@
# Standalone Keras Repository

| Status | Proposed |
:-------------- |:---------------------------------------------------- |
| **RFC #** | [202](https://github.com/tensorflow/community/pull/202) |
| **Author(s)** | Qianli Zhu ([email protected]), Francois Chollet ([email protected]) |
| **Sponsor** | Karmel Allison ([email protected]) |
| **Updated** | 2020-02-05 |

## Objective

Move the Keras code from the TensorFlow main GitHub repository to its own
repository, with TensorFlow as a dependency.

## Motivation

### Build times

Building the open-source TensorFlow project end-to-end is an extensive exercise.
With a standard GCP instance, it might take more than one hour to finish the whole
build process (it might take longer with a Mac laptop). Although the local build
cache might help speed up the follow-up builds, the initial time cost is too
high for regular software development workflows. Internally, Google has a
distributed build and caching service, which Googlers heavily rely on,
that can build TensorFlow and run all Keras tests within 5 mins. Sadly,
we can't expose this to external contributors.

Currently, any contribution to Keras code will require building all of
TensorFlow, which is quite expensive to do for average users.
Having a separate repository will allow the Keras package to be built
without building TensorFlow. This should greatly improve the
velocity of open-source developers when they contribute to Keras code.

### Community Benefit

The difficulty of building TensorFlow from scratch in order to make a PR
to Keras code has been a significant source of issues:

* It discouraged contributions, since many external developers couldn't test
their changes and make sure they were correct.
* External developers would send unverified PRs, and Google reviewers spend time back
and forth, fixing the PR. Sometimes PR is just not moving forward because of the
lengthy feedback loop.

With the new standalone Keras repository, external contributors
should experience much shorter turn-around time when
building/testing Keras, since they don't need to build TensorFlow anymore.
This should have a positive impact on building a vibrant open-source
developer community.

In addition, by getting the Keras team at Google to start developing Keras
using the same public tools and infrastructure as third-party developers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split this point into multiple sub-points, because I believe that this argument is not sufficiently highlighted.

  • Github is not made to manage monorepos like tensorflow. Keras issues and pull requests get mixed up with the tensorflow core ones.
  • The source of truth being in github is very important because it allows the community to manage the repo. Currently, some commits pop up without any public pull request or public user, making it impossible for non-googlers to understand what is the rational behind a commit, or even who to contact if it caused a bug: e.g. tensorflow/tensorflow@9698ae1
  • As long as keras is in tensorflow, it's hard for us to have any impact on the processes/tests/tools that keras use internally. Imagine for example that we want to use Isort, or pytest, or github actions... In all cases it's a lot more work (code) and a lot more people who need to review and give us a green light. With a repository with the size of tensorflow, it's just unfeasable unless we have a face-to-face meeting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keras issues and pull requests get mixed up with the tensorflow core ones.

before the split, we had the same issue. Not sure how to fix that.

we make the development process more transparent and more community-oriented.

### TensorFlow API modularity

There are other side-benefits if we split the repository. Currently, Keras
has to rely on a number of private TensorFlow APIs. However, a litmus test
of the quality of the public TensorFlow low-level APIs is that they should
be strictly sufficient to a higher-level API like Keras.
After splitting the repository, Keras will have to import TensorFlow and
rely exclusively on public APIs. If Keras still ends up using TensorFlow
private features, it might be an indication of tight coupling of
implementation details. If certain private features are extensively used,
we might want to consider exposing them as public low level API.

This design is also aligned with the design for
[Modular TensorFlow](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md),
which splits the TensorFlow project into smaller components that are not
tightly coupled together.


## Design Proposal

### New location of the code

GitHub: the code will live at [keras-team/keras](https://github.com/keras-team/keras),
joining the other Keras SIG projects and replacing the current external Keras
codebase. `tf.Keras` will also replace Keras on PyPI.

Also considered: `tensorflow/keras`.

Pros:
1. Under the umbrella of Keras SIG, which hosts all other Keras related projects
like keras-application, KerasTuner etc.
1. Lots of existing followers on keras-team, who may not be easily migrated to
TF project.
1. Can't easily delete keras project, which already have tons of stars and
incoming reference links. Continued existence of external Keras code will create
confusion ("why is there tensorflow/keras AND keras-team/keras?").

Cons:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does keras will be the only Tensorflow "core project"/sig that is external to Tensorflow/?
What is the storytelling here?
I suppose the other one is MLIR but it is not only dedicated to Tensorflow and now it is under LLVM that is a standalone foundation so the rationale is different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fchollet here.

1. The repo isn't under the same organization as tensorflow, which makes it hard
to manage issues/PRs and references across the organization.
1. Existing issue/PR under the same org can be transferred easily, but not cross the different org. See [here](https://help.github.com/en/github/managing-your-work-on-github/transferring-an-issue-to-another-repository).

### Source of Truth

TensorFlow uses a Google-internal code repository as its source of truth. Every PR
submitted though GitHub is converted to a Google-internal change first,
submitted through the internal system, and then copied to GitHub as commits.
At the same time, PR is marked as merged with the corresponding commit hash.

Likewise, issue tracking and code review takes place through Google-internal tools.

For Keras, since we are trying to promote community engagement, we hope to use
GitHub as source of truth. This will have the following implications:

* We expect the majority of the code development/contribution from GitHub
and the dev tools / tests / scripts should focus on the GitHub development use
case. See below for more details.
* Keras CI/presubmit build for the GitHub repo should target a stable PIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the disadvantage of slowing down keras's adoption of new TF features and APIs.

This document should clarify when would CI against tf stable be expected to break, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also important to highlight the release process for TF)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not supposed to slow down adoption of new features compared to targeting an unpinned nightly all the time. Maybe it's not clear in the document, but if a pull request requires a new feature in tensorflow nightly, the person making the pull request can change the targeted version of tensorflow in the CI. So we don't have to wait at all.

Do I get it right or you were referring to another problem?

version of tensorflow package as dependency. It could either be a `tf-nightly`
with explicit version, or a release candidate version, or a stable version.
Depend on a floating `tf-nightly` could cause CI build to be instable, which has
been observed in other repository
[like tf-addons](https://github.com/tensorflow/addons/pull/912).
* The Keras code will be mirrored to a Google-internal code repository via
Google-internal tools within a very short time window after each change.
The Google-internal CI tests will run on HEAD for both Keras and TF code.
* The CI build for the repository on GitHub might break when it points to a
new version of `tf-nightly`, if certain behavior has been changed and wasn't
caught by unit tests. We have observed a few similar cases with
[tf/addons](https://github.com/tensorflow/addons).
We hope this can be reduced by stronger unit test coverage by Google internel
systems, when both TF and Keras code are tested at HEAD.
* pip package management. Keras will now follow the `tf-estimator` approach.
"pip install tensorflow" should also install Keras (from PyPI) as well.
There are more details for the pip package in the
[Improved pip package structure](https://github.com/tensorflow/community/pull/182) RFC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also considerconda this time? Anaconda is a huge part of Data Science and ML ecosystem and tf.keras not being a part of it doesn't feel right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gunan from TF build team, and also @seanpmorgan who mentioned this in the SIG-addon meeting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this item on the design review meeting. So far this is not active target we are supporting. The meeting conclusion are:

Conda uses a different toolchain, with no ABI compatibility between TF+Conda. Because we don’t have C interfaces, any custom ops built for TF will not work for Conda, and vice versa. There is a set of people who publish TF for Conda. But tf addons, io, etc. are not published for Conda. There’s no easy way around this because it would require doubling the TF release burden entirely.
Keras in theory should be easier? Insofar as it is Python only (currently). But Keras will in the future include C(++). That should be fine as long as there is a Python interface to the exposed parts, or C parts are used internally.


### Dependency Cleanup

As the high-level API of TensorFlow, Keras should have a direct dependency on
TF low-level APIs, but not the other way around. Unfortunately, there is some existing reverse
logic in the TF code that relies on Keras, which we should update/remove
when we split the repository.

The current usage of Keras from TensorFlow are:
* Unit tests, which should be converted to integration tests, or port the tests
to Keras repository.
* `feature_column`, which uses Keras base layer and model.
* Legacy `tf.layers` in v1 API, which uses Keras base layer as base class.
* legacy RNN cells, which uses Keras serialization and deserialization.
* TPU support code for `optimizer_v2`.
* TF Lite for keras model saving utils.
* Aliases from tf.losses/metrics/initializers/optimizers in tf.compat.v1.

All Keras imports in integration tests can be changed to use dynamic import like below:

```python
try:
from tensorflow.python.keras.engine import base_layer
except ImportError:
tf.logging.error('keras is not installed, please pip install keras')
base_layer = None
```

### Update Keras to only use public TF APIs

The current Keras code will still work if we do e.g.:
```python
from tensorflow.python.ops import array_ops

ones = array_ops.ones([2, 3])
```

However, since Keras is a separate repository, having it only use TF
public APIs will heavily reduce the chance of breakage caused by relying
on private methods or implementation details. We think this point is
critial to the health of the project. This also allows TF to change internal
implementation details without worrying about breaking Keras.

The converted code should look like e.g.:

```python
import tensorflow as tf

ones = tf.ones([2, 3])
```

During this conversion, we might notice that certain TF features used in Keras are
not public. A decision should be made on a case-by-case basis:

* Copy the functionality from TF to Keras.
* Replace the usage with another alternative TF public API.
* Make the functionality a new TF public API.

**Note that the open-source community is encouraged to contribute to this effort.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice description of the problem. Since there is a lot of decision making going on there, I believe it would be nice for people in charge of decisions to open issues describing what they want to do for all the unwanted imports we'll have. It's hard to help when you don't know what will be decided to resolve the problem (here option 1? 2? or 3?).

If we don't do that, I believe the open-source community won't help much just because of a lack of guidance.


### Two-stage change process

For any change that is affecting both TensorFlow and Keras, the change
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we have again the cache sharing problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you detail what is the cache you are mentioning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to be split into two, one as a PR to the TF repo,
and the other as a PR to the Keras repo. Here are some common scenarios:

1. Adding a new feature to TensorFlow, and having Keras rely on it. Note that the
TF change needs to be submitted first, and the Keras PR needs to wait for the new TF
nightly to become available on PyPI.

Also note that any rollback of the TF PR will cause Keras to break, the
rollback sequence should be PR 33333 and then PR 22222 (see example below).
The Google-internal test for TF should catch the error if the rollback sequence
is not correct.

```python
# Existing scenario.
# PR 11111 (2 files updated)
# +++ tensorflow/python/ops/array_ops.py
def some_new_function(inputs):
...

# +++ tensorflow/python/keras/layers/core.py

class new_layer(Layer):

def call(inputs):
array_ops.some_new_function(inputs)
...
```

```python
# New scenario.
# PR 22222 (1 file updated)
# +++ tensorflow/python/ops/array_ops.py
@tf.export('some_new_function')
def some_new_function(inputs):
...

==================================
# PR 33333 (1 file updated)
# +++ tensorflow/python/keras/layers/core.py

class new_layer(Layer):

def call(inputs):
tf.some_new_function(inputs)
...
```

2. Changing the behavior of an existing TF API.

Note that the PR 22222 needs to be submitted with both the new and old
function since Google internal CI is still testing from HEAD.
The previous function can be
deleted after PR 33333 is submitted. Also note that this issue is caused by
Keras not using exclusively public TF API, but relying on TF implementation details.
Moving towards only using public APIs should reduce the likelihood of this kind of issue.

```python
# Existing scenario.
# PR 11111 (2 files updated)
# tensorflow/python/ops/array_ops.py
<<<
def existing_function(inputs):
...
>>>
def new_function(inputs, knob1=False, knob2=1):
...
# tensorflow/python/keras/layers/core.py

class existing_layer(Layer):

def call(inputs):
<<<
array_ops.existing_function(inputs)
>>>
array_ops.new_function(
inputs,
knob1=True,
knob2=3)
```

```python
# New scenario.
# PR 22222 (1 file updated)
# tensorflow/python/ops/array_ops.py
<<<
def existing_function(inputs):
...
>>>
def existing_function(inputs):
return new_function(
inputs,
knob1=False,
knob2=1)

def new_function(inputs, knob1, knob2=1):
...

==================================
# PR 33333 (1 file updated)
# tensorflow/python/keras/layers/core.py
class existing_layer(Layer):

def call(inputs):
<<<
array_ops.existing_function(inputs)
...
>>>
array_ops.new_function(
inputs,
knob1=True,
knob2=3)
```


### Performance Implications

There may be some performance implications as we move towards only using
public TF APIs. We need to maintain a benchmark to ensure that there
is no performance regression.

### Dependencies

The TensorFlow pip package will auto-install the Keras package, which shouldn't make
any difference on the end-user side. Under the hood, Keras will be a different
package imported by `tf_core`, like what we do for TF estimator.

### Developer experience Impact

* The local build and test times should be greatly reduced, since compiling TF
is no longer needed, and Keras is a pure-Python project.
* Cross-boundary changes will require some extra handling since such changes
needs to be split into two or more PRs. Same for rollbacks.
* Tooling on the GitHub side (for code review, etc.) is not as good as
Google-internal tools. This may impact the develoment velocity for
Keras team members at Google.

### Best Practices, Tutorials and Examples

* The new Keras repository should have a new contribution guide about how to
setup a local test environment and iterate based on that. A similar one in
tf/addons can be used as an example.
* The existing TF docs needs to be updated to highlight that Keras code now lives
in a different repository, with a new process for sending PRs, etc.
* When filing an issue, people might need to consider where to send the issue,
e.g. is it a Keras issue or an issue caused by TF but surfaced by Keras. The
different ownership of the repository will also cause difficulties for
transferring the issue.

### User Impact

* No end-user facing change for current TF users; the split would only affect
developers, e.g. in-flight PRs during the transition period.
* For current Keras pip package users, they will get the new TF keras package
when they update their pip, which should have more features than the current
Keras-team version.

## Questions and Discussion Topics

1. Tools for issue tracking: we can't rely on Google-internal bug tracking tool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This another critical point that need to be solved. We cannot have a weaker issue tracking than TF. Check the full thread at #29

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the notice. I will check with infra team and support team to see if they have any suggestions for improving this situation. Will leave this open for discussion in the design meeting.

since it's not publicly visible, also if managing GitHub issues across the orgs
is hard, we might need to find some alternatives for tracking bugs/features etc.
2. OSS tests for TPU-related code. Since TPUs are not available during local
testing, the verification will have to be done when the PR is mirrored to Google's
internal systems.
3. Transition period for moving the Keras code from `tensorflow/tensorflow` to
`keras-team/keras`. All in-flight PRs / issues will be affected: they need
to be copied to `keras-team/keras`, or if they also touch TensorFlow, then they
need to split into two.