Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --input-dir and --output-dir as options #119

Merged
merged 2 commits into from
Apr 28, 2017

Conversation

rhuss
Copy link
Contributor

@rhuss rhuss commented Apr 20, 2017

  • Use these options when a whole directory hierarchy needs to be processed.
  • Extracted file processing logic in an extra process.go

Partial solution to #117

Bear with me, Golang is not my mother tongue (which was in fact Perl at those days, but is Java & JavaScript since some times). Looking forward to a review ;-)

@rhuss rhuss force-pushed the pr/input-dir branch 3 times, most recently from cd4b04f to 3608037 Compare April 20, 2017 18:09
process.go Outdated

// == Rendering ====================================================

func renderTemplate(g *Gomplate, inString string, outPath string) error {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather this function live in main.go. This file seems like a pretty good place to keep all of the file I/O functions, but this one doesn't really fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, will move it back :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only concern I have is a circular deps on the files then: main -> process -> main (or maybe I think here a bit to Javaesque). From a purity pov it should go in a dedicated render.go but thats then probably overkill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking again about it, imo renderTemplate in process.go is maybe not so a bad fit as it is the final step in 'process' (so process could io + render).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a circular deps on the files

not really an issue... to the Go compiler the files are almost irrelevant - everything in the directory is all part of the same flat namespace. Definitely a Java-esque thing - which Go isn't 😉

process.go Outdated
}

for n, input := range inputs {
if err := renderTemplate(g, input, outputs[n]); err != nil {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be ideal if this were done in parallel, using goroutines - I think it should be pretty straightforward...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes sense cpu-wise, but I wonder whether io-wise it reasonable to write in parallel to disk. My gut feeling is that a serial approach might be not worse here

But without a benchmark its hard to judge.

Should I still change this ? Also, this branch is only entered when using multiple --file options (not with the new --input-dir). So its not that many probably anyway.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, rendering templates aren't always pure disk IO - there's often network IO that blocks for far longer. So running these in parallel could be a big win.

I didn't realize this wasn't called with --input-dir - I haven't had a chance to look too closely yet 😉. I'd definitely rather all separate files be rendered in parallel, whether or not they come from --file or --input-dir.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo renderTemplate in process.go is maybe not so a bad fit as it is the final step in 'process' (so process could io + render)

Figuring out which files to read and reading them is IMO a separate concern from the render step. Admittedly, main.go has a lot of mixed concerns, but if a new file is created I'd rather it be fairly single-responsibility

process.go Outdated
func ensureDir(dir string, mode os.FileMode) error {
_, err := os.Stat(dir)
if err != nil {
if os.IsNotExist(err) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is unnecessary - os.MkdirAll will be a no-op if the path already exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the heads up

rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 20, 2017
@@ -11,7 +11,7 @@ load helper

@test "errors with non-existant env var using .Env" {
gomplate -i '{{.Env.FOO}}'
[ "$status" -eq 2 ]
[ "$status" -eq 1 ]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this is because RunTemplate's signature has changed... I'd rather it stay the same to not break compatibility.

main.go Outdated
@@ -26,16 +27,13 @@ type Gomplate struct {
}

// RunTemplate -
func (g *Gomplate) RunTemplate(text string, out io.Writer) {
func (g *Gomplate) RunTemplate(text string, out io.Writer) error {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is public API, so the signature can't change. Please remove the error. panicing isn't the best way to error out, but it's not the end of the world, and I'd much rather keep API compatibility. I know of at least one user who bundles gomplate into a different tool and calls RunTemplate directly, so this would break them.

README.md Outdated
@@ -194,6 +194,18 @@ By default, `gomplate` will read from `Stdin` and write to `Stdout`. This behavi

You can specify multiple `--file` and `--out` arguments. The same number of each much be given. This allows `gomplate` to process multiple templates _slightly_ faster than invoking `gomplate` multiple times in a row.

##### `--input-dir` and `--output-dir`

For processing multiple templates in a directory you can use `--input-dir` and `--output-dir` together. In this case all files in input directory will be processed as templates and the resulting files stored in `--output-dir` (which must exist). The directory structure will be preserved.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(which must exist)

I don't think that's true, is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here where it is asserted

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, commented about that in-line in process.go

@@ -0,0 +1,2 @@
one: eins
two: zwei
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing EOL here, no big deal though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

main.go Outdated

func validateInOutOptions(c *cli.Context) error {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is totally necessary, I really don't like that this has to exist... I've been thinking of migrating to github.com/jawher/mow.cli for a while and maybe it's time to do that. It can handle option exclusion pretty well, so this validation would be unnecessary. Maybe I'll work on migrating to mow.cli in the next little while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let me know when and from where (branch) to rebase when you added it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might need to take this back. I started working on it in #122 but it seems to have trouble with taking - as an argument for --file or --out. I think it maybe makes sense to keep this code in for now and not block this PR on the other one.

@hairyhenderson
Copy link
Owner

Thanks for this, @rhuss - and sorry for the rough review so far - I want to make sure this feature is implemented well. I hope you're not taking any of my nitpicks and criticisms personally - I really appreciate the time you've put into this so far! 😄

Thinking about this further, I think I'd like --output-dir to be optional if --input-dir is provided. Essentially, it would default to the current working directory (.) in that case.

I think I'm also going to migrate the commandline flag processing to mow.cli to make the option validation a little easier. It'll be a pretty disruptive PR though, so it'll require some careful merging - sorry in advance!

process.go Outdated
func processInputDir(c *cli.Context, g *Gomplate) error {
inputDir := c.String("input-dir")
outDir := c.String("output-dir")
if _, err := assertDir(outDir); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hairyhenderson here is the assertion that the outDir must exist. I'm not sure if should we require this, but I think its common practice to require the top-level directory to exists. Otherwise typos in the usage can easily slip through leading to unexpected results.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm... 🤔 I think I'd rather create outDir if it doesn't exist. gomplate is usually used in scripts and typos would be caught pretty quickly, by virtue of the files not existing in the right location. Besides, --out will create the output if it doesn't exist, so it makes sense (to me anyway) that --out-dir would create the directory too.

Also it makes the code simpler, which is a good thing 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@rhuss
Copy link
Contributor Author

rhuss commented Apr 21, 2017

Thinking about this further, I think I'd like --output-dir to be optional if --input-dir is provided. Essentially, it would default to the current working directory (.) in that case.

ok, I can implement it that way.

rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 21, 2017
rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 21, 2017
@rhuss
Copy link
Contributor Author

rhuss commented Apr 21, 2017

Just added parallel file processing (both for the --file and the --input-dir case). I also did a small stupid benchmark with 1000 files a 40k each and 1000 template instructions (on an 16 GB Mac with 8 cores)

For the serial implementation I get:

time bin/gomplate --input-dir test/t/in --output-dir test/t/out --datasource config=test/t/config.yml

322.71s user 66.78s system 152% cpu 4:14.92 total

For the parallel implementation:

time bin/gomplate --input-dir test/t/in --output-dir test/t/out --datasource config=test/t/config.yml

797.23s user 151.41s system 734% cpu 2:09.22 total

So roughly a factor two for the total time (although more overhead for io and context switching I guess, there are 10k go routines created in this sample). Not so bad.

rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 21, 2017
If not given, the default is the current working directory
Relates to PR hairyhenderson#119
@rhuss
Copy link
Contributor Author

rhuss commented Apr 21, 2017

Addressed all review comments so far. wrt. to the parallel processing: I had to introduce some locking around the cache access when processing the Datasource. Please review this as I have quite some general experience with threading but not with shared memory when using goroutines.

rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 21, 2017
… success

Thanks to the awesome IT suite for triggering this
Relates to hairyhenderson#119
@rhuss
Copy link
Contributor Author

rhuss commented Apr 21, 2017

all green ;-) like the integration test suite, need to add some for --input-dir, too ...

@hairyhenderson
Copy link
Owner

So roughly a factor two for the total time (although more overhead for io and context switching I guess, there are 10k go routines created in this sample). Not so bad.

Interesting! I think it would be pretty unusual to be processing directories so large, but it's 👍 anyway.

data.go Outdated
}
cached, ok := d.cache[cacheKey]
if ok {
func (d *Data) ReadSource(fs vfs.Filesystem, source *Source, args ...string) (map[string]interface{}, error) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is public API - can you find a way to accomplish this without changing the signature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it since it was necessary to add the creation of the JSON and YAML documents in the mutex protected section when doing parallel processing. Also, I think it makes sense to include the (potentially costly) deserialized objects into the cache so that this conversion needs to be called only once (and not for every call to the datasource). I don't consider this optimisation as premature since the call to the datasource can happen quite often also when doing only a single call.

I suggest adding a second method ReadSourceAsObjects (could be private, too) which is used by DataSource and leave an (unoptimised) ReadSource with the original signature. Then we could decide whether to increase the cache to hold both kinds of objects (with the cost of increasing memory usage) or we make ReadSource to not use a cache.

wdyt ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think it makes sense to include the (potentially costly) deserialized objects into the cache so that this conversion needs to be called only once (and not for every call to the datasource). I don't consider this optimisation as premature since the call to the datasource can happen quite often also when doing only a single call.

If you're using the same datasource multiple times in a template, it makes sense to assign its output to a variable:

{{ $config := datasource "config" }}
hostname: {{ $config.hostname }}
port: {{ $config.port }}
...

So datasources should almost never be called multiple times in a template. It's non-obvious if someone is new to gomplate and go templates in general, but maybe that could be mitigated with documentation for now... Ultimately it would be nice to have some sort of subcommand that could lint the template, but that's a long way off 😉

@hairyhenderson
Copy link
Owner

wrt. to the parallel processing: I had to introduce some locking around the cache access when processing the Datasource. Please review this as I have quite some general experience with threading but not with shared memory when using goroutines.

Ok, just had a deeper look at the code, and I was a little confused about the cache lock at first, but now I understand why it's necessary.

I think I'm going to change my mind (sorry!) - let's not parallelize rendering. At this point it feels like premature optimization and I'm not sure I'm comfortable with the extra complexity in the code that it introduces...

@rhuss
Copy link
Contributor Author

rhuss commented Apr 22, 2017

@hairyhenderson what about the following suggestion: We keep the current login without synchronising but add an option --parallel which then would switch to the threaded version. This flag could be marked as experimental.

I think it should be possible to keep the current API and clearly separate those functions without much duplication.

Please let me know whether I should give it a try. Otherwise, I revert the code again (but let me know whether you want the cache optimisation mentioned above)

rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 22, 2017
@hairyhenderson
Copy link
Owner

Please let me know whether I should give it a try. Otherwise, I revert the code again (but let me know whether you want the cache optimisation mentioned above)

Let's just revert the code for now, including the cache changes. I'd rather keep this PR smaller and focused solely on --input-dir and --output-dir.

rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 23, 2017
rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 23, 2017
rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 23, 2017
If not given, the default is the current working directory
Relates to PR hairyhenderson#119
rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 23, 2017
… success

Thanks to the awesome IT suite for triggering this
Relates to hairyhenderson#119
rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 23, 2017
rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 23, 2017
@rhuss
Copy link
Contributor Author

rhuss commented Apr 23, 2017

Reworked again and moved the optional parallel processing to #123 which is based on this PR. Think it still makes sense, and since it was quite a bit of work and testing I didn't want to simply remove it.

@rhuss
Copy link
Contributor Author

rhuss commented Apr 23, 2017

let me know whether i should do a git squash (so weed out the threading meander on this PR)

@hairyhenderson
Copy link
Owner

let me know whether i should do a git squash

Yes please!

}
return "."
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be accomplished by defaulting to . in the StringFlag, though that'd change the logic for validateInOutOptions...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that, but this doesn't work in the validation logic, which checks for whether the outputDir is set. if it has a default value, then it is always set. You then also can't distinguish between people setting "." explicitely (and so switching on the ferature) or whether it comes from the default.

Dont think its so bad here.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You then also can't distinguish between people setting "." explicitely (and so switching on the ferature) or whether it comes from the default.

Not really sure that this matters - the feature should be dependent on --input-dir, no?

Copy link
Contributor Author

@rhuss rhuss Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you then error on

  • when --output-dir="." and --out="parsed-file.txt" is provided
  • when --output-dir="." is provided and no --input-dir is given

Of course I can remove these checks (or make them apply only when output-dir is not "." but that inconsistent imo).

Again, is it really that bad here ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - good points. There would be ways to apply this, but I see now that the logic would probably be uglier - let's just leave this as-is then 🙂

}

// read directory
entries, err := ioutil.ReadDir(input)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still want this put above the MkdirAll call. If something goes wrong reading it, I don't want the output dir to be left behind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we want to make this really atomar, then we should also check the whole traversal to have not partially filled (and remember which directories we created and which were already there).

But I can move that up (which however makes it a bit harder to refactor of course as this method is already close to the cyclomatic complexity limit of 10)

Please be sure whether you really want this up (don't want to make many turns yet)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved it.

process.go Outdated
if err != nil {
return nil, err
}
if !si.IsDir() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary - ioutil.ReadDir will error if given a non-directory. If you remove this block, this function can disappear and you can just call os.Stat directly on ln39

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

process.go Outdated
return os.OpenFile(filename, os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0644)
}

func checkClose(c io.Closer, err *error) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too clear what the point of this function is... Why not simply defer file.Close()? Nothing will be done with err so it doesn't seem too useful to assign it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not me, its the gometalinter which croaks if the error is not checked ;-).

Well, if *error is nil then it's set with the error from the Close() so it makes a little sense.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add // nolint: errcheck on the line to shut gometalinter up.

Well, if *error is nil then it's set with the error from the Close() so it makes a little sense.

Right, but after it's set, nothing actually reads that error... 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually you are right, but the calling method could initialize err with nil and then always return err (which in case of a Close() error would be filled before returning, otherwise it stays nil). That was missing.

But I agree that its an academic question, will remove that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks 🙂

@rhuss
Copy link
Contributor Author

rhuss commented Apr 25, 2017

squashed it

@hairyhenderson
Copy link
Owner

Ok! Thanks for your patience with all this - I have just one more request, then it'll be ready to go 😉 - can you write some integration tests for this? Just add some tests to test/integration/basic.bats, using the other tests as examples...

rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 28, 2017
rhuss added 2 commits April 28, 2017 09:56
All filese from --input-dir will be processed as templates and stored
with the same directory hierachy in --ouput-dir

- Use both options when a whole directory hierarchy needs to be processed.
- Extracted file processing logic in an extra process.go
- --output-dir is optional and default to "."
- --output-dir is created automatically if not existing

Fixes hairyhenderson#117

Signed-off-by: Roland Huss <[email protected]>
rhuss added a commit to rhuss/gomplate that referenced this pull request Apr 28, 2017
@rhuss
Copy link
Contributor Author

rhuss commented Apr 28, 2017

test added

@hairyhenderson
Copy link
Owner

@rhuss thank you! This is ready to merge now - thanks again for your patience with this 🙂

@hairyhenderson hairyhenderson merged commit 082cba0 into hairyhenderson:master Apr 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants