Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xargs interleaving bytes, resulting in invalid wof.extract #142

Closed
missinglink opened this issue Oct 8, 2018 · 4 comments
Closed

xargs interleaving bytes, resulting in invalid wof.extract #142

missinglink opened this issue Oct 8, 2018 · 4 comments
Assignees

Comments

@missinglink
Copy link
Member

missinglink commented Oct 8, 2018

It seems that #134 introduced a bug where the output of xargs processes running in parallel can be output interleaved, resulting in invalid json when running wof_extract.sh.

I have been seeing a bunch of errors such as Unexpected token { in JSON at position 3657 in the logs, using jq I can confirm that the file is corrupt:

$ cat wof.extract | jq . >/dev/null
parse error: Expected separator between values at line 4187, column 9

An example of inteleaving can be found in the extract below near name:por_x_preferred:

"name:pam_x_preferred":["Seattle"],
"name:pap_x_preferred":["Seattle"],
"name:per_x_preferred":["سیاتل"],
"name:pms_x_preferred":["Seattle"],
"name:pnb_x_preferred":["سیاٹل"],
"name:pol_x_preferred":["Seattle"],
"name:por_x_preferred":[{"geom:area":0.000228,"geom:bbox":"-117.293484,47.678915,-117.26234,47.691216","geom:latitude":47.685772,"geom:longitude":-117.280507,"gn:population":1786,"iso:country":"US","lbl:latitude":47.685512,"lbl:longitude":-117.283914,"mz:is_current":1,"name:ara_x_preferred":["ميلوود"],"name:azb_x_preferred":["میلوود، واشینقتون"],"name:bul_x_preferred":["Милуд"],"name:cat_x_preferred":["Millwood"],"name:dut_x_preferred":["Millwood"],"name:eng_x_preferred":["Millwood"],"name:fas_x_preferred":["میلوود، واشینگتن"],"name:fre_x_preferred":["Millwood"],"name:ger_x_preferred":["Millwood"],"name:hat_x_preferred":["Millwood"],"name:hbs_x_preferred":["Millwood"],"name:hrv_x_preferred":["Millwood"],"name:ita_x_preferred":["Millwood"],"name:mlg_x_preferred":["Millwood"],"name:nan_x_preferred":["Millwood"],"name:nld_x_preferred":["Millwood"],"name:per_x_preferred":["میلوود، واشینگتن"],"name:pol_x_preferred":["Millwood"],"name:por_x_preferred":["Millwood"],"name:spa_x_preferred":["Millwood"],"name:srp_x_preferred":["Милвуд"],"name:unk_x_variant":["Woodward's"],"name:uzb_x_preferred":["Millwood"],"name:vol_x_preferred":["Millwood"],"qs:pop":0,"wof:hierarchy":[{"continent_id":102191575,"country_id":85633793,"county_id":102087555,"locality_id":101730019,"region_id":85688623}],"wof:id":101730019,"wof:name":"Millwood","wof:parent_id":102087555,"wof:placetype":"locality","wof:population":1786,"wof:superseded_by":[]}

cc/ @Joxit could you please confirm the issue? I'd like to find a fix for xargs, or if that's not possible we might need to revert to parallel because it has flags for this:

from the gnu parallel man page

--line-buffer
--lb
Buffer output on line basis. --group will keep the output together for a whole job. --ungroup allows output to mixup with half a line coming from one job and half a line coming from another job. --line-buffer fits between these two: GNU parallel will print a full line, but will allow for mixing lines of different jobs.

--line-buffer takes more CPU power than both --group and --ungroup, but can be much faster than --group if the CPU is not the limiting factor.

Normally --line-buffer does not buffer on disk, and can thus process an infinite amount of data, but it will buffer on disk when combined with: --keep-order, --results, --compress, and --files. This will make it as slow as --group and will limit output to the available disk space.

With --keep-order --line-buffer will output lines from the first job while it is running, then lines from the second job while that is running. It will buffer full lines, but jobs will not mix. Compare:

  parallel -j0 'echo {};sleep {};echo {}' ::: 1 3 2 4
  parallel -j0 --lb 'echo {};sleep {};echo {}' ::: 1 3 2 4
  parallel -j0 -k --lb 'echo {};sleep {};echo {}' ::: 1 3 2 4
See also: --group --ungroup
@missinglink
Copy link
Member Author

Using the docker containers, so whatever versions we have installed in there, maybe we can upgrade xargs?

@missinglink
Copy link
Member Author

@missinglink
Copy link
Member Author

FYI, the bug is intermittent:

$ pelias prepare placeholder
Creating extract at /data/placeholder/wof.extract
Done!
import...
populate fts...
optimize...
close...
Done!



$ pelias prepare placeholder
Creating extract at /data/placeholder/wof.extract
Done!
import...
invalid json SyntaxError: Unexpected token g in JSON at position 17200
    at JSON.parse (<anonymous>)
    at DestroyableTransform._transform (/code/pelias/placeholder/lib/jsonParseStream.js:7:23)
    at DestroyableTransform.Transform._read (/code/pelias/placeholder/node_modules/readable-stream/lib/_stream_transform.js:184:10)
    at DestroyableTransform.Transform._write (/code/pelias/placeholder/node_modules/readable-stream/lib/_stream_transform.js:172:83)
    at doWrite (/code/pelias/placeholder/node_modules/readable-stream/lib/_stream_writable.js:428:64)
    at writeOrBuffer (/code/pelias/placeholder/node_modules/readable-stream/lib/_stream_writable.js:417:5)
    at DestroyableTransform.Writable.write (/code/pelias/placeholder/node_modules/readable-stream/lib/_stream_writable.js:334:11)
    at DestroyableTransform.ondata (/code/pelias/placeholder/node_modules/readable-stream/lib/_stream_readable.js:619:20)
    at emitOne (events.js:116:13)
    at DestroyableTransform.emit (events.js:211:7)
invalid json SyntaxError: Unexpected token B in JSON at position 0
    at JSON.parse (<anonymous>)
    at DestroyableTransform._transform (/code/pelias/placeholder/lib/jsonParseStream.js:7:23)
    at DestroyableTransform.Transform._read (/code/pelias/placeholder/node_modules/readable-stream/lib/_stream_transform.js:184:10)
    at DestroyableTransform.Transform._write (/code/pelias/placeholder/node_modules/readable-stream/lib/_stream_transform.js:172:83)
    at doWrite (/code/pelias/placeholder/node_modules/readable-stream/lib/_stream_writable.js:428:64)
    at writeOrBuffer (/code/pelias/placeholder/node_modules/readable-stream/lib/_stream_writable.js:417:5)
    at DestroyableTransform.Writable.write (/code/pelias/placeholder/node_modules/readable-stream/lib/_stream_writable.js:334:11)
    at DestroyableTransform.ondata (/code/pelias/placeholder/node_modules/readable-stream/lib/_stream_readable.js:619:20)
    at emitOne (events.js:116:13)
    at DestroyableTransform.emit (events.js:211:7)
populate fts...
optimize...
close...
Done!

@Joxit
Copy link
Member

Joxit commented Oct 9, 2018

I confirm the issue, it occurs quite often on hard drive less often on my SSD :/

Working on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants