-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java][Docs] Document environment variables/java properties #22595
Comments
Ji Liu / @tianchen92: I just noticed that something in this file should also be updated like 'Java Code Style Guide' |
Micah Kornfield / @emkornfield: |
Ji Liu / @tianchen92: One more question, why should set jvm param for JVM>=9? Not quite familiar, seems <io.netty.tryReflectionSetAccessible>true</io.netty.tryReflectionSetAccessible> is already in pom.xml, it dosen't work?
|
Micah Kornfield / @emkornfield:
"<io.netty.tryReflectionSetAccessible>true</io.netty.tryReflectionSetAccessible> is already in pom.xml, it dosen't work?" The only reference I found was for test execution in POM.xml. I think Consumers of the library have to set this property themselves when running the JVM. But my maven knowledge is weak, so I might be misunderstanding something. |
Ji Liu / @tianchen92: If you would like to provide a PR for this, I think ''Java Code Style Guide' could also be updated (unused imports, redundant modifier) or I can take this issue:) (If so please let me know if there's other info should be updated besides the above ones). |
Micah Kornfield / @emkornfield: |
Ji Liu / @tianchen92: |
Jim Northrup: is there a charter for what java usecases will be supported, and THEN, what among these items will leverage NIO, and what among these can use pure heap implementations of objects exclusively? the utilities should abandon all hope of stability or useful benchmarks while there is a NIO component in a piece of code. the oracle engineers this year are certainly not on the same page as the jdk8 team, or the jdk6 team. Unsafe/NIO usecases number about 2: if you're utilizing mmap files to minimize page faults, go there. |
Micah Kornfield / @emkornfield: I don't fully understand this question. My best attempt to answer it below: The system property is needed because we use Netty as an off-heap memory allocator, this could potentially be replaced with something JNI based. The core of the current Java implementation is off-heap memory. If you have specific requirements/use-cases in mind discussing dev@ or user@ mailing list is probably the way to go.
Could you provide a link to the text you quoted I'd be interested in reading it.
|
Micah Kornfield / @emkornfield: |
Micah Kornfield / @emkornfield: |
the process cleanup of the underlying OS will be the best protection against java NIO/JNI memory handles – if you have a daemon or long-running thing, or you must use directbuffers, assume that the reference counting is imperfect, and it will bite you one day (it may take days) if you trust it. so thing that use nio should be short lived, and wherever possible process encapsulated. netty is the jboss-endorsed c10k java representative with the popular marketshare. iiuc arrow is a team that picked up netty derived off-heap tools naively and demonstrated that in 2019 it's still prone to some gotchas that are a little bit stronger than edge cases when the unit tests pass. indeed, my initial testing with writing jdbc to arrow on kilobytes of records succeeded well, and gave me the confidence to assume this will do the job faster than python. and so began this thread on 800+ megabytes of data. considering the age and size of the netty ecosystem, there is no lack of scrutiny or open source virtue here. it's a VM-level weakness that java NIO is still something like peanuts in the kitchen, you should really put a consumer facing notice on where NIO is and is not present. |
Micah Kornfield / @emkornfield: It is true the Java Arrow library has a steep learning curve, and could use better documentation so new developers aren't bitten. There has also been less focus on the non-core Java libraries (i.e. adapters) until recently, and we need to do something distinguish the maturity between them so these types of things are less surprising. If you have suggestions please let us know. I would suggest perhaps sending mail to the dev@ or user@ mailing lists, since generally more people monitor those then conversations on JIRA. FWIW, the core library was adapted from Apache Drill and used by Dremio in their product, both of which, iiuc are long running processes that provide competitive analytic performance (I don't know how prone to resource leakage they are are).
"and gave me the confidence to assume this will do the job faster than python. and so began this thread on 800+ megabytes of data." I'm sorry you ran into this. If think you are working into the python ecosystem Turbodbc might be your best bet of getting data into Arrow. In general, most of the python code is just a facade on top of C++ so I would expect it to be pretty performant. Please discuss on the mailing list or continue to file JIRAs if you are seeing unexpected performance/behavior. We want to know.
"you should really put a consumer facing notice on where NIO is and is not present." Would you mind opening up a JIRA/Pull Request describing how you think it is best to publicize it?
|
Jim Northrup:
I admire Arrow for doing a thing well. I hope that if I simply call “mvn maven-versions-plugin:latest” in the future this simple jdbc code will work better than before.
I appreciate the attention to the details.
I think through this discussion the jist is that tensorflow one-hot columns may quickly test the expected norms of arrow. Likewise, timeseries datasets have us blowing gaskets all over the place in terms of time-to-completion and RAM using pandas. What do we do with a 300 gig numpy dataset living in swap that takes 3 dasy to build? There’s no LSTM examples to demonstrate anything but toy datasets.
Turbodbc looks like a good fit for reducing transcription times.
For what I need in the space of Arrow, I think the ideal tool is something to work in and out of numpy and delegate to and from apache Geode or Hazelcast as the main substrate.
If perchance arrow can act as a window to memory grids, all the better.
As I find the time for signups and 2fa’s I will compose this to the lists
|
This shouldn't be too complicated, all you have to do is send an e-mail to [email protected] |
Specifically, "-Dio.netty.tryReflectionSetAccessible=true" for JVMs >= 9 and BoundsChecking/NullChecking for get.
Reporter: Micah Kornfield / @emkornfield
Assignee: Ji Liu / @tianchen92
PRs and other links:
Note: This issue was originally created as ARROW-6206. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: