Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial import #1

Merged
merged 106 commits into from
Jun 1, 2022
Merged

Initial import #1

merged 106 commits into from
Jun 1, 2022

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Jun 1, 2022

No description provided.

jacques-n and others added 30 commits June 1, 2022 16:10
This only adds yet a successful compilation for windows. Tests don't
run.

Author: Uwe L. Korn <[email protected]>

Closes #213 from xhochy/ARROW-202 and squashes the following commits:

d5088a6 [Uwe L. Korn] Correctly reference Kudu in LICENSE and NOTICE
72a583b [Uwe L. Korn] Differentiate Boost libraries based on build type
6c75699 [Uwe L. Korn] Add license header
e33b08c [Uwe L. Korn] Pick up shared Boost libraries correctly
5da5f5d [Uwe L. Korn] ARROW-202: Integrate with appveyor ci for windows
… not implicitly skip

I have

```
$ py.test pyarrow/tests/test_hdfs.py
================================== test session starts ==================================
platform linux2 -- Python 2.7.11, pytest-2.9.0, py-1.4.31, pluggy-0.3.1
rootdir: /home/wesm/code/arrow/python, inifile:
collected 15 items

pyarrow/tests/test_hdfs.py sssssssssssssss
```

But

```
$ py.test pyarrow/tests/test_hdfs.py --hdfs -v
================================== test session starts ==================================
platform linux2 -- Python 2.7.11, pytest-2.9.0, py-1.4.31, pluggy-0.3.1 -- /home/wesm/anaconda3/envs/py27/bin/python
cachedir: .cache
rootdir: /home/wesm/code/arrow/python, inifile:
collected 15 items

pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_close PASSED
pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_download_upload PASSED
pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_file_context_manager PASSED
pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_ls PASSED
pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_mkdir PASSED
pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_orphaned_file PASSED
pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_read_multiple_parquet_files SKIPPED
pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_read_whole_file PASSED
pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_close PASSED
pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_download_upload PASSED
pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_file_context_manager PASSED
pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_ls PASSED
pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_mkdir PASSED
pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_read_multiple_parquet_files SKIPPED
pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_read_whole_file PASSED
```

The `py.test pyarrow --only-hdfs` option will run only the HDFS tests.

Author: Wes McKinney <[email protected]>

Closes #353 from wesm/ARROW-557 and squashes the following commits:

52e03db [Wes McKinney] Add conftest.py file, hdfs group to opt in to HDFS tests with --hdfs
This supersedes apache/arrow#467

This is ready for review. Next steps are
- Integration with the arrow CI
- Write docs on how to use the object store

There is one remaining compilation error (it doesn't find Python.h for one of the Travis configurations, if anybody has an idea on what is going on, let me know).

Author: Philipp Moritz <[email protected]>
Author: Robert Nishihara <[email protected]>

Closes #742 from pcmoritz/plasma-store-2 and squashes the following commits:

c100a453 [Philipp Moritz] fixes
d67160c5 [Philipp Moritz] build dlmalloc with -O3
16d1f716 [Philipp Moritz] fix test hanging
0f321e16 [Philipp Moritz] try to fix tests
80f9df40 [Philipp Moritz] make format
4c474d71 [Philipp Moritz] run plasma_store from the right directory
85aa1710 [Philipp Moritz] fix mac tests
61d421b5 [Philipp Moritz] fix formatting
4497e337 [Philipp Moritz] fix tests
00f17f24 [Philipp Moritz] fix licenses
81437920 [Philipp Moritz] fix linting
5370ae06 [Philipp Moritz] fix plasma protocol
a137e783 [Philipp Moritz] more fixes
b36c6aaa [Philipp Moritz] fix fling.cc
214c426c [Philipp Moritz] fix eviction policy
e7badc48 [Philipp Moritz] fix python extension
6432d3fa [Philipp Moritz] fix formatting
b21f0814 [Philipp Moritz] fix remaining comments about client
27f9c9e8 [Philipp Moritz] fix formatting
7b08fd2a [Philipp Moritz] replace ObjectID pass by value with pass by const reference and fix const correctness
ca80e9a6 [Philipp Moritz] remove plain pointer in plasma client, part II
627b7c75 [Philipp Moritz] fix python extension name
30bd68b7 [Philipp Moritz] remove plain pointer in plasma client, part I
77d98227 [Philipp Moritz] put all the object code into a common library
0fdd4cd5 [Philipp Moritz] link libarrow.a and remove hardcoded optimization flags
8daea699 [Philipp Moritz] fix includes according to google styleguide
65ac7433 [Philipp Moritz] remove offending c++ flag from c flags
7003a4a4 [Philipp Moritz] fix valgrind test by setting working directory
217ff3d8 [Philipp Moritz] add valgrind heuristic
9c703c20 [Philipp Moritz] integrate client tests
9e5ae0e1 [Philipp Moritz] port serialization tests to gtest
0b8593db [Robert Nishihara] Port change from Ray. Change listen backlog size from 5 to 128.
b9a5a06e [Philipp Moritz] fix includes
ed680f97 [Philipp Moritz] reformat the code
f40f85bd [Philipp Moritz] add clang-format exceptions
d6e60d26 [Philipp Moritz] do not compile plasma on windows
f936adb7 [Philipp Moritz] build plasma python client only if python is available
e11b0e86 [Philipp Moritz] fix pthread
74ecb199 [Philipp Moritz] don't link against Python libraries
b1e0335a [Philipp Moritz] fix linting
7f7e7e78 [Philipp Moritz] more linting
79ea0ca7 [Philipp Moritz] fix clang-tidy
99420e8f [Philipp Moritz] add rat exceptions
6cee1e25 [Philipp Moritz] fix
c93034fb [Philipp Moritz] add Apache 2.0 headers
63729130 [Philipp Moritz] fix malloc?
99537c94 [Philipp Moritz] fix compiler warnings
cb3f3a38 [Philipp Moritz] compile C files with CMAKE_C_FLAGS
e649c2af [Philipp Moritz] fix compilation
04c2edb3 [Philipp Moritz] add missing file
51ab9630 [Philipp Moritz] fix compiler warnings
9ef7f412 [Philipp Moritz] make the plasma store compile
e9f9bb4a [Philipp Moritz] Initial commit of the plasma store. Contributors: Philipp Moritz, Robert Nishihara, Richard Shin, Stephanie Wang, Alexey Tumanov, Ion Stoica @ RISElab, UC Berkeley (2017) [from ray-project/ray@b94b4a3]
Also added some missing status checks to builder-benchmark

Author: Wes McKinney <[email protected]>

Closes #782 from wesm/ARROW-1151 and squashes the following commits:

9b488a0e [Wes McKinney] Try to fix snappy warning
06276119 [Wes McKinney] Restore check macros used in libplasma
83b3f36d [Wes McKinney] Add branch prediction to RETURN_NOT_OK
…m parquet-cpp

I will make a corresponding PR to parquet-cpp to ensure that this code migration is complete enough.

Author: Wes McKinney <[email protected]>

Closes #785 from wesm/ARROW-1154 and squashes the following commits:

08b54c98 [Wes McKinney] Fix variety of compiler warnings
ddc7354b [Wes McKinney] Fixes to get PARQUET-1045 working
f5cd0259 [Wes McKinney] Import miscellaneous computational utility code from parquet-cpp
…and Clang warning fixes

This was tedious, but overdue. The Status class in Arrow as originally imported from Apache Kudu, which had been modified from standard use in Google projects. I simplified the implementation to bring it more in line with the Status implementation used in TensorFlow.

This also addresses ARROW-111 by providing an attribute to warn in Clang if a Status is ignored

Author: Wes McKinney <[email protected]>

Closes #814 from wesm/status-cleaning and squashes the following commits:

7b7e6517 [Wes McKinney] Bring Status implementation somewhat more in line with TensorFlow and other Google codebases, remove unused posix code. Add warn_unused_result attribute and fix clang warnings
An additional pair of eyes would be helpful, somewhat strangely the tests are passing for some datetime objects and not for others.

Author: Philipp Moritz <[email protected]>

Closes #1153 from pcmoritz/serialize-datetime and squashes the following commits:

f3696ae4 [Philipp Moritz] add numpy to LICENSE.txt
a94bca7d [Philipp Moritz] put PyDateTime_IMPORT higher up
0ae645e9 [Philipp Moritz] windows fixes
cbd1b222 [Philipp Moritz] get rid of gmtime_r
f3ea6699 [Philipp Moritz] use numpy datetime code to implement time conversions
e644f4f5 [Philipp Moritz] linting
f38cbd46 [Philipp Moritz] fixes
6e549c47 [Philipp Moritz] serialize datetime
… be a stateful kernel

Only intended to implement selective categorical conversion in `to_pandas()` but it seems that there is a lot missing to do this in a clean fashion.

Author: Wes McKinney <[email protected]>

Closes #1266 from xhochy/ARROW-1559 and squashes the following commits:

50249652 [Wes McKinney] Fix MSVC linker issue
b6cb1ece [Wes McKinney] Export CastOptions
4ea3ce61 [Wes McKinney] Return NONE Datum in else branch of functions
4f969c6b [Wes McKinney] Move deprecation suppression after flag munging
7f557cc0 [Wes McKinney] Code review comments, disable C4996 warning (equivalent to -Wno-deprecated) in MSVC builds
84717461 [Wes McKinney] Do not compute hash table threshold on each iteration
ae8f2339 [Wes McKinney] Fix double to int64_t conversion warning
c1444a26 [Wes McKinney] Fix doxygen warnings
2de85961 [Wes McKinney] Add test cases for unique, dictionary_encode
383b46fd [Wes McKinney] Add Array methods for Unique, DictionaryEncode
0962f06b [Wes McKinney] Add cast method for Column, chunked_array and column factory functions
62c3cefd [Wes McKinney] Datum stubs
27151c47 [Wes McKinney] Implement Cast for chunked arrays, fix kernel implementation. Change kernel API to write to a single Datum
1bf2e2f4 [Wes McKinney] Fix bug with column using wrong type
eaadc3e5 [Wes McKinney] Use macros to reduce code duplication in DoubleTableSize
6b4f8f3c [Wes McKinney] Fix datetime64->date32 casting error raised by refactor
2c77a19e [Wes McKinney] Some Decimal->Decimal128 renaming. Add DecimalType base class
c07f91b3 [Wes McKinney] ARROW-1559: Add unique kernel
…integration tests

This PR adds a workaround for reading the metadata layout for C++ dictionary-encoded vectors.

I added tests that validate against the C++/Java integration suite. In order to make the new tests pass, I had to update the generated flatbuffers format and add a few types the JS version didn't have yet (Bool, Date32, and Timestamp). It also uses the new `isDelta` flag on DictionaryBatches to determine whether the DictionaryBatch vector should replace or append to the existing dictionary.

I also added a script for generating test arrow files from the C++ and Java implementations, so we don't break the tests updating the format in the future. I saved the generated Arrow files in with the tests because I didn't see a way to pipe the JSON test data through the C++/Java json-to-arrow commands without writing to a file. If I missed something and we can do it all in-memory, I'd be happy to make that change!

This PR is marked WIP because I added an [integration test](apache/arrow@6e98874#diff-18c6be12406c482092d4b1f7bd70a8e1R22) that validates the JS reader reads C++ and Java files the same way, but unfortunately it doesn't. Debugging, I noticed a number of other differences between the buffer layout metadata between the C++ and Java versions. If we go ahead with @jacques-n [comment in ARROW-1693](https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16244812&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16244812) and remove/ignore the metadata, this test should pass too.

cc @TheNeuralBit

Author: Paul Taylor <[email protected]>
Author: Wes McKinney <[email protected]>

Closes #1294 from trxcllnt/generate-js-test-files and squashes the following commits:

f907d5a7 [Paul Taylor] fix aggressive closure-compiler mangling in the ES5 UMD bundle
57c7df45 [Paul Taylor] remove arrow files from perf tests
5972349c [Paul Taylor] update performance tests to use generated test data
14be77f4 [Paul Taylor] fix Date64Vector TypedArray, enable datetime integration tests
5660eb34 [Wes McKinney] Use openjdk8 for integration tests, jdk7 for main Java CI job
019e8e24 [Paul Taylor] update closure compiler with full support for ESModules, and remove closure-compiler-scripts
48111290 [Paul Taylor] Add support for reading Arrow buffers < MetadataVersion 4
c72134a5 [Paul Taylor] compile JS source in integration tests
c83a700d [Wes McKinney] Hack until ARROW-1837 resolved. Constrain unsigned integers max to signed max for bit width
fd3ed475 [Wes McKinney] Uppercase hex values
224e041c [Wes McKinney] Remove hard-coded file name to prevent primitive JSON file from being clobbered
0882d8e9 [Paul Taylor] separate JS unit tests from integration tests in CI
1f6a81b4 [Paul Taylor] add missing mkdirp for test json data
19136fbf [Paul Taylor] remove test data files in favor of auto-generating them in CI
9f195682 [Paul Taylor] Generate test files when the test run if they don't exist
0cdb74e0 [Paul Taylor] Add a cli arg to integration_test.py generate test JSON files for JS
cc744564 [Paul Taylor] resolve LICENSE.txt conflict
33916230 [Paul Taylor] move js license to top-level license.txt
d0b61f49 [Paul Taylor] add validate package script back in, make npm-release.sh suitable for ASF release process
7e3be574 [Paul Taylor] Copy license.txt and notice.txt into target dirs from arrow root.
c8125d2d [Paul Taylor] Update readme to reflect new Table.from signature
49ac3398 [Paul Taylor] allow unrecognized cli args in gulpfile
3c52587e [Paul Taylor] re-enable node_js job in travis
cb142f11 [Paul Taylor] add npm release script, remove unused package scripts
d51793dd [Paul Taylor] run tests on src folder for accurate jest coverage statistics
c087f482 [Paul Taylor] generate test data in build scripts
1d814d00 [Paul Taylor] excise test data csvs
14d48964 [Paul Taylor] stringify Struct Array cells
1f004968 [Paul Taylor] rename FixedWidthListVector to FixedWidthNumericVector
be73c918 [Paul Taylor] add BinaryVector, change ListVector to always return an Array
02fb3006 [Paul Taylor] compare iterator results in integration tests
e67a66a1 [Paul Taylor] remove/ignore test snapshots (getting too big)
de7d96a3 [Paul Taylor] regenerate test arrows from master
a6d3c83e [Paul Taylor] enable integration tests
44889fbe [Paul Taylor] report errors generating test arrows
fd68d510 [Paul Taylor] always increment validity buffer index while reading
562eba7d [Paul Taylor] update test snapshots
d4399a8a [Paul Taylor] update integration tests, add custom jest vector matcher
8d44dcd7 [Paul Taylor] update tests
6d2c03d4 [Paul Taylor] clean arrows folders before regenerating test data
4166a9ff [Paul Taylor] hard-code reader to Arrow spec and ignore field layout metadata
c60305d6 [Paul Taylor] refactor: flatten vector folder, add more types
ba984c61 [Paul Taylor] update dependencies
5eee3eaa [Paul Taylor] add integration tests to compare how JS reads cpp vs. java arrows
d4ff57aa [Paul Taylor] update test snapshots
407b9f5b [Paul Taylor] update reader/table tests for new generated arrows
85497069 [Paul Taylor] update cli args to execute partial test runs for debugging
eefc256d [Paul Taylor] remove old test arrows, add new generated test arrows
0cd31ab9 [Paul Taylor] add generate-arrows script to tests
3ff71384 [Paul Taylor] Add bool, date, time, timestamp, and ARROW-1693 workaround in reader
4a34247c [Paul Taylor] export Row type
141194e7 [Paul Taylor] use fieldNode.length as vector length
c45718e7 [Paul Taylor] support new DictionaryBatch isDelta flag
9d8fef97 [Paul Taylor] split DateVector into Date32 and Date64 types
8592ff3c [Paul Taylor] update generated format flatbuffers
Author: Uwe L. Korn <[email protected]>

Closes #1334 from xhochy/ARROW-1703 and squashes the following commits:

7282583f [Uwe L. Korn] ARROW-1703: [C++] Vendor exact version of jemalloc we depend on
… UniqueID bytes

Now, the hashing of UniqueID in plasma is too simple which has caused a problem.  In some cases(for example, in github/ray, UniqueID is composed of a taskID and a index),  the UniqueID may be like "ffffffffffffffffffff00", "ffffffffffffffffff01", "fffffffffffffffffff02" ...  . The current hashing method is only to copy the first few bytes of a UniqueID and the result is that most of the hashed ids  are same, so when the hashed ids  put to plasma store, it will become very slow when searching(plasma store uses unordered_map to store the ids, and when the keys are same, it will become slow just like list).

In fact, the same PR has been merged into ray, see ray-project/ray#2174.

and I have tested the perf between the new hashing method and the original one with putting lots of objects continuously, it seems the new hashing method doesn't cost more time.

Author: songqing <[email protected]>

Closes #2220 from songqing/oid-hashing and squashes the following commits:

5c803aa0 <songqing> modify murmurhash LICENSE
8b8aa3e1 <songqing> add murmurhash LICENSE
d8d5f93f <songqing> lint fix
426cd1e2 <songqing> lint fix
4767751d <songqing> Use hashing function that takes into account all UniqueID bytes
Author: Wes McKinney <[email protected]>

Closes #2221 from wesm/ARROW-2634 and squashes the following commits:

c65a8193 <Wes McKinney> Add Go license details to LICENSE.txt
cloudera/hs2client. Add Thrift to thirdparty toolchain

This patch incorporates patches developed at cloudera/hs2client (Apache 2.0) by
the following authors:

* 12  Wes McKinney <[email protected]>, <[email protected]>
*  2  Thomas Tauber-Marshall <[email protected]>
*  2  陈晓发 <[email protected]>
*  2  Matthew Jacobs <[email protected]>, <[email protected]>
*  1  Miki Tebeka <[email protected]>
*  1  Tim Armstrong <[email protected]>
*  1  henryr <[email protected]>

Closes #2444

Change-Id: I88aed528a9f4d2069a4908f6a09230ade2fbe50a
…ibrary

This is very minimal in functionality, it just gives a simple R package that calls a function from the arrow C++ library.

Author: Romain Francois <[email protected]>
Author: Wes McKinney <[email protected]>

Closes #2489 from romainfrancois/r-bootstrap and squashes the following commits:

89f14b4ba <Wes McKinney> Add license addendums
9e3ffb4d2 <Romain Francois> skip using rpath linker option
79c50011d <Romain Francois> follow up from @wesm comments on #2489
a1a5e7c33 <Romain Francois> + installation instructions
fb412ca1d <Romain Francois> not checking for headers on these files
2848fd168 <Romain Francois> initial R 📦 with travis setup and testthat suite, that links to arrow c++ library and calls arrow::int32()
1. `glog` provides richer information.
2. `glog` can print good call stack while crashing, which is very helpful for debugging.
3. Make logging pluggable with `glog` or original log using a macro. Users can enable/disable `glog` using the cmake option `ARROW_USE_GLOG`.

Author: Yuhong Guo <[email protected]>
Author: Wes McKinney <[email protected]>

Closes #2522 from guoyuhong/glog and squashes the following commits:

b359640d4 <Yuhong Guo> Revert some useless changes.
38560c06e <Yuhong Guo> Change back the test code to fix logging-test
e3203a598 <Wes McKinney> Some fixes, run logging-test
4a9d1728b <Wes McKinney> Fix Flatbuffers download url
f36430836 <Yuhong Guo> Add test code to only include glog lib and init it without other use.
c8269fd88 <Yuhong Guo> Change ARROW_JEMALLOC_LINK_LIBS setting to ARROW_LINK_LIBS
34e6841f8 <Yuhong Guo> Add pthread
48afa3484 <Yuhong Guo> Address comment
12f9ba7e9 <Yuhong Guo> Disable glog from ARROW_BUILD_TOOLCHAIN
62f20002d <Yuhong Guo> Add -pthread to glog
673dbebe5 <Yuhong Guo> Try to fix ci FAILURE
69c1e7979 <Yuhong Guo> Add pthread for glog
fbe9cc932 <Yuhong Guo> Change Thirdpart to use EP_CXX_FLAGS
6f4d1b8fc <Yuhong Guo> Add lib64 to lib path suffix.
84532e338 <Yuhong Guo> Add glog to Dockerfile
ccc03cb12 <Yuhong Guo> Fix a bug
7bacd53ef <Yuhong Guo> Add LICENSE information.
9a3834caa <Yuhong Guo> Enable glog and fix building error
2b1f7e00e <Yuhong Guo> Turn glog off.
7d92091a6 <Yuhong Guo> Hide glog symbols from libarrow.so
a6ff67110 <Yuhong Guo> Support offline build of glog
14865ee93 <Yuhong Guo> Try to fix MSVC building failure
53cecebef <Yuhong Guo> Change log level to enum and refine code
09c6af7b9 <Yuhong Guo> Enable glog in plasma
…es to apache license.

Fix clang-format, cpplint warnings, -Wconversion warnings and other warnings
with -DBUILD_WARNING_LEVEL=CHECKIN. Fix some build toolchain issues, Arrow
target dependencies. Remove some unused CMake code
The baseline UTF8 decoder is adapted from Bjoern Hoehrmann's DFA-based implementation.
The common case of runs of ASCII chars benefit from a fast path handling 8 bytes at a time.

Benchmark results (on a Ryzen 7 machine with gcc 7.3):
```
-----------------------------------------------------------------------------
Benchmark                                      Time           CPU Iterations
-----------------------------------------------------------------------------
BM_ValidateTinyAscii/repeats:1                 3 ns          3 ns  245245630   3.26202GB/s
BM_ValidateTinyNonAscii/repeats:1              7 ns          7 ns  104679950   1.54295GB/s
BM_ValidateSmallAscii/repeats:1               10 ns         10 ns   66365983   13.0928GB/s
BM_ValidateSmallAlmostAscii/repeats:1         37 ns         37 ns   18755439   3.69415GB/s
BM_ValidateSmallNonAscii/repeats:1            68 ns         68 ns   10267387   1.82934GB/s
BM_ValidateLargeAscii/repeats:1             4140 ns       4140 ns     171331   22.5003GB/s
BM_ValidateLargeAlmostAscii/repeats:1      24472 ns      24468 ns      28565   3.80816GB/s
BM_ValidateLargeNonAscii/repeats:1         50420 ns      50411 ns      13830   1.84927GB/s
```

The case of tiny strings is probably the most important for the use case of CSV type inference.

PS: benchmarks on the same machine with clang 6.0:
```
-----------------------------------------------------------------------------
Benchmark                                      Time           CPU Iterations
-----------------------------------------------------------------------------
BM_ValidateTinyAscii/repeats:1                 3 ns          3 ns  213945214   2.84658GB/s
BM_ValidateTinyNonAscii/repeats:1              8 ns          8 ns   90916423   1.33072GB/s
BM_ValidateSmallAscii/repeats:1                7 ns          7 ns   91498265   17.4425GB/s
BM_ValidateSmallAlmostAscii/repeats:1         34 ns         34 ns   20750233   4.08138GB/s
BM_ValidateSmallNonAscii/repeats:1            58 ns         58 ns   12063206   2.14002GB/s
BM_ValidateLargeAscii/repeats:1             3999 ns       3999 ns     175099   23.2937GB/s
BM_ValidateLargeAlmostAscii/repeats:1      21783 ns      21779 ns      31738   4.27822GB/s
BM_ValidateLargeNonAscii/repeats:1         55162 ns      55153 ns      12526   1.69028GB/s
```

Author: Antoine Pitrou <[email protected]>

Closes #2916 from pitrou/ARROW-3536-utf8-validation and squashes the following commits:

9c9713b78 <Antoine Pitrou> Improve benchmarks
e6f23963a <Antoine Pitrou> Use a larger state table allowing for single lookups
29d6e347c <Antoine Pitrou> Help clang code gen
e621b220f <Antoine Pitrou> Use memcpy for safe aligned reads, and improve speed of non-ASCII runs
89f6843d9 <Antoine Pitrou> ARROW-3536:  Add UTF8 validation functions
Vendor the `std::string_view` backport from https://github.com/martinmoene/string-view-lite

Author: Antoine Pitrou <[email protected]>

Closes #2974 from pitrou/ARROW-3800-string-view-backport and squashes the following commits:

4353414b6 <Antoine Pitrou> ARROW-3800:  Vendor a string_view backport
Second granularity is allowed (we might want to add support for fractions of seconds, e.g. in the "YYYY-MM-DD[T ]hh:mm:ss.ssssss" format).

Timestamp conversion also participates in CSV type inference, since it's unlikely to produce false positives (e.g. a semantically "string" column that would be entirely made of valid timestamp strings).

Author: Antoine Pitrou <[email protected]>

Closes #2952 from pitrou/ARROW-3738-csv-timestamps and squashes the following commits:

005a6e3f7 <Antoine Pitrou> ARROW-3738:  Parse ISO8601-like timestamps in CSV columns
1. Get rid of all macros and sprinkled out hash table handling code

2. Improve performance by more careful selection of hash functions
   (and better collision resolution strategy)

Integer hashing benefits from a very fast specialization.
Small string hashing benefits from a fast specialization with less branches
and less computation.
Generic string hashing falls back on hardware CRC32 or Murmur2-64, which has probably sufficient
performance given the typical distribution of string key length.

3. Add some tests and benchmarks

Author: Antoine Pitrou <[email protected]>

Closes #3005 from pitrou/ARROW-2653 and squashes the following commits:

0c2dcc3de <Antoine Pitrou> ARROW-2653:  Refactor hash table support
Also update mapbox::variant to v1.1.5 (I'm not sure which version was previously vendored).

Author: Antoine Pitrou <[email protected]>

Closes #3184 from pitrou/ARROW-4017-vendored-libraries and squashes the following commits:

fe69566d7 <Antoine Pitrou> ARROW-4017:  Move vendored libraries in dedicated directory
…edstock after compiler migration

Crossbow builds:
- [kszucs/crossbow/build-403](https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93&query=build-403)
- [kszucs/crossbow/build-404](https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93&query=build-404)
- [kszucs/crossbow/build-405](https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93&query=build-405)
- [kszucs/crossbow/build-406](https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93&query=build-406)
- [kszucs/crossbow/build-407](https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93&query=build-407)

Author: Krisztián Szűcs <[email protected]>

Closes #3368 from kszucs/conda_forge_migration and squashes the following commits:

e0a5a6422 <Krisztián Szűcs>  use --croot
3749a2ff9 <Krisztián Szűcs>  git on osx; set FEEDSTOSK_ROOT
ca7217d7f <Krisztián Szűcs>  support channel sources from variant files
33cba7118 <Krisztián Szűcs>  fix conda path on linux
2505828b7 <Krisztián Szűcs> fix task names
0c4a10bc3 <Krisztián Szűcs> conda recipes for python 3.7; compiler migration
Added howard hinnant date project as a third party library.
Used system timezone database for timezone information.

Author: Antoine Pitrou <[email protected]>
Author: shyam <[email protected]>

Closes #3352 from shyambits2004/timestamp and squashes the following commits:

882a5cf6 <Antoine Pitrou> Tweak wording of vendored date library README
7f524805 <Antoine Pitrou> Small tweaks to license wording for the date library
9ee8eff4 <shyam> ARROW-4198 :  Added support to cast timestamp
- Ported parquet-cpp external license references
- Removed spurious duplicates (boost, mapbox)

Author: François Saint-Jacques <[email protected]>

Closes #3692 from fsaintjacques/ARROW-4546-parquet-license and squashes the following commits:

a5aa81e48 <François Saint-Jacques> ARROW-4546: Update LICENSE with parquet-cpp licenses
This includes a Dockerfile that can be used to create wheels based on ubuntu 14.04 which are compatible with TensorFlow.

TODO before this can be merged:
- [x] write documentation how to build this
- [x] do more testing

Author: Philipp Moritz <[email protected]>

Closes #3766 from pcmoritz/ubuntu-wheels and squashes the following commits:

f708c29b <Philipp Moritz> remove tensorflow import check
599ce2e7 <Philipp Moritz> fix manylinux1 build instructions
f1fbedf8 <Philipp Moritz> remove tensorflow hacks
bf47f579 <Philipp Moritz> improve wording
4fb1d38b <Philipp Moritz> add documentation
078be98b <Philipp Moritz> add licenses
0ab0bccb <Philipp Moritz> cleanup
c7ab1395 <Philipp Moritz> fix
eae775d5 <Philipp Moritz> update
2820363e <Philipp Moritz> update
ed683309 <Philipp Moritz> update
e8c96ecf <Philipp Moritz> update
8a3b19e8 <Philipp Moritz> update
0fcc3730 <Philipp Moritz> update
fd387797 <Philipp Moritz> update
78dcf42d <Philipp Moritz> update
7726bb6a <Philipp Moritz> update
82ae4828 <Philipp Moritz> update
f44082ea <Philipp Moritz> update
deb30bfd <Philipp Moritz> update
50e40320 <Philipp Moritz> update
58f6c121 <Philipp Moritz> update
5e8ca589 <Philipp Moritz> update
5fa73dd5 <Philipp Moritz> update
595d0fe1 <Philipp Moritz> update
79006722 <Philipp Moritz> add libffi-dev
9ff5236d <Philipp Moritz> update
ca972ad0 <Philipp Moritz> update
60805e22 <Philipp Moritz> update
7a66ba35 <Philipp Moritz> update
1b56d1f1 <Philipp Moritz> zlib
eedef794 <Philipp Moritz> update
3ae2b5ab <Philipp Moritz> update
df297e1c <Philipp Moritz> add python build script
358e4f85 <Philipp Moritz> update
65afcebe <Philipp Moritz> update
11ccfc7e <Philipp Moritz> update
f1784245 <Philipp Moritz> update
b3039c8b <Philipp Moritz> update
9064c3ca <Philipp Moritz> update
c39f92a9 <Philipp Moritz> install tensorflow
ec4e2210 <Philipp Moritz> unicode
773ca2b6 <Philipp Moritz> link python
b690d64a <Philipp Moritz> update
5ce7f0d6 <Philipp Moritz> update
a9302fce <Philipp Moritz> install python-dev
f12e0cfe <Philipp Moritz> multibuild python 2.7
9342006b <Philipp Moritz> add git
ab2ef8e7 <Philipp Moritz> fix cmake install
cef997b5 <Philipp Moritz> install cmake and ninja
5d560faf <Philipp Moritz> add build-essential
adf2f705 <Philipp Moritz> add curl
f8d66963 <Philipp Moritz> remove xz
e439356e <Philipp Moritz> apt update
79fe557e <Philipp Moritz> add docker image for ubuntu wheel
This changes refactors much of our CMake logic to make use of built-in CMake paths and remove custom logic. It also switches to the use of more modern dependency management via CMake targets instead of plain text variables.

This includes the following fixes:

- Use CMake's standard find features, e.g. respecting the `*_ROOT` variables: https://issues.apache.org/jira/browse/ARROW-4383
- Add a Dockerfile for Fedora: https://issues.apache.org/jira/browse/ARROW-4730
- Add a Dockerfile for Ubuntu Xenial: https://issues.apache.org/jira/browse/ARROW-4731
- Add a Dockerfile for Ubuntu Bionic: https://issues.apache.org/jira/browse/ARROW-4849
- Add a Dockerfile for Debian Testing: https://issues.apache.org/jira/browse/ARROW-4732
- Change the clang-7 entry to use system packages without any dependency on conda(-forge): https://issues.apache.org/jira/browse/ARROW-4733
- Support `double-conversion<3.1`: https://issues.apache.org/jira/browse/ARROW-4617
- Use google benchmark from toolchain: https://issues.apache.org/jira/browse/ARROW-4609
- Use the `compilers` metapackage to install the correct binutils when using conda, otherwise system binutils to fix https://issues.apache.org/jira/browse/ARROW-4485
- RapidJSON throws compiler errors with GCC 8+ https://issues.apache.org/jira/browse/ARROW-4750
- Handle `EXPECT_OK` collision: https://issues.apache.org/jira/browse/ARROW-4760
- Activate flight build in ci/docker_build_cpp.sh: https://issues.apache.org/jira/browse/ARROW-4614
- Build Gandiva in the docker containers: https://issues.apache.org/jira/browse/ARROW-4644

Author: Uwe L. Korn <[email protected]>

Closes #3688 from xhochy/build-on-fedora and squashes the following commits:

88e11fcfb <Uwe L. Korn> ARROW-4611:  Rework CMake logic
Author: Jeroen Ooms <[email protected]>

Closes #3923 from jeroen/cpuidex and squashes the following commits:

59429f02 <Jeroen Ooms> Mention mingw-w64 polyfill in LICENSE.txt
28619330 <Jeroen Ooms> run clang-format
9e780465 <Jeroen Ooms> polyfill for __cpuidex on mingw-w64
Replace mapbox::variant with Michael Park's variant implementation.

Author: Antoine Pitrou <[email protected]>

Closes #4259 from pitrou/ARROW-5252-variant-backport and squashes the following commits:

03dbc0e14 <Antoine Pitrou> ARROW-5252:  Use standard-compliant std::variant backport
Some antiquated C++ build chains miss the standard <codecvt> header.
Use a small vendored UTF8 implementation instead.

Author: Antoine Pitrou <[email protected]>

Closes #4616 from pitrou/ARROW-5648-simple-utf8 and squashes the following commits:

54b1b2f68 <Antoine Pitrou> ARROW-5648:  Avoid using codecvt
@lidavidm lidavidm merged commit 64b0127 into main Jun 1, 2022
@lidavidm lidavidm deleted the import branch June 1, 2022 21:37
lidavidm pushed a commit that referenced this pull request Oct 5, 2023
… correctly (#1168)

Fixes #1100 

Test before fix:
```
Expected equality of these values:
  AdbcGetObjectsDataGetTableByName(&mock_data, "mock_catalog", "mock_schema", "table_suffix")
    Which is: 0x16d014ee8
  &mock_table_suffix
    Which is: 0x16d014ea8
arrow-adbc/c/driver/common/utils_test.cc:220: Failure
Expected equality of these values:
  AdbcGetObjectsDataGetColumnByName(&mock_data, "mock_catalog", "mock_schema", "table", "column_suffix")
    Which is: 0x16d014df8
  &mock_column_suffix
    Which is: 0x16d014d48
arrow-adbc/c/driver/common/utils_test.cc:224: Failure
Expected equality of these values:
  AdbcGetObjectsDataGetConstraintByName(&mock_data, "mock_catalog", "mock_schema", "table", "constraint_suffix")
    Which is: 0x16d014d08
  &mock_constraint_suffix
    Which is: 0x16d014cc8
[  FAILED  ] AdbcGetObjectsData.GetObjectsByName (0 ms)
```

Test after fix:
```
$ ctest                                                                           
Test project arrow-adbc/build
    Start 1: adbc-driver-common-test
1/2 Test #1: adbc-driver-common-test ..........   Passed    0.25 sec
    Start 2: adbc-driver-sqlite-test
2/2 Test #2: adbc-driver-sqlite-test ..........   Passed    0.19 sec

100% tests passed, 0 tests failed out of 2

Label Time Summary:
driver-common    =   0.25 sec*proc (1 test)
driver-sqlite    =   0.19 sec*proc (1 test)
unittest         =   0.43 sec*proc (2 tests)

Total Test time (real) =   0.44 sec
```
birschick-bq referenced this pull request in birschick-bq/arrow-adbc May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.