BUG: fix read_gbq lost precision for longs above 2^53 and floats above 10k#14064
BUG: fix read_gbq lost precision for longs above 2^53 and floats above 10k#14064tworec wants to merge 1 commit into
Conversation
820a4a6 to
a6543ec
Compare
Codecov Report
@@ Coverage Diff @@
## master #14064 +/- ##
==========================================
- Coverage 86.32% 86.32% -0.01%
==========================================
Files 141 141
Lines 51165 51170 +5
==========================================
+ Hits 44169 44170 +1
- Misses 6996 7000 +4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
why are you changing this?
There was a problem hiding this comment.
For two reasons:
- it is independent from datetime package
- it is 4x faster
In [1]:
import timeit
timeit.timeit('import numpy as np; from datetime import datetime; np.datetime64(datetime.utcfromtimestamp(float(1234567890.123)))', number=1000000)
Out[1]:
6.163272747769952
In [2]:
timeit.timeit('import numpy as np; np.datetime64(int(float(1234567890.123)*1e6), "us")', number=1000000)
Out[2]:
1.5235848873853683
There was a problem hiding this comment.
- doesn't really matter, but ok
- sure, for a single value. you are much better off leaving these as floats, then coercing all in one go if you care about perf (using
to_datetime)
There was a problem hiding this comment.
It's not the point of this PR. I can restore previous version or leave mine - what do you prefer?
|
you are changing lots of things, but have minimal tests for this. pls add some. |
|
is this meant to close #14020 ? |
|
I didn't see #14020 before, but this PR seems to answer @aschmolck request |
|
needs more tests for datetimes |
|
this PR is all about numeric precision loss. I'll restore my datetime improvement. It can be added as separate PR I think. |
a6543ec to
8e44062
Compare
8e44062 to
e36b30b
Compare
|
Ok, I've rebased and added full tests for this change (+ test clean up) |
e36b30b to
505792e
Compare
|
cc @parthea |
There was a problem hiding this comment.
see Dependencies sub-chapter
There was a problem hiding this comment.
then you need a pointer from the install.rst to the deps section
There was a problem hiding this comment.
you need to add a pointer from here to the Dependency section of the docs (that you added below).
There was a problem hiding this comment.
beause I think a new user should start reading about BQ support from the beginning of the section to acknowledge all prerequisites (eg. additional deps)
There was a problem hiding this comment.
are these unsupported types validated?
There was a problem hiding this comment.
there is no need to validate them
read_gbqcan not encounterRECORD(because query can not return RECORD column), andBYTESare stored asobject(so, yes, there is some kind of support for it)to_gbqdoes not generate those types within schema generation (see_generate_bq_schema)
There was a problem hiding this comment.
needs to be the same length as the heading
There was a problem hiding this comment.
ahh I c, ok well then make sure the pointer from the installation page goes here.
There was a problem hiding this comment.
from where exactly - what file?
There was a problem hiding this comment.
don't change things that are not related to this PR.
There was a problem hiding this comment.
why are you taking things out like this? pls don't
There was a problem hiding this comment.
because I don't see them in resulting doc... can this 'thing' cause something to be displayed in generated html?
There was a problem hiding this comment.
move to 0.19.1 (only this particular change)
There was a problem hiding this comment.
Or rather 0.20.0 ? As the current state of this PR changes the behaviour for certain integer columns, so I don't think it is needed to go into 0.19.1
There was a problem hiding this comment.
@jreback what's your final preference here?
There was a problem hiding this comment.
@jorisvandenbossche makes a good point, so 0.20.0
There was a problem hiding this comment.
acutally I disagree with this approach. I would only cast to object if the values are not representable (IOW big), otherwise follow the pandas standard and promot to float. It is much more natural from a user point of view.
There was a problem hiding this comment.
I were trying to get your opinion on this approach about a month ago, but there was silence which was sign to me that you have no objections.
Now when code is ready and tested I see this. That's one point.
Second point. If I understand correctly, your proposed approach will take into account three data types to store BQ INTEGER columns:
int(orint64) by defaultfloatwhen there arenulls and all other values are less than2**53objectwhen there arenulls and values greater than2**53
Are you sure it is worth complicating things wich can be much simpler (vide my solution)?
There was a problem hiding this comment.
@tworec we have 96 PR's open at the moment. I don't always comment till someone indicates things are ready.
can you see whether the column is nullable or not a-priori?
There was a problem hiding this comment.
ok, I understand :)
no, all columns in BigQuery queries are nullable
There was a problem hiding this comment.
just make a new class in stead of changing existing code.
There was a problem hiding this comment.
@parthea changed this class to use private key path to authorize (before it used user account) but did not changed name of the class. This rename is ment to fix this inconsistency.
505792e to
f32edcb
Compare
|
I think it might be better to force the user to specify a |
|
I offer solution with default behavior. Can we stop here? |
|
@jorisvandenbossche I've corrected docs. regarding float storage: my proposition is described in comment and implemented with tests |
|
gentelmen @jreback @jorisvandenbossche 😃 |
|
@tworec sorry we got a bit lost in the shuffle. can you rebase and we can get this in. |
f641fad to
aa248da
Compare
|
@jreback ok, I've rebased |
aa248da to
f97fcdb
Compare
|
@parthea thoughts? |
f97fcdb to
5a22fca
Compare
|
hi @jreback , can we finally merge this? Please decide if you want this change or drop it. |
|
I was waiting for @parthea to come back on this. |
|
oh, I c. We'll wait then. |
parthea
left a comment
There was a problem hiding this comment.
I added a few minor comments. I think _get_private_key_path() should not be called in TestToGBQIntegrationWithLocalUserAccountAuth.
Separately, it would be great if you could follow the steps in the contributing docs to run the integration tests from your fork on Travis-CI. You only need to do the Travis configuration described in the contributing docs once and you're all set for future PRs.
| Integer and boolean ``NA`` handling | ||
| +++++++++++++++++++++++++++++++++++ | ||
|
|
||
| .. versionadded:: 0.19 |
There was a problem hiding this comment.
Please update the .. versionadded::
| from math import pi | ||
| query = '''select * from | ||
| (select PI() * POW(10, 307) as NULLABLE_DOUBLE), | ||
| (select null as NULLABLE_DOUBLE)''' |
There was a problem hiding this comment.
SQL reserved words should be upper case for consistency with existing code. We should either use the following or change existing code if there is a preference to change the SQL formatting.
query = '''SELECT * FROM
(SELECT PI() * POW(10, 307) AS Nullable_Double),
(SELECT NULL AS Nullable_Double)'''
| from math import pi | ||
| query = '''select * from | ||
| (select PI() as NULLABLE_FLOAT), | ||
| (select null as NULLABLE_FLOAT)''' |
There was a problem hiding this comment.
SQL keywords should be uppercase for consistency with existing code.
There was a problem hiding this comment.
Oh, much of old code was not consistent in that sense. But I've tried to fix all inconsistencies. Please check me :)
| _skip_local_auth_if_in_travis_env() | ||
|
|
||
| _setup_common() | ||
| clean_gbq_environment(_get_private_key_path()) |
There was a problem hiding this comment.
Since we are testing local user account auth, can we change this to clean_gbq_environment()? We should also make the same change in tearDownClass.
5a22fca to
5e476d6
Compare
tworec
left a comment
There was a problem hiding this comment.
I fixed things as you @parthea suggested. Please review once again.
I don't fully trust travis CI so I need to create separete empty project for this purpose. And I'm not on my own. My company need to set up billing for it. It will take time.
| _skip_local_auth_if_in_travis_env() | ||
|
|
||
| _setup_common() | ||
| clean_gbq_environment(_get_private_key_path()) |
| Integer and boolean ``NA`` handling | ||
| +++++++++++++++++++++++++++++++++++ | ||
|
|
||
| .. versionadded:: 0.19 |
| from math import pi | ||
| query = '''select * from | ||
| (select PI() as NULLABLE_FLOAT), | ||
| (select null as NULLABLE_FLOAT)''' |
There was a problem hiding this comment.
Oh, much of old code was not consistent in that sense. But I've tried to fix all inconsistencies. Please check me :)
| from math import pi | ||
| query = '''select * from | ||
| (select PI() * POW(10, 307) as NULLABLE_DOUBLE), | ||
| (select null as NULLABLE_DOUBLE)''' |
7d978d9 to
65e3327
Compare
|
@parthea please review, because every time I look at this change it need manual rebase (conflicts). |
|
@parthea my Travis CI setup is broken, it causes errors with pickles eg. https://travis-ci.org/RTBHOUSE/pandas/jobs/199354963 Can you help? |
|
@tworec I'm sorry for the delay. The latest commit looks good. Could you try running |
|
@parthea thx I've done it. Output is below. Why is this needed -- tags should not affect building right? |
65e3327 to
50d31ce
Compare
fixes: - lost precision for longs above 2^53 - and floats above 10k
50d31ce to
788ccee
Compare
|
i've squashed |
|
@tworec when git clones a repo it doesn't clone the tags, so they are from your master whenever it was cloned, NOT the current one. This behavior baffles me, but that's how it works :< |
|
ok, now my travis build works well |
|
thanks for the patience @tworec you can also put up a PR to: https://github.com/pydata/pandas-gbq (will need to take out the documentation changes, but outherwise should be clean). |
…e 10k closes pandas-dev#14020 closes pandas-dev#14305 Author: Piotr Chromiec <[email protected]> Closes pandas-dev#14064 from tworec/read_gbq_full_long_support and squashes the following commits: 788ccee [Piotr Chromiec] BUG: fix read_gbq lost numeric precision
fixes:
Also contains
test_gbq.pyclean upgit diff upstream/master | flake8 --diff