Faster dataframe construction#128
Conversation
Codecov Report
@@ Coverage Diff @@
## master #128 +/- ##
=========================================
Coverage ? 31.03%
=========================================
Files ? 8
Lines ? 1621
Branches ? 0
=========================================
Hits ? 503
Misses ? 1118
Partials ? 0
Continue to review full report at Codecov.
|
I added this. Though needs a review, as it's possible I haven't thought of a corner case EDIT: Travis is fake news, this currently fails https://travis-ci.org/maxim-lian/pandas-gbq/builds/343686588. Will look into soon |
tswast
left a comment
There was a problem hiding this comment.
I'm not seeing anything obviously wrong, but I trust Travis. Thanks for linking to that build.
|
|
||
| column_dtypes = { | ||
| str(field['name']): | ||
| dtype_map.get(field['type'].upper(), object) for field in fields |
There was a problem hiding this comment.
I wonder if our dtype_map needs some updating, as google-cloud-bigquery already does some type conversions. Travis seems to think something is up with this.
pandas_gbq/gbq.py:853: in read_gbq
final_df[field['name']].astype(type_map[field['type'].upper()])
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/generic.py:3054: in astype
raise_on_error=raise_on_error, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:3189: in astype
return self.apply('astype', dtype=dtype, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:3056: in apply
applied = getattr(b, f)(**kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:461: in astype
values=values, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:504: in _astype
values = _astype_nansafe(values.ravel(), dtype, copy=True)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/types/cast.py:534: in _astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
pandas/lib.pyx:980: in pandas.lib.astype_intsafe (pandas/lib.c:17409)
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E ValueError: invalid literal for long() with base 10: '2017-12-13 17:40:39'
Sidenote (note relevant for this PR): I think this line may be important for #123. We'd want to fall back to object if the mode is REPEATED.
ecd6a2b to
5641fc3
Compare
|
This now works - was actually very close to working prior. This is probably 40% of the way to the 'load in each page into an array and then concat at the end'. 20% of the work though; so a decent place to pause & merge |
|
|
||
| df = DataFrame(data=(iter(r) for r in rows), columns=column_dtypes.keys()) | ||
| for column in df: | ||
| df[column] = df[column].astype(column_dtypes[column]) |
There was a problem hiding this comment.
Do we even need the astype? I recall that when I dropped it last time I encountered #174. A behavior change like that would require a version bump, though, so I'm happy with keeping this a 0.6.1 release if you'd prefer to get this out as-is.
There was a problem hiding this comment.
That's a good point re int with nulls. I'll add a test and see what happens.
I think we do need this for dates; otherwise they'll be str
There was a problem hiding this comment.
Actually there's already a test: https://github.com/max-sixty/pandas-gbq/blob/no-python-loop/tests/system/test_gbq.py#L165
What's happening: pandas is ignoring setting a column with NaN to int, and keeping it as object
This is an attempt to push the DataFrame construction down to Pandas, which is extremely fast, rather than looping over the entire dataframe in python ourselves
One downside is that the DataFrame doesn't accept multiple dtypes, so the resulting dataframe will be defined only by the data.
We could run
.astypeon each column, which I think would be fairly performant. But we'd need to write around cases such as non-float/obj columns having nulls. And the DataFrame constructor likely gets it right a vast majority of the time