Skip to content
This repository was archived by the owner on Mar 13, 2026. It is now read-only.

Faster dataframe construction#128

Merged
max-sixty merged 11 commits into
googleapis:masterfrom
max-sixty:no-python-loop
Aug 22, 2018
Merged

Faster dataframe construction#128
max-sixty merged 11 commits into
googleapis:masterfrom
max-sixty:no-python-loop

Conversation

@max-sixty

Copy link
Copy Markdown
Contributor

This is an attempt to push the DataFrame construction down to Pandas, which is extremely fast, rather than looping over the entire dataframe in python ourselves

One downside is that the DataFrame doesn't accept multiple dtypes, so the resulting dataframe will be defined only by the data.

We could run .astype on each column, which I think would be fairly performant. But we'd need to write around cases such as non-float/obj columns having nulls. And the DataFrame constructor likely gets it right a vast majority of the time

@codecov-io

codecov-io commented Feb 20, 2018

Copy link
Copy Markdown

Codecov Report

❗ No coverage uploaded for pull request base (master@d6b7507). Click here to learn what that means.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #128   +/-   ##
=========================================
  Coverage          ?   31.03%           
=========================================
  Files             ?        8           
  Lines             ?     1621           
  Branches          ?        0           
=========================================
  Hits              ?      503           
  Misses            ?     1118           
  Partials          ?        0
Impacted Files Coverage Δ
pandas_gbq/gbq.py 20.57% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6b7507...729b9f3. Read the comment docs.

@max-sixty

max-sixty commented Feb 20, 2018

Copy link
Copy Markdown
Contributor Author

We could run .astype on each column, which I think would be fairly performant. But we'd need to write around cases such as non-float/obj columns having nulls. And the DataFrame constructor likely gets it right a vast majority of the time

I added this.

Though needs a review, as it's possible I haven't thought of a corner case

EDIT: Travis is fake news, this currently fails https://travis-ci.org/maxim-lian/pandas-gbq/builds/343686588. Will look into soon

@tswast tswast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing anything obviously wrong, but I trust Travis. Thanks for linking to that build.

Comment thread pandas_gbq/gbq.py Outdated

column_dtypes = {
str(field['name']):
dtype_map.get(field['type'].upper(), object) for field in fields

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if our dtype_map needs some updating, as google-cloud-bigquery already does some type conversions. Travis seems to think something is up with this.

pandas_gbq/gbq.py:853: in read_gbq
    final_df[field['name']].astype(type_map[field['type'].upper()])
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/generic.py:3054: in astype
    raise_on_error=raise_on_error, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:3189: in astype
    return self.apply('astype', dtype=dtype, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:3056: in apply
    applied = getattr(b, f)(**kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:461: in astype
    values=values, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:504: in _astype
    values = _astype_nansafe(values.ravel(), dtype, copy=True)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/types/cast.py:534: in _astype_nansafe
    return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
pandas/lib.pyx:980: in pandas.lib.astype_intsafe (pandas/lib.c:17409)
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   ValueError: invalid literal for long() with base 10: '2017-12-13 17:40:39'

Sidenote (note relevant for this PR): I think this line may be important for #123. We'd want to fall back to object if the mode is REPEATED.

@max-sixty

Copy link
Copy Markdown
Contributor Author

This now works - was actually very close to working prior.

This is probably 40% of the way to the 'load in each page into an array and then concat at the end'. 20% of the work though; so a decent place to pause & merge

@tswast tswast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment thread pandas_gbq/gbq.py

df = DataFrame(data=(iter(r) for r in rows), columns=column_dtypes.keys())
for column in df:
df[column] = df[column].astype(column_dtypes[column])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need the astype? I recall that when I dropped it last time I encountered #174. A behavior change like that would require a version bump, though, so I'm happy with keeping this a 0.6.1 release if you'd prefer to get this out as-is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point re int with nulls. I'll add a test and see what happens.

I think we do need this for dates; otherwise they'll be str

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's already a test: https://github.com/max-sixty/pandas-gbq/blob/no-python-loop/tests/system/test_gbq.py#L165

What's happening: pandas is ignoring setting a column with NaN to int, and keeping it as object

@max-sixty max-sixty merged commit 428845d into googleapis:master Aug 22, 2018
@max-sixty max-sixty deleted the no-python-loop branch August 22, 2018 17:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants