Tutorial

With an activated Python 3 virtual env, clone the repository into your project root folder, install required libraries and copy the package from inside:

git clone https://github.com/dougpm/gcp_toolkit.git && \
cp -r gcp_toolkit/gcp_toolkit gcp_toolkit2 && \
cp gcp_toolkit/requirements.txt . && \
pip install -r requirements.txt && \
rm -rf gcp_toolkit && \
mv gcp_toolkit2 gcp_toolkit

io module

Using the IO class:

import gcp_toolkit as gtk

io = gtk.IO('your-bucket-name', 'your-dataset-name')

This automatically creates google.cloud.storage and google.cloud.bigquery Client instances, but you can pass your own to the constructor if you need to specify details.

Note: You must have Create Table permissions on the specified dataset.

Loading data from BigQuery into a pandas Data Frame:

df = io.bq_to_df('SELECT fields FROM dataset.table_name')

Loading data from pandas Data Frame into BigQuery:

io.df_to_bq(df, 'dataset.table_name')

Loading data from Storage bucket into pandas Data Frame:

df = io.bucket_to_df('path/to/bucket/files/files_prefix*')

Moving data from BigQuery to Storage:

df = io.bq_to_bucket('SELECT fields FROM dataset.table_name',
                     'path/to/files/file_name')

Note: The above may fail occasionally due to the table being too big to be extracted to a single file. In that case, you must add a ‘*’ wildcard to the file name, like so:

df = io.bq_to_bucket('SELECT fields FROM dataset.table_name',
                     'path/to/files/file_name*')