Google BigQuery Introduction & Features
Google BigQuery (GBQ) is a scalable cloud base data warehouse tool made by Google. It enables you to integrate and analyze all of your data regardless of its size in a matter of seconds. You can choose from a variety of data sources and formats which makes it very convenient to implement and adopt. Once you come up with queries which you would like to transform into business information you may enjoy multiple export and integration options like Google sheets or Data Studio.
GBQ uses ansi-SQL queries which can be executed against gigabytes or petabytes of data. It is also easy to access public or commercial datasets that have already been stored in GBQ.
Keep in mind that GBQ utilizes so-called column storage which requires a specific mindset and relates to the way how are you billed for using the tool.
Data is stored in columns and can be retrieved using commonly known SQL. If you have Google Analytics 360, then your Web Analytics data maybe be exported to GBQ thanks to one of Google’s many backend integrations. Furthermore, you can enjoy GBQ dumps from other Google tools like DCM, Adwords or even Youtube. Data in GBQ usually is stored daily tables. Newer implementations may use data partitioning based on a pseudo column and then daily tables that are not in use. Since data may be massive GBQ () delivers astonishing performance to help you to perform your analysis.
In January 2018 running costs were looking as follows:
Storing 1TB size table - $20
Accessing (Querying) 1TB size table once - $5
(So if you store 1TB and accessed it once during the billing period, bill would be $25).
While this cost is undoubtedly less compared to the physical storage, you might still be subjected to higher bills if querying happens without control.
The more you query, the more you pay.
Let’s say that you run about 10 queries per day, that’s $50 for storing and accessing data.
Also, querying cost is independent to the type of the query that’s executed - whether you are accessing one row of data or the entire column, if the query results in opening the file (which stores data from a particular column), then cost is incurred. Also if in the same query, you referenced the same column more than once or you ran different operations on the columns (regex, aggregation), you’re charged for reading once the file
Google Data Studio Integration
GBQ tables and view can be exposed to data visualization tools such as Data Studio, Tableau, or exported to Google sheets for further analysis.
While Data Studio integrates with GBQ seamlessly for visualising and reporting, there are querying costs associated to pull the data. DataStudio demands the constant access for fetching the data. There are caching functionalities inside Data Studio, however often they are not utilised properly due to the complexity of queries used.
When working with Data Studio please keep in mind:
- If there are many report components (e.g.charts & scorecards) different queries are executed, one per item.
- By default data studio refreshes reports on a daily basis even if the user hasn’t opened the report. So all queries are executed daily, this means 30 executions a month.
- If report uses timestamp functions in queries like last seven days or last month then the caching system won’t work and each time a report is presented, queries will be executed. So if you have one report which is used by 100 users daily this means that you will pay for around 2000 queries.
As it can be seen, depending on the number of times the reports are opened or refreshed or scheduled, cost may vary.
Analyze your usage needs
Reflect on your usage levels by raising these questions -
- How many users are going to use your report?
- How many times will they query the data?
- Are there any scheduled reports in Data Studio that result in more queries?
- You might discover unnecessary or redundant queries and optimize them. For example, we found that there are reports that are scheduled during the weekends and disabled them for one of our clients. This resulted in 14% savings in the query costs.
Create the aggregates
Create aggregates whenever possible and minimise number of columns. Write few lines of SQL code to aggregate the rows. Use tools like cron to process the data regularly and store it separately. Even 1TB of data can be reduced to few megabytes after aggregation. Create more aggregate tables to minimise size of datasets (low storage -> less to query -> economical).
Move old datasets to Coldline Storage
Data is generally stored in nearline storage so that it can be accessed quickly. However, Google provides an option to move less used data to ‘coldline’ storage where the SLA for retrieval varies from 4 to 7 days. The cost of cold ine storage is 1.5 times cheaper than the nearline storage. For one of our clients, we moved data that was older than 1 year ($35TB dataset) into coldline storage which resulted in over $100 in savings.