How to do Data Cleansing using SQL BigQuery and Connect Custom Query to Google Data Studio (2/4)

Fitrie Ratnasari
5 min readFeb 23, 2022

It is very delighted when we have a clean-raw dataset, but in real cases, it is too good to be true. Mostly we have messy data here and there that need to handle, such as working with replacing its data type, filling in missing value, replacing or even removing the peculiar data or missing value, adding new columns (formula) that are needed, or perhaps we need to joining several tables from different sheets or even data sources. Here SQL BigQuery comes in place to do such work, moreover when we have the main objective to visualize the data.

Quick Introduction to BigQuery

Google BigQuery is a cloud-based, fully managed, serverless enterprise data warehouse that supports analytics over petabyte-scale data. It is a Platform as a Service (PaaS) that supports querying using ANSI SQL. It delivers high-speed analysis of large data sets while reducing or eliminating investments in onsite infrastructure or database administrators. Google Cloud Platform, is designed to streamline big data analysis and storage. It also has built-in machine learning capabilities.

BigQuery Advantages (Source: Enterprise Strategy Group)

Seeing the robust, efficient, and security of the BigQuery, we can utilize its capability to do data cleansing before we launch our data into Google Data Studio.

As been explained earlier article (see the first part of this sequel article here), after we connect the dataset into the Google Cloud Platform environment, sneaking the dataset reliability, and investigate what kind of data wrangling we need to conduct, the next thing that takes an imperative role is data cleansing, of which 70% of data analyst’s work is this magnificent of data cleansing.

If we recall the dataset we have been using in an earlier article about Android App Market, here are the display of the dataset and its data type:

Display of Dataset (1)
Display of Dataset (2)
Datatype of Dataset

But beforehand, it is imperative for us to launch the dataset into BigQuery data source, hence later we can do data wrangling using BigQuery.

Launch the Dataset into BigQuery Database

Here are the steps:
1. Go to https://console.cloud.google.com/
2. Authorize your google account for the very first time
3. Create New Project

4. Choose the New Project we’ve just created and click ‘Create Dataset’ on the right side of the new project Name.
5. Fulfill the columns required especially on DatasetID (this will be the name of the dataset) and Data Location (choose nearest your home country), finally click create a dataset.

6. When a dataset is created, there will be a pop-up message that prompts us to go to the dataset.

7. Right click of new_dataset we have just created, and click ‘Create Table’

8. Next choose the table source (is it from the drive, upload, etc), type the table name, and fulfill the Schema. The schema we see here is the same as the schema in MySQL means the dataset structure itself. Field name indicating the column name, type is for its data type, Mode whether it is nullable, required or repeated, Description if you’d like to add notes every one of each column, and finally click Create Table.

Data Cleansing using BigQuery

The scenario here is we are going to see the general overview from all applications available on the market, that need columns: App, Category, Rating, Size, Installs, and Type. Seeing the columns we need and its data structures (also looking back its investigation result in an earlier article), there are several works required to do in data cleansing, such as:

  1. Installs column: Remove the separated comma and ‘+’ sign and change the data type into Integer.
  2. Price column: Applying the logic; If the price is zero, then write as it is, otherwise removes the ‘$’ sign and change the data type into float.
  3. Remove nan value in Rating and Size columns

The SQL syntax to do such work is as follows:

SELECT App, Category, Rating, Size, Type,CASE WHEN Price != ‘0’ then SAFE_CAST((SPLIT(Price, ‘$’)[SAFE_OFFSET(1)]) AS FLOAT64) 
else SAFE_CAST(Price AS FLOAT64) end as price
,CAST(REGEXP_REPLACE((SPLIT(Installs, ‘+’)[SAFE_OFFSET(0)]), r’[⁰-9.]’, ‘’) AS INT64) InstallsFROM `bia-binar-2021.Kaggle_app.apps`WHERE rating is not null and size is not null

Now we run that SQL syntax on BigQuery Editor to see whether our syntax is working…

Drumroll for the result

and here are the result:

Perfect, halfway to go! Our data is ready to be further analyzed!
The next thing we do is to launch our SQL syntax to be the data source of later visualization.

Connect your SQL Query to Google Data Studio

The good thing we have when using Google Cloud Platform is we can get benefit from all platform interconnection under Google’s roof. When we have a personalized query that works in BigQuery, we can connect those results into Google Data Studio to visualize the cleaned dataset. The steps to connect SQL query to GDS are as follows:

  1. Go to datastudio.google.com
  2. Add new report
  3. Choose ‘BigQuery’ in data connection

4. Next choose Custom Query, Project Name is where your dataset lies, and on the right box paste your personalized query.

Yeay that’s it!

Now we can create any charts or perhaps a dashboard to visualize the data, as I created as follows:

Wait, how can we create such a dashboard?
If you are curious, stay tuned for the next story.

Until next time, data enthusiasts! :D

--

--