End to End Process on How to Visualize Your Data using Google Cloud Platform (1/4)

Fitrie Ratnasari
7 min readFeb 15, 2022

Presenting your data is always be a great way to strengthen your statement by revealing evidence-based information. No matter what is the environment, is it work or study, everyone tends to meet the occasion when they need to present with data. The challenges are sometimes the data itself is not ready to be analyzed (need further do of data wrangling), or perhaps we do not know what kind of data cleansing is appropriate to be used, and we might be confused to choose which one of those data visualization tools we have to utilize that met our particular condition (need privacy -not public-, and with a minimal budget for an instance).

With this article, I’ll share an end-to-end process on how to visualize your data with the data visualization tools from Google namely Google Data Studio, since it is completely free and we can explore more if we like to create a useful (not to mention fancy one) dashboard.

To grasp better I will separate it into a few parts:

  1. Getting your data and sneaking the data thoroughly to get a sense of your data — embedding in this article
  2. How to do data cleansing or data wrangling in SQL (using BigQuery — Google Cloud Platform) — see here
  3. How to visualize each of your data with GDS features, and finally
  4. How to present your data effortlessly (that I have been used this framework for more than 5 years of work for the top level of Governmental bodies and Corporate).

Without further ado, let’s get started!

Connect your Data to Google Cloud Platform

To visualize your data in Google Data Studio (as a part of Google Cloud Platform) first, we need to authorize our google account to the platform. Afterward, we need to create new report (as a dashboard sheet) and Add data source as pictures shown below:

Create New Report as Dashboard Sheet
Add data source to dashboard sheet

After we click Add Data, we are going to connect our dashboard to one of these data source options as shown in the picture below:

What if you only have .xls data? Well since it is owned by Google Platform, you need to upload it into your Google Drive, so that this platform can detect your file as a sheet file. Or if the file is not owned by yours, but have been shared from others to you, choose ‘Shared With Me’ after you select Google Sheet, as shown below:

On the other hand, if you have a .csv file you can choose ‘File Upload’ right away on Google Connector. Or you can connect your SQL database also to this GDS platform.

Sneaking your data as if you’re the Data ‘Sherlock’

We as Data ‘Sherlock’ Magnifying The Case

Now we’re having our data, the next thing we should do is to investigate the peculiarity of the data itself. At least there are several steps we need to check to get the basic idea of the dataset and ensure its reliability, and most of the time I used basic python libraries of Pandas and NumPy to do those pre-requisite checking (however it’s OK if you have other tools options, Ms. excel or SQL for example). To give you a better understanding, in this article I have been using a dataset of the Android App Market, and here are the works:

  1. A number of rows and columns of the dataset

First thing first we need to know how big the dataset we’re using, by simply using .shape syntax in pandas there is the answer:

import pandas as pd
import numpy as np
data = pd.read_csv('Apps.csv')
data.shape #to know the number of rows and columns of dataset

So in the dataset, we have 9,659 rows with 14 columns or fields.

2. List each column/field and its data type (Int, Float, String/Text, Object, etc)

data.dtypes #show the list of column and their data types

Wait, why do we need to know each column's data type? Most of the time, data type in the original dataset is not ready to be analyzed into further analysis nor visualized to any data visualization tools, so with this, we know that this data type is not appropriate and needs to do data cleansing afterward. For example in this dataset Number of installs that have object type (that is supposed to be integer not object) and Price that have object type (supposed to be a float).

3. Basic (Descriptive) Statistics of each numeric column/field

Not to worry, we’re just having a very basic one :D
data.describe() #to get descriptive statistics in numerical columns

The advantage of doing basic stats is we can see is there any peculiarity of each filed column or we know what kind of skewness of a column implies. For example in the Size column, the Min is 0 MB, and Max 100 MB, with median (50%) 12 MB, and mean 20 MB, hence we know the applications are in right-skewed, concentrated most size of data point in less than 28 MB.

Another good thing doing this is by just slight seeing we can know whether we have null values in each column. For example, in Reviews and Unnamed columns, there are 9659 rows, while in Rating has 8169 rows and Size has 8432 rows. Another way we can check null value with isna() or isnull() on point 6 in this section.

4. How many unique types from each column?

Sometimes out of our curiosity we need to know what is this dataset tell about a particular column, for example, this time I wanted to see what kind of filing is Category column, so with this, we know what’s inside Category and how many rows are categorized in a particular type of category. So we can simply use .value_counts() to see this:

data.Category.value_counts() #value_counts is used to count how many rows this particular category appeared in entire dataset

5. Is there any duplicate rows we need to handle?

Duplicate rows can be noise in our dataset that can infer the outcome, so we need to know if there are any duplicate data points in our dataset.

data.duplicated(keep='first').sum() #to count how many rows we have duplicated data points. 'first' means, the first row of those duplicated data is being keep and other will be removeddata[data.duplicated()=='True')] #to show the duplicated data

6. Is there any null or nan value?

For example we wanted to know whether we have null values in column Rating:

data.Rating.isna().sum() #to know how many null values we have in column Ratingdata[data.Rating.isna()] #to show the null values in dataset

So with this we understand that we have 1463 null values in column Rating. As noted in pandas, isna() and isnull() have the same function: to find a missing value in a dataframe. So we can do this with isnull also.

Oh well dear, what if I’m very new to these python things but would try to? You can use the data exploration from libraries named Mito and Bamboolib, which would feel like dragging sheets into your notebook (just click here and there, they’ll do the code automatically).

Ok good, now we have quite a great sense of the dataset, what kind of peculiarity of each data field, and what kind of data cleansing is needed to conduct. As a subsequent process, we’ll be doing that magnificent of data cleansing or data wrangling to polish our dataset, and that will be explained in the next article :D

See ya!

--

--