Startups Profit Prediction using Multi Linear Regression

Fitrie Ratnasari
5 min readAug 9, 2021

This work aim is to create a multi linear regression model that could predict startup profit by knowing their R&D expense, marketing and administrative cost.

I remember the time when I was working with the Ministry of Creative Economy back then in 2014, what we did were creating compact program for local digital startup so that they could lift up their companies quality and one of the KPI is to increasing their profits. The basic question of myself at that time was, what predictors that could make profit increase? Is there any correlation with that predictors so that could help priority business decision? To answer this question I delve into datasets that match with this goal, and Kaggle has one (this dataset considered a simple one, nonetheless could answer that Multi Linear Regression can be a solution to tackle profit prediction).

So let’s get started!

Dataset Description & Initial Planning

The dataset related to startups data containing R&D Expense, Marketing, Administrative Cost, State of Origin and Profit of company.

This work aim is to create a regression model that could predict startup profit by knowing their R&D expense, marketing and administrative cost. The model would be done in several approaches: Simple Linear Regression, Ridge, Lasso, and Elastic Net.

The Dataset is in csv format, consisted of 50 entries (rows) with 5 attributes of various data types between float and object, as picture shown below:

Exploratory Data Analysis & Data Wrangling

After acquiring the dataset we do rigorous data accountability checking to inspect how good data we have by checking:

  1. Missing value, as a result there’s no missing value.

2. Multicollinearity between variables, by using VIF score and found that 2 of variables: R&D Spending and Marketing are highly correlated. To explain further:

VIF = 1, no correlation between the independent variable and the other variables.

VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others. With this terms, from result below: feature R&D Spend and Marketing Spend have high multicollinearity, or correlated each other ~ which this could lead high variance and error rate in model.

Then we will be using a more robust model that could combat this issue: Lasso, Ridge, ElasticNet later on, compared to Linear Regression.

3. Correlation to Target, this is congruent with what is being done in VIF.
More blueish indicates the variable is correlated with the profit, so we can say R&D Spending and Marketing is the most correlated to the profit.

4. Target distribution normality, found that the profit variable failed to reject hypotheses null of normal distribution, or in other words it is already normal distribution.

5. Skewness of X Variables (exclude Profit), there’s no skewness found in all variables (all variables skewness is under 0.75). Then additional transformation is not necessarily to be done.

In Data Cleansing we do several process:

  1. Do label encoder for state of origin

2. Remove outlier in variables, found that the profit column has 1 outlier and has been removed. Before removing outlier:

After removing outlier:

Regression Models & Result

In this work we embodied four different model to predict the profit of start up company, they are:

  1. Linear Regression
  2. Ridge Cross Validation
  3. Lasso Cross Validation
  4. ElasticNet Cross Validation

As a result we found that all these four models works well with the dataset, and Ridge Regression perform best among other with a slight different R2 Score, and less score on its RMSE, as we can see in table below:

Comparison Table by using Various Regression

Ridge and Lasso have an important role in this model since they are working best in feature importance by adjusting coefficient by weighting penalty of each variable without taking any risk there is multicollinearity among those variables. This is an advantage of using Ridge and Lasso. However the dataset is almost perfect and quite neat, without doing much transformation the model could do their best to predict the profit of the company.

To visualise better, we plot the actual profit and predicted profit by comparing four models used:

Due to slight difference in R2 Score, the plotting overlaps one model to other. All four models predict the profit well and have a very small variance.

Next question is, how can we validate our model is accountable?

Seeing above residual error plot, we can say that the validation of the regression model we have is highly accountable, since the residual plot above is scattered, does not follow any pattern and equally distributed y=0 (i.e; mean=0.000).

Moreover when we check the residual distribution, we found it complied with Normal Distribution as we have expected

Suggestion & Summary

There are numerous suggestions to take the analysis going further:

  1. Using the dataset that also involved other variables of startup profits such as financial ratios, corporate actions, stock price of the company if it has traded in the market, etc.
  2. Try feature engineering such as polynomial features to see if any improvement can be done.

Trivia : Try to predict new data dummy

If a company have R&D Spending USD 165,000, Administrative Cost USD 137,000, and Marketing Expense USD 470,000, how much profit it could be?

By using the best model we have, RidgeCV Regression, we can easily found that the predicted profit would be USD 190,582 (remember this result is still have around 4.8% residual error, hence slight difference in actual profit might happen).

So that is all! Feel free to visit my Github to see this work.

Have a good day! :)

--

--