5 min to read
Predict the number of days before a house is sold
I was reading this article - Waiting to be sold: Prediction of time-Dependent House Selling Porbability. I loved reading about the data science approach taken to predict when a house would be sold on an online webpage - and I just had to explore more :) .
The problem Real estate webpages such as trulia.com fail to inform on how long it would take for a house to sell after it gets listed there.
This information is equally important for both a potential buyer and the seller. With this information the seller will have an understanding of what she can do to expedite the sale, i.e. reduce the asking price, renovate/remodel some home features, etc. On the other hand, a potential buyer will have an idea of the available time for her to react i.e. to place an offer.
In data science terms, this seems like a linear regression problem at frist glance. The predicted variable is the unit of time it takes for a house to sell - regardless on whether the unit is days, weeks or months.
We expect that the data is collected in the following way: it is crawled which covers a certain amount of time - say 2 months.
What is likely to happen? For example, the data contains houses which are listed and not sold - e.g. we are missing the label of this data point.
I have 9+ years of experience in machine learning, and I have worked on regression problems numerous times.
The data set here suggests that a simple regression can be used to predict the number of days a house would stay listed on a webpage before it gets sold. However - for many houses, we do not know the actual sell date. Say that we take a window of 1 month of observations fro this website. Chances are high that few houses would get sold during this month and many houses listed before we started observing would still be for sale. In addition to this, new houses would be added, which also would not likely get sold during our month of observation. This allows us to have a lot of data points, e.g. houses, but few labels - e.g. the number of days it took for a house to get sold after its original listing day. This means that we:
either discard the data points we cannot get the label for
or impute the missing label value with something of our choosing.
This all is great however this intervention takes us farther from reality and allows us to train a regression algorithm, which we expect not to work well in real life. For years I thought that since we have missing labels and uncomplete data, there is only so much we can do to solve problems like this and that the best strategy is simply to wait patiently and collect more data diligently.
divine music plays
Doctors and researchers often face the following problem: a new treatment is tested on a group of patients to see the effectiveness of it. In many cases it is a life or death situation.
Imagine that a new lung cancer drug is tested and the researchers need to understand whether it saves the lifes of the patients or not. It is important to note here that saves is a very mild way to put things in, and depending on the clinical trial might mean many things such as cures, reliefs symptoms, extends life expectancy, among others.
Clinical trials are time-boxed - say to 1 year. 100 patients might enroll for the trail at the beginning of the year. 50 could drop out from the trail during the year. The doctors might lose contact with 7 more patients, and sadly, 23 might have died from lung cancer during the trail year.
If the goal is to predict the number of days a patient might live after they start a new treatment, we would have only 23 data points for which we know the exact number of days. This is a 77% reduction of the original data set. We as data scientists might try to impute some of the missing labels so to expand the data set which we would use to train a regression model on.
Doctors are not data scientists. When what is predicted is so serious - as to how many days a patients has more to live, it is dangerous to impute data and make assumptions, to put it mildly. To work with problems like this, survival analysis has been created.
Survival analysis takes into consideration that 20 patients have lived through the 1 year trail. This means that the number of days to live for them is bigger than 365 and still counting. This is an important piece of information which needs to be taken into account. The fact that no further information can be collected for 57 patients also does not mean that we know the exact days to live for them either. This is something which is also valued in survival analysis.
Instead of predicting how many days would a patient live, what is predicted is the percentage of all patients who are still alive at incremental time periods - e.g. % of patients alvie after 6 months, 1 year, 1 year 2 months, 1 year 4 motnhs, etc., from the beginning of the trail.
Survival analysis then can be applied to many other mundane real-life problems such as house sale (real estate), employee retention (HR), customer subscirption lifespan (sales), economics, insurance and others. Read more about it in the links I have shared at the bottom of the page.
Such a wonderful introduction to survival analysis needs its practical tutorial with Python. Here goes my take on it.
- Survival Analysis in Python