Getting the Data
Cooking is art and science. So is Analytics. Both start from getting the right ingredients. No matter how many spices and cooking techniques you apply, the dish won’t taste right if the correct ingredients aren’t present. One has to be knowledgeable about the right places, where they keep the good stuff and tap into it. Getting hold of quality ingredients is the first step to cooking. Synonymously, getting your hands on the right kind of data is also the most important pre-requisite. Not getting the right ingredients will ultimately lead to you being a victim of unhealthy food, pesticide infected vegetables and a whole lot of diseases. A data science project also involves data you cannot blindly trust to be authentic. Otherwise, all the effort spent downstream will go to waste.
Cleansing the data
This part of the workflow is getting all the ingredients ready. They have to be washed, peeled, cut to the right shape and size and then may be washed again. Bringing the data to the right shape is a very important step for the analytics to work on it properly. So, you need to be aware of techniques where you can slice and dice the data, remove the unwanted weeds in it and bringing it to a sleek shape by studying and processing it thoroughly.
The cooking part
Here’s where the magic happens in both the cases. In the data science world this step is synonymous to applying the right algorithm so that the right results are achieved. The ingredients have to be processed sequentially, in parallel, the right spices need to be added at the right time, and sufficient time needs to be given for the processing. You need to keep tasting the dish intermittently just to see the ingredients are balanced, do some minor rectifications and make sure it’s all going in the right direction.
Same with data science. Once the data has been cleansed, the right algorithm needs to be applied with the right kind of configurations. The models need to be trained thoroughly and to be sure the models need to be validated intermittently. Once this is done, the dish is ready to be served, i.e. the system is ready to be deployed!
We often notice that we tend to leverage multiple sources of heat in the kitchen – the microwave oven, the ovens present in the burners etc. These sources are used independently and in parallel to satisfy the objectives quickly. In data science that’s exactly how things are done. This part of the process is called parallel/distributed computing which is the underlying concept behind “map reduce”. A term you might have heard of and thought it’s a difficult thing to understand. It is nothing but a lot of calculations happening simultaneously so that a lot of data is processed in a short span of time.
Data science is not rocket science. Anyone with the right amount of talent and the hunger to learn and explore can solve some really great problems. The right data science feature in your product will leave the user savoring the taste of the machine intelligence. So, strap up your aprons, and start having fun with data!
Above is a visual representation of map – reduce framework.
Map-reduce is a programming technique to solve any aggregation-grouping-summation related problem, where we have huge amount of data and try to do the above operation in parallel in no of distributed machines.
In map function/s, we will collect the frequency of data after processing on some subset of data.
In reduce function/s, we will merge and aggregate the intermediate processed data from map function/s and can apply any other statistical formula on the result to make final output.
So the final result from map-reduce is a form of summery of raw data in repository before processing the data with map-reduce framework.
We should understand the scenarios where the map-reduce operation can/can not be applied.
1. Map-Reduce is a batch operation, so it should not be applied to any on-line scenarios.
2. Map-reduce can not be applied to recursive problems, because in recursion upper cycle is dependent on inner cycle of the problem which is not the case for map-reduce
3. If the data size is low i.e. it can be processed in single machine, we should apply it. Here map-reduce programming within cluster will be an over-engineering.
4. We can use map-reduce in frequecy calculation on unique obejects and apply any sort of aggregation function on that frequecy to form a new dimension of the information.
Reader should comment for the scenarios where map-reduce should be/not to be applied.
We will describe some map-reduce scenario in our next articles.
Nailing down the right question – in those competitions someone else already did that job. The question is already formulated, competitors just need to answer that question. But most of the time, one prime task of a data scientist is to formulate the right question. Most often you will come across clients or collaborators who are holding tons of data, know some good can be extracted out of it, but just don’t know the exact question. Asking the right question is a key skill all data scientists should develop.
Collecting more data – in those competitions someone already collected the data. No more data seeking is left for you. But most of the time another key task for a data scientist is to collect more relevant data, from disparate sources and augment it to the analysis to get a better answer to the business question. Data scientists should be skilled to probe the domain expert, look around the web to get more data.
Thinking on scalability – in those competitions size of the data is fixed. But in reality there is continuous influx of data (velocity != 0). Data scientists always need to be prepared to accommodate more and more data. That means data scientists should be skilled to develop algorithms, data management and data handling tools that scale properly with the size of data.
User interaction, user interface – most of the time those competitions ask you to submit results in some machine readable format, which is compared automatically. You do not learn to care about your end user’s experience. Most of the time what a data scientist produces in terms of algorithms, methods are consumed by some other group of people through a report or a dashboard. Even if the algorithms are designed for automated decisions, some expert will intervene to make sure that the automated system is working properly.
Product Deployability – in those challenges there is no question of deployment. You really do not know the future of all your efforts. Deployment comes with much more responsibility. A continuous monitoring system is typically designed to make sure that the algorithms are working within expectations. Fall back options are also prepared. It also need to be made sure that the system cannot be tricked.
Maintainability – in those challenges there is no question of maintainability. You can make as complicated algorithms as possible till you can chase the accuracy and jump higher on the leader board. Reality is different. Algorithms and methods should be simple and effective. For .1% increase in accuracy one cannot make the system twice as complex. Also maintainability requires the software to be modular, so that new data sources can be added easily in the future and new algorithms can be tested easily in the future.
Now many of you will ask me: should we stop participating in those competitions? What is the alternative?
As I said, web-based competitions will not make you a complete data scientist, but will definitely give you a good starting point. Such competitions give you a scope to develop the core skill of designing algorithms. Moreover, such competitions are sources of some real data. For many practising individuals or teams, the biggest problem is to get hold of some real data.
The best way out to be a complete data scientist (with all the complementary skills as listed above) is to come up with your own question and use your own data. How ? OK, get your question, I’m showing you the data: the whole social media is really a huge source of data. Most tools like Python, R provide APIs to access Facebook, LinkedIn, Twitter, and many others. Your smart phone is loaded with multiple sensors, you need apps like Sensor Logger to get those values out as CSV files and do further analysis. The same holds for your car. In general sensors are getting cheap, you can collect data related to temperature, pressure and many more parameters around your home in real-time.
Here is a useful list of data sources that are available for public http://www.reddit.com/r/datasets/, and the list goes on…
Thank You !
- Scraped all the subtitle (.srt) files from the internet till Season 5.
- Lords of Text Analytics processed all the text to make some sense out of it.
- One True Algorithm was implemented to teach Mr. Machine all he should know.
The files have been loaded but what about the functions in the above snippet that look unknown? The complete code is given in my GitHub page The data is here You can refer to that if you want to go for an implementation of your own
import pysrt import os files = os.listdir('s04/') sentences =  for f in files: try: subs = pysrt.open('s04/'+f) except: continue
Data Cleaning Part
The above snippet was used inside the loop where files were being processed one by one. The cleaning part had these steps roughly –
text = clean_text(subs.text) sentences += review_to_sentences(text,tokenizer,remove_stopwords=True)
- Lower case all the text
- Sentence tokenize and then word tokenize
- Since this is a special dataset with its own vocabulary I had to come up with my own brand of stopwords –
more_stops = ['would','ill','come','one','up.','up','whose','get','',' ','well','say','see','going','like','tell','want','make','know','year','go','yes','take','time','never','could','need','let','enough','many','keep','nothing','oh','look','father','think','cant','thing','still','even','heard','call','back','hear','u','ever','said','better','every','find','may','word','boy','man','lady','woman','give','must','day','done','right','good','always','little','long','seven','girl','son','brother','way','child','king','lord','mother','away','got','whats','ask','wanted','put','first','much','something','friend','sure','course','told','made','war','god','old','people','world']
- I also had to normalize some text so that they can be interpreted as the same. Here’s the list
text = text.replace('\n',' ') soup = BeautifulSoup(text) for tag in soup.find_all('font'): tag.replaceWith('') text = soup.get_text() text = text.replace('they\'re','they are') text = text.replace('They\'re','They are') text = text.replace('They\'ve','They have') text = text.replace('they\'ve','they have') text = text.replace('I\'ve','I have') text = text.replace('won\'t','would not') text = text.replace('don\'t','do not') text = text.replace('Don\'t','Do not') text = text.replace('he\'ll','he will') text = text.replace('It\'s','It is') text = text.replace('it\'s','it is') text = text.replace('we\'re','we are') text = text.replace('you\'ve','you have') text = text.replace('You\'ve','You have') text = text.replace('You\'re','You are') text = text.replace('you\'re','you are') text = text.replace('he\'s','he is') text = text.replace('He\'s','He is') text = text.replace('she\'s','she is') text = text.replace('She\'s','She is') text = text.replace('I\'m','I am') text = text.replace('one\'s','one is') text = text.replace('We\'re','We are') text = text.replace('we\'re','we are') text = text.replace('didn\'t','did not') text = text.replace('That\'s','That is') text = text.replace('that\'s','that is') text = text.replace('There\'s','There is') text = text.replace('there\'s','there is') text = text.replace('We\'ll','We will') text = text.replace('we\'ll','we will') text = text.replace('We\'ve','We have') text = text.replace('we\'ve','we have') text = text.replace('Where\'s','Where is') text = text.replace('where\'s','where is') text = text.replace('haven\'t','have not') text = text.replace('we\'d','we would') text = text.replace('Isn\'t','Is not') text = text.replace('isn\'t','is not') text = text.replace('you\'d','you would') text = text.replace('You\'d','You would') text = text.replace('I\'d','I would') text = text.replace('aren\'t','are not') text = text.replace('you\'ll','you will') text = text.replace('You\'ll','You will') text = text.replace('it\'ll','it will') text = text.replace('It\'ll','It will') text = text.replace('weren\'t','were not') text = text.replace('men','man') text = text.replace('lannisters','lannister') text = text.replace('robb\'s','robb') text = text.replace('wasn\'t','was not')
- I did not want to stem the data as for this dataset I thought stemming would be a bit too aggressive approach. So, I lemmatized it instead –
from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() words = [wordnet_lemmatizer.lemmatize(w) for w in words if not w in stops] # lemmatize
And Now, for the algorithm…I was itching to use word2vec on some dataset and this looked like the perfect match. Here’s the Wikipedia definition –
Word2vec is a group of related models that are used to produce so-called word embedding. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words: the network is shown a word, and must guess which words occurred in adjacent positions in an input textI won’t be going into the details of the algorithm here but you can check it out at this link Python Code –
And stunningly, once the algorithm had done its bit, the results turned out to be great. Word2Vec can be used in many ways. I used it in the following –
# Import the built-in logging module and configure it so that Word2Vec # creates nice output messages logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\ level=logging.INFO) # Set values for various parameters num_features = 500 # Word vector dimensionality min_word_count = 2 # Minimum word count num_workers = 4 # Number of threads to run in parallel context = 30 # Context window size downsampling = 1e-3 # Downsample setting for frequent words # Initialize and train the model (this will take some time) print "Training model..." model = word2vec.Word2Vec(sentences, workers=num_workers, \ size=num_features, min_count = min_word_count, \ window = context, sample = downsampling) # If you don't plan to train the model any further, calling # init_sims will make the model much more memory-efficient. model.init_sims(replace=True) # It can be helpful to create a meaningful model name and # save the model for later use. You can load it later using Word2Vec.load() model_name = "300features_4minwords_10context" model.save(model_name)
- Input a keyword to get the context
This was what I got across seasons for the same keywords that I gave –
[(u'die', 0.9999886155128479), (u'dead', 0.9999884366989136), (u'name', 0.9999882578849792), (u'sansa', 0.9999881982803345), (u'killed', 0.9999881386756897)]
The evolution of Winterfell for example becomes quite evident with every passing season. First Season was about the Starks and their honor and grace. Season 4 was more about Starks getting murdered and them losing the control over Winterfell. Simply Amazing!
Associated Keywords keyword S01 S02 S03 S04 S05 winterfell stark, lannister, grace, hand, ser grace, stark, ship, night, true stark, grace, kill, die, sansa die, dead, name, sansa, killed north, name, thank, night, wall stark hand, ser, grace, last, honor night, grace, sister, ship, city kill, grace, sword, wedding, fight night, dead, last, sister, ser fight, queen, night, house, north tyrion ser, jamie, stark, kingdom, lannister jamie, please, hard, ser, home wedding, grace, sansa, sword, stark name, die, night, ser, love ser, people, army, grace, death westeros lannisters, life, die, sword, war night, life, grace, ship, people, true kill, die, sword, grace, life die, life, tyrion, night, family people, life, queen, world, free cersei robert, landing, queen, house, stark city, queen, grace, stark, night landing, blood, life, sansa, people tyrion, queen, last, family, dead queen, people, world, army, hand baratheon joffrey, hand, old, wall, war stannis, grace, ship, night, wall hand, wedding, landing, blood, stark wall, lannister, die, night, last wall, world, queen, north, death
- Given a set of words, predict the odd man out!
model.doesnt_match('khal greyjoy targaryen'.split())
Input Odd Man Out night, watch, wall, westeros westeros khal, greyjoy, targaryen greyjoy ned, robb, catelyn, baelish baelish arya, robb, sansa, snow, catelyn catelyn
References https://www.google.co.in/search?q=game+of+thrones+jon+snow+battle+of+bastards+shot&espv=2&biw=1455&bih=699&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiymJ33psDNAhXK6CYKHSgmBLYQ_AUICCgD#imgrc=YhUwhl0iibSHBM%3Afor the cover image. https://www.google.co.in/search?q=game+of+thrones+jon+snow+battle+of+bastards+shot&espv=2&biw=1455&bih=699&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiymJ33psDNAhXK6CYKHSgmBLYQ_AUICCgD#tbm=isch&q=game+of+thrones+algorithm+hd&imgrc=aONEJzY168jwOM%3A,https://www.google.co.in/search?q=game+of+thrones+jon+snow+battle+of+bastards+shot&espv=2&biw=1455&bih=699&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiymJ33psDNAhXK6CYKHSgmBLYQ_AUICCgD#tbm=isch&q=game+of+thrones+subtitles+hd&imgrc=jbaW_yXkqHmX1M%3A,
I wanted to talk about one of the issues that data scientists face but it’s one of the least discussed topics in modeling today.
If you Google Modeling , you will be bombarded with innumerable links to videos of tutorials giving you easy and Step by Step instructions of how to build a predictive model, but no one tells you if your model fails in the validation stage, what then?
To understand the below mentioned points you need to know how to build a model 1st. A model exercise can be typically classified into 4 broad categories:
- Data Preparation - Data Imputation and correlation check
- Model Creation - Build model after multiple iterations chucking less insignificant variables
- Model Validation - Check the Goodness of Fit of the Model in the Training Data
- Prediction - Predict with the final Model on the Test Data
What if, after you have built your model you see the R-Squared is a mere 30-40% i.e. the independent variables are explaining 30-40% of your dependent variable. It’s clearly a bad fit, and this situation is very common and is typically occurred due to:
- Parent data: when you have less number of observations or you are using incorrect set of data to perform modeling
- Variables: It may also be the case that the independent variables do not have the explanatory power at all to explain your dependent variable
Solution 1: This is the most important and common cause that is pulling behind your R-Squared value. So even before you start modeling you need to perform thorough data preparation. How well you clean and impute (Outlier Capping/Flooring, Missing Imputation) your data is in direct correlation with the goodness of Fit you should get. Also, one needs to keep in mind that data imputation should be performed keeping mind the business and not the textual statistical procedures
Solution 2: To be blatantly honest with you, if you are working on a small dataset, it’s not advisable for you to model with that kind of data, as even you see a good fit; it’s going to fail in the Test data set. If you have a historical data form a time cross section i.e. if you have a time series data, your variables are not able to capture the time trend or cyclicity present in the data, you must check for trend Cyclicity and Volatility, remove them and then try rebuild your model
Now, let’s say if you have done all these to the dot and still your R-Squared is trailing. You may try Variable Transformations or what’s also called Feature Engineering In this exercise you are required to perform Log, Square, and Square-root of a variable and check its correlation with the response (dependent) variable. It’s a genuine scenario where the variable in its naked self has no explanatory power to the response, if tweaked a little by transforming it, a strong correlation is observed. Then, you are to use the transformed variable in your model instead of the variable in its original form. For some data transformation techniques refer here
After these entire if you see that your R-Squared is not getting up to your desired value, one thing which I would suggest you that to perform a Step wise regression
A stepwise regression is nothing but an iterative way to doing regression where the variables are added to the model with every iteration depending on their t-statistic of their estimated coefficients. The steps are performed in either of forward of backwards. This method should boost your R-Squared as you’re adding only highly significant variables to you model
The stepwise selection process consists of a series of alternating forward selection and backward elimination steps. The former adds variables to the model, while the latter removes variables from the model
The following statements use PROC PHREG to produce a stepwise regression analysis. Stepwise selection is requested by specifying the SELECTION=STEPWISE option in the MODEL statement. The option SLENTRY=0.25 specifies that a variable has to be significant at the 0.25 level before it can be entered into the model, while the option SLSTAY=0.15 specifies that a variable in the model has to be significant at the 0.15 level for it to remain in the model. The DETAILS option requests detailed results for the variable selection process.
proc phreg data=Myeloma; model Time*VStatus(0)=LogBUN HGB Platelet Age LogWBC Frac LogPBM Protein SCalc / selection=stepwise slentry=0.25 slstay=0.15 details; run;
Try these on a model you have left before due to poor R-Squared and see if its lifts the measure
require(MASS) m.glm <- glm(y ~ ., family = binomial, data = ) stepwise(m.glm, trace = FALSE) stepwise(m.glm, direction="forward/backward")