Sunday, 9 August 2015

Visual Representation using Charts

In this blogpost we will try understanding how the function, Charts, works. Following is the example of a Barplot 

 Taking the Radish Survey we can create a chart in the following manner.  


Let us try and understand the code. One Step at a time

import matplotlib.pyplot as plt import numpy as np names = [] votes = [] for radish in counts: names.append(radish) votes.append(counts[radish]) x = np.arange(len(counts)) plt.bar(x, votes) plt.xticks(x + 0.5, names, rotation=90)


First, Two modules are being imported here. Pyplot is a way to plot graph data with Matplotlib. NumPy provides number functions for Python.

names = [] votes = [] for radish in counts: names.append(radish) votes.append(counts[radish])

Now, This loop does the formatting of the dictionary so that it can be easily sent to matplotlib.

x = np.arange(len(counts))

This solves the purpose of spreading the numbers evenly on the X-axis.

plt.bar(x, votes)

This function creates the actual bargraph where x is the position and votes are the height.

plt.xticks(x + 0.5, names, rotation=90)

This specifies a range of values to use as labels.




No Strings Attached



In this blogpost we"ll be looking at how Python helps us find the winner in a voting scenario. Here we are looking at votes been given to Varieties of Radish. We have 300 voters voting for 11 varieties of Radish. Following are the steps we have followed -

Step 1 - Reading Data
The data is available in a text file format and is not comma seperated. This is how the first line of the data looks like -

Evie Pulsford - April CrossMatilda Condon - April CrossSamantha Mansell - Championgeronima trevisani - cherry belle.

Now we use a string called split() to arrange the data such that it displays 1 Name and 1 Vote in a line and we have "voted for" in between the 2 words.


Step 2 - Inspecting Votes
Here we try to find out how many people have voted for the variety White Icicle by simply finding the string "White Icicle" on every line. The code is as follows -
 
Step 3 - Counting Actual Votes
Moving forward, we count the votes for White Icicle, Diakon, and Sicily Giant. We have defined a function here by the name 'count_votes' for a simple reason i.e to make our lives easier. The code is as follows - 

Step 4 - Counting all the votes
Now we perform the same function for all the varieties with a slightly different code. You will notice something funny.
Python returns the counts for some of the variety twice. This happens because all the data are not uniform. For Example - We have-
"White Icicle"
"white Icicle"
" White Icicle"
Python reads this as 3 different varieties. So, now our motive is to combine the votes for the 3 types.

Step 4 - Uniting the varieties with spelling error.

Step 5 - Correcting Double Space 
Correcting double space error is done by replacing double space with single space in the total document.
 Step 6 - Finding the Winner
We are almost done here. What remains is finding double votes, printing in a more user friendly manner and finally finding the winner. 

Sunday, 19 July 2015

The Animal Kingdom of Data Science

The Animal Kingdom of Data Science


It was 4.54 billion years ago when this beautiful planet, that we now call earth, came into existence. A couple of billion years later came living organisms that in some way or the other we call animals. 200,000 years ago, we humans came by and evolved much faster than most living organisms. 74 years ago, some of us contributed towards making a new world having 0s and 1s at its core. Few years ago, another set of animals came along to make our lives comfortable. I would like to tell you about some of these animals from behind the screen.




Python-
Python is a high level programming language. The origin of Python can be dated back to late 1980s. The major purpose of python is to provide code readability. Its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. Python when run along with Spider proves to be an even more user friendly coding space. The following is an example of a program for Fibonacci series-
def fib(n):
 a,b = 1,1
 for i in range(n-1):
  a,b = b,a+b
 return a
print fib(5)



Anaconda - Anaconda is a free distribution of the Python programming language for large-scale data processing, predictive analytics, and scientific computing, that aims to simplify package management and deployment. Its package management system is conda.

 




Spyder- Spyder (formerly Pydee) is an open source cross-platform IDE for scientific programming in the Python language. Spyder integrates NumPy, SciPy, Matplotlib and IPython, as well as other open source software.






Hadoop – This tiny elephant is capable of making big wonders. Technically put, “Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.” Now what is interesting is how the name Hadoop came into existence. Hadoop is the name given to a yellow baby elephant toy by Doug Cutting's(The creator of Hadoop) son. Apparently, even the name Google has similar origins.   





Hive - The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

So, we can successfully conclude that Data scientists are animal lovers! Also, at a serious level, the reason for this pattern of naming can be understood in Doug Cutting's own words, "The rules of names for software is they're meaningless because sometimes the use of a particular piece of software drifts, and if your name is too closely associated with that, it could end up being wrong over time"


(Source : Wikipedia, hive.apache.org)