Recommendation algorithms also make heavy use of high dimensional feature vectors to represent users of interest and items to be recommended. Those items, whose feature vectors are nearest to the feature vector of a target user, are often recommended to the target user. Recommendation systems are vital components of today’s commerce and web portals.

However, computing distances between two high dimensional vectors can be computationally expensive. In the era of big data, we often need to deal with databases of billions of records. A linear scan of the database records, followed by expensive distance computations, is simply impractical. This is where hashing algorithms, or more accurately, learn-to-hash algorithms come to rescue. The simple idea behind these classes of hashing algorithms is that, for each high dimensional vector, we can derive a short binary hashing code, with the characteristics of preserving similarities in the original vector space using a simple Hamming distance in the hash code space. This way computation of distances among data points in the code space is trivially fast. Combined with other techniques, we can deliver sub-linear and fast search implementation.

Combining powerful ML modeling algorithms which produce feature vectors of high quality, and powerful learn-to-hash algorithms that produce short and efficient binary codes, we can design highly accurate and yet highly scalable information retrieval systems. And that is why I surveyed the field of learning to hash, and the result is the following PPT:

l2h-survey

]]>

Published in: *2018 21st International Conference on Information Fusion (FUSION)*

**Abstracts**:

In the presence of unknown correlations, the optimal data fusion, in the sense of Minimum Mean Square Error, can be formulated as a problem of minimizing a nondifferentiable but convex function. The popular projected subgradient methods are known to converge slowly. The single-projection optimal subgradient method, OSGA-V, is known to be numerically more efficient. This paper presents necessary formulations and methods for the application of the OSGA-V algorithm in the minimization of the optimal data fusion problem, achieving much faster convergence rate than the projected subgradient method. We expect this method to significantly reduce the computational cost and time to achieve optimal data fusion in the presence of unknown correlations.

**Conference PPT**:

]]>

Abstract: In this article, we first presented a mean 3D face model from [1], [2], with 21 facial landmark coordinates, in a easy-to-use CSV file format. We reviewed the popular POSIT algorithm for head pose estimation. We then presented our simplified derivation of the POSIT algorithm. At the end, we provided a Python implementation of the modernPosit algorithm, and demonstrated the computation of face bounding rectangle, based on computed head pose. The novelty lies in the simplified interpretation of the POSIT algorithm.

]]>During one of company meetings, a co-worker raised an interview question that aroused my intense interest. The question goes like this:

**Given a positive integer N, how many different ways can one express it as a sum of consecutive integers?**

For a naive brute force solution, the complexity of the algorithm is obviously O($N^2$). With some optimization, he was able to get to a O(N) performance improvement.

For me, intuitively, this is a math problem. It immediately aroused my interest in looking at this problem in more details. With a little bit mathematical derivation, it appears that I can solve the question with a complexity of $O(\sqrt{N})$.

But before I present my algorithm, let me first clarify a couple of edge cases.

- The number $N$ can be obviously expressed as a $sum$ of the single number $N$ itself. I would like to exclude this trivial case.
- For the set of consecutive integers, we assume the integers will have to be positive.

Thus the original question can be stated more precisely as:

**Given a positive integer N, how many different ways can one express it as a sum of more than one consecutive positive integers?**

Let’s denote the consecutive numbers summation expression as:

$$ m\ +\ (m+1)\ +\ …\ +\ (m+n-1) $$

where $$ m > 0\ ,\ n > 1 $$

then we have

$$ N\ = \ m\ +\ (m+1)\ +\ …\ +\ (m+n-1) $$

or $$ N\ =\ mn + (1\ +\ 2\ +\ …\ +\ (n-1))\ =\ mn+\frac{n(n-1)}{2}$$

or $$ N=n(m + \frac{n-1}{2})$$

The question can be now rephrase as:

How many different *positive integer pairs* we can find for $m$ and $n$ satisfying:

$$N=n(m + \frac{n-1}{2}),\ \ \ \ \ m>0,n>1\ \ \ \ (1)$$

Since $m$ is a positive integer $(m \ge 1)$, we have

$$ N \ge n(n+1)/2$$ or $$ n^2+n-2N \le 0$$

This leads to

$$1\lt n \le \frac{\sqrt{8N+1}-1}{2}\ \ \ \ \ (2)$$

Inequalities (2) effectively limits the possible values of $n$. We still need to find the value of $m$ given a $n$ satisfying inequalities (2). We will then consider two different cases.

**If $n$ is an odd number**

then $\frac{n-1}{2}$ is an integer. For $n$ to be a solution of equation (1), $n$ MUST be a factor (divisible) of the number $N$. And if $n$ is indeed a factor of $N$, and $n$ satisfy inequalities (2), then a positive integer $m$ is available to satisfy equation (1).

**If $n$ is an even number **, then we can rewrite equation (1) as equation (3):

$$ 2N=n(2m+n-1)\ \ \ \ \ (3)$$

Because $n$ is assumed to be an even number, $2m+n-1$ would be an odd number. Thus the number $n$ must be a factor of $2N$ and it MUST contain all the exponents of 2 in $2N$ (so that $\frac{2N}{n}$ is odd). Again, inequalities (2) must be satisfied to find a viable $m$ given $n$.

Summarizing the above two cases, we get the following algorithms:

- Let $P=\lfloor \frac{\sqrt{8N+1}-1}{2} \rfloor$
- Find the set
**$O$**of all the odd factors of $N$ between 3 and $P$ inclusive - Denote the maximum exponents of 2 in $N$ as $e$, and let $q=2^{e+1}$
- Add $q$ into the set
**$E$**if $q \le P$. For each element $o$ in**$O$**, if $oq \le P$, then add $oq$ into the set**$E$** - The number of elements in
**$O$**and**$E$**determines the number of summation expressions for $N$. - If we want to further identify the summation expressions: for each element in
**$O$**and**$E$**, use $m=\frac{N}{n}-\frac{n-1}{2}$ to find corresponding $m$.

Python code would be the following:

In [1]:

```
import numpy as np
def get_consecutive_sum_n_values(N):
P=int(np.floor((np.sqrt(8*N+1)-1)/2))
# odd factors of N
n_values = [o for o in range(3, P+1, 2) if (N % o == 0)]
# maximum exponent of 2 in 2N
q = 1
while q <= N and N % q == 0:
q <<= 1
# even factors
if q <= P:
n_values.extend([q * o for o in n_values if (o * q <= P)])
n_values.append(q)
return n_values;
def print_consecutive_sum_expression(N):
n_values = get_consecutive_sum_n_values(N)
print('expressions for:', N)
for n in n_values:
m = int((2*N - n*(n-1))/2/n)
if n == 2:
print(' ', n, 'integers: (',m, '+', m+n-1, ')')
else:
print(' ', n, 'integers: (',m, '+ ... +', m+n-1, ')')
```

Some testing codes:

In [2]:

```
print_consecutive_sum_expression(9)
print_consecutive_sum_expression(15)
print_consecutive_sum_expression(100)
print_consecutive_sum_expression(1000000)
```

The algorithm above is simple enough, and it has less than 10 lines of codes. However, with further mathematical analysis, it is possible to further simplify the above algorithm.

Let’s first relax the requirement of $m$ being positive. Let’s allow $m$ to take value of negative integers or zero. We then have the following observations:

- Note: $$ (-p)\ +\ (-p + 1)\ +\ …\ +\ 0\ +\ …\ +\ (p-1)\ +\ p\ +\ (p + 1)\ +\ …\ +\ (p + n – 1)\ ==$$ $$(p + 1)\ +\ …\ +\ (p + n – 1)$$ $$where\ p \ge 0 $$ We can state that we can convert the summation of a consecutive sequence of integers that starts with a non-positive integer into a summation of the mapping consecutive sequence of integers that starts with a positive integer.
- Note: $$(p + 1)\ +\ …\ +\ (p + n – 1)\ ==$$ $$ (-p)\ +\ (-p + 1)\ +\ …\ +\ 0\ +\ …\ +\ (p-1)\ +\ p\ +\ (p + 1)\ +\ …\ +\ (p + n – 1)\ ==$$ $$where\ p \ge 0 $$ We can state that we can convert the summation of a consecutive sequence of integers that starts with a positive integer into a summation of the mapping consecutive sequence of integers that starts with a non-positive integer.
- Summarizing the above two observations, we come to the conclusion that summation sequences have a one-to-one mapping relationship between those start with positive integers and those start with non-positive integers. Or we can say:

**Relaxing the requirement of $m \gt 0$, the number of viable sequences that starts with positive integers is the same as the number of viable sequences with non-positive integers. Both are half of the total number of viable sequences.**

Please note that in the above conversions, we added or removed $2p+1$ number of integers. So if the number of elements of the sequence before conversion (think $n$) is odd or even, the number of elements of the sequence after conversion will be even or odd respectively. Thus we have:

** With the requirement of $m \gt 0$ relaxed, the number of viable sequences with odd number of elements is the same as the number of viable sequences with even number of elements. Both are half of the total number of viable sequences.**

Combining the above two bold observations, we get:

**The number of viable sequences starting with positive integers is the same as the number of viable sequences containing odd number of elements. **

So the original questions becomes: **Without requiring $m$ to be positive, how many pairs of $m$ and $n$ we can find for the following equation:** $$N=n(m + \frac{n-1}{2}),\ \ \ \ \ n>1\ and\ n\ is\ odd\ \ \ \ (4)$$

It should be easy to see that the answer is exactly: **the number of odd factors of the given $N$**.

Please note once you get a negative $m$ out of the $m=\frac{N}{n}-\frac{n-1}{2}$, you need to follow the conversion procedure to get the sequence that starts with a positive integer.

In [3]:

```
import numpy as np
def get_consecutive_sum_n_values_simplified(N):
# odd factors of N
return [o for o in range(3, N+1, 2) if (N % o == 0)]
def print_consecutive_sum_expression_simplified(N):
n_values = get_consecutive_sum_n_values_simplified(N)
print('expressions for:', N)
for n in n_values:
m = int((2*N - n*(n-1))/2/n)
if m <= 0:
n -= (-m*2+1)
m = (-m + 1)
if n == 2:
print(' ', n, 'integers: (',m, '+', m+n-1, ')')
else:
print(' ', n, 'integers: (',m, '+ ... +', m+n-1, ')')
```

The same testing codes:

In [4]:

```
print_consecutive_sum_expression_simplified(9)
print_consecutive_sum_expression_simplified(15)
print_consecutive_sum_expression_simplified(100)
print_consecutive_sum_expression_simplified(1000000)
```

The simplified algorithm in Python only takes one line of codes. However in terms of the computational complexity, the simplified algorithm is actually worse with a performance of $O(N)$.

However, if we are smart in finding the odd factors instead of using the brute force method, the simplified algorithm will be at the same complexity with the algorithm given prior.

This article is for those just got into the Python world. For those experienced Python developers, you are excused to leave now. Python is extremely simple to start with, the traditional *Hello World* program takes only one line. Before no time, you would be writing classes, methods, system calls, and you really enjoy the convenience and power of the language. But quickly, you also realize that, there are many strange codes floating around. And you hear about lots of unacquainted jargons. You could feel a little puzzled, and a little confused about the big picture of the Python world.

That is where this article comes to help. I intend to draw a broad picture of the Python world, with pointers for you to explore in depth. At the end of the article, you won’t become an expert on anything with Python. But, hopefully, you will be satisfied in understanding the big picture, and know where to go for further information and to gain expertise.

All software packages start with installations. As of the writing of this post, there are two major versions of Python that are prevalent, i.e. Python 2 and Python 3. Usually, Python 2 refers to Python with version 2.7.x, and Python 3 refers to Python with version 3.5+. Python 2 and Python 3 are incompatible on many occasions. Which one to choose? The simplified answer is to choose Python 3 for new projects and use whatever version your old project was using for the old projects.

Chances are, in the near future, you will need to work in both Python 2 and Python 3 worlds for different projects. So you install both. But then, how do you switch between the two? When you run *python -m xxx* command, how do you know which version is started? Do you need to manually set up your environment variables for switching? That would be too complicated and error prone, and you need tools. That is when Virtualenv and Anaconda come into play.

Virtualenv and Anaconda solve more problems than Python versions. Remember, for any real projects, you would need support from many libraries, right? Different projects usually rely on different versions of the same libraries, i.e. it is incorrect to assume all of your projects can always use the latest version of the same libraries. For different projects could be at different lifecycles. So the industry’s best practice is to isolate *the python version + specific libraries versions* into one bundle called: ** environments**.

The difference between Virtualenv and Anaconda? Again, the simplest answer is

Anaconda = Virtualenv + Pip

functionality wise. Wait, what is Pip? Be patient, I will talk about it. But before that, I’d like to show some codes:

On MacOS, with Virtualenv, commands to install and setup both Python 2 & 3 environments are:

brew install python

brew install python3

virtualenv env2

virtualenv --python=python3 env3

In the above codes, env2 and env3 and simply subfolders (that you can change to your favorite folder names) where the environments will be setup, i.e. all python binaries and libraries will be install and copied into those subfolders.

Environments need to be activated before invoking Python commands:

source env2/bin/activate

deactivate

source env3/bin/activate

deactivate

For Anaconda:

conda create --prefix ~/bio-env python=3.4

source activate ~/bio-env

source deactivate

In the above codes, you can change to the Python version to use Python 2.

As promised, I will talk about Pip. As you would expect, any programmer needs libraries. In Python, libraries are called packages. Like in Java world, you pull and install packages from some central repositories called PyPI – the Python Package Index. Most likely you don’t have to install Pip. For Pip is installed when you install your Python environments.

Some sample pip commands for installing packages:

pip install SomePackage # latest version

pip install SomePackage==1.0.4 # specific version

pip install 'SomePackage>=1.0.4' # minimum version

pip install -r requirements.txt . # "Requirements files" are files containing a list of items to be installed

pip list # list installed packages

pip search "query" # search for packages

pip install --upgrade SomePackage # upgrade packages if needed

Another concept you need to understand is Python Wheels.

Pip install

, by default, installs from source files involving compilation. This process could be demanding at times, especially when non-python libraries are involved. So the invention of the Python Wheels is to allow others to pre-built a package for you, so that you can quickly install without worrying about how to compile the package. Make sense, right?

pip install SomePackage-1.0-py2.py3-none-any.whl # installing from wheels

pip install wheel

pip wheel --wheel-dir=/local/wheels -r requirements.txt # To build wheels to a local directory

Remember Anaconda? It gets its own package installer tool. It is called Conda. Its usage patttern is similar to pip, but using a different central repository, the one from Continnum.

conda create --name snakes python=3 # create python 3 environment called snakes

conda info --envs # display the list of environments

source activate snowflakes # activate snowflakes environment

conda search beautiful-soup # search packages

conda install package_name # install a package

conda list # list installed packages

conda update package_name # update a package

conda update conda # update conda itself

conda update python # update to the latest 2.x or 3.x version of python

conda remove packge_name # uninstall a package

In addition to the central repository, Anaconda also supports “private” repositories hosted on anaconda.org. The “private” repositories are called channels. In install from a channel, you can use the -c option, e.g.

conda install -c pandas bottleneck

Don’t worry, with Anaconda, you can still use Pip.

I personally prefer Anaconda because it has more integrated support cross platforms and it has enough packages for scientific computing.

With the installation and environments taken care of, you are ready to do some serious coding.

You write a python file with the .py as the extension. Congratulations, you’ve got your first module developed. Of course, you will want to write your own libraries for others to use. Well it is quite somple:

- Create a folder named after your library name
- Within that folder, put an empty __init__.py in it

Congratulations, you got your first package ready. Of course, it has no meat in it. For you have not done any real coding.

If you want some classes visible to others when they write

import your_package_name

, you define the class in the __init__.py object. If you prefer other to write

import your_package_name.your_module

or

from your_package_name import your_module

, you keep the __init__.py empty and create

your_module.py

.

Don’t expect me to talk about how to write Python classes. It should be another of your Google searches.

Time to share your work with others. Visit this page to equip yourself with the full knowledge. Basically you need to provide a set of meta information to inform the packager how to package your stuff. Briefly:

- setup.py – this is your main file containing a call of setup() from Setuptools package indicating package name and url etc.
- setup.cfg – an ini file that contains option defaults for setup.py commands
- MANIFEST.in – needed in certain cases where you need to package additional files such as resource files
- optional stuff – license file, author file, contributions file etc

The following commands will help you to have a sense how to package.

python setup.py sdist # create a source distribution

python setup.py bdist_wheel --universal # a universal (meaning pure python) wheel

If you need to upload your package into the PyPi repository to have a broader audience, you further need the twine tool.

For any serious project, you would need to deploy it to some servers and get it running. The Fabric package is here to make it easy. Python Fabric allows you to easily invoke shell scripts locally or remotely on the servers.

At the top of your project folder, you want to create a

fabfile.py

which contains your deployment scripts. Some important imports at the top of that script:

from fabric.api import task, local, settings

from fabric.colors import green, red

from fabric.context_managers import prefix

Each method you define in the

fabfile.py

becomes a subcommand as the

fab

. For example, the following routine:

@task

def mytask():

run("a command")

Sould allow you to invoke command from within the top folder of your project:

fab mytask

It is generally a good practice to have separate tasks for bootstrapping, deployment and testing. A tutorial is available here.

We should never forget about tests whenever we are coding, right? Pytest is the tool to use for that purpose.

Here the REST means Representational state transfer. We often need to deliver a micro service out of Python. Then Flask is a good choice for that. It is RESTful, it supports templating and it is fully WSGI compliant.

**IPython**– I just can’t stress enough how useful this tool is. It is a python prompt on steroids. It has completion, history, shell capabilities, and a lot more. Make sure that you take a look at it.**NumPy**– You do math or science? You can not live without it.**SciPy**– When we talk about NumPy then we have to talk about scipy. It is a library of algorithms and mathematical tools for python and has caused many scientists to switch from ruby to python.**matplotlib**– A numerical plotting library. It is very useful for any data scientist or any data analyzer.**SymPy**– SymPy can do algebraic evaluation, differentiation, expansion, complex numbers, etc. It is contained in a pure Python distribution.**NLTK**– Natural Language Toolkit – I realize most people won’t be using this one, but it’s generic enough. It is a very useful library if you want to manipulate strings. But it’s capacity is beyond that. Do check it out.**SQLAlchemy**– A database library. Many love it and many hate it. The choice is yours.**BeautifulSoup**– I know it’s slow but this xml and html parsing library is very useful for beginners.**Requests**– The most famous http library written by kenneth reitz. It’s a must have for every python developer.**Cython**– an extension language for the CPython runtime. It translates Python code to fast C code and supports calling external C and C++ code natively.**Pandas**– a library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas is free software released under the three-clause BSD license**Statsmodels**– a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.**scikit-learn**– is an open source library for the Python. It features various classification, regression and clustering algorithms including support vector machines, logistic regression, naive Bayes, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.**Theano**– uses NumPy-like syntax to optimize and evaluate mathematical expressions. What sets Theano apart is that it takes advantage of the computer’s GPU in order to make data-intensive calculations up to 100x faster than the CPU alone. Theano’s speed makes it especially valuable for deep learning and other computationally complex tasks.**TensorFlow**– another high-profile entrant into machine learning, developed by Google as an open-source successor to DistBelief, their previous framework for training neural networks. TensorFlow uses a system of multi-layered nodes that allow you to quickly set up, train, and deploy artificial neural networks with large datasets. It’s what allows Google to identify objects in photos or understand spoken words in its voice-recognition app.**Scrapy**– an aptly named library for creating spider bots to systematically crawl the web and extract structured data like prices, contact info, and URLs. Originally designed for web scraping, Scrapy can also extract data from APIs.**Pattern**– combines the functionality of Scrapy and NLTK in a massive library designed to serve as an out-of-the-box solution for web mining, NLP, machine learning, and network analysis. Its tools include a web crawler; APIs for Google, Twitter, and Wikipedia; and text-analysis algorithms like parse trees and sentiment analysis that can be performed with just a few lines of code.**Seaborn**– a popular visualization library that builds on matplotlib’s foundation. The first thing you’ll notice about Seaborn is that its default styles are much more sophisticated than matplotlib’s. Beyond that, Seaborn is a higher-level library, meaning it’s easier to generate certain kinds of plots, including heat maps, time series, and violin plots.**Bokeh**– another great visualization library is Bokeh, which is aimed at interactive visualizations. In contrast to the previous library, this one is independent of Matplotlib. The main focus of Bokeh, as we already mentioned, is interactivity and it makes its presentation via modern browsers in the style of Data-Driven Documents**Keras**– an open-source library for building Neural Networks at a high-level of the interface, and it is written in Python. It is minimalistic and straightforward with high-level of extensibility. It uses Theano or TensorFlow as its backends, but Microsoft makes its efforts now to integrate CNTK (Microsoft’s Cognitive Toolkit) as a new back-end.

Understanding all these, you are officially an expert!

]]>
Between **Dropwizard** and Spring, I personally prefer **Dropwizard**. For Dropwizard seems simpler and has a more elegant design. Then, what will Dropwizard do for us as a developer? Let’s quickly run through a micro-service building virtual exercise.

Obviously our micro-service will be built on top of a HTTP server. So we will need an embedded HTTP server as a starting point. When it comes to embedded HTTP servers, we had quite some options in the past. You can read here for all the possible options. But, as of now, the clear winner is **Jetty**. So we will use **Jetty** as our embedded HTTP server.

Our micro-service will need to support RESTful API. There is even a standard for that. It is called JAX-RS, Java API for RESTful Web Services (JAX-RS). The best implementation in this case is **Jersey**.

With **Jersey**, you can use many pleasant markings to turn your POJOs (Plain Old Java Object) into web requests or web responses with easy. Take a peak at these markings:

@Path, @GET, @PUT, @POST, @DELETE, @HEAD, @PathParam, @QueryParam, @MatrixParam, @HeaderParam, @CookieParam, @FormParam, @DefaultValue, @Context

Great, the **Jersey** does the heavy lifting of marshaling between web request/response protocols and our Java objects so that we can focus on our Java models instead of HTTP protocols. There are still issues: we would like to load our application configurations from likely a jason/yaml file, additionally we might want our request/response to contain jsons. We will need another library. Yes, that library would be **Jackson**.

Apart from the above, we might also needs supports like **Metrics**, **Logging**, **JDBI** and user input **validator** etc.

The good news is, Dropwizard provides all of the things above. I hope by now, you are convinced that using Dropwizard is a good choice for you when building your micro services. You can simply think of Dropwizard as a bundle of libraries that reflects the best engineering practices in the Java world.

To understand how a dropwizard project looks like and how they work, I found this was a very good starting point.

]]>I recently did some Pandas-based analysis of Uber trip data for trip prediction purpose. During the process, I found Panads is really helpful in data processing and visualization, and thus this post.

However, due to obvious reasons, I am not supposed to publish internal business data online. And hence here I stripped down the original codes and changed the data source to stock prices.

Unfortunately, Uber trip data and stock price data are fundamentally different. For example, Uber trip data exhibit strong weekly pattern while stock prices do not. As the side effect, some of the analysis in this post is not practically useful, and some of it might not even make sense. But I still consider it valuable for practicing Pandas.

Pandas is really a great tool for data transformation, analyzing and visualization, as long as the data set can fit in memory.

To better understand Pandas, one needs first to have a grasp of the basic concepts. I found the best starting point was to read through Introduction to Data Structures. Once you understand the basic data structures, it should be easier to understand the others.

For those need to deal with time series, you should also read through Time Series / Date functionality.

As with any data science work, we start with data. Here we need to get historical stock data. We used to have Yahoo Finance and Google Finance python packages at our disposal. The problem is that Yahoo seems to be a sinking ship now. As of May 2017, the Yahoo Finance API for historical data is broken while Google Finance provides realtime data only.

So let me roll up my sleeves, take a detour and provide an API for query historical data from the Google Finance web site.

In [1]:

```
from datetime import date
from dateutil.relativedelta import relativedelta
import urllib.request
class GoogleFinanceAPI():
def __init__(self, symbol):
self._symbol = symbol.upper()
# Warning: if you call this method for too much data, Google finance truncates it and only gives you partial
def get_simple_historical(self, start_date, end_date=date.today()):
url_string = "http://www.google.com/finance/historical?q={0}&startdate={1}&enddate={2}&output=csv".format(
self._symbol, str(start_date), str(end_date))
with urllib.request.urlopen(url_string) as f:
csv = f.read()
return csv;
# Split the whole query into 10 years each to avoid truncating
def get_full_historical(self, start_date, end_date=date.today()):
results = []
while start_date < end_date:
query_start_date = end_date + relativedelta(years = -10)
if query_start_date < start_date:
query_start_date = start_date
partial_historical_data = self.get_simple_historical(query_start_date, end_date)
results.append(partial_historical_data)
end_date = query_start_date
return results
```

Go get the data

In [2]:

```
msft = GoogleFinanceAPI('msft')
# Google Finance data indicates MSFT stock price started from 3/13/1986
# Is that Microsoft's IPO day? I am not old enought to know, and feeling lazy not to search at this moment
msft_csvs = msft.get_full_historical(date(1986, 3, 13))
```

Pandas enjoys CSV data. In the following codes, we see

- read_csv() for creating pandas data frame from csv data. Please note the three named parmeters passed.
- The dtype parameter is to inform Pandas about the date field data type
- The encoding tells Pandas that the string data contains BOM mark ‘\xef\xbb\xbf’. Without using the ‘utf-8-sig’ encoding parameter, you would run into key error problem when accessing the ‘Date’ column. For there are hidden characters before the string ‘Date’ in the column name.
- The parse_dates indicate to Pandas to parse the first column and generate date object. Without this option, you won’t be able to use many of the Pandas’s nice time series manipulation capabilities.

- concat() to concatenate multiple homgeneous data frames together
- head() and tail() for quick inspection of the data, they are for dumping the first and last few rows of the data table respectively
- set_index() call is used to make the date field as the indexing field.

In [3]:

```
import numpy as np
import pandas as pd
import io
msft_dfs = []
for msft_csv in msft_csvs:
msft_dfs.append(pd.read_csv(io.BytesIO(msft_csv), encoding='utf-8-sig', parse_dates=[0], dtype={'Date': date}))
msft_df = pd.concat(msft_dfs)
msft_df = msft_df.set_index(['Date'])
print(msft_df.head())
print(msft_df.tail())
```

Without further ado, let’s get some visualization. A few points:

- If for any reason, you would like to reverse the order of the x-axis, you can use the invert_xaxis() call after plot().
- title() and set_size_inches() are self-explanatory
- We only plotted the closing price. If you want to plot multiple columns at the same time, you can use [‘Open’, ‘Close’] in place of ‘Close’. You can put arbitrary number of entries in the list. I did not do it because all these numbers are so close, without zooming in, practically only one series of data is visible even if you show all.
- You don’t want to plot every thing, i.e. msft_plot.plot(), because the high numerical values of volume data will obscure everything else.
- Note when you pass a list as the parameter, you should have double brackets, i.e. [[‘Open’, ‘Close’]].

In [4]:

```
import matplotlib.pyplot as plt
%matplotlib inline
msft_df['Close'].plot()
plt.title('Microsoft stock closing prices - All')
plt.gcf().set_size_inches(15, 8)
```

Extracting and plotting data for a single year is almost one liner:

In [5]:

```
msft_df['2017']['Close'].plot()
plt.title('Microsoft stock closing prices - 2017')
plt.gcf().set_size_inches(15, 8)
```

Let’s check if the stock exibits any movement patterns within a week, i.e. does MSFT usually close at a higher price than Tuesday? We evaluate this by dividing the day’s price by the sum of the corresponding five weekdays’ closing prices. There are obviously exceptions, i.e. during holiday time. Please excuse me for not dealing with this complexity. For the whole thing at this point is hypothetical, and it is only for coding illustrations.

- resample() function call reduces the number points and allows aggregation.
- asfreq() is the inverse of resample. It expands the data into finer granularity, and the “bfill” method means copy data points backwards.
- The spikes in the data is due to the fact that the code did not incorporate holidays. So weeks with holidays will have higher contribution values for each weekday of that week.
- As we can see from the plot, the distribution seems random.

In [6]:

```
from pandas.tseries.offsets import *
weekly_sum = msft_df.resample('W-FRI').sum()
weekly_sum_daily = weekly_sum.asfreq(BDay(), method = 'bfill')
weekly_contrib = (msft_df / weekly_sum_daily)
weekly_contrib['2016']['Close'].plot()
plt.title('Microsoft stock ratio of every weekday price to the weekly price total - 2016')
plt.gcf().set_size_inches(15, 8)
```

For the closing price, let’s create a table (DataFrame) with columns as Monday, Tuesday, Wednesday, Thursday, Friday, and each row represents a week. From this table, we can easily extract Monday’s value of the weekly distribution.

In the Monday plot below, the gaps should be because that Monday is a holiday. The spikes are due to holidays in the week (other than Monday).

In [7]:

```
# With the following line of code, each row is for a week, and the one column is
# the list of values of weekday distributions
weekly_contrib_resample = weekly_contrib['Close'].resample('W-FRI').apply(lambda x: x.tolist()[0:5])
# Seperate the list of weekday values into five columns
weekly_contrib_rect = weekly_contrib_resample.apply(pd.Series)
# Rename the columns
column_names = ['Mon', 'Tue', 'Wed', 'Thur', 'Fri']
weekly_contrib_rect = weekly_contrib_rect.rename(columns=lambda x: column_names[x])
weekly_contrib_rect['2016'].plot()
plt.title('Microsoft stock ratio of weekday price to the week price total - 2016')
plt.gcf().set_size_inches(15, 8)
plt.figure()
weekly_contrib_rect['2016']['Mon'].plot()
plt.title('Microsoft stock weekly ratio of monday price to the week price total - 2016')
plt.gcf().set_size_inches(15, 8)
```

Histograms are good approximation of distributions.

In [8]:

```
stacked = False
weekly_contrib_rect['Mon'].plot.hist(stacked = stacked, bins=80)
plt.title('weekly contributiona - All')
plt.gcf().set_size_inches(15,8)
```

It is always a good idea to setup a SSL certificate for your site. SSL certificate is an online identification document issued by CAs (*Certificate Authority*). There are many CA vendors to choose from. I bought from register.com with an annual cost of around $27. You should stay away from StartCom & WoSign because they are blacklisted by major browsers (read here).

Once your website certificate and its associated private key are downloaded. You can follow this article to set it up.

For those who have no prior knowledge of certificates. I hope the following quick recap of the major concepts can help:

- Certificate – public ID document, usually with a file extension of .pem or .crt.
- Key – the secret file you keep to yourself. A certificate and its key form an inseparable pair.
- CSR – certificate signing request, i.e. a request signed by YOU and sent to CA, providing basic identification information. One important piece of information is the
*common name*which should be your domain name. You create CSR with a command line similar to mine:openssl req -new -newkey rsa:2048 -nodes -keyout xiaohaionline.com.key -out xiaohaionline.com.csr - You submit the content of your CSR file through a webpage to your CA, and after verification of some basic information your CA should issue you the certificate for you to use.

Another note is about **WordPress** itself: if you want your installation to work for installing new themes or auto upgrading the software, you need to make the apache user account the owner of the /var/www/html/ folder. You can use the following commands:

sudo chown -R apache:apache /var/www/html/

Please note I am running:

Red Hat Enterprise Linux Server release 7.3 (Maipo)

Commands might differ if you use different OS versions.

I recently (September 2018) moved the server from RHEL to Linux (Amazon AMI). The Amazon AMI’s default installation instructions installs PHP 5.3 which is not compatible with some recent plugins (e.g. mathjax-latex in my case). I decided to upgrade to PHP 5.5. The following commands were used for that purpose.

sudo yum update -y

sudo yum remove php httpd php-cli php-xml php-common httpd-tools

sudo yum install httpd24

sudo yum install php55 php55-devel php55-common php55-cli php55-pecl-apc php55-pdo php55-mysql php55-xml php55-gd php55-mbstring php-pear php55-mysqlnd php55-mcrypt

sudo yum install mod24_ssl

sudo service httpd restart

sudo yum remove php httpd php-cli php-xml php-common httpd-tools

sudo yum install httpd24

sudo yum install php55 php55-devel php55-common php55-cli php55-pecl-apc php55-pdo php55-mysql php55-xml php55-gd php55-mbstring php-pear php55-mysqlnd php55-mcrypt

sudo yum install mod24_ssl

sudo service httpd restart