Combination Sum

2016-07-19

Problem 1 : Combination Sum

Given a set of candidate numbers (C) and a target number (T), find all unique combinations in C where the candidate numbers sums to T.
The same repeated number may be chosen from C unlimited number of times.

Note:

All numbers (including target) will be positive integers.
The solution set must not contain duplicate combinations.

For example, given candidate set [2, 3, 6, 7] and target 7,
A solution set is:
[ [7], [2, 2, 3] ]

Solution

the solution is: use recursive way , the function in the code following:

recursive(vector > &results,vector &candidates, int target, int index, vector res);
results indicates the final results we want, index indicates the position where we start from, where res is a temp vector containing one solution of all.

class Solution {
public:
    vector<vector<int> > combinationSum(vector<int> &candidates, int target) {
        vector<vector<int> > results;
        sort(candidates.begin(),candidates.end());  //sort to fit the requirement of non-descending orde.
        vector<int> res;
        recursive(results,candidates,target,0,res); //we start from index 0
        return results;
    }
    void recursive(vector<vector<int> > &results,vector<int> &candidates, int target, int index, vector<int> res) {
        if (target == 0) {   //a valid solution is achieved
            results.push_back(res);
            return;
        }
        for (int i=index; i<candidates.size(); i++) {
            if (target-candidates.at(i) >= 0) {
                vector<int> temp = res;
                temp.push_back(candidates.at(i));
                ////new target, new temp vector,index is i because same candidate is allowed to be used in unlimitde times.
                recursive(results,candidates,target-candidates.at(i),i,temp); 
            } else     //if current sum is bigger than target, then adding bigger candidates is useless
                return;
        }
    }
};

Problem 2 : Combination Sum II

Given a collection of candidate numbers (C) and a target number (T), find all unique combinations in C where the candidate numbers sums to T.
Each number in C may only be used once in the combination.

Note:
All numbers (including target) will be positive integers.
The solution set must not contain duplicate combinations.
For example, given candidate set [10, 1, 2, 7, 6, 1, 5] and target 8,
A solution set is:
[ [1, 7], [1, 2, 5], [2, 6], [1, 1, 6] ]

Solution

difference between Combination Sum and Combination Sum II is “Each number in C may only be used once in the combination”
so we recursively use the function: recursive(vector > &results,vector &candidates, int target, int index, vector res);
i should start from index+1 to avoid duplication, the others are all the same.

class Solution {
public List<List<Integer>> combinationSum2(int[] cand, int target) {
    Arrays.sort(cand);
    List<List<Integer>> res = new ArrayList<List<Integer>>();
    List<Integer> path = new ArrayList<Integer>();
    dfs_com(cand, 0, target, path, res);
    return res;
}
void dfs_com(int[] cand, int cur, int target, List<Integer> path, List<List<Integer>> res) {
    if (target == 0) {
        res.add(new ArrayList(path));
        return ;
    }
    if (target < 0) return;
    for (int i = cur; i < cand.length; i++){
        if (i > cur && cand[i] == cand[i-1]) continue;
        path.add(path.size(), cand[i]);
        dfs_com(cand, i+1, target - cand[i], path, res);
        path.remove(path.size()-1);
    }
}
};

Problem 3 : Combination Sum III

Find all possible combinations of k numbers that add up to a number n, given that only numbers from 1 to 9 can be used and each combination should be a unique set of numbers.

Example 1:
Input: k = 3, n = 7
Output:
[[1,2,4]]

Example 2:
Input: k = 3, n = 9
Output:
[[1,2,6], [1,3,5], [2,3,4]]

Solution

control the number of integers used.

class Solution:
    # @param {integer} k
    # @param {integer} n
    # @return {integer[][]}
    def __init__(self):
        self.ret=[]

    def combinationSum3(self, k, n):
        self.findTuple(k, n, [], 1)
        return self.ret

    def findTuple(self, k, n, tmp, start):
        if k<0 or n<0:
            return
        if k==0 and n==0:
            if sorted(tmp) not in self.ret:
                self.ret.append(sorted(tmp))
            return
        for cand in range(start, 10):
            if cand>n:
                break
            if cand not in tmp and cand<=n:
                temp = copy.deepcopy(tmp)
                temp.append(cand)
                self.findTuple(k-1, n-cand, temp, cand+1)

Problem 4 : Combination Sum IV

Given an integer array with all positive numbers and no duplicates, find the number of possible combinations that add up to a positive integer target.

nums = [1, 2, 3]
target = 4

The possible combination ways are:
(1, 1, 1, 1)
(1, 1, 2)
(1, 2, 1)
(1, 3)
(2, 1, 1)
(2, 2)
(3, 1)

Note that different sequences are counted as different combinations.

Therefore the output is 7.

Solution

这是一个DP题，如果简单使用统计的方法可能会超时。

comb[target] = sum(comb[target - nums[i]]), where 0 <= i < nums.length, and target >= nums[i]

代码为

public int combinationSum4(int[] nums, int target) {
    int[] comb = new int[target + 1];
    comb[0] = 1;
    for (int i = 1; i < comb.length; i++) {
        for (int j = 0; j < nums.length; j++) {
            if (i - nums[j] >= 0) {
                comb[i] += comb[i - nums[j]];
            }
        }
    }
    return comb[target];
}

Binary Tree Maximum Path Sum

2016-07-19

Binary Tree, LeetCode

Problem

Given a binary tree, find the maximum path sum.

For this problem, a path is defined as any sequence of nodes from some starting node to any node in the tree along the parent-child connections. The path does not need to go through the root.

For example:
Given the below binary tree,

   1
  / \
 2   3

Return 6.

Original Address

Solution

function getMaxRoot(r) compute max value edged with node r
also, r is the highest node. for example:

   1
  / \
 2   3

getMaxRoot(1) returns 4.
each path has a highest node.
for a single node:
maxPrice = max(maxPrice, getMaxRoot(r->left)+getMaxRoot(r->right)+r->val);

class Solution {
public:
    int maxPrice;
public:
    int maxPathSum(TreeNode *root) {
        maxPrice=INT_MIN;
        getMaxRoot(root);
        return maxPrice;
    }
    // compute the maximum value of the path with hightest and edge node r.
    int getMaxRoot(TreeNode *r) {
        if (r == NULL)
            return 0;
           int leftM  = max(0,getMaxRoot(r->left));
        int rightM = max(0,getMaxRoot(r->right));  
        maxPrice = max(maxPrice, leftM+rightM+r->val);
        return max(leftM,rightM)+r->val;
    }
};

Best Time to Buy and Sell Stock

2016-07-19

DP, Greedy, LeetCode, State Machine

Problem : Time to Buy and Sell Stock II

Say you have an array for which the ith element is the price of a given stock on day i.

Design an algorithm to find the maximum profit. You may complete as many transactions as you like (ie, buy one and sell one share of the stock multiple times). However, you may not engage in multiple transactions at the same time (ie, you must sell the stock before you buy again).

Solution

贪心算法：条件是假设一天内卖完了可以再买，1-2-3可以拆分成1-2和2-3.

public class Solution {
public int maxProfit(int[] prices) {
    int total = 0;
    for (int i=0; i< prices.length-1; i++) {
        if (prices[i+1]>prices[i]) total += prices[i+1]-prices[i];
    }
    return total;
}

如果不允许在一天内卖完了可以再买，贪心的算法是没有意义的，虽然答案是对的。
这时，每一次需要找到local最小值和local最大值，然后把差值加到返回值上。

public int maxProfit(int[] prices) {
    int profit = 0, i = 0;
    while (i < prices.length) {
        // 找到local最小值
        while (i < prices.length-1 && prices[i+1] <= prices[i]) i++;
        int min = prices[i++]; // 因为price[i+1]>price[i]，所以将i++
        // 找到local最大值
        while (i < prices.length-1 && prices[i+1] >= prices[i]) i++;
        profit += i < prices.length ? prices[i++] - min : 0;
    }
    return profit;
}

Problem : Time to Buy and Sell Stock III

Say you have an array for which the ith element is the price of a given stock on day i.

Design an algorithm to find the maximum profit. You may complete at most two transactions.

Note:
You may not engage in multiple transactions at the same time (ie, you must sell the stock before you buy again).

Original Address

Solution

依旧是DP问题，如果用a[i][j]表示从标号i到j单次最大的利润，代码如下：

public int maxProfit(int[] prices) {
    int n = prices.length;
    if (n <= 1)
        return 0;
    int[][] minElement=new int[n][n], maxpay=new int[n][n];
    for (int i=0; i<n; i++) {
        minElement[i][i] = prices[i];
        maxpay[i][i] = 0;
    }
    for (int len=1; len<n; len++) {
        for (int i=0; i+len<n; i++) {
            int j = i+len;
            if (prices[i]-minElement[i][j-1] > maxpay[i][j-1])
                maxpay[i][j] = prices[i]-minElement[i][j-1];
            minElement[i][j] = Math.min(prices[j], minElement[i][j-1]);
        }
    }
    int div = 2, maxprofit = maxpay[0][n-1];
    if (n <= 3)
        return maxprofit;
    for (int i=div; i<n-2; i++) {
        int x1 = maxpay[0][i];
        int x2 = maxpay[i][n-1];
        if (x1+x2 > maxprofit)
            maxprofit = x1+x2;
    }
    return maxprofit;
}

但是，这样时间复杂度为O(n^2)。TLE！！！！！！！！！
然后参考了别人的思路：——————————————-

以f[k][i]表示第k个transaction后从开始到标号i-1得到的最大利润
迭代方程要考虑两种情况：

p[i]不比前一个售出点的price高，所以f[k][i]=f[k][i-1]
p[i] 比前一个售出点的price高，所以f[k][i]=max{f[k-1][j]+p[i]-p[j]}=p[i]+max{f[k-1][j]-p[j]} (其中1<j<i-1)

所以，f[k][i] = max{f[k][i-1], max{f[k-1][j]+p[i]-p[j]} }
问题来了，如果这么实现，复杂度又会达到O(n^2).
注意到情况2中的“ max{f[k-1][j]-p[j]} ”，可以使用一个变量记录最大的f[k-1][j]-p[j]，然后每次更新它即可
对于每个transaction循环，它的初始值为：tmp = maxpay[k-1][0]-p[0], 每到一个新的i，更新tmp=max{tmp, f[k-1][i]-p[i])。
所以迭代方程就可以写成：f[k][i]=p[i]+tmp. 这样，时间复杂度就降到了O(n). </b>
本题也可以变形为至多执行k个transaction，只需要把代码中2,3改成k,k+1即可。

public class Solution {
    public int maxProfit(int[] prices) {
        int n = prices.length;
        if (n <= 1)
            return 0;
        // maxpay[t][i] indicates the max profit in t-th transaction
        int[][] maxpay=new int[3][n];
        // initialize maxpay, when t=0, maxpay[t][i]=0
        for (int j=0; j<3; j++)
            for (int i=0; i<n; i++)
                maxpay[j][i] = 0;
        for (int t=1; t<3; t++) {
            // for every t, initialize t
            int tmp = maxpay[t-1][0]-prices[0];
            for (int i=1; i<n; i++) {
                //iteration formula
                //maxpay[t][i] = Math.max{ maxpay[t][i-1] , prices[i]+max<j>{maxpay[t-1][j]-prices[j]} };
                maxpay[t][i] = Math.max(maxpay[t][i-1], prices[i]+tmp);
                // make sure that tmp = max(tmp , maxpay[t-1][i]-prices[i])
                tmp = Math.max(tmp, maxpay[t-1][i]-prices[i]);
            }
        }
        return maxpay[2][n-1];
    }
}

Problem : Time to Buy and Sell Stock IV

Say you have an array for which the ith element is the price of a given stock on day i.

Design an algorithm to find the maximum profit. You may complete at most k transactions.

Note:
You may not engage in multiple transactions at the same time (ie, you must sell the stock before you buy again).

Original Address

Solution

we are allowed to perform at most k transactions
we can apply the algorithm above, but there is one thing to notice,
when k is very large: k>=n/2, that’s to say, we perform one transaction on each day.
so, the problem becomes Best Time to Buy and Sell Stock II,
greedy algorithm is ok, otherwise we get TLE.

public class Solution {
    public int maxProfit(int k, int[] prices) {
        int len = prices.length;
        if (k >= len / 2) return quickSolve(prices);
        
        int[][] t = new int[k + 1][len];
        for (int i = 1; i <= k; i++) {
            int tmpMax =  -prices[0];
            for (int j = 1; j < len; j++) {
                t[i][j] = Math.max(t[i][j - 1], prices[j] + tmpMax);
                tmpMax =  Math.max(tmpMax, t[i - 1][j - 1] - prices[j]);
            }
        }
        return t[k][len - 1];
    }
    

    private int quickSolve(int[] prices) {
        int len = prices.length, profit = 0;
        for (int i = 1; i < len; i++)
            // as long as there is a price gap, we gain a profit.
            if (prices[i] > prices[i - 1]) profit += prices[i] - prices[i - 1];
        return profit;
    }
}

Problem : Best Time to Buy and Sell Stock with Cooldown

Say you have an array for which the ith element is the price of a given stock on day i.

Design an algorithm to find the maximum profit. You may complete as many transactions as you like (ie, buy one and sell one share of the stock multiple times) with the following restrictions:

You may not engage in multiple transactions at the same time (ie, you must sell the stock before you buy again).
After you sell your stock, you cannot buy stock on next day. (ie, cooldown 1 day)

Example:

prices = [1, 2, 3, 0, 2]
maxProfit = 3
transactions = [buy, sell, cooldown, buy, sell]

Solution

本题中，可能的操作有buy、sell、rest(啥也不干)。可以使用状态机来解题：
由题可以绘制如下状态机：

转移方程表示如下：

s0[i] = max(s0[i - 1], s2[i - 1]);
s1[i] = max(s1[i - 1], s0[i - 1] - prices[i]);
s2[i] = s1[i - 1] + prices[i];

由于s1状态是买完以后的状态，所以最值最大值肯定不在s1上出现，只要找到最大的s0和s2.
关于初值设置：

s0=0，因为如果以s0为开始，你没有任何股票
如果以s1为开始，通过buy第一天的股票获得，可以设置s1的初值为-price[0]
设置s2的初值为INT_MIN，当然设置为0也完全没有问题(没有股票卖也卖不到钱)

class Solution {
public:
    int maxProfit(vector<int>& prices){
        if (prices.size() <= 1) return 0;
        vector<int> s0(prices.size(), 0);
        vector<int> s1(prices.size(), 0);
        vector<int> s2(prices.size(), 0);
        s1[0] = -prices[0];
        s0[0] = 0;
        s2[0] = INT_MIN;
        for (int i = 1; i < prices.size(); i++) {
            s0[i] = max(s0[i - 1], s2[i - 1]);
            s1[i] = max(s1[i - 1], s0[i - 1] - prices[i]);
            s2[i] = s1[i - 1] + prices[i];
        }
        return max(s0[prices.size() - 1], s2[prices.size() - 1]);
    }
};

空间复杂度为O(n)，可以降低到O(1).

class Solution {
    int maxProfit(vector<int>& prices) {
        if (prices.size() < 2) return 0;
        int s0 = 0, s1 = -prices[0], s2 = 0;
        for (int i = 1; i < prices.size(); ++i) {
            int last_s2 = s2;
            s2 = s1 + prices[i];
            s1 = max(s0 - prices[i], s1);
            s0 = max(s0, last_s2);
        }
        return max(s0, s2);
    }
}

python visualization

2016-07-10

Algorithm

python, visualization

matplotlib

matplotlib配置

修改文件，位于.matplotlib目录中

使用rc方法,可以定义的有’figure’,’axes’,’xtick’,’ytick’,’grid’,’legend’等

plt.rc('figure', figsize=(10,10)) # 设置图像默认大小
# 也可以
font_option = {'family':'monospace',
               'weight':'bold',
               'size':'small'}
plt.rc('font', **font_option)

matplotlib使用

1 2	import matplotlib.pyplot as plt import numpy as np

创建一个新的figure，所有图像都位于Figure对象中

1	fig = plt.figure(2) # 图像编号为2

无法通过空的Figure绘图，必须用add_subplot()创建subplot
创建4个子图，2x2

ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
fig.show() # 显示

分别对每个sub_figure画图

ax1.plot(...)
ax2.scatter(...)
ax2.bar(...)
fig.show()

subplots，返回一个含有已创建subplot对象的numpy数组
axes可以使用axes[][]的形式访问

fig, axes = plt.subplots(2,3)
axes
$-> array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0A8B2EF0>,
            <matplotlib.axes._subplots.AxesSubplot object at 0x0AA5F5D0>,
            <matplotlib.axes._subplots.AxesSubplot object at 0x0AA9DD90>],
           [<matplotlib.axes._subplots.AxesSubplot object at 0x0AAD1E70>,
            <matplotlib.axes._subplots.AxesSubplot object at 0x0AB1B810>,
            <matplotlib.axes._subplots.AxesSubplot object at 0x0697C130>]], dtype=object)

subplots_adjust间距控制
wspace,hspace控制宽度和高度的百分比，可以用作subplot之间的距离

1	plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=None)

pyplot.subplots的选项

parameter	explaination
nrows	subplot的行数
ncols	subplot的列数
sharex	所有子图使用相同x轴刻度（xlim的影响）
sharey	所有子图使用相同y轴刻度（xlim的影响）
subplot_kw	用于创建各subplot的关键字字典
**fig_kw	创建fig时的其他关键字，如plt.subplot(2,2,figsize=(8,6))

设置x、y轴的刻度

ticks = axe1.set_xticks([0,200,400,600,800])
# 旋转45读，字体大小为9
labels = axe1.set_xtickslabels(['one','two','three','four','five'],rotation=45,fontsize=9)
# 将图例放在不错的位置，自动选择
ax.legend(loc='best')

注解，显示在(x,y)位置

ax.text(x, y, 'hello world', family='consola', fontsize=10)
# annotate函数注解，既有箭头又有文字
# xy是箭头位置，xytext是注解位置，结果如下图：
ax.annotate('local max', xy=(2, 1), xytext=(3, 1.5),
            arrowprops=dict(facecolor='black', shrink=0.05),
            )

![](http://matplotlib.org/_images/annotation_basic.png)

图形中放入块patch

rect = plt.Rectangle((0.2,0.75), 0.4, 0.15, color='r', alpha=0.3)
circ = plt.Circle((0.7,0.2), 0.15, color='b', alpha=0.3)
pgon = plt.Polygon([[0.15,0.15],[0.35,0.4],[0.2,0.6]], color='g', alpha=0.3)

ax.add_patch(rect)
ax.add_patch(circ)
ax.add_patch(pgon)

图形属性和说明

attribute	explaination
color	color=’g’ 颜色，可以指定’#555555’
linestyle	linestyle=’—‘ 线性
marker	marker=’o’ 标记
label	label=’algorithm 1’ 图例
xlim,ylim	x轴、y轴的范围

保存文件，参数设置如下表

1	plt.savefig()

params	introduction
fname	文件名
dpi	分辨率（每英寸点数），默认为100
facecolor、edgecolor	背景色，默认为白色
format	设置文件格式，png、pdf等
bbox_inches	图标需要保存的部分。设为tight则尝试剪掉图标周围的空白部分

Pandas中的可视化方法

普通的plot

import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
plt.show()

![](http://pandas.pydata.org/pandas-docs/stable/_images/series_plot_basic.png)

On DataFrame, plot() is a convenience to plot all of the columns with labels

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list('ABCD'))
df = df.cumsum()
df.plot()
plt.show()
# 其他关键字 subplots=True将不同列的图分别画在子图中
# layout=(2, 3) 两行三列
# sharex=False，sharey=False -> 不共享x、y轴

![](http://pandas.pydata.org/pandas-docs/stable/_images/frame_plot_basic.png)

You can plot one column versus another using the x and y keywords in plot()

df3 = pd.DataFrame(np.random.randn(1000, 2), columns=['B', 'C']).cumsum()
df3['A'] = pd.Series(list(range(len(df))))
df3.plot(x='A', y='B')
plt.show()

使用第二个y轴

使用secondary_y关键字

df.A.plot()
df.B.plot(secondary_y=True, style='g')
# 或者    mark_right默认是True
ax = df.plot(secondary_y=['A', 'B'], mark_right=True)
ax.set_ylabel('CD scale')
ax.right_ax.set_ylabel('AB scale')

![](http://pandas.pydata.org/pandas-docs/stable/_images/frame_plot_secondary_y.png)

Scales尺度

使用logy、logx、loglog关键字

1
2
3

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = np.exp(ts.cumsum())
ts.plot(logy=True)

其他plot，用kind指定

value	function	value	function
bar	直方图	hist	统计直方图
kde, density	密度图	box	盒须图
area	面积图	scatter	散点图
hexbin	六边形箱图	pie	饼图
barh	横向的直方图

1 2	df.ix[5].plot(kind='bar') plt.show()

![](http://pandas.pydata.org/pandas-docs/stable/_images/bar_plot_ex.png)

其他用法

df = pd.DataFrame()
$-> df.plot.area    df.plot.box     df.plot.hist    df.plot.pie
    df.plot.bar     df.plot.density df.plot.kde     df.plot.scatter
    df.plot.barh    df.plot.hexbin  df.plot.line

bar plot

df.ix[5].plot.bar()
plt.show()

df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df2.plot.bar()
df2.plot.bar(stacked=True)  # 堆叠式
df2.plot.barh(stacked=True)  # 横向

histogram

Histogram can be drawn by using the DataFrame.plot.hist() and Series.plot.hist() methods

1
2
3

df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
df4.plot.hist(stacked=True, bins=20) # 下图
df4['a'].plot.hist(orientation='horizontal', cumulative=True)

![](http://pandas.pydata.org/pandas-docs/stable/_images/hist_new_stacked.png)

Box盒须图

Boxplot can be drawn calling Series.plot.box() and DataFrame.plot.box(), or DataFrame.boxplot() to visualize the distribution of values within each column

df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
# 设置不同区域的颜色
color = dict(boxes='DarkGreen', whiskers='DarkOrange', medians='DarkBlue', caps='Gray')
# sym keyword, vert表示是否横向显示
# 另外还有positions=[1, 4, 5, 6, 8]参数指示盒图的位置
df.plot.box(color=color, sym='r+', vert=False)

![](http://pandas.pydata.org/pandas-docs/stable/_images/box_new_colorize.png)

Area面积图

Series.plot.area() and DataFrame.plot.area()

1
2
3

df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df.plot.area(stacked=True)
# 如果stacked=False，图形不堆叠

![](http://pandas.pydata.org/pandas-docs/stable/_images/area_plot_stacked.png)

Scatter散点图

using the DataFrame.plot.scatter() method

df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])
ax = df.plot.scatter(x='a', y='b', color='DarkBlue', label='Group 1')
# 两种不同颜色的组，注意ax=ax
df.plot.scatter(x='c', y='d', color='DarkGreen', label='Group 2', ax=ax)

df.plot.scatter(x='a', y='b', c='c', s=50)  # 下图
# 用c的值确定bubble大小
df.plot.scatter(x='a', y='b', s=df['c']*200)

![](http://pandas.pydata.org/pandas-docs/stable/_images/scatter_plot_colored.png) ![](http://pandas.pydata.org/pandas-docs/stable/_images/scatter_plot_bubble.png)

Hexagonal Bin Plot六边形箱图

数据过多，过于密集，无法显示出每一个数据，所以就显示数据密度相关参数
use DataFrame.plot.hexbin()

df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
df['b'] = df['b'] + np.arange(1000)
# gridsize决定网格能有多少个，默认值为100
df.plot.hexbin(x='a', y='b', gridsize=25)

# a和b作为二维坐标，C作为值，reduce_C_function是一个用于处理多个数据值的函数
# reduce_C_function包括：mean, max, sum, std等，下面有图
df.plot.hexbin(x='a', y='b', C='z', reduce_C_function=np.max, gridsize=25)

![](http://pandas.pydata.org/pandas-docs/stable/_images/hexbin_plot_agg.png)

Pie饼图

DataFrame.plot.pie() or Series.plot.pie()

series = pd.Series(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], name='series')
# Series的饼状图
series.plot.pie(figsize=(6, 6))
# 使用subplot，每一列都是一个饼图，subplots=True要有
df = pd.DataFrame(3 * np.random.rand(4, 2), index=['a', 'b', 'c', 'd'], columns=['x', 'y'])
df.plot.pie(subplots=True, figsize=(8, 4))
# labels=['AA', 'BB', 'CC', 'DD']   每个扇形的标签
# colors=['r', 'g', 'b', 'c']    每个扇形的颜色
# autopct='%.2f'    显示比例、显示精度
# fontsize=20       字体大小

Density plot

1 2	ser = pd.Series(np.random.randn(1000)) ser.plot.kde() # 数量越多就越接近高斯分布

![](http://pandas.pydata.org/pandas-docs/stable/_images/kde_plot.png)

Scatter Matrix Plot

1
2
3

from pandas.tools.plotting import scatter_matrix
df = pd.DataFrame(np.random.randn(1000, 4), columns=['a', 'b', 'c', 'd'])
scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='kde')

![](http://pandas.pydata.org/pandas-docs/stable/_images/scatter_matrix_kde.png)

多元数据可视化

Andrews曲线

可以应用于多元数据，将其绘制成使用样本属性作为傅里叶级数参数的大量曲线。

1
2
3

from pandas.tools.plotting import andrews_curves
data = pd.read_csv('data/iris.data')
andrews_curves(data, 'Name')   # Name是类别属性，根据类别划分

iris.data中的数据

SepalLength	SepalWidth	PetalLength	PetalWidth	Name
5.1	3.5	1.4	0.2	Iris-setosa

![](http://pandas.pydata.org/pandas-docs/stable/_images/andrews_curves.png)

Parallel Coordinates平行坐标系

可以应用于多元数据,每个垂直的线都对应一个属性

1
2
3

from pandas.tools.plotting import parallel_coordinates
data = pd.read_csv('data/iris.data')
parallel_coordinates(data, 'Name')

![](http://pandas.pydata.org/pandas-docs/stable/_images/parallel_coordinates.png)

随机性检测

Lag Plot

用于检测数据集或者是时间序列数据是否是随机数据,显示data[t]和data[t+1]的关系。
如果plot出的图形是无规则的，那么数据有极大的可能性是随机的。

from pandas.tools.plotting import lag_plot
data = pd.Series(np.arange(1000))  # 有规则
data = pd.Series(np.random.rand(1000)) # 无规则
data = pd.Series(0.1 * np.random.rand(1000) + 0.9 * np.sin(np.linspace(-99 * np.pi, 99 * np.pi, num=1000)))  # 有规则， 有图
lag_plot(data)

![](http://pandas.pydata.org/pandas-docs/stable/_images/lag_plot.png)

Autocorrelation Plot

用于检测时序数据的随机性，通过计算不同的时间延迟（步长）下数据的自相关系数
如果这个序列是随机的，那么对于所有的延迟，其自相关系数都应该接近于0。否则，必然存在至少一个延迟对应的自相关系数远大于/小于0

1
2
3

from pandas.tools.plotting import autocorrelation_plot
data = pd.Series(0.7 * np.random.rand(1000) + 0.3 * np.sin(np.linspace(-9 * np.pi, 9 * np.pi, num=1000)))
autocorrelation_plot(data)

![](http://pandas.pydata.org/pandas-docs/stable/_images/autocorrelation_plot.png)

其中，中间黑线的0值线，向外的实线和虚线分别是95%、99%置信带，有颜色的线是不同延迟对应的自相关系数。

Bootstrap Plot

可视化地评估统计信息的不确定性，比如说均值、中值、中距等
方法：从数据集中随机选取特定长度的子集并计算其相应的统计信息，重复特定次数

1
2
3

from pandas.tools.plotting import bootstrap_plot
data = pd.Series(np.random.rand(1000))
bootstrap_plot(data, size=50, samples=500, color='grey')

![](http://pandas.pydata.org/pandas-docs/stable/_images/bootstrap_plot.png)

Colormaps

from matplotlib import cm
# df.plot(colormap='cubehelix')
# df.plot(colormap=cm.cubehelix)
# colormap='Greens'
# colormap='gist_rainbow'
# colormap='winter'
dd = pd.DataFrame(np.random.randn(10, 10)).applymap(abs)
dd = dd.cumsum()
dd.plot.bar(colormap='Greens')

Plotting Tables

关键字table=True，table关键字也可以使用DataFrame或者Series作为值

fig, ax = plt.subplots(1, 1)
df = pd.DataFrame(np.random.rand(5, 3), columns=['a', 'b', 'c'])
ax.get_xaxis().set_visible(False)   # Hide Ticks
df.plot(table=True, ax=ax)

python数据加载、存储与文件格式

2016-06-23

Dev

pandas, python, 文件

csv

输入函数	说明
read_csv	从文件、URL、文件对象中加载带分隔符(,)的数据
read_table	从文件、URL、文件对象中加载带分隔符(默认为制表符:\t)的数据
read_fwf	读取顶宽列格式数据(没有分隔符)
read_clipboard	读取剪贴板数据，read_table的剪贴板版
from_csv	Series的方法，直接读出Series实例

一些上述函数扩展用法

import pandas as pd
from pandas import DataFrame,Series

# 指定分隔符，也可用delimiter，读取前10行数据
pd.read_table('filename', sep=',',nrows=10)
# 读取特定大小的文件块(byte)
pd.read_table('filename', chunksize=1000)

# 读入DataFrame时，指定列名
pd.read_csv('filename', header=None)
pd.read_csv('filename', names=['a','b','c','d'])

# 指定列为索引，列d设为索引
pd.read_csv('filename', names=['a','b','c','d'], index_col='d')
# 层次化索引的话，可以index_col指定多个列名
pd.read_csv('filename', names=['a','b','c','d'], index_col=['c','d'])

# 跳过文件的某些行
pd.read_csv('filename', skiprows=[1,3,6])
# 需要忽略的行数，从尾部算起
pd.read_csv('filename', skip_footer=10)

# 读文件缺失值处理，将文件中某些值设置为nan
pd.read_csv('filename', na_values=['NULL'])
# 将文件中满足box条件的值设置为nan
box = {'col1':['foo','NA'], 'col3':['two']}
pd.read_csv('filename', na_values=['NULL'])
# 写文件缺失值处理，将nan写成na_rep
pd.read_csv('filename', na_rep='NULL')

# 日期解析,解析所有列，也可以指定，默认为False;
# 冲突型日期，看成国际标准格式，28/6/2016, 默认为False
pd.read_csv('filename', parse_dates=True, dayfirst=True)

# 设置编码，数据解析后仅有一列返回Series
pd.read_csv('filename', encoding='utf-8', squeeze=True)

输出函数	说明
to_csv	把数据写入到文件、输出流中，分隔符为(,)，Series也有这个方法

# 直接打印
data.to_csv(sys.stdout)
# 输出到文件，并且将nan用'NULL'替换
data.to_csv('filename', na_rep='NULL')
# 列名和index也可以禁用
data.to_csv('filename', index=False, header=False)
# 通过指定cols可以显示特定的列
data.to_csv('filename', index=False, cols=['a','b'])

手工处理分隔符格式

对于任何单字符分隔符文件，可以直接使用python内置的csv模块，将任意已打开的文件或文件对象传递给csv.reader:

import csv
f = open('filename')
reader = csv.reader(f)
# 对reader迭代会为每行产生一个元组
for line in reader:
    print(line)

csv文件的形式多样只需定义csv.Dialect的一个子类即可定义出新格式

class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'

reader = csv.reader(f, dialect=my_dialect())

csv.Dialect的属性还包括：

属性	说明
delimiter	分隔字段的单字符字符串，默认为’,’
lineterminator	写操作的行终结符，默认为’\r\n’。读操作忽略，它能认出跨平台结束符
quotechar	用于带有特殊字符的字段的引用符号
quoting	引用约定。包括csv.QUOTE_ALL(引用所有字段)，csv.QUOTE_MINIMAL(只引用带有特殊字符的字段)，csv.QUOTE_NONNUMERIC(只引用非数值属性),csv.QUOTE_NON(不引用)
skipinitialspace	忽略分隔符后面的空白符，默认为False
doublequote	处理字段内的引用符号，如果为True则双写
escapechar	用于对分隔符进行转义的字符串，默认禁用

要手工输出分隔符文件，使用csv.writer

with open('filename', 'w') as f:
writer = csv.writer(f, dialect=my_dialect)
writer.writerow(('one','two','three'))
writer.writerow(('1','2','3'))

JSON

obj = """
{"name":"wes",
 "place":["usa","russia","china"],
 "pet": null,
 "siblings": [{"name":"scott","age":25, "pet":"Zuko"},{"name":"katie","age":33, "pet":"Cisco"}]
}
"""

import json
# 将json字符串转换成python形式
result = json.loads(obj)
$-> {u'name': u'wes',
      u'pet': None,
      u'place': [u'usa', u'russia', u'china'],
      u'siblings': [{u'age': 25, u'name': u'scott', u'pet': u'Zuko'},
     {u'age': 33, u'name': u'katie', u'pet': u'Cisco'}]}
# 将python字符串转换成json形式
asjson = json.dumps(result)

pandas团队正致力于开发原生的高效json导出(to_json)和解码(from_json)功能，待续………

XML和HTML解析

使用lxml.html处理HTML内容

from lxml.html import parse
from urllib2 import urlopen

parsed = parse(urlopen('http://finance.yahoo.com/xxxx/xxx/xxx///xxxx'))
doc = parsed.getroot()
# 通过doc可以获得特定类型的所有html标签(tag)，比如table等

links = doc.findall('.//a')   # 链接
links[0].get('href')
$-> 'http://baidu.com'
links[0].text_content()
$-> '百度一下'

处理表格，’.//table’
表格的每一行都是 ‘.//tr’
表格的第一行是标题行，th表示单元格
余下的行是数据行，td表示单元格

from pandas.io.parsers import TextParser
# 下面的是解析函数,解析一行数据
def _unpack(row, kind='td'):
    elts = row.findall('.//%s' % kind)
    return [val.text_content() for val in elts]

# 解析整个表格，返回一个DataFrame对象
def parse_table(table):
    rows = table.findall('.//tr')
    header = _unpack(row[0], kind='th')
    data = [_unpack(r) for r in rows[1:]]
    # TextParser将数值型的列进行类型转化
    return TextParser(data, names=header).get_chunk()

# 定位所有的表格
tables = doc.findall('.//table')
tab = tables[0]
parse_table(tab)

使用lxml.objectify处理XML内容
root.INDICATOR用于返回一个用于产生各个XML元素的生成器。

from lxml import objectify

parsed = objectify.parse(open('hello.xml'))
root = parsed.getroot()
root.get('href')
$-> 'http://baidu.com'
root.text
$-> '百度一下'

data = []

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        el_data[child.tag] = child.pyval
    data.append(el_data)
# 转化成DataFrame
perf = DataFrame(data)

二进制数据和Excel文件

pandas对象有一个将数据以pickle序列化形式保存到磁盘上的方法：save

# 写入磁盘
frame.save('filename')
# 读入内存
frame = pd.load('filename')

使用xlrd包和openpyxl包(需要安装)读写xls或者xlsx文件

# 创建ExcelFile示例
xls_file = pd.ExcelFile('data.xls')
# 存放在某个工作表中的数据可以通过parse读取到DataFrame中
table = xls_file.parse('sheet1')

使用数据库

关系型数据库

import sqlite3

# create table
query = """
create table test
(a varchar(20),
 b integer);"""
con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()

# insert
data = [('tom',20),('jerry',15)]
stmt = "insert into test values(?,?)"
con.executemany(stmt, data)
con.commit

# query, select返回元组列表(大部分python SQL驱动器都这样)
cursor = con.execute('select * from test')
rows = cursor.fetchall()
print(rows)

# 由于这样产生DataFrame的方法比较复杂，所以有现成的方法
import pandas.io.sql as sql
sql.read_frame('select * from test', con)

非关系型数据库
非关系型数据库有多种方式，有些是字典键值对形式存在，另一些是基于文档的，这里不再赘述。

NumPy基础

2016-06-23

Dev

numpy, python

1 2	import numpy as np from numpy.random import randn

ndarray：一种对维数组对象

创建方法

函数	说明
array	将输入数据(列表、元组、数组或其他序列类型)转换为ndarray，可以指定dtype
asarray	将输入转换为ndarray，如果输入本身就是ndarray，则不进行复制
arange	同range，只是返回的是ndarray
ones,ones_like	前者根据指定形状和dtype创建一个全1数组。后者以另一个数组为参照，copy其形状和dtype
zeros,zeros_like	前者根据指定形状和dtype创建一个全0数组。后者以另一个数组为参照，copy其形状和dtype
empty,empty_like	只分配内存，不填充任何值
eye,identity	单位矩阵

NumPy的精度dtype集合，实际使用应该加上np.前缀

dtype	dtype
int8, uint8	int16,uint16
int32, uint32	int64,uint64
float16, float32	float64,float128
complex64	complex128
complex256	bool

# construct
arr1 = np.array([1,2,3,4,5])
arr1.ndim
$-> 1 # 一维数组
arr1.shape
$-> (1,5) # 1x5的数组
arr1.dtype
$-> dtype('int64')

# type convert，这里一定会常见一个新的数组，无论目标转换类型与原数组类型是否一致
float_arr = arr1.astype(np.float64)

索引与切片

索引切片方法与list的对应方法相似，不同点在于：
数组切片也是原始数组的视图，这意味着数据不会被复制，因此视图上的任何更改都会直接地反映到源数组上。而list不然，请具体看下面的例子。

a = [1,2,3,4,5,6]
b = a[2:5]
b[1] = 100
print(a)
$-> [1, 2, 3, 4, 5, 6]

arr = np.array([1,2,3,4,5,6])
br = arr[2:5]
br[1] = 99
arr
$-> array([ 1,  2,  3, 99,  5,  6])

# 如果需要ndarray的切片的副本而非视图，可以显示地复制
new_arr = arr[2:5].copy()

多维数组的切片

arr2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2d[1][1]
$-> 1
arr2d[1,1]
$-> 1

# 沿着第0轴(第一个轴)切片
arr2d[:2]
$-> array([[1, 2, 3],
           [4, 5, 6]])
arr2d[:2,1:]
$-> array([[2, 3],
           [5, 6]])
arr2d[:,:1]
$-> array([[1],
           [4],
           [7]])

# bool索引与赋值
arr<4
$-> array([ True,  True,  True, False, False, False], dtype=bool)
arr[arr<4] = 4
arr
$-> array([ 4,  4,  4, 99,  5,  6])

花式索引(fancy indexing),花式索引总是将数据复制到新数组中，切片得到的是视图。

arr = np.empty((8,4))
for i in range(8):
    arr[i] = i
arr
$-> array([[ 0.,  0.,  0.,  0.],
           [ 1.,  1.,  1.,  1.],
           [ 2.,  2.,  2.,  2.],
             [ 3.,  3.,  3.,  3.],
           [ 4.,  4.,  4.,  4.],
           [ 5.,  5.,  5.,  5.],
           [ 6.,  6.,  6.,  6.],
           [ 7.,  7.,  7.,  7.]])

# 以特定顺序选取行子集，传入一个制定顺序的证书列表或者ndarray
arr[[4,2,6]]  # 打印第5、3、7行
$-> array([[ 4.,  4.,  4.,  4.],
           [ 2.,  2.,  2.,  2.],
           [ 6.,  6.,  6.,  6.]])
# 如果使用负数将从末尾计数
arr[[-2,-1,4]] # 倒数第二行、第一行和正数第5行
$-> array([[ 6.,  6.,  6.,  6.],
           [ 7.,  7.,  7.,  7.],
           [ 4.,  4.,  4.,  4.]])

arr = np.arange(32).reshape((8,4))
arr[[1,5,7,2],[0,3,1,2]]
$-> array([ 4, 23, 29, 10])
# 上面操作结果没有返回一个指示对应行列值得矩阵，方法如下：
arr[np.ix_([1,5,7,2],[0,3,1,2])]
$-> array([[ 4,  7,  5,  6],
           [20, 23, 21, 22],
           [28, 31, 29, 30],
           [ 8, 11,  9, 10]])

数组轴对换和转置
数组不仅有transpose方法(返回数据源视图，不复制)，还有T属性，二维的情况很容易理解，下面是高纬轴对换的例子。

arr = np.arange(16).reshape((2,2,4))
arr
$-> array([[[ 0,  1,  2,  3],
            [ 4,  5,  6,  7]],

           [[ 8,  9, 10, 11],
            [12, 13, 14, 15]]])
arr.transpose((1,0,2))
$-> array([[[ 0,  1,  2,  3],
            [ 8,  9, 10, 11]],

           [[ 4,  5,  6,  7],
            [12, 13, 14, 15]]])

# swapaxes,返回数据源视图，不复制
arr = np.arange(16).reshape((2,2,4))
arr.swapaxes(1,2)
Out[53]:
array([[[ 0,  4],
        [ 1,  5],
        [ 2,  6],
        [ 3,  7]],

       [[ 8, 12],
        [ 9, 13],
        [10, 14],
        [11, 15]]])

通用函数，元素级数组函数

一元函数	说明	一元函数	说明
abs,fabs	求绝对值，实数用fabs更快	sqrt	计算各元素平方根,相当于arr**0.5
square	计算各元素平方,相当于arr**2	exp	计算e的元素次方
log,log10,log2,log1p	对数运算，最后一个是log(1+x)	sign	指示符号，(1,0,-1)
ceil	向上取整	floor	向下取整
rint	四舍五入到最近的整数，保留dtype	modf	将数组的整数和小数部分以两个独立数组的形式返回
isnan	返回哪些是NaN的bool数组	isfinite,isinf	返回哪些是有穷的/无穷的bool数组
cos,cosh,sin,sinh,tan,tanh	普通型和双曲型三角函数	logical_not	计算各元素not x的真值，相当于-arr
arccos,arccosh,arcsin,arcsinh,arctan,arctanh	反三角函数

二元函数	说明	二元函数	说明
add	数值中对应元素相加	substract	第一个数组元素减去第二个数组元素
multiply	数组元素相乘	divide,floor_divide	除法，向下圆整除法
power	第一个数组元素为底，第二个数组元素为顶计算乘方	maximum,fmax	最大值，fmax忽略nan
minimum,fmin	fmin忽略nan	mod	求模
copysign	将第二个数组元素的符号赋给第一个数组的元素	greater,greater_equal,less,less_equal	比较运算，产生bool数组
equal,not_equal	比较运算，产生bool数组	logical_and,logical_or,logical_xor	元素级真值逻辑运算

arr = np.arange(10)
np.sqrt(arr)

x = randn(8)
y = randn(8)
np.maximum(x,y)

arr = randn(8)*5
np.modf(arr)
$-> (array([[-0.06742411, -0.37438428,  0.7491406 ,  0.63876896, -0.73629364,
           0.60505606, -0.38540241,  0.78968936]]),
 array([[-1., -1.,  0.,  0., -1.,  1., -1.,  0.]]))

数据处理和统计

points = np.arange(-5,5,0.01) # 1000个间隔相等的点
xs,ys = np.meshgrid(points, points)
z = np.sqrt(xs**2+ys**2)

xarr = np.array([1.1,1.2,1.3,1.4,1.5])
yarr = np.array([2.1,2.2,2.3,2.4,2.5])
cond = np.array([True,False,True,True,False])
# 根据cond的情况选取xarr或者yarr
result = [(x if c else x) for x,y,c in xarr,yarr,cond]

# 上述方法对大数组处理速度较慢，且无法用于多维数组
# np.where函数通常用于：根据一个数组产生另一个数组
result = np.where(cond, xarr, yarr)

arr = randn(4,4)
arr
$-> array([[ 1.52182545,  0.87451946,  1.04261881, -1.0087171 ],
            [-0.17748091, -0.11488603, -0.29951479,  0.67766543],
            [-0.21761354, -0.83476571,  1.69775644,  1.45229995],
            [ 0.1791336 ,  1.55750933,  0.23509194,  0.39716205]])
np.where(arr>0, 2, -2) # 把大于0的都换成2，小于0的换成-2
$-> array([[ 2,  2,  2, -2],
           [-2, -2, -2,  2],
           [-2, -2,  2,  2],
           [ 2,  2,  2,  2]])

基本数组统计方法
既是数组示例的方法，也是顶级方法。这类函数可以接受一个axis参数，表示轴向方向

方法	说明
sum	求和，对于bool值数组，sum计算True的个数
mean	均值，长度为0的均值为NaN
std,var	标准差和方差，自由度可调，默认为n
min,max	最小值，最大值
argmin,argmax	最小、最大元素的索引
cumsum	所有元素累计和
cumprod	所有元素累计积

arr.mean(axis=1)
arr = np.arange(9).reshape((3,3))
arr.cumsum(axis=0) # 列和
$-> array([[ 0,  1,  2],
           [ 3,  5,  7],
           [ 9, 12, 15]])
arr.cumprod(1)  #行积
$-> array([[  0,   0,   0],
           [  3,  12,  60],
           [  6,  42, 336]])

用于bool数组的方法

(arr>0).sum()  #计算正值数量
bools = np.array([False, True, True, False])
# any用于检测数组中是否有True， all检查是否都是True
bools.any() -> True
bools.all() -> False

排序
顶级方法np.sort返回排序完的副本，就地排序则会修改数组本身

arr = randn(8)
arr.sort()
np.array(arr)
$-> array([[ 0.4811325 ],
           [ 1.49888281],
           [ 0.71035914],
           [-1.24322946],
           [ 0.90270716],
           [ 1.29337938],
           [-0.29419419],
           [ 0.71318192]])
arr = randn(5,3)
arr.sort(axis=0)

数组集合运算

线性代数基础

函数	说明	函数	说明
diag	给定矩阵，以一维数组的方式返回对角线元素；给定一维数组，返回以此数组为对角线的方阵	dot	矩阵乘法
trace	矩阵的迹	det	行列式的值
eig	方阵的特征值和特征向量	inv	方阵的逆
pinv	矩阵的Moore-Penrose伪逆	qr	计算矩阵的QR分解
svd	奇异值分解	solve	计算线性方程组Ax=b，A是方阵
lstsq	计算Ax=b的最小二乘解

from numpy.linalg import inv,qr

x = np.arange(6).reshape((3,2))
y = np.arange(6).reshape((2,3))
x.dot(y)  # 等同于np.dot(x,y)

q,r = pr(arr)

随机数生成
numpy.random中的部分函数

函数	说明	函数	说明
seed	确定随机数生成器种子	permutation	序列的随机排列或一个随机排列的范围
shuffle	对一个序列就地随机排列	rand	产生均匀分布的样本值
randint	给定上下限的随机整数选取	randn	产生正态分布样本值(mean=0,std=1)
binomial	产生二项分布样本值	normal	产生正态分布样本值
beta	产生Beta分布样本值	chisqaure	产生卡方分布样本值
gama	产生Gamma分布样本值	uniform	产生[0,1]均匀分布样本值

简单git使用方法

2016-06-08

Dev

git

查看当前远程库

$-> git remote
$-> git remote -v
BITDM   https://github.com/BITDM/bitdm.github.io.git (fetch)
BITDM   https://github.com/BITDM/bitdm.github.io.git (push)
bitdm   https://github.com/BITDM/bitdm.github.io (fetch)
bitdm   https://github.com/BITDM/bitdm.github.io (push)
origin  https://github.com/Atlantic8/bitdm.github.io.git (fetch)
origin  https://github.com/Atlantic8/bitdm.github.io.git (push)

添加远程库

1	$-> git remote add emacs git://github.com/lishuo/emacs

从远程库抓取数据

1	$->git fetch [remote-name]

添加远程源，upstream可以是别的名字

$-> git remote add upstream https://github.com/BITDM/bitdm.github.io.git
$-> git remote -v
BITDM   https://github.com/BITDM/bitdm.github.io.git (fetch)
BITDM   https://github.com/BITDM/bitdm.github.io.git (push)
origin  https://github.com/Atlantic8/bitdm.github.io.git (fetch)
origin  https://github.com/Atlantic8/bitdm.github.io.git (push)
upstream        https://github.com/BITDM/bitdm.github.io.git (fetch)
upstream        https://github.com/BITDM/bitdm.github.io.git (push)

用远程源来更新自己的项目

1 2	$-> git fetch upstream $-> git merge upstream/master

推送数据到远程仓库

1	$-> git push [remote-name] [branch-name]

远程仓库的删除和重命名

1 2	$-> git remote rm [remote-name] $-> git remotw rename form-name to-name

把本地文件夹放到github上作为repository

首先在网页上创建github仓库
进入目标文件夹
git init   # initialize an empty repository
git remote add origin http://xxxxxx.git  # add remote repository address
git add --all   # 添加所有文件
git commit -m 'add'  # 提交/注释
git push origin master  # 提交
完成

修改本地文件，同步到仓库

# 先添加文件
git add test.txt
git commit -m "add test.txt"
# 删除文件，如果要在版本库中删除，使用git rm，并且commit
rm test.txt
git status   # 查看状态
git push origin master  # 推送到远程库

# 修改文件的话，先自行修改
git status  # 显示修改
git add 想要提交的文件名
git commit -m "注释的一些信息"
# 如果在这一步出错的话：git reset --hard HEAD 回滚到add之前的状态
git push # 完成

pandas初步

2016-05-17

Dev

pandas, python

Series

Series类似于一维数组对象，由一组数据和其对应的标签组成，仅由一组数据即可产生简单的Series：

from pandas import Series, DataFrame
import pandas as pd

obj = Series([4,7,-5,3])
obj.index
$-> RangeIndex(start=0, stop=4, step=1)
obj.values
$-> array([ 4,  7, -5,  3], dtype=int64)

obj2 = Series([4,7,-5,3], index=['a','b','c','d'])
# a 4, b 7, c -5, d 3
obj2['a']
$-> 4
obj2[['a','b']]
$-> {'a':4, 'b':7}
obj2[obj2 > 3]
$-> {'a':4, 'b':7}

#也可以将Series看成一个字典， index是key， values是value
'b' in obj2
$-> true
#可以通过字典建立Series
obj3 = Series({'a':4, 'b':7, 'c':-5, 'd':3})

sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
states=['California','Ohio','Oregon','Texas']
obj4=Series(sdata,index=states)
$-> California        NaN
    Ohio          35000.0
    Oregon        16000.0
    Texas         71000.0
obj4.isnull()
$-> California     True
    Ohio          False
    Oregon        False
    Texas         False

#Series对象有一个name属性
obj4.name = 'population'

DataFrame

表格型数据，有行索引和列索引

data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year':[200,2001,2002,2001,2002],
        'pop':[1.5,1.7,3.6,2.4,2.9]}
frame = DataFrame(data)
$->   pop   state  year
   0  1.5    Ohio   200
   1  1.7    Ohio  2001
   2  3.6    Ohio  2002
   3  2.4  Nevada  2001
   4  2.9  Nevada  2002

指定列名，列会按照columns制定的顺序排列

1	DataFrame(data, columns=['year', 'state', 'pop'])

指定index，如果index元素不在columns中，产生一列 NaN 值

DataFrame(data, columns=['year','state','pop'], index=['year','state','pop','A'])

# 获取DataFrame的一个列，成为一个Series
# 相同的index，name属性就是对应的列名
frame['year']  &  frame.year

# 获取列名，删除一列
frame.columns
$->  Index([u'pop', u'state', u'year'], dtype='object')
del frame.year

嵌套字典构造DataFrame，外层的key作为列属性，里层的key作为行属性

pop = {'Nvidia':{2001:2.4, 2002:2.9}, 'AMD':{2000:1.5,2001:1.7,2002:3.6}}
frame3 = DataFrame(pop)，加上index=[],重新设置index值
$->       AMD  Nvidia
    2000  1.5     NaN
    2001  1.7     2.4
    2002  3.6     2.9
frame3.T
$->         2000  2001  2002
    AMD      1.5   1.7   3.6
    Nvidia   NaN   2.4   2.9

# 设置行、列属性的name,设置完了也会打印出来
frame3.index.name = 'year'
frame3.columns.name = 'state'
# 显示DataFrame中的数据，ndarray形式
frame3.values
$-> array([[ 1.5,  nan],
          [ 1.7,  2.4],
          [ 3.6,  2.9]])

DataFrame构造器

data type	explain
二维ndarray	数据矩阵，还可以传入行标和列标
由数组、列表或元组组成的字典	每个序列会变成DataFrame的一列，所有序列长度必须相同
NumPy的结构化/记录数据	类似于由数组组成的字典
由Series组成的字典	每个Series组成一列。没有显示指定index的话，各Series的index会被合并成结果的行index
由字典组成的字典	各内层字典组成一列，键合并成行index
字典或Series的列表	各项成为DataFrame的一行，字典键或Series索引的并集将成为列index
由列表或元组组成的列表	类似于ndarray
另一个DataFrame	原来DataFrame的索引会被保留，除非显示指定
NumPy的MaskedArray	类似于ndarray，只是掩码值在结果DataFrame会变成NaN/缺失值

索引对象

obj = Series(range(3), index=['a','b','c'])
index = obj.index
index[1] = 'x'
$-> TypeError: Index does not support mutable operations

重新索引,如果某个索引值当前不存在，引入缺失值

# 重新索引列index
obj = Series([1,2,3], index=['a','b','c'])
obj2 = obj.reindex(['a','b','c','d'], fill_value=0)
# 重新索引行/列index
obj2 = obj.reindex(['a','b','c','d'], fill_value=0, columns=['x','y','z'])
# 也可以使用
frame.ix[['a','b','c','d'], ['x','y','z']]

处理时间序列数据时，可能需要一些插值处理，method选项
ffill | pad : 前向填充(搬运)
bfill | backfill : 后向填充(搬运)

obj3=Series(['blue','red','yellow'], index=[0,2,3])
obj3.reindex(range(6), method='ffill')
$-> 0      blue
    1      blue
    2       red
    3    yellow
    4    yellow
    5    yellow
    dtype: object

丢弃指定轴上的项

obj = Series([1,2,3], index=['a','b','c'])
obj.drop('c')
$-> a    1
    b    2
    dtype: int64

索引、选取和过滤

obj = Series(np.arange(4.), index=['a','b','c','d'])
obj[0]
$-> 0.0
obj['b':'d']
$-> b    1.0
    c    2.0
    d    3.0
    dtype: float64
# b c d 对应的都设置为5
obj['b':'d'] = 5

data = DataFrame(np.arange(16).reshape((4,4)), columns=['a','b','c','d'], index=['A','B','C','D'])
data > 5
$-> upper      a      b      c      d
    A      False  False  False  False
    B      False  False   True   True
    C       True   True   True   True
    D       True   True   True   True
data[data < 5] = 0
$-> upper   a   b   c   d
    A       0   0   0   0
    B       0   5   6   7
    C       8   9  10  11
    D      12  13  14  15

DataFrame的索引选项

data type	explain
obj[val]	选取DataFrame的单个列或一组列
obj.ix[val]	选取DataFrame的单个行或一组行
obj.ix[:,val]	选取单个列或列子集
obj.ix[val1,val2]	同时选取行和列
reindex方法	匹配一个或多个轴到新索引
xs方法	根据标签选取当行或单列，返回一个Series
icol,irow方法	根据整数选取当行或单列，返回一个Series
get_value	根据行列标签获取对应值
set_value	根据行列标签设置对应值

算术运算和数据对齐

Series对象可以叠加，对应index相加，没有对应index相加的是 NaN
DataFrame对象也可以相加，行和列都相加
为了不得到NaN，可以使用add函数，设置fill_value

df1.add(df2, fill_value=0)
# other operants
df1.sub(df2, fill_value=0)
df1.div(df2, fill_value=0)
df1.mul(df2, fill_value=0)

函数应用和映射

NumPy的ufunc也可以用于操作pandas对象

1
2
3

frame = DataFrame(np.random.randn(4,3),columns=list('bde'),index=list('xyzw'))
# 绝对值
np.abs(frame)

DataFrame的apply方法

frame = DataFrame(np.random.randn(4,3),columns=list('bde'),index=list('xyzw'))
f = lambda x: x.max()-x.min()
frame.apply(f)
# 除标量外，传递给apply的函数还可以返回多个值组成的Series
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
$->             b         d         e
    min -0.983954 -0.578884 -0.916169
    max  1.111333  0.982841  0.663686

排序和排名

obj = Series(range(4), index=['a','d',b','c'])
obj.sort_index() # order by index
$-> a    0.0
    b    1.0
    c    2.0
    d    3.0
    dtype: float64
obj.order() # 按值排序, NaN都放最后
$-> a    0.0
    b    1.0
    c    2.0
    d    3.0
    dtype: float64

# DataFrame对象也可以排序
frame.sort_index(axis=0)
frame = DataFrame({'a':[1,2,3], 'b':[4,5,6]})
# 按照a和b排序，降序
frame.sort_index(by=['a','b'],ascending=False)

重复值轴索引

1
2
3

obj = Series(range(5), index=['a','a','b','b','c'])
obj.index.is_unique
$-> False

汇总和计算描述统计

df = DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index=['a','b','c','d'],columns=['one','two'])
# ax=1时按行求和，ax=0时按列求和
# skipna=True会忽略NaN(默认)
df.sum(axis=ax, skipna=False)

# idxmin/idxmax返回间接统计索引
df.idxmax()
$-> one    b
    two    d
    dtype: object
df.cumsum()
$->     one  two
    a  1.40  NaN
    b  8.50 -4.5
    c   NaN  NaN
    d  9.25 -5.8
df.describe() # 一次性产生多个汇总统计
$->             one       two
    count  3.000000  2.000000
    mean   3.083333 -2.900000
    std    3.493685  2.262742
    min    0.750000 -4.500000
    25%         NaN       NaN
    50%         NaN       NaN
    75%         NaN       NaN
    max    7.100000 -1.300000

method	explain
count	非NaN值的数量
describe	汇总统计
min,max	最大、最小值
argmin,argmax	最小、最大值的索引位置
idxmin,idxmax	最小、最大值的索引
quantile	计算样本的分位数
sum,mean,median	总和、均值、中位数
mad	根据平均值计算平均绝对离差
var,std	样本方差、标准差
skew	样本值的偏度(三阶矩)
kurt	样本值的峰度(四阶矩)
cumsum	样本累计和
cummin,cummax	样本累计最小、最大值
cumprod	样本累计积
diff	计算一阶差分(时间序列数据)
pct_change	计算百分数变化

唯一值、值计数和成员资格

obj = Series(['a','b','c','b','a','c','d','b'])
obj.unique()
$-> array(['a', 'b', 'c', 'd'], dtype=object)
obj.value_counts()
$-> b    3
    c    2
    a    2
    d    1
    dtype: int64
pd.value_counts(obj.values, sort=False)
# Series中的所有元素是否在参数中
mask = obj.isin(['b','c'])
$-> 0    False
    1     True
    2     True
    3     True
    4    False
    5     True
    6    False
    7     True
    dtype: bool
obj[mask]
$-> 1    b
    2    c
    3    b
    5    c
    7    b
    dtype: object

缺失数据处理

method	explain
dropna	根据标签值中是否存在缺失数据对轴标签进行过滤，可通过阈值调节容忍度
fillna	用指定值或插值方法填充缺失数据
isnull	布尔值列表，True表示缺失值/NA
notnull	与isnull相反

from numpy import nan as NA
data = Series([1,NA,3.5,NA,7])
# 舍弃包含NA的行或列，加上how='all'后舍弃全是NA的行或列
# 默认是行，设置axis=1变成列！
data.dropna()  data.dropna(how='all')
data.[data.notnull()]

# 替换NA，返回新对象
# 设置inplace=True对现有对象就地修改
data.fillna(0)

层次化索引

data = Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,2,3]])
$-> a  1    0.506070
       2   -2.293016
       3    1.391751
    b  1   -1.218733
       2    0.390983
       3    1.462456
    c  1    0.162262
       2   -0.091724
    d  2    0.321799
       3    0.203933

取值方法

data['b']
data['b':'c']
data[:,2] # 二层取值

# unstack可以将这种数据重新安装到DataFrame中
# stack是unstack的逆操作
data.unstack()
$->           1         2         3
    a  0.506070 -2.293016  1.391751
    b -1.218733  0.390983  1.462456
    c  0.162262 -0.091724       NaN
    d       NaN  0.321799  0.203933

# DataFrame可以使用分层索引
frame = DataFrame(np.arange(12).reshape((4,3)),index=[['a','a','b','b'],[1,2,1,2]],columns=[['Ohis','Ohis','Colorado'],['Green','Red','Green']])
frame.index.names=['key1','key2']
frame.columns.names=['state','color']
$-> state      Ohis     Colorado
    color     Green Red    Green
    key1 key2
    a    1        0   1        2
         2        3   4        5
    b    1        6   7        8
         2        9  10       11

重排分级顺序

frame.swaplevel('key1','key2')
$-> state      Ohis     Colorado
    color     Green Red    Green
    key2 key1
    1    a        0   1        2
    2    a        3   4        5
    1    b        6   7        8
    2    b        9  10       11

根据级别汇总统计

# 如果对列计数，需要设置axis=1
frame.sum(level='key2')
$-> state  Ohis     Colorado
    color Green Red    Green
    key2
    1         6   8       10
    2        12  14       16
frame.sum(level='color',axis=1)
$-> color      Green  Red
    key1 key2
    a    1         2    1
         2         8    4
    b    1        14    7
         2        20   10

mean shift

2016-05-11

Algorithm

machine learning

基本Mean Shift

示意图
给定d维空间$R^d$的n个样本点 ,i=1,…,n,在空间中任选一点x，那么Mean Shift向量的基本形式定义为:

![](http://pic002.cnblogs.com/images/2012/358029/2012051213564761.jpg)

其中，$S_k$是在一个半径为h的高维球区域中的点集合。

基于核函数的Mean Shift

![](http://pic002.cnblogs.com/images/2012/358029/2012051215383189.jpg)
解释一下K()核函数，h为半径，$\frac{C_{k,d}}{nh^d}$ 为单位密度，要使得上式f得到最大，最容易想到的就是对上式进行求导，的确meanshift就是对上式进行求导.。

Mean Shift Clustering伪代码

// e is a predefined threshold value.
for data in dataset:
    x = data;
    do :
        calculate mean shift of x: ms;
        error = f(ms-x);
    while (error < e);
    dict{data} = x;
dict{x}=dict{y} -> x,y in same cluster

IPython使用方法

2016-05-11

Dev

IPython

IPython的部分功能整理

Tab自动完成

内容补全，与linux中的功能相似

内省

显示对象通用信息

b = [1, 2, 3, 4, 5]
b?
b??
np.*load*? #列出NumPy命名空间中所有包含load的函数

b也可以是函数对象，如果函数对象后面跟两个？，就可以显示函数代码。

%run

执行脚本文件

1 2	#执行my_work.py中的python代码 %run my_work.py

绝对路径和相对路径都可以使用

命令行中剪贴代码

%paste
#粘贴代码，一次性粘贴完

%cpaste
#粘贴代码
#可以多次粘贴
#结束时输入--即可
#--

代码性能分析

python主要的性能分析工具是cProfile模块

#执行script.py并输出各函数的执行时间
python -m cProfile script.py
#按照cumulative time排序
python -m cProfile -s cumulative script.py
%run -p -s cumulative script.py

#IPython接口, %prun用于分析语句而不是模块
%prun -l 7 -s cumulative func()

对于逐行分析代码性能，可以使用line_profiler库，具体参见<利用python进行数据分析>74页。

魔术命令

以%为前缀的命令叫做魔术命令

command	explaination
%quickref	显示IPython快速参考
%magic	显示所有魔术命令的详细文档
%debug	从最新的异常跟踪底部进入交互式调试器
%hist	打印命令的输入(也可是输出)历史
%pdb	在异常发生后自动进入调试器
%paste	执行剪贴板中的python代码
%cpaste	打开特殊提示符以便手工粘贴待执行的python代码
%reset	删除交互式命名空间中全部变量/名称
%page object	通过分页器打印输出object
%run script.py	执行script.py中的代码
%prun statement	通过cProfile执行statement，并打印分析器结果
%time statement	报告statement的执行时间
%timeit statement	多次执行statement输出平均时间
%who %who_ls %whos	显示交互式空间中定义的变量/信息级别/冗余度
%xdel variable	删除变量varibale，并尝试清除其在IPython中对象上的一切引用

与操作系统相关的魔术命令

command	explaination
!cmd	在系统shell中执行cmd
output=!cmd args	执行cmd，并将stdout存放在output中
%alias alias_name cmd	为系统shell命令定义别名
%bookmark	使用IPython的目录书签系统
%cd directory	将directory设置为当前目录
%pwd	返回系统当前工作目录
%pushd directory	将当前目录入栈，转向目标目录
%popd	弹出栈顶目录，并转向该目录
%dirs	返回一个含有当前目录栈的列表
%dhist	打印目录访问历史
%env	以dict形式返回系统环境变量