Yiiikaiiii YikaiLL

Java Interview

1 Arrays and strings

HashMap

Key -> Hashcode -> array of linked list->in case of collisions

Resizeable arrays

The array copied itself to a new array each time, let's say doubling its origin size to get a new size array. Even with resizing, the amortized insertion is still of O(1) time complexity. Proof: Given a array of size N, the total number of copies to insert N elements is roughly: $$N/2+N/4+N/8+...+2+1 < N$$

Chapter 8 Classfication

8.5 Model Evaluation and Selection

8.5.4 Bootstrap

Unlike the accuracy estimation methods just mentioned, the bootstrap method samples the given training tuples uniformly with replacement.

.632 boostrap

$$(1-1/d)^d=1-e^{-1}=0.632$$ 63.2% will form the training set and 36.8% would be at the test set. Bootstrapping tends to be overly optimistic. It works best with small data sets.

Bootstrapping tends to be overly optimistic. It works best with small data sets.

Chapter 1

1.3 Basic Outlier Detection Models

Z-value test

Consider a set of 1-dimensional quantitative data observations, $$Z_i = ^{|X_i -\mu|}/_{\sigma}$$

Enough Data In cases where the mean and standard deviation of the distribution can be accurately estimated, a good “rule-of-thumb” is to use Zi ≥ 3 as a proxy for the anomaly.
Few Data at Hand However, in scenarios in which very few samples are available, the mean and standard deviation of the underlying distribution cannot be estimated robustly. In such cases, the results from the Z-value test need to be interpreted more carefully with the use of the (related) Student’s t-distribution rather than a normal distribution. This issue will be discussed in Chapter 2.

Chapter 1

Time domain approach

This is generally motivated by the presumption that correlation between adjacent points in time is best explained in terms of a dependence of the current value on past values

mulitplicative models and additvie models

ARIMA / autoregressive integrated moving average

Frequency Domain Approach

assumes the primary characteristics of interest in time series analyses relate to periodic or systematic sinusoidal variations found naturally in most data.

Chapter 2

Cassandra in 50 words

Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, row-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web.

Distributed and Decentralized

Cassandra is distributed, which means that it is capable of running on multiple machines while appearing to users as a unified whole.

No master-slave

	def knn_detect_outlier(self):
	"""
	Using knn_detect_outlier to output the outlier score for different sequence.
	The idea is that for knn_detect_outlier, each sequence in the old cluster are representative of normal sequence.
	Distance between
	:return:
	"""
	# Normal
	from sklearn.neighbors import NearestNeighbors
	neigh = NearestNeighbors(n_neighbors=self.K_NEIGHBOR, metric=self.mydist)

	import urllib
	from lxml import html
	from urllib.request import urlopen
	from lxml import etree
	import time
	import json
	from collections import defaultdict
	from send_email import send_email
	# from open_html import fill_input