Collection of Simple Random Sample from Population of Websites¶

Simple Random Sample¶

Below is a program that creates a simple random sample of websites according to Alexa Analytics' Top One Million sites CSV. The function create_sample creates a random integer between 0 and 999,999. That number acts as the index for the site we will get from the population CSV (e.g. The site at index 0 is google.com).

Once we have selected a site for our sample, we need to resolve its IP address (note: we will be using only IPv4 addresses as Alexa only offers those in their dataset and all websites should handle IPv4 and IPv6). If we are unable to resolve the address than we will randomly select another site to replace it.

When sampling is completed, the sample dataset (domains and IPv4 addresses) is exported to a CSV file.

from random import randint
import pandas as pd
import socket

# List of top one million sites according to Alexa Analytics/Website Ranking
# https://s3.amazonaws.com/alexa-static/top-1m.csv.zip
top_sites = pd.read_csv('top-1m.csv', header=None)[1]

# n is sample size
n=1000

# Dictionary used for stored sample data
sample = {
    'Website Domain' : [],
    'IPv4 Address' : []
}

def create_sample(n): 
    i = 0
    while i < n:
        i += 1
        # Get random number between 0 and 999,999
        random_index = randint(0, len(top_sites) - 1)

        # If the site has not already been selected, add it to our data set
        if not top_sites[random_index] in sample['Website Domain']:
            try:
                # print("\033[0mGetting IPv4 Address for %s..." % top_sites[random_index])
                ipv4 = socket.gethostbyname(top_sites[random_index])
            # If we can't resolve the IP from the host name, replace it with a different host name
            except:
                # print("\033[1mFailed. Selecting new site for sample.")
                i -= 1
                continue
            sample['Website Domain'].append(top_sites[random_index])
            sample['IPv4 Address'].append(ipv4)

create_sample(n)

# Save sample to a CSV file
dataset = pd.DataFrame.from_dict(sample)
dataset.to_csv('website_sample.csv')

dataset

Use Dataset Instead of Creating New Sample¶

Essentially importing website_sample.csv for our dataset so we don't have to create a new sample.

n = 1000
dataset = pd.DataFrame.from_csv('website_sample.csv')
dataset

/home/ari/.anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: from_csv is deprecated. Please use read_csv(...) instead. Note that some of the default arguments are different, so please refer to the documentation for from_csv when changing your function calls

Determining Proportion of Websites Running AWS¶

Proportion of IPv4 addresses owned by AWS¶

The program takes a list of all IPv4 addresses owned by AWS and compares them to the list of addresses in our sample. AWS does not give a list of IPv4 address but instead gives a subnet of their addresses, this means they've purchased addresses in bulk so they're grouped together. In order to properly compare an IPv4 address to a subnet, python offers a library called ipaddress that breaks up subnets and ip addresses into a data format that can easily be compared.

If an address appears in Amazon's IPv4 range (their owned addresses) than the domain associated with the IP address is appended to a list. The list of websites is then exported as a CSV file.

import json, requests, ipaddress 

# List of IP Ranges (IPv4 and IPv6) owned by Amazon and used for AWS
# https://ip-ranges.amazonaws.com/ip-ranges.json
aws_ip_ranges = json.loads(requests.get('https://ip-ranges.amazonaws.com/ip-ranges.json').text)

# Determine if given IP address (ip_input) shows uo in AWS IPv4 Range
def check_aws(ip_input):
    # Compare given IP to all AWS IP addresses within AWS IPv4 Subnetwork
    for i in range(len(aws_ip_ranges['prefixes'])):
        # Parse IPv4 address for comparison
        site_ip = ipaddress.ip_address(ip_input)
        
        # Parse AWS IPv4 Subnet
        aws_subnet = ipaddress.ip_network(aws_ip_ranges['prefixes'][i]['ip_prefix'])
        
        # If IP is within the AWS IPv4 Range, the website is run on AWS
        if site_ip in aws_subnet:
            return True
    # If the website is not within the range, the 
    # website operates independnetly of AWS    
    return False

# List of websites using AWS
websites_using_aws = []

def get_aws_domains():

    # Check every IP within our sample against AWS IPv4 Range
    for i in range(len(dataset)):
        if check_aws(dataset['IPv4 Address'][i]):
            websites_using_aws.append(dataset['Website Domain'][i])

get_aws_domains()
            
# Save dataset of AWS websites to a CSV file
aws_df = pd.DataFrame({'AWS Websites':websites_using_aws})
aws_df.to_csv('websites_using_aws.csv')

aws_df

1-Proportion Z-Test for Proportion of AWS to non-AWS Websites¶

Testing in Python¶

This is a procedure for completing a one-proportion z-test given the sample dataset and proportion of websites using AWS. First we declare p as the claimed market share percentage (The Forbes article claimed AWS has a 31% market share). We calculate q and proceed with our success failure condition.

The assert lines basically test s/f - if np or qp is < 10 than the program stops and an exception (error) is raised. Because the index values when sampling were random, than we can assume our sample is random.

Next we calculate our z-score value with a one-proportion z-test: $$ \begin{align} z = \frac{\hat{p} - p}{\sigma} && \sigma = \sqrt{\frac{pq}{n}} \end{align} $$

The code for calculating z-score is z = ((len(websites_using_aws)/n) - p)/sd. For standard deviation, it's sd = math.sqrt((p*q)/2). Once we calculate our z-score, finally we can get our p-value. The library SciPy offers statistics which will allow us to calculate p-value similar to how a calculator would. Once complete, we can do hypothesis testing using a significance level of 5%.

import math
import scipy.stats as st

# Creating initial values from datatset/claim
claimed_marketshare = 0.31

p = claimed_marketshare
q = 1-claimed_marketshare

# Success/Failure Condition, exception raised if np or nq is less than 10
assert n*p >= 10, True
assert n*q >= 10, True

# Calculate Z-Score & P-Value
sd = math.sqrt((p*q)/n)
z = ((len(websites_using_aws)/n) - p)/sd

p_value = st.norm.cdf(z)
  
print('P: %f\tQ: %f\nNP: %f\tNQ: %f\n\nP-Hat: %f\n\nZ-Score: %f\nP-Value: %f\n' 
      % (p, q, n*p, n*q, (len(websites_using_aws)/n), z, p_value))

# Hypothesis Testing
significants_level = 0.05

if p_value <= (significants_level): print('\033[1mReject H0')
else: print('\033[1mFail to reject H0')

P: 0.310000	Q: 0.690000
NP: 310.000000	NQ: 690.000000

P-Hat: 0.061000

Z-Score: -17.025268
P-Value: 0.000000

Reject H0

Interpretation¶

We've rejected the null-hypothesis as our P-Value is approximately zero. The claim made within the Forbes article is invalid according to this observational study as we can support the alternate hypothesis that Amazon does not have a 31% market share in cloud computing/hosting.

Confidence Interval for 1-Proportion Sample¶

Confidence Interval Based on Z-Test¶

This is a standard method to produce a confidence interval given the z-score, sample size, claimed population proportions and sample proportions produced by the one-proportion z-test.

We use the following equations to calculate the standard deviation and standard error for our confidence interval: $$ \begin{align} \sigma_p = \sqrt{\frac{pq}{n}} && SE_p = \sqrt{\frac{\hat{p}\hat{q}}{n}} \end{align} $$

Both $\hat{p}$ and $\hat{q}$ are simply $p$ and $q$ for the sample proportion. Using the z-score we calculated during our hypothesis test as our critical value, we can multipy it by our standard error to produce our margin of error, like so: $$ ME = Z_c\sqrt{\frac{\hat{p}\hat{q}}{n}} $$

# Sample p (statistic) and q values
aws_p = len(websites_using_aws)/n
aws_q = 1 - aws_p

# Standard error for the sample proportion
se = math.sqrt((aws_p*aws_q)/n)

# Margin of error
me = z*se

print('Interval: (%f, %f)' % (aws_p + me, aws_p - me))

Interval: (-0.067852, 0.189852)

Interpretation¶

I am 95% confident that the true proportion of websites within the top one-million sites population is between -6.78% and 19%. By 95% confident I mean if the above procedures are reproduced with a sample size of 1,000, the proportion of websites within the sample that use AWS as their cloud provider will be between -6.78% and 19%. Because zero lies within our interval, the results of this study can be considered insignificant.

	Website Domain	IPv4 Address
0	ipos.vn	94.237.76.49
1	sectorul4news.ro	89.42.219.210
2	newprint.ca	37.218.253.61
3	destinationhotels.com	23.100.83.213
4	flutetoday.com	97.74.55.1
5	serialcrush.com	62.149.142.158
6	mybmtc.com	202.71.129.225
7	hmi.edu	192.249.121.112
8	flashkit.com	70.42.23.121
9	fucktubes.xxx	104.28.23.71
10	thetodaypost.com	104.31.66.246
11	avpgalaxy.net	162.211.84.48
12	scwtenor.com	74.208.236.209
13	ecom.ly	104.18.47.2
14	redwolfwildernessadventures.com	67.59.136.110
15	wesounds.com	107.6.153.170
16	bloganten.ru	92.53.114.3
17	karinto.in	175.134.120.229
18	myscopeoutreach.org	182.160.163.245
19	landsurveyor.blogfa.com	149.56.201.253
20	finministry.com	104.18.41.100
21	odeontravel.rs	195.252.107.131
22	greeningtheblue.org	104.27.147.128
23	jerusalemperspective.com	104.199.115.212
24	bazi-oksana.ru	5.101.152.32
25	leupold.com	52.88.153.55
26	cloud9.gg	23.227.38.32
27	goldsgym.com	162.209.117.196
28	tagbox.in	52.172.54.225
29	sexmamki.org	151.80.209.25
...	...	...
970	jb.com.br	152.199.54.25
971	vietadsonline.com	171.244.34.197
972	itftkd.ir	185.128.81.85
973	rouxroamer.com	104.31.95.99
974	filejo.co.kr	43.255.255.83
975	peraichi.com	54.230.89.244
976	hw3d.net	192.99.14.211
977	nekoshop.ru	37.140.192.198
978	maxbestwork.com	66.199.189.51
979	hindihelpguru.com	199.250.213.223
980	exertion-fitness.com	23.227.38.32
981	aktualnacenabytu.sk	212.57.38.25
982	geoequipos.cl	192.140.57.10
983	removenotifications.com	192.64.119.93
984	matbit.net	5.61.47.250
985	homebasedonlinevehiclemarketing.com	146.66.96.176
986	xn--v8j5erc590uusnxox.com	183.90.253.8
987	nflsport.icu	198.54.121.189
988	izithakazelo.blog	192.0.78.191
989	irinabiz.ru	138.201.199.38
990	intercity.pl	46.174.180.162
991	bongacams3.com	31.192.123.62
992	twinstrangers.net	52.214.239.109
993	textgeneratorfont.com	162.241.133.121
994	silversaints.com	212.188.174.246
995	evassmat.com	104.28.24.228
996	mpets.mobi	136.243.25.36
997	londongateway.com	65.52.130.1
998	derangler.shop	85.236.56.247
999	tavirekini.lv	94.100.11.185

	AWS Websites
0	leupold.com
1	wanasatime.com
2	simplesdental.com
3	monetixwallet.com
4	10tv.in
5	rosedalecenter.com
6	margstacobistro.com
7	shanghainavi.com
8	keaweather.net
9	lion.co.nz
10	moomii.jp
11	figleafapp.com
12	maharajamultiplex.in
13	conchovalleyhomepage.com
14	honkmedia.net
15	willowtreeapps.com
16	playvod.ma
17	tigosports.gt
18	araelium.com
19	boostnote.io
20	echofoodshelf.org
21	obonsai.com.br
22	atcost.in
23	profsnhcadmission.in
24	manalonline.com
25	nj211.org
26	conta.no
27	cldmail.co.uk
28	obo.se
29	soft32.es
...	...
31	juegosmesa.cl
32	localone.app
33	grjapan.jp
34	peachysnaps.com
35	knottyladyyarns.com
36	amormaturo.com
37	hltmag.co.uk
38	oldtownportraitgallery.com
39	matterport.com
40	parkopedia.co.uk
41	rehabs.com
42	diablomedia.com
43	battersea.org.uk
44	qmo.org.au
45	midlandsb.com
46	kuflink.co.uk
47	lunns.com
48	send2sell.com
49	wvi.org
50	letsignite.in
51	kolkt.com
52	lucasmaier.com.br
53	ellevatenetwork.com
54	completefrance.com
55	bellevie-group.com
56	via.id
57	tweentribune.com
58	filesun.net
59	peraichi.com
60	twinstrangers.net