Below is a program that creates a simple random sample of websites according to Alexa Analytics' Top One Million sites CSV. The function create_sample
creates a random integer between 0 and 999,999. That number acts as the index for the site we will get from the population CSV (e.g. The site at index 0 is google.com).
Once we have selected a site for our sample, we need to resolve its IP address (note: we will be using only IPv4 addresses as Alexa only offers those in their dataset and all websites should handle IPv4 and IPv6). If we are unable to resolve the address than we will randomly select another site to replace it.
When sampling is completed, the sample dataset (domains and IPv4 addresses) is exported to a CSV file.
from random import randint
import pandas as pd
import socket
# List of top one million sites according to Alexa Analytics/Website Ranking
# https://s3.amazonaws.com/alexa-static/top-1m.csv.zip
top_sites = pd.read_csv('top-1m.csv', header=None)[1]
# n is sample size
n=1000
# Dictionary used for stored sample data
sample = {
'Website Domain' : [],
'IPv4 Address' : []
}
def create_sample(n):
i = 0
while i < n:
i += 1
# Get random number between 0 and 999,999
random_index = randint(0, len(top_sites) - 1)
# If the site has not already been selected, add it to our data set
if not top_sites[random_index] in sample['Website Domain']:
try:
# print("\033[0mGetting IPv4 Address for %s..." % top_sites[random_index])
ipv4 = socket.gethostbyname(top_sites[random_index])
# If we can't resolve the IP from the host name, replace it with a different host name
except:
# print("\033[1mFailed. Selecting new site for sample.")
i -= 1
continue
sample['Website Domain'].append(top_sites[random_index])
sample['IPv4 Address'].append(ipv4)
create_sample(n)
# Save sample to a CSV file
dataset = pd.DataFrame.from_dict(sample)
dataset.to_csv('website_sample.csv')
dataset
Essentially importing website_sample.csv
for our dataset so we don't have to create a new sample.
n = 1000
dataset = pd.DataFrame.from_csv('website_sample.csv')
dataset
The program takes a list of all IPv4 addresses owned by AWS and compares them to the list of addresses in our sample. AWS does not give a list of IPv4 address but instead gives a subnet of their addresses, this means they've purchased addresses in bulk so they're grouped together. In order to properly compare an IPv4 address to a subnet, python offers a library called ipaddress
that breaks up subnets and ip addresses into a data format that can easily be compared.
If an address appears in Amazon's IPv4 range (their owned addresses) than the domain associated with the IP address is appended to a list. The list of websites is then exported as a CSV file.
import json, requests, ipaddress
# List of IP Ranges (IPv4 and IPv6) owned by Amazon and used for AWS
# https://ip-ranges.amazonaws.com/ip-ranges.json
aws_ip_ranges = json.loads(requests.get('https://ip-ranges.amazonaws.com/ip-ranges.json').text)
# Determine if given IP address (ip_input) shows uo in AWS IPv4 Range
def check_aws(ip_input):
# Compare given IP to all AWS IP addresses within AWS IPv4 Subnetwork
for i in range(len(aws_ip_ranges['prefixes'])):
# Parse IPv4 address for comparison
site_ip = ipaddress.ip_address(ip_input)
# Parse AWS IPv4 Subnet
aws_subnet = ipaddress.ip_network(aws_ip_ranges['prefixes'][i]['ip_prefix'])
# If IP is within the AWS IPv4 Range, the website is run on AWS
if site_ip in aws_subnet:
return True
# If the website is not within the range, the
# website operates independnetly of AWS
return False
# List of websites using AWS
websites_using_aws = []
def get_aws_domains():
# Check every IP within our sample against AWS IPv4 Range
for i in range(len(dataset)):
if check_aws(dataset['IPv4 Address'][i]):
websites_using_aws.append(dataset['Website Domain'][i])
get_aws_domains()
# Save dataset of AWS websites to a CSV file
aws_df = pd.DataFrame({'AWS Websites':websites_using_aws})
aws_df.to_csv('websites_using_aws.csv')
aws_df
This is a procedure for completing a one-proportion z-test given the sample dataset and proportion of websites using AWS. First we declare p
as the claimed market share percentage (The Forbes article claimed AWS has a 31% market share). We calculate q
and proceed with our success failure condition.
The assert
lines basically test s/f - if np or qp is < 10 than the program stops and an exception (error) is raised. Because the index values when sampling were random, than we can assume our sample is random.
Next we calculate our z-score value with a one-proportion z-test: $$ \begin{align} z = \frac{\hat{p} - p}{\sigma} && \sigma = \sqrt{\frac{pq}{n}} \end{align} $$
The code for calculating z-score is z = ((len(websites_using_aws)/n) - p)/sd
. For standard deviation, it's sd = math.sqrt((p*q)/2)
.
Once we calculate our z-score, finally we can get our p-value. The library SciPy offers statistics which will allow us to calculate p-value similar to how a calculator would. Once complete, we can do hypothesis testing using a significance level of 5%.
import math
import scipy.stats as st
# Creating initial values from datatset/claim
claimed_marketshare = 0.31
p = claimed_marketshare
q = 1-claimed_marketshare
# Success/Failure Condition, exception raised if np or nq is less than 10
assert n*p >= 10, True
assert n*q >= 10, True
# Calculate Z-Score & P-Value
sd = math.sqrt((p*q)/n)
z = ((len(websites_using_aws)/n) - p)/sd
p_value = st.norm.cdf(z)
print('P: %f\tQ: %f\nNP: %f\tNQ: %f\n\nP-Hat: %f\n\nZ-Score: %f\nP-Value: %f\n'
% (p, q, n*p, n*q, (len(websites_using_aws)/n), z, p_value))
# Hypothesis Testing
significants_level = 0.05
if p_value <= (significants_level): print('\033[1mReject H0')
else: print('\033[1mFail to reject H0')
We've rejected the null-hypothesis as our P-Value is approximately zero. The claim made within the Forbes article is invalid according to this observational study as we can support the alternate hypothesis that Amazon does not have a 31% market share in cloud computing/hosting.
This is a standard method to produce a confidence interval given the z-score, sample size, claimed population proportions and sample proportions produced by the one-proportion z-test.
We use the following equations to calculate the standard deviation and standard error for our confidence interval: $$ \begin{align} \sigma_p = \sqrt{\frac{pq}{n}} && SE_p = \sqrt{\frac{\hat{p}\hat{q}}{n}} \end{align} $$
Both $\hat{p}$ and $\hat{q}$ are simply $p$ and $q$ for the sample proportion. Using the z-score we calculated during our hypothesis test as our critical value, we can multipy it by our standard error to produce our margin of error, like so: $$ ME = Z_c\sqrt{\frac{\hat{p}\hat{q}}{n}} $$
# Sample p (statistic) and q values
aws_p = len(websites_using_aws)/n
aws_q = 1 - aws_p
# Standard error for the sample proportion
se = math.sqrt((aws_p*aws_q)/n)
# Margin of error
me = z*se
print('Interval: (%f, %f)' % (aws_p + me, aws_p - me))
I am 95% confident that the true proportion of websites within the top one-million sites population is between -6.78% and 19%. By 95% confident I mean if the above procedures are reproduced with a sample size of 1,000, the proportion of websites within the sample that use AWS as their cloud provider will be between -6.78% and 19%. Because zero lies within our interval, the results of this study can be considered insignificant.