STAT 250 Spring 2019 Data Analysis Assignment 1
Your submitted document should include the following items. Points will be deducted if the
following are not included.
1.
Type your Name and STAT 250 with your correct section number (e.g. STAT 250-xxx)
right justified and then Data Analysis Assignment #1 centered on the top of page 1 below
2.
3.
corresponding number and subpart. Keep the answers in order. Do not include the questions
4.
Generate all requested graphs and tables using StatCrunch.
5.
Elements of good technical writing:
Use complete and coherent sentences to answer the questions.
Graphs must be appropriately titled and should refer to the context of the question.
Graphical displays must include labels with units if appropriate for each axis.
Units should always be included when referring to numerical values.
When making a comparison you must use comparative language, such as “greater than”, “less
than”, or “about the same as.”
Ensure that all graphs and tables appear on one page and are not split across two pages.
Type all mathematical calculations when directed to compute an answer ‘by-hand.’
Pictures of actual handwritten work are not accepted on this assignment.
When writing mathematical expressions into your document you may use either an equation editor
or common shortcuts such as:
x can be written as sqrt(x), p̂ can be written as p-hat, x can be
written as x-bar.
Problem 1: 2018 Movies
1
Moviepass is a subscription service that allows users to see one movie per a day at select theaters.
AMC Theatres released their own movie subscription service called A-List to compete with
Moviepass which allows users to see up to three movies per a week. Raw data was collected from
one user who purchased an annual Moviepass subscription in January 2018 and subscription to AList in November 2018. The dataset found in our StatCrunch Group presents 169 movies seen
along with other variables describing each movie. The data set is called “2018 Movies.”
a) Use StatCrunch to create a one-way table for the variable “Genre” using both counts and
percentages. Select Stat → Tables → Frequency and select both ‘Frequency’ and ‘Percent of
total’ in the Statistic(s) box by holding down the Ctrl Key (Command Key on Macs) when
making these selections. Copy your table into your document and then manually round the
values in the ‘Percent of total” column to two decimal places in the StatCrunch table that
you have copied into your document.
b) Interpret your findings from the table in part (a) by identifying the least and most popular
genre by percent of total. Use complete sentences with context and include the genre and
percentage in the sentences.
c) Use StatCrunch to generate a two-way table for the variables “Genre” and “Viewer
Rating”. Go to Stat → Tables → Contingency → With Data (since you have the raw data in
StatCrunch). Select “Genre” as your row variable and “Viewer Rating” as your column
variable. In the display box, select only Percent of Total. Lastly, unclick (or deselect) “ChiSquare test for independence” since it is highlighted by default by holding the Ctrl key and
d) How many and what percentage of the 169 movies did the viewer dislike? Answer this
question in a complete sentence.
e) What values are the same when looking at both your one-way table and your two-way table?
Be specific if referencing rows or columns.
f) Now, create two more two-way tables keeping “Genre” as your row variable and “Viewer
Rating” as your column variable. One table needs to include row percentages and the other
needs to include column percentages. To do this, change what you select in the display box
from percent of total (in part (c)) to row percent for the first table and column percent for the
second table. Include both tables in your document.
g) Specifically interpret the meaning of the row percentage found in the “Children’s/Animated”
and “Liked” cell. Note that there are 14 movies in that cell.
h) Now, specifically interpret the meaning of the column percentage found in the
“Children’s/Animated” and “Liked” cell. Note that there are 14 movies in that cell.
Problem 2: 2018 Movies Revisited
2
Which genre is most popular among the 169 movies seen? Use the “2018 Movies” data set posted
in our StatCrunch group to answer the following questions.
a) Using the variable named “Genre”, produce a relative frequency bar chart using Graph →
Bar Plot → With Data. Please properly label axes and provide a meaningful title and copy it
b) Using the variable “Genre”, produce a relative frequency Pareto chart. Begin with your bar
chart, and edit it by changing “Order by” to Count Descending. Properly title and label your
graph and copy it into your document.
c) Using the variable “Genre”, produce a Pie Chart using Graph → Pie Chart → With Data.
Add an appropriate title and copy this entire graph including the legend into your document.
d) Use the three graphs to answer the question: Which genre of movie did this individual see
the most of? Present both the count and the proportion and write your answer in one
sentence.
e) Now produce two grouped relative frequency bar charts (to copy to your document) by
following the directions below.
Go to Graph → Bar Plot → With Data.
For the first grouped bar chart, graph the variable “Viewer Rating” and group by “Genre.”
To “group by” click the arrow next to Group by box (the third box down) and select the
variable you are asked to group by. In the Type box (5th box down from the top) choose
relative frequency within category. Title these graphs clearly. You may keep the default
labels for the x and y-axis.
For the second grouped bar chart, graph the variable “Genre” and group by “Viewer
Rating.” In the Type box (5th box down from the top) choose relative frequency within
category. Title these graphs clearly. You may keep the default labels for the x and y-axis.
f) Compare the graph variable among the categories of the genres. Describe what you see
from each graph in one sentence each. Specifically with the graph grouped by Viewer
See next page for Problem 3
Problem 3: Metro Bike Share
3
On July 7, 2016, the Los Angeles County Metropolitan Transportation Authority launched a bicycle
sharing system called Metro Bike Share. The system uses a fleet of about 1,400 bikes and includes
93 stations in Downtown Los Angeles, Venice, and the Port of Los Angeles. It is the first bike
share system in the United States to be integrated as part of the city’s existing public transit system.
The “Metro Bike Share” data set includes a random sample of 300 trips lasting between one and
60 minutes. Twelve variables are included for each observation. The Duration variable indicates
the length of the trip in minutes.
a) Create a frequency histogram for the variable “Duration” by using Graph → Histogram.
Properly title and label your graph and copy it into your document.
b) Interpret the shape of this distribution in one complete sentence.
c) Use StatCrunch to obtain the sample size, mean, and standard deviation for the “Duration”
variable by using Stat → Summary Stats → Columns. Note: in the Statistics box, select the
summary statistics listed above in the exact order given. Copy the entire table into your
document and manually round each value to two decimal places.
d) Use StatCrunch to obtain the five number summary and the IQR for the “Duration” variable
(the five number summary includes Min, Q1, Median, Q3, Max). Go to Stat → Summary
Stats → Columns to obtain these values. Note: in the Statistics box, select the summary
statistics listed above in the exact order given. Copy the entire table into your document and
manually round each value to two decimal places.
e) Choose the appropriate summary statistics for center and spread (presented in either 3c or
3d) based on your stated shape of the distribution in 3b.
f) Use your summary statistics from part 3d and determine the fences used to mathematically
identify outliers for the “Duration” variable. To do this, show all steps in your calculations
manually including how you obtained the upper and lower fences. Please type your work
and calculations.
g) Construct a horizontally oriented boxplot of the “Duration” variable by using Graph →
Boxplot. To do this, click the “Draw boxes horizontally” box. Properly title and label and
copy this graph into your document.
h) How many outliers do you identify (please use both the boxplot and your results from 3f)?
Write your response in a complete sentence.
Problem 4: SAT Scores
This data set presents SAT Verbal and Math scores for a random sample of 300 individuals. In
addition, the individual’s gender and college is recorded. The sample was collected from one of six
colleges (numbered 1 – 6). The data set is called “SAT Scores.”
a) Construct two relative frequency histograms using the “Math” variable (one for Males and
one for Females). To do this, go to Graph → Histogram. Select Math to enter it in the
4
graph box and then click the arrow in the “Group by:” box and select Gender. Properly title
and label your graphs. Finally, below the titling area, under “For multiple graphs” change
Columns per page from 1 to 2 and click Compute! Once the graph is computed, click the
three lines in the bottom left of the leftmost graph. Select x-axis and change the minimum
to 250 and select the y-axis and change the maximum to 0.24. (I have to do this to have
each graph have the same sizing for the x and y axes)*. Copy and paste your graphs into
b) Describe the shape of each distribution in context in one sentence each.
c) Use StatCrunch to obtain sample size (n), the mean, and standard deviation of the “Math”
variable by Gender (using “Group by:”) Copy and paste the table into your document.
For parts 4d-4f, determine how well the Empirical Rule does in predicting the percentage of
observations within some number of standard deviations of the mean.
d) Use your rounded summary statistics for females from part 4c to calculate the interval
corresponding to one, two, and three standard deviations about the mean SAT Math
Score. Type your work showing how you obtained these intervals. Round the endpoints
of the final intervals correctly to whole numbers and clearly label and list these three
intervals in your document as shown below:
68% interval (lower value, upper value)
95% interval (lower value, upper value)
99.7% interval (lower value, upper value)
e) Use StatCrunch to determine the count and percentage of observations falling in each of
these intervals by following the instructions listed below or using another appropriate
counting method. Properly label and list these counts and percentages in your document.
Start in the “Female Math SAT Scores” data set (found in your StatCrunch Group). Go to
Data → Row Selection → Interactive Tools. In the slider selectors box, click the variable
Math into the variable box. Then Click compute.
The box that appears has a slider under the words Math that allows you to create ranges of
scores that you determined in 4d. Use the slider to obtain the count for each interval by
looking at the “# rows selected” presented in the first line of the box. Calculate the
percentages from the counts you obtained for each interval and include them in your
document.
f) Do each of the three percentages found in part 4e match to what the Empirical Rule
predicts? Compare your results in 4e with the expected percentage stated in the empirical
g) Suppose a new female student with a Math SAT score of 700 was recorded. Calculate the z-
score of this ‘new’ score and explain in a complete sentence what this z-score indicates.
5
1
Sample Solution to Display Formatting
A random sample of 30 students was selected from a STAT 250 course taught during the
summer session and their first exam scores were recorded.
a) Create a histogram in StatCrunch. Be sure to title and label it correctly.
b) Interpret the histogram’s shape
See sample solution and formatting on page 2.
Following the main points will help you submit a professionally completed assignment.
1)
2)
3)
4)
Right justify your name and provide your correct section and the due date.
Center the specific homework assignment title.
Bold each problem complete problem number.
The graph can be around the below size for readability (click on the graph once and only
adjust the size of the graph by using the bottom right dot)
keep the assignment in problem and part order (present 1a, then 1b, and so on).
2
Kenneth Strazzeri

Data Analysis Assignment 1
Problem X
a)
b) The shape of this distribution is left skewed because I see the majority of the data values
falling in the upper end of the distribution and a few 50s and 60s skewing the shape. There does
not seem to be any outliers visible on the graph.

