Description

Predict the Most Valuable Player in the National League for 2017 and support your prediction with data

2 references

attached material to include from the lecture

11/26/2016 Module 3

https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 1/12

Module 3

This is a single, concatenated file, suitable for printing or saving as a PDF for of fline viewing. Please

note that some animations or images may not work.

Module 3 Study Guide and Deliver ables

Topic: Introducing SQL and Hitting Sabermetrics

Readings: 1. OpenIntro Statistics (3rd ed.) (https://www.openintro.org/stat/textbook.php

stat_book=os) : Section 1.7 from Introduction to Data: Categorical Data

2. edX: Module 2: Hitting:

Sabermetrics: Hitting Metrics

Tech: Basic SQL syntax

Tech: GROUP BY and Aggregate Functions

View: Exploring Categorical Data (https://www.youtube.com/watch v=7NhNeADL8fA)

Mean, median, and mode (https://www.youtube.com/watch v=zLHunbpH5Hg)

Discussions: Module 3: Discussion 1

Initial Response due 11:59pm ET, Saturday, 19 November 2016

Response to another post due 11:59pm ET, Monday, 21 November 2016

Assignments: Module 3: Assignment 1 due 6am ET, Tuesday, 22 November 2016

Assessment: None

Live

Classrooms:

9pm ET, Tuesday, 15 November 2016

9pm ET, Thursday, 17 November 2016

Study Guide

This module (metis342_m03_00_printable.htm) is also available as a concatenated page, suitable for

printing or saving as a PDF for offline viewing.

Lesson Objectives

1. Understand categorical variables, their difference from numerical variables, and how graph them

2. Understand how to evaluate batting performance in the game of baseball to help better understand various hitting metrics

3. Understand the SELECT statement syntax in SQL using the Lahman baseball database

An Introduction to Data, Categorical Data

11/26/2016 Module 3

https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 2/12

Exploring Categorical Data

Exploring Categorical Data

Source: https://youtu.be/7NhNeADL8fA (https://youtu.be/7NhNeADL8fA)

Exploring relationships between categorical variables

Exploring relationships between categorical variables

Source: https://youtu.be/zLHunbpH5Hg (https://youtu.be/zLHunbpH5Hg)

Hitting Metrics in Baseball

The Problem with Batting A verage

11/26/2016 Module 3

https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 3/12

BUXSABERT215-V016300

Source: https://youtu.be/sOu4y7 21×4 (https://youtu.be/sOu4y7 21×4)

Test Yourself

What is the formula for batting average

Choose the best answer.

BA = H/PA

BA = (H+B)/PA

BA = H/AB

BA = (H+B)/AB

Are SF (sacrifice flies) part of the formula for AB (at bats)

Yes

No

Are SF (sacrifice flies) part of the formula for PA (at bats)

Yes

No

Are BB (walks) part of the formula for AB (at bats)

Yes

No

11/26/2016 Module 3

https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 4/12

Are BB (walks) part of the formula for PA (at bats)

Yes

No

Hitting Metrics in Baseball (continued)

Relationship with T eam Runs Scored

BUXSABERT215-V016500

Source: https://youtu.be/9cFbMomu_yw (https://youtu.be/9cFbMomu_yw)

Test Yourself

Suppose that the coeficient of determination the R2 value between Team Doubles and Team Runs

Scored is R2=.31. Which of the following choices is equivalent to this statement

Choose the best answer.

31% of the time, a double will lead to a run being scored.

31% of doubles drive in at least one run.

31% of variation in Team Runs Scored can be explained by the number of doubles a team hits.

On average, for every 31 doubles a team hits, they will score 1 additional run.

Which of the following relationships between two variables would most likely have the smallest

correlation (R2 value closest to 0)

11/26/2016 Module 3

https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 5/12

The relationship between a player’s foot speed in Miles/Hour and that same player’s number of Stolen

Bases in a season.

The number of letters in the home plate umpires first name and the temperature in degrees Fahrenheit

at the start of the game.

The Batting Average of a player and the On Base Percentage of the same player in that same

season.

The relationship between Team Home Runs in a season and Team Runs Scored in that same season.

Better Hitting Metrics

BUXSABERT215-V016400

Source: https://youtu.be/eJB6HVWDeXE (https://youtu.be/eJB6HVWDeXE)

Select all of the following statements that are TRUE:

Runs Created has the highest R2 value shown, and does the best job of explaining the variation in

Team Runs Scored of all metrics mentioned in the last lecture.

Batting Average is an entirely useless statistic, because it does not account for several important

parts of the game, such as BBs and SFs.

Though On Base Percentage and Slugging Percentage have dif ferent denominators and scale

differently, OPS has one of the highest correlations with Team Runs Scored of the metrics shown in

the videos.

Team Runs Scored will have an R2 value of 1 when related to itself.

11/26/2016 Module 3

https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 6/12

Hitting Metrics in Baseball (continued)

Linear Weights Reading

To continue our exploration of ”Better Hitting Metrics,” we will now turn to various ”Linear Weights” systems. Fangraphs has a great introduction to

the idea behind Linear Weights (http://www.fangraphs.com/library/principles/linear weights/) . The basic idea behind using Linear Weights, instead of a

statistic like OPS, is that On Base Percentage and Slugging Percentage (and thus OPS) give improper values to each of fensive event that is

possible in baseball. For instance, OBP gives an equal value (1.000) to a Walk as it does to a Homerun. This is clearly not reflective of their

contribution towards scoring runs. Slugging Percentage, in a similar vein, gives each Double twice the value of a Single, when in reality , a Double is

worth more than a Single, but not twice as much. How do we solve this problem of ”value”, of how much weight to give each of fensive event

To understand the framework behind Linear Weights we must first look at a couple of Run Expectancy tables

(http://www.fangraphs.com/library/misc/re24/) . The whole article is valuable and may explain the concept better than is done here but the most

important aspect is the Run Expectancy Chart located in the ”Calculation:” section. That chart explains how many runs are to be expected given a

ny possible setup in a baseball inning. These setups are called ”Base/Out” states, and range from nobody on base with 0 outs, to bases loaded

with 2 outs. Because there are 8 ”Base” states (None on, Man on First, Man on Second ), and 3 ”Outs” states (0 Outs, 1 Out, 2 Out), we have

24 (8*3) possible Base/Out scenarios. This is where the ”24″ in RE24 comes from. The important part of this Base/Out concept is that each

Base/Out state is easily translated to an aforementioned Run Expectancy. These specific values are seen in the chart on the linked Fangraphs

page. They are derived using baseball history; for example, there have been so many situations in baseball history where there was a man on first

and 1 out. By finding the average number of runs that score in an inning with this Base/Out state, the Run Expectancy for that state can be found.

For this example, the Run Expectancy is .489 runs.

How does this concept of Run Expectancy help us create a value metric The method is rather simple, yet ef fective. For a good introduction to

RE24 (http://joeposnanski.com/stat of the day re24/) by a great baseball writer. We just subtract the Run Expectancy value of the Base/Out state

before a player has batted from the RE value of the base/out state after his at bat. If the Run Expectancy is higher afterwards, he has contributed

to his team in a positive way, and his RE24 value will be positive for that at bat. The opposite is true as well. Let s say a player comes to bat with a

runner on 1st and 1 out. We have already seen that this situation is worth .489 runs. If the batter doubles, and the team now has runners on 2nd

and 3rd with 1 out, the team is now expected to score 1.352 runs. The player who hits that double is credited with 1.352 .489 = .863 runs. If that

batter had come to bat in the exact same situation and hit a double, but the runner from first scores on the play, the situation is slightly different.

Now, the team has scored a run (trivially worth 1 run) and has a runner on 2nd with 1 out (worth .644 runs), so we credit the batter with 1 + .644

.489 = 1.155 runs. Pretty cool!

Now, if we were to add up all these values for a season for each player, we would have their RE24 value for that year . Here (http://www.baseballreference.com/leaders/re24_bat_top_ten.shtml)

are the leaders in RE24 for each season since 1940. However , RE24 is a very context dependent

statistic. If two different players have the exact same number of Hits, Doubles, Triples, Homeruns, Batting Average, and so on in a given year, they

are still likely to have diferent RE24 s, solely because they came to bat in different Base/Out states beyond their control. So now, we will look at a

Linear Weights system that is not reliant on the situations in which a player comes to the plate a context independent statistic wOBA.

wOBA (http://www.fangraphs.com/library/of fense/woba/) is just one example of using Linear Weights to measure offensive value, but it’s a damn good

one. What wOBA does is use multipliers (http://www.fangraphs.com/guts.aspx type=cn) for every offensive event. How do we get these multipliers, or

wOBA constants They are very related to the calculations behind RE24. However , instead of using the situational Run Expectancy change for

each Plate Appearance by a batter, we take the average Run Expectancy change for whatever of fensive event occurs during that Plate Appearance.

The value for each event is the average Run Expectancy change of that event; therefore, a Single with the bases loaded is weighted the exact

same as a Single with nobody on base. These two singles would have dif ferent values when using RE24, but using wOBA, they are equal events.

By looking at thousands and thousands of each event in a given season (for example, Singles in 2010), we can find the average RE change that a

Single gives a team, and that will be the wOBA constant for a Single in that season (in 2010, the wOBA weight for a Single was .895 the average

Single in 2010 added .895 runs to the teams expected total compared to if an out was made). This is an important note, the baseline for each

11/26/2016 Module 3

https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 7/12

event is an out, so every event is compared to an out (meaning an out has a wOBA weight of 0), whereas in RE24, an out would’ve been a net

negative event.

Through this use of weights, wOBA is able to give a much more accurate value to each of fensive event. While OBP and SLG (and therefore OPS)

attempt to account for diferences in value between different ofensive events, their measurement weights are inaccurate in measuring contribution

to Run Scoring capabilities. By using a Linear Weights system such as wOBA, we are able to properly measure the contributions of a player

towards scoring runs, and thus to winning games (as will be seen in a Runs to Wins track in a later Module). wOBA is also scaled to the league

average OBP for every season, making it easier to see who the average, above average, and truly superb hitters are in a given year .

Weighted On Base Average (wOBA) is a rate statistics which attempts to credit a hitter for the value of each outcome (single,

double, etc.) rather than treating all hits or times on base equally. wOBA is on the same scale as On Base Percentage (OBP) and

is a better representation of offensive value than batting average, RBI or OPS. The weights change slightly with the run

environment, but the general formula is:

(displaystyle wOBA = frac{.69 谞uBB + .72 谞HBP + .89 1B + 1.27 谞2B + 1.62 谞3B + 2.10 谞HR}{AB + BB IBB + SF +

HBP})

This is the general wOBA formula, and the constants will differ slightly for each season, but it is a good indicator of the

approximate weights put onto each offensive event.

Pete Palmer Linear W eights formula (displaystyle = .33 BB + .33 HBP + .46 1B + .80 2B + 1.02 3B + 1.40 HR +

.30 SB .60 CS .25 Out )

This formula, put forth by sabermetrician Pete Palmer in his 1984 work The Hidden Game of Baseball, is an early example of a Linear Weights

measurement system. As you can see, there are diferences between his formula and the wOBA formula, but they are clearly from the same

school of thought. Find out what an event is worth in terms of runs, and credit the player for each contribution in an accurate way . In Palmer’s

formula the weights are lower across the board than the wOBA formula; that is because the wOBA formula has a large divisor , while Palmer chose

to include Outs in his formula as a subtraction. Furthermore, Palmer includes base running stats such as Stolen Bases, while wOBA focuses

solely on hitting, and leaves base running measurements for other statistics.

In summary, wOBA is another example of a ”Better Hitting Metric”, and while OPS does a very good job of explaining Run Scoring on a team level

(as shown in the preceding lectures), Linear Weights measurements such as wOBA attempt to put a precise, consistent, and accurate value on

each offensive event and its contribution towards scoring runs

More (Optional) Readings (Some Duplicated From Reading Above)

RE24 (http://www.fangraphs.com/library/misc/re24/)

Linear Weights (http://www.fangraphs.com/library/principles/linear weights/)

Run Expectancy Chart (http://www.baseballprospectus.com/sortable/index.php cid=18191 15)

RE24 Yearly Leaders (http://www.baseball reference.com/leaders/re24_bat_top_ten.shtml)

RE24 Write Up (Posnanski) (http://joeposnanski.com/stat of the day re24/)

Another Valuable Metric: ISO Isolated Slugging (http://www.fangraphs.com/library/of fense/iso/)

Test Yourself

Which of the following are ADVANTAGES of using Linear Weights compared to using OPS (Select

ALL that apply)

Linear Weights metrics assign a value to each offensive event that better reflects their true

contribution to Run Scoring than those used in OBP or SLG

11/26/2016 Module 3

https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 8/12

It is easier to calculate quickly than OPS when given a player’s BA, OBP, and SLG

Because the ”Weights” change based on the run scoring environment, a Linear Weights metric such

as wOBA works accurately even in any Run Scoring Environment

Basic SQL Syntax

SQL Select

BUXSABERT215-V017300

Source: https://youtu.be/T7ehDERAICc (https://youtu.be/T7ehDERAICc)

For more information:

1. Basic Syntax (http://www.w3schools.com/sql/sql_syntax.asp)

2. SELECT (http://www.w3schools.com/sql/sql_select.asp)

Note: You can look up playerID by searching baseball reference.com (http://www.baseballreference.com/)

.

Explore the Sandbox:

Write a SQL query that selects playerID, awardID, and yearID from the AwardsPlayers Table. Refer to

the Lahman Database Documentation if necessary. Don’t forget that table names are case sensitive!

Now, modify your previous query, adding aliases that rename the awardID column as AwardName and

the yearID column as Year.

Note: The SQL Sandbox is located in the edX Sabermetrics course. If you have not created an account

in edX, please do so in order for you to access the Sandbox. The link to the edX: SQL Sandbox is

located on the left navigation bar.

11/26/2016 Module 3

https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 9/12

Please Share

Want to share an interesting query you wrote Have a question you want to investigate, but need help

with the syntax We expect you’ll use the query Collaboration Forum located in the Class Discussion to

collaborate, learn, and further explore the material in this class! Your Participation here is required, and

will be part of your Discussion Grade. Your own exploration of data (in our case baseball data, but it

applies broadly) is crucial to your development as an analyst.

Basic SQL Syntax – Where

SQL Where

BUXSABERT215-V019000

Source: https://youtu.be/T7ehDERAICc (https://youtu.be/W Sd1qWszFE)

For more information:

1. WHERE (http://www.w3schools.com/sql/sql_where.asp)

2. AND operator (http://www.w3schools.com/sql/sql_and_or .asp)

Explore the Sandbox: WHERE P art 1

Write a SQL Query that selects the playerID, Strikeouts, Walks, and HomeRuns from the Pitching table

for all players in 2014. Make sure to use aliases to match the column names to the ones just listed.

Refer to the Lahman Database Documentation (http://seanlahman.com/files/database/readme2014.txt) if

necessary.

Now modify your previous query, adding a new column called FIP (Fielding Independent Pitching). The

formula for FIP that will we use is below. Note that the Lahman Database records Innings Pitched as:

(displaystyle IPOuts = IP 谞3 )

(displaystyle FIP = frac{(3 Walks +3 Hit By Pitches + 13 Home Runs 2 Strikeouts)}{(frac

{IPOuts}{3}) + 3.132})

11/26/2016 Module 3

https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 10/12

Explore the Sandbox: WHERE P art 2

Write a SQL query that selects playerID, yearID, ERA, and Innings_Pitched from the Pitching Table for

all players since (and including) 2010 with an ERA of less than 3.00 and at least 100 Innings Pitched

(300 outs recorded). ERA is included in the database and does not need to be calculated manually.

Recall that the Lahman Database contains only Outs Recorded, and Innings Pitched is equal to Outs

divided by 3. Refer to the Lahman Database Documentation

(http://seanlahman.com/files/database/readme2014.txt) if necessary.

Basic SQL Syntax – Order By & Distinct

SQL Order By

BUXSABERT215-V017700

Source: https://youtu.be/UHxjlokjSl8 (https://youtu.be/UHxjlokjSl8)

For more information:

1. ORDER BY (http://www.w3schools.com/sql/sql_orderby .asp)

SQL Distinct

11/26/2016 Module 3

https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 11/12

BUXSABERT215-V017400

Source: https://youtu.be/p1akosQ7jy4 (https://youtu.be/p1akosQ7jy4)

For more information:

1. Distinct (http://www.w3schools.com/sql/sql_distinct.asp)

Explore the Sandbox:

Use DISTINCT to get every unique teamID from the Teams table.

Basic SQL Syntax – GROUP BY & Aggregate F unctions – Part 1

Introduction to Group By

BUXSABERT215-V017200

Source: https://youtu.be/f975qZB0Y5A (https://youtu.be/f975qZB0Y5A)

For more information:

1. GROUP BY (http://www.w3schools.com/sql/sql_orderby .asp)

2. HAVING (http://www.w3schools.com/sql/sql_having.asp)

11/26/2016 Module 3

https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 12/12

3. AGGREGATE FUNCTIONS (http://www.w3schools.com/sql/sql_functions.asp)

Group By for Career Totals

BUXSABERT215-V017600

Source: https://youtu.be/YSsoLLXg0rU (https://youtu.be/YSsoLLXg0rU)

Explore the Sandbox:

Write a query that selects the playerID and Career Home Run totals from the Batting Table for each

player in the Lahman Database. Call this variable Career_HR.

Please Share

Want to share an interesting query you wrote Have a question you want to investigate, but need help

with the syntax We expect you’ll use the query Collaboration Forum located in the Class Discussion to

collaborate, learn, and further explore the material in this class! Your Participation here is required, and

will be part of your Discussion Grade. Your own exploration of data (in our case baseball data, but it

applies broadly) is crucial to your development as an analyst.