# Sabermetrics?and?2017?Predictions

Description
Predict the Most Valuable Player in the National League for 2017 and support your prediction with data
2 references
attached material to include from the lecture

11/26/2016 Module 3
https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 1/12
Module 3
This is a single, concatenated file, suitable for printing or saving as a PDF for of fline viewing. Please
note that some animations or images may not work.
Module 3 Study Guide and Deliver ables
Topic: Introducing SQL and Hitting Sabermetrics
Readings: 1. OpenIntro Statistics (3rd ed.) (https://www.openintro.org/stat/textbook.php
stat_book=os) : Section 1.7 from Introduction to Data: Categorical Data
2. edX: Module 2: Hitting:
Sabermetrics: Hitting Metrics
Tech: Basic SQL syntax
Tech: GROUP BY and Aggregate Functions
Mean, median, and mode (https://www.youtube.com/watch v=zLHunbpH5Hg)
Discussions: Module 3: Discussion 1
Initial Response due 11:59pm ET, Saturday, 19 November 2016
Response to another post due 11:59pm ET, Monday, 21 November 2016
Assignments: Module 3: Assignment 1 due 6am ET, Tuesday, 22 November 2016
Assessment: None
Live
Classrooms:
9pm ET, Tuesday, 15 November 2016
9pm ET, Thursday, 17 November 2016
Study Guide
This module (metis342_m03_00_printable.htm) is also available as a concatenated page, suitable for
printing or saving as a PDF for offline viewing.
Lesson Objectives
1. Understand categorical variables, their difference from numerical variables, and how graph them
2. Understand how to evaluate batting performance in the game of baseball to help better understand various hitting metrics
3. Understand the SELECT statement syntax in SQL using the Lahman baseball database
An Introduction to Data, Categorical Data
11/26/2016 Module 3
https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 2/12
Exploring Categorical Data
Exploring Categorical Data
Exploring relationships between categorical variables
Exploring relationships between categorical variables
Source: https://youtu.be/zLHunbpH5Hg (https://youtu.be/zLHunbpH5Hg)
Hitting Metrics in Baseball
The Problem with Batting A verage
11/26/2016 Module 3
https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 3/12
BUXSABERT215-V016300
Source: https://youtu.be/sOu4y7 21×4 (https://youtu.be/sOu4y7 21×4)
Test Yourself
What is the formula for batting average
BA = H/PA
BA = (H+B)/PA
BA = H/AB
BA = (H+B)/AB
Are SF (sacrifice flies) part of the formula for AB (at bats)
Yes
No
Are SF (sacrifice flies) part of the formula for PA (at bats)
Yes
No
Are BB (walks) part of the formula for AB (at bats)
Yes
No
11/26/2016 Module 3
https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 4/12
Are BB (walks) part of the formula for PA (at bats)
Yes
No
Hitting Metrics in Baseball (continued)
Relationship with T eam Runs Scored
BUXSABERT215-V016500
Source: https://youtu.be/9cFbMomu_yw (https://youtu.be/9cFbMomu_yw)
Test Yourself
Suppose that the coeficient of determination the R2 value between Team Doubles and Team Runs
Scored is R2=.31. Which of the following choices is equivalent to this statement
31% of the time, a double will lead to a run being scored.
31% of doubles drive in at least one run.
31% of variation in Team Runs Scored can be explained by the number of doubles a team hits.
On average, for every 31 doubles a team hits, they will score 1 additional run.
Which of the following relationships between two variables would most likely have the smallest
correlation (R2 value closest to 0)
11/26/2016 Module 3
https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 5/12
The relationship between a player’s foot speed in Miles/Hour and that same player’s number of Stolen
Bases in a season.
The number of letters in the home plate umpires first name and the temperature in degrees Fahrenheit
at the start of the game.
The Batting Average of a player and the On Base Percentage of the same player in that same
season.
The relationship between Team Home Runs in a season and Team Runs Scored in that same season.
Better Hitting Metrics
BUXSABERT215-V016400
Source: https://youtu.be/eJB6HVWDeXE (https://youtu.be/eJB6HVWDeXE)
Select all of the following statements that are TRUE:
Runs Created has the highest R2 value shown, and does the best job of explaining the variation in
Team Runs Scored of all metrics mentioned in the last lecture.

Batting Average is an entirely useless statistic, because it does not account for several important
parts of the game, such as BBs and SFs.

Though On Base Percentage and Slugging Percentage have dif ferent denominators and scale
differently, OPS has one of the highest correlations with Team Runs Scored of the metrics shown in
the videos.

Team Runs Scored will have an R2 value of 1 when related to itself.

11/26/2016 Module 3
https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 6/12
Hitting Metrics in Baseball (continued)
To continue our exploration of ”Better Hitting Metrics,” we will now turn to various ”Linear Weights” systems. Fangraphs has a great introduction to
the idea behind Linear Weights (http://www.fangraphs.com/library/principles/linear weights/) . The basic idea behind using Linear Weights, instead of a
statistic like OPS, is that On Base Percentage and Slugging Percentage (and thus OPS) give improper values to each of fensive event that is
possible in baseball. For instance, OBP gives an equal value (1.000) to a Walk as it does to a Homerun. This is clearly not reflective of their
contribution towards scoring runs. Slugging Percentage, in a similar vein, gives each Double twice the value of a Single, when in reality , a Double is
worth more than a Single, but not twice as much. How do we solve this problem of ”value”, of how much weight to give each of fensive event
To understand the framework behind Linear Weights we must first look at a couple of Run Expectancy tables
(http://www.fangraphs.com/library/misc/re24/) . The whole article is valuable and may explain the concept better than is done here but the most
important aspect is the Run Expectancy Chart located in the ”Calculation:” section. That chart explains how many runs are to be expected given a
ny possible setup in a baseball inning. These setups are called ”Base/Out” states, and range from nobody on base with 0 outs, to bases loaded
with 2 outs. Because there are 8 ”Base” states (None on, Man on First, Man on Second ), and 3 ”Outs” states (0 Outs, 1 Out, 2 Out), we have
24 (8*3) possible Base/Out scenarios. This is where the ”24″ in RE24 comes from. The important part of this Base/Out concept is that each
Base/Out state is easily translated to an aforementioned Run Expectancy. These specific values are seen in the chart on the linked Fangraphs
page. They are derived using baseball history; for example, there have been so many situations in baseball history where there was a man on first
and 1 out. By finding the average number of runs that score in an inning with this Base/Out state, the Run Expectancy for that state can be found.
For this example, the Run Expectancy is .489 runs.
How does this concept of Run Expectancy help us create a value metric The method is rather simple, yet ef fective. For a good introduction to
RE24 (http://joeposnanski.com/stat of the day re24/) by a great baseball writer. We just subtract the Run Expectancy value of the Base/Out state
before a player has batted from the RE value of the base/out state after his at bat. If the Run Expectancy is higher afterwards, he has contributed
to his team in a positive way, and his RE24 value will be positive for that at bat. The opposite is true as well. Let s say a player comes to bat with a
runner on 1st and 1 out. We have already seen that this situation is worth .489 runs. If the batter doubles, and the team now has runners on 2nd
and 3rd with 1 out, the team is now expected to score 1.352 runs. The player who hits that double is credited with 1.352 .489 = .863 runs. If that
batter had come to bat in the exact same situation and hit a double, but the runner from first scores on the play, the situation is slightly different.
Now, the team has scored a run (trivially worth 1 run) and has a runner on 2nd with 1 out (worth .644 runs), so we credit the batter with 1 + .644
.489 = 1.155 runs. Pretty cool!
Now, if we were to add up all these values for a season for each player, we would have their RE24 value for that year . Here (http://www.baseballreference.com/leaders/re24_bat_top_ten.shtml)
are the leaders in RE24 for each season since 1940. However , RE24 is a very context dependent
statistic. If two different players have the exact same number of Hits, Doubles, Triples, Homeruns, Batting Average, and so on in a given year, they
are still likely to have diferent RE24 s, solely because they came to bat in different Base/Out states beyond their control. So now, we will look at a
Linear Weights system that is not reliant on the situations in which a player comes to the plate a context independent statistic wOBA.
wOBA (http://www.fangraphs.com/library/of fense/woba/) is just one example of using Linear Weights to measure offensive value, but it’s a damn good
one. What wOBA does is use multipliers (http://www.fangraphs.com/guts.aspx type=cn) for every offensive event. How do we get these multipliers, or
wOBA constants They are very related to the calculations behind RE24. However , instead of using the situational Run Expectancy change for
each Plate Appearance by a batter, we take the average Run Expectancy change for whatever of fensive event occurs during that Plate Appearance.
The value for each event is the average Run Expectancy change of that event; therefore, a Single with the bases loaded is weighted the exact
same as a Single with nobody on base. These two singles would have dif ferent values when using RE24, but using wOBA, they are equal events.
By looking at thousands and thousands of each event in a given season (for example, Singles in 2010), we can find the average RE change that a
Single gives a team, and that will be the wOBA constant for a Single in that season (in 2010, the wOBA weight for a Single was .895 the average
Single in 2010 added .895 runs to the teams expected total compared to if an out was made). This is an important note, the baseline for each
11/26/2016 Module 3
https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 7/12
event is an out, so every event is compared to an out (meaning an out has a wOBA weight of 0), whereas in RE24, an out would’ve been a net
negative event.
Through this use of weights, wOBA is able to give a much more accurate value to each of fensive event. While OBP and SLG (and therefore OPS)
attempt to account for diferences in value between different ofensive events, their measurement weights are inaccurate in measuring contribution
to Run Scoring capabilities. By using a Linear Weights system such as wOBA, we are able to properly measure the contributions of a player
towards scoring runs, and thus to winning games (as will be seen in a Runs to Wins track in a later Module). wOBA is also scaled to the league
average OBP for every season, making it easier to see who the average, above average, and truly superb hitters are in a given year .
Weighted On Base Average (wOBA) is a rate statistics which attempts to credit a hitter for the value of each outcome (single,
double, etc.) rather than treating all hits or times on base equally. wOBA is on the same scale as On Base Percentage (OBP) and
is a better representation of offensive value than batting average, RBI or OPS. The weights change slightly with the run
environment, but the general formula is:
(displaystyle wOBA = frac{.69 谞uBB + .72 谞HBP + .89 1B + 1.27 谞2B + 1.62 谞3B + 2.10 谞HR}{AB + BB IBB + SF +
HBP})
This is the general wOBA formula, and the constants will differ slightly for each season, but it is a good indicator of the
approximate weights put onto each offensive event.
Pete Palmer Linear W eights formula (displaystyle = .33 BB + .33 HBP + .46 1B + .80 2B + 1.02 3B + 1.40 HR +
.30 SB .60 CS .25 Out )
This formula, put forth by sabermetrician Pete Palmer in his 1984 work The Hidden Game of Baseball, is an early example of a Linear Weights
measurement system. As you can see, there are diferences between his formula and the wOBA formula, but they are clearly from the same
school of thought. Find out what an event is worth in terms of runs, and credit the player for each contribution in an accurate way . In Palmer’s
formula the weights are lower across the board than the wOBA formula; that is because the wOBA formula has a large divisor , while Palmer chose
to include Outs in his formula as a subtraction. Furthermore, Palmer includes base running stats such as Stolen Bases, while wOBA focuses
solely on hitting, and leaves base running measurements for other statistics.
In summary, wOBA is another example of a ”Better Hitting Metric”, and while OPS does a very good job of explaining Run Scoring on a team level
(as shown in the preceding lectures), Linear Weights measurements such as wOBA attempt to put a precise, consistent, and accurate value on
each offensive event and its contribution towards scoring runs
RE24 (http://www.fangraphs.com/library/misc/re24/)
Linear Weights (http://www.fangraphs.com/library/principles/linear weights/)
Run Expectancy Chart (http://www.baseballprospectus.com/sortable/index.php cid=18191 15)
RE24 Write Up (Posnanski) (http://joeposnanski.com/stat of the day re24/)
Another Valuable Metric: ISO Isolated Slugging (http://www.fangraphs.com/library/of fense/iso/)
Test Yourself
Which of the following are ADVANTAGES of using Linear Weights compared to using OPS (Select
ALL that apply)
Linear Weights metrics assign a value to each offensive event that better reflects their true
contribution to Run Scoring than those used in OBP or SLG

11/26/2016 Module 3
https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 8/12
It is easier to calculate quickly than OPS when given a player’s BA, OBP, and SLG

Because the ”Weights” change based on the run scoring environment, a Linear Weights metric such
as wOBA works accurately even in any Run Scoring Environment

Basic SQL Syntax
SQL Select
BUXSABERT215-V017300
Source: https://youtu.be/T7ehDERAICc (https://youtu.be/T7ehDERAICc)
1. Basic Syntax (http://www.w3schools.com/sql/sql_syntax.asp)
2. SELECT (http://www.w3schools.com/sql/sql_select.asp)
Note: You can look up playerID by searching baseball reference.com (http://www.baseballreference.com/)
.
Explore the Sandbox:
Write a SQL query that selects playerID, awardID, and yearID from the AwardsPlayers Table. Refer to
the Lahman Database Documentation if necessary. Don’t forget that table names are case sensitive!
Now, modify your previous query, adding aliases that rename the awardID column as AwardName and
the yearID column as Year.
Note: The SQL Sandbox is located in the edX Sabermetrics course. If you have not created an account
in edX, please do so in order for you to access the Sandbox. The link to the edX: SQL Sandbox is
located on the left navigation bar.
11/26/2016 Module 3
https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 9/12
Want to share an interesting query you wrote Have a question you want to investigate, but need help
with the syntax We expect you’ll use the query Collaboration Forum located in the Class Discussion to
collaborate, learn, and further explore the material in this class! Your Participation here is required, and
will be part of your Discussion Grade. Your own exploration of data (in our case baseball data, but it
Basic SQL Syntax – Where
SQL Where
BUXSABERT215-V019000
Source: https://youtu.be/T7ehDERAICc (https://youtu.be/W Sd1qWszFE)
1. WHERE (http://www.w3schools.com/sql/sql_where.asp)
2. AND operator (http://www.w3schools.com/sql/sql_and_or .asp)
Explore the Sandbox: WHERE P art 1
Write a SQL Query that selects the playerID, Strikeouts, Walks, and HomeRuns from the Pitching table
for all players in 2014. Make sure to use aliases to match the column names to the ones just listed.
Refer to the Lahman Database Documentation (http://seanlahman.com/files/database/readme2014.txt) if
necessary.
Now modify your previous query, adding a new column called FIP (Fielding Independent Pitching). The
formula for FIP that will we use is below. Note that the Lahman Database records Innings Pitched as:
(displaystyle IPOuts = IP 谞3 )
(displaystyle FIP = frac{(3 Walks +3 Hit By Pitches + 13 Home Runs 2 Strikeouts)}{(frac
{IPOuts}{3}) + 3.132})
11/26/2016 Module 3
https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 10/12
Explore the Sandbox: WHERE P art 2
Write a SQL query that selects playerID, yearID, ERA, and Innings_Pitched from the Pitching Table for
all players since (and including) 2010 with an ERA of less than 3.00 and at least 100 Innings Pitched
(300 outs recorded). ERA is included in the database and does not need to be calculated manually.
Recall that the Lahman Database contains only Outs Recorded, and Innings Pitched is equal to Outs
divided by 3. Refer to the Lahman Database Documentation
Basic SQL Syntax – Order By & Distinct
SQL Order By
BUXSABERT215-V017700
Source: https://youtu.be/UHxjlokjSl8 (https://youtu.be/UHxjlokjSl8)
1. ORDER BY (http://www.w3schools.com/sql/sql_orderby .asp)
SQL Distinct
11/26/2016 Module 3
https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 11/12
BUXSABERT215-V017400
Source: https://youtu.be/p1akosQ7jy4 (https://youtu.be/p1akosQ7jy4)
1. Distinct (http://www.w3schools.com/sql/sql_distinct.asp)
Explore the Sandbox:
Use DISTINCT to get every unique teamID from the Teams table.
Basic SQL Syntax – GROUP BY & Aggregate F unctions – Part 1
Introduction to Group By
BUXSABERT215-V017200
Source: https://youtu.be/f975qZB0Y5A (https://youtu.be/f975qZB0Y5A)
1. GROUP BY (http://www.w3schools.com/sql/sql_orderby .asp)
2. HAVING (http://www.w3schools.com/sql/sql_having.asp)
11/26/2016 Module 3
https://learn.bu.edu/bbcswebdav/pid 4608750 dt content rid 16002752_1/courses/16fallmetis342_o1/module3/metis342_m03_00_printable.htm 12/12
3. AGGREGATE FUNCTIONS (http://www.w3schools.com/sql/sql_functions.asp)
Group By for Career Totals
BUXSABERT215-V017600
Source: https://youtu.be/YSsoLLXg0rU (https://youtu.be/YSsoLLXg0rU)
Explore the Sandbox:
Write a query that selects the playerID and Career Home Run totals from the Batting Table for each
player in the Lahman Database. Call this variable Career_HR.
Want to share an interesting query you wrote Have a question you want to investigate, but need help
with the syntax We expect you’ll use the query Collaboration Forum located in the Class Discussion to
collaborate, learn, and further explore the material in this class! Your Participation here is required, and
will be part of your Discussion Grade. Your own exploration of data (in our case baseball data, but it