The latest release of the baseballr
includes a function for acquiring player statistics from the NCAA’s website for baseball teams across the three major divisions (I, II, III).
The function, ncaa_scrape
, requires the user to pass values for three parameters for the function to work:
school_id
: numerical code used by the NCAA for each school year
: a four-digit year type
: whether to pull data for batters or pitchers
If you want to pull batting statistics for Vanderbilt for the 2013 season, you would use the following:
library(baseballr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
ncaa_scrape(736, 2021, "batting") %>%
select(year:OBPct)
#> ── NCAA Baseball Team Stats data from stats.ncaa.org ───────────────────
#> ℹ Data updated: 2022-04-30 07:17:50 UTC
#> # A tibble: 41 × 12
#> year school conference division Jersey Player Yr Pos GP
#> <int> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
#> 1 2021 Vanderbilt SEC 1 51 Bradfi… Fr OF 67
#> 2 2021 Vanderbilt SEC 1 25 Noland… So INF 66
#> 3 2021 Vanderbilt SEC 1 99 Gonzal… So INF 61
#> 4 2021 Vanderbilt SEC 1 9 Young,… So INF 61
#> 5 2021 Vanderbilt SEC 1 12 Keegan… Jr UT 60
#> 6 2021 Vanderbilt SEC 1 8 Thomas… Jr OF 59
#> 7 2021 Vanderbilt SEC 1 5 Rodrig… So C 58
#> 8 2021 Vanderbilt SEC 1 16 Bulger… Fr UT 50
#> 9 2021 Vanderbilt SEC 1 6 Kolwyc… Jr INF 43
#> 10 2021 Vanderbilt SEC 1 19 LaNeve… So OF 37
#> # … with 31 more rows, and 3 more variables: GS <dbl>, BA <dbl>,
#> # OBPct <dbl>
The same can be done for pitching, just by changing the type
parameter:
ncaa_scrape(736, 2021, "pitching") %>%
select(year:ERA)
#> ── NCAA Baseball Team Stats data from stats.ncaa.org ───────────────────
#> ℹ Data updated: 2022-04-30 07:17:52 UTC
#> # A tibble: 41 × 12
#> year school conference division Jersey Player Yr Pos GP
#> <int> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
#> 1 2021 Vanderbilt SEC 1 51 Bradfi… Fr OF 67
#> 2 2021 Vanderbilt SEC 1 25 Noland… So INF 66
#> 3 2021 Vanderbilt SEC 1 99 Gonzal… So INF 61
#> 4 2021 Vanderbilt SEC 1 9 Young,… So INF 61
#> 5 2021 Vanderbilt SEC 1 12 Keegan… Jr UT 60
#> 6 2021 Vanderbilt SEC 1 8 Thomas… Jr OF 59
#> 7 2021 Vanderbilt SEC 1 5 Rodrig… So C 58
#> 8 2021 Vanderbilt SEC 1 16 Bulger… Fr UT 50
#> 9 2021 Vanderbilt SEC 1 6 Kolwyc… Jr INF 43
#> 10 2021 Vanderbilt SEC 1 19 LaNeve… So OF 37
#> # … with 31 more rows, and 3 more variables: App <dbl>, GS <dbl>,
#> # ERA <dbl>
Now, the function is dependent on the user knowing the school_id
used by the NCAA website. Given that, I’ve included a ncaa_school_id_lu
function so that users can find the school_id
they need.
Just pass a string to the function and it will return possible matches based on the school’s name:
ncaa_school_id_lu("Vand")
#> # A tibble: 10 × 6
#> school conference school_id year division conference_id
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Vanderbilt SEC 736 2013 1 911
#> 2 Vanderbilt SEC 736 2014 1 911
#> 3 Vanderbilt SEC 736 2015 1 911
#> 4 Vanderbilt SEC 736 2016 1 911
#> 5 Vanderbilt SEC 736 2017 1 911
#> 6 Vanderbilt SEC 736 2018 1 911
#> 7 Vanderbilt SEC 736 2019 1 911
#> 8 Vanderbilt SEC 736 2020 1 911
#> 9 Vanderbilt SEC 736 2021 1 911
#> 10 Vanderbilt SEC 736 2022 1 911