vignettes/Introduction_DemographixeR.Rmd
Introduction_DemographixeR.Rmd
Let’s illustrate the usefulness of DemografixeR with a simple example. Say we know the first name of a sample of customers, but useful information about gender, age or nationality is unavailable:
Customers: | Maria | Ben | Claudia | Adam | Hannah | Robert |
It’s common knowledge that names have a strong sociocultural influence - names’ popularity vary across time and location - and these naming conventions may be good predictors for other useful variables such as gender, age & nationality. Here’s where DemografixeR comes in:
“DemografixeR allows R users to connect directly to the (1) genderize.io API, the (2) agify.io API and the (3) nationalize.io API to obtain the (1) gender, (2) age & (3) nationality of a name in a tidy format.”
DemografixeR deals with the hassle of API pagination, missing values, duplicated names, trimming whitespace and parsing the results in a tidy format, giving the user time to analyze instead of tidying the data.
To do so, DemografixeR is based on these three main pillar functions, which we will use to predict the key demographic variables of the previous sample of customers, so that we can ‘fix’ the missing demographic information:
API | R function | Estimated variable |
---|---|---|
https://genderize.io | genderize(name) |
Gender |
https://agify.io | agify(name) |
Age |
https://nationalize.io | nationalize(name) |
Nationality |
They all work similarly, and allow to be integrated in multiple workflows. Using the previous group of customers, we can obtain the following results:
Customers: | Maria | Ben | Claudia | Adam | Hannah | Robert |
Estimated gender: | female | male | female | male | female | male |
Estimated age: | 21 | 48 | 45 | 34 | 27 | 59 |
Estimated nationality: | CY | AU | CL | PL | SL | US |
To see how to get to these results, read on!
The following step is optional, it is only necessary if you plan to estimate gender, age or nationality for more than 1000 different names a day. To do so, you need to obtain an API key from the following link:
To use the API key, simply save it only once with the save_api_key(key)
and you’re all set. All the functions will automatically retrieve the key once saved:
save_api_key(key = "__YOUR_API_KEY__")
Please be careful when dealing with secrets/tokens/credentials and do not share them publicly. Yet, if you wish explicitly know which API key you’ve saved, retrieve it with the get_api_key()
function. To fully remove the saved key use the remove_api_key()
function.
We start by predicting the gender from our customers. For this we use the genderize(name)
function:
customers_names <- c("Maria", "Ben", "Claudia", "Adam", "Hannah", "Robert") customers_predicted_gender <- genderize(name = customers_names) customers_predicted_gender # Print results #> [1] "female" "male" "female" "male" "female" "male"
We see that genderize(name)
returns the estimated gender for each name as a character
vector:
class(customers_predicted_gender) #> [1] "character"
Yet, it is also possible to obtain a detailed data.frame
object with additional information. DemografixeR also allows to use ‘pipes’:
gender_df <- genderize(name = customers_names, simplify = FALSE) customers_names %>% genderize(simplify = FALSE) %>% knitr::kable(row.names = FALSE)
name | type | gender | probability | count |
---|---|---|---|---|
Maria | gender | female | 0.98 | 334287 |
Ben | gender | male | 0.95 | 77991 |
Claudia | gender | female | 0.98 | 118604 |
Adam | gender | male | 0.98 | 116396 |
Hannah | gender | female | 0.97 | 13198 |
Robert | gender | male | 0.99 | 177418 |
We continue with the age estimation of our customers. As with the genderize(name)
function, the simplify
parameter also works with the agify(name)
function to retrieve a data.frame
:
customers_predicted_age <- agify(name = customers_names, simplify = FALSE) customers_names %>% agify(simplify = FALSE) %>% knitr::kable(row.names = FALSE)
name | type | age | count |
---|---|---|---|
Maria | age | 21 | 517258 |
Ben | age | 48 | 75632 |
Claudia | age | 45 | 110105 |
Adam | age | 34 | 110754 |
Hannah | age | 27 | 12843 |
Robert | age | 59 | 160915 |
Last but not least, we finish with the nationality extrapolation. Equally as with the genderize(name)
and agify(name)
function, the simplify
parameter also works with the nationalize(name)
function to retrieve a data.frame
:
customers_predicted_nationality <- nationalize(name = customers_names, simplify = FALSE) customers_names %>% nationalize(simplify = FALSE) %>% knitr::kable(row.names = FALSE)
name | type | country_id | probability |
---|---|---|---|
Maria | nationality | CY | 0.0550798 |
Ben | nationality | AU | 0.0665534 |
Claudia | nationality | CL | 0.0559340 |
Adam | nationality | PL | 0.0905836 |
Hannah | nationality | SL | 0.2673254 |
Robert | nationality | US | 0.0909442 |
country_id
parameterResponses of names will in a lot of cases be more accurate if the data is narrowed to a specific country. Luckily, both the genderize(name)
and agify(name)
function support passing a country code parameter (following the common ISO 3166-1 alpha-2 country code convention). For obvious reasons the nationalize(name)
does not:
us_customers_predicted_gender<-genderize(name = customers_names, country_id = "US") us_customers_predicted_gender #> [1] "female" "male" "female" "male" "female" "male" us_customers_predicted_age<-agify(name = customers_names, country_id = "US") us_customers_predicted_age #> [1] NA 67 69 65 54 70
To obtain a data.frame
of all supported countries, use the supported_countries(type)
function. Here’s an example of 5 countries:
supported_countries(type = "genderize") %>% head(5) %>% knitr::kable(row.names = FALSE)
country_id | name | total |
---|---|---|
AD | Andorra | 29783 |
AE | United Arab Emirates | 145847 |
AF | Afghanistan | 23531 |
AG | Antigua and Barbuda | 1723 |
AI | Anguilla | 1081 |
In this case the total
column reflects the number of observations the API has for each country. The beauty of the country_id
parameter lies in that it allows to pass a single character
string or a character
vector with the same length as the name
parameter. An example illustrates this better:
agify(name = c("Hannah", "Ben"), country_id = c("US", "GB"), simplify = FALSE) %>% knitr::kable(row.names = FALSE)
name | type | age | count | country_id |
---|---|---|---|---|
Hannah | age | 54 | 67 | US |
Ben | age | 38 | 1980 | GB |
In this previous example we passed two names - Hannah & Ben - and two country codes - US & GB. Thus, the functions allow to pass vectorized vectors - this is especially useful for workflows where we are using a data.frame
with a variable with names and another variable containing country codes.
meta
parameterAll three functions have a parameter defined as meta
, which returns information about the API itself, such as:
Here’s an example:
name | type | gender | probability | count | api_rate_limit | api_rate_remaining | api_rate_reset | api_request_timestamp |
---|---|---|---|---|---|---|---|---|
Hannah | gender | female | 0.97 | 13198 | 1000 | 977 | 7218 | 2020-05-05 21:59:42 |
sliced
parameterThe nationalize(name)
function has the useful sliced
parameter. Logically, names can have multiple estimated nationalities - and the nationalize(name)
function automatically ranks them by probability. This logical parameter allows to ‘slice’/keep only the value with the highest probability to keep a single estimate for each name (one country per name) - and is set by default to TRUE
. But you may wish to see all to potential countries a name can be associated to. For this simply set the parameter to FALSE
:
nationalize(name = "Matthias", simplify = FALSE, sliced=FALSE) %>% knitr::kable(row.names = FALSE)
name | type | country_id | probability |
---|---|---|---|
Matthias | nationality | DE | 0.4161638 |
Matthias | nationality | AT | 0.2650625 |
Matthias | nationality | CH | 0.1106922 |
In the last example you see that instead of returning a single country code, it returns multiple country codes with their associated probability.
Let’s replicate the initial example with our group of customers. Voilà!
library(dplyr) df<-data.frame("Customers:"=c("Maria", "Ben", "Claudia", "Adam", "Hannah", "Robert"), stringsAsFactors = FALSE, check.names = FALSE) df <- df %>% mutate(`Estimated gender:`= genderize(`Customers:`), `Estimated age:`= agify(`Customers:`), `Estimated nationality:`= nationalize(`Customers:`)) df %>% t() %>% knitr::kable(col.names = NULL)
Customers: | Maria | Ben | Claudia | Adam | Hannah | Robert |
Estimated gender: | female | male | female | male | female | male |
Estimated age: | 21 | 48 | 45 | 34 | 27 | 59 |
Estimated nationality: | CY | AU | CL | PL | SL | US |
For more information access the package documentation at https://matbmeijer.github.io/DemografixeR.