Use string as random seed in R

Mehrad Mahmoudian published on
4 min, 779 words

Categories: Programming

Tags: R

Abstract

A quick and easy way to use a string as the random generator seed in R language.

A little bit of background

Using random seed is an essential part of reproducible research, and due to fluctuation of the results, some would use seed hunting for the "best" results (e.g accuracy in Machine Learning). For this reason and in order to detect who is doing honest research and who has sone seed hunting, people usually use random seeds like 1 or 12345. The latter have been my random seed of choice is all my projects and research since 2016. I was once accused that I have tried 1, 12, 123, 1234, and 12345 and then picked the best one! My response was simple: all my papers and my PhD thesis have used 12345 and they are all publicly available, so they can simply go and check :) . But wouldn't it be more convenient and easier to trust the preson and the project if were we be able to use a string (e.g our first name or full name) as the random seed? Unfortunately, and yet logically, R only accepts an integer as the seed for random number genetion. This was my motivation to start looking for a way to use my name as the random seed.

So the first thing we need to do is to find a way to convert string. There are many ways to achieve this. There are two ways that comes to mind:

  1. convert each character to its index. For example "a" would be 1, "b" would be 2and so on.
  2. convert the string into number.

The first method is suboptimal if we limit ourselves to c(letters, LETTERS) because what if the user uses characters like space to separate their first name and last name, or use numbers or etc. Yes, we can expand that vector, but ... isn't that waste of time?

So let's get to the second method. There are many ways to do this as well. You can perhaps use an instance of CyberChef to play around, use your creativity, and find a way.

For instance, I quickly put together this simple recipe: "mehrad" -> to decimal -> sum which you can see it here. This is very simple method and it definitely does not create unique number for every string (i.e hash collision) but it is good enough. If you want to create a unique one, which will be a definite overkill in all aspects for such use-case, you can use one of the hashing algorithms but you would need third-party packages such as openssl package.

So here I would just go with the plan above and use only things in the base package.

Note: I will use basepipe (|>) in the following code blocks because it is very easy to add steps to the code without using Home and End multiple times, but you don't have to use pipe and you can use "normal" syntax of R if you like so. If you want to learn more about pipes, watch John Mount video or read David Selby's blog post.

The first step is to convert our string to integers:

utf8ToInt("mehrad")
[1] 109 101 104 114  97 100
utf8ToInt("mehrad") |> length()
[1] 6

We now have one integer per character, but the set.seed() only uses one integer, so we should somehow turn it into one single number. There are many things we can do, but perhaps the simplest is to add them up :) So let's do that

utf8ToInt("mehrad") |> sum()
[1] 625

Now that we have it, we can use it for the seed:

utf8ToInt("mehrad") |> sum() |> set.seed()
# or in classic R syntax
set.seed(sum(utf8ToInt("mehrad")))

And as always it is good to test if our method really works:

set.seed(sum(utf8ToInt("mehrad"))) ; rnorm(10)
set.seed(sum(utf8ToInt("mehrad"))) ; rnorm(10)
set.seed(sum(utf8ToInt("mehrad"))) ; rnorm(10)
[1] -0.3949023  0.6945254 -0.2651758 -0.4293754  0.2215511 -1.0237239  0.4103700  0.6291080  0.4894505 -1.7841721
[1] -0.3949023  0.6945254 -0.2651758 -0.4293754  0.2215511 -1.0237239  0.4103700  0.6291080  0.4894505 -1.7841721
[1] -0.3949023  0.6945254 -0.2651758 -0.4293754  0.2215511 -1.0237239  0.4103700  0.6291080  0.4894505 -1.7841721

And if we change the string we get another set of random numbers:

# capitalize my name
set.seed(sum(utf8ToInt("Mehrad"))) ; rnorm(10)

# using full name
set.seed(sum(utf8ToInt("Mehrad Mahmoudian"))) ; rnorm(10)
 [1]  1.3306590  1.0458094 -0.9870942  0.6997714 -0.7041404 -0.5791050 -2.1145910  1.3888766  0.8057116  0.1516002
 [1] -0.54792314 -0.01842653 -0.28491016  1.15766660  0.60054169 -0.45916757  0.66566250 -0.11515557  0.64156640 -0.20554542