Guessing Gender from a Chinese Name with Bayes
A small idea for guessing gender from a Chinese name using Bayes.
published Apr 12, 2017 tags
#javascript
#experiments
~/posts/gender-by-name $ cat post.md
A couple of years ago I saw someone do this in Python and figured I’d do a JS port.
The mechanic is Bayes’ theorem:
What each term means:
P(A|B)— the conditional probability of A given B; A’s posterior probability given B’s value.P(B|A)— the conditional probability of B given A.P(A)— A’s prior (or marginal) probability. “Prior” because it ignores anything from B.P(B)— B’s prior probability.
The Python reference:
def prob_for_gender(self, firstName, gender=0):
p = 1. * self.female_total / self.total \
if gender == 0 \
else 1. * self.male_total / self.total
for char in firstName:
p *= self.freq.get(char, (0, 0))[gender]
return p
The per-character product has obvious holes. “翁胜男” has three masculine characters but reads as a woman’s name overall; “刘璇” uses a feminine character but the full name lands closer to neutral. To make this work I’d need to:
- Build a name → gender dataset and compute character- and word-level frequencies.
- Apply Bayes at the whole-name level, with a character-level fallback.
Parked here for now.