back to index

Guessing Gender from a Chinese Name with Bayes

A small idea for guessing gender from a Chinese name using Bayes.

published Apr 12, 2017 tags #javascript #experiments

~/posts/gender-by-name $ cat post.md

/ LANG EN / 中文
/ THEME / /

A couple of years ago I saw someone do this in Python and figured I’d do a JS port.

The mechanic is Bayes’ theorem:

Bayes formula

What each term means:

  • P(A|B) — the conditional probability of A given B; A’s posterior probability given B’s value.
  • P(B|A) — the conditional probability of B given A.
  • P(A) — A’s prior (or marginal) probability. “Prior” because it ignores anything from B.
  • P(B) — B’s prior probability.

The Python reference:

def prob_for_gender(self, firstName, gender=0):
    p = 1. * self.female_total / self.total \
        if gender == 0 \
        else 1. * self.male_total / self.total

    for char in firstName:
        p *= self.freq.get(char, (0, 0))[gender]

    return p

The per-character product has obvious holes. “翁胜男” has three masculine characters but reads as a woman’s name overall; “刘璇” uses a feminine character but the full name lands closer to neutral. To make this work I’d need to:

  • Build a name → gender dataset and compute character- and word-level frequencies.
  • Apply Bayes at the whole-name level, with a character-level fallback.

Parked here for now.

back to index