004/365
I think that the most important part of doing this is also explaining step by step how to do it and replicate it. In this case, the main question here is to know if there are names more used for dogs or more used by people, and in what measure we share these names. Knowing this, I think this is important because we, as humans, decide to give names to things, to ourselves, and to animals as a way to identify them and us.
Giving names to things is also a way to express the masculinities and femininities of our society, and I think that comparing this could be useful (and also kind of funny). Our names are a manifestation of our identity, and as such, other factors like biological sex, gender, and other factors get involved, but I will not talk a lot about that here; I leave the floor open in the comments.
For example, in my case, my name is Carlos. Only 0.002% of all the dogs in Catalonia share my name. In the Homo Sapiens world, just 0.53% of all the inhabitants in Catalonia share my name. So, are there names that are mostly used for dogs and not for people? Also, how "fluid" are these names? Do both Canis lupus familiaris sexes share the same name, or is there also a representation of the binarism of our current society in the dog world?
Let's find out.
Firstly, where to find this data? For the dogs' names, I made a transparency request to ANICOM, a census register for all the dogs in Catalonia. For people's names, I scraped the Idescat Portal to get the 65,000 most frequent names in Catalonia.
Once we have this, it is time to open Python (or R as well; as we say in Spanish, "se acepta pulpo como animal de compañía").
import polars as pl
pl.Config.set_tbl_rows(50) # Let's configure this to show just 50 rows in our notebook when we print.
raw_dogs = pl.read_csv("path to our dataset")
Okay, the next step is to filter the dataset to show just dogs (in this one we have dogs and cats, but let's keep it with dogs for now). We have to calculate the length of the dataset (how many dogs exist in our dataset to calculate the pct later).
total = raw_dogs.filter(
(pl.col("Espècie") == "Gos")
& pl.col("Nom").is_not_null()
).height
names_freq = (
raw_dogs
.filter(
(pl.col("Espècie") == "Gos")
& pl.col("Nom").is_not_null()
)
# Same as above. Then we calculate how many dogs there are by sex and name
.select(["Sexe", "Nom"])
.group_by(["Sexe", "Nom"]).agg([
pl.len().alias("count")
])
.with_columns(
((pl.col("count") / total) * 100).round(2).alias("pct")
) # Here we calculate the pct of each sex
).sort("count", descending=True)
dogs = (
names_freq
.select(["Nom", "Sexe", "count", "pct"])
.pivot(index="Nom", on="Sexe", values="count")
.fill_null(0).drop("null")
.with_columns([
(pl.col("Femella") + pl.col("Mascle")).alias("total_dogs"),
# We calculate the weight of each sex within the dogs (0 to 1)
(pl.col("Mascle") / (pl.col("Femella") + pl.col("Mascle"))).alias("masculinitat_gos")
])
.with_columns([
((pl.col("total_dogs") / total) * 100).alias("pct_dogs")
])
)
dogs
Okay, let's do the same cleaning for the people dataset.
raw_p = pl.read_csv("../data/raw_data/noms_catalunya.csv",
schema_overrides=frequencia:pl.Utf8
).with_columns(
pl.col("frequencia").str.replace_all(r"\.","").cast(pl.Int128)
)
total_poblacio_idscat = raw_p["frequencia"].sum()
p = (
raw_p
.with_columns([
pl.col("nom").str.to_uppercase().alias("Nom"),
((pl.col("frequencia") / total_poblacio_idscat) * 100).alias("pct_person")
])
.select(["Nom", "sexe", "pct_person"])
.pivot(index="Nom", on="sexe", values="pct_person", aggregate_function="sum")
.fill_null(0)
.with_columns([
(pl.col("H") + pl.col("M")).alias("total_pct_person"),
(pl.col("H") / (pl.col("H") + pl.col("M"))).alias("masculinitat_person")
])
)
Okay, here comes the funny part: how to visualize this data. Imagine we have four quadrants. The top right one shows how male and doggy the name is on a scale from 0 to 1. The top left quadrant shows how female and doggy the name is on a scale from 0 to 1.
And the same for people’s names. Okay, here in the middle would mean that the name is used indistinctly for male and female, and for both species (Homo Sapiens and Canis lupus).
Then, how can we calculate this? First of all, we have to join the names, as for now the two datasets are separate. How? On name (nom), and maintaining both names for the moment. Then, we have to coalesce, to pick the second name (“the name that is used for people”) if the first one (from the dogs dataset) doesn’t use it or is null.
j = dogs.join(p, on="Nom", how="full").fill_null(0)
if "Nom_right" in j.columns:
j = j.with_columns(
pl.coalesce(["Nom", "Nom_right"]).alias("Nom")
).drop("Nom_right")
Okay, the funny part, but the tricky one. The data for now (how masculine and feminine a name is) is not useful, as we must specify a coordinate to draw in an HTMLCanvas with Svelte. Okay, let's begin with the y_axis.
If dog, +1; if people, -1.
x_axis: if male, +1; if female, -1.
If a name only exists for dogs, the division gives 1. Multiplying by 2 and subtracting 1 gives +1 (At the very top of the chart).
If a name only exists for people, the division gives 0. 0 * 2 - 1 gives -1 (At the very bottom).
If a name is exactly equally popular in both worlds, it gives 0.5. 0.5 * 2 - 1 gives 0 (Right in the middle of the Y-axis).
j = j.with_columns([
# Y-axis: Dogs (+1) vs People (-1)
(
(
((pl.col("total_dogs") / total) * 100) /
(((pl.col("total_dogs") / total) * 100) + pl.col("total_pct_person"))
) * 2 - 1
).fill_nan(0).alias("coord_y"),
# X-axis: Male (+1) vs Female (-1)
# We use coalesce: if a name is only in one place, it takes that one.
(
(pl.coalesce(["masculinitat_gos", "masculinitat_person"]) * 2) - 1
).alias("coord_x")
])
j.write_json("../data/clean_data/_names.json")
We got it!