Many games that have a combat element have some type of rationale for determining damage dealt to the opponent. Sid Meier's Civilization VI is the sixth installment in the series and it features, as it has in previous iterations, a combat element between military (and non-military) units. In this article, we'll use the combat log file that the game saves to learn a thing or two about modeling and data analysis. If you are new to data science / data analytics, I hope you'll find value in this. If you are well-versed in the methods, I hope you'll give your thoughts on how you might have approached this. There's more than one way to do this. If you are a business manager / executive and you're trying to understand how your data science teams do their work, take this analysis as an example of what the process looks like on a sample data set. Finally, if you're a teacher, I hope you will use this in your classroom! That's the real point of this blog!

In [1]:
from IPython.display import Image
Image(filename = "civattack.png")
Out[1]:

Civilization VI, Unit Combat Analysis!

Can we try to reconstruct the math for combat in CIV VI? Why, in the above image, will the crossbowman take no damage and why does the swordsman lose almost half his health? This is what we want to explore. Maybe we can even extract a model for how damage is dealt?

Load libraries

Note: I despise using "np" as an alias for numpy and "pd" for pandas. So, I stick with "numpy" and "pandas". Though, I prefer "plt" for pyplot. I also generally dislike doing things like "import sklearn.linear_model as linear_model". I prefer to see the entire module path. With IDEs and even Jupyter notebooks, tab complete removes the keystroke saving arguments. As for style / readability. I find code to be more readable when the module is fully specified. Of course, there are limits to this. Anyway ... let's get on with it. :)

In [2]:
import math
import random
import itertools
In [3]:
import numpy
import scipy
import scipy.stats
import pandas
import sklearn
import sklearn.linear_model
import matplotlib
import matplotlib.pyplot as plt

Load Data

This can be found in /Documents/My Games/Sid Meier's Civilization VI/Logs or the equivalent for your OS / computer set up If you find the folder, look for a file called "combatlog.csv". If you inspect it, you'll notice that some columns have a ":" in them. The colon needs to be replaced by a comma.

In [4]:
combatdf = pandas.read_csv("CombatLog_forblog.csv", skipinitialspace = True)

Inspect Your File!

I can't stress this enough, but you have to look at the data format and quality before beginning any analysis. Make sure that columns are aligned correctly, see if there is missing data, etc. The next few steps are a bunch of things to look at.

What columns do we have?

In [5]:
combatdf.columns
Out[5]:
Index(['Game Turn', 'Attacking Civ', 'DefendingCiv', 'AttackerObjType',
       'DefenderObjType', 'Attacker Type', 'Defender Type', 'AttackerID',
       'DefenderID', 'AttackerStr', 'DefenderStr', 'AttackerStrMod',
       'DefenderStrMod', 'AttackerDmg', 'DefenderDmg'],
      dtype='object')

Use "describe"

"describe" is your friend. This is a small enough data set so I can spot check this with my eyes. For larger datasets, I would encourage that you write a wrapper around describe. Here, I want to do a few things.

  • See which columns are numeric
  • Make sure that the count for each column is the same (1193 in this case)
  • The basic and default statistics that are computed are helpful in a number of ways.
    • Is the standard deviation for any particular column zero? If it is, then you have a constant value. That may be an issue. It may not be. Depends on context.
    • Knowing the range (max - min) is useful as well. For example, AttackerObjType and DefenderObjType look to be in a tight range. Whereas "Attacker Type" and "Defender Type" are spread out.
    • Identifying which columns are actually numeric and which are categorical is helpful as well. AttackingCiv, DefendingCiv, AttackerObjType, DefenderObjType, Attacker Type, Defender Type, all appear to be categorical represented as numbers. The column names also suggests this.
  • Notice that the "ID" columns are missing from describe
In [6]:
combatdf.describe()
Out[6]:
Game Turn Attacking Civ DefendingCiv AttackerObjType DefenderObjType Attacker Type Defender Type AttackerStr DefenderStr AttackerStrMod DefenderStrMod AttackerDmg DefenderDmg
count 1193.000000 1193.000000 1193.000000 1193.000000 1193.000000 1.193000e+03 1.193000e+03 1193.000000 1193.000000 1193.000000 1193.000000 1193.000000 1193.000000
mean 104.241408 31.993294 28.775356 1.241408 1.056999 4.497726e+06 3.747162e+06 26.101425 22.078793 -0.645432 0.382230 11.999162 36.830679
std 60.182207 28.297615 27.862796 0.651839 0.332930 5.171176e+06 4.666581e+06 9.662889 11.103335 5.876313 5.604686 16.820684 18.164881
min 3.000000 0.000000 0.000000 1.000000 1.000000 6.553600e+04 6.553600e+04 3.000000 5.000000 -25.000000 -12.000000 0.000000 1.000000
25% 51.000000 4.000000 3.000000 1.000000 1.000000 4.587550e+05 5.242930e+05 20.000000 15.000000 -3.000000 -3.000000 0.000000 24.000000
50% 100.000000 17.000000 15.000000 1.000000 1.000000 1.507336e+06 1.245189e+06 25.000000 20.000000 0.000000 0.000000 0.000000 33.000000
75% 150.000000 63.000000 63.000000 1.000000 1.000000 8.388681e+06 5.373972e+06 30.000000 26.000000 0.000000 3.000000 21.000000 47.000000
max 210.000000 63.000000 63.000000 3.000000 3.000000 1.671174e+07 1.677725e+07 110.000000 100.000000 42.000000 45.000000 100.000000 100.000000

It doesn't hurt to look at the file

Since this data set is fairly small, we can print it directly to screen. Otherwise, we may want to inspect by printing out random subsamples. Regardless, pandas is pretty good to not actually display the entire dataset.

In [7]:
combatdf #notice that there are "UNKNOWN" unit types.
Out[7]:
Game Turn Attacking Civ DefendingCiv AttackerObjType DefenderObjType Attacker Type Defender Type AttackerID DefenderID AttackerStr DefenderStr AttackerStrMod DefenderStrMod AttackerDmg DefenderDmg
0 3 2 63 1 1 131073 655369 UNIT_WARRIOR UNIT_SCOUT 20 10 0 0 22 38
1 3 10 63 1 1 196610 524295 UNIT_WARRIOR UNIT_SCOUT 20 10 0 0 18 37
2 4 7 63 1 1 131073 917517 UNIT_WARRIOR UNIT_SCOUT 20 10 0 0 20 49
3 5 5 63 1 1 131073 458758 UNIT_WARRIOR UNIT_SPEARMAN 20 25 10 6 32 30
4 6 5 63 1 1 131073 458758 UNIT_WARRIOR UNIT_SPEARMAN 20 25 7 3 32 33
5 6 7 63 1 1 131073 917517 UNIT_WARRIOR UNIT_SCOUT 20 10 -2 4 24 31
6 7 7 63 1 1 131073 917517 UNIT_WARRIOR UNIT_SCOUT 20 10 -4 1 27 30
7 7 63 7 1 1 851980 131073 UNIT_SPEARMAN UNIT_WARRIOR 25 20 0 6 36 33
8 9 5 63 1 1 131073 458758 UNIT_WARRIOR UNIT_SPEARMAN 20 25 6 0 28 34
9 9 6 63 1 1 131073 262147 UNIT_WARRIOR UNIT_SCOUT 20 10 0 3 21 37
10 10 6 63 1 1 131073 262147 UNIT_WARRIOR UNIT_SCOUT 20 10 -2 -4 18 42
11 10 10 63 1 1 131073 458758 UNIT_WARRIOR UNIT_SPEARMAN 20 25 10 -4 19 49
12 12 8 63 1 1 131073 1441812 UNIT_WARRIOR UNIT_SPEARMAN 20 25 10 14 41 20
13 13 2 63 1 1 131073 589832 UNIT_WARRIOR UNIT_SPEARMAN 20 25 13 6 32 33
14 13 8 63 1 1 131073 1441812 UNIT_WARRIOR UNIT_SPEARMAN 20 25 11 12 43 23
15 14 3 63 1 1 131073 1048591 UNIT_WARRIOR UNIT_SPEARMAN 20 25 17 6 28 40
16 14 8 63 1 1 196610 1441812 UNIT_WARRIOR UNIT_SPEARMAN 20 25 15 10 25 26
17 15 3 63 1 1 131073 1048591 UNIT_WARRIOR UNIT_SPEARMAN 20 25 14 2 18 45
18 15 4 63 1 1 131073 1310733 UNIT_WARRIOR UNIT_SPEARMAN 20 25 10 6 26 33
19 15 8 63 1 1 196610 1441812 UNIT_WARRIOR UNIT_SPEARMAN 20 25 13 7 24 35
20 15 63 3 1 1 2490403 196608 UNIT_SLINGER UNIT_SCOUT 15 10 0 3 0 36
21 15 63 3 1 1 1703960 196608 UNIT_WARRIOR UNIT_SCOUT 20 10 0 -1 19 47
22 15 63 3 1 1 2555940 131073 UNIT_QUADRIREME UNIT_WARRIOR 25 20 0 -5 0 41
23 15 63 3 1 1 1048591 131073 UNIT_SPEARMAN UNIT_WARRIOR 25 20 -8 1 40 12
24 16 4 63 1 1 131073 1310733 UNIT_WARRIOR UNIT_SPEARMAN 20 25 7 3 33 28
25 16 7 63 1 1 262144 1900570 UNIT_SLINGER UNIT_SCOUT 15 10 0 3 0 31
26 16 18 63 1 1 196610 1507349 UNIT_WARRIOR UNIT_SCOUT 20 10 5 0 13 54
27 16 63 3 1 1 1703960 196608 UNIT_WARRIOR UNIT_SCOUT 20 10 -2 -5 19 42
28 16 63 18 1 1 1507349 196610 UNIT_SCOUT UNIT_WARRIOR 10 20 -5 7 73 14
29 16 63 8 1 1 2359329 262144 UNIT_SPEARMAN UNIT_SLINGER 25 5 0 5 14 59
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1163 207 2 63 1 1 1703937 16515104 UNIT_SCOUT UNIT_SCOUT 10 10 5 0 21 40
1164 207 2 63 3 1 65536 16515104 DISTRICT_CITY_CENTER DISTRICT_CITY_CENTER 25 10 0 -4 0 73
1165 207 6 0 3 1 65536 4128797 DISTRICT_CITY_CENTER DISTRICT_CITY_CENTER 40 48 0 9 0 12
1166 207 12 8 1 3 786437 65536 UNIT_CROSSBOWMAN UNKNOWN 40 26 -18 12 0 15
1167 207 63 19 1 1 16187457 983045 UNIT_CROSSBOWMAN UNIT_KNIGHT 40 48 0 3 0 16
1168 207 63 19 1 1 15794253 983045 UNIT_ARCHER UNIT_KNIGHT 25 48 -3 1 0 10
1169 208 0 63 3 1 1048591 15728701 DISTRICT_ENCAMPMENT DISTRICT_ENCAMPMENT 40 20 0 0 0 70
1170 208 0 6 1 1 4128797 2097152 UNIT_KNIGHT UNIT_CATAPULT 48 23 1 4 14 81
1171 208 0 63 1 1 5505057 15728701 UNIT_CATAPULT UNIT_QUADRIREME 35 20 -17 -7 0 40
1172 208 2 63 1 1 1703937 16777247 UNIT_SCOUT UNIT_SWORDSMAN 10 36 3 0 79 14
1173 208 6 0 3 1 65536 4456448 DISTRICT_CITY_CENTER DISTRICT_CITY_CENTER 40 48 0 16 0 9
1174 208 6 0 3 1 196610 5242881 DISTRICT_CITY_CENTER DISTRICT_CITY_CENTER 40 48 0 0 0 23
1175 208 6 0 3 1 458758 4128797 DISTRICT_CITY_CENTER DISTRICT_CITY_CENTER 40 48 0 4 0 17
1176 208 8 12 1 1 1048576 786437 UNIT_CROSSBOWMAN UNIT_CROSSBOWMAN 40 30 0 -1 0 50
1177 208 12 8 1 3 786437 65536 UNIT_CROSSBOWMAN UNKNOWN 40 26 -23 13 0 11
1178 208 63 19 1 1 16187457 983045 UNIT_CROSSBOWMAN UNIT_KNIGHT 40 48 0 3 0 22
1179 208 63 19 1 1 15794253 983045 UNIT_ARCHER UNIT_KNIGHT 25 48 -3 1 0 10
1180 209 0 63 3 1 1048591 7405577 DISTRICT_ENCAMPMENT DISTRICT_ENCAMPMENT 40 25 0 0 0 61
1181 209 0 63 1 1 5505057 7405577 UNIT_CATAPULT UNIT_SPEARMAN 35 25 -17 -6 0 28
1182 209 0 6 1 3 4259871 196610 UNIT_KNIGHT UNKNOWN 48 20 0 17 18 51
1183 209 0 6 1 3 5242881 196610 UNIT_KNIGHT UNKNOWN 48 20 5 15 13 71
1184 209 0 6 1 3 4063262 196610 UNIT_KNIGHT UNKNOWN 48 20 -1 11 16 52
1185 209 3 63 1 1 1441792 11206711 UNIT_NORWEGIAN_LONGSHIP UNIT_QUADRIREME 30 20 -2 -2 24 44
1186 209 6 0 3 1 65536 4128797 DISTRICT_CITY_CENTER DISTRICT_CITY_CENTER 40 48 0 10 0 14
1187 209 6 0 3 1 196610 5242881 DISTRICT_CITY_CENTER DISTRICT_CITY_CENTER 40 48 -1 -4 0 23
1188 209 8 12 1 1 1048576 786437 UNIT_CROSSBOWMAN UNIT_CROSSBOWMAN 40 30 0 -6 0 65
1189 209 63 19 1 1 16187457 983045 UNIT_CROSSBOWMAN UNIT_KNIGHT 40 48 0 4 0 15
1190 209 63 19 1 1 15794253 983045 UNIT_ARCHER UNIT_KNIGHT 25 48 -3 3 0 11
1191 209 63 19 1 1 15400978 983045 UNIT_SWORDSMAN UNIT_KNIGHT 36 48 2 2 46 20
1192 210 0 6 1 3 4259871 196610 UNIT_KNIGHT UNKNOWN 48 20 -2 8 17 68

1193 rows × 15 columns

What did we want to do?

We're interested in seeing how attacker strength and defender strength relate to how much damage each take in combat. But before we start, there's still a bunch of investigating and general data exploration.

1-D Histograms are your friend

What kind of values are we working with? Are there boundary conditions? Are there obvious "visual" outliers? Just because the data formatting turned out to be ok, that doesn't mean that the values are actually ok.

In [8]:
"""we can help ourselves a little bit by doing some automation. This print command is just a visual aid for me so that I
can more easily reference the needed columns without scrolling all around the notebook"""
for col in combatdf.describe():
    print(col)
Game Turn
Attacking Civ
DefendingCiv
AttackerObjType
DefenderObjType
Attacker Type
Defender Type
AttackerStr
DefenderStr
AttackerStrMod
DefenderStrMod
AttackerDmg
DefenderDmg
In [9]:
columns_we_care_about = ['AttackerStr','DefenderStr','AttackerStrMod','DefenderStrMod','AttackerDmg','DefenderDmg']
nrows = 2
ncols = 3
fig, ax = plt.subplots(nrows = nrows, ncols = ncols, figsize = (12,8))
count = 0
describe = combatdf.describe()
for col in describe:
    if col not in columns_we_care_about:
        continue
    row_idx = count%nrows
    col_idx = count//nrows
    col_min = describe.loc['min'][col]
    col_max = describe.loc['max'][col]
    b = int(col_max - col_min + 1) #this is ok since by inspection of the data we see that the six columns we care about are integers
    counts, bins, patchobjs = ax[row_idx][col_idx].hist(combatdf[col],bins = b, alpha = 0.7, normed = True)
    ax[row_idx][col_idx].set_title(col)
    count += 1

What do you notice?

Given that this game is still only half way through and I'm playing on Price and am not a rockstar Civ player, unit strengths haven't gotten too large [first column]. Though it looks like some civ or civs have units with strength over 100! [It's over 9000!]

Next, for the attacker and defender strength modifiers, it looks like there are spikes at zero. That can make sense since it's likely the case that unit vs unit combat probably happens most against unmodified units. Or at least that's what it is implying about this game. You may also notice, by the eyeball metric, that there are also modification spikes around 5, 7, 10. If you've played the game then you know that in game bonuses to units come in those varieties. So that can also make sense. Finally, negative modifiers surely exist because units can be damaged (resulting in a negative modifier) and terrain can also harm a unit's overall strength.

Finally, in the third column, for the damage that the attacker took (AttackerDmg), there is a spike at 0. We'll have to investigate this. If you've played Civ VI, you may already have a guess as to why. Also, it looks clear that we have some natural boundary conditions. Units can't take more than 100 in damage nor can they take negative damage. So that's at least good in the sense that we don't have outliers representing a data quality issue. For diehards of the game, you may know that later in the game it's possible for damage to go beyond 100 when considering city attacks. But that hasn't happened yet here.

Investigate AttackerDmg

We noticed a spike at zero for AttackerDmg. It's not a bad idea to look to see what's causing that. Now, here's something to keep in mind. Knowing something about the context of the data is extremely helpful, more often than not. We don't have to search in an unstructured, "non-parametric" way. The latter is how "sexy" data science is sold. In reality, you (a) don't have time for this nonsense and (b) can use prior knowledge! In this case, we're going to look at AttackerID and AttackerDmg to see if we can find any association. Let's just query the data frame and see if any natural grouping pops out.

In [10]:
combatdf.query('AttackerDmg == 0')['AttackerID'].unique()
Out[10]:
array(['UNIT_SLINGER', 'UNIT_QUADRIREME', 'UNIT_ARCHER',
       'DISTRICT_CITY_CENTER', 'UNIT_CATAPULT', 'UNIT_CROSSBOWMAN',
       'DISTRICT_ENCAMPMENT'], dtype=object)
In [11]:
combatdf.query('AttackerDmg != 0')['AttackerID'].unique()
Out[11]:
array(['UNIT_WARRIOR', 'UNIT_SPEARMAN', 'UNIT_SCOUT',
       'UNIT_SUMERIAN_WAR_CART', 'UNIT_HEAVY_CHARIOT',
       'UNIT_NORWEGIAN_LONGSHIP', 'UNIT_GALLEY',
       'UNIT_MACEDONIAN_HETAIROI', 'UNIT_INDIAN_VARU', 'UNIT_SWORDSMAN',
       'UNIT_APOSTLE', 'UNIT_KNIGHT', 'UNIT_INQUISITOR', 'UNIT_PIKEMAN'],
      dtype=object)

What do we notice?

You can see that the AttackerIDs for when AttackerDmg is zero and when it is not zero are disjoint. In other words, if the AttackerDmg is zero you won't find that unit when AttackerDmg is not zero. In Civ VI there's a distinction between ranged attacks and melee attacks. Melee attacks are when the defender can deal damage back to the attacker. Ranged attacks are one-sided --- the attacker fires at the defender from "afar". This is true for air units as well. Fighter vs Fighter is an "air melee" but Fighter vs non-anti-air unit is ranged. In any case, the point is, there are two different cases --- when attacker damage is zero and when it is not.

One note: this way of separation may not be general enough. It could be possible (air vs air or air vs ground anti-air) that a unit type (or more generally, an attacker type) shows up in both cases. However, the overall point should be clear. When attacker damage is zero, it is a ranged attack. If it is not a ranged attack, the attacker will take some non-zero damage.

Next, it might help to put together some scatter plots. This first one is just looking at the damage the attacker took against the damage the defender took on a per record basis. To be clear, this isn't how combat works. In actuality, the attacker and defender strengths are what cause damage to the attacker and defender. But it may be interesting just to see what kind of relationship might exist, if any.

In [12]:
plt.scatter(combatdf['AttackerDmg'],combatdf['DefenderDmg'],alpha = 0.7, marker = '.')
plt.xlabel('Damage to attacker')
plt.ylabel('Damage to defender')
Out[12]:
<matplotlib.text.Text at 0xdcffdd0>

How interesting!

So this is interesting! Ignoring the boundary condtions on both the horizontal and vertical axes (0 and 100), it certainly looks like there is some parity between how much damage an attacker takes and how much damage a defender takes. In other words, if the attacker takes a small amount of damage, odds are the defender is taking a high amount of damage and vice versa. So there is a level curve on which damage is dealt. It also appears that there is a small amount of randomness incorporated into combat.

We're not going to try to model this, but if you're interested, give it a try! I would recommend that if you want to fit a model, then exclude all cases when the attacker damage is 0 or 100 and same for defender. These are boundary conditions that will make a mess of most parametric models. Also, exclude all data involving districts.

Another quick investigation

Let's just take a look at what type of ids we are working with. We can see from below that we have "DISTRICT", "UNIT", and "UNKNOWN" types. For this investigation, let's keep our focus on UNIT vs UNIT and separate by what kind of damage the attacker took. Let's also throw out religious units.

In [13]:
numpy.array(sorted(combatdf['AttackerID'].unique().tolist() + combatdf['DefenderID'].unique().tolist()))
Out[13]:
array(['DISTRICT_CITY_CENTER', 'DISTRICT_CITY_CENTER',
       'DISTRICT_ENCAMPMENT', 'DISTRICT_ENCAMPMENT', 'UNIT_APOSTLE',
       'UNIT_ARCHER', 'UNIT_ARCHER', 'UNIT_CATAPULT', 'UNIT_CATAPULT',
       'UNIT_CROSSBOWMAN', 'UNIT_CROSSBOWMAN', 'UNIT_GALLEY',
       'UNIT_GALLEY', 'UNIT_HEAVY_CHARIOT', 'UNIT_HEAVY_CHARIOT',
       'UNIT_INDIAN_VARU', 'UNIT_INDIAN_VARU', 'UNIT_INQUISITOR',
       'UNIT_INQUISITOR', 'UNIT_KNIGHT', 'UNIT_KNIGHT',
       'UNIT_MACEDONIAN_HETAIROI', 'UNIT_MACEDONIAN_HETAIROI',
       'UNIT_MISSIONARY', 'UNIT_NORWEGIAN_BERSERKER',
       'UNIT_NORWEGIAN_LONGSHIP', 'UNIT_NORWEGIAN_LONGSHIP',
       'UNIT_PIKEMAN', 'UNIT_PIKEMAN', 'UNIT_QUADRIREME',
       'UNIT_QUADRIREME', 'UNIT_SCOUT', 'UNIT_SCOUT', 'UNIT_SLINGER',
       'UNIT_SLINGER', 'UNIT_SPEARMAN', 'UNIT_SPEARMAN',
       'UNIT_SUMERIAN_WAR_CART', 'UNIT_SUMERIAN_WAR_CART',
       'UNIT_SWORDSMAN', 'UNIT_SWORDSMAN', 'UNIT_WARRIOR', 'UNIT_WARRIOR',
       'UNIT_WARRIOR_MONK', 'UNKNOWN'], dtype='<U24')
In [14]:
unit_combatdf = combatdf[combatdf['AttackerID'].str.contains("UNIT") & combatdf['DefenderID'].str.contains("UNIT") & \
                         ~combatdf['AttackerID'].str.contains('APOSTLE') & ~combatdf['DefenderID'].str.contains('APOSTLE') &\
                        ~combatdf['AttackerID'].str.contains('INQUISITOR') & ~combatdf['DefenderID'].str.contains('INQUISITOR')]
numpy.array(sorted(unit_combatdf['AttackerID'].unique().tolist() + unit_combatdf['DefenderID'].unique().tolist()))
Out[14]:
array(['UNIT_ARCHER', 'UNIT_ARCHER', 'UNIT_CATAPULT', 'UNIT_CATAPULT',
       'UNIT_CROSSBOWMAN', 'UNIT_CROSSBOWMAN', 'UNIT_GALLEY',
       'UNIT_GALLEY', 'UNIT_HEAVY_CHARIOT', 'UNIT_HEAVY_CHARIOT',
       'UNIT_INDIAN_VARU', 'UNIT_INDIAN_VARU', 'UNIT_KNIGHT',
       'UNIT_KNIGHT', 'UNIT_MACEDONIAN_HETAIROI',
       'UNIT_MACEDONIAN_HETAIROI', 'UNIT_NORWEGIAN_BERSERKER',
       'UNIT_NORWEGIAN_LONGSHIP', 'UNIT_NORWEGIAN_LONGSHIP',
       'UNIT_PIKEMAN', 'UNIT_PIKEMAN', 'UNIT_QUADRIREME',
       'UNIT_QUADRIREME', 'UNIT_SCOUT', 'UNIT_SCOUT', 'UNIT_SLINGER',
       'UNIT_SLINGER', 'UNIT_SPEARMAN', 'UNIT_SPEARMAN',
       'UNIT_SUMERIAN_WAR_CART', 'UNIT_SUMERIAN_WAR_CART',
       'UNIT_SWORDSMAN', 'UNIT_SWORDSMAN', 'UNIT_WARRIOR', 'UNIT_WARRIOR',
       'UNIT_WARRIOR_MONK'], dtype='<U24')

Replot

It's good to compare this plot against the plot above for a quick visual about what we've excluded.

In [15]:
plt.scatter(unit_combatdf['AttackerDmg'],unit_combatdf['DefenderDmg'],alpha = 0.7, marker = '.')
plt.xlabel('Damage to attacker')
plt.ylabel('Damage to defender')
Out[15]:
<matplotlib.text.Text at 0xe3020d0>

outliers?

There's one point that looks a little strange.

In [16]:
unit_combatdf.query("AttackerDmg < 10 and DefenderDmg < 50 and AttackerDmg > 0")
Out[16]:
Game Turn Attacking Civ DefendingCiv AttackerObjType DefenderObjType Attacker Type Defender Type AttackerID DefenderID AttackerStr DefenderStr AttackerStrMod DefenderStrMod AttackerDmg DefenderDmg
496 82 17 63 1 1 196610 9109580 UNIT_WARRIOR UNIT_WARRIOR 20 20 3 -7 4 47

Well?

That one attack looks legitimate from a data formatting standpoint, but certainly appears a bit outside the visual clustering. We would have expected that if the defender took almost 50 points of damage, the attacker should've taken damage close to 20 points. Let's keep this in mind, but leave the point in our data set for now. There are two other points that look visually askew, but we can be here all day scrutinizing every point. So let's not bother.

Melee vs Ranged

Let's make two dataframes. One for melee and one for ranged.

In [17]:
meleedf = unit_combatdf.query('AttackerDmg > 0')
rangedf = unit_combatdf.query('AttackerDmg == 0')

Incorporating attacker and defender strengths

So far, we had looked at just the output --- the damage done to the attacker and the defender. Odds are, this damage is dependent on the attacker and defender strengths (AttackerStr and DefenderStr, respectively). There are also two columns for modifiers to the strength.

Again, if you played the game then you know that the modifiers are additive. So the "net" attacker strength is the sum of the base strength (AttackerStr) and the modifier (AttackerStrMod). The same goes for the defender. So let's just add a column in both of our data frames for this net strength.

In [18]:
cols = [('AttackerStr','AttackerStrMod','AttackerNetStr'),('DefenderStr','DefenderStrMod','DefenderNetStr')]
for c in cols:
    t = pandas.DataFrame(meleedf[c[0]] + meleedf[c[1]],columns = [c[2]])
    meleedf[c[2]] = meleedf[c[0]] + meleedf[c[1]]
    t = pandas.DataFrame(rangedf[c[0]] + rangedf[c[1]],columns = [c[2]])
    rangedf[c[2]] = rangedf[c[0]] + rangedf[c[1]]
In [19]:
meleedf.columns
Out[19]:
Index(['Game Turn', 'Attacking Civ', 'DefendingCiv', 'AttackerObjType',
       'DefenderObjType', 'Attacker Type', 'Defender Type', 'AttackerID',
       'DefenderID', 'AttackerStr', 'DefenderStr', 'AttackerStrMod',
       'DefenderStrMod', 'AttackerDmg', 'DefenderDmg', 'AttackerNetStr',
       'DefenderNetStr'],
      dtype='object')

Exploration of net strength

Garbage in, garbage out. It's good to check up on the inputs just like we did for the outputs.

In [20]:
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (12,6))
ax[0].scatter(meleedf['AttackerStr'],meleedf['DefenderStr'],alpha = 0.7, marker = '.')
ax[0].set_xlabel('Attacker Strength')
ax[0].set_ylabel('Defender Strength')
ax[0].set_title('Melee')
ax[1].scatter(rangedf['AttackerStr'],rangedf['DefenderStr'],alpha = 0.7, marker = '.')
ax[1].set_xlabel('Attacker Strength')
ax[1].set_ylabel('Defender Strength')
ax[1].set_title('Range')
Out[20]:
<matplotlib.text.Text at 0xe100d70>
In [21]:
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (12,6))
ax[0].scatter(meleedf['AttackerNetStr'],meleedf['DefenderNetStr'],alpha = 0.7, marker = '.')
ax[0].set_xlabel('Attacker Net Strength')
ax[0].set_ylabel('Defender Net Strength')
ax[0].set_title('Melee')
ax[1].scatter(rangedf['AttackerNetStr'],rangedf['DefenderNetStr'],alpha = 0.7, marker = '.')
ax[1].set_xlabel('Attacker Net Strength')
ax[1].set_ylabel('Defender Net Strength')
ax[1].set_title('Range')
Out[21]:
<matplotlib.text.Text at 0xe1de470>

Some comments

The above graphs show how the combat match ups from base strength become more clumped after adjusting for the modifiers, especially for melee. That makes sense to some extent since if we assume that all players are rational and won't routinely put their troops in harm's way needlessly. Of course, someone will be stronger and someone will be weaker from time to time, but the disparities don't appear to be large (we can look at ratios if we want).

However, net negative values for attacker and defender strengths are a bit of a concern. Let's take a look at those cases first and see if anything pops out. Odds are, with terrain modifiers and damage the unit may have already taken, it's possible net strength can be negative. But peeking at the data doesn't hurt.

In [22]:
meleedf.query('AttackerNetStr < 0 or DefenderNetStr < 0')
Out[22]:
Game Turn Attacking Civ DefendingCiv AttackerObjType DefenderObjType Attacker Type Defender Type AttackerID DefenderID AttackerStr DefenderStr AttackerStrMod DefenderStrMod AttackerDmg DefenderDmg AttackerNetStr DefenderNetStr
76 24 63 1 1 1 3735599 262146 UNIT_WARRIOR UNIT_SLINGER 20 5 0 -6 11 60 20 -1
93 27 3 63 1 1 131073 2490403 UNIT_WARRIOR UNIT_SLINGER 20 5 9 -6 9 95 29 -1
182 37 1 63 1 1 196608 3997733 UNIT_SCOUT UNIT_SLINGER 10 5 2 -6 16 51 12 -1
197 38 2 63 1 1 393218 3014677 UNIT_WARRIOR UNIT_SLINGER 20 5 5 -7 10 80 25 -2
334 56 6 63 1 1 524289 4653111 UNIT_WARRIOR UNIT_SLINGER 20 5 -2 -9 14 71 18 -4
605 101 63 4 1 1 8257607 1048583 UNIT_SCOUT UNIT_INDIAN_VARU 10 40 -14 -3 100 7 -4 37
826 134 63 4 1 1 10813476 1638409 UNIT_SCOUT UNIT_INDIAN_VARU 10 40 -16 10 100 4 -6 50
In [23]:
rangedf.query('AttackerNetStr < 0 or DefenderNetStr < 0')
Out[23]:
Game Turn Attacking Civ DefendingCiv AttackerObjType DefenderObjType Attacker Type Defender Type AttackerID DefenderID AttackerStr DefenderStr AttackerStrMod DefenderStrMod AttackerDmg DefenderDmg AttackerNetStr DefenderNetStr
276 48 63 3 1 1 5767227 393216 UNIT_QUADRIREME UNIT_SLINGER 25 5 -3 -7 0 63 22 -2
324 55 4 1 1 1 720896 589828 UNIT_ARCHER UNIT_SLINGER 25 5 5 -6 0 100 30 -1
366 60 2 63 1 1 655360 5701651 UNIT_ARCHER UNIT_SLINGER 25 5 0 -6 0 68 25 -1
589 99 1 63 1 1 1179650 6094880 UNIT_ARCHER UNIT_SLINGER 25 5 5 -9 0 100 30 -4
864 143 3 63 1 1 1245191 7143478 UNIT_ARCHER UNIT_SLINGER 25 5 4 -6 0 78 29 -1

Notice anything? All of these cases except for one in range-type attacks involve Civ #63. Barbarians. The modifier involved isn't excessive. The base strength was low enough that net negative would be possible. So I'm not really worried about this. But it's good to know! Let's take a quick look at attacker and defender net strength against the civilization id. I'm civilization index 0, if you care.

In [24]:
acivs = sorted(meleedf['Attacking Civ'].unique())
dcivs = sorted(meleedf['DefendingCiv'].unique())
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (12,6),sharey = True)
boxplot = ax[0].boxplot([meleedf[meleedf['Attacking Civ'] == v]['AttackerNetStr'].values for v in acivs], acivs,
                     labels = acivs, bootstrap = 10000)
boxplot = ax[0].set_xlabel('Attacking Civ')
boxplot = ax[0].set_ylabel('Net Strength')
boxplot = ax[1].boxplot([meleedf[meleedf['DefendingCiv'] == v]['DefenderNetStr'].values for v in dcivs], dcivs,
                     labels = dcivs, bootstrap = 10000)
boxplot = ax[1].set_xlabel('Defending Civ')
boxplot = ax[1].set_ylabel('Net Strength')
fig.suptitle('Melee')
Out[24]:
<matplotlib.text.Text at 0xe538d90>
In [25]:
acivs = sorted(rangedf['Attacking Civ'].unique())
dcivs = sorted(rangedf['DefendingCiv'].unique())
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (12,6),sharey = True)
boxplot = ax[0].boxplot([rangedf[rangedf['Attacking Civ'] == v]['AttackerNetStr'].values for v in acivs], acivs,
                     labels = acivs, bootstrap = 10000)
boxplot = ax[0].set_xlabel('Attacking Civ')
boxplot = ax[0].set_ylabel('Net Strength')
boxplot = ax[1].boxplot([rangedf[rangedf['DefendingCiv'] == v]['DefenderNetStr'].values for v in dcivs], dcivs,
                     labels = dcivs, bootstrap = 10000)
boxplot = ax[1].set_xlabel('Defending Civ')
boxplot = ax[1].set_ylabel('Net Strength')
fig.suptitle('Range')
Out[25]:
<matplotlib.text.Text at 0xee5c750>

This all looks reasonable. Sanity checks are useful before we start doing analysis. Let's do one more survey of our data --- the modifiers.

In [26]:
acivs = sorted(meleedf['Attacking Civ'].unique())
dcivs = sorted(meleedf['DefendingCiv'].unique())
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (12,6),sharey = True)
boxplot = ax[0].boxplot([meleedf[meleedf['Attacking Civ'] == v]['AttackerStrMod'].values for v in acivs], acivs,
                     labels = acivs, bootstrap = 10000)
boxplot = ax[0].set_xlabel('Attacking Civ')
boxplot = ax[0].set_ylabel('Strength Modifier')
boxplot = ax[1].boxplot([meleedf[meleedf['DefendingCiv'] == v]['DefenderStrMod'].values for v in dcivs], dcivs,
                     labels = dcivs, bootstrap = 10000)
boxplot = ax[1].set_xlabel('Defending Civ')
boxplot = ax[1].set_ylabel('Strength Modifier')
fig.suptitle('Melee')
Out[26]:
<matplotlib.text.Text at 0xf49fb50>
In [27]:
acivs = sorted(rangedf['Attacking Civ'].unique())
dcivs = sorted(rangedf['DefendingCiv'].unique())
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (12,6),sharey = True)
boxplot = ax[0].boxplot([rangedf[rangedf['Attacking Civ'] == v]['AttackerStrMod'].values for v in acivs], acivs,
                     labels = acivs, bootstrap = 10000)
boxplot = ax[0].set_xlabel('Attacking Civ')
boxplot = ax[0].set_ylabel('Strength Modifier')
boxplot = ax[1].boxplot([rangedf[rangedf['DefendingCiv'] == v]['DefenderStrMod'].values for v in dcivs], dcivs,
                     labels = dcivs, bootstrap = 10000)
boxplot = ax[1].set_xlabel('Defending Civ')
boxplot = ax[1].set_ylabel('Strength Modifier')
fig.suptitle('Range')
Out[27]:
<matplotlib.text.Text at 0xf1a9850>

Again, things look reasonable. Though it looks like civilizations index 11 and 19 seem to attack with significant negative modification to range attacks. Barbarians generally are terrible.

Now let's take a look at our attacker and defender net strengths against the damage they take. But how to do this? Once again, let's think about the game itself. Does it seem sensible that we look at the difference between the attacker and defender net strengths or their ratio or something else? I want to reject ratio because of the possibility of division by zero and that strengths can go negative. Difference seems like a good first start. So let's take a look.

In [28]:
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (12,6), sharey = True, sharex = True)
ax[0].scatter(meleedf['AttackerNetStr']-meleedf['DefenderNetStr'],meleedf['DefenderDmg'],marker = '.', alpha = 0.7, label = 'Defender')
ax[0].scatter(meleedf['AttackerNetStr']-meleedf['DefenderNetStr'],meleedf['AttackerDmg'],marker = '.', alpha = 0.7, label = 'Attacker')
ax[0].legend(loc = 6)
ax[0].set_xlabel('Net Strength Differential (Attacker - Defender)')
ax[0].set_ylabel('Damage Taken')
ax[0].set_title('Melee')
ax[1].scatter(rangedf['AttackerNetStr']-rangedf['DefenderNetStr'],rangedf['DefenderDmg'],marker = '.', alpha = 0.7, label = 'Defender')
ax[1].legend(loc = 6)
ax[1].set_xlabel('Net Strength Differential (Attacker - Defender)')
ax[1].set_ylabel('Damage Taken')
ax[1].set_title('Range')
Out[28]:
<matplotlib.text.Text at 0xf9c53d0>

Well, that's not bad! Clearly, there's a functional form here. Likely exponential if I had to guess. Of course, there's a hard cap of 100 and a hard floor of 0. The melee graph corroborates something we saw earlier --- that there seems to be some zero-sum type parity between the amount of damage the attacker and defender take. Out of curiosity, let's just look at the sum of attacker damage and defender damage for melee.

In [29]:
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (8,8), sharey = True, sharex = True)
ax.scatter(meleedf['AttackerNetStr']-meleedf['DefenderNetStr'],meleedf['AttackerDmg'] + meleedf['DefenderDmg'],marker = '.', alpha = 0.7, label = 'Attacker + Defender Damage')
ax.legend(loc = 9)
ax.set_xlabel('Net Strength Differential (Attacker - Defender)')
ax.set_ylabel('Sum of Damage Taken')
ax.set_title('Melee')
Out[29]:
<matplotlib.text.Text at 0xfa612d0>

Interesting! What're your guesses? Quadratic? Hyperbolic?

Let's see if we can fit an exponential to the damage the defender takes based on the net strength differential between attacker and defender. To do this, we'll "linearize" the data with a log transformation. But we'll also get rid of the boundary condition values of 0 and 100 from the damage taken. Normally, I'd look to rescale the data to be centered around zero, but in this case, the numbers are generally tame and the behavior looks well-defined enough that I'll bypass these additional steps.

First let's plot the log transformed data along side with square-rooted data --- we'll do melee for defender damage taken.

In [30]:
data = meleedf.query('DefenderDmg > 0 and DefenderDmg < 100')
x = data['AttackerNetStr']-data['DefenderNetStr']
y1 = numpy.log(data['DefenderDmg'])
y2 = numpy.sqrt(data['DefenderDmg'])
plt.scatter(x,y1,marker = '.', alpha = 0.7,label = 'log transform')
plt.scatter(x,y2,marker = '.', alpha = 0.7,label = 'sqrt transform')
plt.legend()
plt.ylabel('Transformed Defender Damage')
plt.xlabel('Net Strength Differential')
Out[30]:
<matplotlib.text.Text at 0xfa0b4d0>

Looks pretty clear that a log transform is more appropriate, suggesting an exponential form relating the difference between attacker and defender net strength and the damage the defender takes. There are ways for us to automatically detect this, but it's a bit out of scope here. Additionally, it's ok to look at the graphs!

Now, let's set up a linear regression on the log transformed data.

In [31]:
reg = sklearn.linear_model.LinearRegression()
reg.fit(x.values.reshape(-1,1),y1)
reg.coef_, reg.intercept_
Out[31]:
(array([0.03794637]), 3.387202364737919)

Ok, so we have regression coefficients, but they are on the log transformed data. In other words, we fitted the model $\log(y) = mx + b$ where $y$ is the defender damage and $x$ is the difference between net attacker strength and net defender strength. So, to get back $y$, we have $y = e^{mx + b}$

In [32]:
xs = numpy.linspace(-60,40,121)
plt.plot(xs,[min(v,100) for v in numpy.exp(reg.coef_[0]*xs + reg.intercept_)],label = 'fit',alpha = 0.7, color = 'black')
plt.scatter(x,numpy.exp(y1),label = 'actual', alpha = 0.7, marker = '.')
plt.legend()
plt.xlabel('Net Strength Differential')
plt.ylabel('Defender Damage Taken')
plt.title('Melee')
Out[32]:
<matplotlib.text.Text at 0xfd0ae70>

That looks beautiful!

What about attacker damage? We should be able to model that with the same process.

In [33]:
data = meleedf.query('AttackerDmg > 0 and AttackerDmg < 100')
x = data['AttackerNetStr']-data['DefenderNetStr']
y1 = numpy.log(data['AttackerDmg'])
y2 = numpy.sqrt(data['AttackerDmg'])
plt.scatter(x,y1,marker = '.', alpha = 0.7,label = 'log transform')
plt.scatter(x,y2,marker = '.', alpha = 0.7,label = 'sqrt transform')
plt.legend()
plt.ylabel('Transformed Attacker Damage')
plt.xlabel('Net Strength Differential')
Out[33]:
<matplotlib.text.Text at 0xfa92d90>

Again, we'll go with a log transform.

In [34]:
reg = sklearn.linear_model.LinearRegression()
reg.fit(x.values.reshape(-1,1),y1)
reg.coef_, reg.intercept_
Out[34]:
(array([-0.03884319]), 3.3699317974670757)

Oooh. Compare these coefficients against those found for defender damage taken! It indeed looks like the sum of the damage taken is a type of hyperbolic cosine.

In [35]:
xs = numpy.linspace(-60,40,121)
plt.plot(xs,[min(v,100) for v in numpy.exp(reg.coef_[0]*xs + reg.intercept_)],label = 'fit',alpha = 0.7, color = 'black')
plt.scatter(x,numpy.exp(y1),label = 'actual', alpha = 0.7, marker = '.')
plt.legend()
plt.xlabel('Net Strength Differential')
plt.ylabel('Attacker Damage Taken')
plt.title('Melee')
Out[35]:
<matplotlib.text.Text at 0xfde6030>

Visually, this looks like a pretty solid fit.

In [36]:
reg.score(x.values.reshape(-1,1),y1)
Out[36]:
0.9100817324843044

That $R^{2}$ looks pretty good!

There's a bunch of other stuff we can do to tighten this analysis up. But this really depends on what we want to do. In my case, I just want to understand what extra strength really means and how much is necessary. From looking at the model and the graphs, it looks like if I am attacking and I have a +30 advantage, the defender is taking full damage. If we were at parity then we're both taking about 30 points of damage. So, let's just print out a table with attacker and defender damage taken based on our model. I haven't done an explicit hypothesis test on the coefficients, but suffice to say it looks like the slopes are equal up to a sign change and the intercepts are the same between attacker and defender damage. So, all else being equal, let's take a slope of 0.038 and intercept of 3.38.

In [37]:
xs = numpy.linspace(-35,35,15)
slope = .038
intercept = 3.38
ydefender = numpy.exp(slope*xs + intercept)
yattacker = numpy.exp(-slope*xs + intercept)
modeldf = pandas.DataFrame(numpy.array([xs.tolist(),ydefender.tolist(),yattacker.tolist()]).T,columns = ['StrengthDiff','DefenderDamage','AttackerDamage'])
modeldf
Out[37]:
StrengthDiff DefenderDamage AttackerDamage
0 -35.0 7.767901 111.052160
1 -30.0 9.393331 91.835598
2 -25.0 11.358882 75.944287
3 -20.0 13.735724 62.802821
4 -15.0 16.609918 51.935367
5 -10.0 20.085537 42.948426
6 -5.0 24.288427 35.516593
7 0.0 29.370771 29.370771
8 5.0 35.516593 24.288427
9 10.0 42.948426 20.085537
10 15.0 51.935367 16.609918
11 20.0 62.802821 13.735724
12 25.0 75.944287 11.358882
13 30.0 91.835598 9.393331
14 35.0 111.052160 7.767901

Why don't you try to model defender damage on range attacks?

Oh, one last thing. I didn't do this, but you can certainly try. Can you figure out how much random fluctuation there is? The model we derived is a least-squares best fit, ie, an average model (average minimizes square error). We're not going to get every point exactly correct. The variation around our model is the implicit unexplained variance --- error. Clearly, the game designers added some randomness to combat. But not so much randomness that their underlying model was washed away. Rather just enough so that all combats aren't exactly the same. Or it could be that there are other deterministic factors that contribute to the damage taken. If there are, we don't have access to them from the combatlog.csv file. This is reality! From our model's standpoint, everything the model can't explain is "random", unexplained variance.