Today was the last day of my SF trip visiting my cousin and her family, and I had a flight back from SF to New Jersey where I was able to get some studying done.
I realized during a convo that some of the transformer math is a little fuzzy in my mind, so I spent time making sure I really understand the math
and different mechanisms that are used in a transformer, and am able to understand the intuitions for why this specific math is used, like what benefits it
gives a transformer. I feel that I was able to better understand why self-attention works well (and also that it's not particularly special, just a way to create a
weighted average based on learned parameters and the data, where the average is weighted based on pairwise compatibility between tokens), and a few other mechanisms
that the transformer uses.
I took some more notes that can be found in this google doc https://docs.google.com/document/d/1HBdClydi9yagHKQcUlLzjf9EMJDC2lgI9AGhp9uP7A0/edit?usp=sharing