Tejas Khare

Today was the last day of my SF trip visiting my cousin and her family, and I had a flight back from SF to New Jersey where I was able to get some studying done.

I realized during a convo that some of the transformer math is a little fuzzy in my mind, so I spent time making sure I really understand the math and different mechanisms that are used in a transformer, and am able to understand the intuitions for why this specific math is used, like what benefits it gives a transformer. I feel that I was able to better understand why self-attention works well (and also that it's not particularly special, just a way to create a weighted average based on learned parameters and the data, where the average is weighted based on pairwise compatibility between tokens), and a few other mechanisms that the transformer uses.

I took some more notes that can be found in this google doc https://docs.google.com/document/d/1HBdClydi9yagHKQcUlLzjf9EMJDC2lgI9AGhp9uP7A0/edit?usp=sharing