Neural Networks
Table of Contents
- 1. Neural Networks
- 1.1. CNN
- 1.2. RNN & LSTM
- 1.3. Neural Networks Scaling Laws
- 1.4. Why Neural Networks can learn almost anything - YouTube
- 1.5. OpenWorm
- 1.6. Activation functions
- 1.7. Dropout in Neural Networks
- 1.8. Learning Rate & Batch Size
- 1.9. Transformers
- 1.9.1. Positional encoding and fourier transform
- 1.9.2. Rotary Embeddings: A Relative Revolution | EleutherAI Blog
- 1.9.3. “Attention”, “Transformers”, in Neural Network “Large Language Models”
- 1.9.4. https://github.com/rasbt/LLMs-from-scratch
- 1.9.5. Transformer Circuits Thread
- 1.9.5.1. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- 1.9.5.2. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- 1.9.5.3. Mapping the Mind of a Large Language Model \ Anthropic
- 1.9.5.4. What Do Neural Networks Really Learn? Exploring the Brain of an AI Model - YouTube
- 1.9.5.5. [2410.14670] Decomposing The Dark Matter of Sparse Autoencoders
- 1.9.5.6. Transformer Circuits Thread - distill.pub
- 1.9.6. Attention in transformers, visually explained | DL6 - YouTube
- 1.10. From Scratch
-
- 1.10.0.1. Hello Deep Learning - Bert Hubert’s writings
- 1.10.0.2. Diffusion models from scratch
- 1.10.0.3. Create a Simple Neural Network in Python from Scratch - YouTube
- 1.10.0.4. Gradient Descent into Madness - Building an LLM from scratch
- 1.10.0.5. Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch
-
- 1.11. Diffusion models
- 1.12. Regularization
- 1.13. Superposition Hypothesis
- 1.14. Autoregressive predictors
- 1.15. Neural Networks, Pre-Lenses, and Triple Tambara Modules | Bartosz Milewski’s Programming Cafe
- 1.16. Multiverse Model Compression
- 1.17. Neural circuit policies enabling auditable autonomy
- 1.18. Why neural networks aren’t neural networks Youtube
- 1.19. We are doing Neural Networks wrong
- 1.20. Neural network design
- 1.21. Evolutionary Tree of LLMs Lineage Genealogy of models
- 1.22. Ver “Manifold Mixup: Better Representations by Interpolating Hidden States” en YouTube
- 1.23. 2305.15586.pdf - Manifold Diffusion Fields
- 1.24. Ver “Introduction to GANs, NIPS 2016 | Ian Goodfellow, OpenAI” en YouTube
1. Neural Networks
1.1. CNN
1.1.2. Pooling + Convolutional
http://databookuw.com/databook.pdf#section.6.5
It is common to periodically insert a Pooling layer between successive convolutional layers in a DCNN architecture.
Its function is to progressively reduce the spatial size of the representation in order to reduce the number of parameters and computation in the network.
- This is an effective strategy to:
- help control overfitting and
- fit the computation in memory.
- help control overfitting and
1.2. RNN & LSTM
- The Unreasonable Effectiveness of Recurrent Neural Networks
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- https://ahmetozlu93.medium.com/long-short-term-memory-lstm-networks-in-a-nutshell-363cd470ccac
- Stanford CS 224N | Natural Language Processing with Deep Learning
1.3. Neural Networks Scaling Laws
Has Generative AI Already Peaked? - Computerphile - YouTube
We consistently find across our experiments that, across concepts, the frequency of a concept in the pretraining dataset is a strong predictor of the model’s performance on test examples containing that concept (see Fig. 2).
Notably, model performance scales linearly as the concept frequency in pretraining data grows exponentially i.e., we observe a consistent log-linear scaling trend.
We find that this log-linear trend is robust to controlling for correlated factors (similar samples in pretraining and test data [81]) and testing across different concept distributions along with samples generated entirely synthetically [52].
- AI can’t cross this line and we don’t know why. - YouTube
1.5. OpenWorm
OpenWorm is an international open science project to simulate the roundworm Caenorhabditis elegans at the cellular level as a simulation
Although the long-term goal is to model all 959 cells of the C. elegans, the first stage is to model the worm’s locomotion by simulating the 302 neurons and 95 muscle cells.
This bottom up simulation is being pursued by the OpenWorm community.
As of this writing, a physics engine called Sibernetic has been built for the project and models of the neural connectome and a muscle cell have been created in NeuroML format.
A 3D model of the worm anatomy can be accessed through the web via the OpenWorm browser.
The OpenWorm project is also contributing to develop Geppetto, a web-based multi-algorithm, multi-scale simulation platform engineered to support the simulation of the whole organism.
1.5.1. OpenWorm GitHub
1.6. Activation functions
1.6.1. ReLu as folding
1.7. Dropout in Neural Networks
1.8. Learning Rate & Batch Size
- How does Batch Size impact your model learning | by Devansh | Geek Culture | Medium
- Don’t Decay the Learning Rate, Increase the Batch Size - 1711.00489v2.pdf
Decaying the learning rate is simulated annealing - Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour - 1706.02677v2.pdf
Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k. - Why Small Batch sizes lead to greater generalization in Deep Learning | by Devansh | Geek Culture | Medium
- [1609.04836] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
1.9. Transformers
- Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing machine learning one concept at a time.
- The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time.
- The Illustrated GPT-2 (Visualizing Transformer Language Models) – Jay Alammar – Visualizing machine learning one concept at a time.
- The Annotated Transformer
- Categories | Ketan Doshi Blog
- Transformers from scratch | peterbloem.nl
multi-head attention: each head captures a “type” of relation
mary,gave,roses,to,susan → who gave the roses? and who received them? need 2 heads of attention - poloclub.github.io/transformer-explainer/
- Transformers from Scratch
- Transformers Explained Visually - Overview of Functionality | Ketan Doshi Blog
- Transformers Explained Visually - How it works, step-by-step | Ketan Doshi Blog
- Transformers Explained Visually - Multi-head Attention, deep dive | Ketan Doshi Blog
- Transformers Explained Visually - Not just how, but Why they work so well | Ketan Doshi Blog
- Transformer Models 101: Getting Started — Part 1 | by Nandini Bansal | Feb, 2023 | Towards Data Science
- https://projector.tensorflow.org/
- openai/transformer-debugger
1.9.1. Positional encoding and fourier transform
Transformer Architecture: The Positional Encoding - Amirhossein Kazemnejad’s Blog
Fourier transform is ubiquitous, but I have a “theory” that angle encoding in quantum machine learning could’ve been the source of inspiration for positional encoding
- Master Positional Encoding: Part I | by Jonathan Kernes | Towards Data Science
- Master Positional Encoding: Part II | by Jonathan Kernes | Towards Data Science
- Fourier Feature Encoding
1.9.2. Rotary Embeddings: A Relative Revolution | EleutherAI Blog
Used by GPT-J
1.9.6. Attention in transformers, visually explained | DL6 - YouTube
The attention layers “nudge” the current next vector embedding towards the most probable answer
1.10. From Scratch
1.10.0.2. Diffusion models from scratch
1.10.0.4. Gradient Descent into Madness - Building an LLM from scratch
- Building an LLM from Scratch: Automatic Differentiation (2023) | Hacker News
:ID: 39779111-9e33-497f-9cab-ff87dcd8f435
1.10.0.5. Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch
- Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch | Hacker News
:ID: f9f9923a-e271-47ce-ae2f-8b37e5bdbe23
1.11. Diffusion models
1.11.1. The Physics Principle That Inspired Modern AI Art | Quanta Magazine → Stable Diffusion explained
1.12. Regularization
1.12.2. Lp space - Wikipedia L0, L1, L2
\(L_\infty\) is the max function
\(L_0\) is the sum of non-zero term, defines the Hamming distance if you take the difference of two vectors
\(L_0 = k, k \le \textbf{param dimension}\) are lines, planes, cubes, hypercubes, …, hyperplanes of all but k parameters set to zero
1.12.2.1. Compressed sensing - Wikipedia
1.13. Superposition Hypothesis
How might LLMs store facts | Chapter 7, Deep Learning - YouTube Superposition Hypothesis
- The Johnson-Lindenstrauss bound for embedding with random projections — scikit-learn 1.6.0 documentation
- Johnson–Lindenstrauss lemma - Wikipedia
You can encode \(\sim \exp(\epsilon N)\) vectors that are approximately perpendicular within an error of \(\epsilon\) in a $N$-dimensional space -
Compressed sensing. In general, if one projects a vector into a lower-dimensional space, one can’t reconstruct the original vector. However, this changes if one knows that the original vector is sparse. In this case, it is often possible to recover the original vector.
1.14. Autoregressive predictors
1.16. Multiverse Model Compression
https://multiversecomputing.com/
Model Compression: https://paperswithcode.com/task/model-compression
https://github.com/HuangOwen/Awesome-LLM-Compression
Utilizan mejor compresión de modelos: https://multiversecomputing.com/papers/compactifai-extreme-compression-of-large-language-models-using-quantum-inspired-tensor-networks
https://arxiv.org/abs/2401.14109
El mecanismo es Tensor Network: https://en.wikipedia.org/wiki/Tensor_network
They show that LLMs do not have to get larger to get better. Instead, they can perform just as well with just a fraction of the parameters. Moreover, the article also presented a novel compression technique that, unlike previous methods, is controllable and explainable.
También pueden editar (Knowledge Editing) y explicar la seguridad de la red (quitarle un concepto)
1.16.3. Tensor Train (TT) Decomposition
Este es el concepto que buscar
También se llama Matrix Product State (MPS).
1.17. Neural circuit policies enabling auditable autonomy
1.18. Why neural networks aren’t neural networks Youtube
https://youtu.be/CfAL_cL3SGQ
They are consecutive linear and non-linear transformations
1.19. We are doing Neural Networks wrong
https://link.medium.com/97kwPdKMSkb
- Artificial NN are too simple (It takes 1000 artificial neurons to simulate 1 biological neuron at the single spike timescale)
- We do not know how they work
- We could compensate for the lack of complexity in artificial neurons with larger models, tons of computing power, and gigantic datasets, but that’s too inefficient to be the eventual last step of this quest.
- ANNs should be more neuroscience-based for two reasons,
- (future) the difference in complexity between biological and artificial neurons will result in differences in outcome — AGI won’t come without a reform
- (present) the inefficiency with which we’re pursuing this goal is damaging our society and the planet
- (future) the difference in complexity between biological and artificial neurons will result in differences in outcome — AGI won’t come without a reform
1.20. Neural network design
1.21. Evolutionary Tree of LLMs Lineage Genealogy of models
- [2307.09793] On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models
- Constellation
- llm evolutionary tree - Google Search
- Mooler0410/LLMsPracticalGuide: A curated list of practical guide resources of LLMs (LLMs Tree, Examples, Papers)
https://www.reddit.com/r/MachineLearning/comments/13wkcn3/d_llm_evolutionare_tree_from_the_practical_guides/ - LLM Evolutionary Tree. LLM Proliferation. – blog.biocomm.ai
1.22. Ver “Manifold Mixup: Better Representations by Interpolating Hidden States” en YouTube
https://youtu.be/1L83tM8nwHU
Softer decision boundaries by mixing up data