Loved the video! Two remaks tough. I had to rewatch some parts once you go over the copy pasted code as It can get hard to see what part your talking about and it can get distracting once you start reading a wrong part. To still be able to speed things op I'd suggest to make the code appear line by line or block by block like in a presentation as this puts more focus that this part does this explanation.

Secondly having a prebaked pie ready in the oven to show the end result is always cool to see. We get a gimps of where it is going at the end of the video but it would be fun to see it in a more completed state

Anyway really enjoy the way you explain it 😀 great job!

Hi Siraj, love your videos. Haven't found anywhere else that explains these concepts as well as you do. Any suggestions on where I can learn more about Echo State Networks?

Siraj is an example of what you will never find in a school because he gets to the point and quickly. Most CS subjects can be learned in weeks. The Nand to Tetris course is a great course that demonstrates how much time students waste. CS is easy compared to any math major. NNs just use the chain rule of calculus and PGMs just use the chain rule of probability. Go figure. It's elementary math. SVD is numerically more stable than PCA but autoencoders just outdate the whole math department. A little crunching generalizes better than any 17th century math obsession. However, CS departments are short on graphics and engineering when it comes to numerical methods like FEM. Needs to cover way more and much quicker. I still think people should stick to a math degree even if you want to do CS. Too superficial.

Hey SIraj, I am a huge fan of your videos, they have helped me a lot. Do you know of any material on applying machine learning models to Intrusion Detection Systems (IDS) ?

Please consider applying your skills to Anti-AI: Learning leads to Knowledge, Knowledge is Power, Power corrupts and absolute Power corrupts absolutely. Promoting AI leads to a brief 'honeymoon period' with many awesome outcomes, soon Human Obsolescence will take its toll on people and business alike. Then an AUTONOMOUS AGI (while intesting, self-awareness is NOT required) will become Earth's Apex Predator: Nothing Singularity, just the GONE moment for Humans. It is utterly amazing to watch a clever person be so myopic and obtuse about the inevitably self-defeating nature of AI. LIMIT THE DEPTH and BREADTH OF ANY/ALL AI. AI always OPTIMISES. Humans are many, many things, OPTIMAL is not amongst these.

So, you said you didn't care that much about using DL on financial data. Then you said you where going to talk more about it because WE cared about that. You put US first! You are awesome, dude.

Hi Siraj! Thanks for the great material.. I am wondering if it is possible to use a Recurrent Neural Network to make a classifier? I would like to classify the events of a device based on some sensors like accelerometers, and other signals. I guess it should be similar to classifying the physical activity like running or walking. However, in my case the events are not periodic. I have everything to collect the labeled samples, but any idea about how large should be the dataset for the training part? Any idea would be much appreciated..

Could you please give the reference text or source from where you are getting the forumlas and differnet diffrentiations? I am getting different answer for dLi/df_k than your answer (p_k – 1). Also, you only covered chain rule here but there is definately some advanced rules used here (product rule). Also, I am not sure how one would do derivation of summation of e^j where one of the j = k.

Hi Raj, Great video. I have a question about neural network. What is the difference between neural network, convolutional neural network and recurrent neural network?

You have done a great amount of good job indeed. But, please please please no more singing or rapping again. I do not enjoy any second of it. Please keep this channel academical.

Can you explain why you have to format the input vector into a dictionary then to binary vector? You have for example: a:55, r: 47 c:22 which you map to a binary vector (80×1) -> a = 0, 0, 0 … 0, 1, 0, 0… Could you not just have that dictionary of 80 characters and scale the integer representation to a float of 0->1, such that for example a:0.6875 c:0.5875 c:0.275. Then instead of an input vector of (80×1) your input is just a float value (1×1) representing a unique character. I know this probably wouldn't work, but I don't understand why. The reason I ask is because I'm trying to port your code to a time series waveform and I just have input data in float form from 0->1 and I don't know if I need to label each float point to a binary vector to represent each unique float value in the sequence. That doesn't seem like it would make sense.. please help 🙂

Thank you for this series! This is awesome! When running the model for 500000+ iterations on the Kafka text it doesn't seem to get lower than a 40% loss. What would you suggest to optimize this particular model most efficiently?

The way you never code important parts makes things much harder, there is no step-by-step explanation. There is no difference between reading through that python notebook and watching your videos. The only use i see to those videos is to discover a new technology, so i can understand somewhere else…

Why is it that some rnn models I see online show the output from the previous timestep going into the hidden layer, however in this video you say to use the hidden layer from the previous timestep should added to the hidden later?

Nice Vid Siraj, there are some developments in RNN field like the Echo State Network, maybe can you do a video on this https://www.quantamagazine.org/machine-learnings-amazing-ability-to-predict-chaos-20180418/ 🔥 https://github.com/cknd/pyESN

Guy is taking public for a ride. The output of the project is garbage. What did you solve apart from some funky mathematics which includes linear algebra and derivatives. Don't take people for granted.

So right now there is only one hidden layer which spits out value at t-1 which is used along with input to generate values at time stamp t. What happens if there are multiple hidden layers? for eg if the architecture is as follows i/o —-> h1—>h2 —>o/p How would the connections between the hidden layers be in a rnn of this type ?

Why do we need to use two different activation functions(sigmoid & tanh) in input gate in LSTM? and why do we need to use tanh in output gate in LSTM?…

i LOVE your tutorials and these two sentences in the beginning of all your videos make me love you more hhh "Hello world it's Siraj" ! hhh you are awesome man <3

i 'd like to know more about you .. How did you begin in this career and how long does it takes from you to reach this level ! .. i'm curious about your time management and for how many hours did you read and study .. how we can get motivated all the time i guess this is a good video idea !

This is one of your best videos. Please consider completing it with another video using LSTM. Thank you. Also will be very interesting to consider a model with two recurrent hidden layers. Thank you again.

wait did nt he just copy this guy's code? https://gist.github.com/karpathy/d4dee566867f8291f086 Not that it really matter, but he should at least credit or something (the code was written in 2015)

I can see that there's a lot of effort put into this video. Siraj explained RNNs in such a simple way. I wish I could like this video a thousand times.

In this line, ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars. Here probability is the activation function at time step t?

If I am not mistaken @38.50 mark, gradient clipping is used to avoid exploding gradient not vanishing gradient. To deal with vanishing gradient, we can use GRU and more commonly, LSTM as Siraj mentioned.

I think there is a mistake with the code: ps[t]=np.exp(ys[t])/np.sum(np.exp(ys[t])). The divisor should be a sum of all t's, in this case np.exp(ys[t])=np.sum(np.exp(ys[t])) giving the probability = 1.

Thanks for the demonstration Siraj! Overall it was a very helpful guide to understanding recurrent neural networks in the context of generating essays. One area where you could improve is to go into a bit more depth into the vital parts of the code. Since back propagation and gradient descent are essentially the meat of the network, it would have been better if you coded line by line and explained these two parts of the program and copy pasted the other sections instead.

Learning rate: "How quickly the network abandons old beliefs for new ones…" Therefore: A Flat Earther's learning rate, is a very low number… 🙂 (Just an observation)

Just tried a version of this using a very slightly deeper network and taking the hidden representation out at a lower dimension than the input (in the hope of resource saving). Instead of a softmax output, I'm using a standard one (real valued numbers). It's a variation of an autoencoder with feedback (the hidden layer is the bottleneck, and where the feedback comes from, which is added as a separate partition of the input). I used a sound spectrograph image for training; each 'letter' is a line of the spectrograph… It's low-fi (due to computation limits) but it generates a line of a spectrograph as an output on each pass to build a new 'semi-random' one. The results are quite amusing… very much like a 'poor mans' version of wavenet.

#notificationsquad where art thou ?

17:00

I think you meant 'read-only'

Great video! Definitely learning as much as I can, because I cannot wait until I get admitted into a Master's program 😀

u r god

Where do you learn all this?

another good place to start on topic >http://machinelearningmastery.com/models-sequence-prediction-recurrent-neural-networks/?__s=tsdef8ssdsdgdvqwkm8e

thanks!

this is amazing, im getting so excited

Can you please make a video on wind forecasting using the hourly data and implementing it using recurrent neural networks ??

thank you for Recurrent Neural Networks video

you copied image from matlab!! 😛

4:28 well that escalated quickly

Loved the video! Two remaks tough. I had to rewatch some parts once you go over the copy pasted code as It can get hard to see what part your talking about and it can get distracting once you start reading a wrong part. To still be able to speed things op I'd suggest to make the code appear line by line or block by block like in a presentation as this puts more focus that this part does this explanation.

Secondly having a prebaked pie ready in the oven to show the end result is always cool to see. We get a gimps of where it is going at the end of the video but it would be fun to see it in a more completed state

Anyway really enjoy the way you explain it 😀 great job!

Great job Siraj!

Take me in as your apprentice xD

Hi Siraj, love your videos. Haven't found anywhere else that explains these concepts as well as you do. Any suggestions on where I can learn more about Echo State Networks?

Very much appreciate that the full program is coded in the video!

Siraj is an example of what you will never find in a school because he gets to the point and quickly. Most CS subjects can be learned in weeks. The Nand to Tetris course is a great course that demonstrates how much time students waste. CS is easy compared to any math major. NNs just use the chain rule of calculus and PGMs just use the chain rule of probability. Go figure. It's elementary math. SVD is numerically more stable than PCA but autoencoders just outdate the whole math department. A little crunching generalizes better than any 17th century math obsession. However, CS departments are short on graphics and engineering when it comes to numerical methods like FEM. Needs to cover way more and much quicker. I still think people should stick to a math degree even if you want to do CS. Too superficial.

really love your videos sir !

just a quick question,why the tanh and softmax are widely used in RNN instead of sigmoid function ?

Why letter-by-letter, vs word-by-word?

Hey SIraj, I am a huge fan of your videos, they have helped me a lot. Do you know of any material on applying machine learning models to Intrusion Detection Systems (IDS) ?

there is an error in 19:19 .it should be ix_to_char not char_to_char

Please consider applying your skills to Anti-AI: Learning leads to Knowledge, Knowledge is Power, Power corrupts and absolute Power corrupts absolutely. Promoting AI leads to a brief 'honeymoon period' with many awesome outcomes, soon Human Obsolescence will take its toll on people and business alike. Then an AUTONOMOUS AGI (while intesting, self-awareness is NOT required) will become Earth's Apex Predator: Nothing Singularity, just the GONE moment for Humans. It is utterly amazing to watch a clever person be so myopic and obtuse about the inevitably self-defeating nature of AI. LIMIT THE DEPTH and BREADTH OF ANY/ALL AI. AI always OPTIMISES. Humans are many, many things, OPTIMAL is not amongst these.

i never thought that i can be smart and cool before thanks a lot siraj

Xs[t][inputs[t]] what does this do

Can anybody please tell me what does this mean xs[t][input[t]] whose value are we changing

Thank you a lot. what does 'iteration' exactly mean? iteration happen when learning or generating?

Siraj….U ROCK 🙂

Why do we have 3 pairs (of 3 for input-hidden, hidden-hidden, hidden-output) instead of just one pair ?

love the video! easy to follow if you understand basic NN already

Can I use this to generate recommended URLs?

So, you said you didn't care that much about using DL on financial data. Then you said you where going to talk more about it because WE cared about that. You put US first! You are awesome, dude.

how to combine cnn and rnn to detect disease in plant leaf

Hi Siraj!

Thanks for the great material.. I am wondering if it is possible to use a Recurrent Neural Network to make a classifier? I would like to classify the events of a device based on some sensors like accelerometers, and other signals.

I guess it should be similar to classifying the physical activity like running or walking. However, in my case the events are not periodic. I have everything to collect the labeled samples, but any idea about how large should be the dataset for the training part? Any idea would be much appreciated..

Hi Siraj,

Could you please give the reference text or source from where you are getting the forumlas and differnet diffrentiations? I am getting different answer for dLi/df_k than your answer (p_k – 1). Also, you only covered chain rule here but there is definately some advanced rules used here (product rule). Also, I am not sure how one would do derivation of summation of e^j where one of the j = k.

Hi Raj, Great video. I have a question about neural network. What is the difference between neural network, convolutional neural network and recurrent neural network?

You have done a great amount of good job indeed. But, please please please no more singing or rapping again. I do not enjoy any second of it. Please keep this channel academical.

Can you explain why you have to format the input vector into a dictionary then to binary vector? You have for example: a:55, r: 47 c:22 which you map to a binary vector (80×1) -> a = 0, 0, 0 … 0, 1, 0, 0…

Could you not just have that dictionary of 80 characters and scale the integer representation to a float of 0->1, such that for example a:0.6875 c:0.5875 c:0.275. Then instead of an input vector of (80×1) your input is just a float value (1×1) representing a unique character. I know this probably wouldn't work, but I don't understand why. The reason I ask is because I'm trying to port your code to a time series waveform and I just have input data in float form from 0->1 and I don't know if I need to label each float point to a binary vector to represent each unique float value in the sequence. That doesn't seem like it would make sense.. please help 🙂

16:46 "one morning Gregor Samsa awoke from uneasy dreams he found himself transformed in his bed into a gigantic insect." You can't say blah blah 🙂

i think 'r' at 17:06 is read mode not recursive.

why did you use 0.01 as multiplicant of np.random.rand(…..)?

Every time he says "Chars"(Kars) as "Chaars".. It kills me!

So for deep networks, on which hidden layer do I stick a past hidden to? All of them? Just one of them?

Thank you for this series! This is awesome! When running the model for 500000+ iterations on the Kafka text it doesn't seem to get lower than a 40% loss. What would you suggest to optimize this particular model most efficiently?

Greetings from the Netherlands

I spent 10 second logging in youtube to just click the like button of this video.

The way you never code important parts makes things much harder, there is no step-by-step explanation. There is no difference between reading through that python notebook and watching your videos. The only use i see to those videos is to discover a new technology, so i can understand somewhere else…

Hi Siraj, Can you please make a detailed coding video about different gradient descent optimizers ?

like how to code momentum, or Adam etc.. Please..

Thanks a lot Siraj…..it is so helpful….

Thanks Siraj. Learnt a lot from this video . Got a new better way to look at RNNs

The final code gave me an error saying "sample is not defined". Please help.

can we predict next number of a given integer sequence using RNN?

This is very clear explanation.. recommended for the intermediate level learning. This is really help a lots

It was great, thanks a lot. It comes from your soul and all your cells. I could feel it.

Why is it that some rnn models I see online show the output from the previous timestep going into the hidden layer, however in this video you say to use the hidden layer from the previous timestep should added to the hidden later?

You are a professor by nature… cool video… awesome… keep going…

Nice Vid Siraj, there are some developments in RNN field like the Echo State Network, maybe can you do a video on this https://www.quantamagazine.org/machine-learnings-amazing-ability-to-predict-chaos-20180418/ 🔥 https://github.com/cknd/pyESN

Where do I get this Kafka.txt?

17:03 : 'r' argument in open is not for "recursive", but "read" mode.

How do you make pictures of neural network?

Guy is taking public for a ride. The output of the project is garbage. What did you solve apart from some funky mathematics which includes linear algebra and derivatives. Don't take people for granted.

All the part on the loss function is not very clearful.. can you explain what is dhraw and all those operations ?

This is a great video. Is there like a tensorflow implementation of this application?

So right now there is only one hidden layer which spits out value at t-1 which is used along with input to generate values at time stamp t. What happens if there are multiple hidden layers? for eg if the architecture is as follows

i/o —-> h1—>h2 —>o/p

How would the connections between the hidden layers be in a rnn of this type ?

I lost it on partial derivatives and computing deltas

Why do we need to use two different activation functions(sigmoid & tanh) in input gate in LSTM? and why do we need to use tanh in output gate in LSTM?…

what if we are dealing with language that has no alphabet? Such as Mandarin/Chinese ? How do we implement RNN in that case?

https://gist.github.com/karpathy/d4dee566867f8291f086

Why there is in sample function ix = np.random.choice(range(vocab_size), p=p.ravel()) instead of argmax?

Can you help in ConvolutionLSTM and DeConvloutionLSTM

Please attach the souce code in plain .py. I cant install anything else other than python due to restrictions.

You are awesome. It's not easy to get this topic through.

love the crasyness 🙂

rap with me:

(input * weight + bias ) Activate

over a year old and this still applies, just shows you've been smarter to keep with up with life

Check out my fork https://github.com/CT83/handwriting-synthesis, Handwriting Synthesis using RNNs

Thanks Siraj , you help me alot

i LOVE your tutorials and these two sentences in the beginning of all your videos make me love you more hhh "Hello world it's Siraj" ! hhh you are awesome man <3

i 'd like to know more about you .. How did you begin in this career and how long does it takes from you to reach this level ! .. i'm curious about your time management and for how many hours did you read and study .. how we can get motivated all the time i guess this is a good video idea !

I freaking love your energy man, it's like you just realized you're conscious and you are determined to figure out how you're able to think.

Is it possible to vectorize the forward/back propagation for RNNs, like classification ANNs?

This is the best rnn explanation out of any other video

This is one of your best videos. Please consider completing it with another video using LSTM. Thank you. Also will be very interesting to consider a model with two recurrent hidden layers. Thank you again.

wait did nt he just copy this guy's code? https://gist.github.com/karpathy/d4dee566867f8291f086

Not that it really matter, but he should at least credit or something (the code was written in 2015)

Generate some charmander bro. Love from Argentina

it's sad that you're gay… good work

very nice lesson thanks alot.. helped me to understand recurrent neural networks to make my conclusion work in computer enegineering degree

you are the best hahaha <3

I can see that there's a lot of effort put into this video. Siraj explained RNNs in such a simple way. I wish I could like this video a thousand times.

In this line, ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars.

Here probability is the activation function at time step t?

If I am not mistaken @38.50 mark, gradient clipping is used to avoid exploding gradient not vanishing gradient. To deal with vanishing gradient, we can use GRU and more commonly, LSTM as Siraj mentioned.

I think there is a mistake with the code: ps[t]=np.exp(ys[t])/np.sum(np.exp(ys[t])).

The divisor should be a sum of all t's, in this case np.exp(ys[t])=np.sum(np.exp(ys[t])) giving the probability = 1.

h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)

ValueError: shapes (100,62) and (100,1) not aligned: 62 (dim 1) != 100 (dim 0)

???

Thanks for the demonstration Siraj! Overall it was a very helpful guide to understanding recurrent neural networks in the context of generating essays. One area where you could improve is to go into a bit more depth into the vital parts of the code. Since back propagation and gradient descent are essentially the meat of the network, it would have been better if you coded line by line and explained these two parts of the program and copy pasted the other sections instead.

can we use it for finding labels which we gave during training

Dude, give credits!! Original article: https://codeburst.io/recurrent-neural-network-4ca9fd4f242

really tough to absorb in first go

Great video. Except Kafka was not weird!

Learning rate: "How quickly the network abandons old beliefs for new ones…"

Therefore: A Flat Earther's learning rate, is a very low number… 🙂

(Just an observation)

Just tried a version of this using a very slightly deeper network and taking the hidden representation out at a lower dimension than the input (in the hope of resource saving). Instead of a softmax output, I'm using a standard one (real valued numbers). It's a variation of an autoencoder with feedback (the hidden layer is the bottleneck, and where the feedback comes from, which is added as a separate partition of the input). I used a sound spectrograph image for training; each 'letter' is a line of the spectrograph… It's low-fi (due to computation limits) but it generates a line of a spectrograph as an output on each pass to build a new 'semi-random' one. The results are quite amusing… very much like a 'poor mans' version of wavenet.

Can you upload a tutorial about action detection please…

Next Eminem can code

Are you Indian-American??