Let’s start with establishing some context. Some time earlier this year, I started a new YouTube channel by the name Making it in Africa. The whole idea of this channel was to attempt building a multi-million dollar company in 2-3 years in Africa starting with $0 and document the journey. As far as I was concerned, there was barely any documentation of successful startups showing what they did to get to the level of success they were at. Retrospectively speaking, I can see why. Building a startup is hard and making YouTube videos consistently is hard. They both require immense time investments and if you are doing everything on both ends, one quickly starts to erode into the time of the other. So you are faced with a dilemma of whether to focus on documenting the journey but slowing it down or actually working on the business. Earlier on, I had picked the former but now I am more inclined towards the latter. Which explains why the channel has had much less activity in recent times. I still try to post at least once a month but it is much harder to make guarantees now; that is until I get someone to help out with the editing. Shameless plug, if you’d like to help out, feel free to email me: [email protected].

The challenges

There’s a couple of challenges when it comes to creating content for the YouTube channel. I do not own a camera. I borrow one from a friend and fellow YouTuber. I actually don’t have any equipment, except from a Blue yeti microphone and an old tripod from the 80s that feels like it is constantly on the verge of collapsing. The lighting is terrible and most videos end up substandard, at least according to me. One of the first investments I’d actually recommend based on my short experience as a content creator is lighting. Good lighting can make the difference between a good and bad video, visually at least.

I have a certain standard of quality that I’d like the videos to be in but simply cannot achieve it with the current resources. So anyway, I figured I could still achieve decent quality by changing how I make videos to account for the lack of adequate resources to produce high quality content.

Recently, I started experimenting with a new style of video editing that involved me not being on camera at all. Essentially, the video is a montage of various videos, images, memes and GIFs with a voice over guiding the narrative. This video style was inspired by some of my favorite content creators: Jake Tran and Economics Explained. Several other big content creators have also used this type of video to narrate content while engaging their audience.

The key distinguishing factor between me and these other content creators is that I am doing YouTube part time. They have more time dedicated to creating content than I do. While I have been editing my latest video in this style, I noticed just how absurdly long it takes for me to find the videos that match with what I am trying to say as well as finding the right sections of movies to use as part of an edit. Let’s actually focus on that last part about extracting clips from movies. I mean, you can’t really expect me to watch a whole movie just to extract 5-10 seconds worth of footage to use in a video. That would take eons.

But why?

The short answer is : I want to edit videos faster and tell compelling stories in less time.

This possibility idea was partially inspired by Carykh’s video (embedded below). Might be cool to later combine this with the ability to edit videos later.

The solution – my first AI

TL;DR A machine learning AI that can contextually read emotions and actions going on in videos and return that data to us.

Ideally, I’d like tell a computer what kind of emotion or words I am trying to represent and a computer would be able to skim through the movies and return all the parts that match my criteria. For example, imagine giving a computer a list of all your movies then telling it, find me all the places where people are crying or find me all shots that show a rich guy. This would save hours of time for anyone wanting to use movie excerpts in their videos. This will allow for a new age of story telling where creators are not limited by their time but by their creativity. Basically, the computer would do the boring work for creators while they focus on what they do best; making great content.

The possibilities

With a technology that allows us to get what kind of context a video is in, a lot of possibilities are made possible.

  • Film boards can use the AI (with a few additional tweaks) to determine what kind of ratings a film should have based on the content of the video.
  • Directors can use the AI to establish just how much of an emotion various shots are likely to exude. This in turn will allow them to experiment on a whole new level with an AI helping them make decisions on which shots are likeliest to bring out the emotion they want.

Films are about making your audience feel a certain way and this Artificial Intelligence objectively tells you what people are likely to feel by watching your video. With this computer-human collaboration, we will be able to create films that bring out a whole new level of emotional immersion.

How I’m building it

That sounds cool and all, but how exactly are you planning to pull it off? Well, I don’t have the answers yet but I do have some answers. Let’s start with the fundamental building block of machine learning Artificial Intelligence: Data. Machine learning based AIs use data to learn patterns and predict results. This is similar to what children do, they observe their surroundings (get data) and learn how to interact with it (the output). In this particular case, we need a list of videos with a myriad of emotions to observe. The kind of data we have is what determines how well our AI can perform. Fortunately, we now generate more data that ever before. YouTube has 300 hours of content uploaded every single minute!

So we know that there is plenty of data but how do we use this data to learn? We’ll be using a sub-branch of Machine learning called deep learning. In particular, supervised deep reinforcement learning. Supervised meaning a human provides the initial data as labeled data – think of this as providing a machine with an exam along with the answers. Deep learning is basically create a series of networks that relay information to each other and learning from each other (case and point what our brains do). That was a fairly rudimentary explanation but it shall suffice for now.

What this means is that we can tell an AI what we are giving it (the videos – input) and then have it guess what action and emotion are being portrayed in the video then give a score on their guess. This score will be based on the initial data and results we had given it. As time goes by, the machine will get better at guessing and will have more data generated by itself based on the patterns it has discovered. It will also, inadvertently get better at scoring its own guesses and the result is a compounding effect where the machine can learn from its previous mistakes; this is called reinforcement learning.

For this to happen, we’ll go over a series of steps (this may change as I learn more):

  • Getting the video data
  • Splitting the videos into shorter clips that humans can easily label
  • Labeling the emotions and actions in videos – humans watch through thousands of videos and label what is going on in each clip
  • Finding a way to store the emotion data as resource efficiently as possible
  • Figuring out how the AI can detect context changes ie. if we switch from one scene to another.
  • Finding the inputs and output we’d like the AI to work with
  • Training a Convolution Neural Network (hereby referred to as CNN) model on the data given – this is basically a rough simulation of what the human brain does when it comes to learning. Different neurons fire up as we learn new information.
  • Giving the CNN footage it has never seen and asking it to classify it and score itself based on previous experience
  • Building software that people can use that uses the trained model to detect actions and emotions.


All in all, things will probably change right now. In the mean time, I actually would like your suggestions on what the AI should be called. I’ll also be sharing my updated thoughts and sketches of my thought process as the time goes by. Perhaps we shall simplify it further if it gets out hand.

How long will it take? Probably 2-3 months with 3-4 hours each to have a decent prototype. Could be less, could be more, who know. We’ll see how it goes.

If you read this far, I’m impressed. Thank you and I hope you found this educative and insightful.

Where you come in

In the mean time, here’s what I’d like from you: a name suggestion for the AI. Shoot me some suggestions in the comments section

Yeah, that’s pretty much it, I think I’m done writing for now. Jesus, I need a coffee.

2 thoughts on “ Building an AI that detects actions and emotions in videos ”

  1. Name suggestion : Emo.

    Play on the word “Emotion”
    Also it literally means a person who is emotional.

Leave a Reply

Your email address will not be published. Required fields are marked *