So Sora, OpenAI’s latest model for text to video, made waves on social media. From my vantage point on various apps, I saw swathes of video examples accompanied by excitement, fear and anger - the usual response to a newly upgraded technology. It also speaks to a rising discord in the realms of generative AI, where discourse tends to fall within two specific camps; AI art is taking work from artists and is promoting consumption over soul, and that AI art allows people to fully articulate their creations without being gatekept by technical skill. As you can imagine, it’s a controversial subject. My stance is that AI should complement human creation without replacement. I’m generally pro-AI with some exceptions, such as deepfakes. I feel that the main issue here is that there is a conflation between genAI and AI in general, and the applications of the latter have been seamlessly integrated into our lives in so many ways.
I think it’s worth going through Sora’s technical and safety report. I’m planning to make a video of this as well, so that these can be cross referenced.
Safety First!
First of all, safety measures are outlined here; including Red Teaming, a text classifier to reject text input prompts that would generate deepfakes, impersonation and any potentially harmful content as well as a detection classifier that can discern whether a video was generated by Sora or not. There’s also a note that the OpenAI team are engaging with policymakers, educators and artists, which altogether gives me a sense of reassurance. After all, it’s not open to the public yet (as of 27th Feb 2024) so to have this in writing now is a good sign.
The main point of the deep dive for me, however, is the technical report. They state that: “Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world”, which gives the impression that this is a stepping stone towards something greater than a video generator - perhaps the hint of AGI or even ASI further down the line (Artificial General Intelligence and Artificial Superintelligence respectfully)?
Under the Hood
What really interested me is how this came to be, and, interestingly enough reminded me of my PhD research in a rather abstract way. My thesis examined different ways in which Science Fiction theatre can be staged by highlighting spatiotemporal differences through linguistic worldbuilding and dramaturgy as opposed to the more visual film and television. This would be key in highlighting a sense of alterity, creating a world behind the scenes and hinting at it through language and gesture. Sora seems to actually work in the same vein.
So if we look at the report, it lists RNNs (Recurrent Neural Networks), GANs, (Generative Adversarial Networks), Autoregressive Transformers and Diffusion Models as the stepping stones towards Sora’s processing. Whilst these methods are focused on a narrow category of visual data, Sora is more generalist - inspired by LLMs (Large Language Models) and their use of tokens in how they unify modalities of text such as code, maths and natural languages. In the same way, Sora can effectively represent and unify diverse modalities of visual data - what the report calls visual patches. These patches seem to be highly scalable, allowing for training generative models on a wide range of images and video.
So how does this work, exactly? Sora breaks down images and video into smaller pieces that can be analysed and generated independently before being reassembled into a coherent whole. This dissection of visual data into patches allows a greater focus on smaller, localised areas of an image or still; allowing for a more detailed texture, pattern or object dynamics than global analysis methods. The patches are then compressed into a latent space, summarizing this high detailed data into a more compact representation - capturing the essential information in a more generalised way.
A decoder is then used to map these latent representations back into the high dimensional space, so that the details, continuity and dynamics of the original scenes are preserved through essential reconstructing the video. This not only allows Sora to process and generate video but also gain an understanding of patterns and relationships between patterns of visual data. It’s amazing how it can have a such a streamlined process of creating complex visual data on a micro and macro scale, also taking into account the dynamics between units of patches. I aimed to do something similar when staging Science Fiction theatre; the worldbuilding aspects being complex and dynamic but being compressed into the latent space of the stage, using dialogue and theatrical elements which is then reconstructed in the audience’s mind. This allows Sora to not only generate narrative, but also to learn from it - which makes me think about the whole concept of simulations as a whole. The use of patches also allows Sora to train on videos and images of various dimensions, as well as generating extended videos and interpolating between two input videos. Maybe this could help filmmakers to finish particular scenes or sequences.
Future Implications
Of course, Sora is in its early stages. OpenAI are very transparent when mentioning the current limitations of the model. They state that it “may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect”, as well as confusing spatial details of a prompt and temporal trajectories (e.g. a tracking camera). However, how could this be expanded in the future? With models like Sora, does this mark the start into adding credence or weight to simulation theory, or even the beginning of creating our own simulations ourselves?
I’m reminded of one of my recent reads - David Chalmers’ book, Reality +, in which he builds on insights from Hans Moravec, Nick Bostrom and Robin Hanson on the evidence that we could indeed be in a simulation. Chalmers lists criteria such as “interestingness” of character, our position relatively early in the universe and simply just the increasing interest in simulation theory amongst people. I’d extend the first point to “interesting times”, which I would say fits the criteria, we’re still on the first rung of the Kardashev scale (even if we’ve made so many developments in other ways) and with the advancement of generative AI, conversations about post-truth and simulated reality feel increasingly common, especially on social media. Maybe, then, we’re closer to finding the answer one way or another.