Thu. 07/25 – Reddit’s Throwing Elbows Again - podcast episode cover

Thu. 07/25 – Reddit’s Throwing Elbows Again

Jul 25, 202416 min
--:--
--:--
Listen in podcast apps:

Episode description

Once again, Reddit looks like it’s not worried about upsetting people. New generative search on Bing, new models from Mistral and a new video model from Stability. But did Runway train it’s video models on YouTube videos? We might have a smoking gun. But what if the dream of synthetic data for AI training is a mirage?

Sponsors:


Links:

See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

Transcript

Welcome to the Techmeme Ride Home for Thursday, July 25th, 2024. I'm Brian McCullough. Once again, Reddit looks like it's not worried about upsetting people. New generative search on being new models from Mistral and a new video model from Stability, but did runway train its video models on YouTube videos. We might have a smoking gun there. But what if the dream of synthetic data for AI training is just a mirage? Here's what you missed today in the world of tech.

Or any other alternative search engine that doesn't rely on Google's indexing. And search Reddit by using sitereddit.com. You will not see any results from the last week. Duck Duck Go is currently turning up 7 links when searching Reddit. But provides no data on where the links go or why. Instead, only saying that quote — we would like to show you a description

here, but the site won't allow us. Older results will still show up, but these search engines are no longer able to crawl Reddit, meaning that Google is the only search engine that will turn up results from Reddit going forward. Searching for Reddit still works on Kaji, an independent paid search engine that buys part of its search index from Google.

The news shows how Google's near monopoly on search is now actively hindering other companies' ability to compete at a time when Google is facing increasing criticism over the quality of its search results. And while neither Reddit nor Google responded to a request for comment, it appears that the exclusion of other search engines is the result of a multi-million dollar deal, that gives Google the right to scrape Reddit for data to train its AI products.

They're killing everything for search but Google, Colin Hayhurst, CEO of the search engine Mojik, told me on a call. robots.textfiles are just instructions, which crawlers can and have ignored. But according to Hayhurst, Reddit is also actively blocking its crawler. Reddit has been upset about AI companies scraping the site to train large language models and has taken public and aggressive steps to stop them from continuing to do so.

Last year, Reddit broke a lot of third-party apps beloved by the Reddit community when it started charging for access to its API, making many of those third-party apps too expensive to operate. Earlier, this year, Reddit announced that it signed a $60 million deal with Google, allowing it to license Reddit content to train its AI products.

Redits robots.text used to include a bunch of jokes, like forbidding the robot vendor from future rama, scraping it, and specific pages that search engines were and were not allowed to access. .rss was allowed, while slash login was not allowed. Today, Reddit's robot.text is much simpler and more strict. In addition to a few links to Reddit's new public content policies, the file simply includes instructions which basically mean no user agent or bot should scrape any part of the site.

Reddit appears to have updated its robots.text file around June 25 after Mojik's Hayhurst noticed its crawler was getting blocked. That announcement said that, quote, good faith actors like researchers and organizations, such as the Internet Archive, will continue to have access to Reddit content for non-commercial use, and that we are selective about who we work with and trust with large-scale access to Reddit content, end quote.

It also links to a guide on accessing Reddit data, which plainly states Reddit considers search or website ads as a commercial purpose, and that no one can use Reddit data without permission or paying a fee, end quote. Microsoft, by the way, confirmed a very short time. It starts at search engine land that Bing stopped crawling Reddit after Reddit updated its robots.text file on July 1.

Meanwhile Microsoft unveiled Bing Generative Search, which shows AI generated answers with the sources used to create them, currently available to a small subset of users, quoting Windows Central. At the very top of the page will be an AI generated answer created by large and small language models that have reviewed millions of sources to provide the most accurate answer.

It will break down that answer into a document index that can provide more information about particular subjects within that search query if you'd like to learn more. The search engine will also list the sources that the AI generated text was created from below the answer, and will even present traditional search results in a sidebar on the right for those who are uninterested in Bing's curated AI experience.

Microsoft says it continues to evaluate the impact that AI in search is having on websites in terms of direct traffic and readership. There's a growing concern in the industry that websites that create content for free will eventually go out of business if AI bots scrape that content to present it directly in a chat window or a search page. This new AI search experience has been built from the ground up with this concern in mind.

Microsoft says the company claims this new experience maintains the same number of clicks to websites that traditional search does, but time will tell if that's true. Which means it must be time to do a whip around to discuss the newest models folks have released. Mistro has announced Mistro large two. It's new generation for its flagship model with 123 billion parameters. Quoting venture beat.

However, in an important caveat, the model is only licensed as open for non-commercial research uses, including open weights, allowing third parties to fine tune it to their liking. For those seeking to use it for commercial slash enterprise grade applications, they will need to obtain a separate license and usage agreement from Mistro as the company states in its blog post and an ex post from research scientist de Vendras Singh Chaplot.

While having a lower number of parameters or internal model settings that guide its performance, then Lama 3.1's 405 billion, it still nears the former's performance. Available on the company's main platform and via cloud partners, Mistro large two builds on the original large model and brings advanced multilingual capabilities with improved performance across reasoning, co-generation and mathematics.

It is being hailed as a GPT-4 class model with performance closely matching GPT-40, Lama 3.1, 405 and Anthropics Cloud 3.5 sonnet across several benchmarks. Anstability AI, unveiled stable video 4D, a model based on its stable video diffusion model that takes video input and generates videos from eight new perspectives. Also quoting venture beat.

While there is a growing set of Gen AI tools for video generation including open AI Sora, Runway, Haaper and Luma AI among others, stable video 4D is something a bit different. Stable video 4D builds on the foundation of stability AI's existing stable video diffusion model which converts images into videos. The new model takes this concept further by accepting video input and generating multiple novel view videos from eight different perspectives.

We see stable video 4D being used in movie production, gaming, ARVR and other use cases where there is a need to view dynamically moving 3D objects from arbitrary camera angles. Varun Jampani, team lead 3D research at stability AI told venture beat. Jampani noted that stable video 4D is a first of its kind network where a single network does both novel view synthesis and video generation. Networks leverage separate video generation and novel view synthesis networks for this task.

He also explained that stable video 4D is different from stable video diffusion and stable video 3D in terms of how the attention mechanisms work. We carefully design attention mechanisms in the diffusion network which allow generation of each video frame to attend to its neighbors at different camera views or timestamps, thus resulting in better 3D coherence and temporal smoothness in the output videos Jampani said. Lumen is the world's first handheld metabolic coach.

It's a device that measures your metabolism through your breath and on the app it lets you know if you're burning fat or carbs and gives you tailored guidance to improve your nutrition workouts to sleep even stress management. All you have to do is read into your Lumen first thing in the morning and you'll know what's going on with your metabolism, whether you're burning mostly fats or carbs. Then Lumen gives you a personalized nutrition plan for that day based on your measurements.

You can also breathe into it before and after workouts and meals so you know exactly what's going on in your body in real time and Lumen will give you tips to keep you on top of your health game because your metabolism is at the center of everything your body does. Optimal metabolic health translates to a bunch of benefits including easier weight management, improved energy levels, better fitness results, better sleep, etc. Lumen has helped my wife and I get healthier this summer so join us.

If you want to take the next step in improving your health go to lumen.me-slash-ride to get 15% off your Lumen. That is l-u-m-e-n.me-slash-ride for 15% off your purchase. Thank you Lumen for sponsoring this episode. Selling a little or a lot. Shopify helps you do your thing however you could shing.

Shopify is the global commerce platform that helps you sell at every stage of your business from the Want Your Online Shop stage to the first real-life store stage all the way to that did we just hit a million orders stage. Shopify is there to help you grow whether you're selling scented soap or offering outdoor outfits. Shopify helps you sell everywhere. From their all-in-one ecommerce platform to their in-person POS system wherever and whatever you're selling, Shopify's got you covered.

Shopify helps you turn browsers into buyers with the internet's best converting checkout 36% better on average compared to other leading commerce platforms. And some more with less thanks to Shopify Magic, your AI powered all-star. My 25 year old ecommerce company runs on Shopify and it runs flawlessly. Shopify powers 10% of all ecommerce in the US and Shopify's the global force behind all birds, rothies, and Brooklyn and millions of other entrepreneurs of every size across 175 countries.

Sign up for a $1 per month trial period at Shopify.com slash ride. All lower case. Go to Shopify.com slash ride now to grow your business no matter what stage you're in. Shopify.com slash ride. Shot and chaser 404 media has a source and has seen an internal document that they say reveals AI startup runway scraped thousands of videos from YouTube creators and brands including Disney and vice news to train its Gen 3 AI video generation tool.

Quote. The model initially codenamed Jupiter and released officially as Gen 3 drew widespread praise from the AI development community and technology outlets covering its launch when runway released it in June. Last year runway raised $141 million from investors including Google and Nvidia at a $1.5 billion valuation. In TechCrunch asked runway co-founder Anastasus Jaramindus in June where the training data for Gen 3 came from he would not offer specifics.

We have an in-house research team that oversees all our training and we use curated internal data sets to train our models. Jaramindus told TechCrunch the spreadsheet of training data viewed by 404 media and our testing of the model indicates that part of its training data is popular content from the YouTube channels of thousands of media and entertainment companies including The New Yorker, vice news, Pixar, Disney, Netflix, Sony and many others.

It also includes links to channels and individual videos belonging to popular influencers and content creators including Casey Neesat, Sam Kolder, Benjamin Hardman, Marques Brownley and numerous others. While 404 media couldn't confirm that every single video included in the spreadsheet was used to train Gen 3, it's possible that some content was filtered out later or that not every single link on the spreadsheet was scraped.

The training data reveals specifics about the generative AI industry which has been repeatedly accused of training models on copyrighted material. Runway did not respond to multiple requests for a comment via email, LinkedIn and its official Discord channel.

When reached for comment Google which operates YouTube and is a runway investor pointed us to a Bloomberg story from April in which the company told the publication that open AI training its AI video generator Sora with YouTube videos would violate YouTube's rules. There are previous comments on this still stand. A Google spokesperson told 404 media in an email when asked about runway scraping YouTube videos.

There was a company wide effort to compile videos into spreadsheets to serve as training a former runway employee told 404 media. After the list of videos was compiled, runway scraped the videos using open source software specifically YouTube DL which has a proxy configuration option.

Runway purchased proxies from a provider the source said which gives customers an IP address that routes requests for downloads through in order to not get blocked by YouTube. 404 media granted the source in this article anonymity because they feared professional retribution.

The channels in that spreadsheet were a company wide effort to find good quality videos to build the model with the former employee said this was then used as input to a massive web crawler which downloaded all the videos from all those channels using proxies to avoid getting blocked by Google. The document contains 14 spreadsheets each labeled with different categories.

One of the spreadsheets contains what appears to show a list of 117 terms like beach, doctor and rain and the names of runway employees next to each of those terms. The former employee told 404 media that these names were either people tasked by others to find videos related to the keywords or the employees themselves noting that they were working on that keyword.

Next to the term rainbow and the employee name someone wrote a note that said no channels or playlists dedicated to it but found good individual videos for fine tuning. This in the document show that the company was trying to obtain videos that had a specific type of subject matter, camera work and with a diverse set of people in them.

The high camera movement sheet contains 177 links to YouTube channels including the official call of duty channel, filmmaker Josh Newman's channel, Unreal Engine and Vans channels. A spreadsheet titled cinematic masterpieces contains 206 links to individual channels and videos of a specially high quality including animated shorts and student films. On that sheet, a note next to a link to the DeFi Studio YouTube channel says the holy grail of car cinematic so far.

Single great videos for fine tuning is a stockpile of another 253 videos along with a column for topics like waxing eyebrows, ice sculpting, smiling and screaming. The non-YouTube source sheet also contains a link to an archive of studio G-Blee films, several anime piracy sites and a fan site for Xbox Game Clips as well as a now offline movie piracy site called A.Z.I. movies that has a note with it from someone at Runway Quote, tons of stuff in here. And finally pair that with this.

Researchers suggest that using synthetic data created by AI systems to train other AI systems could lead to the rapid degradation of AI models and a collapse over time. Using the FT. The use of computer generated data to train artificial intelligence models risks causing them to produce non-sensical results according to new research that highlights looming challenges to the emerging technology.

Leading AI companies including OpenAI and Microsoft have tested the use of so-called synthetic data, information created by AI systems and then also train large language models as they reach the limits of human-made material that can improve the cutting edge technology. Research published in Nature on Wednesday suggests the use of such data could lead to the rapid degradation of AI models.

One trial using synthetic input text about medieval architecture descended into a discussion of Jack Rabbits after fewer than 10 generations of output. The work underlines why AI developers have hurried to buy troves of human-generated data for training and raises questions of what will happen once those finite sources are exhausted.

Synthetic data is amazing if we manage to make it work, said Ilya Shumailov, lead author of the research, but what we are saying is that our current synthetic data is probably erroneous in some ways. The most surprising thing is how quickly this stuff happens. The paper explores the tendency of AI models to collapse over time because of the inevitable accumulation and amplification of mistakes from successive generations of training.

The speed of the deterioration is related to the severity of shortcomings in the design of the model, the learning process and the quality of data used. The early stages of collapse typically involve a, quote, loss of variance, which means majority sub-populations in the data become progressively overrepresented at the expense of minority groups in late-stage collapse, all parts of the data may descend into gibberish.

So given the Lama News this week, I was interested to see this interaction, Sam Lesson threaded the original tech meme link of this story to say, quote, the whole story of using AI to generate training data to keep training AI has always been a head scratcher to me. It never made sense to me why that would work, versus just drift rapidly into nonsense. But for a long time, lots of people seem to believe it.

I am glad to see that the research is coming back to in line with what intuitive expectations would be, and quote, to which Mark Zuckerberg himself responded, quote, distilling models into smaller models that are almost as small, but only a fraction as expensive to run, clearly works, and is a lot of what I expect people to use Lama 3.145B to do.

There's also evidence that you can further train these smaller models to surpass the intelligence of the teacher model, although that's not necessarily using synthetic data from the teacher and quote. Also there's this crunch base article that says in the first half of this year, a lot of AI startups raised $500 million across 198 angel or seed deals. Right home AI fund was in about a dozen of those by my count.

And add one more, because we made our final bet just yesterday, closing the investment tomorrow, the AI fund, has officially deployed all of its capital in a little over a year. Quick and dirty, just like Chris and I said we do. Talk to you tomorrow.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.