Flat Preloader Icon Loading...

Web3 Nodes: View Our Explainer

Web3 Nodes:
View Our Explainer

LLMs and You: How AI Labs Use the Grass Network

LLMs and You: How AI Labs Use the Grass Network

Synopsis: Buyers on the Grass network use your bandwidth to scrape data from the internet. By exploring how AI labs train their language models, we can learn a bit about what types of material they use Grass to access, and why your personal data is not part of the equation.

Introduction

By now, you probably know that Grass is a network that sells your unused internet bandwidth to companies who use it to view the public web. As we’ve explained a bit in the past, the highest profile use case for this service is AI labs who need massive amounts of web data to train their language models. But what kind of data are they downloading, exactly? And why do they need it?

To understand the answers to these questions, we need to learn a bit about how large language models work. So strap in for a minute while we take a quick look at what’s going on behind the scenes. Today, we’ll try to explain LLMs like you’re five — or at least like you’re a sophisticated adult who wants to understand AI a little better. So where to begin?

LLMs and the Word Vectors They Produce

Let’s start simple: LLMs are AI algorithms to which you can pose questions in plain language and get an actual answer. You may ask for a summary of a given topic, a translation of a particular passage, or a detailed solution to a complex problem. In response, they will generate predictive text to satisfy whatever prompt you’ve decided to input. To the untrained eye, it’s a robot that can speak English.

But how do they work? Ultimately, LLMs comb through massive amounts of written language, find patterns in the ways that certain words relate to one another, then translate these words into strings of numbers that reflect these relationships. These numbers are the language that LLMs actually speak, and they are known as “word vectors.” Let’s give an example to see how they work.

Say you’re in the mood to eat something with meatballs, but you can’t remember the name of that pasta that goes with them. If you ask an LLM what to call this mysterious noodle, it will search for a noun that is A) a pasta, and B) likely to appear in the same sentence as “meatball.” Voila: “Spaghetti.”

In a very simple model, trained only to answer meatball-related questions for forgetful diners, each word vector might have only two dimensions.

1: Does this word describe a noodle? (1 for yes and 0 for no.)

2: How strong is the correlation between this word and “meatballs” in written text?

In this case, spaghetti might be represented as [1, 0.95], with the 1 signifying that spaghetti is a noodle and the 0.95 signifying a 95% correlation with the word “meatball.” This is a higher score than any other word the model has encountered, and thus most likely to be the correct answer. There you have it: Spaghetti and Meatballs.

So now we understand how LLMs communicate a word’s relationship with other words — but what happens when the questions become more complicated? Instead of asking what to call “spaghetti,” what if you asked what a seven year old would call spaghetti?

To find out, you’d have to read quotes from millions of seven year olds and determine which word has the highest correlation with “meatball” in these very specific contexts. As it turns out, seven year olds — hardly known for their facility with the Italian language — are liable to mispronounce the word as “sketti” or “basketti.” At least, that’s what ChatGPT reported back a few moments ago.

Now, this raises a few questions. When answering our prompt simply required a two dimensional assessment of general correlation, it was easy to comb through limited data and see which word appeared in the most sentences with “meatballs.” As soon as we started asking more complex questions, though, the word vectors needed to be exponentially longer, and thus draw on larger banks of information. Perhaps you can see where this is going. If you want to train an LLM to answer any question a user could possibly ask, you’re going to have to access much larger datasets.

Big Data

While the scientists in our example above may be content to study meatballs alone, major AI labs are working to create incredibly refined LLMs that will someday have access to all recorded human knowledge. This requires them to spit out word vectors with far more than two dimensions, which can capture more subtle relationships between the words they read. To illustrate, let’s use this model, which was trained on the entire English Wikipedia.

Consider the word “Donkey.” In English, it’s spelled D-O-N-K-E-Y. Vectorized, it’s spelled -0.092339 followed by another 5,507 digits. — a mouthful to say, and impossible to remember.

The word vectors in this model are so long because the model is trained on 199,430 unique words, and it’s capable of producing vectors for each of them that communicate its relationship with all the others. By training their model in this way on the entirety of Wikipedia, it’s able to answer any questions that might be contained in the articles within. The 5,000 character vector lengths bely the sheer amount of information that each one relates back to. So it’s not hard to figure out that if we want these LLMs to give accurate answers, the correlations that they draw between words — and the patterns they discover in written content — get more and more accurate as the data sets they’re trained on get larger and larger.

But how could an AI lab possibly access this much data?

The Grass Connect

This is where it all ties back to you, and the bandwidth you sell to these AI labs on Grass. If you look at the list of models on the website we linked earlier, you can see that a variety of them are available. One was trained by reading all of the words on Wikipedia, one by combing through mountains of Google News articles, and one on the British National Corpus. Whatever data a lab wants its model to be trained on, this is the content they need to access in order to train it.

Here’s the thing: this is relatively simple when the data is crystalized and the answers won’t change. If someone asks an LLM when Columbus discovered America, the answer will always be 1492. They could train it on the Encyclopedia Britannica.

But what if an LLM wants to answer questions about contemporary information? What if it wants to answer questions about popular sentiment, or how the average person feels about a certain topic? Where could you find billions of people expressing their thoughts and opinions on any topic imaginable, refreshed eternally in a never ending stream? Modern problems, as they say, require modern solutions. In this case, the solution is social media.

To access this information, however, requires a nonstop connection to the internet, viewed from every corner of the Earth, capable of downloading unfathomable volumes of written language. This, my friend, is the origin of Grass, Wynd Network’s marketplace where ordinary users sell internet bandwidth to AI Labs for downloading written words off the web.

Conclusion

So now you understand who these labs are, the LLMs they are trying to train, the types of data they use to train them, and where they can access it with the help of your internet bandwidth. This is only the most rudimentary explanation of how LLMs are trained, and we’ve obviously left a lot out in the interest of simplicity. But hopefully it goes some of the way towards explaining what exactly your bandwidth is being used for and how AI labs use the public data submitted on social media websites to train their AI models.

You’ll notice that nowhere in this conversation is your personal data mentioned even once, and that’s because it doesn’t factor into the equation. When we tell people they’re selling bandwidth so AI companies can download data, that’s often their first assumption — that they’re giving up their data, just like they do by using social media in the first place. We just wanted to write this primer so you would know that this doesn’t happen in any way, not even 1 percent — buyers simply access public web data, often from sites like Reddit, and nothing about you is visible whatsoever. So you can rest assured that your privacy is intact — and maybe you learned something along the way.

Grass: Progress Update and Road Ahead

Grass: Progress Update and Road Ahead

Key Points:

  1. The Grass network has seen 80,000 individual downloads and nearly 1,000,000 unique residential IPs through our referral program alone
  2. The network will go live once we pass key thresholds in certain metrics, defined below
  3. The launch of our Android and iPhone mobile apps stand to significantly increase the size of the network and the uptime for users
  4. Formal compensation will occur when the network launches, but earnings are occurring now

Over the past few months, we’ve focused on educating people about the vision we have for Grass.  Defining proxy networks, identifying the use cases, and making sure we’re transparent about what we’re planning.  Now it’s time to shift into phase two: we’re working on carrying this thing to market, and today we’ll update you on what’s next.

As you know, Grass is currently building out its network of residential proxies by recruiting users like you to act as individual nodes.  The sooner we reach a threshold of active nodes, the sooner we launch and we all start to see the money roll in.  By signing up, downloading, and referring your friends, you’re playing a very active part in this process.  Thanks, by the way.

So where do things stand now?

Well, progress has been pretty incredible since our first announcement four months ago.  The network itself has seen almost 1,000,000 unique IP addresses since early June.  The entire point of a proxy network is to provide IP addresses for buyers to route web traffic through, so this is pretty massive.  

To date, 80,000 people have downloaded the web extension and all of them were referrals from the ref links you’ve been sharing on Twitter, YouTube, and TikTok.  This is really extraordinary for a word of mouth campaign, and we are more confident than ever that Grass will exceed every expectation we have.

So this is all great news, but now is the time to focus on the future.  We have some very concrete milestones we’re working towards, and a handful of key metrics that will tell us when the time has come.  Over the next few months, you’ll be able to watch the progress with your own eyes and feel the network get closer and closer to the fateful day we go live.  

Here’s what to look out for:

Development Milestones

  • New UI: The first order of business is to roll out our new UI, which will make the dashboard more intuitive and appeal more to a mass audience.  You can expect this sometime in the next few weeks.
  • Open Access to Grass: Currently, Grass is only available through the ref links you post.  Soon we’ll be opening up access to anyone and everyone, and we expect to see a substantial increase in downloads once this barrier is lifted.
  • Launch Android App: Grass is currently only available as a downloadable web extension, but soon we’ll be releasing an app that you can install on your phone.  This has the potential to be a watershed moment for a few reasons.  First, most people’s phones are on all the time, so the amount of active nodes will balloon when we start getting users with 24/7 uptime.  Second, the overwhelming majority of time people spend on the internet occurs on mobile devices, so when they see an ad or ref link, they’ll be able to download and install the app within mere seconds.  This means more users, more downloads, and more active nodes on the network.
  • Launch IOS app: This has the potential to be a huge milestone for all of the same reasons, but on an even larger scale.  62% of the mobile phones in America are iPhones, which means we’ll experience the Android effect at almost twice the intensity.  This might just be the event that pushes us over the line – but only time will tell.

Key Network Metrics

As we progress through these development milestones, we’ll also be keeping our eye on the underlying growth of the network itself.  This is what really matters when it comes to how soon we can launch, and we have a handful of metrics for measuring how far we are in the process.  If you familiarize yourself with them, it will be easier to follow along on the road ahead.

  • Downloads: This refers to the number of individual users who have downloaded the web extension. 
  • Referrals:  As you can probably guess, this is the number of people who have been referred by other members.  To us, it’s a measure of how much footwork the people themselves are putting in to spread the word, and how much faith the community has in Grass’s vision.
  • Unique IPs: This refers to the number of IP addresses that have been active on the network.  This metric is a bit more complex, as it also accounts for individual users who have provided bandwidth from multiple locations.  If you’ve checked your dashboard from a friend’s house or coffee shop, you’ve probably seen additional addresses show up at the bottom left.  This shows increased breadth in the network, but the number of active users is most important of all.
  • Concurrent active nodes:  Concurrent active nodes refers to the number of users who are active at any given time.  Essentially, if a buyer logged on to use bandwidth on the Grass network, this refers to the number of different proxies they could route their web traffic through.
  • Concurrent active nodes (US):  The holy grail.  US IP addresses have the highest demand in the world, and more buyers are willing to pay for American proxies than anywhere else.   We’re building a global network of residential proxies, but it’s particularly important to get Americans signed up.

Over the next few months, we’ll be providing updates not only when we attain the milestones listed above, but also when we reach key numbers on all of these metrics.  

The Road to Compensation

As soon as we launch, the network will start generating independent revenue – 100% of which will go to compensating active nodes.  The sooner that day comes, the sooner your points will be converted.

Don’t ever forget, though: you might have to wait until then to get compensated, but you’re already earning now.  If you’ve been checking your dashboard for four months at this point, that means you have four months of earnings stacked up before we even launch.  We’re currently testing the network with several proxy buyers who are looking to ramp up their usage once our network has sufficiently grown, and if you’re reading this, you are very early.

Hopefully this update gave you an idea of where Grass stands today, how far we’ve come, and a concrete sense of the path to going live.  We’ll continue to update you on all of the events we described above, so keep a close eye on our Twitter, Discord, and Blog for more news.  And remember – the more people you refer, the more earnings you can stack up before launch, and the faster we can all get there.  So go out and touch Grass!

Pin It on Pinterest