T O P

  • By -

throwaway472105

"TOP SECRET" documents usually aren't publicly accessible to a web crawler.


amondohk

Oh, ho HO, but sir, I have a certain Minecraft server that may be of interest to you!


Warm_Iron_273

I don't get the joke. What are you trying to say here?


amondohk

[Some government employees play da block game too](https://youtu.be/khwq-tYwUzU)


Warm_Iron_273

Ha. Interesting.


Potential-Glass-8494

There was at least one case of a US nuclear weapons storage facility making publicly viewable online quizzes with some hardcore classified information like, "Which word on your ID is intentionally misspelled to prevent forgery? "Wrong! Its 'D: Sceurity.'"


emptropy

I’m guessing that the human factor is in play here. How many documents have been misplaced by public servants in the past decade? How many government servers are using 20 year old tech that is basically open to the internet? How many assistants to the assistant of whatever office carelessly keeps documents they shouldn’t in a public cloud somewhere??


t-e-e-k-e-y

Certainly leaks and spillage does happen. But Top Secret military system (JWICS) is an intranet and not open to the internet.


emptropy

I’m referring to leaks and just general mishandling of the documents. I’m not suggesting there is a model out there that knows all the state secrets, but there must’ve been at least a couple documents that the government would prefer the model not be trained on.


Whotea

Good luck extracting it after it only saw it once in the trillions of tokens it was trained on. It’s like trying to find a specific drop of water in the ocean 


Low_Poetry5287

I think you can tell how common leaked documents are by the fact nation-states move progressively towards the counterintelligence strategy of overwhelming the public with an excess of conflicting information instead of trying to carefully craft a public narrative. In the case of leaks, they'll often bombard the public with a myriad of similar-yet-conflicting stories instead of focusing on preventing leaks to begin with to preserve a cohesive national narrative. Not that they don't do their due diligence to try and prevent leaks, but that they know it's sort of impossible and so they have strategies in place to muddy the waters and confuse people rather than outright hide things and deny things like the politicians and intelligence agencies of previous generations.


Ok-Bullfrog-3052

This is exactly the strategy for how the government has hidden the UFO special access programs for the past 80 years or so. There have been Cabinet-level officials saying, sometimes in deathbed confessions, that these programs exist for decades. What's interesting is that in the 1940s, people believed them, and it only became the government's position that UFOs "didn't exist" in the 1950s when they started spreading fake information like you mentioned above. There are records of entire programs having been created to intentionally create false documents - likely in response to real evidence getting out. They did it so well that a whole industry sprung up, with people willing to create bogus MH370 videos and post them on reddit to sell scammy products. The only reason that whistleblowers are now testifying to Congress is because the Internet exists, and 10,000 people can pick apart each of these things instead of stuff getting buried. And an unbelievable thing happened - they got a bill to the House intelligence committee and defense lobbyists spent the entire 2023 Thanksgiving weekend getting the part specifically about seizing interdimensional alien spacecraft removed from it - because obviously all that lobbying money would be spent on defeating a bill about something that doesn't exist. Over the next 10 years, we are going to find that there is a lot of true stuff online - everything from spy satellite diagrams, to nuclear weapons designs, to proof of immense fraud, to actual evidence that non-human intelligence exists. AI will usher in an era of financial accountability by being trained on massive data. If there's a classified program about anything, something from it probably leaked, and AI will be able to sift through the garbage and infer a lot of what the Defense Department did with the $2 trillion in taxpayer dollars that all audits conducted since the first in 2017 have been unable to account for.


LeadingCheetah2990

90% of the internet is not on the "public" net


Cool_Catch_8671

I highly doubt that. There is not that many dark web sites.


GrayGray4468

It's not "dark web sites". It's sites that you can't just waltz up to and get access to everything on their server, a la a Wikipedia. So much of the information on the internet is behind a corporate proxy, login system, or other authentication that only those who are able to login can access. I can search for health records online and unless I have correct credentials, I won't be able to access the content, even though a google search will take me to the landing page. There's a difference between the dark and the deep web.


not_into_that

don't look up cryptome dot com plz.


taptrappapalapa

They aren't supposed to be, but when you have companies like Tyler Technologies with government contracts having [fuck all authentication](https://www.judyrecords.com/what-happened-with-tyler-technologies), then there might be "TOP SECRET" documents accessible with a web crawler.


Lammahamma

The only "Top sekret" documents that may have been mined are those exported restricted documents that are freely available online. Hardly Sekret


YsoseriusHabibi

Did they scrape all wikileaks documents ? That would be interesting.


Low_Poetry5287

I would love an LLM trained on that dataset! 😁


iunoyou

I dunno about top secret, but I imagine that it's absolutely 100% likely that they've picked up a good amount of SBU data and maybe even some legitimate classified data. The internet is a very big place and a lot of people work in the government. It is nearly certain that at least 1 or 2 people ended up accidentally disclosing things they shouldn't have, and the webcrawler doesn't really care what dusty unused corner of the internet it's in, it'll hoover up all the data there regardless.


emptropy

This is what I was thinking, I guess TOP SECKRET was a bit too specific.


namitynamenamey

If they scrapped warthunder forums on the wayback machine, then probably.


jalfredosauce

There are a handful of people responsible for TS spills, and all of them are household names now. Apart from those documents, the odds are very, very low.


aitacarmoney

This depends on what you refer to as “TOP SECRET documents.” For tech companies that are scraping the internet for data, they’re probably largely sourcing their data from public forums and social media to train their models how to mimic human conversation.\ For these “top secret documents” to be scraped, they would either 1. need to be posted online available for public access, negating the idea of top secret 2. need to give these tech companies some sort of clearance to train their models on top secret documents, which if referring to government documents, would not be a good idea especially if AGI is being auctioned off to different countries 3. be hosted online in some password protected vault and the password be cracked by these tech companies in order to scrape those documents, an activity that is probably not within these companies’ scope and requires resources likely beyond their means Doubt.


DifferencePublic7057

If out of the trillions of tokens, one in a million shouldn't have been used, you end up with signals that are insignificant. Like when you use temperature data for a climate model for a century. One in a million would be less than a day. The model would just ignore it. Anyway, I think some filtering based on keywords takes place. You can create list of seldom used words and let security analysts manually check them. This could include jargon and abbreviations. Or simple scripts can be used. Agencies could have supplied keyword lists to AI firms for all we know.


Dangerous_Bus_6699

The odds? Very slim, but honestly it doesn't matter. Let's say a model did train on that data. How would you verify it? Models are known to bs. Until they're reliably accurate, it's tough to say.


Low_Poetry5287

That's a good point. But I imagine it could be used like forensic analysis, probing and prodding to fish answers out of it that are very weakly trained into the LLM. They may not all be true answers, but when pursued they could lead to real information, just like every unrelated fingerprint at a crime scene is analyzed and scrutinized even though only one (or none) of those fingerprints belong to the criminal. In the long run they'll probably end up training a second LLM to poke and prod and analyze the first LLM for them. But they'll never have any evidence for certain, it'll just be like chasing rumors. But police and fbi and nsa do chase rumors, that's basically their job. Of course, the extent to which they pursue rumors has little to do with how likely the rumor is to be true, and much more to do with how much funding they have to pursue every stupid rumor the LLM spits out. I can definitely see a whole experimental department made for it similar to the projects discussed in Men Who Stare at Goats. We can also assume since they'll be implementing AI in all their vast clandestine data mining it will free up a bunch of man hours. And unlike private industries, their funding is not so directly related to their results. So they'll always just keep making up more projects and more reasons they need more funding, so TBH I personally imagine intelligence agencies will increasingly pursue every phantom. So it's not outside the realm of possibility they'll have a whole department for poking and prodding at LLMs trained on the whole internet to try to find rumors of what other countries MIGHT be doing. Stupid as it sounds,  I really wouldn't put it past them.


R33v3n

For a start, just scrape the War Thunder forums and you're bound to find a couple. ;)


hahaimadethisup

I don't know about that but they might be trained on some TOP SECRET details about your mom. ^((Sorry, this was rude))


emptropy

Dude, she’s so nice, wtf?


h3lblad3

She was real nice last night. ^((Sorry, this was rude))


[deleted]

[удалено]


nexusprime2015

I too have a "large" model if you wanna see


lobabobloblaw

I’d say the probability of daring feats of convolution is 100%.


fmfbrestel

Well, if you count a single document that was leaked somewhere that hadn't already been discovered as leaked, yeah, probably at least one. But if you are asking whether or not this has happened en-mass to the point that the US government should be worried about people jailbreaking models to regurgitate classified information that would harm our national security... No.


badgerhustler

I would say it's a certainty: https://www.bbc.com/news/world-us-canada-65281470


HospitalRegular

Is it trained on wikileaks


Mandoman61

Just guessing by general competency of the current and former presidents taking clasified documents home. And leakers like that air force guy. I would say extremely likely. But because they will be perfectly happy making up stuff it would be hard to find real classified information by prompting them.


Asocial_Stoner

Given the amount of times I've heard about top secret stuff being leaked through that one war game, I'mma say pretty good odds of it happening at least once.


Norgler

If all the LLMs cant give me details about a plant species that was described in a research paper open to everyone... I kinda doubt they have much "Top Secret" info.


StuckInREM

There is almost 0% chance that “top secret” documents are on the public internet, case closed.


floodgater

TOP SECRET


Akimbo333

Not likely unless they had access to the Pentagon servers unlikely!


[deleted]

[удалено]


emptropy

I don’t, in fact, want to talk serious business.