I have done a lot of scraping in the past. Cookies are a pain, this is a really elegant solution. Of course the biggest problem is that everything interesting is hidden away behind JavaScript these days and then you have to resort to Selenium and the whole thing just spirals out of control. But I'm looking forward to giving this a shot for non-JavaScript content in the future.
Do you mean JavaScript? I have never run into content hidden by Java, but many pages load content dynamically using JavaScript.
I have found it's quite easy to snoop on those JavaScript API requests using the Network tab of Chrome Devtools, then copy the network request as a curl command for bash scripts or as JavaScript for browser extensions.
Tongue in cheek: You'd never know - servers running Java code generating HTML pages have probably conditionally not-rendered many pieces of HTML that you've never come across in your browsing :)
The term "everything interesting" is of course subjective. What is interesting to person A might not be interesting to person B. I never use Selenium and I generally have no problem acessing "everything interesting". The simplest example is reading and submitting HN comments. Presumably we all find this interesting enough. Javascript is neither required to read, vote nor submit to HN.
What if the phrase "everything interesting" was replaced with specific examples and questions. Something like, "I cannot access X without Javascript. How do I access X without using Javascript."
It's possible that people might disagree on the definition of "works". For example, perhaps web developers might be biased toward a definition that puts them in control instead of the user. If I can retrieve information from a server with HTTP requests then the website "works" for me. As a user, I certainly do not need to use Javascript to make HTTP requests. Nor do I need to use a particular client.
One could argue that even HN does not "work" completely without Javascript. For example, the script at https://news.ycombinator.com/hn.js will not run.
I've once used Selenium to run javascript in the webpage to steal a few dynamic tokens required by the sites API to reuse in my more well-trodden python-requests workflow.
Using the tools at hand is often the best approach. That said, I've spent most of the last 13 years of my career automating browsers. For years, I used Selenium with a variety of libraries. After switching to Puppeteer/Playwright, I have zero interest in going back lol. Playwright actually has first party Python support. (Puppeteer has a port called Pyppeteer, but it's no longer maintained and the author recommends using Playwright)
Feel like you can just read Chrome's cookie from the file (and filter out the ones you need by site, of course) so you don't need to bother run chrome in debugging mode?
Thanks for the link. I know yt-dlp does, but from your link I found another library (https://github.com/n8henrie/pycookiecheat) that can do that and it seems more popular than browser_cookie3. (browser_cookie3 works totally fine last time I tried).
One could argue that cookies need to be more securely stored than passwords, because they can allow an attacker to bypass passwords and all other authentication factors.
> Tired of copy pasting cURL commands from chrome to your terminal ?
FYI for anyone that does this, MITM Proxy is usually a better option for this type of stuff. Not sure about Chrome, but especially with Firefox, you have no way of getting the full raw request on anything with a request body like POST. You have to Copy Request Headers, then Copy POST Data. With MITM Proxy or similar you can just get the full request at once. Also you can inject headers like X-Forwarded-For into all or specific requests.
Repo has a single open pull request where someone (by their own admission) just fed the code into ChatGPT then pasted the results into a pull request without actually testing it.
Honestly thought it's troll cuz April fools or something, but the guy actually has like 10% usefull suggestions in the PR.
Will probably close that one tho.
Woah! I didn't realise it was 1st April when I raised that PR.
Sorry if it isn't of much use (as can be expected of straight GPT responses).
I didn't mean to come out as a troll.
I agree; should've at least tested the change before raising a PR, but I would've just dropped the idea, so felt it best to share my findings at least.
I've had kind of the same problem in the past. For me I built a cookiejar textfile generating chrome extension, because it turns out most relevant tracking or session cookies are on external domains or oauth provider domains. [1]
You just need to copy/paste the generated text content to a cookies.txt and you're set, so it worked for my workflow in the terminal.
NB. Cannot enable remote debugging if using Chrome in Guest mode on ChromeOS. Why is left as question for the reader.
A more universal solution, one that does not require enabling websockets, is a localhost-bound forward proxy where HTTP traffic, including cookies, is saved in log files. No need to copy/paste with a GUI or mouse. Can use standard UNIX utilities to work with log files.
I used to use an extension called "Get Cookies.txt" to do this. Then several more by similar names with the same codebase. I looked at the code myself and didn't see anything up, but they each got removed from the chrome store for security violations.
edit: JavaScript not Java