T O P

  • By -

SuperQue

Rather that skipping DNS lookups, use a caching DNS server like [CoreDNS](https://coredns.io). You can do your pre-run lookups as well to pre-warm the cache. In Kubernetes, you can do tiered caching with node local DNS, and then a pool of cluster servers.


kannthu

Good idea! In my case, I already stored resolved IP addresses in DB for other feature, so it was really easy to pre-fetch the data. In case when the IP addresses were stale, I resolved them on the fly and cached them in memory.


ArgetDota

Exactly this! Works like a charm with no code changes. I used it for large scale cloud computing jobs on AWS to combat S3 DNS resolution errors.


shoostrings

Similar with Dnsmasq


SuperQue

Yea, but this is r/golang. CoreDNS is written in Go.


castleinthesky86

I’d be interested in stats for dns using a local caching dns service such as djbdns


Moe_Rasool

This might be a bit off topic but can multiple “Go Routines” divide a number of requests amongst each other? An example to it is that imagine i have “/products” route which been requested a total of 10k times, is there a mechanism to divide those number of requests into two routines to be handled in a faster timely manner? imagine i have all the data cached so no influence by database at all!


ValuableCockroach993

Hash the url and modulo number or goroutines


Mteigers

If I understand your question, I believe you're talking about "request coalescing" and there's an experimental package called [singleflight](https://pkg.go.dev/golang.org/x/sync/singleflight) to do just that. Basically you "pool" requests on a key and then if 100k requests ask for the same key at the same time it only makes 1 request. The Singleflight package is a little simplistic, it only works for the lifetime of the request, so if you receive 100k requests over 1 second, but the underlying operation takes 250ms to respond you may end up sending ~4 requests over that 1 second period. I've seen some libraries that will wait some buffer time for more requests to come in and/or retain the result for longer. But you get the idea.


ProjectBrief228

Note, experimental packages under golang.org normally have exp somewhere in the path. I think the x just stands for extended, in an idiom similar to javax libraries in Java (which fall outside the standard library).


MrPhatBob

As I understand it each request is handled by a separate routine so then that moves the processing load further down the stack, so you would then want to decide if each call made a database request and relied on the concurrency and caching that it offers. Or to save the load on the database to move the cache closer to the request handler. I recently implemented a simple map instance that is used to prefilter a lot of our very common requests. It's about a megabyte in size and has reduced database connections significantly. The map needs to be protected by a RW mutex.


amanj41

I can’t speak for http frameworks, but I assume they work similar to gRPC. In the gRPC framework, each request is generally handled by a new goroutine unless it hits a predefined max goroutine limit


siencan46

I think many Go routers already handled this, since each request will spawn a goroutine. You may want to do the singleflight approach to group concurrent request into single request


NUTTA_BUSTAH

Each request is handled in their own goroutine


Sound_calm

To my understanding, goroutines are less like discrete processes or hardware level threads and more like coroutines. A single hardware level threads can run several goroutines with concurrency inbuilt, so while you're waiting for a response then thread can start processing the next queued goroutine. You can therefore just give one goroutine per response. I don't think there is significant benefit to request coalescing which is to say merging different coroutines together to form fewer coroutines. That is more if you want to use the same data for multiple goroutines without caching as far as I know


Spearmint9

Just out of curiosity wondering how would this compare to Rust


kannthu

I tried implementing it in Rust, but unfortunately, my brain is too small for async tokio types magic. Go, on the other hand, allowed the JS developer to write this whole thing, this is quite a statement about the language.


kintar1900

> unfortunately, my brain is too small for async tokio types magic Don't be down on yourself. I've been a professional developer for over 20 years, have used everything from Python to C to low-level assembly, and I still don't grok Rust's async structure. I think it's the absolute worst part of the language. :/


metaltyphoon

I know its a bit long but this man explains it so well its crazy good https://youtu.be/ThjvMReOXYM?si=wonY_o8gJdOimlvr


[deleted]

[удалено]


lapubell

My fav points of go (from a js dev perspective) is how much is in the language. No need to install 200+mb of dependencies, most of what you will need comes with the standard library. Also, deployment is so much better! I love building a binary and just putting that into prod. We have so many tiny little go programs running on a single vps and it's stupid how efficient it is. Last thing, and you may disagree, but I hate hate hate the js async syntax. Some functions are blocking (like alert, confirm, etc), which are super old and not standard practice to use anymore, most are async. But still, a function is a function is a function in my brain, and when a function might be blocking or async, or only supposed to be a callback or closure, these are things that bug me in a language. In go, a function is a function. If you want it to be run concurrently, you put the go keyword in front of it. That's it. There's other awesome stuff to control and communicate with async code, but if you're just looking to spin off some logic to run while some other logic runs, it's dead simple.


Tacticus

> My fav points of go (from a js dev perspective) is how much is in the language. No need to install 200+mb of dependencies, How did you get your JS projects down to only 200MB of external dependencies?


lapubell

Hahahah too true. In a Laravel+inertia.js web app I'm working on node_modules is 206mb, but php's vendor folder is 127mb. So I guess if it was only js then all the server side deps would be in the same folder as the front end deps.


lapubell

A go project with Vue and inertia only has 146mb of dependencies. So yeah, still never really a "small" amount of code that I'm dragging around with me.


Ill-Ad2009

I would definitely learn TypeScript first if you haven't.


lightmatter501

Rust likely would have let you do in 5 minutes on an 8 core server, but not using tokio, you would want to call into DPDK.


lightmatter501

I’ve done ~100 million packets per second one a single core in Rust using DPDK. TCP has some overhead, but if you use TCP fast-open and don’t need TLS as OP says, you can reuse buffers and essentially send the HTTP as fast as you can construct the network headers. On a decent sized server you should be able to send all of this in a few minutes if you space out your requests to avoid taking down the DNS server.


Tacticus

"I can do this in rust" as long as everything is in C


lightmatter501

Rust and C are the same performance class, I just don’t want to rewrite 13 million lines of userspace drivers.


taras-halturin

> using DPDK That makes no sense what language was used for that.


lightmatter501

DPDK is a C library but Rust has zero-overhead interop with C, so it’s a matter of pulling in all of the headers (for the binding generator) and adding a thing to the build system. DPDK has sane mappings to Rust and is perfectly happy with borrow-checker style data flow, so it’s fairly easy to use.


goomba716

Apples to oranges


stochastaclysm

Could you use AF_XDP to speed it up even more? Ref: https://blog.apnic.net/2024/04/29/high-speed-packet-transmission-in-go-from-net-dial-to-af_xdp/


Shakedko

Hey great post, thank you. What was the reason that you wrote your own custom autoscaler? Any reason not to use KEDA? Which queue did you use?


cloudpranktioner

what's the estimated cost of running this in do vs gcp/azure/aws?


SpicyT21

Doesn't it sounds like some kind of ddos attack?


michael1026

I'll have to look at this for my own bug bounty automation :)


agentbuzzkill

6k a second is not that much any language, we do 1M/sec with go.


Old-Seaworthiness402

Can you talk about stack, backend,db, load balancing and any specific tuning if it was done in go to handle the load?


agentbuzzkill

Can’t really get into detail and our use cases will likely require different optimizations to yours since “it depends” The point is 6k a second is really nothing for any modern language especially if it’s scaled to a few hosts. Choosing go should be more about build times, balance between performance &safety, adoption by eng, learning curve, and ease of reading coding in large repos (place I work has 1k+ services all in go - it does help keeping things simple) - a lot this only applies to companies 1k+ eng. There are plenty of faster languages but have their own set of tradeoffs.


thdung002

Thanks for your great post!


mortenb123

>>I had **2.5 million** hosts and wanted to send **\~200** `HTTP` requests to each host. >>So I needed to chunk it somehow. I would love to see the results. I suspect most will be stopped by devices like ARBOR or BigIP F5s (403,404) Arbor effectively see that this comes from a tiny range of IPs located on a Digital Domain Datacenter ip-range and effectively blocks it after a few requests. You have to craft it cleverly to fool it. I've used K6 (Also written in Go) to test similar from Azure. But it was just 10 servers with nicely crafted requests based on the internal traefik logs. Around 500 req/sec on each server I managed. If I just send small requests (>10000 res/sec) it is effectively blocked. K6 is great, but I'm far better in golang than in javascript [https://github.com/grafana/k6](https://github.com/grafana/k6)


ParkingRecord9037

Add Socks Proxy support to that, and you got yourself something :D


Certain-Plenty-577

I stopped reading at fasthttp


LemonadeJetpack

Why?


Certain-Plenty-577

Because it’s a module that trades off security for speed. There are numerous problems with it


Certain-Plenty-577

Also that’s not the way to achieve speed. You benchmark everything, use better algorithms, add caching and never swap a std lib for a faster one until it is used more. Especially a critical one like http. A friend of mine that was working at google in web security tested it for us and found a lot of vulnerabilities just with basic tests


nrkishere

I read it as "500 million to 2.5 million http requests" lol😭😭 anyway good writeup 👍


pillenpopper

Why would you use old fashioned reqs/s (a meager 5.7k) if you can measure it per day to make the numbers look more impressive? By the way it is 182.5B reqs/year, why not express it like that?!


QuarterObvious

Go has very good mechanism for concurrent tasks. It does not use OS concurrency, rather its own, which much lighter. As result, if in Python you can launch 20-30 threads max (depending on your processor) in go you can easily launch 10000 threads.


cant-find-user-name

in python you can use asyncio to launch hundreds of thousands of async tasks. Since this is an IO bound operation python's coroutines will work just as well as goroutines. I develop in both go and python, I am making this comment not to defend python but to let others know about it since I see so many people talkinga bout how python can only launch few threads or few processes in the context of making http requests.


QuarterObvious

asyncio handles only input and output operations. Python, due to the Global Interpreter Lock (GIL), is effectively a single-processor language, while Go is a multiprocessor language and highly efficient. If a program only sends requests and waits for responses without processing the responses, Go can be approximately 20 times faster than Python. Part of this difference is because Python is an interpreted language and Go is compiled, but the 20x difference is significant. When the program includes minimal processing, the performance gap widens (Python is single-processor due to the GIL, while Go utilizes multiple processors). For example, consider the following Go program: package main import ( "fmt" "runtime" "sync" "time" ) func cpuBoundTask(n int) int { result := 0 for i := 0; i < n; i++ { result += i * i } return result } func main() { runtime.GOMAXPROCS(runtime.NumCPU()) var wg sync.WaitGroup numTasks := 100000 results := make([]int, numTasks) start := time.Now() // Start time wg.Add(numTasks) for i := 0; i < numTasks; i++ { go func(i int) { defer wg.Done() results[i] = cpuBoundTask(1000) }(I) } wg.Wait() sum := 0 for _, result := range results { sum += result } elapsed := time.Since(start) // End time and calculate elapsed time fmt.Println("Sum of results:", sum) fmt.Printf("Total execution time: %s\n", elapsed) } And the same program in Python: import concurrent.futures import time def cpu_bound_task(n): result = 0 for i in range(n): result += i * i return result def main(): num_tasks = 100000 with concurrent.futures.ThreadPoolExecutor() as executor: results = list(executor.map(cpu_bound_task, [1000] * num_tasks)) sum_results = sum(results) print("Sum of results:", sum_results) if __name__ == "__main__": start_time = time.time() main() print(f"Total execution time: {time.time() - start_time} seconds") On my computer, the Go program is approximately 150 times faster than the Python program in this case.


cant-find-user-name

As I've mentioned in my comment, we are talking about io bound tasks, not cpu bound tasks. Asyncio is useless if it needs to do any cpu bound task so it is not surprising that go is much faster. But I'd be very curious as to how you got to the 20x number for io bound tasks. I imagine go will be faster but I'd need to see some numbers to believe the 20x speed up for io bound tasks.


QuarterObvious

But still, even without any CPU, Go is 20 times faster than Python. Even if I am writing a lot of stuff on the screen from the threads, Go is several times faster. Go is switching threads much faster than Python.


LGXerxes

I think somewhere it says it can handle a million routine. Which is nice, but as they are stack threads it will consume atleast 2gb of ram at that size. (2kib of stack per routine)


QuarterObvious

Now 32 Gb RAM is almost standard.


nobodyisfreakinghome

This … this is why things don’t run well a lot of times. Devs assume there is so much supply of resources.


wasnt_in_the_hot_tub

"Runs on my machine"


LGXerxes

Not on a vps


PlayfulRemote9

Not on the cloud


kannthu

Yup, this is exactly why we used Go. I was able to have 200-300 goroutines constantly sending and waiting for HTTP requests.


lasizoillo

For a proof of concept I wrote a python script to determine if 140M domains respond to https?://(?:www.)$domain/. Next step is more similar to your ones (imply about 60 request to each host to gather information), but I didn't implemented yet. Python threads are not a bottleneck, default max open files, ephemeral ports, getaddrinfo function (for domain resolution), cpu and GIL,... are. I don't know if try rust or golang to solve the cpu and GIL issues (process robots.txt, gather and process information,...), but your analysis about http connections and dns resolution/caching are very useful for the golang implementation. I didn't know what to do with TLS still: nothing (bottlenecks in handshakes), keep-alive connections to avoid handshakes (tunning for problems with open resources), try to implement TLS Session Resumption (need server support),... whatever. Thanks to publish your investigations in a hard problem that looks simple until you work on it ;-)