A trading bot is a script that hits an API. A trading platform is a system that other people trust with real money, in a live market, under regulators who don’t grade on a curve. The distance between the two is where most projects quietly fail.
That distance comes down to four things you have to get right at the same time: the financial domain, real-time correctness, cloud infrastructure that holds up at scale, and regulatory compliance. Most teams are genuinely good at one or two of them. The trouble is that a weakness in any one will sink the platform in a way the other three can’t rescue. A flawless matching engine doesn’t matter if you double-fill an order. Perfect uptime doesn’t matter if you’re operating without a licence.
To get specific about where the line actually sits, we spoke with Oleksii Samara, a Principal Software Architect at IT Craft, a software company that has worked in fintech and healthtech for over twenty years. Oleksii has spent 20 years in fintech, focused on trading platform architecture, cloud infrastructure on AWS and multi-cloud setups, and financial API integrations including the FIX protocol, the standard for institutional trading connectivity. At IT Craft he has led production trading systems for clients including Predira, an OTC platform builder that lets interdealer brokers stand up configured trading solutions quickly.
So what does it actually take? His answers, below.
The part that keeps you up at night: orders that can’t go wrong
Everything else is recoverable. An order that fills twice, or fills once but gets recorded as failed, is money moving in the real world based on a bug. This is where domain knowledge and systems engineering meet, and it’s the first thing Oleksii designs for.
If your trading bot executes an order and the exchange API returns an ambiguous response, how do you handle it, and how do you prevent a double fill?
My approach has three layers. First, every order submission carries a client-generated idempotency key. If we retry, we send the same key, and the exchange returns the existing order status rather than creating a new one. Second, rather than retrying the original submission, we immediately query the order status endpoint by client order ID to determine definitively whether the order was created. Only if that query also fails do we retry, with exponential backoff and a hard retry ceiling. Third, before any order touches the exchange, we write the intended order to a durable store.
How do you guarantee an order is processed exactly once, never dropped and never duplicated?
Before submitting any order to the exchange, we write the intended order to a persistent store with status ‘pending’. That’s the authoritative record, no matter what happens next. Idempotency keys on every submission handle the retry case: the exchange recognises the key and returns the existing order rather than creating a duplicate. Between the decision layer and the submission layer, we use an SQS FIFO queue with a deduplication ID on each order, so a message resent inside the five-minute dedup window collapses to a single delivery. The thing you have to respect is that window: a retry that lands after it, or under a fresh ID, gets treated as new, so the retry policy and the dedup window have to be designed together rather than tuned separately. Behind all of this, a background reconciliation process compares our local order ledger against the exchange’s open and recent orders at regular intervals. That job isn’t only hunting for missing orders, it also has to fold in partial fills and cancel/replace activity, matching each fill report against the ledger so the volumes actually agree instead of quietly drifting apart.
Speaking the market’s language
Exchanges don’t talk REST and forgiveness. Institutional connectivity runs on protocols that assume you know exactly what you’re doing, and the details that look like trivia are the ones that take down a session in production.
Which exchange APIs have you integrated with in production?
My primary protocol-level experience is with FIX, the Financial Information eXchange protocol used for institutional trading across equities, futures, and FX. FIX is reliable and low-latency, but unforgiving: session management, heartbeat handling, sequence number recovery, and message validation all have to be implemented precisely.
Sequence handling is where most of the subtle bugs live. Every message carries a sequence number, and if one arrives lower than expected with the PossDupFlag set, you drop it instead of processing it, otherwise a routine retransmission turns into a duplicate order. We reset sequence numbers on a schedule, usually at the start of the trading week, and lean on heartbeats and test requests so a half-dead session gets detected and torn down rather than silently swallowing messages.
On the Predira OTC platform, the application layer ran over GraphQL, and we hit a subtle ordering problem under load. WebSocket subscription messages for order status updates were arriving at the client out of sequence. We fixed it by adding sequence numbers to the
subscription event stream and buffering on the client side to enforce correct ordering before showing state to the UI. Straightforward once we’d found it, but it forced us to drop an assumption that feels safe and isn’t: that WebSocket delivery order equals processing order.
Holding up when the market gets loud
Markets don’t send you steady traffic. They send you nothing for an hour and then everything at once. A platform that drops orders during a volume spike is worse than
useless, because the spike is exactly when the orders matter most.
How do you handle exchange API rate limits without dropping orders during peak volume?
The foundation is a token bucket rate limiter at the outbound API layer. Each endpoint has its own bucket that refills at the permitted rate, and if the bucket is empty, orders queue rather than drop. We prioritise: order submissions are high, status polls medium, balance checks low, so under pressure the capacity goes where it matters. Where the exchange supports batch submission, batching increases effective throughput. For multi-tenant platforms, separate API key pools for submission and polling keep polling traffic from eating into order submission quota. We track consumption via CloudWatch and alert before the system hits the ceiling. We usually start the alarm around 70% of quota and then tune it to the platform’s baseline, because the right threshold depends entirely on how spiky normal traffic is. It isn’t a number you can pick in the abstract.
What database and queuing architecture would you use for a platform handling 5,000 orders per second with full audit logging?
At that throughput, a relational database as the primary write path becomes the bottleneck immediately. The architecture I use separates three concerns: ingestion, operational state, and audit. Every order action is written as an immutable event to an append-only log, Apache Kafka at this scale. That event stream is both the audit trail and the source of truth. A separate consumer maintains a materialised view of current order state in DynamoDB, which handles the operational read load with sub-millisecond latency and scales horizontally. Analytical and reporting queries run against a separate store, Redshift or Athena on S3, populated from the same stream, so reporting never competes with operations. The SQS submission queue absorbs bursts and provides back-pressure: if the exchange is rate-limiting, orders queue durably instead of dropping. One detail that matters at this scale: the consumers building those views have to be idempotent, because events get replayed during recovery and a consumer will see some of them more than once. The DynamoDB view is eventually consistent with the log by design, so the system is built to tolerate a brief lag between an event landing and the read model catching up, rather than pretending that gap isn’t there.
Guarding the keys and the perimeter
A trading platform holds the keys to other people’s live accounts. That changes the security problem from “protect our data” to “make sure one client’s breach can never touch another’s money.”
How do you penetration-test a trading platform before production?
Security testing has to cover both the standard web application attack surface and the trading-specific vectors that general testers miss. Automated DAST tooling runs against staging as part of every CI/CD pipeline, which catches OWASP Top 10 class issues before a human tester sees the build. For major releases, we bring in an independent security firm for a structured test across the application layer, API layer, authentication flows, and data storage. Beyond that standard scope, we specifically test for order manipulation. Before the Predira MVP went live, we also reviewed every AWS Security Hub finding and remediated everything at High or Critical severity.
How do you store and manage exchange API keys for a multi-tenant platform where each client has keys connected to live accounts?
Keys never touch application databases or environment variables. They go into AWS Secrets Manager, retrieved at runtime via IAM-controlled API calls. Each tenant’s keys live under a separate secret ARN with separate IAM policies, so a vulnerability affecting one tenant’s keys can’t cascade to another’s. The keys themselves are envelope-encrypted with customer managed KMS keys. Where the exchange supports rotation, we automate it: Secrets Manager triggers a Lambda that generates a new key pair, updates the secret, and invalidates the old one.
Knowing when to call a lawyer
This is the discipline engineers most often skip, and the one that can void the whole build. You can’t write your way out of operating in a jurisdiction that says you needed a licence you never got.
At what point in a project do you raise regulatory licensing and compliance?
Early, before architecture decisions get locked in. The questions I ask in any engagement: what jurisdictions will the platform operate in, is the system making decisions on users’ behalf or just facilitating their own orders, will it hold custody of funds or assets, and has the client consulted a regulatory lawyer in their target market. When a client hasn’t thought about this, I tell them to get regulatory advice before we commit to a design. I’m not a lawyer, but I know which questions change what we build.
What this tells you
Read back through those answers and notice how little overlap there is between them. Idempotency keys and FIX sequence recovery and KMS envelope encryption and “have you talked to a regulatory lawyer” are four different kinds of knowledge. None of them is exotic on its own. The hard part is that a trading platform demands all of them, in the same system, holding together under live load.
That’s why a working trading bot tells you almost nothing about whether a team can ship a platform. The bot proves someone can call an API. The platform proves they can be wrong about the API response, the network ordering, the traffic spike, the breach, and the jurisdiction, and still not lose anyone’s money.
If you’re scoping a trading platform, the questions above make a decent litmus test for any team you’re evaluating, including your own. The right team won’t have a slick answer to every one. But they’ll recognise why each question matters, and they won’t be hearing any of them for the first time.
Oleksii Samara is a Principal Software Architect at IT Craft. If you’re weighing the build, these are the conversations worth having before a line of code gets written.
















