I couldn't let the year pass without one post, (late) Merry Christmas.

CAD flow

Before you even begin CADing: flesh out what requirements must be met and the constraints imposed. These fundamentally characterize the hardware you'll be describing; when we program it is in service of the specification. CAD is the means of formalizing the design, it does not lead. To this end, the model serves to verify that the spec (which it ought to represent) is realizable and self-consistent.
Good CAD/RTL/HDL is no more complicated than it need be. Generally, but especially when reasoning about complex system, additional cognitive load is not good.

Only once you have an MVP will you have a reference for (and can begin thinking about) microarchitectural (implementation) changes. 'Quantitatively backed predictions are how you seriously approach problems' (Gaben). It's a legitimate consideration to weigh if muddling the design affords a tangible benefit-- there's an adage that 'best part is no part'. Does early-restart or critical-word-first actually achieve you more throughput in the context of the larger system? When you have an FSM, it may be possible to skip returning to the IDLE state if the producer is valid when the transactions goes thru-- will the (complexity) overhead of introducing another transition edge, with it another MUX condition, really outweigh the +1-cycle penalty of keeping it plain/reasonable? Are you sure that this MUXing priority is sound, that this doesn't introduce any hazards or timing issues, and doesn't violate the implicit and explicit guarantees and assumptions present in the OG design?
To be able to actually ameliorate is a productive skill. From the design/integration POV: what is the justification to deviate from a clearly defined and obvious structure? Inside the context of the system, how are these gains recognizable? When you find it's okay to afford design relaxation, you may find solace.

The (CAD) Model is not the End

Harking back to FRC (HS robotics) days with 'real' CAD, it'd be easy to draw up something that was a painful assemble, where there's no room for the wrench/screwdriver or there's no reference for mounting this motor and you need to hack together a jig on bag day or these two riveted/lock-tightened -together subassemblies must be taken apart because they geometrically cannot fit otherwise. I would get solidworks-myopic, prioritizing having a complete CAD rather than making it make-able, like a noob.

I'll relate these manufacturability/assembly oversights to loosely related VLSI examples:

From a physical-design perspective flyover routes are ugly; at scale, the tools cannot converge these unconstrained paths
chip_top should only represent connectivity (routes). When a module has logic or registers, it needs to meet timing, and this closure analysis can only begin once all blocks below have been too -- this bottleneck is inconclusive with the tapeout schedule.
When you instantiate a module with varying parameters, it will synthesize to a different netlists. This means for every variant you've gotta go thru the whole rigamaroll on the back-end even though the parametrization may have marginal impact. In some scenarios, it's possible to work around this by introducing an input that'll be tied to a constant (externally)

VLSI involves a lot of people and has a lot of processes. Being a friendly designer means you understand, and own, the context of your block and the collaterals that surround it; you should hand-off a 'well-oiled' design which considers to the people and flows that it'll encounter down the line. It's not like software where you can get away with spamming the test suite-- school doesn't hammer this home well, nor does it do justice the scale faced in industry, where just the pre-compile steps, like ingesting the massive filelists, take significant time and compute.

Structure

The basics: organize for maintainability, it should not be brittle and break when someone else attempts modifications.

Design to the invariant. Consider a trivial pipeline: you have some compute divvied up across a number sequential stages, with a producer and consumer. In hardware, there'll need to be a 'stall' indicator, wire, from back-to-front (i.e. when all intermediary stages have partials, ignore the producer if the consumer isn't ready). A naive way of coding this would be to define a stall flag, conditioning on signals from all stages (wide-fanout), and praying the tools are able to de-jumble. A more modular programming would have each stage's control signals dependent solely on it's neighbor (left as an exercise to the reader). This'll elaborate a net spanning all stages, but this repeating structure is easier for the tools to massage. Similarly, it's easier on you as when get down the adjacent-stage handshake (invariant) once, the rest is pretty much free plus it's a simple representation to modify/extend.

Related to practices 'alleged to make the tools happy': control and data signals should be made disjoint. The distinction between data and control realizes the "initialize to zero" maxim is not universal-- it is perfectly fine to leave data registers uninitialized; the onus is on the control signals (that ought be reset to a proper state) to determine the data's validity / how it should be transformed. 'Poisoning' here is revealing. Likewise, it's a usual circumstance that if somehow the default-case block is reached it's cuz something goof'd -- the proper behavior is to indicate/propagate this error, letting it be 'X.

Consider the 2^nd order effects of the implementation. Such as, when an external (module ports) signal is assigned, you are creating a coupling, defining a path (to somewhere, we don't know) that must meet timing. As an example, say we have an 'out' flipflop holds bits which (1) must be written to memory, indicated by setting mem_wr_en, and (2) will be consumed when out_rdy is high. It would save power to say mem_wr_en = out_vld & out_rdy (only writing during the cycle the transaction is processed). However, connecting the memory to this external ready signal '''induces tool stress''' to meet timing-- it's ambiguous what combo logic is generating out_rdy (so, how delayed it is), and entangling an external signal with the memory (commonly a critical path candidate) to save some pico joules is silly.

Text

Lint! RTL in particular has a lot of repetitive signal names arising from the variations and flavors-- my taste is to instantiate/declare variables one-per-line and turn on vertical-alignment. If you group thoughtfully, you'll hopefully be able to see something is awry by just glancing (scanning vertically) rather than actually needing to read/process. Tangential, my take's list entries or module params should too hold their own line for the sake of neater git-diffs; the ordering of entries should be consistent: with Verilog modules, it's natural to follow the ordering in the file even while including the redundant .portname(portname); generically, for list entries sorting lexicographically ensures you don't duplicate entries when merging numerous commits (you may see Python tools do this for imports).

Like a C programmer, read the (compiled) output. You'll find what gets optimized for free, and where you'll need to intervene with explicit representation. Usually, the tools do a fairly good job: spending thought and time on the minutiae of structure is waste if it's going to be distilled into the same result. Long one liners sure are edifying, but functionally equivalent to self-commenting code (i.e. where you write out the partial terms).

Actual `.v`

typedef struct packed creates a contiguous bit-vector where the fields serve to index the corresponding bits. Very useful for when you've the same set of signals flowing ax numerous stages/states. Rather than always @(posedge clk) with non-blocking assignments for each, you can hook a packed struct up to a register module (and if you're linting you'll get warning when a field is left hanging). Somewhat related are interfaces, which you can think of as structs but with modports letting you specify directionality of the signals corresponding to each side of a master-slave relation; integration will be grateful if you use interfaces because of the guarantees and the assurance that ~~there won't be~~ it's harder to make a silly wiring mismatch.

†: can't slice unpacked
‡: slicing packed dim

Arrays

Verilog	arr [3:0][7:0]	[3:0] arr [7:0]	[3:0][7:0] arr
What	2D unpacked	8 x 4-bit vecs	Packed Vector
arr = '0	✗	✗	✓
arr = '{default:0}	✓	✓	✓
arr[2] = '0	✗	✓	✓
arr[2][3] = '0	✓	✓	✓
arr[2][3:1] = '0	✗†	✓‡	✓‡
arr[2:1] = '0	✗†	✗†	✓‡

Obligatory FSM Gold Model

We one-hot encode state as it simplifies comparison/resolution logic.

                    localparam IDX_IDLE  = 0;
localparam IDX_READ  = 1;
localparam IDX_WRITE = 2;
localparam FSM_STATE_CNT = 3;
typedef enum logic [FSM_STATE_CNT-1:0] {
    IDLE  = FSM_STATE_CNT'(1 << IDX_IDLE),
    READ  = FSM_STATE_CNT'(1 << IDX_READ),
    WRITE = FSM_STATE_CNT'(1 << IDX_WRITE)
} fsm_state_t;

// transition edges
assign fsm_idle_2_read  = in_fire;
assign fsm_read_2_write = 1'b1;  // mem asm'd to respond in single cycle
assign fsm_write_2_idle = out_fire;

fsm_state_t state, nxt_state;
always_comb begin
    case (state)
        IDLE:    nxt_state = fsm_idle_2_read  ? READ  : state;
        READ:    nxt_state = fsm_read_2_write ? WRITE : state;
        WRITE:   nxt_state = fsm_write_2_idle ? IDLE  : state;
        default: nxt_state = 'X;
    endcase
end

Misc.

Module names can contain $ (!?)
You can instantiate a module passing .* to the portlist and it'll greedily match portnames with the exact signal name; useful for wrappers, downside of it being unsearchable. Personally am against using this