Compile-Time And Run-Time Optimizations For Enhancing Locality And Parallelism On Multi-Core And Many-Core Systems