Pattern recognition and machine learning by Bishop is one of the canonical text books. It helps to have a linear algebra background, it includes a refresher though
This one is interesting to see the "statistical" other side of the industry vs machine-learning people. For example I don't think gradient descent is used once in that book.